How to check GPU utlization while having multiple jobs per node?

I often use ssh to monitor how my jobs are doing, especially to check if running jobs are making good use of allocated GPUs.

If only a single job is running per node, a simple ssh into the allocated node works fine. But, often I have two (or more) jobs per node, and the ssh seems to only show GPUs allocation for the last job.

e.g. below, I have two jobs on a11-02 node, the first job has five GPUs and the second job has one GPU:

$ squeue -u $USER -S +i -o "%7i %8u %4P %30j %2t %10R %10b"
JOBID   USER     PART NAME     ST NODELIST(R TRES_PER_N
2747454 tnarayan isi  jobname1 R  a11-02     gpu:a40:5 
2747461 tnarayan isi  jobname2 R  a11-02     gpu:a40:1

I want to check on the first job to see if it is making good use of given GPUs, so I ssh into the node, and run nvidia-smi

$ ssh a11-01 'nvidia-smi'

Thu Dec 16 13:05:39 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A40          Off  | 00000000:81:00.0 Off |                    0 |
|  0%   63C    P0   168W / 300W |  23130MiB / 45634MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     65465      C   .../envs/rtg-py39/bin/python    23127MiB |
+-----------------------------------------------------------------------------+

But, I see one GPU only, which was located to the most recent job.
How to check on GPUs given to my other jobs on the same node?

Thanks
TG

1 Like

@tnarayan The squeue output shows node a11-02, but the ssh command uses a11-01. Maybe it’s just that.

Try specifying the devices if needed. For example: nvidia-smi -i 0,1,2,3,4,5,6,7

Oops, that was a typo :man_facepalming:t2:.

Even if I specify the correct node name (a11-02) and specify all the device IDs, only GPUs of the most recent job are visible in ssh session.

$ squeue -u $USER -S +i -o "%7i %8u %4P %30j %2t %10R %10b"
JOBID   USER     PART NAME   ST NODELIST(R TRES_PER_N
2747454 tnarayan isi  jobname1 R  a11-02     gpu:a40:5
2747461 tnarayan isi  jobname2 R  a11-02     gpu:a40:1


$ ssh a11-02 'nvidia-smi -i 0,1,2,3,4,5,6,7'

Thu Dec 16 18:38:12 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A40                 Off  | 00000000:A1:00.0 Off |                    0 |
|  0%   64C    P0   261W / 300W |  38258MiB / 45634MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      5674      C   .../envs/rtg-py39/bin/python    38255MiB |
+-----------------------------------------------------------------------------+

Okay, so when using ssh to log in to a node, it actually logs you into the most recent job in the case of multiple jobs (check printenv | grep SLURM). And Slurm restricts access to that job’s allocated GPU(s). However, you can log in to a specific job using srun. For example:

srun --jobid=<job_id> nvidia-smi

Or for an interactive session within that job:

srun --jobid=<job_id> --pty bash
1 Like

Thank you!