I often use ssh
to monitor how my jobs are doing, especially to check if running jobs are making good use of allocated GPUs.
If only a single job is running per node, a simple ssh
into the allocated node works fine. But, often I have two (or more) jobs per node, and the ssh seems to only show GPUs allocation for the last job.
e.g. below, I have two jobs on a11-02
node, the first job has five GPUs and the second job has one GPU:
$ squeue -u $USER -S +i -o "%7i %8u %4P %30j %2t %10R %10b"
JOBID USER PART NAME ST NODELIST(R TRES_PER_N
2747454 tnarayan isi jobname1 R a11-02 gpu:a40:5
2747461 tnarayan isi jobname2 R a11-02 gpu:a40:1
I want to check on the first job to see if it is making good use of given GPUs, so I ssh
into the node, and run nvidia-smi
$ ssh a11-01 'nvidia-smi'
Thu Dec 16 13:05:39 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A40 Off | 00000000:81:00.0 Off | 0 |
| 0% 63C P0 168W / 300W | 23130MiB / 45634MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 65465 C .../envs/rtg-py39/bin/python 23127MiB |
+-----------------------------------------------------------------------------+
But, I see one GPU only, which was located to the most recent job.
How to check on GPUs given to my other jobs on the same node?
Thanks
TG