Faulty node on Endeavour cluster's ISI partition

hjcho · February 7, 2024, 10:00pm

The node a11-01 has a faulty GPU where if I use salloc/sbatch and get access requesting for gpus (a40:1) it gets me the a11-01 node but without a GPU. nvidia-smi shows that there are no devices, neither does torch.cuda.is_available().

Other members from my lab have experienced the same issue with this node. It has 8 total GPUs where most seem to be working fine. This is frustrating because submitted jobs keep getting placed onto this node and failing because there’s no GPU and would not stay in queue for other nodes that do not have this problem. (I chose Discovery cluster as the topic because there was not option for the endeavour cluster)

csul · February 13, 2024, 7:18pm

Hi,
Are you still having issues with this? I’m not able to replicate this behavior but we may have rebooted this node since you initially reported it.

-Cesar

[csul@a11-01 ~]$ python -c "import torch; print(torch.cuda.is_available())"
True
[csul@a11-01 ~]$ nvidia-smi
Tue Feb 13 11:16:44 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A40                     Off | 00000000:E1:00.0 Off |                    0 |
|  0%   29C    P0              70W / 300W |      4MiB / 46068MiB |      4%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

hjcho · February 13, 2024, 7:33pm

I think 7/8 of the GPUs on this node work fine and only one was not working, so I’m not sure if what you’re showing would capture the full behavior as I was also able to get the same output for another job for the same node while this problem was occurring.

I just checked now that not all 8 were taken and 4 were remainig, so I requested for 4 for an interactive job and see that only 3 were given, so I think the same issue is there.

[hjcho@endeavour2 modeling]$ salloc -p isi --gres=gpu:a40:4 --mem 64GB --time 4:00:00 --cpus-per-task=4
salloc: Pending job allocation 19744515
salloc: job 19744515 queued and waiting for resources
salloc: job 19744515 has been allocated resources
salloc: Granted job allocation 19744515
salloc: Nodes a11-01 are ready for job
Warning: '/project/jonmay_231/hjcho/.conda/pkgs/' already in 'pkgs_dirs' list, moving to the top
Warning: '/project/jonmay_231/hjcho/.conda/envs/' already in 'envs_dirs' list, moving to the top
[hjcho@a11-01 modeling]$ nvidia-smi
Tue Feb 13 11:32:38 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A40                     Off | 00000000:01:00.0 Off |                    0 |
|  0%   31C    P0              74W / 300W |      4MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A40                     Off | 00000000:25:00.0 Off |                    0 |
|  0%   30C    P0              76W / 300W |      4MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A40                     Off | 00000000:41:00.0 Off |                    0 |
|  0%   29C    P0              69W / 300W |      4MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

csul · February 13, 2024, 9:57pm

Hi,

We’ve set the node to ‘drain’. Once all currently running jobs complete, we will reboot it, which should make the last gpu available again.

-Cesar