The node a11-01 has a faulty GPU where if I use salloc/sbatch and get access requesting for gpus (a40:1) it gets me the a11-01 node but without a GPU. nvidia-smi shows that there are no devices, neither does torch.cuda.is_available().
Other members from my lab have experienced the same issue with this node. It has 8 total GPUs where most seem to be working fine. This is frustrating because submitted jobs keep getting placed onto this node and failing because there’s no GPU and would not stay in queue for other nodes that do not have this problem. (I chose Discovery cluster as the topic because there was not option for the endeavour cluster)
I think 7/8 of the GPUs on this node work fine and only one was not working, so I’m not sure if what you’re showing would capture the full behavior as I was also able to get the same output for another job for the same node while this problem was occurring.
I just checked now that not all 8 were taken and 4 were remainig, so I requested for 4 for an interactive job and see that only 3 were given, so I think the same issue is there.
[hjcho@endeavour2 modeling]$ salloc -p isi --gres=gpu:a40:4 --mem 64GB --time 4:00:00 --cpus-per-task=4
salloc: Pending job allocation 19744515
salloc: job 19744515 queued and waiting for resources
salloc: job 19744515 has been allocated resources
salloc: Granted job allocation 19744515
salloc: Nodes a11-01 are ready for job
Warning: '/project/jonmay_231/hjcho/.conda/pkgs/' already in 'pkgs_dirs' list, moving to the top
Warning: '/project/jonmay_231/hjcho/.conda/envs/' already in 'envs_dirs' list, moving to the top
[hjcho@a11-01 modeling]$ nvidia-smi
Tue Feb 13 11:32:38 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A40 Off | 00000000:01:00.0 Off | 0 |
| 0% 31C P0 74W / 300W | 4MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A40 Off | 00000000:25:00.0 Off | 0 |
| 0% 30C P0 76W / 300W | 4MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A40 Off | 00000000:41:00.0 Off | 0 |
| 0% 29C P0 69W / 300W | 4MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+