B11-10 No CUDA GPUs are available

I launched a GPU job on b11-10, and the request contains 1GPU and 12CPUs for 4 nodes. However, the job immediately stoped and the log told me that RuntimeError: No CUDA GPUs are available.

The job ID is 25584436. The corresponding log is at /project/weiwenfe_1082/zjin8285/4_GAR/01_Exps/84_BB05_base_bb04_maskdecoder_depth8/train/train_slurm_25584436.log.

Here is the screenshot.

I cannot appreciate it more if you could help me solve the problem. Thanks.

I am wondering why this problem has no response after 15 days.

There are multiple nodes where GPUs are not available even though the job is successfully allocated. Here is another example a03-01.

Base on the link, I think it would be enough to just reboot those servers.