`torch` device count is 0

I requested one a100 gpu. With the same sbatch arguments, from time to time I get this error:

/home1/zhejianz/envs/arena/.venv/lib/python3.11/site-packages/torch/cuda/__init__.py:128: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.