Error when running Pytorch on GPUs

Hello,

I’m trying to train a model on HPC’s GPUs using Pytorch, but when I try to do so I get the following error:

  File "/home1/sommerer/.conda/envs/rocus/lib/python3.7/site-packages/torch/nn/modules/container.py", line 100, in forward
input = module(input)
  File "/home1/sommerer/.conda/envs/rocus/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
  File "/home1/sommerer/.conda/envs/rocus/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 87, in forward
return F.linear(input, self.weight, self.bias)
  File "/home1/sommerer/.conda/envs/rocus/lib/python3.7/site-packages/torch/nn/functional.py", line 1370, in linear
ret = torch.addmm(bias, input, weight.t())
RuntimeError: CUDA error: no kernel image is available for execution on the device

One post (https://github.com/pytorch/pytorch/issues/31285) seems to indicate that I will have to build Pytorch from source. If this is the solution, then would I have to build from source every time I submit a job to the GPUs? That doesn’t seem ideal. If anyone knows how to fix this error I’d greatly appreciate it.

@sommerer I think the version of PyTorch you have only works with certain types of GPUs. We have k40, v100, and p100 GPUs on the cluster. Try specifying the GPU type by modifying the --gres option in your job script to use the relatively newer p100 or v100 GPUs:

#SBATCH --gres=gpu:p100:1

Great thanks for the reply! I’ll try it out when I have time and update with results.

I changed from using k40 to using p100 and it worked. I no longer got the error I showed above. Thank you!

I’ve run into this issue as well, and yes it’s a problem with the CUDA compute capability - some newer code coming out won’t run on K40’s anymore unless you specifically compile and built the older GPU kernels (if that is possible)