Note: the question is about Slurm, and not the internals of the job. I have a PyTorch task with distributed data parallel (DDP), I just need to figure out how to launch it with slurm
Here are something I tried (please correct me if I am wrong)
Without GPUs, slurm works as expected
Step1: Get an allocation.
# TODO: sbatch instead of srun on bash script
$ srun -t 1:00:00 --mem=4G -N 2 -n 2 --pty bash
srun: job 59667 queued and waiting for resources
srun: job 59667 has been allocated resources
Step 2: view allocation
$ scontrol show hostnames
d05-06
d05-07
Step 3: run
$ srun hostname
d05-06.hpc.usc.edu
d05-07.hpc.usc.edu
srun worked as expected; it ran 2 jobs, one on each node printing hostname
What is wrong with GPU Jobs
Step 1: get an allocation
$srun -t 1:00:00 --mem=4G -N 2 -n 2 --gres=gpu:2 --pty bash
srun: job 59662 queued and waiting for resources
srun: job 59662 has been allocated resources
Step2: view allocation
scontrol show hostnames
d11-03
d23-16
Step 3: Run
But the srun
which worked before without GPU now freezes.
$ srun hostname
It stays there forever.
I tried srun --gpus-per-task=1 hostname
but it didn’t make any difference
I have the GPUs as expected (on the first node):
$ gpustat
d11-03.hpc.usc.edu Fri Aug 21 17:25:20 2020 440.64.00
[0] Tesla V100-PCIE-32GB | 30'C, 0 % | 0 / 32510 MB |
[1] Tesla V100-PCIE-32GB | 32'C, 0 % | 0 / 32510 MB |
if I run ssh d23-16
from the first node, I get
ssh_exchange_identification: read: Connection reset by peer
if try to login to the second node from the head node (discovery.hpc.usc.edu
) while the allocation is legal, my ssh works! however, that is not useful for running the jobs from bash script (Just saying so you know my ssh settings are fine)
In summary, I am trying to launch my GPU job on multiple nodes, but neither ssh nor the srun
works when I have GPUs in my job allocation.
How are others able to do it?
Thanks,
-TG