How to launch a GPU job on multiple nodes?

Note: the question is about Slurm, and not the internals of the job. I have a PyTorch task with distributed data parallel (DDP), I just need to figure out how to launch it with slurm

Here are something I tried (please correct me if I am wrong)

Without GPUs, slurm works as expected

Step1: Get an allocation.

# TODO: sbatch instead of srun on bash script
$ srun -t 1:00:00 --mem=4G -N 2 -n 2  --pty bash
srun: job 59667 queued and waiting for resources
srun: job 59667 has been allocated resources

Step 2: view allocation

$ scontrol show hostnames
d05-06
d05-07

Step 3: run

$ srun hostname
d05-06.hpc.usc.edu
d05-07.hpc.usc.edu

srun worked as expected; it ran 2 jobs, one on each node printing hostname

What is wrong with GPU Jobs

Step 1: get an allocation

$srun -t 1:00:00 --mem=4G -N 2 -n 2 --gres=gpu:2  --pty bash
srun: job 59662 queued and waiting for resources
srun: job 59662 has been allocated resources

Step2: view allocation

scontrol show hostnames
d11-03
d23-16

Step 3: Run
But the srun which worked before without GPU now freezes.

$ srun hostname

It stays there forever.

I tried srun --gpus-per-task=1 hostname but it didn’t make any difference

I have the GPUs as expected (on the first node):

$ gpustat
d11-03.hpc.usc.edu       Fri Aug 21 17:25:20 2020  440.64.00
[0] Tesla V100-PCIE-32GB | 30'C,   0 % |     0 / 32510 MB |
[1] Tesla V100-PCIE-32GB | 32'C,   0 % |     0 / 32510 MB |

if I run ssh d23-16 from the first node, I get
ssh_exchange_identification: read: Connection reset by peer

if try to login to the second node from the head node (discovery.hpc.usc.edu) while the allocation is legal, my ssh works! however, that is not useful for running the jobs from bash script (Just saying so you know my ssh settings are fine)

In summary, I am trying to launch my GPU job on multiple nodes, but neither ssh nor the srun works when I have GPUs in my job allocation.
How are others able to do it?

Thanks,
-TG

Thanks a lot for using our new Discovery cluster. I believe you’re mentioning SSHing from one of your allocated nodes to another one, but the Discovery cluster doesn’t allow that for security reasons.

I was able to allocate two interactive sessions on two if the GPU nodes (using the --gres=gpu:<n> option) by doing something like this, where I used ‘salloc’ instead of a couple of sruns (on our cluster the default is for salloc to run srun to get an interactive shell):

[christay@discovery ~]$ salloc --ntasks=2 --cpus-per-task=1 --ntasks-per-node=1 --mem-per-cpu=1GB --time=10:00 --account=<myaccount> --gres=gpu:1
salloc: Granted job allocation 60912
salloc: Waiting for resource configuration
salloc: Nodes e21-[05-06] are ready for job
[christay@e21-05 ~]$

This was handy, because it logs me into the e21-05 node and prints out the pattern of the second node, that I was also able to SSH into from another window. I tried the same thing with ‘srun’ but it’s not quite as friendly- it doesn’t tell me the jobID and nodes I got.

[christay@discovery ~]$ srun --ntasks=2 --cpus-per-task=1 --ntasks-per-node=1 --mem-per-cpu=1GB --time=10:00 --account=<myaccount> --gres=gpu:1 --pty bash -i
[christay@e21-05 ~]$ scontrol show hostnames
e21-05
e21-06

I’m not sure what your use case is, but is sort of like what you were trying to do? The ‘salloc’ command seems more geared for user-friendliness. I think usually people use a pattern where they just get one interactive shell to test their code, and then when it’s time to kick off the real job across nodes they use ‘sbatch’ to run parallel tasks, perhaps using ‘srun’

— EDIT ----
Interesting, I see what you mean about srun hostname hanging when you run it on an allocated GPU node. I have to ask my teammates why it does that. I CTRL-C’d it to get out and it printed

[christay@e21-04 ~]$ srun hostname
^Csrun: Cancelled pending job step with signal 2
srun: error: Unable to create step for job 60921: Job/step already completing or completed

Thanks a lot for the reply.

My question is: suppose I get two nodes with GPUs using salloc:
e.g.

salloc --ntasks=2 --cpus-per-task=1 --ntasks-per-node=1 --mem-per-cpu=1GB --time=10:00  --gres=gpu:2

I get the shell on the first node where my job runs.
I assumed I should explicitly run on the first node someway to launch my job on all the allocated nodes (either using srun or using ssh ).
But, I am unable to get it to work so far.

So, the same question again: How do I launch my job script on all the nodes that are allocated to job?
Is there some setting in slurm that runs job script on all the nodes, and not just the first node?

I’d appreciate it if anyone in your team points me to docs, or an example job script to do so.

here is a script I am using to test this setting:

#!/usr/bin/env bash
#SBATCH --ntasks=2 --cpus-per-task=1 --ntasks-per-node=1
#SBATCH --mem-per-cpu=1GB --time=10:00
#SBATCH --gres=gpu:2

nvidia-smi
echo $(hostname -f) $(date) sleeping
sleep 5s
echo $(hostname -f) $(date) exiting
echo "Done"

when I run it with sbatch, I see that it runs only on the first node.

Thanks,
TG

Glad you’re able to confirm allocating multiple GPU nodes for your test!

I see what you mean about your script only executing on one node, even though two nodes get allocated when you submit it with ‘sbatch’. I think it’s because it’s basically a serial job, and ‘sbatch’ doesn’t have what it’s looking for to parallelize it across the two allocated nodes. I have to ask the guys on our research computing team to confirm this- they have lots of expert experience with this kind of thing.

I see that 'man sbatch' on Discovery has an example near the end:

Specify a batch script by filename on the command line. The batch script specifies a 1 minute time limit for the job.

$ cat myscript
#!/bin/sh
#SBATCH --time=1
srun hostname |sort

I tried it and it runs on four different nodes- it looks like a pretty good way to use ‘srun’:

[christay@discovery work]$ sbatch -N4 myscript
Submitted batch job 64414
[christay@discovery work]$ cat slurm-64414.out
d22-52.hpc.usc.edu
d23-10.hpc.usc.edu
d23-11.hpc.usc.edu
d23-12.hpc.usc.edu

A reference in the online SLURM documentation (Slurm Workload Manager - Quick Start User Guide) helps me to understand how ‘sbatch’ and ‘srun’ can work together:

sbatch is used to submit a job script for later execution. The script will typically contain one or more srun commands to launch parallel tasks.

Also, 'man srun' has an example near the end that I found interesting- it uses ‘srun’ by itself. It’s under the EXAMPLES section that starts with:

This simple example demonstrates the execution of different jobs on different nodes in the same srun.

Using the example I created a script ‘test.sh’:

#!/usr/bin/bash
case $SLURM_NODEID in
     0) echo "I am running on "
        hostname ;;
     1) hostname
        echo "is where I am running" ;;
esac

I ran it like in the example and it produced output from the two allocated nodes- pretty cool.

[christay@discovery work]$ srun --nodes 2 test.sh
e22-09.hpc.usc.edu
is where I am running
I am running on
e22-08.hpc.usc.edu

It uses one of the many environment variables SLURM makes available to the user when the job runs. I think this is interesting because this simple script can produce slightly different output depending on what node it runs on! But, I’m easily amused by some things.

As you know, there’s a million things ‘sbatch’ and ‘srun’ can do, and those are just two of the tools the SLURM job scheduler makes available- optimizing parallel computations is a real art as well as a science, so I hope when you get ready to start running your workloads on Discovery you can open a Jira support ticket and our research computing team can work with you. They’re experts in their field and they’ll be able to help you really optimize your code, whether it’s using PyTorch or whatever you implement, as long as we support it (i.e. all the popular research computing tools). Also the team does consultations, workshops and regular office hours (by Zoom, nowadays) so you can contact them or drop in to run your ideas past them and get some great help and pointers.

1 Like

Thanks for the reply and detailed explanation.
I was tricked by the frozen issue of srun inside srun session.
But now I use srun inside sbatch script, and it works as expected.