I am trying to use Pytorch Lightning for multi-node training. However, the job stalls on startup (Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
). Any idea what might be happening? Here is the slurm script I am running.
#!/bin/bash
#SBATCH --account=shrikann_35
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --gpus-per-task=1
#SBATCH --mem=8G
#SBATCH --time=0:05:00
module purge
module load gcc/11.3.0
module load nvhpc/22.11
eval "$(conda shell.bash hook)"
conda activate base
python test_distrib_training.py
Thanks!