Running PytorchLightning on multiple nodes

nmehlman · January 3, 2025, 8:29pm

I am trying to use Pytorch Lightning for multi-node training. However, the job stalls on startup (Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4). Any idea what might be happening? Here is the slurm script I am running.

#!/bin/bash

#SBATCH --account=shrikann_35
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --gpus-per-task=1
#SBATCH --mem=8G
#SBATCH --time=0:05:00

module purge
module load gcc/11.3.0
module load nvhpc/22.11
eval "$(conda shell.bash hook)"
conda activate base
python test_distrib_training.py

Thanks!

system · March 4, 2025, 8:30pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.