How to enable low-latency network (InfiniBand?) for distributed GPU training?

How to ensure that the allocated nodes are rack-local and/or have low-latency network interface?

I have heard of InifiniBand interfaces but never used it so far.

My sbatch script is /scratch2/tnarayan/papers/006-many-to-eng/runs/rtg/slurm-multinode-launch.sh
It uses PyTorch’s data distributed parallel (DDP). Please let me know how to enable infiniband or such low latency setup for my distributed training.

I think I figured it out!
Nodes on the cluster has a network interface called ib0 for InfiniBand

Since I use NCCL backend for pytorch DDP, I should set NCCL_SOCKET_IFNAME variable before launching the training process. And, NCCL_DEBUG=INFO to see if ib0 is really being used.

export NCCL_SOCKET_IFNAME=ib0  #default is eth0
export NCCL_DEBUG=INFO     #WARN
1 Like