How to ensure that the allocated nodes are rack-local and/or have low-latency network interface?
I have heard of InifiniBand interfaces but never used it so far.
My sbatch script is /scratch2/tnarayan/papers/006-many-to-eng/runs/rtg/slurm-multinode-launch.sh
It uses PyTorch’s data distributed parallel (DDP). Please let me know how to enable infiniband or such low latency setup for my distributed training.