How to ensure that the allocated nodes are rack-local and/or have low-latency network interface?
I have heard of InifiniBand interfaces but never used it so far.
My sbatch script is
It uses PyTorch’s data distributed parallel (DDP). Please let me know how to enable infiniband or such low latency setup for my distributed training.