I am currently using debug partition to debug my codes. Usually my sbatch script would only take less than 10 minutes, but I at least need 1 GPU, 8 CPUs, and 16GB memory.
However, my jobs cannot get allocated due the following reasons:
Many CPU-only jobs are not allocated in CPU-only debug nodes, such as d05-[41-42]. Instead, they firstly get allocated at GPU nodes, which is not quite reasonable.
GPU debug users keep requesting huge amount of memories, such as 248GB. Even if they only request 1 GPU, my job still cannot gets allocated due to memory problem.
I understanding debug partition would be very popular, but I don’t think the above behaviors are healthy and reasonable. Those behaviors block others to use their nodes. I cannot appreciate it more if someone could solve my problem. Thanks.
Unfortunately, we have limited resources so there will have to be queue times. Generally, over 95% of jobs in the debug partition start within 30 seconds of submission. For February so far, 99% start within 5 minutes. I think these are reasonable wait times, especially considering queue times in the gpu partition. We also put a maximum job time of 1 hour, which I think discourages the monopolization of resources you are worried about.
The debug partition tends to sit idle mostly so it was hard to justify having a single a40 node there when it could be more heavily utilized in the gpu partition. We have plans to set up utilization monitoring so we can get a better idea of how efficient resources are being consumed and educate users as needed.
Additionally, we are planning a future purchase to add new resources to the gpu partition. Until then, we will have to be patient with the resources currently available.