GPUs in Debug Partition Cannot Be Allocated

ZhangyuJin · February 13, 2024, 2:36am

I am currently using debug partition to debug my codes. Usually my sbatch script would only take less than 10 minutes, but I at least need 1 GPU, 8 CPUs, and 16GB memory.

However, my jobs cannot get allocated due the following reasons:

Many CPU-only jobs are not allocated in CPU-only debug nodes, such as d05-[41-42]. Instead, they firstly get allocated at GPU nodes, which is not quite reasonable.
GPU debug users keep requesting huge amount of memories, such as 248GB. Even if they only request 1 GPU, my job still cannot gets allocated due to memory problem.

I understanding debug partition would be very popular, but I don’t think the above behaviors are healthy and reasonable. Those behaviors block others to use their nodes. I cannot appreciate it more if someone could solve my problem. Thanks.

csul · February 13, 2024, 6:53pm

Hi,

Unfortunately, we have limited resources so there will have to be queue times. Generally, over 95% of jobs in the debug partition start within 30 seconds of submission. For February so far, 99% start within 5 minutes. I think these are reasonable wait times, especially considering queue times in the gpu partition. We also put a maximum job time of 1 hour, which I think discourages the monopolization of resources you are worried about.

The debug partition tends to sit idle mostly so it was hard to justify having a single a40 node there when it could be more heavily utilized in the gpu partition. We have plans to set up utilization monitoring so we can get a better idea of how efficient resources are being consumed and educate users as needed.

Additionally, we are planning a future purchase to add new resources to the gpu partition. Until then, we will have to be patient with the resources currently available.

-Cesar

ZhangyuJin · February 14, 2024, 1:15am

Thanks for your reply, but I still hope the allocation could have some improvements.

CPU-only jobs launched at CPU-only nodes first, then GPU nodes.
1-GPU jobs request at most half of the CPUs and memories within one node.