Cgroup out-of-memory handler

luisalbe · September 29, 2022, 7:17pm

Hi!

I submit a job with this configuration parameters:

#SBATCH --ntasks=32

#SBATCH --ntasks-per-node=8

#SBATCH --cpus-per-task=1

#SBATCH --time=1:00:00

#SBATCH --mem-per-cpu=7GB

#SBATCH --partition=debug

But I get an out-of-memory error, how can I increase the memory in my request so that my job can be run? I saw in a previous post that I can request a largemem partition with mem=0? I already tried that, but the request has not been approved yet. Is there any way to use a debug partition to have the request approved faster while using more memory?

Thank you!

dstrong · September 29, 2022, 8:32pm

You could increase the --mem-per-cpu request or use --mem=0 to request all memory on a node. Use the nodeinfo command to see the different node configurations. You may need to target nodes with more memory, like the epyc-64 nodes.

luisalbe · September 29, 2022, 10:26pm

If I do not want to use --mem=0, how do I determine the max --mem-per-cpu? In the example above, I was given 4 nodes in the debug partition, those had:

CPUs/node = 16
Memory/node = 59
Thus I’d have 59/16 = 3.7 mem/cpu, right? However, I requested 7GB and I could run it. How do I know if I can request more memory per cpu in the debug partition?

dstrong · September 29, 2022, 11:10pm

For those original sbatch options, it requested 4 nodes, 8 CPUs per node, and 56 GB of memory per node.

Max memory per CPU would depend on how many CPUs you request per node and the total memory that node has. Even if you request 1 CPU on a node, you could still request all the memory on that node.