Cgroup out-of-memory handler


I submit a job with this configuration parameters:

#SBATCH --ntasks=32

#SBATCH --ntasks-per-node=8

#SBATCH --cpus-per-task=1

#SBATCH --time=1:00:00

#SBATCH --mem-per-cpu=7GB

#SBATCH --partition=debug

But I get an out-of-memory error, how can I increase the memory in my request so that my job can be run? I saw in a previous post that I can request a largemem partition with mem=0? I already tried that, but the request has not been approved yet. Is there any way to use a debug partition to have the request approved faster while using more memory?

Thank you!

You could increase the --mem-per-cpu request or use --mem=0 to request all memory on a node. Use the nodeinfo command to see the different node configurations. You may need to target nodes with more memory, like the epyc-64 nodes.

If I do not want to use --mem=0, how do I determine the max --mem-per-cpu? In the example above, I was given 4 nodes in the debug partition, those had:

  1. CPUs/node = 16
  2. Memory/node = 59
    Thus I’d have 59/16 = 3.7 mem/cpu, right? However, I requested 7GB and I could run it. How do I know if I can request more memory per cpu in the debug partition?

For those original sbatch options, it requested 4 nodes, 8 CPUs per node, and 56 GB of memory per node.

Max memory per CPU would depend on how many CPUs you request per node and the total memory that node has. Even if you request 1 CPU on a node, you could still request all the memory on that node.