Partition with the largest memory

Hi, I keep running out of memory on my jobs. I wanted to know which partition has more than 100GB of memory. I tried using main at 100 GB limit and my job hit that amount. Which specifications should I give as well. This was the seff output from my most recent job:

Job ID: 926978
Cluster: discovery
User/Group: mohazzab/mohazzab
State: OUT_OF_MEMORY (exit code 0)
Cores: 1
CPU Utilized: 03:28:50
CPU Efficiency: 99.74% of 03:29:23 core-walltime
Job Wall-clock time: 03:29:23
Memory Utilized: 97.78 GB
Memory Efficiency: 97.78% of 100.00 GB

It wouldn’t let me run this:

#!/bin/bash
#SBATCH --time=18:30:00
#SBATCH --partition=epyc-64
#SBATCH --mem-per-cpu=256GB
#SBATCH --export=none
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=mohazzab@usc.edu
#SBATCH --output=adult_cis.log # Standard output and error log

Please specify which partition I should use. Thanks

There are two parts to your question:

  1. Which nodes have > 100GB Mem?

For informaiton about node types, you can query slurm using the sinfo command. Since you’re interested in memory you can run something like this:

sinfo -o "%40N %20P %5D %c %m  %f" | sort -hk 5
NODELIST                                 PARTITION            NODES CPUS MEMORY  AVAIL_FEATURES
d17-[03-44],d18-[01-38],d22-[51-52],d23- main*                168   20 63400+  xeon-2640v4
e01-[46,48,52,62,64,76,78],e02-[40-41,43 oneweek              16    16 63400+  xeon-2650v2
e01-60,e05-[42,76,78,80]                 debug                5     16 63400  xeon-2650v2
e06-[01-22,24],e07-[01-16,18],e09-18,e10 main*                78    16 63400  xeon-2640v3
d05-[06-15,26-42],d06-[15-29],d11-[09-47 main*                81    24 94000+  xeon-4116
d11-[02-04],d13-[02-11],d14-[03-18]      main*                29    32 191000  xeon-6130
b22-[01-32]                              epyc-64              32    64 256000  epyc-7542

The main and epyc-64 partitions have nodes with > 100GB memory.

  1. Why doesn’t your resource request work?

Even though our epyc-64 nodes have 256GB of memory, not all of it is usable. We reserve some of that memory for the operating system. You probably got a message similar to

error: Job submit/allocate failed: Requested node configuration is not available

That’s because no compute nodes exist that can satisfy your request. If you decrease your memory request it should work. In this specific case, please try under 248GB.

Thank you!

Another question…I split my job into two datasets to get this process done with less memory…but I got kicked out of again for memory issues. When looking at job diagnostics, it appears that I wasn’t even at the threshold, what happened?

(base) [mohazzab@discovery2 adult]$ seff 932130
Job ID: 932130
Cluster: discovery
User/Group: mohazzab/mohazzab
State: OUT_OF_MEMORY (exit code 0)
Cores: 1
CPU Utilized: 02:04:27
CPU Efficiency: 99.93% of 02:04:32 core-walltime
Job Wall-clock time: 02:04:32
Memory Utilized: 81.35 GB
Memory Efficiency: 81.35% of 100.00 GB

This grafana page shows the memory utilization of the compute node you were using around the time your job was scheduled.

Near the end of your job you can see jumps in total memory usage (including other users). The jumps in memory usage increase so it’s possible that your program was at ~80GB and tried to allocate more and failed. It’s worth a try to increase your memory request more and see if that works.

Thanks for all your help!