OOM issue in batch mode

chincheh · October 15, 2020, 1:07am

Hi everyone, I submitted a job via sbatch but it ended up with an OOM issue:

slurmstepd: error: Detected 5 oom-kill event(s) in step 464046.batch cgroup. 
Some of your processes may have been killed by the cgroup out-of-memory handler.

Strangely, the same job runs fine under interactive mode (srun).

No matter how large or how small I set
--mem-per-cpu or --mem, the job always got killed after running for 20 seconds due to OOM.

The following is my job (train.sh).

#!/bin/bash

# SBATCH --gres=gpu:k40:1
# SBATCH --ntasks=1
# SBATCH --mem=16GB
# SBATCH --cpus-per-task=8
# SBATCH --time=0:10:00

module load gcc/8.3.0
module load anaconda3
module load cuda/10.1.243
module load cudnn/8.0.2-10.1

eval "$(conda shell.bash hook)"
conda activate diffart

python -m dev.models.racnn.train

sbatch train.sh

Please let me know if anyone has an idea what happens.
(this happens on Discovery.)

molguin · October 16, 2020, 9:28pm

Hello,

Can you try using srun in your train.sh slurm job script:
srun python -m dev.models.racnn.train

chincheh · October 17, 2020, 2:38am

Thanks for your suggestion, but I still get confusing errors:

srun python -m dev.models.racnn.train

srun: job 477154 queued and waiting for resources
srun: job 477154 has been allocated resources
  0%|          | 0/300000 [00:00<?, ?it/s]

slurmstepd: error: Detected 6 oom-kill event(s) in step 477154.0 cgroup. 
Some of your processes may have been killed by the cgroup out-of-memory handler.

srun: error: d11-04: task 0: Out Of Memory

The results and error are exactly the same as those from sbatch.

The following is worse:

srun \
  --gres=gpu:k40:1 \
  --ntasks=1 \
  --mem=32GB \
  --cpus-per-task=8 \
  --time=0:10:00 \
    python -m dev.models.racnn.train

Traceback (most recent call last):
  File "/home1/chincheh/.conda/envs/diffart/lib/python3.6/runpy.py", line 183, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "/home1/chincheh/.conda/envs/diffart/lib/python3.6/runpy.py", line 109, in _get_module_details
    __import__(pkg_name)
  File "/scratch/chincheh/gard-adversarial-audio/dev/models/__init__.py", line 1, in <module>
    from dev.models.racnn.racnn import RawAudioCNN
  File "/scratch/chincheh/gard-adversarial-audio/dev/models/racnn/racnn.py", line 1, in <module>
    import torch.nn as nn
  File "/home1/chincheh/.conda/envs/diffart/lib/python3.6/site-packages/torch/__init__.py", line 16, in <module>
    import ctypes
  File "/home1/chincheh/.conda/envs/diffart/lib/python3.6/ctypes/__init__.py", line 538, in <module>
    _reset_cache()
  File "/home1/chincheh/.conda/envs/diffart/lib/python3.6/ctypes/__init__.py", line 273, in _reset_cache
    CFUNCTYPE(c_int)(lambda: None)
MemoryError
srun: error: e07-03: task 0: Exited with exit code 1

The module I wrote was not even loaded correctly and the error message is still confusing.

molguin · October 19, 2020, 9:49am

Hello,

Just to be sure: there should not be an empty space between ‘#’ and ‘SBATCH’ in your slurm job script. It should look like the following:

#!/bin/bash
#SBATCH --gres=gpu:k40:1
#SBATCH --ntasks=1
#SBATCH --mem=16GB
#SBATCH --cpus-per-task=8
#SBATCH --time=0:10:00

module load gcc/8.3.0
module load anaconda3
module load cuda/10.1.243
module load cudnn/8.0.2-10.1

eval "$(conda shell.bash hook)"
conda activate diffart

srun python3 -m dev.models.racnn.train

chincheh · October 19, 2020, 10:32pm

It solved the issue. Thank you so much!