Hi everyone, I submitted a job via sbatch
but it ended up with an OOM issue:
slurmstepd: error: Detected 5 oom-kill event(s) in step 464046.batch cgroup.
Some of your processes may have been killed by the cgroup out-of-memory handler.
Strangely, the same job runs fine under interactive mode (srun
).
No matter how large or how small I set
--mem-per-cpu
or --mem
, the job always got killed after running for 20 seconds due to OOM.
The following is my job (train.sh
).
#!/bin/bash
# SBATCH --gres=gpu:k40:1
# SBATCH --ntasks=1
# SBATCH --mem=16GB
# SBATCH --cpus-per-task=8
# SBATCH --time=0:10:00
module load gcc/8.3.0
module load anaconda3
module load cuda/10.1.243
module load cudnn/8.0.2-10.1
eval "$(conda shell.bash hook)"
conda activate diffart
python -m dev.models.racnn.train
sbatch train.sh
Please let me know if anyone has an idea what happens.
(this happens on Discovery.)
Hello,
Can you try using srun in your train.sh slurm job script:
srun python -m dev.models.racnn.train
Thanks for your suggestion, but I still get confusing errors:
srun python -m dev.models.racnn.train
srun: job 477154 queued and waiting for resources
srun: job 477154 has been allocated resources
0%| | 0/300000 [00:00<?, ?it/s]
slurmstepd: error: Detected 6 oom-kill event(s) in step 477154.0 cgroup.
Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: d11-04: task 0: Out Of Memory
The results and error are exactly the same as those from sbatch
.
The following is worse:
srun \
--gres=gpu:k40:1 \
--ntasks=1 \
--mem=32GB \
--cpus-per-task=8 \
--time=0:10:00 \
python -m dev.models.racnn.train
Traceback (most recent call last):
File "/home1/chincheh/.conda/envs/diffart/lib/python3.6/runpy.py", line 183, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "/home1/chincheh/.conda/envs/diffart/lib/python3.6/runpy.py", line 109, in _get_module_details
__import__(pkg_name)
File "/scratch/chincheh/gard-adversarial-audio/dev/models/__init__.py", line 1, in <module>
from dev.models.racnn.racnn import RawAudioCNN
File "/scratch/chincheh/gard-adversarial-audio/dev/models/racnn/racnn.py", line 1, in <module>
import torch.nn as nn
File "/home1/chincheh/.conda/envs/diffart/lib/python3.6/site-packages/torch/__init__.py", line 16, in <module>
import ctypes
File "/home1/chincheh/.conda/envs/diffart/lib/python3.6/ctypes/__init__.py", line 538, in <module>
_reset_cache()
File "/home1/chincheh/.conda/envs/diffart/lib/python3.6/ctypes/__init__.py", line 273, in _reset_cache
CFUNCTYPE(c_int)(lambda: None)
MemoryError
srun: error: e07-03: task 0: Exited with exit code 1
The module I wrote was not even loaded correctly and the error message is still confusing.
Hello,
Just to be sure: there should not be an empty space between ‘#’ and ‘SBATCH’ in your slurm job script. It should look like the following:
#!/bin/bash
#SBATCH --gres=gpu:k40:1
#SBATCH --ntasks=1
#SBATCH --mem=16GB
#SBATCH --cpus-per-task=8
#SBATCH --time=0:10:00
module load gcc/8.3.0
module load anaconda3
module load cuda/10.1.243
module load cudnn/8.0.2-10.1
eval "$(conda shell.bash hook)"
conda activate diffart
srun python3 -m dev.models.racnn.train
1 Like
It solved the issue. Thank you so much!
1 Like