Signalling a job before time limit is reached

sagendor · January 21, 2021, 2:20am

I’m trying to signal my job so that I can save the state of it to disk before slurm kills the process when it runs out of time. I’m using the --signal flag in my batch submission file like so:

#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --mem=0
#SBATCH --time=00:3:30 
#SBATCH --partition=debug 
#SBATCH --signal=USR1@30
#SBATCH --output=test.out
#SBATCH --job-name=test

echo "Job started!"
python ~/scratch/ascript.py
echo "Job ended!"

The idea here is that slurm should send the signal SIGUSR1 30 seconds before the job ends to all processes running. Inside the python script, it looks like this:

import signal
import time

def handler(signum, frame):
    print('Signal handler called with signal', signum)
    exit(0)

signal.signal(signal.SIGUSR1, handler)
signal.signal(signal.SIGTERM, handler)
signal.signal(signal.SIGINT, handler)

# do nothing
print("Going to sleep...")
time.sleep(100000)

The script should catch the signal SIGUSR1 which I specified in my slurm file, and then execute the function “handler”, but it’s not working. When I run this script on the terminal and use Ctrl+C to send the SIGINT signal, it works.

Am I missing something here? Do I need to know something about how slurm has been configured on Discovery?

sagendor · January 21, 2021, 9:25pm

If anyone has a suggestion for a better way to do this I am open to it

sagendor · January 21, 2021, 9:50pm

So I managed to find a somewhat convoluted way to achieve what I wanted to - though I still would like to understand why my original attempt didn’t work. Here is a modified submission script

#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --time=00:3:30 
#SBATCH --partition=debug 
#SBATCH --signal=B:USR1@30
#SBATCH --output=test.out
#SBATCH --job-name=test

trap 'echo signal recieved!; kill "${PID}"; wait "${PID}"; handler' USR1 SIGINT SIGTERM

echo "Job started!"

# run python script
python ~/scratch/ascript.py &
PID="$!"
wait "${PID}"

echo "Job ended!"

The idea is that we run the python script (or whatever your job is) in the background, and capture its process id with with the $! variable. We then run the command wait which tells the script to wait until this process finishes before continuing.

The #SBATCH --signal=B:USR1@30 flag means that we will send the signal USR1 to the bash shell (rather than any child processes), where it can be captured by the trap command, which then sends the SIGTERM signal to our actual job via kill, and our job can then handle that signal.

Pretty convoluted, but seems to work. Wish there was some better documentation on how to do this properly

csul · March 28, 2023, 7:09pm

Hi,

This is a very late response but I was going to do a blog post on this topic and was reminded of this question while doing research. Here’s a hint from the schedmd mail list. We have to launch python with srun. The explanation seems to be that using srun puts your python script into a job step while launching it from the job script creates a child process. Apparently only the batch script or job steps can be sent signals.

Anyway, your job script should look like this

#!/bin/bash
#SBATCH --time=00:01:30
#SBATCH --partition=debug
#SBATCH --signal=USR1

module load gcc/11.3.0
module load python

echo 'job started!'
srun python3 ascript.py
echo 'job ended!'

Note that bash will try to interpret the ! character if its in " quotes but not ' quotes.

ascript.py should look the same

import signal
import time

def handler(signum, frame):
    print('Signal handler called with signal', signum)
    exit(0)

signal.signal(signal.SIGUSR1, handler)
signal.signal(signal.SIGTERM, handler)
signal.signal(signal.SIGINT, handler)

# do nothing
print("Going to sleep...")
time.sleep(100000)

Finally, for anyone wondering how this is useful. One thing you can do is put functionality in your signal handler function to save the state of your program before it gets terminated so you can pick up where you left off in the next job.