Signalling a job before time limit is reached

I’m trying to signal my job so that I can save the state of it to disk before slurm kills the process when it runs out of time. I’m using the --signal flag in my batch submission file like so:

#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --mem=0
#SBATCH --time=00:3:30 
#SBATCH --partition=debug 
#SBATCH --signal=USR1@30
#SBATCH --output=test.out
#SBATCH --job-name=test

echo "Job started!"
python ~/scratch/ascript.py
echo "Job ended!"

The idea here is that slurm should send the signal SIGUSR1 30 seconds before the job ends to all processes running. Inside the python script, it looks like this:

import signal
import time

def handler(signum, frame):
    print('Signal handler called with signal', signum)
    exit(0)

signal.signal(signal.SIGUSR1, handler)
signal.signal(signal.SIGTERM, handler)
signal.signal(signal.SIGINT, handler)

# do nothing
print("Going to sleep...")
time.sleep(100000)

The script should catch the signal SIGUSR1 which I specified in my slurm file, and then execute the function “handler”, but it’s not working. When I run this script on the terminal and use Ctrl+C to send the SIGINT signal, it works.

Am I missing something here? Do I need to know something about how slurm has been configured on Discovery?

1 Like

If anyone has a suggestion for a better way to do this I am open to it

1 Like

So I managed to find a somewhat convoluted way to achieve what I wanted to - though I still would like to understand why my original attempt didn’t work. Here is a modified submission script

#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --time=00:3:30 
#SBATCH --partition=debug 
#SBATCH --signal=B:USR1@30
#SBATCH --output=test.out
#SBATCH --job-name=test

trap 'echo signal recieved!; kill "${PID}"; wait "${PID}"; handler' USR1 SIGINT SIGTERM

echo "Job started!"

# run python script
python ~/scratch/ascript.py &
PID="$!"
wait "${PID}"

echo "Job ended!"

The idea is that we run the python script (or whatever your job is) in the background, and capture its process id with with the $! variable. We then run the command wait which tells the script to wait until this process finishes before continuing.

The #SBATCH --signal=B:USR1@30 flag means that we will send the signal USR1 to the bash shell (rather than any child processes), where it can be captured by the trap command, which then sends the SIGTERM signal to our actual job via kill, and our job can then handle that signal.

Pretty convoluted, but seems to work. Wish there was some better documentation on how to do this properly