Slurm provides a number of commands to monitor your jobs, but they do not always have the best default options or output format. There are a few existing open source scripts available that take job information from Slurm and, for example, customize the output format.
We’re interested in customizing a few of these scripts to provide alternative Slurm commands to make the process of monitoring your jobs on CARC systems a bit easier. We’re seeking your thoughts on what tools would be useful as well as the best default options and output formats to use. Some examples are shown below.
This command would be a simple alias for squeue -u $USER
with different formatting.
$ jobq
483928 epyc-64 job4 PENDING 0:00 (Priority)
481377 epyc-64 job1 RUNNING 1-23:01:39 b22-[18,23-24]
482304 epyc-64 job2 RUNNING 1-16:34:12 b22-[10-12]
483705 epyc-64 job3 RUNNING 1-16:34:12 b22-[25-27]
This command would take output from sacct
and provide a compact overview of your recent job history. The default setting would be to show basic information for jobs from the past 7 days, but specifying a longer time would also be possible. This command could also be used instead of jobq
or squeue
because pending or running jobs would also show up with this command.
$ jobhist
JobID Partition JobName State Start Elapsed Timelimit NNodes NCPUS ReqMem
------------ ---------- -------- ---------- ------------------- ---------- ---------- ------ ----- ----------
481508 debug interac+ COMPLETED 2021-06-15T15:11:13 00:27:40 00:30:00 1 8 2Gc
482373 debug mpi-tes+ COMPLETED 2021-06-17T16:58:08 00:01:24 00:30:00 2 32 3Gc
482601 debug io-test+ COMPLETED 2021-06-18T09:22:02 00:04:00 00:30:00 1 8 16Gn
482659 debug io-test+ FAILED 2021-06-18T10:23:38 00:00:01 00:30:00 1 8 16Gn
482665 main b25-tes+ COMPLETED 2021-06-18T10:45:44 00:02:45 00:10:00 1 8 2Gn
483687 oneweek interac+ FAILED 2021-06-22T11:56:08 00:00:00 01:00:00 1 1 250Gn
483687 oneweek interac+ FAILED 2021-06-22T11:56:13 00:00:00 01:00:00 1 1 249Gn
483687 oneweek interac+ CANCELLED+ 2021-06-22T11:56:16 00:00:02 01:00:00 1 1 248Gn
483699 debug interac+ RUNNING 2021-06-22T12:42:17 00:00:05 00:30:00 1 1 2Gc
This command would take output from sacct
for a specific job and provide detailed information about that job. The output could also include job efficiency information, as an alternative to seff
$ jobinfo 48266
Job ID : 48266
Name : b25-test.job
User : ttrojan
Account : ttrojan_123
Cluster : discovery
Partition : main
Nodes : 1
Nodelist : d05-15
CPUs : 8
GPUs : 0
Exit code : 0:0
Submit time : 2021-06-18T10:45:44
Start time : 2021-06-18T10:45:44
End time : 2021-06-18T10:48:29
Wait time : 00:00:00
Reserved walltime : 00:10:00
Used walltime : 00:02:45
Used CPU time : 00:04:02
% User (computation) : 97.01%
% System (I/O) : 2.99%
Mem reserved : 2G/node
Max mem used : 831.57M (d05-15)
Max disk write : 634.88K (d05-15)
Max disk read : 20.16M (d05-15)
Other scripts
There are other scripts available to provide an overview of job efficiency information for recent job history or for a job array or to provide information on GPU usage, for example.
Once we decide on the specific command names, options, and formats to use, we will make the commands available to users. Of course, we will also modify and improve them over time as needed, and add new ones as needed. You would also be able to copy these scripts in order to customize them to your own liking.
Let us know what you think!