Custom Slurm commands for monitoring jobs

Slurm provides a number of commands to monitor your jobs, but they do not always have the best default options or output format. There are a few existing open source scripts available that take job information from Slurm and, for example, customize the output format.

We’re interested in customizing a few of these scripts to provide alternative Slurm commands to make the process of monitoring your jobs on CARC systems a bit easier. We’re seeking your thoughts on what tools would be useful as well as the best default options and output formats to use. Some examples are shown below.

jobq

This command would be a simple alias for squeue -u $USER with different formatting.

$ jobq
  JOBID PARTITION     NAME     STATE       TIME NODELIST(REASON)
 483928   epyc-64     job4   PENDING       0:00 (Priority)
 481377   epyc-64     job1   RUNNING 1-23:01:39 b22-[18,23-24]
 482304   epyc-64     job2   RUNNING 1-16:34:12 b22-[10-12]
 483705   epyc-64     job3   RUNNING 1-16:34:12 b22-[25-27]

jobhist

This command would take output from sacct and provide a compact overview of your recent job history. The default setting would be to show basic information for jobs from the past 7 days, but specifying a longer time would also be possible. This command could also be used instead of jobq or squeue because pending or running jobs would also show up with this command.

$ jobhist
       JobID  Partition  JobName      State               Start    Elapsed  Timelimit NNodes NCPUS     ReqMem
------------ ---------- -------- ---------- ------------------- ---------- ---------- ------ ----- ----------
      481508      debug interac+  COMPLETED 2021-06-15T15:11:13   00:27:40   00:30:00      1     8        2Gc
      482373      debug mpi-tes+  COMPLETED 2021-06-17T16:58:08   00:01:24   00:30:00      2    32        3Gc
      482601      debug io-test+  COMPLETED 2021-06-18T09:22:02   00:04:00   00:30:00      1     8       16Gn
      482659      debug io-test+     FAILED 2021-06-18T10:23:38   00:00:01   00:30:00      1     8       16Gn
      482665       main b25-tes+  COMPLETED 2021-06-18T10:45:44   00:02:45   00:10:00      1     8        2Gn
      483687    oneweek interac+     FAILED 2021-06-22T11:56:08   00:00:00   01:00:00      1     1      250Gn
      483687    oneweek interac+     FAILED 2021-06-22T11:56:13   00:00:00   01:00:00      1     1      249Gn
      483687    oneweek interac+ CANCELLED+ 2021-06-22T11:56:16   00:00:02   01:00:00      1     1      248Gn
      483699      debug interac+    RUNNING 2021-06-22T12:42:17   00:00:05   00:30:00      1     1        2Gc

jobinfo

This command would take output from sacct for a specific job and provide detailed information about that job. The output could also include job efficiency information, as an alternative to seff.

$ jobinfo 48266
Job ID               : 48266
Name                 : b25-test.job
User                 : ttrojan
Account              : ttrojan_123
Cluster              : discovery
Partition            : main
Nodes                : 1
Nodelist             : d05-15
CPUs                 : 8
GPUs                 : 0
State                : COMPLETED
Exit code            : 0:0
Submit time          : 2021-06-18T10:45:44
Start time           : 2021-06-18T10:45:44
End time             : 2021-06-18T10:48:29
Wait time            : 00:00:00
Reserved walltime    : 00:10:00
Used walltime        : 00:02:45
Used CPU time        : 00:04:02
% User (computation) : 97.01%
% System (I/O)       :  2.99%
Mem reserved         : 2G/node
Max mem used         : 831.57M (d05-15)
Max disk write       : 634.88K (d05-15)
Max disk read        : 20.16M (d05-15)

Other scripts

There are other scripts available to provide an overview of job efficiency information for recent job history or for a job array or to provide information on GPU usage, for example.

Feedback

Once we decide on the specific command names, options, and formats to use, we will make the commands available to users. Of course, we will also modify and improve them over time as needed, and add new ones as needed. You would also be able to copy these scripts in order to customize them to your own liking.

Let us know what you think!

3 Likes

I think the command jobhist is great for me.

1 Like