Getting information about jobs¶
Contents
Getting cluster(s) state¶
In Slurm, nodes have different states [2], this tells if a job can or not be allocated.
State | Description |
---|---|
DOWN | The node is unavailable for use |
ALLOCATED | The node has been allocated to one or more jobs. |
IDLE | The node is not allocated to any jobs and is available for use |
MIXED | The node has some of its CPUs ALLOCATED while others are IDLE |
DRAINED | The node is unavailable for use per system administrator request |
MAINT | The node is under maintenance by system administrator |
The simplest way to get information about the state of our clusters is
using the commands: sinfo
and squeue
. Here we list some useful
examples [1] [2] [3] .
View information about nodes and partitions and a longer version (
-N
)$ sinfo $ sinfo -N
Show nodes that are in a specific state.
$ sinfo -t idle $ sinfo -t mix $ sinfo -t alloc
Report node state reason (if exists)
$ sinfo -R
Show queued jobs and long version
$ squeue $ squeue -l
Note
squeue
also includes running jobs.Show queued jobs by a specific user. Most of the cases you will need to get information about your jobs, using the variable
$USER
could be useful.$ squeue -u $USER $ squeue -u pepito77
Show queued jobs of a specific partition/queue.
$ squeue -p debug $ squeue -p bigmem $ squeue -p accel
Show queued jobs that are in a specific state. To know more about the job’s state see: What’s going on with my job? Getting information about submitted jobs
$ squeue -t PD $ squeue -t R $ squeue -t F $ squeue -t PR
Show detailed information about the node(s)
$ scontrol show node compute-1-25 $ scontrol show node compute-0-5 $ scontrol show node debug-0-0
Note
If you need further information, you can always check the command’s manual
man squeue
, man sinfo
, etc.
What’s going on with my job? Getting information about submitted jobs¶
Once your job is queued in a specific partition you may want to know its state. There some of the Slurm’s job states [3].
State | Description |
---|---|
CANCELLED (CA) | Job was explicitly cancelled by the user or system administrator |
COMPLETED (CD) | Job has terminated all processes on all nodes with an exit code of zero |
PENDING (PD) | Job is awaiting resource allocation, there is some different reasons |
RUNNING (R) | Job currently has an allocation |
STOPPED (ST) | Job has an allocation, but execution has been stopped with SIGSTOP signal. CPUS have been retained by this job |
SUSPENDED (S) | Job has an allocation, but execution has been suspended and CPUs have been released for other jobs |
You can check the expected start time of a job(s) base on the actual queue state:
$ squeue --start --job 1234
$ squeue --start -u $USER
You can also check the reason why your job is waiting, usually is displayed by
default in the command squeue
. You can also change the output format, thus,
display the field reason (%R)
more clearly.
$ squeue -u $USER --format="%j %name %U %R"
$ squeue --jobid 1234 --format="%j %name %U %R"
Note
Not only pending jobs set the reason field, also failed jobs set it, showing its failure message.
Note
You can also use sprio
in order to know the priority of your job(s).
For further information see man sprio
In the following table [3] we describe the most common reasons:
Reason | Description |
---|---|
QOSMaxCpuPerUserLimit | User’s allocated jobs are already using the maximum number of CPUs allowed per user. Once the number of allocated CPUs decrease, the job(s) will start |
Priority | One or more higher priority jobs exist for this queue, usually jobs are allocated with a First In First Out set up, for further information see man sprio |
Resources | The job is waiting for resources (CPUs, Memory, nodes, etc). to become available |
TimeLimit | The job exhausted its time limit |
BadConstraints | The job’s constraints can not be satisfied |
Warning
Related with QOSMaxCpuPerUserLimit
Slurm’s reason, the maximum number of
allocated resources at the same time (in specific Memory and CPUs) per user
differ between clusters:
- Apolo:
- CPUs: 96 Memory: 192G
- Cronos:
- CPUs: 96 Memory: 384G
It is important to note that those are policies defined by Apolo - Centro de Computación Científica.
Another useful command to show information about recent jobs is:
$ scontrol show job 1234
There is an example of its output from Apolo II.
JobId=1234 JobName=CuteJob
UserId=user1(11) GroupId=user1(34) MCS_label=N/A
Priority=2000 Nice=0 Account=ddp QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=2-22:33:43 TimeLimit=4-03:00:00 TimeMin=N/A
SubmitTime=2019-01-29T03:46:05 EligibleTime=2019-01-29T03:46:05
AccrueTime=2019-01-29T03:46:05
StartTime=2019-01-29T15:47:12 EndTime=2019-02-02T18:47:12 Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
LastSchedEval=2019-01-29T15:47:12
Partition=accel AllocNode:Sid=apolo:2222
ReqNodeList=(null) ExcNodeList=(null)
NodeList=compute-0-5
BatchHost=compute-0-5
NumNodes=1 NumCPUs=32 NumTasks=32 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=32,mem=60000M,node=1,billing=32
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=1875M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/home/user1/cutejob/slurm.sh
WorkDir=/home/user1/cutejob
StdErr=/home/user1/cutejob/cutejob.1234.err
StdIn=/dev/null
StdOut=/home/user1/cutejob/cutejob.1234.out
Power=
Note
We also recommend to log in (using ssh
) into the respective compute node(s)
of your job and run htop
in order to see if your process(es) are actually
running in the way you would expect and check if the compute’s CPU Load
is the optimum. To know more about see: FAQ
Canceling a job¶
Once your job is submitted, you can do some operations in order to change its state. Here we list some useful examples [1] [4] .
Cancel job 1234
$ scancel 1234
Cancel only array ID 9 of job array 1234
$ scancel 1234_9
Cancel all my jobs (without taking care of its state)
$ scancel -u $USER
Cancel my waiting (
pending
state) jobs.$ scancel -u $USER -t pending
Cancel the jobs queue on a given partition (queue)
$ scancel -p longjobs
Cancel one or more jobs by name
$ scancel --name MyJobName
Pause the job 1234
$ scontrol hold 1234
Resume the job 1234
$ scontrol resume 1234
Cancel and restart the job 1234
$ scontrol requeue 1234
What happened with my job? Getting information about finished jobs¶
Here we are going to explain how to get information about completed jobs (that are no longer in the queue). Those commands use the Slurm database to get the information.
Note
By default, these commands only search jobs associated with the cluster you are
log in, however, for example, if you want to search a job that was executed on
Cronos while you are in a session in
Apolo II, you can do it using the argument
-M slurm-cronos
. Other possible options are -M slurm-apolo
and -M all
sacct
: is used to get general accounting data for all jobs and job steps in the Slurm [5].In case you remember the
jobid
you can use$ sacct -j1234
Get information about today’s jobs submitted by a user (or users)
$ sacct -S$(date +'%m/%d/%y') -u $USER
Get information about jobs submitted by a user (or users) 1 week ago
$ sacct -S$(date +'%m/%d/%y' --date="1 week ago") -u $USER
Get information about the job(s) by its name(s)
$ sacct -S$(date +'%m/%d/%y') --name job_name
Note
-S
argument is to select eligible jobs in any state after the specified time. It is mandatory to search jobs in case that ajobid
was not specified. It supports multiple date formats, seeman sacct
to know more about.
References¶
[1] | (1, 2) University of Luxembourg (UL) HPC Team (2018). UL HPC Tutorial: Advanced scheduling with SLURM. Retrieved 16:45 January 28, 2019 from https://ulhpc-tutorials.readthedocs.io/en/latest/scheduling/advanced/ |
[2] | (1, 2) SchedMD LLC (2018). Slurm, resource management [sinfo]. Copy of manual text available at https://slurm.schedmd.com/sinfo.html. Retrieved 14:24 January 31, 2019 |
[3] | (1, 2, 3) SchedMD LLC (2018). Slurm, resource management [squeue]. Copy of manual text available at https://slurm.schedmd.com/squeue.html. Retrieved 12:30 February 1, 2019 |
[4] | SchedMD LLC (2018). Slurm, resource management [scancel]. Copy of manual text available at https://slurm.schedmd.com/sinfo.html. Retrieved 15:47 January 31, 2019 |
[5] | SchedMD LLC (2018). Slurm, resource management [sacct]. Copy of manual text available at https://slurm.schedmd.com/sacct.html. Retrieved 8:44 February 4, 2019 |