Getting information about jobs

Getting cluster(s) state

In Slurm, nodes have different states [2], this tells if a job can or not be allocated.

Slurm node’s states
State Description
DOWN The node is unavailable for use
ALLOCATED The node has been allocated to one or more jobs.
IDLE The node is not allocated to any jobs and is available for use
MIXED The node has some of its CPUs ALLOCATED while others are IDLE
DRAINED The node is unavailable for use per system administrator request
MAINT The node is under maintenance by system administrator

The simplest way to get information about the state of our clusters is using the commands: sinfo and squeue. Here we list some useful examples [1] [2] [3] .

  • View information about nodes and partitions and a longer version (-N)

    $ sinfo
    $ sinfo -N
    
  • Show nodes that are in a specific state.

    $ sinfo -t idle
    $ sinfo -t mix
    $ sinfo -t alloc
    
  • Report node state reason (if exists)

    $ sinfo -R
    
  • Show queued jobs and long version

    $ squeue
    $ squeue -l
    

    Note

    squeue also includes running jobs.

  • Show queued jobs by a specific user. Most of the cases you will need to get information about your jobs, using the variable $USER could be useful.

    $ squeue -u $USER
    $ squeue -u pepito77
    
  • Show queued jobs of a specific partition/queue.

    $ squeue -p debug
    $ squeue -p bigmem
    $ squeue -p accel
    
  • Show queued jobs that are in a specific state. To know more about the job’s state see: What’s going on with my job? Getting information about submitted jobs

    $ squeue -t PD
    $ squeue -t R
    $ squeue -t F
    $ squeue -t PR
    
  • Show detailed information about the node(s)

    $ scontrol show node compute-1-25
    $ scontrol show node compute-0-5
    $ scontrol show node debug-0-0
    

Note

If you need further information, you can always check the command’s manual man squeue, man sinfo, etc.

What’s going on with my job? Getting information about submitted jobs

Once your job is queued in a specific partition you may want to know its state. There some of the Slurm’s job states [3].

Job’s states
State Description
CANCELLED (CA) Job was explicitly cancelled by the user or system administrator
COMPLETED (CD) Job has terminated all processes on all nodes with an exit code of zero
PENDING (PD) Job is awaiting resource allocation, there is some different reasons
RUNNING (R) Job currently has an allocation
STOPPED (ST) Job has an allocation, but execution has been stopped with SIGSTOP signal. CPUS have been retained by this job
SUSPENDED (S) Job has an allocation, but execution has been suspended and CPUs have been released for other jobs

You can check the expected start time of a job(s) base on the actual queue state:

$ squeue --start --job 1234
$ squeue --start -u $USER

You can also check the reason why your job is waiting, usually is displayed by default in the command squeue. You can also change the output format, thus, display the field reason (%R) more clearly.

$ squeue -u $USER --format="%j %name %U %R"
$ squeue --jobid 1234 --format="%j %name %U %R"

Note

Not only pending jobs set the reason field, also failed jobs set it, showing its failure message.

Note

You can also use sprio in order to know the priority of your job(s). For further information see man sprio

In the following table [3] we describe the most common reasons:

Job’s reasons
Reason Description
QOSMaxCpuPerUserLimit User’s allocated jobs are already using the maximum number of CPUs allowed per user. Once the number of allocated CPUs decrease, the job(s) will start
Priority One or more higher priority jobs exist for this queue, usually jobs are allocated with a First In First Out set up, for further information see man sprio
Resources The job is waiting for resources (CPUs, Memory, nodes, etc). to become available
TimeLimit The job exhausted its time limit
BadConstraints The job’s constraints can not be satisfied

Warning

Related with QOSMaxCpuPerUserLimit Slurm’s reason, the maximum number of allocated resources at the same time (in specific Memory and CPUs) per user differ between clusters:

  • Apolo:
    CPUs: 96 Memory: 192G
  • Cronos:
    CPUs: 96 Memory: 384G

It is important to note that those are policies defined by Apolo - Centro de Computación Científica.

Another useful command to show information about recent jobs is:

$ scontrol show job 1234

There is an example of its output from Apolo II.

JobId=1234 JobName=CuteJob
   UserId=user1(11) GroupId=user1(34) MCS_label=N/A
   Priority=2000 Nice=0 Account=ddp QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=2-22:33:43 TimeLimit=4-03:00:00 TimeMin=N/A
   SubmitTime=2019-01-29T03:46:05 EligibleTime=2019-01-29T03:46:05
   AccrueTime=2019-01-29T03:46:05
   StartTime=2019-01-29T15:47:12 EndTime=2019-02-02T18:47:12 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2019-01-29T15:47:12
   Partition=accel AllocNode:Sid=apolo:2222
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=compute-0-5
   BatchHost=compute-0-5
   NumNodes=1 NumCPUs=32 NumTasks=32 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=32,mem=60000M,node=1,billing=32
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=1875M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/user1/cutejob/slurm.sh
   WorkDir=/home/user1/cutejob
   StdErr=/home/user1/cutejob/cutejob.1234.err
   StdIn=/dev/null
   StdOut=/home/user1/cutejob/cutejob.1234.out
   Power=

Note

We also recommend to log in (using ssh ) into the respective compute node(s) of your job and run htop in order to see if your process(es) are actually running in the way you would expect and check if the compute’s CPU Load is the optimum. To know more about see: FAQ

Canceling a job

Once your job is submitted, you can do some operations in order to change its state. Here we list some useful examples [1] [4] .

  • Cancel job 1234

    $ scancel 1234
    
  • Cancel only array ID 9 of job array 1234

    $ scancel 1234_9
    
  • Cancel all my jobs (without taking care of its state)

    $ scancel -u $USER
    
  • Cancel my waiting (pending state) jobs.

    $ scancel -u $USER -t pending
    
  • Cancel the jobs queue on a given partition (queue)

    $ scancel -p longjobs
    
  • Cancel one or more jobs by name

    $ scancel --name MyJobName
    
  • Pause the job 1234

    $ scontrol hold 1234
    
  • Resume the job 1234

    $ scontrol resume 1234
    
  • Cancel and restart the job 1234

    $ scontrol requeue 1234
    

What happened with my job? Getting information about finished jobs

Here we are going to explain how to get information about completed jobs (that are no longer in the queue). Those commands use the Slurm database to get the information.

Note

By default, these commands only search jobs associated with the cluster you are log in, however, for example, if you want to search a job that was executed on Cronos while you are in a session in Apolo II, you can do it using the argument -M slurm-cronos. Other possible options are -M slurm-apolo and -M all

  • sacct: is used to get general accounting data for all jobs and job steps in the Slurm [5].

    • In case you remember the jobid you can use

      $ sacct -j1234
      
    • Get information about today’s jobs submitted by a user (or users)

      $ sacct -S$(date +'%m/%d/%y') -u $USER
      
    • Get information about jobs submitted by a user (or users) 1 week ago

      $ sacct -S$(date +'%m/%d/%y' --date="1 week ago") -u $USER
      
    • Get information about the job(s) by its name(s)

      $ sacct -S$(date +'%m/%d/%y') --name job_name
      

    Note

    -S argument is to select eligible jobs in any state after the specified time. It is mandatory to search jobs in case that a jobid was not specified. It supports multiple date formats, see man sacct to know more about.

References

[1](1, 2) University of Luxembourg (UL) HPC Team (2018). UL HPC Tutorial: Advanced scheduling with SLURM. Retrieved 16:45 January 28, 2019 from https://ulhpc-tutorials.readthedocs.io/en/latest/scheduling/advanced/
[2](1, 2) SchedMD LLC (2018). Slurm, resource management [sinfo]. Copy of manual text available at https://slurm.schedmd.com/sinfo.html. Retrieved 14:24 January 31, 2019
[3](1, 2, 3) SchedMD LLC (2018). Slurm, resource management [squeue]. Copy of manual text available at https://slurm.schedmd.com/squeue.html. Retrieved 12:30 February 1, 2019
[4]SchedMD LLC (2018). Slurm, resource management [scancel]. Copy of manual text available at https://slurm.schedmd.com/sinfo.html. Retrieved 15:47 January 31, 2019
[5]SchedMD LLC (2018). Slurm, resource management [sacct]. Copy of manual text available at https://slurm.schedmd.com/sacct.html. Retrieved 8:44 February 4, 2019