Submitting jobs

What is sbatch?

Slurm has a lot of options to manage all the resources of a cluster to achieve any possible combination of needs like: Number of CPUs, Number of Nodes, Memory, Time, GPUs, Licenses, etc.

The command sbatch is used to submit a batch script, making your job running in the cluster. Like this:

$ sbatch <batch_script>

A Slurm batch is a shell script (usually written in bash) where you specify all these options to Slurm, including the creation of the environment to make your job run correctly, and the set of commands to run that job.

Thus, we say that a batch script has three parts:

  1. Sbatch parameters:

    The idea is to include all the information you think Slurm should know about your job (name, notification mail, partition, std_out, std_err, etc) and request all your computational needs, which consist at least in a number of CPUs, the computing expected duration and the amount of RAM to use.

    All these parameters must start with the comment #SBATCH, one per line, and need to be included at the beginning of the file, just after the shebang (e.g. #!/bin/bash) which should be the first line.

    The following table [3] shows important and common options, for further information see man sbatch.

    Sbatch option’s
    Option Description Possible value Mandatory
    -J, --job-name Job’s name Letters and numbers no
    -t, --time Maximum Walltime of the job Numbers with the format DD-HH:MM:SS yes
    --mem Requested memory per node size with units: 64G, 600M no
    -n, --ntasks Number of tasks of the job Number no (default 1)
    --ntasks-per-node Number of tasks assigned to a node Number no (default 1)
    -N, --nodes Number of nodes requested Number no (default 1)
    -c, --cpus-per-task Number of threads per task Number no (default 1)
    -p, --partition Partition/queue where the job will be submited longjobs, bigmem, accel and debug no (default longjobs)
    --output File where the standard output will be written Letters and numbers no
    --error File where the standard error will be written Letters and numbers no
    --mail-type Notify user by email when certain event types occur to the job NONE, ALL, BEGIN, FAIL, REQUEUE, TIME_LIMIT, TIME_LIMIT_% no
    --mail-user Email to receive notification of state shanges Valid email no
    --exclusive The job allocation can not share nodes with other running jobs Does not have values no
    --test-only Validate the batch script and return an estimate of when a job would be scheduled to run Does not have values no
    --constraint Some nodes have features associated with them. Use this option to specify which features the nodes associated with your job must have The name of the feature to use no

    Note

    Each option must be included using #SBATCH <option>=<value>

    Warning

    Some values of the options/parameters may be specific for our clusters.

    Note

    About the --mail-type option, the value TIME_LIMIT_% means the reached time percent, thus, TIME_LIMIT_90 notify reached the 90% of walltime, TIME_LIMIT_50 at the 50%, etc.

  2. Environment creation

    Next, you should create the necessary environment to make your job run correctly. This often means include the same set of steps that you do to run your application locally on your sbatch script, things like export environment variables, create or delete files and directory structures, etc. Remember a Slurm script is a shell script.

    In case you want to submit a job that uses an application that is installed in our clusters you have to load its module.

    An application Module. is used to create the specific environment needed by your application.

    The following table [1] show useful commands about modules.

    Module useful commands
    Command Functionality
    module avail check what software packages are available
    module whatis <module-name> Find out more about a software package
    module help <module-name> A module file may include more detailed help for the software package
    module show <module-name> see exactly what effect loading the module will have with
    module list check which modules are currently loaded in your environment
    module load <module-name> load a module
    module unload <module-name> unload a module
    module purge remove all loaded modules from your environment

    Warning

    Slurm always propagate the environment of the current user to the job. This could impact the behavior of the job. If you want a clean environment, add #SBATCH --export=NONE to your sbatch script. This option is particularly important for jobs that are submitted on one cluster and execute on a different cluster (e.g. with different paths).

  3. Job(s) steps

Finally, you put the command(s) that executes your application, including all the parameters. You will often see the command srun calling the executable instead of executing the application binary. For more information see MPI jobs section.

There are other options beyond using sbatch to submit jobs to Slurm, like salloc or simply using srun. We recommend using sbatch, but depending on the specific need of your application those options could be better. To know more about see: FAQ and Testing my job

Serial jobs

Serial jobs only use a process with one execution thread, this means one core of a CPU, given our configuration without HTT (Hyper-Threading Technology).

This kind of job does not take advantage of our computational resources but is the basic step to create more complex jobs.

In terms of Slurm, this job uses one task (process) and one cpu-per-task (thread) in one node. In fact, we don’t need to specify any resource, the default value for those options in Slurm is 1.

Here is a good article about the differences between Processes and Threads.

In the template below we specify ntasks=1 to make it explicit.

#!/bin/bash

#SBATCH --job-name=serial_test       # Job name
#SBATCH --mail-type=FAIL,END         # Mail notification
#SBATCH --mail-user=<user>@<domain>  # User Email
#SBATCH --output=slurm-serial.%j.out # Stdout (%j expands to jobId)
#SBATCH --error=slurm-serial.%j.err  # Stderr (%j expands to jobId)
#SBATCH --ntasks=1                   # Number of tasks (processes)
#SBATCH --time=01:00                 # Walltime
#SBATCH --partition=longjobs         # Partition


##### ENVIRONMENT CREATION #####



##### JOB COMMANDS #### 
hostname
date
sleep 50

Shared Memory jobs (OpenMP)

This set up is made to create parallelism using threads on a single machine. OpenMP makes communication between threads (-c in Slurm) but they must be on the same machine, it does not make any kind of communication between process/threads of different physical machines.

In the below example we launched the classical “Hello world” OpenMP example [5]. It was compiled in Cronos using intel compiler 18.0.1 as follow:

$ module load intel/18.0.1
$ icc -fopenmp omp_hello.c -o hello_omp_intel_cronos

We used 16 threads, the maximum number allowed in the Cronos’ longjobs partition. In terms of Slurm, we specify 16 cpus-per-task and one ntasks.

#!/bin/bash

#SBATCH --job-name=openmp_test      # Job name
#SBATCH --mail-type=FAIL,END        # Mail notification
#SBATCH --mail-user=<user>@<domain> # User Email
#SBATCH --output=slurm-omp.%j.out   # Stdout (%j expands to jobId)
#SBATCH --error=slurm-omp.%j.err    # Stderr (%j expands to jobId)
#SBATCH --time=01:00                # Walltime
#SBATCH --partition=longjobs        # Partition
#SBATCH --ntasks=1                  # Number of tasks (processes)
#SBATCH --cpus-per-task=16          # Number of threads per task (Cronos-longjobs)


##### ENVIRONMENT CREATION #####
module load intel/18.0.1 


##### JOB COMMANDS #### 
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

./hello_omp_intel_cronos

Output

Hello World from thread = 8
Hello World from thread = 0
Number of threads = 16
Hello World from thread = 4
Hello World from thread = 15
Hello World from thread = 5
Hello World from thread = 3
Hello World from thread = 2
Hello World from thread = 10
Hello World from thread = 9
Hello World from thread = 1
Hello World from thread = 11
Hello World from thread = 6
Hello World from thread = 12
Hello World from thread = 14
Hello World from thread = 7
Hello World from thread = 13

Warning

Remember the maximum number of total threads that can be running at the same time in a compute node.

  • Apolo:
    • Longjobs queue: 32
    • Accel queue: 32
    • Bigmem queue: 24
    • Debug queue: 2
  • Cronos:
    • Longjobs queue: 16

Otherwise, your job will overpass the maximum multiprocessing grade and this is going to cause a drastic decrease in the performance of your application. To know more about see: FAQ

As extra information, our setup does not use HTT (Hyper-Threading Technology).

Note

We highly recommend using the Slurm variable $SLURM_CPUS_PER_TASK to specify the number of threads that OpenMP is going to work with. Most of the applications use the variable OMP_NUM_THREADS to defined it.

MPI jobs

MPI jobs are able to launch multiple processes on multiple nodes. There is a lot of possible workflows using MPI, here we are going to explain a basic one. Based on this example and modifying its parameters, you can find the configuration for your specific need.

The example was compiled in Cronos using impi as follow:

$ module load impi
$ impicc hello_world_mpi.c -o mpi_hello_world_apolo

We submited the classic “Hello world” MPI example [6] using 5 processes (--ntasks=5), each one on a different machine (--ntasks-per-node=1). Just to be clear, we used 5 machines and 1 CPU per each, leaving the other CPUs (15, in this specific case) free to be allocated by Slurm to other jobs.

#!/bin/bash

#SBATCH --job-name=mpi_test         # Job name
#SBATCH --mail-type=FAIL,END        # Mail notification
#SBATCH --mail-user=<user>@<domain> # User Email
#SBATCH --output=slurm-mpi.%j.out   # Stdout (%j expands to jobId)
#SBATCH --error=slurm-mpi.%j.err    # Stderr (%j expands to jobId)
#SBATCH --time=01:00                # Walltime
#SBATCH --partition=longjobs        # Partition
#SBATCH --ntasks=5                  # Number of tasks (processes)
#SBATCH --ntasks-per-node=1         # Number of task per node (machine)


##### ENVIRONMENT CREATION #####
module load impi


##### JOB COMMANDS #### 
srun --mpi=pmi2 ./mpi_hello_world_apolo

Note

The use of srun is mandatory here. It creates the necessary environment to launch the MPI processes. There you can also specify other parameters. See man srun to more information.

Also, the use of --mpi=pmi2 is mandatory, it tells MPI to use the pmi2 Slurm’s plugin. This could change when you are using a different implementation of MPI (e.g MVAPICH, OpenMPI) but we strongly encourage our users to specify it.

Output

HELLO_MPI - Master process:
  C/MPI version
  An MPI example program.

  Process 3 says 'Hello, world!'
  The number of processes is 5.

  Process 0 says 'Hello, world!'
  Elapsed wall clock time = 0.000019 seconds.
  Process 1 says 'Hello, world!'
  Process 4 says 'Hello, world!'
  Process 2 says 'Hello, world!'

HELLO_MPI - Master process:
  Normal end of execution: 'Goodbye, world!'

30 January 2019 09:29:56 AM

Warning

As you can see in that example, we do not specify -N or --nodes to submit the job in 5 different machines. You can let Slurm decides how many machines your job needs.

Try to think in terms of “tasks” rather than “nodes”.

This table shows some other useful cases [2]:

MPI jobs table
You want You ask
N CPUs --ntasks=N
N CPUs spread across distinct nodes --ntasks=N --nodes=N
N CPUs spread across distinct nodes and nobody else around --ntasks=N --nodes=N --exclusive
N CPUs spread across N/2 nodes --ntasks=N --ntasks-per-node=2
N CPUs on the same node --ntasks=N --ntasks-per-node=N

Array jobs

Also called Embarrassingly-Parallel, this set up is commonly used by users that do not have a native parallel application, so they run multiple parallel instances of their application changing its input. Each instance is independent and does not have any kind of communication with others.

To do this, we specify an array using the sbatch parameter --array, multiple values may be specified using a comma-separated list and/or a range of values with a “-” separator (e.g --array=1,3,5-10 or --array=1,2,3). This will be the values that the variable SLURM_ARRAY_TASK_ID is going to take in each array-job.

This input usually refers to these cases:

  1. File input

    You have multiple files/directories to process.

    In the below example/template we made a “parallel copy” of the files contained in test directory using the cp command.

    ./test/
    ├── file1.txt
    ├── file2.txt
    ├── file3.txt
    ├── file4.txt
    └── file5.txt
    

    We used one process (called task in Slurm) per each array-job. The array goes from 0 to 4, so there were 5 processes copying the 5 files contained in the test directory.

    #!/bin/bash
    
    #SBATCH --job-name=array_file_test       # Job name
    #SBATCH --mail-type=FAIL,END             # Mail notification
    #SBATCH --mail-user=<user>@<domain>      # User Email
    #SBATCH --output=slurm-arrayJob%A_%a.out # Stdout (%a expands to stepid, %A to jobid )
    #SBATCH --error=slurm-array%J.err        # Stderr (%J expands to GlobalJobid)
    #SBATCH --ntasks=1                       # Number of tasks (processes) for each array-job
    #SBATCH --time=01:00                     # Walltime for each array-job
    #SBATCH --partition=debug                # Partition
    
    #SBATCH --array=0-4    # Array index
    
    
    ##### ENVIRONMENT CREATION #####
    
    
    ##### JOB COMMANDS ####
    
    # Array of files
    files=(./test/*)
    
    # Work based on the SLURM_ARRAY_TASK_ID
    srun cp ${files[$SLURM_ARRAY_TASK_ID]} copy_$SLURM_ARRAY_TASK_ID
    

    Thus, the generated file copy_0 is the copy of the file test/file1.txt and the file copy_1 is the copy of the file test2.txt and so on. Each one was done by a different Slurm process in parallel.

Warning

Except to --array, ALL other #SBATCH options specified in the submitting Slurm script are used to configure EACH job-array, including ntasks, ntasks-per-node, time, mem, etc.

  1. Parameters input

    You have multiple parameters to process.

    Similarly to the last example, we created an array with some values that we wanted to use as parameters of the application. We used one process (task) per array-job. We had 4 parameters (0.05 100 999 1295.5) to process and 4 array-jobs.

    Force Slurm to run array-jobs in different nodes

    To give another feature to this example, we used 1 node for each array-job, so, even knowing that one node can run up to 16 processes (in the case of Cronos) and the 4 array-jobs could be assigned to 1 node, we forced Slurm to use 4 nodes.

    To get this we use the parameter --exclusive, thus, for each job-array Slurm will care about not to have other Slurm-job in the same node, even other of your job-array.

    Note

    Just to be clear, the use of --exclusive as a SBATCH parameter tells Slurm that the job allocation cannot share nodes with other running jobs [4] . However, it has a slightly different meaning when you use it as a parameter of a job-step (each separate srun execution inside a SBATCH script, e.g srun --exclusive $COMMAND). For further information see man srun.

#!/bin/bash

#SBATCH --job-name=array_params_test     # Job name
#SBATCH --mail-type=FAIL,END             # Mail notification
#SBATCH --mail-user=<user>@<domain>      # User Email
#SBATCH --output=slurm-arrayJob%A_%a.out # Stdout (%a expands to stepid, %A to jobid )
#SBATCH --error=slurm-array%J.err        # Stderr (%J expands to GlobalJobid)
#SBATCH --ntasks=1                       # Number of tasks (processes) for each array-job
#SBATCH --time=01:00                     # Walltime for each array-job
#SBATCH --partition=debug                # Partition

#SBATCH --array=0-3    # Array index
#SBATCH --exclusive    # Force slurm to use 4 different nodes

##### ENVIRONMENT CREATION #####


##### JOB COMMANDS ####

# Array of params
params=(0.05 100 999 1295.5)

# Work based on the SLURM_ARRAY_TASK_ID
srun echo ${params[$SLURM_ARRAY_TASK_ID]}

Remember that the main idea behind using Array jobs in Slurm is based on the use of the variable SLURM_ARRAY_TASK_ID.

Note

The parameter ntasks specify the number of processes that EACH array-job is going to use. So if you want to use more, you just can specify it. This idea also applies to all other sbatch parameters.

Note

You can also limit the number of simultaneously running tasks from the job array using a % separator. For example --array=0-15%4 will limit the number of simultaneously running tasks from this job array to 4.

Slurm’s environment variables

In the above examples, we often used the output of the environment variables provided by Slurm. Here you have a table [3] with the most common variables.

Output environment variables
Variable Functionality
SLURM_JOB_ID job Id
SLURM_ARRAY_TASK_ID Index of the slurm array
SLURM_CPUS_PER_TASK Same as --cpus-per-task
SLURM_NTASKS Same as -n, --ntasks
SLURM_JOB_NUM_NODES Number of nodes allocated to job
SLURM_SUBMIT_DIR The directory from which sbatch was invoked

Slurm’s file-patterns

sbatch allows filename patterns, this could be useful to name std_err and std_out files. Here you have a table [3] with some of them.

Slurm’s file-patterns
File-patern Expands to
%A Job array’s master job allocation number
%a Job array ID (index) number
%j jobid of the running job
%x Job name
%N short hostname. This will create a separate IO file per node

Note

If you need to separate the output of a job per each node requested, %N is specially useful, for example in array-jobs.

For instance, if you use #SBATCH --output=job-%A.%a in an array-job the output files will be something like job-1234.1, job-1234.2 , job-1234.3; where: 1234 refers to the job array’s master job allocation number and 1 , 2 and 3 refers to the id of each job-array.

Constraining Features on a job

In Apolo II, one can specify what type of CPU instruction set to use. One can choose between AVX2 and AVX512. These features can be specify using the SBATCH option --constraint=<list> where <list> is the features to constrain. For example, --constraint="AVX2" will allocate only nodes that have AVX2 in their instruction set. --constraint="AVX2|AVX512" will allocate only nodes that have either AVX512 or AVX2.

One can also have a job requiring some nodes to have AVX2 and some others using AVX512. For this one would use operators ‘&’ and ‘*’. The ampersand works as a ‘and’ operator, and the ‘*’ is used to specify the number of nodes that must comply a single feature. For example, --constraint="[AVX2*2&AVX512*3]" is asking for two nodes with AVX2 and three with AVX512. The squared brackets are mandatory.

References

[1]NYU HPC. (n.d). Slurm + tutorial - Software and Environment Modules. Retrieved 17:47, January 21, 2019 from https://wikis.nyu.edu/display/NYUHPC/Slurm+Tutorial
[2]UCLouvai - University of Leuven (n.d). Slurm Workload Manager - Slide 57. Retrieved 11:33 January 25, 2019 from http://www.cism.ucl.ac.be/Services/Formations/slurm/2016/slurm.pdf
[3](1, 2, 3) SchedMD LLC (2018). Slurm, resource management [sbatch]. Copy of manual text available at https://slurm.schedmd.com/sbatch.html. Retrieved 17:20 January 30, 2019
[4]SchedMD LLC (2018). Slurm, resource management [srun]. Copy of manual text available at https://slurm.schedmd.com/srun.html. Retrieved 12:20 January 31, 2019
[5]Barney Blaise (2005) OpenMP Example - Hello World - C/C++ Version. Example was taken from https://computing.llnl.gov/tutorials/openMP/samples/C/omp_hello.c Retrieved 09:32 February 12, 2019
[6]Burkardt John (2008) Using MPI: Portable Parallel Programming with the Message-Passing Interface. Example was taken from https://people.sc.fsu.edu/~jburkardt/c_src/heat_mpi/heat_mpi.c Retrived 09:38 February 12, 2019