Submitting jobs¶
Contents
What is sbatch
?¶
Slurm has a lot of options to manage all the resources of a cluster to achieve any possible combination of needs like: Number of CPUs, Number of Nodes, Memory, Time, GPUs, Licenses, etc.
The command sbatch
is used to submit a batch
script, making your job
running in the cluster. Like this:
$ sbatch <batch_script>
A Slurm batch
is a shell script (usually written in bash
) where you
specify all these options to Slurm, including the creation of the environment to
make your job run correctly, and the set of commands to run that job.
Thus, we say that a batch
script has three parts:
Sbatch parameters:
The idea is to include all the information you think Slurm should know about your job (name, notification mail, partition, std_out, std_err, etc) and request all your computational needs, which consist at least in a number of CPUs, the computing expected duration and the amount of RAM to use.
All these parameters must start with the comment
#SBATCH
, one per line, and need to be included at the beginning of the file, just after the shebang (e.g. #!/bin/bash
) which should be the first line.The following table [3] shows important and common options, for further information see
man sbatch
.¶ Option Description Possible value Mandatory -J, --job-name
Job’s name Letters and numbers no -t, --time
Maximum Walltime of the job Numbers with the format DD-HH:MM:SS yes --mem
Requested memory per node size with units: 64G, 600M no -n, --ntasks
Number of tasks of the job Number no (default 1) --ntasks-per-node
Number of tasks assigned to a node Number no (default 1) -N, --nodes
Number of nodes requested Number no (default 1) -c, --cpus-per-task
Number of threads per task Number no (default 1) -p, --partition
Partition/queue where the job will be submited longjobs, bigmem, accel and debug no (default longjobs) --output
File where the standard output will be written Letters and numbers no --error
File where the standard error will be written Letters and numbers no --mail-type
Notify user by email when certain event types occur to the job NONE, ALL, BEGIN, FAIL, REQUEUE, TIME_LIMIT, TIME_LIMIT_% no --mail-user
Email to receive notification of state shanges Valid email no --exclusive
The job allocation can not share nodes with other running jobs Does not have values no --test-only
Validate the batch script and return an estimate of when a job would be scheduled to run Does not have values no --constraint
Some nodes have features associated with them. Use this option to specify which features the nodes associated with your job must have The name of the feature to use no Note
Each option must be included using
#SBATCH <option>=<value>
Warning
Some values of the options/parameters may be specific for our clusters.
Note
About the
--mail-type
option, the valueTIME_LIMIT_%
means the reached time percent, thus,TIME_LIMIT_90
notify reached the 90% of walltime,TIME_LIMIT_50
at the 50%, etc.Environment creation
Next, you should create the necessary environment to make your job run correctly. This often means include the same set of steps that you do to run your application locally on your sbatch script, things like export environment variables, create or delete files and directory structures, etc. Remember a Slurm script is a shell script.
In case you want to submit a job that uses an application that is installed in our clusters you have to
load
its module.An application Module. is used to create the specific environment needed by your application.
The following table [1] show useful commands about modules.
¶ Command Functionality module avail
check what software packages are available module whatis <module-name>
Find out more about a software package module help <module-name>
A module file may include more detailed help for the software package module show <module-name>
see exactly what effect loading the module will have with module list
check which modules are currently loaded in your environment module load <module-name>
load a module module unload <module-name>
unload a module module purge
remove all loaded modules from your environment Warning
Slurm always propagate the environment of the current user to the job. This could impact the behavior of the job. If you want a clean environment, add
#SBATCH --export=NONE
to your sbatch script. This option is particularly important for jobs that are submitted on one cluster and execute on a different cluster (e.g. with different paths).Job(s) steps
Finally, you put the command(s) that executes your application, including all the parameters. You will often see the commandsrun
calling the executable instead of executing the application binary. For more information see MPI jobs section.
There are other options beyond using sbatch
to submit jobs to Slurm,
like salloc
or simply using srun
. We recommend using sbatch
, but
depending on the specific need of your application those options could be better.
To know more about see: FAQ and Testing my job
Serial jobs¶
Serial jobs only use a process with one execution thread, this means one core of a CPU, given our configuration without HTT (Hyper-Threading Technology).
This kind of job does not take advantage of our computational resources but is the basic step to create more complex jobs.
In terms of Slurm, this job uses one task (process)
and one
cpu-per-task (thread)
in one node
. In fact, we don’t need to specify any
resource, the default value for those options in Slurm is 1
.
Here is a good article about the differences between Processes
and
Threads
.
In the template below we specify ntasks=1
to make it explicit.
#!/bin/bash
#SBATCH --job-name=serial_test # Job name
#SBATCH --mail-type=FAIL,END # Mail notification
#SBATCH --mail-user=<user>@<domain> # User Email
#SBATCH --output=slurm-serial.%j.out # Stdout (%j expands to jobId)
#SBATCH --error=slurm-serial.%j.err # Stderr (%j expands to jobId)
#SBATCH --ntasks=1 # Number of tasks (processes)
#SBATCH --time=01:00 # Walltime
#SBATCH --partition=longjobs # Partition
##### ENVIRONMENT CREATION #####
##### JOB COMMANDS ####
hostname
date
sleep 50
MPI jobs¶
MPI jobs are able to launch multiple processes on multiple nodes. There is a lot of possible workflows using MPI, here we are going to explain a basic one. Based on this example and modifying its parameters, you can find the configuration for your specific need.
The example was compiled in Cronos using impi
as follow:
$ module load impi
$ impicc hello_world_mpi.c -o mpi_hello_world_apolo
We submited the classic “Hello world” MPI example [6] using 5 processes (--ntasks=5
),
each one on a different machine (--ntasks-per-node=1
). Just to be clear,
we used 5 machines and 1 CPU per each, leaving the other CPUs
(15, in this specific case) free to be allocated by Slurm to other jobs.
#!/bin/bash
#SBATCH --job-name=mpi_test # Job name
#SBATCH --mail-type=FAIL,END # Mail notification
#SBATCH --mail-user=<user>@<domain> # User Email
#SBATCH --output=slurm-mpi.%j.out # Stdout (%j expands to jobId)
#SBATCH --error=slurm-mpi.%j.err # Stderr (%j expands to jobId)
#SBATCH --time=01:00 # Walltime
#SBATCH --partition=longjobs # Partition
#SBATCH --ntasks=5 # Number of tasks (processes)
#SBATCH --ntasks-per-node=1 # Number of task per node (machine)
##### ENVIRONMENT CREATION #####
module load impi
##### JOB COMMANDS ####
srun --mpi=pmi2 ./mpi_hello_world_apolo
Note
The use of srun
is mandatory here. It creates the necessary
environment to launch the MPI processes. There you can also specify other parameters.
See man srun
to more information.
Also, the use of --mpi=pmi2
is mandatory, it tells MPI to use the pmi2 Slurm’s
plugin. This could change when you are using a different implementation of MPI
(e.g MVAPICH, OpenMPI) but we strongly encourage our users to specify it.
Output
HELLO_MPI - Master process:
C/MPI version
An MPI example program.
Process 3 says 'Hello, world!'
The number of processes is 5.
Process 0 says 'Hello, world!'
Elapsed wall clock time = 0.000019 seconds.
Process 1 says 'Hello, world!'
Process 4 says 'Hello, world!'
Process 2 says 'Hello, world!'
HELLO_MPI - Master process:
Normal end of execution: 'Goodbye, world!'
30 January 2019 09:29:56 AM
Warning
As you can see in that example, we do not specify -N
or --nodes
to submit
the job in 5 different machines. You can let Slurm decides how many machines your
job needs.
Try to think in terms of “tasks” rather than “nodes”.
This table shows some other useful cases [2]:
You want | You ask |
---|---|
N CPUs | --ntasks=N |
N CPUs spread across distinct nodes | --ntasks=N --nodes=N |
N CPUs spread across distinct nodes and nobody else around | --ntasks=N --nodes=N --exclusive |
N CPUs spread across N/2 nodes | --ntasks=N --ntasks-per-node=2 |
N CPUs on the same node | --ntasks=N --ntasks-per-node=N |
Array jobs¶
Also called Embarrassingly-Parallel, this set up is commonly used by users
that do not have a native parallel application, so they run multiple parallel
instances of their application changing its input
. Each instance is
independent and does not have any kind of communication with others.
To do this, we specify an array using the sbatch
parameter --array
,
multiple values may be specified using a comma-separated list and/or a
range of values with a “-” separator (e.g --array=1,3,5-10
or --array=1,2,3
).
This will be the values that the variable SLURM_ARRAY_TASK_ID
is
going to take in each array-job
.
This input
usually refers to these cases:
File input
You have multiple files/directories to process.
In the below example/template we made a “parallel copy” of the files contained in
test
directory using thecp
command../test/ ├── file1.txt ├── file2.txt ├── file3.txt ├── file4.txt └── file5.txt
We used one process (called
task
in Slurm) per eacharray-job
. The array goes from 0 to 4, so there were 5 processes copying the 5 files contained in thetest
directory.#!/bin/bash #SBATCH --job-name=array_file_test # Job name #SBATCH --mail-type=FAIL,END # Mail notification #SBATCH --mail-user=<user>@<domain> # User Email #SBATCH --output=slurm-arrayJob%A_%a.out # Stdout (%a expands to stepid, %A to jobid ) #SBATCH --error=slurm-array%J.err # Stderr (%J expands to GlobalJobid) #SBATCH --ntasks=1 # Number of tasks (processes) for each array-job #SBATCH --time=01:00 # Walltime for each array-job #SBATCH --partition=debug # Partition #SBATCH --array=0-4 # Array index ##### ENVIRONMENT CREATION ##### ##### JOB COMMANDS #### # Array of files files=(./test/*) # Work based on the SLURM_ARRAY_TASK_ID srun cp ${files[$SLURM_ARRAY_TASK_ID]} copy_$SLURM_ARRAY_TASK_ID
Thus, the generated file
copy_0
is the copy of the filetest/file1.txt
and the filecopy_1
is the copy of the filetest2.txt
and so on. Each one was done by a different Slurm process in parallel.
Warning
Except to --array
, ALL other #SBATCH
options specified in the
submitting Slurm script are used to configure EACH job-array
, including
ntasks, ntasks-per-node, time, mem, etc.
Parameters input
You have multiple parameters to process.
Similarly to the last example, we created an array with some values that we wanted to use as parameters of the application. We used one process (
task
) perarray-job
. We had 4 parameters (0.05 100 999 1295.5
) to process and 4array-jobs
.Force Slurm to run array-jobs in different nodes
To give another feature to this example, we used
1
node for eacharray-job
, so, even knowing that one node can run up to 16 processes (in the case of Cronos) and the 4array-jobs
could be assigned to1
node, we forced Slurm to use4
nodes.To get this we use the parameter
--exclusive
, thus, for eachjob-array
Slurm will care about not to have other Slurm-job in the same node, even other of yourjob-array
.Note
Just to be clear, the use of
--exclusive
as a SBATCH parameter tells Slurm that the job allocation cannot share nodes with other running jobs [4] . However, it has a slightly different meaning when you use it as a parameter of a job-step (each separate srun execution inside a SBATCH script, e.gsrun --exclusive $COMMAND
). For further information seeman srun
.
#!/bin/bash #SBATCH --job-name=array_params_test # Job name #SBATCH --mail-type=FAIL,END # Mail notification #SBATCH --mail-user=<user>@<domain> # User Email #SBATCH --output=slurm-arrayJob%A_%a.out # Stdout (%a expands to stepid, %A to jobid ) #SBATCH --error=slurm-array%J.err # Stderr (%J expands to GlobalJobid) #SBATCH --ntasks=1 # Number of tasks (processes) for each array-job #SBATCH --time=01:00 # Walltime for each array-job #SBATCH --partition=debug # Partition #SBATCH --array=0-3 # Array index #SBATCH --exclusive # Force slurm to use 4 different nodes ##### ENVIRONMENT CREATION ##### ##### JOB COMMANDS #### # Array of params params=(0.05 100 999 1295.5) # Work based on the SLURM_ARRAY_TASK_ID srun echo ${params[$SLURM_ARRAY_TASK_ID]}
Remember that the main idea behind using Array jobs in Slurm is based on the
use of the variable SLURM_ARRAY_TASK_ID
.
Note
The parameter ntasks
specify the number of processes that EACH
array-job
is going to use. So if you want to use more, you
just can specify it. This idea also applies to all other sbatch
parameters.
Note
You can also limit the number of simultaneously running tasks from the job
array using a %
separator. For example --array=0-15%4
will limit the
number of simultaneously running tasks from this job array to 4.
Slurm’s environment variables¶
In the above examples, we often used the output of the environment variables provided by Slurm. Here you have a table [3] with the most common variables.
Variable | Functionality |
---|---|
SLURM_JOB_ID |
job Id |
SLURM_ARRAY_TASK_ID |
Index of the slurm array |
SLURM_CPUS_PER_TASK |
Same as --cpus-per-task |
SLURM_NTASKS |
Same as -n , --ntasks |
SLURM_JOB_NUM_NODES |
Number of nodes allocated to job |
SLURM_SUBMIT_DIR |
The directory from which sbatch was invoked |
Slurm’s file-patterns¶
sbatch
allows filename patterns, this could be useful to name std_err
and
std_out
files. Here you have a table [3] with some of them.
File-patern | Expands to |
---|---|
%A |
Job array’s master job allocation number |
%a |
Job array ID (index) number |
%j |
jobid of the running job |
%x |
Job name |
%N |
short hostname. This will create a separate IO file per node |
Note
If you need to separate the output of a job per each node requested, %N
is
specially useful, for example in array-jobs.
For instance, if you use #SBATCH --output=job-%A.%a
in an array-job the output
files will be something like job-1234.1
, job-1234.2
, job-1234.3
;
where: 1234
refers to the job array’s master job allocation number and 1
, 2
and 3
refers to the id of each job-array.
Constraining Features on a job¶
In Apolo II, one can specify what type of CPU instruction set to use. One can choose
between AVX2 and
AVX512. These features can be specify
using the SBATCH option --constraint=<list>
where <list>
is the features to constrain.
For example, --constraint="AVX2"
will allocate only nodes that have AVX2 in their instruction
set. --constraint="AVX2|AVX512"
will allocate only nodes that have either AVX512 or AVX2.
One can also have a job requiring some nodes to have AVX2 and some others using AVX512. For this
one would use operators ‘&’ and ‘*’. The ampersand works as a ‘and’ operator, and the
‘*’ is used to specify the number of nodes that must comply a single feature. For example,
--constraint="[AVX2*2&AVX512*3]"
is asking for two nodes with AVX2 and three with AVX512.
The squared brackets are mandatory.
References¶
[1] | NYU HPC. (n.d). Slurm + tutorial - Software and Environment Modules. Retrieved 17:47, January 21, 2019 from https://wikis.nyu.edu/display/NYUHPC/Slurm+Tutorial |
[2] | UCLouvai - University of Leuven (n.d). Slurm Workload Manager - Slide 57. Retrieved 11:33 January 25, 2019 from http://www.cism.ucl.ac.be/Services/Formations/slurm/2016/slurm.pdf |
[3] | (1, 2, 3) SchedMD LLC (2018). Slurm, resource management [sbatch]. Copy of manual text available at https://slurm.schedmd.com/sbatch.html. Retrieved 17:20 January 30, 2019 |
[4] | SchedMD LLC (2018). Slurm, resource management [srun]. Copy of manual text available at https://slurm.schedmd.com/srun.html. Retrieved 12:20 January 31, 2019 |
[5] | Barney Blaise (2005) OpenMP Example - Hello World - C/C++ Version. Example was taken from https://computing.llnl.gov/tutorials/openMP/samples/C/omp_hello.c Retrieved 09:32 February 12, 2019 |
[6] | Burkardt John (2008) Using MPI: Portable Parallel Programming with the Message-Passing Interface. Example was taken from https://people.sc.fsu.edu/~jburkardt/c_src/heat_mpi/heat_mpi.c Retrived 09:38 February 12, 2019 |