DMTCP-2.5.2¶
Basic Information¶
- Deploy date: 3 August 2018.
- Official Website: http://dmtcp.sourceforge.net/
- License: Lesser GNU Public License (LGPL)
- Installed on: Cronos
- Supported versions: Serial, parallel jobs
Installation¶
This entry covers the entire process performed for the installation and configuration of DMTCP on a cluster with the conditions described above.
Usage¶
This subsection explains a method for submiting jobs to the cluster and restarting them using DMTCP’s checkpointing services.
For both types of jobs, in the SLURM launch script, load the necessary environment including DMTCP’s module. After that, source the coordinator bash script in order to use the start_coordinator function. Remember to assing a checkpointing interval in seconds with the -i flag.
The last step in both cases is launching the program in the next way.
dmtcp_launch --rm <Your program binary> <args>...
For serial software¶
#!/bin/bash
# Put your SLURM options here
#SBATCH --time=00:30:00 # put proper time of reservation here
#SBATCH --nodes=1 # number of nodes
#SBATCH --ntasks-per-node=1 # processes per node
#SBATCH --job-name=serial_example # change to your job name
#SBATCH --output=serial_example.%j.out # Stdout (%j expands to jobId)
#SBATCH --error=serial_example.%j.err # Stderr (%j expands to jobId)
module load dmtcp
source coordinator.sh
################################################################################
# 1. Start DMTCP coordinator
################################################################################
start_coordinator -i 35
################################################################################
# 2. Launch application
################################################################################
dmtcp_launch --rm ./serial_example
#!/bin/bash
# Put your SLURM options here
#SBATCH --time=00:02:00 # put proper time of reservation here
#SBATCH --nodes=1 # number of nodes
#SBATCH --ntasks-per-node=1 # processes per node
#SBATCH --job-name=serial_example # change to your job name
#SBATCH --output=serial_example.%j.out # Stdout (%j expands to jobId)
#SBATCH --error=serial_example.%j.err # Stderr (%j expands to jobId)
module load dmtcp
source coordinator.sh
################################################################################
# 1. Start DMTCP coordinator
################################################################################
start_coordinator
################################################################################
# 2. Restart application
################################################################################
/bin/bash ./dmtcp_restart_script.sh -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT
For parallel software¶
In this example we run an OpenMP application. Notice that in the restart script we don’t assign again the OMP_NUM_THREADS variable again.
#!/bin/bash
# Put your SLURM options here
#SBATCH --time=00:02:00 # put proper time of reservation here
#SBATCH --nodes=1 # number of nodes
#SBATCH --ntasks-per-node=8 # processes per node
#SBATCH --job-name=parallel_example # change to your job name
#SBATCH --output=parallel_example.%j.out # Stdout (%j expands to jobId)
#SBATCH --error=parallel_example.%j.err # Stderr (%j expands to jobId)
module load dmtcp
source coordinator.sh
export OMP_NUM_THREADS=8
################################################################################
# 1. Start DMTCP coordinator
################################################################################
start_coordinator -i 35
################################################################################
# 2. Launch application
################################################################################
dmtcp_launch --rm ./parallel_example
#!/bin/bash
# Put your SLURM options here
#SBATCH --time=00:02:00 # put proper time of reservation here
#SBATCH --nodes=1 # number of nodes
#SBATCH --ntasks-per-node=8 # processes per node
#SBATCH --job-name=parallel_example # change to your job name
#SBATCH --output=parallel_example.%j.out # Stdout (%j expands to jobId)
#SBATCH --error=parallel_example.%j.err # Stderr (%j expands to jobId)
module load dmtcp
source coordinator.sh
################################################################################
# 1. Start DMTCP coordinator
################################################################################
start_coordinator
################################################################################
# 2. Restart application
################################################################################
/bin/bash ./dmtcp_restart_script.sh -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT
Sending commands to the coordinator¶
If you want to send commands to the coordinator of a set of processes, the start_coordinator function you used in the script generates in your launch directory a dmtcp_command.<job_id> file. Using this, you can communicate with your applications currently running. You can use this to generate a manual checkpoint or to change the checkpointing interval.
Examples¶
For launching a manual checkpoint use this command
$JOBDIR/dmtcp_command.$JOBID -c
For changing the checkpointing interval use this command
$JOBDIR/dmtcp_command.$JOBID -i <time_in_seconds>
Authors¶
- Sebastian Patiño Barrientos <spatino6@eafit.edu.co>