DMTCP-2.5.2

Basic Information

Installation

This entry covers the entire process performed for the installation and configuration of DMTCP on a cluster with the conditions described above.

Usage

This subsection explains a method for submiting jobs to the cluster and restarting them using DMTCP’s checkpointing services.

For both types of jobs, in the SLURM launch script, load the necessary environment including DMTCP’s module. After that, source the coordinator bash script in order to use the start_coordinator function. Remember to assing a checkpointing interval in seconds with the -i flag.

The last step in both cases is launching the program in the next way.

dmtcp_launch --rm <Your program binary> <args>...

For serial software

#!/bin/bash
# Put your SLURM options here
#SBATCH --time=00:30:00           # put proper time of reservation here
#SBATCH --nodes=1                 # number of nodes
#SBATCH --ntasks-per-node=1       # processes per node
#SBATCH --job-name=serial_example     # change to your job name
#SBATCH --output=serial_example.%j.out # Stdout (%j expands to jobId)
#SBATCH --error=serial_example.%j.err # Stderr (%j expands to jobId)

module load dmtcp

source coordinator.sh

################################################################################
# 1. Start DMTCP coordinator
################################################################################

start_coordinator -i 35


################################################################################
# 2. Launch application
################################################################################

dmtcp_launch --rm  ./serial_example
#!/bin/bash
# Put your SLURM options here
#SBATCH --time=00:02:00           # put proper time of reservation here
#SBATCH --nodes=1                 # number of nodes
#SBATCH --ntasks-per-node=1       # processes per node
#SBATCH --job-name=serial_example     # change to your job name
#SBATCH --output=serial_example.%j.out # Stdout (%j expands to jobId)
#SBATCH --error=serial_example.%j.err # Stderr (%j expands to jobId)

module load dmtcp

source coordinator.sh

################################################################################
# 1. Start DMTCP coordinator
################################################################################

start_coordinator

################################################################################
# 2. Restart application
################################################################################

/bin/bash ./dmtcp_restart_script.sh -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT

For parallel software

In this example we run an OpenMP application. Notice that in the restart script we don’t assign again the OMP_NUM_THREADS variable again.

#!/bin/bash
# Put your SLURM options here
#SBATCH --time=00:02:00           # put proper time of reservation here
#SBATCH --nodes=1                 # number of nodes
#SBATCH --ntasks-per-node=8       # processes per node
#SBATCH --job-name=parallel_example     # change to your job name
#SBATCH --output=parallel_example.%j.out # Stdout (%j expands to jobId)
#SBATCH --error=parallel_example.%j.err # Stderr (%j expands to jobId)

module load dmtcp

source coordinator.sh

export OMP_NUM_THREADS=8

################################################################################
# 1. Start DMTCP coordinator
################################################################################

start_coordinator -i 35


################################################################################
# 2. Launch application
################################################################################

dmtcp_launch --rm  ./parallel_example
#!/bin/bash
# Put your SLURM options here
#SBATCH --time=00:02:00           # put proper time of reservation here
#SBATCH --nodes=1                 # number of nodes
#SBATCH --ntasks-per-node=8       # processes per node
#SBATCH --job-name=parallel_example     # change to your job name
#SBATCH --output=parallel_example.%j.out # Stdout (%j expands to jobId)
#SBATCH --error=parallel_example.%j.err # Stderr (%j expands to jobId)

module load dmtcp

source coordinator.sh

################################################################################
# 1. Start DMTCP coordinator
################################################################################

start_coordinator

################################################################################
# 2. Restart application
################################################################################

/bin/bash ./dmtcp_restart_script.sh -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT

Sending commands to the coordinator

If you want to send commands to the coordinator of a set of processes, the start_coordinator function you used in the script generates in your launch directory a dmtcp_command.<job_id> file. Using this, you can communicate with your applications currently running. You can use this to generate a manual checkpoint or to change the checkpointing interval.

Examples

For launching a manual checkpoint use this command

$JOBDIR/dmtcp_command.$JOBID -c

For changing the checkpointing interval use this command

$JOBDIR/dmtcp_command.$JOBID -i <time_in_seconds>

Authors