DMTCP

Section: Distributed MultiThreaded CheckPointing (1)
Updated: June 17, 2008
Index Return to Main Contents
 

NAME

dmtcp - Distributed MultiThreaded Checkpointing  

SYNOPSIS

dmtcp_coordinator [port]

dmtcp_checkpoint command [args...]

dmtcp_restart ckpt1.mtcp [ckpt2.mtcp...]

dmtcp_command coordinatorCommand  

DESCRIPTION

dmtcp is a tool to transparently checkpointing the state of an arbitrary group of programs spread across many machines and connected by sockets. It does not modify the user's program nor the operating system.  

OPTIONS

Most options are controlled through environment variables. These can be set in bash with "export NAME=value" or in tcsh with "setenv NAME value".

DMTCP_CHECKPOINT_INTERVAL=integer
Time in seconds between automatic checkpoints. Checkpoints can also be initiated manually by typing 'c' into the coordinator. (default: 0, disabled; dmtcp_coordinator only)

DMTCP_HOST=string
Hostname where the cluster-wide coordinator is running. (default: localhost; dmtcp_checkpoint, dmtcp_restart only)

DMTCP_PORT=integer
The port the cluster-wide coordinator listens on. (default: 7779)

DMTCP_GZIP=(1|0)
Set to "0" to disable compression of checkpoint images. (default:0, compression disabled; dmtcp_checkpoint only)

DMTCP_CHECKPOINT_DIR=path
Directory to store checkpoint images in. (default: ./)

DMTCP_SIGCKPT=integer
Signal number to use for checkpointing. Must not be used by the user program. (default: SIGUSR2; dmtcp_checkpoint only)
 

DMTCP_COORDINATOR

A dmtcp_coordinator process must be started in order for either dmtcp_checkpoint or dmtcp_restart to operate. There should be exactly one dmtcp_coordinator for each network of processes. In the case of multiple hosts, the address of the single global coordinator should be communicated to dmtcp_checkpoint and dmtcp_restart through the DMTCP_HOST and DMTCP_PORT environment variables.

The coordinator is stateless and is not checkpointed. The dmtcp_coordinator initiates checkpoints of all processes in the system. Checkpoints can be performed automatically on an interval (see DMTCP_CHECKPOINT_INTERVAL above), or they can be initiated manually on the command line of the coordinator.

The coordinator accepts the following commands on its standard input. Each command should be followed by the <return> key. The commands are:

  l : List connected nodes

  s : Print status message

  c : Checkpoint all nodes

  f : Force a restart even if there are missing nodes (debugging)

  k : Kill all nodes

  q : Kill all nodes and quit

  ? : Show this message

Coordinator commands can also be issued remotely using dmtcp_command.  

EXAMPLE USAGE

1. In a separate terminal window, start the dmtcp_coodinator. (See previous section.)


 dmtcp_coordinator

2. In separate terminal(s), replace each command(s) with "dmtcp_checkpoint [command]". The checkpointed program will connect to the coordinator specified by DMTCP_HOST and DMTCP_PORT. Child processes will automatically be checkpointed. Remote processes started via ssh will automatically checkpointed. (The ssh command line with be modified to call dmtcp_checkpoint on the remote host.)


 dmtcp_checkpoint ./myprogram

3. To manually initiate a checkpoint, either run the command below or type "c" followed by <return> into the coordinator. Checkpoint files for each process will be written to DMTCP_CHECKPOINT_DIR. The dmtcp_coordinator will write "dmtcp_restart_script.sh" to its working directory. This script contains the necessary calls to dmtcp_restart to restart the entire computation.


  dmtcp_command -c

4. To restart, one should use dmtcp_restart_script.sh created by the dmtcp_coordinator. One can optionally edit this script to migrate processes to different hosts. In order to give a restarted program standard input, the script must be edited to run the desired process in the foreground of a terminal.


 ./dmtcp_restart_script.sh

 

PROGRAMMING INTERFACE

DMTCP provides a programming interface to allow checkpointed applications to interact with dmtcp.

The user application should link with libdmtcpaware.so (-ldmtcpaware) and use the header file dmtcp/dmtcpaware.h.

For more information see: http://dmtcp.sourceforge.net/  

SEE ALSO

Full documentation is available from http://dmtcp.sourceforge.net/  

AUTHORS

DMTCP and its standalone single-process compontent MTCP (MultiThreaded CheckPointing) were created and are maintained by Jason Ansel, Kapil Arya, Gene Cooperman, Mike Rieker, Ana Maria Visan, and Alex Brick.


 

Index

NAME
SYNOPSIS
DESCRIPTION
OPTIONS
DMTCP_COORDINATOR
EXAMPLE USAGE
PROGRAMMING INTERFACE
SEE ALSO
AUTHORS

This document was created by man2html, using the manual pages.
Time: 16:44:22 GMT, July 02, 2008

SourceForge.net Logo