DMTCP (Distributed MultiThreaded Checkpointing) is a tool to transparently checkpointing the state of an arbitrary group of programs spread across many machines and connected by sockets. It does not modify the user's program nor the operating system.
Among the applications supported by DMTCP are OpenMPI, MATLAB, python, perl, and many programming languages and shell scripting languages. With the use of TightVNC, it can also checkpoint and restart X-Windows applications, as long as they do not use extensions (e.g.: no OpenGL, no video). Among the Linux features supported by DMTCP are open file descriptors, pipes, sockets, signal handlers, process id and thread id virtualization (ensure old pids and tids continue to work upon restart), ptys, fifos, process group ids, session ids, terminal attributes, and mmap/mprotect (including mmap-based shared memory). See the QUICK-START file of the distribution for further details.
DMTCP does not yet support Infiniband or Myrinet for OpenMPI. This is planned for near term. Additional developers are welcome.
DMTCP is also the basis for URDB, the Universal Reversible Debugger. URDB is still experimental. Nevertheless, it currently adds reversibility to gdb, MATLAB, python (pdb), and perl (perl -d). It also supports reverse expression watchpoints, a form of temporal search within a process lifetime.
For further information, see the DMTCP Sourceforge project page.
To obtain the most recent (possibly unstable) source from subversion,
run the following command:
svn co https://dmtcp.svn.sourceforge.net/svnroot/dmtcp/trunk dmtcp
DMTCP is completely transparent and can checkpoint unmodified Linux binaries. However, if you wish to call DMTCP from within your checkpointed program, we provide an optional programming interface called DMTCP Aware. To use DMTCP Aware:
DMTCP: Transparent Checkpointing for Cluster Computations and the Desktop.
Jason Ansel, Kapil Arya, and Gene Cooperman.
23rd IEEE International Parallel and Distributed Processing Symposium (IPDPS'09).
Rome, Italy. May, 2009.
Slides.
Bibtex.
Transparent User-Level Checkpointing for the Native POSIX Thread Library for Linux.
Michael Rieker, Jason Ansel, and Gene Cooperman.
The 2006 International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA'06).
Las Vegas, NV. Jun, 2006. Bibtex.