DMTCP: Distributed MultiThreaded CheckPointing

About DMTCP:

MTCP (Distributed MultiThreaded Checkpointing) transparently checkpoints a single-host or distributed computation in user-space -- with no modifications to user code or to the O/S. It works on most Linux applications, including Python, Matlab, R, GUI desktops, MPI, etc. It is robust and widely used (on Sourceforge since 2007).

Among the applications supported by DMTCP are MPI (various implementations), OpenMP, MATLAB, Python, Perl, R, and many programming languages and shell scripting languages. With the use of TightVNC, it can also checkpoint and restart X-Window applications. The OpenGL library for 3D graphics is supported through a special plugin. It also has strong support for HPC (High Performance Computing) environments, including MPI, SLURM, InfiniBand, and other components. See the QUICK-START file for further details.

DMTCP supports the commonly used OFED API for InfiniBand, as well as its integration with various implementatoins of MPI, and resource managers (e.g., SLURM). See contrib/infiniband/README for more details.

News | See Also | Authors | Acknowledgement

Announcement!

We are currently looking for well qualified applicants who are interested in joining a Ph.D. program in order to do research on checkpointing and reversible debugging. Interested applicants should write to Gene Cooperman (gene@ccs.neu.edu) at Northeastern University.
[2015-05-06]: DMTCP 2.4.0-rc4 released!
[ The release candidate version is is avaiable from the DMTCP Sourceforge project page.]

[NOTE: dmtcp_command is not working in rc4. It is already fixed in github. If you need this command (e.g., in MPI batch scripts), the fixed version is available through downloads. We believe the standard commands (dmtcp_launch, dmtcp_restart, dmtcp_coordinator) are all working. Thank you for your patience as we continue to fully test release 2.4.0.]

[NOTE: If you are running a 32-bit Linux, then after configure and make, you will need to execute (cd bin && ln -s mtcp_restart mtcp_restart-32) from the DMTCP root directory. This mis-feature will also be fixed in the next release candidate.]

[NOTE: On some RedHat/Fedora/CentOS distros only, the perl, tcsh, vim binaries are using libfreebl3.so with the prelink application. This causes a bug on DMTCP launch. Unfortunately, we do not have access to a debuginfo package or a build of libfreebl3.so with '-g' on a distro that exhibits the bug. Until we have such access, we are unable to read the symbol names needed to debug these binaries. Assistance anyone?]

This is release candidate 4 for DMTCP 2.4.0. It is especially important to upgrade:
if you are using MATLAB;
  or if you are using MPI, including the following resource managers: SLURM, Torque, ibrun;
  or if you have a newer Linux kernel using '[vvar]' (test with: grep '\[vvar]' /proc/self/maps );
  or if you use glibc-2.21 or later (test with: ls -l /lib*/libc.so.6 /lib/*/libc.so.6 );
  or if you are using the ARM CPU (either 32- or 64-bit versions);
  or if you are building DMTCP with the Intel icc compiler;
  or if you wish to create a standalone m32 build (32-bits: ./configure --enable-m32) on a 64-bit Linux.
Note: The options --host and DMTCP_HOST are now deprecated in favor of --coord-host and DMTCP_COORD_HOST (and similarly for --port/DMTCP_PORT).
Here is the NEWS file from the internal development branch for the upcoming full release.
[2015-04-25]: DMTCP 2.4.0-rc3 released!
This is release candidate 3 for DMTCP 2.4.0. (See above for latest release candidate.)
[2015-03-25]: DMTCP 2.4.0-rc3 released!
This is release candidate 2 for DMTCP 2.4.0. (See above for latest release candidate.)
[2015-03-17]: DMTCP 2.4.0-rc1 released!
This is release candidate 1 for DMTCP 2.4.0. (See above for latest release candidate.)
[2014-07-14]: DMTCP 2.3.1 released!
This is primarily a bug fix release.
[2014-07-03]: DMTCP 2.3 released!
This is primarily a bug fix release. However, if you are using DMTCP for the ARM v7 CPU, or if you are using DMTCP either with the InfiniBand network or with the SLURM batch system, then it is strongly recommended to upgrade. Check the release notes for more details.
[2014-03-20]: DMTCP 2.2.1 released!
This is a bug fix release. Users relying on --enable-unique-checkpoint-filenames configure flag are highly recommended to upgrade to this release. Check the release notes for more details.
[2014-03-14]: DMTCP 2.2 released!
In this release, the lowest layers have been re-organized and partially re-written for greater clarity of code and greater maintainability. Also, users relying on the use of DMTCP with MPI, InfiniBand or the Toruqe or SLURM batch queues are strongly advised to upgrade. Check the release notes for more details.
[2014-01-12]: DMTCP 2.1 released!
This release includes enhancement to the core feature set and some newly stable plugins. Check the release notes for more details.
[2013-10-03]: DMTCP 2.0 released!
This version 2.0 release represents the future of DMTCP. DMTCP version 2.0 has been re-designed around the concept of plugins. The older DMTCP version 1.2.x branch will continue to be maintained for bug fixes. Check the release notes for more details.
DMTCP is currently maintained by Kapil Arya, Gene Cooperman, Rohan Garg, Jiajun Cao, and Artem Polyakov. The list of active developers continues to evolve.
The DMTCP project is partially supported by Intel Corporation and by the National Science Foundation under grant OCI-0960978. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of Intel Corporation or of the National Science Foundation.

Click here for comments.

SourceForge.net Logo