DMTCP: Distributed MultiThreaded CheckPointing

Parallel Computing:

There are three particular areas where DMTCP can support your parallel computing needs:

Parallel Languages:
Transparent support for parallel languages (MPI, UPC (Unified Parallel C/C++), etc.). These are viewed by DMTCP as just a "black box" consisting of distributed processes; DMTCP also supports transparent checkpointing of the ssh connections often used by these languages.
The Network:
Transparent support for distributed processes over TCP sockets and over InfiniBand.
Resource Managers (the batch queue):
Support for Checkpoint-Restart for several popular resource managers (e.g., SLURM, Torque).

If DMTCP doesn't work for your favorite parallel language, network, or resource manager, please write to us. (See "Contact Us".)

Specific Application Targets:

  • MPI: We regularly test on Open MPI and MVAPICH2, but don't have the resources to frequently test on all implementations or configurations of MPI. If DMTCP is not working for your favorite setup, please do send us a message in dmtcp-forum (see "Contact Us" for that and other channels). See below, if you run MPI over InfiniBand. Also, see the FAQ questions on MPI.
  • InfiniBand: Please use the --infiniband flag of dmtcp_launch for InfiniBand support. This will invoke the InfiniBand plugin for DMTCP. While MPI/InfiniBand support has been shown to be robust in many configurations in release 2.2.1, we are still testing in other configurations. Please write to us (see "Contact Us" on left) if you observe any bugs.

Click here for comments.

Sour Logo