DMTCP: Distributed MultiThreaded CheckPointing

Frequently Asked Questions:


If you don't see your question here, consider searching within the archive of past dmtcp-forum messages, or else asking your question directly (see "Contact Us" on the left).


    1. % dmtcp_launch ./a.out arg1 arg2 ...
      % dmtcp_command --checkpoint     [from another terminal window on same computer]
      % dmtcp_restart ckpt_a.out_*.dmtcp
      If using the above recipe, make sure to first remove old ckpt images. DMTCP also writes out ./dmtcp_restart_script.sh, which handles various bookkeeping and is safer to use.
      NOTE: Numerous configure and run-time options are also available.
    2. It checkpoints most binary programs on most Linux distributions. Some examples on which users have verified that DMTCP works are: Matlab, R, Java, Python, Perl, Ruby, PHP, Ocaml, GCL (GNU Common Lisp), emacs, vi/cscope, Open MPI, MPICH-2, MVAPICH2, Intel® MPI, OpenMP, and Cilk. Both TCP and InfiniBand connections are supported. See Supported Applications for further details. Our goal is to support DMTCP for all vanilla programs. If DMTCP does not work correctly on your program, then this is a bug in DMTCP. We would be appreciative if you can then file a bug report with DMTCP.
    3. It is a free software distributed under the terms of the Lesser GNU Public License (LGPL). This license was chosen to be non-contagious. If you distribute a modified version of DMTCP, you must make available to your users your modifications to DMTCP. But proprietary or other software may freely use the DMTCP libraries and utilities with no restrictions on the proprietary or other software.
    4. No. DMTCP is completely transparent in this sense.
    5. No. DMTCP requires no root permissions. Hence, other software packages can easily include DMTCP as part of their distribution (subject to DMTCP's LGPL license). (See also setuid and other special privileges.)
    1. A DMTCP computation consists of all processes connected to a given coordinator. To have two simultaneous DMTCP computations on the same host, you will need two DMTCP coordinators listening to different port numbers. The command dmtcp_coordinator generates a new coordinator. By default, dmtcp_launch (dmtcp_checkpoint) will first look for an existing coordinator on the localhost on port 7779 and join that existing DMTCP computation. If no such coordinator is found, dmtcp_launch (dmtcp_checkpoint) will create a new coordinator and then join it as the first process of that computation. A checkpoint is initiated when the coordinator tells all its connected processes to create a checkpoint. Each client of the coordinator writes a file, ckpt_*.dmtcp, on its local machine, and the coordinator writes dmtcp_restart_script.sh in its own local directory. The script can be used to restart all processes using the ckpt_*.dmtcp files on the various hosts. Since the coordinator initiates all checkpoints, it is the coordinator that remembers the checkpoint interval. Both dmtcp_command --interval and dmtcp_launch --interval can be used to set the checkpoint interval on the coordinator.
    2. Yes, migration works. Look at dmtcp_restart_script.sh for some options on migrating both single-process and distributed process computations. Homogeneous host architectures are best, but some heterogeneity is also tolerated. This works best if the source and destination Linux distro are both recent. However, test the migration first. Problems are most likely to occur when migrating from a newer Linux/CPU to an older Linux/CPU, since an older distro is often not future-proof.

      Note that in migrating between arbitrary computers, it is possible to encounter "illegal instruction" on restart. This can occur if the CPUs of the source machine and destination machine are different. Most often, this occurs when migrating from a newer CPU to an older CPU that does not support the full range of extensions to the CPU instruction set. It may sometimes occur in migrating between AMD and Intel CPUs if the application or libraries were optimized during compilation for one CPU only. In principle, one can get around this with gcc -mtune=generic, but you would also have to make sure that (i) DMTCP, (ii) your target application, and (iii) all libraries used by it (including libc.so) are compiled with gcc -mtune=generic.
          Finally, DMTCP generally supports process migration when migrating from an older Linux kernel to a newer Linux kernel. This works, because the checkpoint image of the target application contains all of the original run-time libraries, including libc.so. So, when an older libc.so makes a call to a newer Linux kernel, the newer Linux kernel will generally preserve backwards compatibility.

    3. The most full-featured mechanism is through DMTCP plugins. See especially application-initiated checkpointing (using dmtcp_checkpoint()), and application-delayed checkpointing (using dmtcp_disable_ckpt()/dmtcp_enable_ckpt()).
    4. Normally, commands like dmtcp_launch a.out (dmtcp_checkpoint a.out) and dmtcp_restart ckpt_a.out_*.dmtcp pass on the the exit code that is returned by a.out itself. If dmtcp_launch or dmtcp_restart is passed an invalid command line (e.g., no such ckpt file), then they will exit with exit code 99 (by default), or the integer value of DMTCP_FAIL_RC if that environment variable has been set.
    5. Please look at the directory of the modify-env plugin. In particular, look at the README file and the dmtcp_env.txt example from that directory. You can invoke this plugin with:
          dmtcp_launch --with-plugin /absolute/path/to/libdmtcp-modify-env.so
      (If you use LD_LIBRARY_PATH, you can also avoid the need for absolute pathnames.)
    6. Yes. Please look at the directory of DMTCP plugin examples for a flexible mechanism for third-party plugins (add-ons). This includes support for: (i) wrappers around system calls; (ii) hooks for particular events (e.g.: startup, ckpt, resume, restart); and (iii) for populating and querying a nameservice database across distributed processes. Application-initiated checkpointing is also provided. No re-compilation or relinking of the application software is necessary. The DMTCP source also provides a tutorial, doc/plugin-tutorial.pdf.
    7. DMTCP works internally by preloading its own library into the target application (see How does DMTCP Work?). Linux will not honor the setuid bit if a foreign library is being preloaded (for obvious reasons). So, either the application must be run in a mode not requiring special privileges, or DMTCP must be run in a privileged manner. This "Stack Overflow" web page describes two strategies for allowing DMTCP to initially run with privileges. The constructor to use in the case of DMTCP is dmtcp::DmtcpWorker::DmtcpWorker() (or dmtcp::DmtcpWorker::DmtcpWorker(bool) for earlier than DMTCP-2.4) in DMTCP_ROOT/src/dmtcpworker.cpp. (If you're putting this in a gdbinit script, don't forget to use: set break pending on .) Group permissions, and security authorizations such as polkit, ACL and PAM may offer other options.
    8. Use readdmtcp.sh. (The example below assumes only one dmtcp ckpt file is present.)
        <DMTCP_DIR>/util/readdmtcp.sh ckpt_*.dmtcp
    9. Detailed advice is available in the file doc/debugging-dmtcp.txt (although this FAQ is often more current). General advice follows.

      When using GDB with DMTCP, you may find it useful to load some utilities (available since DMTCP-2.6.1 and DMTCP-3.0):
      (gdb) source util/gdb-dmtcp-utils
      (gdb) dmtcp # lists the new GDB commands

      For compiling DMTCP with debugging support under GNU gcc, we recommend:
         ./configure --enable-debug ("-Wall -g3 -O0" for gcc and g++)
         make -j clean && make -j
      (Omit the "-j" if you are on a less powerful computer.)

      CATCHING DMTCP INTERNAL ERRORS (CREATE A CORE DUMP):
      Prior to running dmtcp_launch, do:
      ulimit -c unlimited && export DMTCP_ABORT_ON_FAILED_ASSERT=1
      (and also, if using GDB, set a breakpoint at _exit)

      DEBUGGING DURING INITIAL LAUNCH:     Run it as: gdb --args dmtcp_launch ./a.out
      (Older versions of DMTCP use dmtcp_checkpoint instead of dmtcp_launch.)
      Then dmtcp_launch will call execvp on ./a.out. So, try the following:
      (gdb) break execvp
      (gdb) run
      (gdb) # [stops at execvp]
      (gdb) break main
      (gdb) next
      (gdb) next
      (gdb) ...
          Note that dmtcp_launch calls execvp to execute the main routine of the application (a.out in the above example). If you want to see the actions of the dmtcphijack.so library starting after the call to execvp and before the application's main routine, then at the beginning of your GDB session, do:
      (gdb) break execvp
      (gdb) run
      (gdb) # 'pending on' is required if using a gdbinit script
      (gdb) set breakpoint pending on
      (gdb) # For DMTCP-2.4 and later in DMTCP-2.x:
      (gdb) break dmtcp_initialize
      (gdb) # For DMTCP-3.0 and later:
      (gdb) #   break dmtcp_initialize_entry_point
      (gdb) #     OBSOLETE: For prior to DMTCP-2.4:
      (gdb) #     OBSOLETE: break 'dmtcp::DmtcpWorker::DmtcpWorker(bool)'
      (gdb) continue # Might need to repeat 'continue' a few times.
      We have found the GDB command info proc mappings useful for deciding if an address belongs to libc.so, dmtcphijack.so, your target application, or other.

      NOTE: If you want to trace the internals of DMTCP (in addition to using GDB as above), see tracing DMTCP internals.

      DEBUGGING DURING CHECKPOINT: Begin your DMTCP session under GDB (gdb --args dmtcp_launch ...), and just run (without any checkpoints). At the time of checkpoint, the checkpoint thread (typically thread 2 in GDB) will send a SIGUSR2 to each user thread. By default, GDB will intercept that signal, announce it to the user, and wait until the user executes:
      (gdb) signal SIGUSR2
      At this time, you can gain control with GDB. Tell GDB to switch to thread 2, and you can then examine the stack and set a breakpoint, before issuing the "signal SIGUSR2" command.

      DEBUGGING DURING RESTART:     To capture your process under GDB during dmtcp_restart, you need a more roundabout strategy. This is because dmtcp_restart calls mmap to reload the memory of your process. So, the best way is to use gdb attach or gdb ./a.out `pgrep -n a.out` after your process has restarted.
          In order to assist in using gdb to attach, your restarted process can be forced to pause deep within DMTCP just as it restarts, by setting the environment variable DMTCP_RESTART_PAUSE2. (Set DMTCP_RESTART_PAUSE2 before the original dmtcp_launch, since the restarted process will remember only the original environment variables prior to checkpoint. (Prior to DMTCP-2.4, the variable had the name MTCP_RESTART_PAUSE.) The environment variable DMTCP_RESTART_PAUSE is available to capture the restart even earlier --- primarily of interest for DMTCP developers.     On restart, dmtcp_restart will then pause with a message to attach using a gdb command. In earlier versions of DMTCP, the gdb attach command may fail with: Operation not permitted. In those cases, you may execute: echo 0 > /proc/sys/kernel/yama/ptrace_scope (requires root privilege, and may pose a security risk).
          Beginning with DMTCP-2.4, one can also set the environment variable DMTCP_GDB_ATTACH_ON_RESTART prior to executing dmtcp_restart. This also allows one to use "gdb attach" on the restarted process for debugging.
          Note that after DMTCP 2.0, the MTCP directory was merged into DMTCP itself, and it is no longer possible to run MTCP as a standalone application. If you need that functionality, please consider using dmtcp_launch --no-coordinator with the latest DMTCP release. Having said that, if you are using an older version of DMTCP, The information above is valid.

      DEBUGGING PLUGINS: Some bugs may be produced by interactions among plugins. In such cases, consider temporarily disabling plugins, and see if the bug goes away. (This is similar to the standard advice often given for web browsers.) In some exceptional cases, there can also be a bug in the interaction with internal plugins of DMTCP. See "debugging internal DMTCP plugins" for more information concerning this issue.

      In debugging during restart using 'gdb attach', we have reports in early 2020 saying that under Ubuntu, we are not seeing the symbol tables when we attach. This problem is seen only on Ubuntu (e.g., Ubuntu 18.04). A stack is seen with the addresses, but without the function names. We see this on Ubuntu, but not on CentOS. We are guessing that GDB under Ubuntu is having trouble "walking the stack" to find the symbol table. As a workaround, after restart, please use inside GDB:
      (gdb) source DMTCP_ROOT/util/gdb-add-symbol-files-all (or the older bash shell script from the command line, DMTCP_ROOT/util/gdb-add-symbol-file , if not using the latest DMTCP). The shell script DMTCP_ROOT/util/save-symbol-files-to-gdb-script.py may also be useful in this setting.

    1. DMTCP supported x86 and x86_64 since the beginning. It has been ported to the 32-bit ARM CPU (armv7/armv7a), using the newer EABI API for Linux on ARM. An experimental port to 64-bit ARM (armv8) has been added as of DMTCP-2.4.0. For porting to other CPUs, please see src/mtcp/PORTING (notes on how to port DMTCP).

      DMTCP has also been verified to work on the Intel Xeon Phi (back end, only, at this time) when built with the Intel icc compiler. We used
      ./configure --host=mic CC=icc CXX=icpc CFLAGS=-mmic CXXFLAGS=-mmic LDFLAGS=-mmic
      to build on the Intel MIC. Optionally, one can also add "-static-intel" to CFLAGS, CXXFLAGS and LDFLAGS.

    2. See doc/multi-arch.txt for details. In short,
      ./configure --enable-m32 && make clean && make -j && make install
      ./configure && make clean && make -j && make install
      Use ./configure --prefix=$PWD/build if you wish to build a local copy in $PWD/build, instead of installing globally.

      Note that --enable-m32 will not create the necessary DMTCP commands on a 64-bit Linux. So, on a 64-bit Linux, you will still need the 64-bit install, even if you intend only to run with 32-bit targets.

      On recent versions of DMTCP (late 2019), a --enable-multilib option has been made available to automate this.

    3. DMTCP operates only under Linux as of this writing. Because its design stays close to the POSIX API, it can be ported to other operating systems, given sufficient demand. If someone is interested in doing the work, please write to us, and we will share our ideas on how to do that port.

      See also "Implementing Checkpointing for Android" for work toward porting DMTCP to Android.

    4. DMTCP works directly with both TCP and InfiniBand. It would also be relatively easy to port it to IP, but we haven't seen a demand for this. When using InfiniBand, launch with: dmtcp_launch --infiniband ...
    5. There are also "dmtcp" packages in Debian (since version 7.0, "wheezy"), Ubuntu (since version 11.10), Fedora (since version 17), and Red Hat (via Fedora EPEL). In-house, we commonly use DMTCP under Ubuntu, CentOS, and openSUSE. On an irregular basis, we also test on other distributions.
    1. Small applications should checkpoint and restart in a second. There are several ways to speed up checkpoint/restart for larger applications:
      1. The disk is usually the slowest part of checkpoint/restart. Consider using a RamDisk. For example:
        sudo mount -t ramfs -o size=200m ramfs /path/to/dmtcp_ckpt_dir
        export DMTCP_CHECKPOINT_DIR=/path/to/dmtcp_ckpt_dir
        rm -f /path/to/dmtcp_ckpt_dir/ckpt_*.dmtcp
        dmtcp_launch YOUR_APP
        dmtcp_restart /path/to/dmtcp_ckpt_dir/ckpt_*.dmtcp
        (Warning: ramfs can continue to grow with the total size of files in the dmtcp_ckpt_dir, eventually freezing your system if you write too much. The related tmpfs does not suffer from this, but it uses the swapfile on disk, which can slow it down for large writes.)
      2. Restart will be faster with the environment variable DMTCP_TMPDIR (on ckpt and restart) pointing into the RamDisk created above.
        (Set DMTCP_TMPDIR before the initial launch of your application, since the environment variable is saved with your checkpoint image. This policy may be changed in the future.)
      3. By default, DMTCP uses gzip for dynamic compression of checkpoint images. Consider using dmtcp_launch --no-gzip . Alternatively, set an environment variable: export DMTCP_GZIP="0" . On restart, DMTCP will auto-detect whether gzip was used.
      4. Gzip was chosen because it is available almost universally. Some examples of newer, faster compression packages are: Snappy, LZO, FastLZ, and QuickLZ. Currently, you will have to modify the DMTCP source code to use these. With enough demand, we will make it easy for the end user to select a different compression package.
      5. Two configure options for improving checkpoint and restart speeds are offered. Please test that these optimizations are compatible with your application.
        • ./configure --enable-forked-checkpointing : Use fork-based copy-on-write to have a child checkpoint while the parent continues to execute
        • ./configure --enable-fast-restart : mmap-based on-demand paging from the checkpoint image
    2. If you write all zeroes to the memory region that does not need to be checkpointed, then DMTCP will convert those pages into zero-fill-on-demand pages in the checkpoint image. This creates a smaller checkpoint image, and results in a faster restart.
    1. A DMTCP coordinator process is created on a host (default: localhost). As new processes are created (via fork or ssh), the LD_PRELOAD environment variable (supported by the Linux loader) is used to preload the DMTCP library (dmtcphijack.so). That library runs before the routine main(). It creates a second thread (DMTCP checkpoint thread). The checkpoint thread then creates a socket to the DMTCP coordinator and registers itself. The checkpoint thread also creates a signal handler (SIGUSR2 by default). Control is then returned to the original user thread, which executes its standard startup routines. The DMTCP coordinator can request a checkpoint by sending a message through the socket to the checkpoint thread. The checkpoint thread then sends the checkpoint signal (SIGUSR2) to each of the user threads. (Note that the checkpoint signal is for internal DMTCP use only. It should not be used by non-DMTCP programs.)
          Note that it is shown how to see a summary of the contents of a DMTCP checkpoint image in a previous question.
    2. Some sources of information follow. If you wish to cite DMTCP, please cite the paper by Ansel et al.
      1. The best high-level overview of the design of DMTCP is still in the paper DMTCP: Transparent Checkpointing for Cluster Computations and the Desktop, by Ansel et al. (2009).
      2. For the design of just the single-process checkpointing layer (MTCP), see Transparent User-Level Checkpointing for the Native POSIX Thread Library for Linux, by Rieker et al. (2006).
      3. For a recent overview of the DMTCP architecture, see doc/architecture-of-dmtcp.pdf (available with the source distribution).
      4. For low-level documentation of the implementation of DMTCP, see the doc subdirectory of the source distribution.
    3. Yes. You will have to re-configure and re-make:
      ./configure --enable-logging && make -j5 clean && make -j5
      After this, you will see lots of output sent to stderr on screen. In addition, there will be files in $DMTCP_TMPDIR/dmtcp-$USER@$HOST (where $DMTCP_TMPDIR is $TMPDIR or /tmp). Look at the files jassert_*.log in the given directory. To add low-level debugging (MTCP for single processes), change mtcp/Makefile to uncomment:
      CFLAGS += -O0 -g -DDEBUG -DTIMING -Wall
      NOTE: If you also want to debug under GDB, see debugging under GDB.
    4. Normally this should not happen. However, a complex new user plugin might uncover a bug in DMTCP itself. If so, the first thing to look at is a bug in the interaction with an internal DMTCP plugin. Just as browser plugins have a "safe mode", DMTCP can be loaded without some of its internal plugins.
           If one is only testing the initial launch (no checkpoint or restart, one can safely disable all plugins for "save mode". However, if testing checkpoint or restart, then some (but not all) of the plugins will generally be required for correct operation.
           All plugins can be disabled by launching an application with dmtcp_launch --disable-all-plugins . If one wishes to disable only some of the plugins, then one must modify the source code. The file src/dmtcp_launch.cpp has global variables of the form enableIPCPlugin=true, etc. Try setting some of the internal plugins to false, re-compiling, and testing if the bug goes away. If an interaction with an internal plugin is uncovered, try commenting out some of the wrapper functions in that plugin. Note that it is generally safe to disable the internal plugins when testing only DMTCP launch and resume after writing a checkpoint. However, in testing DMTCP restart (from a checkpoint image file), some of of the DMTCP internal plugins may be required for correct operation.
    5. The run-time overhead of DMTCP is usually negligible. When there is no checkpoint or restart in process, DMTCP code will run only within DMTCP wrappers around certain less frequently used system calls. Examples of such wrappers are wrappers for open(), getpid(), socketpair(), etc. We explicitly do not place a wrapper around read() or write(), since those are frequently called system calls that could produce measurable run-time overhead.
    6. Among the Linux features supported by DMTCP are open file descriptors, pipes, sockets, signal handlers, process id and thread id virtualization (ensure old pids and tids continue to work upon restart), fifos, process group ids, session ids, ptys, terminal attributes, System V shared memory (shmget, etc.), timers, epoll, eventfd, signalfd, and mmap/mprotect (including mmap-based shared memory). Such common O/S daemons as NSCD and LDAP are also transparently supported.
    1. Please see the "Contact Us" links for writing to us. That link includes pointers to the public DMTCP forum, and a private e-mail for private comments. Other possibilities are to add a bug report to the bug tracker, or to write to an individual administrator. We are also always interested in finding others interested in helping develop DMTCP as an open source project.
    2. The origins of DMTCP lie in a project begun in Fall, 2004, and reported on at the CCGrid-06 conference (Transparent Adaptive Library-Based Checkpointing for Master-Worker Style Parallelism). Since then, The work has added more ambitious goals. The list of active developers continues to change over time. As of this writing, the sourceforge site lists ten developers/administrators. We are always happy to include new developers.
    3. Please see the Publications page for the standard citation at the top of that page.



II. Support for Specific Types of Software Applications