*** Working Draft ***
Truly-Transparent Checkpointing of Parallel Applications

Eduardo Pinheiro
Federal University of Rio de Janeiro UFRJ
edpin@cos.ufrj.br

Abstract

Checkpointing is a technique used for many purposes, including, but not limited to assistance in fault tolerant applications, rollback transactions and migration. Many tools have been proposed in the past to help solve these problems. But in the field of migration there is still a lack, because either: (a) the tool is limited to some kind of parallel programming library (PVM, MPI), (b) the size of the checkpoint image is too big to be worth migrating, (c) the checkpoint is limited to some well behaved applications or (d) the checkpoint image must be saved to file or sent to centralized servers instead of going directly to the target machine's memory. We developed a new tool called Epckpt that can solve this lack in process migration. Epckpt can: (a) checkpoint almost all kinds of applications independent of their behavior, (b) limit the size of the applications image to its minimum, (c) checkpoint fork parallel applications, (d) checkpoint an application that was not meant for being checkpointed (was not re compiled nor re linked with any special library) and (e) send the checkpoint image directly to the target machine instead of to a server of a file. Our checkpoint tool is included in Linux's kernel. Results show that the maximum checkpoint overhead in an application's running time is up to 9%, including the cost to restart it. Migration results show that a reduction of up to 77% of total running time can be achieved under heavy load situations.

1. Introduction

We have developed a checkpointing tool (Epckpt) concerned with job scheduling. This tool improves existing tools in many aspects and has new implementation issues. As the main improvements addressed Epckpt can:

As the new implementation issues Epckpt can:      Because we have implemented Epckpt inside Linux's kernel we are able to checkpoint a broad range of applications. We can checkpoint applications that called fork(), exec(), mmap() and applications that use pipe and files. We are still working on checkpoint of shared memory and semaphores but we expect to have it done in a few weeks. Having access to kernel related information makes it easier and faster to work with it. Checkpointing tools that are outside the operating system kernel can't handle all cases well and in practice only a few number of them are done.

     By not saving shared libraries, dynamic linked libraries and the application's code segment we can lower the checkpoint image to its minimum. In real life applications studied the final size was in average 58% of the total checkpoint size. In all cases we found this reduction to be significant.

     Because we are at the operating system level, we don't need applications to have a special library linked together with them nor any source code modifications. This makes just about any long running application a possible candidate for migration. This is important for job scheduling algorithms, so they have a broader range of applications to decide whether they (applications) will migrate or not.

     All available freely distributable tools for checkpointing [1, 2, 3, 4] cannot checkpoint fork parallel applications. So this makes it a new approach to Epckpt. Our tool is able to checkpoint a group of processes running in parallel in a SMP machine.

     And finally, our tool can save the checkpoint image to any file descriptor abstraction, be it a regular file, a pipe or a socket connected to a remote machine. By doing this we eliminate the use of disk I/O and the duplicate network transfer (in systems using NFS for instance, a file is sent to a file server through the network, saved to disk and when needed on another machine it is sent via the network again, thus wasting one network transfer and some useless disk writes.) In section 2 we compare our work to related tools available. In section 3 explain its interface with the user or top level system. Section 4 shows our first practical results. And finally in section 5 we make some conclusions and address future work.

2. Related Work

     Condor[1] requires the programs to be re linked with a special library. It does not have the option not to checkpoint the common shared libraries and dynamic linked libraries. It only works for some special well behaved kind of process, the ones that don't call fork(), exec(), don't communicate at all via pipes or files. By not allowing processes to fork, Condor can't checkpoint a group of parallel processes.

     Libckpt[2] is much like Condor and has much of its limitations such as not allowing processes to fork or exec. Also, it uses only regular files to save processes' image. To be of more value, Libckpt requires the programmer to modify the source code of the program to give the library some hints on when to checkpoint. This makes it intrusive and requires the programmer to know the heuristics of where to place this calls. Besides, this requires re compilation of the source code.

     Libckpt implements incremental checkpointing which is a technique most valued for fault tolerant applications, which is not our major concern. Incremental checkpoint makes the overhead of checkpointing lower but has no use to migration, since a migrating task has to be totally checkpointed at the exact moment of migration. As far as we know, Libckpt does not allow the checkpoint image to be saved in other file abstractions such as a socket or pipe, requiring the use of regular files, although we suppose this to be an easy modification for them to implement.

     CoCheck[3] uses the same protocol used in Condor and thus has the same limitations of it. CoCheck checkpoints parallel processes running on multicomputers. CoCheck uses centralized servers to receive the checkpoints and this is a major concern to them. Instead, Epckpt can send checkpoints directly to the target machine eliminating the problem of allocating servers for this task and minimizing network traffic (checkpoints don't have to go to the server and then be sent again via the network to the target machine, thus reducing network latency time.) But our approach is different because we checkpoint parallel processes running in a multiprocessor not in a multicomputer like CoCheck does.

3. How it works

Three new system calls were created in Linux. They are

checkpoint (int pid, int fd, int flags)
restart (char *ckpt_filename)
collect_data (int pid)

The system call checkpoint() is the one that does the job of saving the image of process identified by pid to the already open file descriptor fd. This file descriptor does not need to be necessarily a regular file, but it could be a socket connection, a pipe or any other file abstraction that modern file systems support. This makes it easier and faster to migrate a process to another machine without relying on a distributed file system performance, such as NFS. It means one does not need to save the checkpoint image to file before reading it on another machine, avoiding I/O transfer time and eliminating the need of using the network twice: once to send it to the file server and once to read it back on another machine. The flags are used to control the behavior of the checkpoint. Currently four options are available

  1. CKPT_CHECK_ALL
  2. CKPT_KILL
  3. CKPT_NO_SHARED_LIBRARIES
  4. CKPT_NO_BINARY_FILE
The first one has to do with checkpointing all processes descendent of process pid. A descendent process is the one created after process pid issued a fork system call. On a SMP if a process calls fork() a new image of it can be created on a different processor in the same cluster (multiprocessor machine). This option of checkpoint has the ability to checkpoint all processes in a group and they can be restarted on a possibly different multiprocessor in the future. The second option is used when the process (or processes) being checkpointed is to be stopped immediately after the checkpoint has took place. This is useful when migration will take place, so the process won't do any computation after the checkpoint and so it is in a safe state. When this option is issued the process dies after the checkpoint is completed and it will no longer exist. If this option is not used the process is let run after the checkpoint is saved. This behavior is useful when doing fault tolerant computation (which periodically saves process image to stable storage, and in case of a failure it can be restarted where it left off). The third option (no shared libraries) is used when it's guaranteed that when the process restarts it will find the same shared libraries and dynamic load libraries on the target machine. This option saves a lot of space in the checkpoint's image. It must be used carefully because even if the shared libraries are the same they must be the save version, otherwise it will not be possible to restart a saved process. The fourth option is similar to the latter. It tells the kernel not to save the code segment of the process in the checkpoint image. Since the code is read only and can be retrieved from the binary image of the executable it does not need to be saved. These two options (no shared libraries e no code) are separate because shared libraries usually are in the machine's local file system and the binary image is usually in a users distributed file system, so there could be some benefit in not requesting access to the file system server at the expense of adding more data to the checkpoint image.

     The system call restart() is responsible for restarting one process at a time after it has been checkpointed. Originally the checkpoint file was intended to be in the same format as the original executable in Linux (elf type). But because we need more information than that provided in the elf file, we created our own type. We could have included this type in Linux list of executable types, but since there are additional steps to be taken before restarting a process, we decided it was better not to make it automatically. All these issues will be discussed in the next section.

     Finally, the system call collect_data() is responsible for telling the kernel that the given process (pid) is to have some information detected during run time so it can be checkpointed later on. The kernel needs this in order to save file names and shared libraries names during the execution of a process. If it is ever checkpointed this information is saved with the checkpoint image. We could have the kernel annotate this data to every process running. But it's a waste so save this data to processes that will never migrate (such as short running processes, demons and other special processes). Besides the waste it would take time to save additional information to all running process. In order to minimize this we decided that the kernel will only keep checkpointing information if the process or it's ancestors called collect_data().

     It's not necessary for processes to call these functions explicitly (but it's possible). We developed some tools that do this job. The first tool is the program spawn. It simply takes the filename of an application to run, calls collect_data() forks off a child and this child calls exec in order to run the application. This tool is used to tell the kernel that the application the user is starting is possibly going to be checkpointed in the future, so it needs to collect the useful data (file names and libraries names).

     The other tool is checkpoint. Given a process id number and some flags, it calls the checkpoint function in the kernel. This tool is also just an interface between the user (or top level application) and the operating system.

     As a counter part, restart is the other tool. It takes the checkpoint image of one process and calls the kernel function restart() to put the checkpointed process into run again.

      The other two tools are more important. They are split and mrestart. When the checkpoint is done over a group of parallel processes, they are all dumped in a single image. Before it can be restarted, this image must be split into individual process images. Split is responsible for doing this. It examines the header of the file, finds the end of each process and splits them into different images. It also creates extra information that can be restored prior to the call to the system function restart(). The other tool (mrestart) is to be used after split has done its job. Mrestart (stands for multiple restart) is responsible for restarting a group of processes. It takes all images that have been split and start them up by calling restart() multiple times. Since restart() does not return upon success (it overrides the process image that called it) mrestart has to fork off as many new processes as there are split checkpoints to be restarted. Since forked children are asynchronous, we used semaphores to make them synchronize before calling restart() in order to restart all processes in a group synchronously.

4. Results

We have tested each step of our implementation with artificial programs written exactly to suit our tests. As we added new features to the checkpoint (support to pipe, file, shared libraries, parallel processes, etc.) we tested it with a new simple application at each step. All tests have been successful so far. But to be sure it would work with real life situations we tested it again using three common applications. They are

     Besides these three sequential applications we used our artificial parallel application called art6. This application forks off two new processes that communicate via pipe. One writes to a file and the other reads and writes to a memory mapped file. The parallel process was simulated under a multitasking uniprocessor.

     Our tests were divided in two parts. Part I is concerned with the overhead introduced by Epckpt. There are two forms of overhead: collection of data during some system calls and the checkpoint time itself. Part II is a simple simulation of performance gain when a process migrates from a heavy loaded machine to a lightweighted one.

4.1. Part I

Every test was run on a standalone machine with the minimum load possible, to avoid any interference with the network. Besides, all tests were repeated at least three times and the average was taken. This minimizes any discrepancies caused by random or non deterministic events.

First we show that just collecting checkpoint data but without actually dumping the checkpoint does not slow down an application too much. Collecting checkpoint data is simply storing in the kernel useful information (as described before).

Figure 1 shows that zip slowed down less than 1 second over a total computing time of 46 seconds, which yields about 0.6% of execution time overhead. The li application also had an slow down of less than 1%. Fpppp had a speed up of 2%. Of course this was caused by some uncontrollable event. And art6 slowed down about 0.2%. In all cases we assumed that collection of checkpoint data caused an acceptable amount of overhead (below 1%).

Figure 2 shows the total running time when the applications were checkpointed once during the execution and immediately restarted. This time does not count any migration time. The applications were checkpointed to file and then restarted automatically. The total time is split into three categories: running time before the checkpoint, time to checkpoint and running time after restarting the application. The time consumed by collecting data is accounted as computing time, since it difficult to isolate and represents less than 1% of total computing time. The time taken by the split tool is accounted as checkpoint time. As seen, the time used by the checkpoint by the zip application is around 9% of the total running time. But for the li application it's below 0.1%. For the fpppp it took less than 3% of the total time. And for the art6 checkpointing took less than 1% of the total computing time.


Figure 1: Total running time without collecting data for the checkpoint and total running time with data collection.



Tables 1 and 2 summarizes part I experiments. Table 1 shows our checkpoint image size improval over the actual size of the application if dumping all libraries and code segment. The non optimized size is the amount of data that other checkpointing tools would have to deal. Our optimizations show a reduction of up to 68% on the checkpoint size. The average was around 58%.This reduction helps minimizing the burst in the network during a migration, thus reducing migration overhead. Table 2 shows all checkpoint overhead in execution time caused by our tool. This does not account for migration time yet. The maximum overhead calculated was 11.4% for the art6 application.


Figure 2: Total running time with the checkpoint overhead. Checkpoint was taken after ten seconds for the zip and fppp applications, after five seconds for the li and after 2 seconds for the art6 application. The time to migrate the application is not accounted.



Application Non optimized
Checkpoint Size (bytes)
Optimized
Checkpoint Size
(bytes)
Reduction
Zip1,228,986598,20251%
Li979,060323,70067%
Fpppp1,474,632753,73648%
Art61,712,445536,89368%

Table 1: Optimized checkpoint size is the minimum amount of data necessary to the checkpoint. Non optimized checkpoint size is the regular size including libraries and code.



Application Total Running
Time without
Checkpoint
(seconds)
Total Running
Time with 1
Checkpoint
(seconds)
Zip46.4350.69
Li11.0211.66
Fpppp42.7141.78
Art612.0313.40

Table 2: Total running time doing one checkpoint during the run with the optimized checkpoint size and total running time without collecting data nor checkpointing.



4.2. Part II

We simulated a situation where a workstation has too many CPU consuming applications running. When this situations arrives an hypothetical job scheduling demon could start a checkpoint up and then migrate the entire application to a lighter loaded workstation. We supposed it takes the scheduler 15 seconds to make the decision to move the application. The simulation of heavy load was simple: four identical processes were launched. Each one is an infinite mathematical computation loop. This is not a realistic application but it behaves just like a number crunching application like fpppp and many other scientific applications. We are not concerned how the scheduler finds out about the heavy load or how it decides what process (or group of processes) should be moved.

Application Total Time in Heavy Loaded Machine
(seconds)
Total Time with Migration after 15 Seconds
(seconds)
Reduction
Zip260.0758.2977.5%
Li55.2030.0345.5%
Fpppp206.6955.3978.7%
Art614.00------

Table 3: Results of performance gain after migration from a heavy loaded workstation to a light loaded workstation.



Table 3 shows our results. The second column shows the total running time when the application was let run on the heavy loaded workstation. The third column shows the total time spent when the applications were moved to a light loaded workstation after 15 seconds spent on the heavy loaded machine. This total running time included checkpoint overhead, migration time and restart time. We noticed a significant reduction of total running time after the migration compared to the running time in a heavy loaded workstation during all the computation time. The application art6 didn't suffer much with the heavy load because it is communication bound and blocks frequently, so when it needs the CPU its priority is so high it always gets its share. Because art6 didn't take longer than 15 seconds, it was not migrated. The average reduction including the worst case (art6) is as much as 50% of the total running time.

5. Conclusions and Future Work

We have demonstrated our tool's potential use in conjunction with an efficient scheduling algorithm. Our tools is also of great help for those seeking checkpointing of parallel applications on a SMP machine. Most of the previous work is either not available or is a sequential checkpointing tool with a distributed checkpoint algorithm implemented over a communication library with the limitations discussed earlier. By doing the checkpoint's main routine inside Linux kernel we were able to optimize the amount of data used to save the checkpoint and to include many information that were left behind by other checkpointing tools in order to make it of valued use for most part of Unix applications. But by doing it in the OS kernel we have narrowed the use of our checkpointing tool since it would be difficult to port it to other Unix kernels. Nowadays, Linux supports a wide range of platforms and is becoming more popular on the Unix community, so this might not be a great disadvantage. The use for our tool is vast, but it is intended to be of most use in dynamic load balance of clusters of multiprocessors. Since scheduling intra clusters of multiprocessors has been studied before [4, 5] we propose a tool that facilitates the study of scheduling inter clusters of multiprocessors. We are working toward realistic job scheduling strategies to help optimize even more the performance of both sequential and fork parallel applications.

References

[1]M. Litzkow, T. Tannenbaum, J. Basney, M. Livny. Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System. University of Wisconsin Madison.

[2] J. Plank, M. Beck, G. Kingsley. Libckpt: Transparent Checkpointing under Unix. Princeton University.

[3] J. Pruyne, M. Livny. Managing Checkpoints for Parallel Programs. University of Wisconsin Madison.

[4] M. Squillante, R. Nelson. Analysis of Task Migration in Shared Memory Multiprocessor Scheduling. IBM Research Division.

[5] R. Chandra, S. Devine, B. Verghese, A. Gupta, M. Rosenblum. Scheduling and Page Migration for Multiprocessor Compute Servers. Stanford University.