The Keykos checkpoint mechanism is described in a paper by Charlie Landau. That paper describes briefly the ability to record a checkpoint to magnetic tape. I explore here a scheme to produce a checkpoint stream over a communications link to another site that is a simple extension of the tape checkpoint scheme. It has not been implemented. This note assumes familiarity with the current checkpoint logic that is well described in the paper.

As the Keykos system runs it is instructed to establish a clone of itself without greatly impacting its normal workload. To do this a communications link is provided to a remote system that is able to serve as standby for the first machine, or at least the mission critical load thereof. The plan is to produce a complete checkpoint on the remote machine. That checkpoint will be just the same as one of the locally produced checkpoints, but which checkpoint is not determined until near the end of the process.

The first phase is to send the versions of the pages and nodes found in their home locations. We omit pages and nodes with unmigrated images in the checkpoint area for transmitting them is sure to be wasted effort. As pages or nodes are migrated they are inserted immediately into the outgoing transmission stream. If this process finishes then a complete checkpoint is available at the remote machine. If the areas exchange roles during the transmission, however, there are no more unmigrated pages (by nature of the local checkpoint logic) and the pages in the new checkpoint area take their place. This will cause new versions of pages and nodes to be retransmitted.

It may not be clear that this will ever produce a complete checkpoint on the other machine and all that I will say just now is that by throttling the production of dirty pages on the original machine, thus impacting the thruput, and getting a faster channel, this can be made to complete. That the fastest channels are rather faster than the fastest disks indicates that this is possible. Some applications may dirty pages at a high rate and require a faster channel to make this work.