Comparing Checkpoint Mechanisms

See Charlie Landau’s The Checkpoint Mechanism in KeyKOS for an extensive description of how Keykos takes checkpoints. In this note I only want to contrast the Keykos mechanism with what is available in certain other systems such as Windows or Unix. See Design Evolution of the EROS Single-Level Store for a detailed description of a later design for EROS.

I have seen systems running Windows and I think Unix that were able to write RAM to disk anticipating an orderly shutdown. This seems to be a useful feature and I want to contrast it to the Keykos system. I assume that the data that is written to disk is merely a bit-for-bit version of what is in RAM. I would welcome any correction here. This is a simple addition to almost any operating system.

In Keykos RAM is merely a cache for pages and nodes each of which have assigned disk locations. The checkpoint merely assures that a consistent snap-shot of the entirety of pages and nodes will always be available even in the advent of any common error condition such as power failure. Keykos holds recently used pages and nodes in RAM and also keeps information derived from them in RAM to make its operations system functions efficient. Whenever this derived information is mutable the pages or nodes are either updated synchronously, or when a checkpoint is taken or the page or node must be swapped out.

A fundamental difference is that Windows or Unix may sometimes be unable to fulfill some request by an application because of exhaustion of RAM. This follows from the fact that certain kinds of system state may only be represented in data structures that are stored only in real RAM. Schemes to keep such data in virtual memory that is available to the kernel are particularly prone to real time failures where the kernel must take a page fault while serving application A, due to RAM pressure caused by application B. Such page faults are liable to interfere with overlapping compute with disk IO.

Keykos took much inspiration from processor design including the design of caches. I have never heard of a hardware cache design that would cause faults due to insufficient cache space. Keykos maintained that property partly by borrowing and extending various hardware cache ideas. I suspect that the Windows checkpoint mechanism does not mitigate hard failures upon RAM exhaustion.