KeyKOS Principles of Operation - Checkpoint-Restart Facility

Checkpoint-Restart Facility

KeyKOS uses the checkpoint-restart facility to recover from hardware failures, software failures, and scheduled downtime.

From time to time, KeyKOS takes a system-wide checkpoint - which is a record of the state of all pages and nodes. The checkpoint represents the state of the system at a single instant of time.

At any one instant two states of the system are available: the current state and the state at the last checkpoint. In normal operation, checkpoint states are not referred to. The current state is represented, in part, by information in volatile memory (e.g., in machine registers); the checkpointed state is represented only in non-volatile memory.

The kernel takes a checkpoint on two occasions: when the space reserved to hold the checkpoint state is nearly full, and when the Checkpoint Key is invoked.

After a checkpoint has been taken, the pages and nodes that have changed since the previous checkpoint are copied from the checkpoint area to their "home positions" in a process called migration. This work is overlapped with normal execution and takes place under the control of a domain using the External Migrator Tool key. The External Migrate Tool key must be properly used to ensure that these checkpoints may continue to be taken.

If it is necessary to take another checkpoint before the above copying activity has finished, there may be a delay in normal processing.

When KeyKOS recovers from a failure, it restarts from the state at the last checkpoint. With the exception of the journal page and pages that have been saved with the journalize page key, all pages and nodes are restored to their state at the time of the last checkpoint. Domains with TRUE process-running bits will begin running from their checkpointed state.

Of necessity, certain things are not backed up. These include the real time clock, the wait objects accessed with the bwait key, charge sets, devices, and things outside of KeyKOS, such as the memory of users.

Journalizing Facilities

There are some applications where the checkpoint-restart facility is not adequate. A transaction-oriented system will want to accept a transaction to update a data base and at some point give the user a positive acknowledgment that the transaction will be remembered. The application cannot normally afford to wait until the next system-wide checkpoint to give the acknowledgment.

The following argument shows that, in such a system, transactions that change the data base (Write transactions) must be idempotent. That is, it never hurts to do them twice.

Suppose a user submits a write transaction and KeyKOS crashes before it is acknowledged. The user knows KeyKOS has crashed because the connection with the application must be re-established. It is unknown whether KeyKOS has committed to remember the transaction. KeyKOS may have crashed just before the acknowledgment was sent to the user but after the transaction was committed. Therefore, the transaction must be resubmitted when KeyKOS restarts. In case the transaction has committed, it must be idempotent.

Transactions that are not idempotent can usually be made so by simple expedients. For example, "transfer $100 from account A to account B" is not idempotent. But "transfer $100 from account A to account B and associate a unique transaction number, N, with this transaction, unless the number N has been used before" is.

See also:

Gray, J., Notes on Operating Systems. Report RJ 3120,
IBM Res. Ctr., San Jose, CA, Oct. 1978. A Definitive Report of Locking and Recovery in a Database System
Lampson, B. and Sturgis, H., Crash Recovery in a Distributed System, Xerox Res. Ctr., Palo Alto, CA, 1976 (working paper)

Transaction applications will have to store a record of recent transactions in some non-volatile storage. That record is called a journal and the procedure is called journalizing.

It is significant that only data, not keys, are stored.

To support journalizing, a special journal page that only the kernel can write is provided. At restart, these values are set before any processes run.

Figure 2-27

Here is an example of journalizing, written in Algol68.

SEMA mutex = LEVEL 1;
INT local restart count := 0;
REF INT restart count = locations 16 to 23 in journal page;
FLEX [0:] transaction nonvolatile storage;
INT next serial number;
PROC process transaction (TRANSACTION transaction) = ACKNOWLEGEMENT: 
BEGIN
     DOWN mutex; # only one process at a time here #
     WHILE local restart count <restart count DO # replay transactions since the restart # local restart count :="restart" count; WHILE UPB(nonvolatile storage)>= next serial number
             DO update database(nonvolatile storage[next serial number]);
                  next serial number +:= 1
            0D
     OD
     IF modifies database(transaction)
     THEN nonvolatile storage[next serial number] := transaction;
          # the above unit may overwrite an entry in nonvolatile        
                  storage with identical data #
             update database(transaction);
             next serial number +:= 1
     FI;
     ACKNOWLEDGEMENT a = read from database(transaction);
     UP mutex;
     a
END

# Program notes:

Both read and write transactions must go through this procedure. An ACKNOWLEDGEMENT includes any data read. The acknowledgment returned will be sent to the user through the network: if the network circuit connection is broken, the acknowledgment should be discarded (it may be incorrect). #