[Home] [Tech Library]
William S. Frantz
Periwinkle Computer Consulting
16345
Englewood Ave.
Los Gatos, CA USA 95032
frantz@netcom.com
(408)
356-8506
Charles R. Landau
Tandem Computers Inc.
19333 Vallco Pkwy,
Loc 3-22
Cupertino, CA USA 95014
landau_charles@tandem.com
(408)
285-6148
Copyright © 1993 USENIX Association. Permission has been granted by the USENIX Association to reprint this paper. It was originally published in the Proceedings of the USENIX Symposium on Micro-Kernels and Other Kernel Architectures, September 1993.
Three major technological directions in computer technology are transaction processing, object orientation, and microkernel operating systems. The KeyKOS operating system and the KeyTXF transaction processing system combine all three of these technologies. The design of KeyKOS directly provides operating system level objects on a microkernel base. In order to maintain the integrity of these objects, KeyKOS takes periodic checkpoints of the entire system. In addition, KeyKOS provides facilities for transaction processing which achieve very high transaction rates. Object oriented technology facilitates construction and reuse of transaction applications. This paper describes how these ideas are combined in the KeyKOS system.
This paper examines the structure of an application environment that combines three technologies: transaction processing, object orientation, and microkernel operating systems. The KeyTXF transaction processing system, which runs on the KeyKOS microkernel operating system, provides high performance object oriented transaction processing. We believe that this work demonstrates that these technologies can be successfully combined, gaining the advantages of each.
The technology of transaction processing significantly reduces the effort required to build applications by allowing applications to combine a group of one or more operations on one or more objects while providing the following "ACID" properties:
Atomicity - "all or nothing"
Consistency - The application can ensure that the data base state does not violate application defined consistency rules.
Isolation - Transactions are logically executed serially.
Durability - Results of a completed transaction are never lost.
Object oriented systems support encapsulation and isolation of function. They support the association of data with the algorithms that manipulate that data and the protection of that data from other algorithms. They make it easy to change the implementation of functions without affecting the users of those functions. They make it easy to build new function using existing functions as building blocks.
Microkernel systems support easy extension of "system" facilities. They permit support of more than one application programming interface. They significantly reduce the effort required to port the system to a new hardware architecture. They increase reliability by decreasing the amount of code whose failure crashes the entire system.
Satish M. Thatte describes a design for a persistent virtual memory system [9] with transactions for object-oriented garbage collected programming systems and object-oriented databases. In contrast with the approach to persistent memory used in KeyKOS, his design requires unforgeable pointers, bounds checking, and automatic garbage collection. His transaction manager keeps an external redo log outside virtual memory, while KeyTXF keeps the log in virtual memory.
QuickSilver [8] is an operating system with a small kernel that supports extensive use of transactions. QuickSilver is not explicitly object-oriented. In contrast, KeyKOS uses object technology to support transactions without requiring awareness of transactions in the microkernel.
These standards address the need for defined interfaces between a transaction application and the software components with which it interacts. Though not specifically object-oriented, such an interface can be viewed as an object interface, whether the object is local or remote. This interface makes explicit use of Transaction ID's.
The Object Management Group has identified the Object Transaction Service as a service which may be the subject of a future Request For Proposal. This work is in a very preliminary stage.
KeyKOS [2 and 4] is a fault tolerant, object-oriented, microkernel operating system. KeyKOS uses capabilities[6] to address objects. KeyKOS provides operating system level protection to its objects.
The checkpoint mechanism [5] is the principal way KeyKOS makes data durable or persistent. The system level checkpoint periodically saves the entire state of the system on non-volatile storage (i.e. disk). This checkpoint protects all processes, data, files, registers, etc. against the system going down (whether planned or unplanned). When KeyKOS is restarted, the state of the system at the most recent checkpoint is restored, and the all processes continue running from that state. Taking a checkpoint interrupts execution of processes for less than a second. Restart requires only the time for active processes to fault in their working sets.
A system level checkpoint is taken every few minutes, so the system must be able to commit transactions between checkpoints. To support the durability requirement of transaction processing, KeyKOS has an additional kernel-implemented operation, the Journalize Page operation. This operation saves the current contents of a virtual memory page to disk immediately so that the data will not be lost. After a restart, the data in the page will represent the data at the time the Journalize Page operation was called (or at the previous checkpoint, whichever is more recent). Using the Journalize Page operation, the KeyTXF transaction processing system can fail at any point in time without losing any committed transactions.
Fault tolerance was a major goal of the KeyKOS microkernel. In addition to the checkpoint and journaling mechanisms, the microkernel uses automatic disk mirroring to ensure that no data can be lost from a single disk failure at any time. A special mechanism, described in section 6, is used to handle disk failures that occur in the middle of writing a disk page, resulting in the page being partially written.
KeyKOS uses a time stamp which allows it to record a time with a resolution of better than a microsecond and a date covering over 100 years. [ The actual resolution depends on the hardware timing facilities available.] This time stamp is used extensively in KeyTXF. KeyTXF requires that time stamps never go backwards, and that the resolution is sufficient to generate distinct time stamps at the desired transaction rate.
KeyKOS maintains the time of the last restart and the time of the last checkpoint in a special page called the checkpoint page. This information is used by KeyTXF to detect when a restart has occurred.
The Compare and Swap operation atomically compares an "old" value with a value in memory. If they are the same, it stores a "new" value into that memory location. It returns an indication of whether a store occurred. If Compare and Swap is not available in hardware, it may be implemented in the kernel. KeyTXF assumes that it is atomic in relation to checkpoint/restart.
KeyTXF is implemented directly on the microkernel interface. KeyTXF applications can be implemented in any of the supported operating system environments including the microkernel.
A transaction processing application is viewed as an object which communicates through references to other objects. On the "front end", the application communicates with requesters of transactions. On the "back end", the application communicates with data storage services, often called resource managers. The application performs its own specific operations needed to process the transaction. The application is responsible for maintaining the consistency of its objects. See Figure 1.
Because the KeyKOS microkernel supports cheap processes, KeyTXF uses several processes for each transaction, simplifying both system and application design. These processes are objects in KeyKOS terminology. Multiple instances of the application are used to allow more than one transaction to be processed at a time. Using multiple instances allows the application programmer to program with the simplicity of single threaded logic.
Both the Transaction Distributor and the Commit/Abort Manager dedicate an interface object to each application instance. The Transaction Distributor uses its interface object to send transactions to the application instance, monitor its status, and provide it with networking services. The Commit/Abort Manager uses its interface object, called the Commit/Abort Interface Object (CAIO) to save data to be committed as described below. KeyTXF uses these objects to serve many of the functions that the transaction ID serves in other systems.
Transactions enter the system as messages from a communications network. They are routed to the Transaction Distributor, which selects an available application instance to process the transaction and passes the message to the application. It returns messages from the application to the transaction source. It can also detect the need for restart recovery.
In addition to selecting an available application instance to process the transaction, the Transaction Distributor is responsible for recovering from application program crashes by aborting any transaction that is in progress and notifying the transaction source that the transaction has failed.
The Monitor provides an operator with a real time display of the status of the system.
The back end consists of the Commit/Abort Manager, and one or more database objects.
The heart of the system is the Commit/Abort Manager (CAM). The CAM ensures atomicity of transactions by supporting Commit and Abort operations. During transaction processing, it records all changes to the database objects. The actual database objects are not changed until the transaction is committed. When a commit is issued, the CAM writes all the database changes into a log, uses the Journalize Page operation to protect that log, sends the changes to the database objects, and then releases all the locks acquired by the transaction.
In addition, the CAM is responsible detecting whether the system has restarted, and recovering from the restart. During restart recovery, the CAM reads the log and re-sends the logged changes to the database objects.
The CAM is described in detail in section 5.
The database ensures isolation or serialization of transactions by means of record-level locking. The data storage objects do not need to know about commit/abort since that logic is in the Commit/Abort Manager.
The microkernel's object orientation helped us build a database out of smaller objects. We used a preexisting object type, called a record collection, which provided efficient indexed access to data. We used record collections to build a new object type, called a Multiple User Record Collection (MURC), which provides two enhancements to the basic record collection. The MURC allows multi-threaded access to data by spreading the data over multiple instances of the record collection (which is single threaded). In addition the MURC provides locking services. This construction resulted in a simple, efficient database, and took a new KeyKOS programmer two months to implement. Other data storage objects could also use wrappers to provide locking and/or interface translation.
The MURC goes through two stages of instantiation during system operation. The first stage occurs when a database is created. The MURC creates a number (specified by the caller) of record collections to hold the data. The caller has the option of arranging for these record collections to use space on different physical disk drives, to achieve parallel access. See figure 3.
The second stage occurs when application instances are created. Each application instance has an associated MURC Accessor (one for each database) to hold its locks and to select the record collection which holds the desired record. See figure 4. Figure 5 shows the how the CAM connects to the database objects.
The log contains information about all the transactions that have been committed since the last checkpoint. Each log entry consists of an entry header, and the data which defines the changes that were made to the database objects during the transaction, called the Local Event Queue (LEQ, see below). The entry header contains the length of the entry, and the internal application ID of the application instance responsible for this entry. The last log entry is followed by a header containing a length of zero which acts as an End of File (EOF) mark.
The log is maintained in a fixed set of pages used in a circular manner. Access to the log is serialized by a lock called the log lock. The CAM maintains a global pointer to the place for the next log entry. Each page of the log has a header and a trailer, which are completely independent of the entry headers. The log page header consists of a word which is used to ensure disk write consistency (described later), and the time stamp of the last log entry that was placed in the page. The trailer consists of another word for disk write consistency assurance. The pages of the log have time stamps that are monotonically increasing, up to and including the page containing the EOF mark. As a refinement, if the EOF mark would occupy the first location in a page, it is omitted; in this case, non-monotonically increasing time stamps will mark the EOF.
Since log entries that were made before the most recent checkpoint are not needed to recover the database, the log need contain no more data than is generated in a checkpoint interval. However, there are no significant costs (other than storage costs) in having the log larger than necessary. The CAM will force a checkpoint if the log space is exhausted. This action causes all transactions to wait until the checkpoint completes, and is undesirable because it introduces a noticeable delay in transaction processing. Allocating enough space for the log avoids this cause of delay.
Understanding the logic of the Commit/Abort Manager requires understanding the states of the log. A somewhat simplified view of the log, ignoring transient states while the CAIO is writing to the log, starts with the end of the previous replay and proceeds to the start of the next replay.
At the end of replay, the Commit/Abort Manager initializes the Log Pointer to point to an EOF mark in the log. See figure 6.
After transactions have been committed, the log contains some transactions that were committed before the most recent checkpoint, and some that were committed after it. See figure 7.
If the system fails at this point, the Log Pointer, the processes, and the databases (in fact, everything but the log), are backed up to their state at the last checkpoint. The log is not backed up because its pages are written out with the Journalize Page operation. The part of the log that must be replayed is the part between the log pointer and the EOF mark. See figure 8.
Most KeyKOS programs do not need to detect restarts. When KeyKOS restarts from a checkpoint, all their state: registers, memory, files etc. etc., are restored to the (consistent) state at the time of the checkpoint and the program runs from there, producing the same results. When a program uses the Journalize Page operation, it loses this consistency, and needs to be aware of restarts.
Consider the case of a CAIO about to store an EOF mark in the log. The simple-minded logic would be: (1) Write the LEQ into the log, (2) store an EOF mark in the log, and (3) call Journalize Page to write the log to disk. If a checkpoint is taken during step one, and the system continues to commit transactions, we end up with a log that looks like figure 7.
If the KeyKOS restarts from that checkpoint, then the CAIO, whose state was backed up to the time of the checkpoint, completes step one, which is not a problem since copying the LEQ is an idempotent operation. But it then stores an EOF in the log, truncating the log and "uncommitting" the transactions following it in the log. To avoid this problem, the CAIO checks for restart when storing EOF in the log and at other critical points, and executes special logic if it finds that a restart has occurred.
When the CAIO completes a replay, it saves the time of last restart in a global variable. To check for restart, it compares the time of last restart in the checkpoint page with the value it saved. If they are different, then a restart has occurred.
Any time the CAIO detects that a restart has occurred, it first aborts any transaction in progress. (If the transaction was committed before the restart, then its effects on the database will be restored by replay.) The CAIO then initiates replay of the log.
A Begin Transaction operation is used by the CAIO only to ensure that the application is processing in the correct sequence. Because the CAIO's provide transaction protection at all times, KeyTXF always considers the application instance to be in a transaction. As explained below, KeyKOS does not use Transaction ID's, so when the application instance issues a Begin Transaction, there is no work to be done. This operation is not fundamental, and could be eliminated, saving a small amount of overhead.
During transaction processing, the CAIO passes the read requests along to the database. When the application instance issues a read with lock, a write, or a delete, the CAIO saves data about the operation internally in a structure called the Local Event Queue (LEQ).
An Abort Transaction operation causes the CAIO to discard the LEQ and release the database locks. Since no changes have been made to the database, that is the only processing necessary.
The commit operation is performed by a single phase commit. A simplified view of what happens when the application issues a "commit transaction" operation, is that the CAIO obtains the log lock, appends the LEQ to the log, releases the log lock, sends the changes recorded in the LEQ to the database, signals the database to release the locks, and returns "commit successful" to the application instance.
Several optimizations contribute to the subtlety of the actual logic. Journalizing the pages of the log is done by independent threads for parallelism. The sequence of updating the database and journalizing the pages is arranged to maximize the probability of committing several transactions with a single journal write. And, as mentioned above, special logic is required to preserve the log when a restart occurs.
As the Commit/Abort Manager appends to the log, it needs to advance through the pages of the log. The procedure getnextlogpage is used to advance to the next page, ensuring that committed transaction entries are not nullified. The logic of getnextlogpage is:
Save the current time (called the new time stamp). If all goes well, this will be used as the new time stamp in the header of the current log page.
Check for restart. By ensuring, after storing the time stamp, that a restart has not occurred, we ensure that the time stamp stored is earlier than any need for replay. If getnextlogpage detects a restart, it does not update the time stamp in the log page header, and a replay will be started.
Copy the old time stamp in the header of the current page. Check that this time stamp is earlier than the new one. (It could be later if a checkpoint occurred after the check for restart, subsequent logging had written later entries into the page, and the system then restarted from that checkpoint.) If it is earlier, then use Compare and Swap to update the log page header time stamp passing the old time stamp as the "old" value and the new time stamp as the "new" value. This update will only succeed if the value in the page has not changed. If the value has changed, a restart has occurred.
Get the capability to the next log page and make the page addressable. If the new page header's time stamp is more recent than the last checkpoint, then the log has wrapped. Take a checkpoint to allow the log page to be reused.
Fork a log write thread to journalize the just filled page.
The complete logic to commit a transaction is:
Obtain the log lock.
Check for restart.
Write a log entry header. Writing the header destroys the EOF mark. The end of the log must be detected by time stamps until the EOF mark is written at the end of the new entry. Call getnextlogpage whenever a new page is needed to write the log entry.
Copy data from the LEQ to the log using getnextlogpage as needed.
If there is room in the current page, then write an EOF mark (a length word of zero), using the following logic to avoid truncating a log after a restart:
Store the current time stamp.
Copy the old time stamp in the log page header, and the current value of the word where we will write the EOF, into local variables (subject to checkpoint).
Check for restart. If the time stamp stored is earlier than or the same as the value found in the log page header, then a restart has occurred.
Use Compare and Swap to update the EOF mark using the value saved above as the "old" value. If the compare part of the Compare and Swap failed, a restart has occurred.
Use Compare and Swap to update the time stamp in the page header using the value saved above as the "old" value. If the compare matched then the EOF mark has been successfully stored.
Otherwise, the time stamp is different from the one saved and the system has restarted. If the time stamp we were trying to store is the same as the one now in the log page header, then the log entry being made is the last one in the log and the EOF mark just stored should remain in the log. Otherwise additional entries have been made in the log after ours and if the EOF mark is still zero, we must restore the old value (known because of the Compare and Swap). Continue with the normal restart logic.
Update the log pointer to point to the next log slot.
Release the log lock.
Perform a pre-page operation on the next few log pages to avoid a paging operation while the log lock is held.
Save the time stamp from the last log page for this log entry for use in checking for completion of the Journalize Page operations.
Process the entries in the LEQ in the order they were made. The lock entries need no action because the locks have already been obtained. For write or delete entries, call the appropriate database instance to perform the write or delete. Note whether there are any unlocked append entries, but do not perform them.
Release all database locks.
If there were any unlocked append entries, reprocess the LEQ to perform them. This logic allows an application to maintain a sequential log of transactions without that log file forcing serialization.
Ensure that the entire log entry has been journalized by checking that log writes for all log pages older than or the same age as the last page of the entry have completed. If it is necessary to schedule a write, this transaction may be delayed up to 100ms to allow other transactions to be committed by the same write. Journalizing a log page is done by forking a thread. Releasing the log lock before performing the Journalize Page operation allows one disk write of the log to commit more than one transaction if their log entries fit in the same page.
Return "transaction committed" to the caller.
During replay, pages of the log are processed sequentially beginning from the log pointer. If a log page is found with a time stamp earlier than that of the previous page, then the current log entry is incomplete, it is discarded, and the log is considered to be at EOF.
Replay runs in a process of its own, but calls the CAIO's associated with each application instance to actually perform the database updates. Replay first gets the log lock to ensure all CAIO's writing into the log have completed their writes. Replay then processes the log, starting from the log pointer saved by the checkpoint, and continuing until it detects log EOF. It releases the log lock before calling the CAIO's, to ensure that they can complete their attempts to log a transaction (by aborting it) and become available to help in the replay. They regain the locks and perform the database updates. When replay is completed the application and CAIO's are destroyed and rebuilt to avoid application errors that could continue a transaction that was entered before the restart.
One non-obvious area of failure is when the computer system fails when halfway through writing a page. This failure can occur if the disk sectors are smaller than a page and only some of them have been written. If this occurs while the Journalize Page operation is writing a page, it is possible that, after restart, the page will have data from two separate times. The microkernel provides support for recovery from failures of this kind.
The KeyKOS microkernel protects against this occurrence by allocating log pages in mirrored disk storage, and completing the write operation on one page before starting it on the other. The Journalize Page operation allows pages to be placed in a special state, where the kernel will use the first and last word in the page to hold a serial number of the write count. If a page in this state is read and the serial number at the start of the page does not match the serial number at the end, then the page is known to be corrupted and the other copy is read.
This technique of failure protection can be defeated by certain algorithms in the disk controller. If for example, the disk controller caches sectors of a page and writes the first and last sectors before writing one of the middle ones, a failure may escape detection. Most SCSI disk systems do not specify whether they use such algorithms.
It is instructive to compare the structure of KeyTXF with that of other systems. The KeyTXF structure takes advantage of the KeyKOS microkernel's object orientation and efficient support for many objects. Without these kernel features, the design approach used in KeyTXF would not be fast enough to be useful.
Other systems generate a unique Transaction ID for each transaction that is processed [3]; KeyTXF does not use Transaction ID's at all. A transaction is identified with the application instance that generates it. To distinguish transactions from different application instances, an application ID is used in the log. The application ID is reused for each transaction from the same application instance.
The Commit/Abort Interface Object serves to carry the identification of the application ID, and therefore the transaction, to the CAM. It is worthwhile to note the generality of the object-oriented microkernel paradigm here. Systems that use Transaction ID's must carry that ID in messages to entities that are involved in the transaction, such as databases and the transaction processing monitor. Often the Transaction ID is passed implicitly using an operating system level mechanism that is specific to transaction processing.
The KeyKOS microkernel was not designed specifically for transaction processing and has no such specific mechanism. But the object-oriented design of KeyTXF makes such a mechanism unnecessary. Lacking such a mechanism, the microkernel remains small and fast.
The CAIO's are the key to this design. For each application instance, the CAIO mediates the application's communication with the database objects. The interface objects are part of the CAM; they are responsible for identifying the different application instances and therefore the transactions. The application code does not need to concern itself with identifying transactions.
Mediating the application's access to the database objects is possible because of the security features of KeyKOS. An application instance has access to the CAM, but does not have any direct access to the database objects. Only the CAM has access to the database objects. Thus the CAM is in a position to control the application's access to the database objects by interposing an interface object. The KeyKOS microkernel allows this type of access control to be strictly enforced.
It is worth pointing out that an application instance is a general KeyKOS object, and as such is not restricted to a single process. It can create processes and objects, and can communicate with existing objects. The interface objects ensure that a database object is transaction protected even though the application instance may pass its capability to the CAIO on to other objects.
An implementation of the TP1 performance benchmark[1] in 1986, showed that KeyTXF can process 18 transactions/second on an IBM 4341 (a 1.3 MIPS processor) with two channels of 1.5 megabyte/second disk. This is equivalent to 13 transactions/second/MIPS. There were 3.3 disk I/O's, 3 X.25 packets and 76,000 instructions per transaction. Additional benchmarks show that performance scales linearly with CPU speed if the I/O system is also upgraded for increased capacity. This compared with 0.7 to 7 transactions/second/MIPS for a variety of other systems at the time [7]. IBM's TPF achieved about 22 transactions/second/MIPS, but offers no protection; in TPF, all applications run in supervisor state.
The superior performance resulted from the microkernel design and implementation. The microkernel's checkpoint mechanism supports the caching in virtual memory of information normally stored in disk files, while maintaining consistency and durability.
A commercial bank designed and implemented an application to process credit card authorization transactions using KeyKOS and KeyTXF. Though no formal measures were made, the development required approximately half the effort that it was estimated would have been required using traditional transaction processing systems.
The KeyTXF experience shows that it is possible to gain the advantages of microkernel architecture, object oriented operating systems, and high performance transaction processing in one system. Object orientation made it possible to keep the design of the transaction processing system and its applications simple, even to the extent of eliminating the use of Transaction ID's. The microkernel architecture allowed us to achieve superior performance. We believe that reliability assertions require that disk system manufacturers provide sufficient specifications to allow software recovery from system failures that occur while writing.
KeyTXF was developed by William Frantz at Key Logic in 1986. The MURC was written by Susan Rajunas. This paper is dedicated to the memory of Codie Wells.