UNIX Implementation on KeyTECH

The UNIX API provides process families, access to devices, and access to disk based files. The simple model of a process is a program space, some open file descriptors, and a tree of child processes. A process requests system service by executing one of about one hundred system calls. These system calls result in either returning information about files or processes, or altering the state of files or processes.

Access to files is via th UNIX hierarchical name structure with a directory at each level in the hierarchy. A directory associates a name with a type of file and, if it is a disk based file or directory, with the address of the file Inode. An Inode is a disk based description of a file or directory which describes the files attributes and identifies the regions of the disk that contain the file or directory information. Implied but not visible to applications are disk allocation maps.

Each process has a description block (a process table entry) that describes the process' address space, open files, and signal handling. Process table entries contain chains of child processes and pointers to the parent process table entry.

Each open file has an entry in the Open File Table which keeps track of the number of processes that have the file open, the attributes of the file, and a pointer to the data structures that buffer the file data in memory.

UNIX functions as objects

An approach to supporting the UNIX application programming interface (API) that takes advantage of the object nature of the KeyTECH Microkernel and the KeyTECH programming interface (SPI) described above uses objects to represent UNIX concepts. The KeyNIX system described here emphasizes the simplicity of design made possible by using objects at every opportunity. In the section covering performance of this system we will see that while performance can be improved it is entirely satisfactory. Later discussions outline changes that reduce the object orientation of the design and will improve the performance.

A UNIX process is a Domain with a Segment as its program space. Each process has a Keeper that handles "system calls". There is a shared Segment and SuperNode for the Process Table and Open File Table. There is a Domain for each Inode and a Segment for each file. Directories are Record Collections with a Directory Accessor object to emulate the sequential reads required by many UNIX programs like ls().

By making each UNIX object into a KeyTECH Object, all caching is done by the Microkernel. The Keeper does not have to manage an Inode cache and worry about doing I/O to read and write Inodes. When the Keeper needs to read that status information from an Inode it sends a message to the Inode object and waits for the reply. Similar arguments apply to other operations. The Keeper does not cache file blocks, does not cache and search directory blocks, and does not maintain paging tables for support of virtual memory. All of these functions are handled by the Microkernel.

It is the Microkernel that handles the sharing of virtual memory based on much simpler operations performed by the Keeper. When a program is loaded by execv(), the Keeper builds a Segment with the program file Segment as a sub-Segment with a read-only attribute. This manipulation of Nodes and Keys that define the address space of the UNIX process Domain are vastly simpler than the maintenance of shared text structures and the construction of shared segment and page tables. The Microkernel is responsible for the construction of mapping tables for the particular hardware and has been optimized for that hardware. The UNIX Keeper manipulates the KeyTECH structures in an optimized fashion that does not change from platform to platform.

There is a directory of Keys to Device Driver objects. There is an Object for each major device in the UNIX configuration. I/O is performed by looking up the correct Device Driver object in the directory based on the major device number (similar to the CDEVSW table) and sending a message to the Device Driver object which describes the I/O operation. The Device Driver object consists of the original UNIX Device Driver code linked with a support library that emulates the UNIX kernel interface and translates the requests for UNIX kernel services into Key invocations.

A Discussion of expected performance

In an object oriented implementation of UNIX as one might find if the UNIX kernel were written in C++, one would use the object paradigm and much of the object management code in the kernel would disappear. When there are permanent objects available because they are implemented at the Microkernel level, then the kernel code that manages a cache of objects in memory and transfers those objects to and from disk also disappears. The use of objects implemented by the Microkernel greatly simplifies the code that uses those objects because the UNIX kernel does not have to have code to manage the data structures.

The cost of using system implemented objects is the cost of task switches between the requester of a service and an object providing that service. What is difficult to accept is that the task switch cost can be lower than the data management code that it replaces. The object paradigm was at the heart of the design of the KeyTECH system and, as a result, the task switch costs are very much lower than in traditional systems and several times lower than in competing microkernels such as MACH and Chorus.

The low cost of task switches makes it possible to obtain better performance with much simpler software. Among the benefits of simpler software is that it is easier to maintain and is much more reliable. The UNIX implementation described here takes the maximum advantage of system level objects and the result is a very small UNIX kernel. A total of 7000 lines of C implement enough of BSD4.3 to run X Windows and the standard set of shells and utilities.

Compatibility

The use of a Keeper to support all of the UNIX functions allows KeyNIX to be 100% binary compatible with BSD 4.3. The BSD 4.3 system calls from the application are "trap" instructions which result in an invocation of the Keeper. The Keeper acts as the UNIX kernel and performs the requested service and restarts the UNIX application in the same manner as the BSD 4.3 keeper.

The application address space is exactly the same in KeyNIX as it is in BSD 4.3. There is no code either linked in or mapped in with the application code that has anything to do with the KeyNIX system. All of the KeyNIX code is in separate objects with separate address spaces. The application address spaces are bit by bit identical.

While KeyNIX could be equally compatible with MACH 2.5, the existing prototype is not. There are three minor incompatibilities that require that applications be "re-linked" with the "ld" processor.

The application prolog ("crt0") in MACH 2.5 initializes certain MACH ports. KeyNIX does not yet have any support of these MACH functions. The MACH port functions can be supported by the Keeper and by Port object, but this work has not be done at this time.
MACH 2.5 port functions are accessed by a "trap" instruction in the same fashion as are UNIX system calls. The Keeper needs code to handle these traps appropriately.
The FORK() function does the same port initialization for the new task that was done by "crt0" in the parent task.
MACH 2.5 does not implement the SBRK() system call. This call is handled by a library routine that uses the "VMALLOC" of MACH 2.5 to handle memory expansion and contraction.

Programs compiled on the Luna 88K under MACH 2.5 that are to be run in the KeyNIX system must be linked with a new prolog and two library stubs for FORK() and SBRK(). In cases where the ".o" files exist, there is no need to re-compile the programs, but the programs must be re-linked.

KeyNIX could be made 100% binary compatible with MACH 2.5 by implementing the basic port mechanisms used by initialization, fork(), and sbrk().

The existing prototype does not support all BSD 4.3 system calls. The major criterion for chosing what to implement and what not to implement was the need to run X-Windows, csh(), ls() and similar useful utilities. If the system call is not needed to run these applications then it is not implemented. There are a number of calls that are implemented in a limited fashion, again sufficiently to run the required applications. As an example, csh() makes usage() calls but does not depend on the answers for correct behavior. Usage() always returns the same fixed values and is not useful as a measuring tool as a result.

If a call is not implemented then an error is returned. However, if a call is only partially implementd then no error is indicated but the results may not meet the needs of the application.

The file structure present in the KeyNIX system is not complete. Only those directories that are basic to the operation of BSD 4.3 and X11 are present. All programs are in /bin. The only files in /etc are the password files and termcap file. The directory structure is miminized in order to keep the initial system installation tape small. Only about 6 megabytes of data is installed on the prototype KeyNIX system.

Description of UNIX support objects

Each of the Objects used to support the UNIX implementation and its relationship to all the Objects that comprise the UNIX implementation are described here. As in the previous descriptions of Objects, only the major operations are described here. A full description of these objects can be found in KeyKOS Object Reference(KL230).

INODE

A file is represented by an Inode object. The Inode object holds all of the state data that describes the file (Owner ID, File Type, Link count, etc) as well as a Key to a Segment, (if the file is a disk file) or Record Collection (if the file is a directory). Inodes for special files do not hold keys to the object that supports the device. Special devices are accessed through a directory of device keys indexed by the major device number. In addition, the Inode object has a Factory key for a Directory Accessor which an Inode for a directory uses to create a Directory Accessor object when the directory is opened.

  KC (INODEF,INODEF_Create) KEYSFROM(sb,m,sb,object) STRUCTFROM(stat) KEYSTO(INODE) RCTO(rc)

The "stat" structure is from "/usr/include/sys/stat.h" and describes the Inode data. The type of file is determined by the contents of this structure.

If the file type is "regular" then "object" is a Segment key. Usually this is a Fresh Segment so that it can grow arbitrarily. If the Segment key is in fact a Read Only Segment key, then the file is read only regardless of the access mode information in the "stat" structure.

If the file type is "directory" then "object" is a Named Sequence Record Collection.

If the file type is some form of character device (tty, pipe, socket, symbolic link, etc.) then "object" is a Zero Data key (or not provided) since the major device number in the "stat" data is used to locate the Device Driver key in the Device Driver directory.

It would seem natural to access the device object directly from the Inode. However, many device drivers link to other device drivers using the switch tables generated based on major device number. Rather than have two mechanisms, all special device access goes through the switch table (which is implmented as a directory of device object keys).

  KC (INODE,INODE_ReturnInodeData) STRUCTTO(stat) RCTO(rc)

The fstat() and stat() functions return the stat.h structure. For fstat() the Open File Table contains the slot number of the Inode key in the shared supernode. The Keeper fetches the Inode key and invokes the Inode to return the "stat" data, receiving the reply string directly into the process memory. For stat() the Keeper gets the correct Inode by calling a directory search subroutine before invoking the Inode.

  KC (INODE,INODE_IncrementLinkCount) RCTO(rc)

When an Inode is put into a directory its link count must be incremented.

  KC (INODE,INODE_DecrementLinkCount) RCTO(rc)

When an Inode is removed from a directory its link count must be decremented. If the link count and the open count both reach zero the Inode object destroys the file Segment (if it can) and self destructs.

  KC (INODE,INODE_UpdatePermissions) STRUCTFROM(st_mode) RCTO(rc)

Change the Inode "stat" data.

  KC (INODE,INODE_ChangeOwner) STRUCTFROM(st_uid,st_gid) RCTO(rc)

Change the Inode "stat" data.

  KC (INODE,INODE_ReturnPermanentDiracc) KEYSTO(diracc) RCTO(rc)

A DIRACC object is used to hide the fact that the Directory is represented as a Record Collection. Several task switches can be saved by adding the lookup function to the Inode. The Inode object can access the Record Collection directly and return the Inode to the requested path segment.

  KC(INODE,INODE_LookUpName) STRUCTFROM(name) KEYSTO(inode) RCTO(rc)

The current prototype actually uses this call. There are similar calls for AddName and DeleteName to completely supplant the need for the Keeper to use DIRACC except in the cases of sequential access to the directory.

  KC (INODE,INODE_Open) STRUCTFROM(mode) KEYSTO(object) RCTO(rc)

Increment the open count and return an object Key. When the file is a Directory the Key returned is to a new created DIRACC object. For regular files the Segment key is returned.

  KC (INODE,INODE_Close) RCTO(rc)

Decrement the open count and destroy the DIRACC object if the opened file was a Directory. When both the open count and the link count reach zero the Inode destroys the file and self destructs.

  KC (INODE,INODE_SaveLinkName) STRUCTFROM(name) RCTO(rc)

For symbolic link support the Inode keeps the link name. When the symbolic link is created the Keeper creates the Inode and sets the link name.

  KC (INODE,INODE_ReturnLinkName) STRUCTTO(name) RCTO(rc)

Read the symbolic link data.

DIRACC

The Directory Accessor is used for two functions. First, it serves as a front end to the Record Collection to simplify Adding, Deleting, and Resolving names. Second, it produces UNIX system dependent directory blocks for sequential reads. BSD 4.3 stores directory entries in a sequence of file chunks (directory blocks) each having a storage allocation technique for variable length path names. Functions such as ls() read access these blocks through library routines that know the structure of the directory blocks. The new UNIX standards call for READDIR, etc. as kernel calls, but on many systems these are implemented in libraries that continue to access a directory as a sequential file.

  KC (DIRACCF,DIRACCF_Create) STRUCTFROM(mode) KEYSFROM(sb,m,sb,rcol) KEYSTO(diracc) RCTO(rc)

The extended jump protocol is used to create a Directory Accessor. Mode is a 2 byte field that holds the MODE parameter of the open() function (See fcntl.h).

  KC (DIRACC,DIRACC_AddName) STRUCTFROM(name) KEYSFROM(inode) RCTO(rc)

A Name is added to the directory. Name is a single path segment. relative to the directory. The "inode" Key is the Key to the file Inode object.

  KC (DIRACC,DIRACC_DeleteName) STRUCTFROM(name)  RCTO(rc)

A Name is deleted from the directory. Name is a single path segment.

  KC (DIRACC,DIRACC_ResolveName) STRUCTFROM(name) KEYSTO(inode) RCTO(rc)

The Inode for a file is retrieved from the directory. Name is a single path segment. A complete name resolution involving a complete path involves repetitive invocations of Inode and Diracc until the complete path name has been parsed and resolved. A Name Server object could remove this logic from the Keeper but would serialize the open() functions. The Keeper has several subroutines to handle path name resolution.

  KC (DIRACC,DIRACC_DirRewind) RCTO(rc)

This operation rewinds the sequential access simulation so that the next DirRead function will receive the first simulated directory block.

  KC (DIRACC,DIRACC_DirRead) STRUCTTO(dirblock) RCTO(rc)

This operation constructs the next directory block required for sequential simulation. In the cases when the invoker does not wish to receive the entire directory block at one time (rare), sequential segments of the directory block is returned on subsequent invocations. If the directory block is greater than 4096 bytes, an extended jump protocol is used to return the entire block.

  KC (DIRACC,DIRACC_NumEntry) STRUCTO(numentries) RCTO(rc)

Returns the number of entries in the directory.

IODEVICE

All Devices and pseudo devices (sockets, streams, etc.) have the same interface to the Keeper. I/O operations that can complete immediately (buffered data) involve a single interaction with the Keeper while operations that cannot complete immediately involve two interactions with the Keeper. The implementation of sockets as devices simplifies the Keeper logic as there are only two styles of I/O, devices and regular files. In a port of a UNIX system the socket code and other device code would be encapsulated in an object that gave the socket code the appearance of running in a UNIX kernel and would interact with the Keeper using device I/O logic.

The Keeper supplies an IOREQ (I/O Request) structure and a Key to the memory of the requesting UNIX process. The Device Driver object returns the IOREQ structure with some of the fields updated.

Since delayed completion I/O operations require the Device Driver object to invoke a Key to the Keeper at the same time that the Keeper might invoke a Key to the Device Driver object to initiate or cancel I/O operations, it is necessary to use an intervening object (a queue object) on one of the calls. The current implementation uses a queue object (called an IOF or IO Forker) on the Keeper invocation of the Device Driver object to initiate or cancel the I/O. The Device Driver object invokes the Keeper directly upon completion of the I/O.

  KC (DEV,DEV_Startio) STRUCTFROM(ioreq) KEYSFROM(memory,notify) STRUCTTO(ioreq) RCTO(rc)

Start I/O operation. The "ioreq" structure contains a description of the I/O operation including the buffer address, count, device number (major and minor), process id, and command. The "memory" Key is a Key to the UNIX process memory which is mapped into the address space of the Device Driver object. The commands are briefly described below. All return values are in the returned ioreq.

CLOSE - Close device.

READ - Read data into user memory

READIOV - Do a vector read into user memory

WRITE - Write from user memory

WRITEV - Do a vector write from user memory

IOCTL - perform IOCTL function (the user parameter is in buffer count field)

SEEK - perform any SEEK functions

SELECT - perform SELECT() function (select function in buffer count field)

MMAP - perform MMAP function which may modify user memory tree.

UNMMAP - perform UNMMAP function which may modify memory tree.

And for Socket devices the following commands which sometimes return device numbers in the "ioreq" device number field.

CREATEPAIR - create a connect pair (for pipes)

LISTEN - put socket into listening state

ACCEPT - accept a queued connection

CONNECT - make a connection

In all cases the return code value indicates the state of the I/O operation.

1 - I/O is complete. The "ioreq" has a "completion" code field.

2 - Device is busy.

At completion of a delayed operation the Notify Key is invoked with the following order codes.

1 - I/O is canceled.

2 - Select operation notification of ability to do I/O

  KC (DEV,DEV_CancelIO) STRUCTFROM(ioreq) KEYSFROM(memory,notify) STRUCTTO(ioreq) RCTO(rc)

Cancel a pending I/O operation. This is treated as a normal I/O operation in that its completion is noted either by a direct return code of 1 (not implemented) or direct return code of 0 followed by a Notify key invocation indicating the cancel is complete. The Keeper may receive notification of the I/O Complete before the CancelIO completes and it is the Keeper's responsibility to prepare for this sequence of events.

Keeper

The Keeper is described by the various Keys (each having different databytes) that designate it. Each invocation of these keys results in the Keeper performing some function or changing the state of the UNIX process, followed by the Keeper becoming "available" for the next invocation.

For instance, when the Keeper starts an I/O operation as the response to a UNIX process I/O request, it becomes available awaiting the results of the I/O. It behaves this way so that it can always receive a signal from another task and take some action. If I/O is in progress when a signal is received the I/O is canceled before normal signal processing takes place.

When the Keeper must become "available" and the UNIX process is not supposed to run, then the Keeper saves the Restart key for use when the UNIX process should resume running.

The Keeper has 9 different start Keys, several of which have more than one order code. In the documentation below, the order codes will be summarized under the Key description. In the descriptions below the presence of an RCTO() specification indicates that the Key is usually invoked with a CALL instruction. When the RCTO() specification is not present the Key is invoked with a FORK or RETURN instruction.

  KC (FAULT,oc) STRUCTTO(process-state) RCTO(rc)

FAULT is called for all process faults (arithmetic faults, system calls, etc.) and the "oc" value indicates what kind of fault it is. For system calls, the process-state information is used to determine the system call and the parameters. The process is ready to execute the instruction following the system call when the Restart key is used to restart the UNIX process. The process-state information is altered by the Keeper before restarting the task via:

  LDEXBL (userdom,Domain_PutRegistersAndControl) STRUCTFROM(process-state) KEYSFROM(,,,Restart)

  RETJUMP()

The fault Key is invoked via a simulated CALL instruction by the kernel. While it appears to the Keeper that it is replying to a CALL instruction, in this case the return code is not used. When a System Call cannot be processed immediately because it involves doing I/O, the UNIX process is marked as waiting for I/O when the Keeper invokes the Device Driver key to start the I/O. When the I/O is complete (or when a signal comes in while waiting for I/O) the UNIX process is taken out of I/O wait and restarted. Disk I/O is accomplished synchronously as it is anticipated that this will happen quickly and the user is not interested in processing a signal during disk I/O.

  KC (SynDomain,oc) KEYSFROM(x) KEYSTO(x) RCTO(rc)

Since the Keeper is the Domain Keeper for the UNIX process, it must also act as a Synthetic Domain key for the process domain. This is so that any Object (such as debugger) that uses the process Domain key will invoke the Keeper instead. The Keeper stops the invoker from removing the Keeper from its Domain Keeper role.

  KC (SIGNAL,sig) STRUCTFROM(usecount) RCTO(rc)

This Key is called by other Keepers who wish to send a signal to the process that this Keeper is keeping. The order code is the signal number. The sender includes the current use count of the process table entry of the receiver of the signal. This allows the receiver to recognize obsolete signals (signals that arrive so late that the keeper is being used for a different process than the one that the sender wanted to signal).

  KC (IOStarted,oc) STRUCTFROM(ioreq)

IOREQ may be updated by the Device Driver.

OC = 0 The I/O has been started.

OC = 1 The I/O was completed immediately.

  KC (Notify,oc) STRUCTFROM(ioreq)

OC = 0 I/O completed

OC = 1 I/O canceled

OC = 2 Select Notification

  KC (TimerReturning,0)

A Wait object has returned indicating that a requested timer has expired. The task may no longer be waiting for this event.

  KC (SelectTimerReturning,0)

The Wait object that was invoked as a result of timing out a select() operation has expired. The task may no longer be waiting for this event

More about Device Drivers

The interface between a UNIX device driver and the UNIX kernel is nominally well defined. In KeyTECH each device driver is compiled and linked with a library that meets that interface so that the device driver software believes that it is installed a UNIX kernel, when in fact the device driver is inside of a Device Driver object. There is an Object for each Major Device Number in the UNIX configuration. When the UNIX system is initiated, the Device Driver objects are created (from their respective Factories) and the Start key to each driver is put into a directory using the Major Device Number as the name of the Key. When a device is opened with an open(), the Start key to the Device Driver object is fetched from the directory and put into the shared supernode with its slot number in the Open File Table (where the Segment key for a regular file would be). The Keeper uses the Key to initiate I/O.

Most Device Driver objects have two Domains. The "top half" of the device driver runs in sthe "main" Domain and the "bottom half" of the driver runs in a second Domain that CALL Microkernel DEVICEIO keys waiting for interrupts. The bottom half Domain will CALL the main Device Driver Domain to notify it of the interrupt. The structure here is very much like the Top Half and Bottom Half structure of a UNIX driver in the kernel except that there is a "bottom half" domain for every minor device number. The Driver Object need not worry about interrupts as all Domains run single threaded all the time (both Top Half and Bottom Half).

Each Device Driver object may be linked with several UNIX device drivers. This happens when the TTY driver uses the UART driver. One of the functions linked with each UNIX device driver is a Device Switch Table (CDEVSW) for used by the driver software. Only those entries appropriate to the driver will be filled in.

Kernel services such as cpass(), copyin(), copyout() are provided by the library that uses KeyTECH mapping functions to obtain access to the process memory. The Process Table Segment is made available read-only so that those UNIX drivers that need to access the Process Table Entry may do so directly. When a driver is multiplexing several devices (such as a TTY driver) the driver may do a SLEEP() function. The library linked in with the driver takes care of this function by making the Device Driver object "available". When the Domain waiting for the interrupt invokes the central object, the interrupt portion of the device driver runs and does a WAKEUP() call which sets the Object up to resume execution after the SLEEP() (SETJUMP and LONGJUMP are used for this with a jump buffer for each sleep() channel)

Analysis of various UNIX operations

This section describes the basic control flow for selected UNIX system calls. Many of the details are omitted from these descriptions so that an overall behavior pattern can be observed. Specific descriptions should refer to the Keeper code until the Keeper Logic Manual is produced. Calls to be described are Simple system calls, open(), File I/O, Device I/O, and Signals.

Simple System Calls

In BSD 4.3 (and other UNIX systems) system calls are traps to the kernel. One of the hardware's privileged instructions (usually a TRAP instruction) is used with the register contents indicating the type of call and the parameters. The KeyNIX system is binary compatible with this approach. The TRAP instruction is reflected to the Keeper by the Microkernel.

GETPROCID, PUTPROCID and, GETTIMOFDAY are examples of Simple System Calls. The UNIX process sets up some registers to describe the call and executes some form of TRAP instruction. The Keeper receives a Message via the FAULT key. The Keeper selects the proper subroutine using a switch(sysnumber) statement, calculates the correct response (or alters some structure like the Process Table Entry), alters the registers (and/or moves data in or out of the user's memory), and replies to the message.

Open()

Open() represents one of the more resource consuming operations of the Keeper. The open subroutine resolves the name presented to an Inode key for the requested file. This operation involves repeated invocations of Inode and Diracc keys until the proper Inode key is isolated.

The Inode key is put into the shared supernode and the Open File Table Entry is updated to reflect the slot number. The Inode key is invoked to open the Inode. If the Inode was for a regular file, the Segment key for the file (returned by the Inode open invocation) is put into the shared supernode and the Open File Table Entry is updated to reflect the slot number of the Segment. If the Inode is for a device, then the Device Directory is used to fetch the Device Driver key based on the Major Device Number. This Key is put into the shared supernode and the Open File Table Entry is updated to reflect the slot number of the Device Driver.

Pseudo devices like Sockets are not "open()ed" but there is a corresponding function like "create()" which results in the proper Key being in the shared supernode and the Open File Table Entry properly updated.

After the open() completes the FD number is put into the process-state and the process is restarted.

File I/O

Read and write calls result in memcpy() call to move the data between the user process memory and the file Segment. The Keeper maintains a small cache of windows over file Segments (based on the Open File Table Entry address) and must check that the correct file Segment is in one of its windows before doing the move.

When one of the remote file systems is installed, file I/O will be treated much the same as device I/O.

Device I/O

Read, write, and ioctl calls result in putting the UNIX process into the I/O wait state and invoking the Device Driver key to initiate the I/O. When the Device Driver object notifies the Keeper that the I/O is completed, the process is restarted. The Device Driver is responsible for moving the data to and from the user's memory.

The Device Driver object is not invoked directly because it may be in the process of notifying the Keeper of some completion. This condition arises when the Keeper receives a signal notification while waiting for the Device Driver to notify the Keeper of the completion of the I/O. The Keeper must invoke the Device Driver object to cancel the I/O and the Device Driver Object may be notifying it of completion. Select functions also can cause the same effect. The Keeper may initiate an I/O operation at the same time that the Device Driver object is notifying the Keeper of a select() completion.

Because of the situations described above, the Keeper never invokes the Device Driver object directly. The Keeper always uses an IOF object. The IOF object is invoked with a FORK instruction and passed the IOStart key as the fourth Key in the Message. The IOF object will likewise use a FORK instruction to invoke the Device Driver object, with the IOStart key still as the fourth Key in the Message. Then, when the Device Driver object invokes the IOStart key, the Keeper also knows that the IOF object is available for another use. This use of an IOF object effects queuing of messages sent to the Device Driver object at the cost of only one extra task switch. Each IOF object holds a single message with the Microkernel queueing the message invocations rather than the message.

Signals

UNIX process A executes a kill() system call. A's Keeper locates the Signal key for process B by looking in the process table entry for B and using the signal slot number to fetch the Key from the shared supernode. A's Keeper does not invoke this Key directly as B's Keeper may be sending a signal to A. Instead A's Keeper uses a FORK object to deliver the Message so that A may become available immediately.

If B is not doing I/O then the B's Keeper uses the Domain key of the UNIX process domain to stop the process and dispatch the signal handler. The complete status of the task is pushed onto the stack for use when the signal handler returns. Since the signal handler may do I/O or other system calls, this complete status cannot be saved in the usual places. When the signal handler is dispatched, the return address is to an address above the user's stack but still in the address space of the UNIX process. A single sigret() call is at this location.

If B was waiting for I/O when the signal was received, the I/O operation must be canceled before the signal is processed. The Device Driver object involved in the I/O is called (using and IOF object) to cancel the I/O, and when this operation completes the signal handler is dispatched as above. Care must be taken to avoid confusion if a second signal request comes during the I/O cancel operation.

When the signal handler returns, the sigret() system call causes B's keeper to restore the process state from the information on the user's stack and restart any I/O that needs to be restarted. If I/O is restarted, the process goes back into I/O wait; otherwise the process is given an EINT error return.

Performance comparison

This section compares the performance of MACH 2.5 on the Luna88K with KeyTECH/KeyNIX on the Luna88K. Several performance problems were identified from the measurement of the performance of this prototype and several anticipated problems were confirmed. A production version would have two major changes that will improve the performance a good deal.

The prototype does not have a fancy loader. Each time a program is loaded, the object file is read from the file system into memory. Programs are not shared. Similarly on a fork() operation the entire contents of the process memory is copied. These simplifications were made to shorten the development time of the demonstration. The performance implications of using the simple loader are well understood and the use of a smart loader will improve FORK() and EXEC() performance.

The use of a File System Object will improve the open()/close() times enormously. It was felt important to understand the impact of the totally Object-oriented design. The good news is that the totally Object-oriented design demonstrates adequate performance commensurate with the other popular microkernels. The other good news is that the use of a File System Object will make the KeyTECH/KeyNIX implementation faster than the other popular microkernels.

Simple System Calls

10000 calls to GETPID were timed.

This ratio will be maintained for all simple system calls as the major expenses are the two context switches. It is possible to alter the implementation to reduce a simple call to a single context switch by putting some of the processing of system calls into the process address space (MACH 3 does this). However, if as much as 10% of the CPU time is spent in simple System Calls, the application will run only 15% slower with two context switches.

Open/Close

1000 calls to open() and close() were timed. The file that was opened was local to the current directory.

The use of Inode objects has an impact here. Even though this result is commensurate with at least one of the popular microkernels UNIX has a tendency to use lots of little files and this 4 to 1 ratio might have an impact. If 10% of the CPU time is spent opening and closing files the application will run 30% slower. The use of a File System Object will improve the performance of KeyNIX to a level better than MACH 2.5.

FORK and EXEC

100 FORK/EXIT pairs and 100 EXEC calls were timed.

MACH 2.5	10 FORK()/EXIT()/second	13 EXEC()/second
KeyNIX	64 FORK()/EXIT()/second	151 EXEC()/second
Ratio	6.4	11.6

The fact that the Microkernel is doing all process and memory management shows to great advantage here. The Keeper manages a pool of processes all ready to be initialized when a FORK() is processed. The work of copying the memory and registers of the process and starting the new process is small compared to the creation of a new process and address space in MACH 2.5 MACH 2.5 has additional work over BSD 4.3 because of the need to set up initial ports for the new process.

Since these two functions are at the heart of UNIX function, these speed advantages will balance well with some of the functions that are somewhat slower.

SBRK

100 sbkr(4096) and sbrk(-4096) were executed with a fetch of a byte from the newly allocated memory. The fetch of the byte forced each system to actually allocate the main store for the page.

The optimized memory management software in the KeyTECH Microkernel again shows great advantage here.

PIPE bandwidth

One megabyte of data is passed through a pipe to a daughter task and back in 1000 byte chunks (a total of 2 megabytes of data transferred).

This result is completely commensurate with the other microkernels. Some small improvement is expected if a new primary Object (a queue) is developed.

Disk File I/O

In this test a file is created and then written and read repeatedly. The I/O model of KeyNIX and MACH 2.5 are so radically different that other comparisons are very difficult. KeyNIX, for instance, never writes to disk as a direct result of writing in a file. All writes to the disk are part of the paging, checkpoint, and migration system. Uncached disk read times are dominated by the actual physical disk I/O time and are of little use. Since many programs use disk files as scratch files (compilers, editors, etc) this test is a reasonable comparison of the two systems. The times reported are the elapsed time to write and then read a one megabyte file ten times.

If this test is run while KeyTECH is doing a checkpoint and migration, the KeyNIX time is extended to 4.4 seconds. This I/O performance is many times better than either of the other microkernels and makes the overall performance of many applications better under KeyNIX than under a "standard" system.

X-Window operations

Several of the common operations in X-Windows are timed using a stopwatch. One observes that the timing differences are often well within the timing errors of the stopwatch method. The timing of XINIT (opening three Xterm windows) is heavily impacted by the lack of look-ahead while paging in program files. The CPU is only 50% busy most of the time during the XINIT process. Paging look-ahead is easy to add and will improve this timing comparison.

XINIT: MACH 2.5 - 2.0 seconds
KeyNIX - 1.5 seconds
Ratio 1.3
New Window: MACH 2.5 - 2.0 seconds
KeyNIX - 1.5 seconds
Ratio 1.3
TWM Restart: MACH 2.5 - 2.2 seconds
KeyNIX - 2.0 seconds
Ratio 1.1

Running some Scripts

Two scripts were prepared to exercise a mix of system functions. These scripts contain loops and the report is the number of loops per minute. SCRIPT is a series of pipe operations that moves larger and large volumes of data through about five pipes. SCRIPT1 does not involve pipes, but moves a file using CAT several times and uses WC to count the words in the file. Other miscellaneous commands are executed within the loop of each of the scripts.

The timings of the scripts gives a better picture of the performance comparison under more typical loads.

SCRIPT (lots of pipes): MACH 2.5 - 9 iterations/minute
KeyNIX - 23 iterations/minute
Ratio 2.5
SCRIPT1: MACH 2.5 - 8 iterations/minute
KeyNIX - 32 iterations/minute
Ratio 4.0

Performance Summary

The overall performance of the KeyNIX system is quite comparable with MACH 2.5. Some operations are slower and some quite a bit faster. A user using X-Windows doing VI and using a variety of shell commands and scripts is unaware of any significant performance difference between MACH 2.5 and KeyNIX. When paging look-ahead is installed, the performance of KeyNIX should always be noticeably faster than MACH 2.5.

Some Ideas about alternate Implementations

There are several ideas that come to mind to simplify some of the operations of the Keeper and to cut down on some of the overhead. Each of these Ideas represents a compromise in the use of Objects and multiple instantiation.

Using a Segment to simulate a Disk Drive

This change would restore the entire UNIX file system to the KeyNIX kernel emulator. The UNIX file system would still be protected by the system-wide checkpoints but would not use Objects for Inodes and Files.

This is part of the solution to supporting existing UNIX file systems. One would want to access existing file systems read/only so as to preserve the restart capability of the KeyNIX system.

This approach trades the simplification of code and cost of (Object) context switching with the complexity of code and the cost of executing that code.

Represent Inodes as Nodes

Representing an Inode as a Domain means that Inodes occupies a page of disk space. This may be viewed as excessive. A Node has sufficient storage to maintain all the state data required by an Inode and uses only 1/13th of a page of disk space.

In order to serialize access to the information in an Inode one would use separate Inode manipulation object that is passed the Inode and asked to do things like modify the open or link counts. This introduces an unwanted serialization which could be avoided by adding some operations on a Node (such as AddOneToDatakeyInSlot).

If the Inode manipulation object is the Keeper of the Inode (much like Tape Keys) then this change can be added transparently.

The Inode server could store Inode data in a common segment indexed by Inode number. The only disadvantage of this is that the page faults would serialize the server. Using a Red Segment node guarantees that the Node is in memory when the Keeper tries to read the Inode data from it. Symbolic links can be larger than 176 bytes and may have to be small files when they get this large or else use special logic in the Inode server for this case.

Use a Process Table Manipulation object

The current process table is an array of process table entries. The process number is the table index. Process numbers are reused. This has certain problems in the human interface for system maintenance. Also there are circumstances when process table entries should be chained so that children can be located more quickly. This is best handled by introducing a process table entry manipulation object that allocates and chains process table entries. The Keeper can still reference its own process table entry but may want to access other process tables entries (to obtain a signal Key) using the process table manipulation object.

Use a File Table Manipulation object

A much weaker argument exists for making the Open File Table a structure manipulated by a server Object. There are no chains. Count fields can be manipulated by CompareAndSwap operations either using the hardware or Microkernel support.

Use a Memory Object

At this time the Keeper is using a Fresh Segment for the memory of the UNIX process, and the Keeper is Loading programs into this memory and copying the memory on a fork(). The next generation of KeyNIX will use a Memory Object that understands Loading with maximum sharing, Forking with sharing based on Ranges, and Mapping for mmap() functions. This will increase performance for fork() and execv() as well as isolate the Keeper from the different memory models of the various UNIX systems.

Small File Objects

The data for small files can be kept in Nodes. A small file might be a Node of up to 16 Nodes each holding 176 bytes of data. When the 17th Node is required the file is converted to a Fresh Segment. The Inode would Build a Fresh Segment for the file while it is open. At the last close the file would be put back into n-Nodes if it is small enough. This might be costly for 8 Node files.

An alternative is to have small files all share some large Segment. Again at open() time the small file data would be read out of the Segment into a Fresh Segment. At close time the file would be copied back. For files of up to one Page this is fine. Files greater than one Page in size will be in units of Pages.

Fresh segments each have a keeper unless they are "sibling" segments. One Keeper probably suffices for all Segments in a file system because no I/O is done to allocate new pages.

File System Object.

A File System Object combines the ideas of using a segment as a simulated disk with the ideas of using Objects for Inodes. A File System Object is something like the File Manager tasks of CHORUS and MACH 3. Each Keeper has an instance of a File System Object for each File System that is mounted. All the File System Objects for a mounted File System share a segment in which the Inodes are stored. A shared supernode is used to store the Segment keys to the file Segments. A second shared segment is used to hold the data of small files.

When a File System Object is called to open() a file, a Segment key to a Fresh Segment or a Read Only segment is always returned. Small files are copied into Fresh Segments by the File System Object. The caller of the File System Object need not be a UNIX Keeper and need not know anything about UNIX other than the fact that a string of characters represents a file name.

A File System Object has operations to import and export segments for use in a secure environment and supports operations like Remove, Create, Rename, etc.