This describes the release 1.2 kernel (the “new item space” kernel).

KERNEL DESIGN PRINCIPLES

Overview

The kernel runs in privileged mode, unswapped, with DAT off, and with interrupts disabled.

Some tables are allocated at IPL time depending on the size of real memory.

For efficiency and other reasons the logical state of kernel implemented objects is represented with some latitude. The real representation must always suffice to efficiently compute the unique logical state but many different real states may represent the same logical state.

More plainly, there is more data in the kernel than strictly necessary to code the logical state. (prepkeys) describes how keys that designate pages or nodes may be prepared for rapid use. (prepnode) describes how nodes may be held in the kernel so as to expedite their expected use.

Design by Assertions

There are many assertions about the state of the kernel’s working storage. Many of the assertions are merely data structure descriptions in a new form. Others are more subtle. Most of the assertions concern structures designed for efficiency. Some of these were carefully written down (in this manual) before we coded and some were formulated afterwards. This is how we designed and understand the kernel.

Many kernel algorithms are immediate consequences of the assertions.

The assertions are frequently expressed formally in this manual. They may also be expressed informally but we have had some trouble with imprecision of informal expressions. Informal expression serves well to introduce an idea, however.

No Stuck Resources

A general principle is that any real resource be reclaimable between atomic actions. This means that such a resource is never locked awaiting the action of some domain program.

Domains with processes, however, are stuck in core. The decongester will remove this exception when we implement it. An exception to this exception is that processes are recorded in checkpoints by special kernel logic.

THE REAL THINGS

The Gnosis kernel manual {(p1,external)} describes relations between nodes, keys, and pages. Except for performance considerations, the information in the Gnosis kernel manual is all that is needed to explain the environment that the Gnosis kernel provides to a collection of programs and is thus sufficient for the design of programs that run under the Gnosis kernel.

A program in Gnosis can sense and act upon nodes only via kernel mediation. There are real keys and nodes but their representation is a matter internal to the Gnosis kernel. This manual is about those real things.

The use of the terms “node”, “key”, and “page” in the Gnosis manual is slightly at variance with the use in this manual -- this manual refers to the real representations where the Gnosis manual refers to the logical entities. It is necessary to understand the properties of the logical entities in order to understand the motivation of the designs described in this manual, for these designs are merely to support the logical entities efficiently. The terms “logical node”, “logical key”, and “logical page” will be used here to mean the same thing as those terms in the Gnosis manual without the qualifier “logical”.

Real nodes, keys, and pages have attributes that their logical counterparts lack. Primary among these is their physical location. A key always lives in its node, but nodes and pages can be in a variety of states such as: in core, on drum, on disk, combinations of above, having channel queue entries to make copies on another storage level.

See (cda-depend) for information on which CDA’s the kernel knows about.

THE KERNEL’S USE OF REAL MEMORY

An Operational View of Virtual Memory

As a program runs under Gnosis, all core {main storage} references are mapped from the virtual address that the program produces to a real address that Gnosis controls. Gnosis uses this mapping as its ultimate control of access to data stored in pages. A program can access whatever is in its virtual address space.

To run a domain, the kernel creates on demand segment and page tables by examining the memory tree that is rooted in the domain.

Due to the limited size of main storage, Gnosis puts only recently accessed data into real main storage and brings other data that the program should be able to access only as the program reaches the point of accessing that data. This fact is relevant to the programmer only for performance considerations and effects on the rate at which meters run. See (p1,chrgset) for more information about charge sets.

The kernel knows which pages the program can access by consulting the “address segment” that is designated in slot 3 of the domain root of the current process. The address segment indicates what page is designated by a given virtual address.

Address segments are constructed recursively of pages and nodes. The keys used to construct address segments {segmode keys} represent authority to access data. The data accessible by such a key is called a segment. These keys may be passed about and occur in more than one segment tree.

This chapter is about the format, meaning, and use of the segment.

Core Page Frames

See (packs) about nodes and pages on the disk.

There is a one-to-one correspondence between core table entries {(coretable)} and the dynamically pageable core page frames.

Not all of these page frames are devoted to holding logical pages. Some are empty {and chained off of an anchor called FREELIST in WSPACE}, and some are used for the following miscellaneous uses:

Core table entries chained from APAGCHHD with CTPOT bit on and CTALLOCATIONPOT bit off have storage key 0 and are used as a cache of page sized node collections {nodepots} from the disk.

Core table entries chained from WETHNPHH hold nodepots and have storage key = 3.

Core table entries chained from WRTAPHCH hold allocation pots {flags and allocation counts for up to 819 contiguous pages in a disk range} and have key = 5.

Some pages are not chained and are used by the migrate logic or checkpoint logic. These have a storage key = 3. Checkpoint uses at most (number of checkpoint devices) + 1 of these. The number used by migration is a function of the external migrator; see (ext-migrator). These pages have CTIOCOMPLETECOUNT not equal to zero between units of transformation.

Contents of Memory

Real memory is divided into the following logical sections {from small addresses to large addresses}. The storage key access-control bits are indicated thus: “{key n}”.

(0:X'FFF') Page Zero {key 3}

Page zero is used for the hardware defined functions such as interrupt information {see 370 Principles of Operation}. Because page zero can be addressed without a base register, it is also used for constants and variables that must be accessed easily or with high performance. These are primarily subroutine save areas and work areas, but also include some variables shared between modules. These are described elsewhere as appropriate. {See (linkage).}

Important cells in Page Zero

PZDOMAINDIBP: While a CPU executes a domain this cell remembers the address of the DIB of that domain. When a domain traps and thus causes the kernel to run, this cell locates the actor.

(X'1000':X'1FFFF') CMS {if any}

(X'1000' or X'20000') KERNDEF {key 0}

Code of running Kernel {key 0} Loaded from KERNCODE EXEC

Kernel DDT {if any} {Keys 8 and 9 set by DDT. The partial page at the beginning has key 0, and the partial page at the end has key 3.}

Storage defined in assembly file WS {key 3}

Miscellaneous counters

Miscellaneous Queue Heads

These queues are of hook keys. Since a hook key is always in slot 13 of a node, the corresponding node is easily located. The queue heads are segregated from node space, since hooks will sometimes point to these queue heads and sometimes to nodes, and it is necessary to distinguish.

The following queue head names are external symbols.

The queue heads now defined are: (See (queue) also.)

CPUQUE queue of domains waiting for the CPU.

FCPUQUE queue of domains waiting for checkpoint or migration.

MIGRWAIT wait queue for the external migrator waiting for migration needed.

KROQUE queue of domains waiting to write on kernel read only pages.

MIGRTCZ wait queue for external migrator while migration in progress.

WORRY wait queue for domains being worried about. See (worry).

JUNK queue for domains that are malformed or have a nonzero trap code and a domain keeper key that is not a gate {no server}.

RGUNAVL queue of domains wanting pages or nodes that are not mounted.

NOPAGES queue of domains waiting for page frames to become available.

NONODES queue of domains waiting for node frames to become available.

NOIORQBS queue of domains waiting for I/O request blocks to become available.

NOTCCWBS queue of domains blocked on no device I/O request blocks.

IOQUEUES array of queue heads for domains waiting for disk I/O.

BWAITQ array of queue heads, one per BWAIT key.

LASTQUE first address after queues.

Work space for memory tree evaluation {(memalgo)}

Timer queue elements for the BWAIT keys {(bwait)}

Work areas for I/O system modules {key 3} Loaded by KERNWORK EXEC

(TABEND) Label defining end of work areas

The next part of memory is overlaid:

During initialization:

(INIT:KERNEND) Initialization Code and Data {key 3} Loaded by KERNCOD2 EXEC

Padding to page boundary

After initialization:

Padding to page boundary

(AMEMORX) Pageable Core Frames {key 0 or 1 for logical pages, key 3 for temporary kernel work areas, and key 5 for allocation pots.}

(to AEMEM) Pageable Core Frames {key 0 or 1 for logical pages, key 3 for temporary kernel work areas, key 5 for allocation pots, and key 6 for node pots})

(ADEPSPAC:AENDDEP) Space for DEPEND {key 3}

(NULSPPT:AENDSGTB) Segment Tables {(memalgo)} {key 3}

(APAGCHHD, PAGCHHDM) Page hash chain heads

(ANODCHHD, NODCHHDM) Node hash chain heads

(DIRBASE:DIREND) Directory entries

DIR# has the number of directory entries.

(APT:APTEND) Page Tables {key 3}

Here PTAGE is the right 3 bits of PTSTATE in the page table header. A page table is in one of the following states:

On the free list

Chained off PTFRAMEHD through PTP. PTAGE=0. PTPRODUCER has NIL*. PTNEXT has 0. FREEPAGETABLE=1. No segment tables point to it.

Valid

PTAGE=0. PTPRODUCER -> Producing node frame or CTE.

Ageing

0 < PTAGE < 7. PTP has all invalid bits on. PTPRODUCER is not NIL*. Even invalid PT entries locate correct page unless entry = 0008. They are “xvalid” according to (ptev).

Forsaken

PTAGE=7. PTP has all invalid bits on. PTPRODUCER may be NIL*. Segment tables may point to it.

When new page tables are needed and there are none on the free list, forsaken page tables are the first ones to be “stolen”.

FREEPAGETABLE=1 only for tables on the free list.

(ARNDIBS:AENDDIBS) Derivative Information Blocks {key 3}

The first DIB is used only for the idle process. Only the following fields of this DIB are valid: GENREGS, FLTREGS, PSW. READINESS has zero. CACHE has X'7FFFFFFFFFFFFFFF' minus the time spent in the idle process since the last restart.

(ASTOT:ASTOTEND) Segment Table Origin Table {key 3}

(AFRSTNOD:ANODEEND) Node Space {key 3}

Pointer Architecture

When a key that designates a node is prepared, it directly locates a node in item space by virtue of holding the address of the node. In order to re-allocate a node frame, it is necessary to find all prepared keys designating that node. The prepared keys that designate node X are chained together on a doubly linked list through the keys and through the backchain head in X. This backchain head is in X’s node frame. See (install).

The DSECT (NODEFRAME) has fields that:

indicate the mode of preparation, if any (NFPREPCODE),

hold the coded disk address of the node (NFHOMEADDRESS),

hold the allocation count (NFALLOCCNT),

provide the two ends of the doubly linked list of prepared keys that designate this node (NFLEFTBACKCHAIN, NFRIGHTBACKCHAIN),

locate the DIB, if any (NFDDIBOFS, etc.),

link this node to the next node with the same CDA hash (NFHASHCHAIN),

provide two locks and an age for the node (NFPREPLOCK, NFCORELOCK), {See (corelock)}

and provide an invocation count for domains (NFCALLCNT).

Flags in Real Nodes (NFFLAGS)

A flag (EXTERNALQUEUE) indicates that another domain, which was waiting on this one, was removed from item space {by the decongester (decngst)} in order to relieve congestion.

A bit (NFGRATIS) indicates that this node is (_gratis.) {This attribute is logical.} See (p1,gp) and (p2,nrange).

(NFDIRTY) means that there is no copy of this page in non-volatile storage. This bit is set when a node is prepared as other than a segment node. It is also set when a slot is stored into, in an unprepared node or a node that is prepared as a segment node.

A bit (NFALLOCATIONCOUNTUSED) indicates that there has existed an unprepared key {other than an exit} designating this node whose allocation count is the same as the allocation count of the node. See (usedcountbits).

A bit (CALLCOUNTUSED) indicates that an unprepared exit key to this node has existed with the current call count. See (usedcountbits).

The (REJECT) bit is turned on when another a process tried to enter this node when it was busy. It will be on when there is an internal stall queue for this node.

The bit (DRYMETER) indicates that there are no caches in DIB’s below this meter. Its use is private to the SCAVENGE routine.

The bit (NEEDSCLEANING) means that the node has been selected for cleaning. The use of this bit is private to the swapper.

(ANODEEND) The node chain terminator. {key 3}

(ACORETAB:APAGEEND) Core Table {key 3}

The order of the entries in the core table is the same as the order of the page frames to which they correspond. Each entry contains the real address of the page frame to which it corresponds.

The DSECT CTCORETBEN (defined in CORETBEN MACRO) describes the format and contents of the core table entry. Fields in the core table entry are: .Ybs=0,0;.Irest=10;

4 bytes: Head of chain of prepared keys that designate this page

4 bytes: Tail of chain of prepared keys that designate this page

If CTPOT=0 {a page}, the following fields are present:

CTCDA (6 bytes): coded disk address

CTALLOCATIONID (4 bytes): allocation id.

If CTPOT=1 {a pot}, the following fields are present:

CTPOTADDRESS (4 bytes): swaploc (range and offset in range) of pot {not used for pots to be written to the swap area (dirty pots)}

CTPOTFLAGS {flags for pots}

CTALLOCATIONPOT {page is an allocation pot otherwise a node pot)

A reason for having the pot in memory (used for measurement) as follows:

CTNODEPOTTYPEHOME - The pot was fetched from a home range (not for migration).

CTNODEPOTTYPESWAPCLEAN - The pot was fetched from the swap range

CTNODEPOTTYPESWAPDIRTY - The pot was created in memory as part of the node cleaning action.

CTNODEPOTTYPEMIGR - This flag may be added to node pots marked CTNODEPOTTYPEHOME or CTNODEPOTTYPESWAPCLEAN to indicate that the pot was read by migration.

1 byte: CTNODEPOTNODECOUNTER {number of current nodes in node pot to be written to swap area, unused in other node pots)

4 bytes: CDA-hash chain through item space

CTFLAGS (1 byte): contains the following bits:

STORKEYISZERO

CTALLOCATIONIDUSED {See (usedcountbits).}

CTPOT {This page is an allocation pot or a node pot.}

CTBACKUPVERSION {This page is a backup version, not the current version}

CTKERNELREADONLY {This page is kernel-read-only.}

CTGRATIS {This page is gratis.} .Ifirst=0;.Irest=0;

CTCORELOCK (1 byte): lock byte. Contains the number of reasons that domain actions require the page to remain in memory. See (corelock).

CTIOCOUNT (1 byte): Number of cleaning and journalizing operations in progress for the page. {These both have the goal of clearing the dirty bit.}

For cleaning, each copy counts for 1, so cleaning can contribute 1 or 2 (or 0) to this field. Journalizing can contribute 1 (or 0). So the maximum value of this field is 3.

CTEXTENSIONFLAGS (1 byte): contains the following bits:

CTMARKCLEANOK - Routine that builds a write channel program should reset the “changed” bit in the hardware key field associated with this page.

CTONCLEANLIST {This page is on the clean-list.}

CTKERNELLOCK - Locked in memory by the kernel. There are three uses of this bit {these uses never conflict}:

During checkpoint, to reserve the page for GCKPT’s private pool or checkpoint header.

During migration, for various purposes by GMIGRATE.

When a page or pot is being fetched, to lock the page frame in memory during reading.

CTWHICHBACKUP - If CTBACKUPVERSION, this bit tells whether the page is the backup version or the next backup version.

CTDEVICELOCKCOUNT (1 byte): Number of device I/O channel programs using this page.

CTREALADDRESS (4 bytes): Real address of the page frame this coretable entry is associated with

CTDIB (4 bytes): Head of the list of {degenerate} page and/or segment tables produced by this page.

(APAGEEND) The page chain terminator. {key 3}

AERMEM contains 1 + the address of the end of the last page of memory that Gnosis will use.

If CMS was used during initialization, its storage follows.

Design Notes

The design of the kernel is based on the idea of segregating storage into permanently allocated contiguous areas, each of which holds objects of just one type. Each of these areas are used as tables. We sometimes call the entries in these tables “frames”. Thus we may speak of a slot or entry in the table of DIB’s as a DIB frame. Some of the more obvious but less profound reasons for this strategy are:

Locators for these objects are indexes and can be very short.

Allocating space for new such objects is usually very fast.

Perhaps the only alternative to this general scheme is to follow the pattern of IBM operating systems {and others as well} and allocate objects {usually called control blocks} according to the logic of a centralized space allocator that is concerned only about the size of the control block.

All objects kept by the Gnosis kernel that come in a variable number are subject to being deleted from real storage independent from the logic of the application which caused their existence. {Nodes with processes are currently an exception for which we have plans and designs to remedy.} There are two general ways that we achieve this situation:

The objects that correspond to externally defined {logical} objects are pages and nodes. Those are swapped out in order to regain the real storage that they occupy.

The other objects are nonessential elaborations on logical entities and can either be reconstructed if necessary, or the requesting module is coded to ensure that the supply does not run out {e.g., the swap area directory entries}.

When we keep objects of one sort in a fixed size array, we have a clear strategy of when to reclaim storage for these kinds of objects, namely when that array is crowded. This may not be an optimal strategy but it is simple and fast.

This is a complete list of these tables:

Core Frames and parallelly allocated core table.Ybs=0,0;

Node frames (FRSTNODE..NODEEND)

Segment table descriptors

Segment tables

Page tables

DIB’s {Derivative Information Blocks}

DEPEND entries {(depend)} .Ybs=1,0.125;

REQUEST entries - for a logical I/O operation

DEVREQ entries - for devices that may satisfy a logical I/O operation

CCWBLOK entries - for actual disk channel programs

CLEANLIST entries - for pages that should be cleaned

BADPAGE entries - for disk addresses that could not be read

TCCWBLOK entries - for Device I/O channel programs

CSREP entries - For Simple Charge Sets

The nature of the logic to reclaim slots in these tables for other entries depends on the table. There are several sorts of complication depending on the table.

Some of the entries in the I/O system, are used only while an I/O operation is queued or in progress. When there is no room for such entries we enqueue the domain whose action requires the entries. The domain thus waits for entries to be freed by the completion of I/O. There is a special wait queue for this (NOIORBQS). Similar strategy is used for the case in which there are no page or node frames available {(NOPAGES) and (NONODES)}. Running out of page or node frames forces a cleaning operation to make additional ones available.

In each case, for more permanent table entries, there is some attempt to sense which entries are frequently used, so as to avoid reclaiming the space of popular entries.

The segment tables and page tables, for example, are consulted by the hardware, and such table entries may not be re-allocated without issuing the PTLB command. This command is expensive, and so we free up several such entries at once so as to reduce the number of PTLB’s which must be executed.

Segment tables are of greatly varying size, and segment tables larger than the minimum size {64K bytes} occupy adjacent entries in the space reserved for segment tables. See (stot).

Counters in the kernel to aid our quantitative understanding of the kernel’s operation. Some of these counters may indirectly help choosing the right size for kernel tables.

These counters tend to measure some activity induced by a lack of frames in some table. In many circumstances, however, making the table many times larger would not significantly reduce these counts. In such cases it is unwarranted to enlarge the table.

DIBSTOLN counts the number of times that a DIB frame was reclaimed for another DIB.

DEPSTOLN reports how many entries in the DEPEND relation were sacrificed to hold new entries.

STCRNCH counts how many times segment tables were sacrificed because the limit on the number of segment tables was exceeded.

STECRNCH counts how many times segment tables were sacrificed due to the total size of the segment tables exceeding the available segment table space.

PTRESCUE is a histogram of the ages of page tables at the point that they were rescued from the page table reclaimer.

This reclaimer searches the page tables cyclically when page table frames are needed. Each page table has a count {right 3 bits of PTSTATE} that is examined. If the count is 0, the page table is invalidated by turning on the invalid bit in each of the page table entries, and the count is set to 1.

If the count is from 1 to 6, then it is incremented.

If the count is 7, the page table frame is reclaimed.

If the page table is accessed by the hardware while the count is greater than 0, the page table entries will be invalid and the resulting page translation exception interruption will then reset the count to 0 and re-validate the entries. The value of the count at the time of the interrupt will be used to index into the histogram to record this event.

REQOUT measures the number of times a disk I/O operation could not be queued because the REQUEST block, which describes the logical disk I/O operation, was not available. N.B. The migrator makes heavy use of these table entries.

DEVRQOUT measures the number of times a disk I/O operation could not be queued because a DEVREQ, which describes one of the physical I/O operations which may satisfy a logical I/O operation, was not available. N.B. The migrator makes heavy use of these table entries.

CCWBOUT measures the number of times the disk I/O system attempted to build a channel program for an operation and could not obtain space for it {a CCWBLOK}.

CLEANOUT measures the number of times the pager identified a page that needed cleaning and was unable to record the information in the CLEANLIST.

BADLOUT measures the number of times GBADPAGE had to forget the address of a disk page that could not be read in order to remember a more recent one.

OUTSCLSH measures the number of times a disk I/O operation was requested and there was already an outstanding operation for hash(disk address). This logic is used to avoid reading a page twice.

Segment tables

370 segment tables must be contiguous and may be as long as 8K. They must start on 64 byte boundaries. (stot) describes the data structures that support the management of segment table space. SHOVE, described here, manages that space.

Since STOT entries designate segment tables and are in the same order as segment tables it is easy to find segment table space or make it. The fundamental idea in SHOVE is that segment table space may be managed by shoving segment tables (and thus preserving the work of creating them) and sacrificing them when space cannot be found.

SHOVE was written and debugged in Algol 68 and that code remains as comments in SHOVE.

FREESTOTES is a internal routine of SHOVE and takes a STOTE count and a space count. It reclaims segment tables until both counts have been met. FREESTOTE is thus used to gain free STOTEs as well as table space. FREESTOTES’s last action is to scan the DIBS for pointers to STOTES that have just been deallocated. FRESTOTES is thus responsible for maintaining assertions concerning STOTE pointers in DIBS. (SHRINK, below, is not.)

SHRINK is an internal routine of SHOVE and sets a segment table to length 0 without freeing the STOTE.

ALSEGTAB {(alsegtab)}, EXSEGTAB {(exsegtab)} and ZPSEGTAB {(zpsegtab)} are the normal external entries to shove. ISHOVE initializes SHOVE’s state.

REAL KEYS

While keys are stored on the disk or in core node pots {(node-pot)} they occupy 12 bytes. Keys in nodes in node frames (between FRSTNODE and NODEEND) each occupy 16 bytes.

Between units of transformation, a key in a node frame that designates a page or node is in one of two states: prepared or unprepared.

Unprepared Keys

Keys that designate pages or nodes:

Such keys are “prepared” bfore they are used. A byte of a key is (KEYTYPE). Bit 0 is zero in unprepared format. Bits 3-7 are a code for the key type. (DATABYTE) is the data byte referred to in the Gnosis manual. This information may be logical in that it is visible to the user (depending on the type). Some key types do not use this field.

CDA’s

If a key designates a page or node, (CDA) in the key are 6 bytes of a coded form of disk address called a (_coded disk address) or (_CDA). This coded disk address is unique within a pack set {(packs)}. The decoding of this address consults small tables that indicate such things as where packs are mounted now. The high order bit is sometimes used as a page/node indicator. 0 indicates a page and 1 indicates a node. This coding evidently allows for 2**47 pages and 2**47 nodes. Since the size of the coded disk address is not visible to the user, it may be changed without disrupting user programs or processes.

The following fields use the high order bit of the CDA to indicate whether the CDA refers to a page or a node:

DISKNODE.DNCDA (always on)

DIRENTCDA

DISKDIR6ENTRYCDA

DISKDIRENTRYCDA

RANGETABLEFIRST

RANGETABLELAST

The following fields have the high order bit of the CDA always zero no matter whether the CDA refers to a page or node:

DISKKEY.DKCDA

KKEY.KCDA

NODE.NFCDA

The following fields have the high order bit of the CDA always zero because they only contain page CDAs:

CORETBEN.CTCDA

REQUEST.REQCDA (except for pots)

Hash Chains

Given the CDA of a page or node, one finds the page or node in item space {if it is there} by means of hash chains. Pages and nodes are linked together by singly linked lists that connect things with the same CDA hash. Such a chain is headed by an element of an array of heads indexed by the CDA hash.

The hash of a CDA are merely bits from the CDA.

NODCHHD is the array of hash chain heads for nodes and PAGCHHD is the array for pages. Page chains terminate in a special item called PAGEEND, and node chains terminate in a special node called NODEEND. The pointers in node chains are item indexes but the pointers for page chains are offsets from CORETABL. To search one of these chains one may: seize the terminator {NODEEND or PAGEEND} by its lock, cause it to conform to the search criterion {e.g., place the desired coded disk address therein}, and do the search without otherwise testing for the end of the chain. Don’t forget to unlock the list. Several terminators can reduce contention for multiprocessor systems.

CDAs known by the kernel code

Page CDA’s

CDA 0 is used to define a core frame that does not contain a logical page.

CDA 1 is the journal page {(p1,journal-page)}.

Page CDAs for which special range keys exist (not know by the kernel)

The 300 CDA’s starting from (X'7FFFFF'-4100-300) are used for the IPL range.

The 4100 CDA’s starting from (X'7FFFFF-4100) are reserved for the dump range.

CDA’s 2**23 through 2**24-1 are reserved for use by the disk transfer mechanism {(p2,cmstogno)}.

Node CDA’s

CDA 0 is used to define a node frame that does not contain a logical node. Also, the “super meter” key is a meter key with CDA 0.

CDA 1 identifies a node known as the kernel node. See (p2,errkey).

CDA 9 is the prime meter node. When a checkpoint is needed and migration has not finished, the kernel will stop and later restart this meter. See (migr-cntrl).

Other bits of a key that designates a page or node {except the exit key} are called the (_allocation id). As pages and nodes are deleted, the disk space is reused by new pages. Keys to the old page must not designate the new page. This is prevented by keeping the allocation id with each page and incrementing that id when the page is re-allocated and the old id has been copied to some key. A bit in the flag byte indicates whether the current allocation id has been placed in a key. The key is thus able to distinguish the successive tenants of the same disk page. The same goes for nodes.

In the DSECT “KEY” there are these fields:

KDATABYTE and KKEYTYPE are the what they say.

KDATABODY is 7 bytes that hold the 7 bytes of a data key. KDATABODY6 is the right 6 bytes of KDATABODY.

KCDA is the 4 byte CDA.

KALLOCATIONID is the one word allocation id.

In order to make resume keys disappear {(p1,exitzap)} quickly, another 24-bit field similar to the allocation id is kept physically with each node. This field is called the (_call id). When a resume key is invoked, all prepared resume keys to the node are deleted. Also, the node’s call id is incremented if the old call id has been placed in some resume key {as indicated by a flag in the node}. The call id is placed in the resume key when the resume key must be unprepared. The call id is used in the resume key in place of the allocation id. When a resume key is prepared, its call id must match the call id of the node. An resume key must be prepared before it can be used as a gate.

The call id must not be set to zero when the allocation id is incremented, because it serves not only to distinguish different calls but also to distinguish between successive tenants of the disk node frame.

Prepared Keys

The following fields are valid in a prepared key:

LEFTCHAIN, RIGHTCHAIN, SUBJECT, which are one word each.

DATABYTE & KEYTYPE which are both one byte. Bit 0 of KEYTYPE is 1.

Keys that designate pages or nodes may be prepared. Bit 0 (PREPARED) in KEYTYPE is 1 for prepared keys. When a key has been recently used, it and its node will be in core. It is necessary to be able to transform a key into its unprepared form at any time. The prepared form is to expedite the use of the keys. When a key is prepared, the page or node that it designates is also in core. When a key to a page or node is prepared, it has been modified in such a way as to locate the designated page or node much more quickly. Locating a page or node given a prepared key is 1 instruction, but from a unprepared key it requires perhaps 100 instructions. In fact, we always prepare a key before using it to locate its designated page or node.

The last 2 bytes of a prepared key are the same as for the unprepared form. The first 14 bytes contain 3 four-byte pointers. The first two fields are pointers that forward and backward link the prepared keys that point to that page or node (LEFTCHAIN and RIGHTCHAIN). The third pointer contains the main storage address of the indicated core table entry or node (SUBJECT). In particular, see (ptrs) about the logic of prepared keys. The first of the extra two bytes is used to hold temporary information for CHECK. The other is reserved.

The item index of a page is the item index of the page’s core table entry, and the item index of a node is that of the node frame’s first item in item space.

See (keytodsk) for simple code that understands prepared keys.

Involved Keys

A key may be (_involved.) The detailed meaning of involved keys is determined by the various rules in sections (partrls) and (inv). See (keytodsk) for code form of definition. See (sum-inv) for summary of reasons for involvement.

A slot is a REF KEY. There are just two operations on a REF key: assignment and de-referencing.

In certain situations the kernel constructs real data on the basis of the key in the slot. {This data must be retracted if another key is placed in the slot.} In this case the bit INVOLVEDW is set. INVOLVEDW means “involved for writing”, i.e. you can’t write a new key into this slot without uninvolving the key or taking special consideration of why it is involved.

In certain of the above situations, the kernel must place new keys in the slot. It may defer doing so but in that case set the bit INVOLVEDR while the slot does not hold the logical key. INVOLVEDR means “involved for reading”, i.e. you can’t read this slot without uninvolving the key or taking special consideration of why it is involved.

If another part of the kernel needs the key of a slot, it may be had directly from the involved slot unless the slot is INVOLVEDR, in which case it must first cause the slot to become uninvolved.

If another part of the kernel needs to replace the key of a slot, it must first cause the slot to become uninvolved. Slots that are INVOLVEDR are also INVOLVEDW.

A key is considered involved if either INVOLVEDW or INVOLVEDR is on.

If an involved key designates a node, the key will be prepared. If node x holds an involved key to node y, then x is said to be coupled to y. The meaning of this coupling is determined by the mode of preparation of the first node and which slot in the first node designates the second node. The set of all prepared keys to a given node are linked by a two-way chain to each other. All of the involved keys are on one end of this chain. Thus it is quick to determine if a node is active.

Typically, coupled nodes have information moved from one to the other relative to where the information would be if the nodes were standard. Thus to standardize a coupled node it may be necessary to modify the nodes to which it is coupled.

Involved keys are hooks or are in prepared nodes. Thus to uninvolve a non-hook it suffices to unprepare the node.

Keys that are not involved may be unprepared without considering their context.

All of the situations that lead to involved keys:

The text “{OB}” means that the mentioned slot has INVOLVEDR on if it is involved.

A hook in slot 13 of some node {OB} {(hook)}.

.Grab=14;One of the following slots of a node prepared as a domain root:

Slot Function Type Reference

1 Meter Meter (metindom) .Ybs=0,0;

3 Memory tree Memory or Data (invmemroot)

4 PSW {OB} Data (invpsw)

5 Trap Code Data (invtc)

6 Monitor Code Data (invmon)

7 PER Data (invper)

8 PER Data (invper)

9 PER Data (invper)

10 KVMA Data (invkvma)

11 KVMA Data (invkvma)

12 Priority {OB} Data

13 Hook {OB} Data, Hook (pdhook)

14 Gen Keys node Node (prepgnky)

15 Gen Regs node Node (prepgnrg) .Ybs=1,0.125;

One of the following slots of a node prepared as a meter:

Slot Function Type Reference

0 Zero Data

1 Sup. meter Meter

3 CPU counter{OB}Data

4 {OB} Data

5 {OB} Data

6 Zero Data

7 Charge Set Chargeset {OB} or Zero Data

An involved charge set key will hold the index of the CSA {(scs)} and a link to the next involved key for the same charge set. This is similar to prepared keys.

8 Data

10 Zero Data

12 Zero Data

15 Data

One of the following slots of a node prepared as a segment:

The format key {in slot 15}, an initial slot, or the background slot with a memory key {(invmem)}.

A data key in a node prepared as a general regs node {OB}.

A data key in slot 13 of a prepared node {OB}. {Such a key is involved perhaps for no {current} reason.}

The Key Machine

This is a fanciful description of the kernel as it might be written more in accordance with data abstraction principles. The mechanism that we imagine hidden here is that of prepared keys and ancillary mechanisms. We speculate mainly on the procedures exported by this abstraction. I hope that this effort will help in understanding the current kernel and help in the design of locks for the MP kernel. It might even lead to improvement in the kernel implementation.

All of the fields of a key currently known as prepareable are obscured from those parts of the kernel outside those parts supporting the abstraction. In particular the CDA, allocation ID, backchains, subject fields, and prepared bit are denied to the “outside” kernel. The keytype (exclusive of the prepared bit) and data byte is exported. The involved bits are for further study.

PROC de ref = (REF SLOT s, REF NODE actr, BOOL e)UNION(REF NODE, REF PAGE): ...
This is a procedure that locates and locks a frame that holds the page or node designated by the key in the slot. If e then the lock is exclusive, otherwise shared. If the action is impossible now the procedure does not return but the domain root at actr is set to execute when the action might be possible.

PROC gen key = (REF SLOT s, UNION(REF NODE, REF PAGE) r, CAT c)VOID: ...
This procedure produces, in slot s, a category c, key to the node or page frame r. There are three categories: Exit, Hook and Other.

There is a fundamental problem in this idea -- how does it relate to prepared nodes and involved keys? Prepared keys cannot be said to transparently support prepared nodes because prepared node logic makes vital use of the preparedness of keys. Witness the rule that involved node keys must be prepared. To consider that the involved mechanisms are more primitive includes too much of the kernel “under the cover”.

One way out of this is to declare that the prepared key mechanism serves the involved key mechanisms and that the prepared key mechanism is not at liberty to unprepare a key at will. However unpreparing a key can force the unpreparing of a node which seems to violate Habermann ordering.

REAL NODES

See (packs) about nodes and pages on the disk.

Real nodes have allocation ids, call ids, and flag bytes.

When a node is in core, there is some additional information kept about it. See (itemndflag) for names and meaning of this information.

Prepared Nodes

If a node has been most recently used as a domain, it is most likely to be used next as a domain. There is information that may be derived from the node and the nodes that it designates that is necessary every time the node is used as a domain. Two examples are the general register values and a locator of a segment table necessary for control register 1. A node recently used as a domain root is thus held in a prepared form anticipating further use as a domain root; much information necessary in its role as a domain root is held in an associated block called the (_Derivative Information Block) {DIB}. See (domrules) for some information about the state of the domain.

There are several distinct ways that a node may be prepared. For example, if a node has been designated for use as a meter node it will be prepared as a meter node. If it has recently been designated as a domain or gate node it will be prepared as a domain. The preparation code in the node (PREPCODE) indicates whether the node is unprepared or in one of the prepared forms. Terms such as “prepare” or “prepare as domain” and “unprepare” are used to designate the transformation between these states. The meaning of each of these preparedness states is defined individually for each type of preparedness in (partrls).

The preparedness of one node may imply the preparedness of another. Therefore the unpreparing of one node may require the unpreparing of another. For this reason we list all (_node preparation implications) here. A prepared domain root implies a prepared general keys node {(prepgk)} and a prepared general regs node {(prepgr)}. Prepared general segment nodes may imply other prepared segment nodes. Prepared meter nodes imply other prepared {superior} meter nodes. Prepared general keys nodes or general registers nodes imply prepared domain nodes. (_There are no other implications!)

Kinds of Prepared Nodes

General Points

This chapter is devoted to describing and motivating the rules that govern the particular varieties of prepared nodes.

This section gives the meaning of involvement for each slot of each kind of prepared node. See (inv) for supplementary information of the same sort.

Domains

General Points

The primary design goal of the prepared domain is to be able to start and stop the designated program quickly. In particular, the design must be as efficient as possible for gate jumps. A coded pointer in the node designates a block of storage called the derivative information block {DIB}. Most of the data required for executing the program on a real processor is stored in this block in a format convenient for starting and stopping the program.

C4, C5, C13, C14, and C15 are involved. C1 and C3 may or may not be involved and may change state without unpreparing the node. Slots 7 and 8 are involved data keys if bit 1 {P} in C4 is on. Slots 10 and 11 are involved if bit 0 of the PSW is on. Other slots are not involved. The domain root and the nodes designated by C14 and C15 {if any} must be distinct.

See (disjdom).

Allocation of DIB’s

DIB’s for domains are allocated from an array (RNDIB) which is permanently divided into DIB frames of size (RNDIBSIZE). (DIBBASE) in low core holds the virtual origin of the DIB’s. To locate a DIB from a prepared node, add (DIBBASE) to the signed value NFDDIBOFS. The unallocated frames are described by an address chain rooted in (DIBFREHD) and linked through the beginning of the free frame. When a DIB frame is required and none is available, we steal a DIB by unpreparing some prepared domain. DIB frames are aligned on double words.

The byte called (READINESS)

This is a byte in the domain’s DIB. The bits in this byte represent an obstacle (when they are 1) to running the domain. This byte is tested for zero just prior to dispatching the domain. If there is a 1 bit in the byte a corresponding action is taken to remove the obstacle.

If the logical domain has a non-zero trap code then bit TRAPPED is on {(invtc)}.

Bit MONITOR indicates that the logical domain has a non trivial monitor register values.

Bit DIBPER means that the logical domain has non-trivial PER parameters.

The kernel was once coded so that DIBPER meant the above and that the real registers did not hold the right values. The current implementation saves the work of turning off DIBPER but causes the extra work of comparing PZDOMAINDIBP with PEROWNER upon each dispatch of each domain using PER. The current design violates a currently unused principle that bits in READINESS are obstacles to be removed before running the domain.

Bit BUSY {(busy)} merely means that the logical domain is busy and must not be started with a start key.

Bit STALECACHE: LRU counter for the cache. See (stalecache).

HOOKED: See (hooked) about hooks.

The work of inspecting various data keys in the domain root is sometimes deferred and the READINESS bit PSWNOTCHECKED is set during the deferral. The keys whose inspection is deferred are: PSW, PERA, PERB, PERC, KVMAA, KVMAB. The routine CHECKPSW in GATE is where the check is finally done just before dispatching the domain. Among prepared domain roots those data keys are not involved while PSWNOTCHEKED is set and they are involved (depending on other considerations) when the bit is not set. The logical PSW resides in the slot of the domain root while PSWNOTCHECKED is set and in the DIB at other times. This bit simplifies the kernel code that changes the domain root PSW slot while avoiding the work of preparing and unpreparing the entire domain. See (pswnotcheckeda).

Preparedness of the Various Slots of a Prepared Domain Root

C1 - Meter

When the program is running, the CPU timer holds MIN(CPU cache, Storage cache / Charge set size). When the program is running, the system is enabled for CPU timer interrupts. The values of these caches are less than the logical balance in any of the meters superior to this node. The caches have been deducted from each of the real balances in the superior nodes. The cache is fair game for being retrieved for use by some other program that, according to the logical meters, should be able to run. When the program is not running, the cache is kept in its own location in the DIB. C1 is involved or the cache is 0. If C1 is involved, then it designates a prepared meter node. The DIB of a prepared domain has an item space pointer (RNVACWAIT) to the lowest superior meter {if any} that holds a “wait for no process” gate.

The caches in the DIB are stored with floating exponent X'40'. This is to facilitate arithmetic with the involved data keys in the meters.

C3 - Address Segment

The derivative information contains the address (in SEGTABPP) of a slot in the STOT table {which holds a value for control register 1}. This is the segment table origin. This STOT entry is either the null entry (NULSPPT), which points to a void segment table, or designates a segment table that is associated with a page or prepared segment node, or designates a forsaken segment table. C3 is involved when the STOT entry is not NULSPPT. It describes a subset of those page tables that are accessible via paths along involved keys in the memory tree. If C3 is involved, then it designates a page or a node prepared as a segment node. It may seem strange that we do not require C3 to be prepared. The above assertion about paths insures, however, that the segment table origin will describe tables with no valid entries if the key in C3 is unprepared. This latitude allows us to unprepare a domain just to the extent of zapping the memory tree portion of its preparation without undoing the other investment in the prepared node.

C3 of a domain is considered the top slot of the memory tree. See (prepsgnd) for preparing keys in such slots.

C4 - PSW, etc.

This data key is involved (unless PSWNOTCHECKED) because a real PSW is in the DIB. The DIB holds {in PSW} the prepared PSW. Its contents in hex digits are: Y71DPP00 00PPPPPP where the Ps are pure logical program data. Y is either 0 or 4 depending on whether bit 1 of the data body of C4 specifies the PER function.

The DIB field SVCLIM is 252 if PER is not active {bit 1 of C4 is 0} and if gate jumps are allowed {bits 14 and 15 are one}. Otherwise SVCLIM is 255.

C5 - The trap code is an involved data key. Bit TRAPPED in READINESS is 0 iff C5 is a zero data key.

C6 - Monitor stuff

C6 is involved. See (monprep) for preparedness of monitor information.

C7 through C9. See (per) for preparedness of the PER hardware. C7 through C9 are involved data keys if bit one of C4 is 1 and PSWNOTCHECKED=0.

C10 & C11 - KVMA stuff

These are involved data keys if PSWNOTCHECKED=0 and bit 8 of the PSW is one indicating that PSW manipulating privileged instructions are to be simulated by the kernel. See (p1,sm).

C13 - Hook

General Points

This key is an artifice to join the domain to various collections of domains at different times. HOOKED in READINESS in the DIB is on iff the prepared domain has a hook in C13.

Except for the WORRY queue, hooks are held only by nodes with processes in them. See also (domrules) and (procrules).

Nodes with hooks in slot 13 may not be removed from core by the kernel without the help of external programs. See (hook).

Stalling

When there is a process in the domain A and it is stalled, waiting for domain B to be available, then the A’s hook designates B. Thus the backchain through the hook includes the queue of domains waiting on B. This handles the case where the domain holds no logical key to B, such as when the domain is stalled on a busy segment keeper or busy meter keeper.

The prepared keys to a prepared domain root are segregated on the backchain into three segments: hooks, exits, and others in that order. {The only involved keys to domain roots are hooks.} The hooks are involved, thus conforming to the rule on involved keys at the left of backchains. See (bchorder). LASTINVOLVED, in the DIB of the domain, points to the rightmost hook on the chain. The left end of the backchain holds the newest stallees. This facilitates adding queue members at the end and finding the exits to zap them.

CPU Queue

When a domain is waiting on availability of a CPU, its hook designates the special item space slot (CPUQUE).

When the domain is in page wait, its hook designates one of many special I/O queue heads slots chosen according to the CDA of the awaited node or page.

When the domain is ready but has a non-empty stall queue {presumably a transitory state}, its hook designates the special item space slot WORRY. Several times a second, every member of this queue has its head queue member put on the CPU queue.

C14 - General Keys

{C14 holds a node key.} This slot is involved because the address of the general keys node is kept in the DIB (GENKEY). The general keys node is prepared to facilitate the prevention of such things as one node being both the keys node of a jumper and domain root of the jumpee.

C15 - Registers

The derivative information holds the logical contents of the general and floating point registers in a form suitable for quick loading and storing. Slot 15 holds a node key and is involved. The designated node is prepared as a general register node.

See (domrules) for other things about prepared domains.

Key Nodes

This is the mode of preparation of the general keys node of a prepared domain. This node has no involved keys except perhaps a hook key in slot 13.

Prepared Register Nodes

Domain annexes are prepared to facilitate the prevention of several pathologies that would be of little value and great expense. See (prepgnky) and (prepgnrg). A node prepared as a general registers node holds only involved data keys and perhaps a hook in slot 13. It is designated by just one involved key, and that key {if any} is in a prepared domain root.

Meter Nodes

The design goal for prepared meter nodes is to make it efficient to run under several levels of meters. To this end, each prepared domain potentially has a cache of time that it may use without having to examine the controlling {superior} meters. See (metindom) for some details on cache use. This cache is never greater than the smallest logical meter balance. It is also less than 5 CPU seconds. In fact, the sum of all caches inferior to a meter node is less than the logical balance of the node. The real balance in a meter node is the logical balance less that sum. The format of a real or logical balance is the same as the format of the CPU timer except shifted right 8 bits. Fortuitously, this format is the representation of a data key and can be added, subtracted, and compared with unnormalized double precision floating point commands.

A prepared meter node has an involved key in slot 1 to another prepared meter node. The primordial meter node is special and does not point to other meter nodes.

Slots 3, 4, 5, 8, and 15 hold involved data keys.

Slots 0, 6, 10, 11, and 12 hold involved zero data keys.

If slot 7 holds a charge set key, then it is involved.

If slot 9 of the meter holds a gate key of a “no process” waiter, then the prepared form of the node has an item space pointer to the next higher superior that designates such a meter. This meter also holds a count of the prepared domains that are inferior to this meter and are in the running state.

A prepared meter holds its level {(p1,mhier)}. See also (scavenge) for the meaning of the DRY bit of the prepared meter.

Segment Nodes

Strategy

We review here the motivation of prepared segment nodes and involved memory keys. If all memory accesses were interpreted by the kernel according to (p1,memtreeformal) without the aid of the dynamic address translation {DAT} hardware, there would be no need of prepared segment nodes, involved memory keys, and the stored dependency relation, or even page and segment tables for that matter. Each memory reference would traverse the memory tree path by consulting unprepared nodes.

In order to use the DAT hardware, we must derive information from the segment nodes and format it as DAT requires. This formatted information is of three forms: segment table descriptors {values for control register 1}, segment tables, and page tables.

This derived information must never permit access to data that is not permitted by the segment nodes. In order to preserve this state of affairs, we establish some rules about segment nodes and the entries that depend on them.

See (memalgo) for more about this.

Fragments of a Correctness Proof

This section has same goal as (memorder).

The code MEMORY and its friends calls on storage management and I/O routines to bring necessary pages and nodes to main store and then builds DAT mapping tables and controls storage keys to provide access to those pages indicated by the nodes of the memory tree. It is also involved in freeing page and node frames and adjusting tables and storage keys accordingly. It must also react to modifications of the memory tree.

These are some theorems of varying precision. These tend to depend on following theorems of this group. The proof is upside down.

Any access granted to a domain by the DAT hardware and storage keys is allowed by the domain’s address segment.

This describes the collective properties of mapping tables verified by CHECK.

For each DIB and each valid entry of each produced mapping table, CHECK executes its version of SEAP and demands that the table or page designated by the entry is produced by the correct node and that xxBGNODE and xxFORMAT in that table have the unique correct values. This version of SEAP verifies that DEPEND entries exist when they should. (chmem) says all of this more completely. In short all valid entries of all produced tables are justified by slots of suitably prepared nodes.

Background Keys

That transactions involving background keys preserve the above has been the source of great confusion, perhaps some bugs.

xxBGNODE fields in produced tables may locate unallocated node frames. Nonetheless the above assertion holds for STBGNODE because there will be no DIBS designating that STOTE.

Background Key Transactions:

We list here transactions concerning background keys and how each transaction preserves (chmem). We assume that CHECK has run (by induction) in each of these cases.

Beginning to share a table:

A mapping fault has just been caused by invalid mapping table entry m and we find that the address of an extant table (or page) x is suitable to replace mapping table entry m. We replace the invalid entry with the address of x.

Entry m has become newly valid. Any slots used to locate x are involved. Side effects might cause nodes to be dislodged from their frames and their slots to become uninvolved.

Consulting a background key upon encountering a background window key.

When an invalid mapping table entry causes a fault, its table holds a xTBGNODE field. The search starts from the table’s producer. If a background window key is encountered the BGNODE field is used to locate the background key.

To find the parts of the kernel that believe that all keys reside in nodes look at references to NODESIZE, FRSTNODE, NODEEND,...

Backing off from the problem a bit leads to the following perspective: We need quick access to the effective background key when a background window key is encountered. One scheme is to start from the domain root when we need a background key. It is hard to argue that this is infrequent.

We might chain mapping tables together that depend on the same background key. The chain head would be at the node with the key. Mapping tables could be found when the node was moved. When tables are reclaimed the chain would have to be repaired.

Another idea, of course, is the lately discredited depend relation between slots and tables. We need to find the table when the node moves.

Another idea is to add a bit to the depend entry that indicates that the slot holds a background key and that the table holding the entry should have its BGKEY field zapped as well.

If the above is a good idea, this is better: do in the entire table! Does this lead back to the old scheme of linking slots to tables via DEPEND?

Presumably such a DEPEND entry would be made only upon allocation of the table and not upon each access to the background key. Following this line we need not record in DEPEND the relation of a slot to an entry. When the slot is zapped the whole table goes!

Creating a new mapping table.

Displacing a red segment node from its frame

A fault on an invalid entry in a table with a xxBGNODE value.

Local Integrity: (obsolete)

page P is accessed when address b is used to consult table T (segment or page) and

T is produced by page or node N and

T’s background field (xxBGKEY) designates a slot with key B in an allocated node frame

Then

access to page P results when address b is applied to N with B as background key

page P is accessed when address b is used to consult table T (segment or page) and

T is produced by page or node N and

T’s background field (xxBGKEY) is null

Then

access to page P results when address b is applied to N with no background key

The above is true but too weak. (chmem) is stronger and sufficient to perpetuate itself.

A situation that seems paradoxical may arise:

The BGKEY field of some table may designate a slot in a node that has replaced the node that was in the same node frame when the table was built. What is certain is that while the node frame was unoccupied there were no entries in the table that depended on the background key. (Since such entries would have required the format key within the old node to be involved.) Any entries now in the table are thus built in light of the background key in the new node.

The above overlooks the situation when a DIB designates a STOTE with a stale STBGKEY and gets a fault subsequent to the departure of the background key.

No it doesn’t! In that case there will be no SEGTABPPs designating the segment table and so no fault can occur. This is because the SEGTABPP field in the DIB was zapped when the node located by the STBGKEY field was displaced.

As a domain accesses address X in its address segment, each key of its memory tree that is conceptually consulted to define the meaning of X, is involved.

Actions that change involved slots cause special action if they are in nodes prepared as segments. (If it is not obvious how to preserve the predicate locks the key is first uninvolved.)

Involved keys are hooks or reside in prepared nodes.

Valid Collective States of Segment Nodes

{Some Rules for Memory Mapping. This supplements section (prepsgnd).} Each of the following is true:

Control register 1 = CTL1OWNER.SEGTABPP.0 .

The DEPEND Module {see (depndintro)}

Definition of the “depend” relation.

When field SEGTABPP in DIB d refers to STOT entry st and st isn’t NULSPPT then SEGTAPBB depends on each slot of the memory tree starting from the domain root (that produced d) that would be consulted until the key that designates the page or node that produced st.

When entry E of segment table st refers to page table PT, E depends on each key that would be consulted if the memory-tree algorithm were executed using the address corresponding to E starting from st and running thru the key designating the page or node that produces PT.

When entry E of page table pt refers to page p then E depends on every slot that would be consulted by the memory tree algorithm using the address corresponding to E starting from the node that produced pt thru page p.

VNXA-If S is slot i of node N and E is a table entry that depends on S, then S is involved and (either:

N is prepared as a segment node and

(either E is the i’th entry of a page table produced by N

or E is the i’th entry of an SSC4 segment table produced by N

or there exists a node N2 and slot j such that

{N2 is prepared as a segment node and}

slot j of N2 has an involved SSC4 segmode key to N and

E is the (16*j+i)’th entry of an SSC5 segment table produced by N2

or there exists a j such that

i = ENTIER(j/16) = j%16 and

E is the j’th entry of an SSC5 segment table produced by N

We used to also say:

or i = 0 and E is the 0’th entry of a {degenerate} SSC3 segment table produced by N

or N is prepared as a domain root and E is the field SEGTABPP of a DIB which is associated with N and i=3 {S is N’s memory tree root}

or N is prepared as a segment node and the pair (S,E) is recorded in the stored dependency relation {(depend)}).

VXA-If S is slot i of node N and E is a table entry that depends on S, then S is involved and (either:

N is prepared as a segment node and:

or N is prepared as a domain root and E is the field SEGTABPP of a DIB which is associated with N and i=3 (S is N’s memory tree root}

or N is prepared as a segment node and the pair (S,E) is recorded in the stored dependency relation {(depend)}.

The DEPEND Module for XA MP{see (depndintro)}

Definition of the “depend” relation.

Terminology for the following

“big seg” refers to big segments that will be used for XA and may be optionally available for 370.

When we say below that the SSC of a table is i we mean that it is appropriate for either holders of segmode keys with LSS=i or that the producer’s format key has s=i and the segmode key’s LSS is 0.

SSC3 page tables are 16 entries long and SSC4 page tables are 256 entries long. See (big-ptab) for much more detail.

If S is slot i of node N and E is a table entry that depends on S, then S is involved and (either:

This isn’t the big seg kernel and N is prepared as a segment node and

(either E is the i’th entry of a page table produced by N

or E is the i’th entry of an SSC4 segment table produced by N

or there exists a node N2 and slot j such that

{N2 is prepared as a segment node and}

slot j of N2 has an involved SSC4 segmode key to N and

E is the (16*j+i)’th entry of an SSC5 segment table produced by N2

or there exists a j such that

i = ENTIER(j/16) = j%16 and

E is the j’th entry of an SSC5 segment table produced by N

or this is the big seg kernel and N is prepared as a segment node and:

(either E is the i’th entry of an SSC3 page table produced by N

or E is the (16*j+i)’th entry of an SSC4 page table which is produced by a node {prepared as a segment node} whose j’th slot holds an involved SSC3 segmode key to N

or E is the j’th entry of an SSC4 page table produced by N where i = ENTIER(j/16) = j%16)

or E is the i’th entry of a segment table produced by N)

or N is prepared as a domain root and E is the field SEGTABPP of a DIB which is associated with N and i=3 {S is N’s memory tree root}

or N is prepared as a segment node and the pair (S,E) is recorded in the stored dependency relation {(depend)}).

MEMORY’s support of these assertions:

When SEARCHPORTION is called, an invalid entry has been located and the purpose of SEARCHPORTION is to find a value to be placed in that entry. The entry address has been computed and is held in DEPENDSAFE during the execution of SEARCHPORTION.

A slot S consulted by SEARCHPORTION while determining the value for this entry will be registered in DEPEND, unless the sequence of SSC values of keys {excluding S} encountered so far in this execution of SEARCHPORTION is one of the set of sequences given here.

The slots consulted to find a value for RNSEGTABPP are all entered into DEPEND, except for the slot in the domain root.

What (memory assertions) CHECK checks

CHECK checks some assertions that are difficult to express in the notation and terminology of the rest of sections (prepsgnd) and (meminv).

An “xvalid” page(or segment) table entry is either a valid entry or an entry with the invalid bit on and a non-zero page frame number(or page table address). This definition holds for just the next two paragraphs. See (ptrescue), (pt-age) and (rpt) about xvalid page table entries. Actually we do not make xvalid segment table entries that are not valid although that might be a good idea. xvalid entries have the invalid bit on for purposes of LRU.

CHECK considers all xvalid entries e of all produced segment tables of all nodes prepared as segment nodes. CHECK executes MEMTRECH {its version of SEAP (seap)} to ensure that e designates the page table produced by the node predicted by SEAP and that that page table has the predicted values in PTBGNODE and PTFORMAT.

CHECK considers all xvalid entries e of all produced page tables of all nodes prepared as segment nodes. CHECK executes MEMTRECH to ensure that e designates the page whose core table entry is predicted by MEMTRECH and that STOREKEYISZERO in CTFLAGS in the CTE entry agrees with the yield of MEMTRECH. {See (storkeyrule) about connection with storage key.}

Similarly CHECK verifies that prepared domains access (via SEGTABPP in their DIB) segment tables produced by the node predicted by SEAP with the predicted STBGNODE and STFORMAT values.

MEMTRECH also demands that all referenced keys be involved and in nodes parepared as segments or slot 3 of a domain root.

CHECK checks the setting of the storage key controlling write access of mapped pages against the access determined by SEAP as it proceeds from valid page table entries.

It follows from the above checked facts that memory mapping is safe in that it provides only that access justified by the memory tree. See (bgx) for some ramifications of these assertions. See (seaploop) for logic of SEAPLOOP used to ensure the above assertions.

What MEMTRECH does

The calls to MEMTRECH are described above {(chmem)}.

MEMTRECH is much like SEAP in that it updates CHBGNODE (in anology to SEAP’s BGNODE) and returns a node or page locator.

An involved segmode key designates a prepared segment node, and the value of the right 6 bits of its data byte are 0 or from 3 through 12.

Production Chains

Mapping tables produced {(produce)} by either a page or a node are all on the chain headed at CTDIB or NFDIB respectively and chained thru either STNEXT or PTNEXT and terminating with zero.

Such a chain may go thru both page table space and STOTE space. We use the fact that page tables precede STOTEs to tell where we are at. Code using this fact is always marked by the word “PTFIRST” in comments nearby.

The forsaken segment tables are a set of STOTE’s. This set is just those STOTE’s found on the queue starting at FORSAKENSEGTABHEAD and chained through NEXTSTOTE. This set is also just those STOTE’s where STPRODUCER holds NIL*.

The free segment tables are a set of STOTE’s. This set is just those STOTE’s found on the queue starting at FREESEGTABHEAD and chained through NEXTSTOTE. This set is also just those STOTE’s where SZCD holds -1.

Forsaken STOTE’s aren’t free.

DIB’s don’t point {via RNDIBSEGTABPP} to free STOTE’s.

Background Keys

These are checked in CHECK.

If entry i of segment table T designates page table U then U.PTBGNODE is the value that would be placed in location BGNODE as a result of executing SEARCHPORTION starting at table T with address i*16**5 and initializing BGNODE with T.xxBGNODE.

If DIB d designates a STOTE s, then s.STBGNODE points to the red node with the background key in effect when you start from the memory key of the domain associated with d and proceed to the segment node that produces s.

Ordering of Memory Assertions

I think that the assertions about the page and segment tables need to be ordered into a few levels. One level would come before another if assertions from the first are used to prove assertions from the latter. The other branches of this plex describe these levels. {These assertions hold only when the world is valid.} This has the same goals as (memorder1).

First level

Producers {see (mem-context) and (prod-chains)}

PAPL is a function closely related to the apply function {(p1,applyaddr)}. PAPL(addr,sl,MK,BK) is a node or a page where addr is an address (a number from 0 to 2**48) and MK and BK are memory keys. PAPL(addr,sl,MK,BK) is the result of partially executing the apply function but returning the first node or page whose ssc <= sl. BK is the initial background key or DK(0) if there is none. In evaluating PAPL, slots are consulted: the segmode keys that constitute the access path, the format keys of red nodes, and any window or background keys used.

If the DIB of a prepared domain designates {via RNSEGTABPP and a STOT entry {not NULSPPT}} a segment table T, then:

PAPL(0,5, root of memory tree of the domain, DK(0)) is a node or page which produces T.

All of the slots consulted by the above call to PAPL are involved.

If node N produces a segment table T, and E is entry i of T, and E is valid, and E designates page table PT, then:

PAPL(i*16**4,3,KN,BG) is a node or page which produces PT, where KN is a segmode key designating N with data byte from the field STFORMAT of T and BG is the key located by STBGNODE in T.

All of the slots consulted by the above call to PAPL are involved.

If node N produces a page table T, and E is entry i in T, and E is valid, then:

PAPL(i*16**3,2,KN,BG) is the page designated by E, where KN is a segmode key designating N with data byte from the field PTFORMAT of T and BG is the key located by PTBGNODE in T.

All of the slots consulted by the above call to PAPL are involved.

The segment and page tables accessible to prepared domains via their DIB’s provide no access that is not provided by logical nodes and pages.

The pages accessible in domains are no more than those described in the memory tree of the domain.

When xTBGKEY designates an unallocated frame.

See (dt) for bibliography on this subject.

Ramifications of the assertions (chmem) (called “chmem” here) concerning the xTBGKEY fields with the page and segment tables lead to states that seem paradoxical at first. Note the absence of assertions that fields xTBGKEY designate involved keys. Mapping tables with such fields may come to be used by domains. How can this be safe?

The short answer is that the assertions of (chmem) perpetuate themselves and while they hold, no domain can access pages not provided by their memory tree. Indeed domains can’t use such tables, even though they are produced.

It is instructive to see how the assertion survives the following events:

An invalid page table entry causes a fault. Its page table’s PTBGKEY designates a key in node X. SEARCHPORTION displaces X while accessing nodes necessary to define the page at the offending virtual address. PTBGKEY now locates a another key. Subsequent (or even this) accesses consult the new (wrong) key giving unjustified access to data. How is this chain of events avoided.

A corollary of (chmem) is that valid page table entries that depend on background keys are known to the depend relation. Displacement of a node holding a background key thus invalidates all page table entries dependent thereon.

At the time of the fault the page table was located by some segment table entry and that segment table was located by some domain’s DIB. The segment table entry was built pursuant to an execution of SEAP that provided the value found in PTBGKEY at the time of the fault. This value was established either (1) by encountering the background key in node during that execution, or (2) by that value having been copied from STBGKEY upon initiation of SEAP.

In the first case the format key of node X will be related via depend to the segment table and X’s displacement will invalidate that segment table entry thus preserving chmem even while PTBGKEY points to a non-node.

In the second case the SEGTABPP of the DIB will be changed to NULSPPT and chmem still holds because STBGKEY and PTBGKEY will agree and SEAP, starting from the segment table will reach the page table leaving BGKEY unchanged. The test will have succeeded even while STBGKEY and PTBGKEY point to non-keys.

In either case the tables may come back into play but only if it should should happen that X’s old frame comes to be occupied by some node with a background key in the same slot. Any page table entries will now be built from the new key.

About free page table frames

Each page table frame {between PT and PTEND} is either on the free chain and has its FREEPAGETABLE bit on, or it is not on the chain and that bit is off.

See (free-pt) for a formal version of this.

The free chain is rooted at PTFRAMEHD and linked through PTP.

For any prepared domain D and any virtual address A, if D’s DIB designates a segment table and address A is valid according to that segment table, then each of the segmode keys of the access path {(p1,accesspath)} is involved. The keys in the 15th slot of red segment nodes of the access path and the window keys consulted to define address A are also involved, as well as background keys. {A key is involved when DAT table information depends on it.} The page key at the end of the access path is involved too.

See (lmi) for more on mapping tables.

The “Produces” Relationship

See (prod-stot) for some implementation clues.

Contexts of Shared Memory Objects

We use the term “object” here to refer to either a page or a segment node. Objects produce tables (segment or page) that depend on the context of the object within a memory tree. When an object produces a table, extra information about the context of that object is recorded with the table to discern the circumstances where that table can be shared. {A segment node may have different LSS’s on different memory paths.}

The extra information is the LSS {not SSC!} of the segmode key in the memory tree that designated the shared node, the background key in effect there, and whether there were read-only bits in the path from the memory tree root. For this purpose, a page key virtually has an LSS of 2.

The context of a segment table produced by a segment node will have an LSS from 3 to 5, or an LSS=0 and SSC from 3 to 5. The context of a segment table produced by a page will have LSS=0. {Perhaps it should be 2 but consider 0 as merely a code for “produced by page”.} A segment table is produced by an LSS3 segment node when a key to that node is found in an LSS6 or greater segment node or in D3 of a domain root. In this case, only the first entry of the segment table is valid. {The segment table is degenerate.} A {degenerate} segment table is produced by a page when a page key is found in one of the same places.

A page table is either produced by a segment node with SSC3 or a page. A page table is produced by a page when a page key to that page is found in a segment node with LSS > 3.

The field PTBGNODE in a page table locates the node with the background key {it is involved}. The field PTFORMAT has the other information in the format of a segment key data byte. Likewise for segment tables with “ST” substituted for “PT”.

See (prod-chains) about chains to find produced tables from producers.

Relations Between Domains, Segment Nodes, and Tables

A segment table is (_active) iff it is designated by the DIB of some prepared domain. A page table is (_active) iff it is designated by an active segment table.

Every active segment table or page table is (_produced) by exactly one segment node or page {not respectively} or is a forsaken segment table or page table {(forsake)}.

If a node is prepared as a segment node and it produces tables, then those tables are appropriate to appear in the address map of a domain whose memory tree holds a segmode key to the node in a context like the context {(mem-context)} of the table.

If the DIB of a prepared domain designates a STOT entry other than NULSPPT, then C3 of the domain is involved. See (invmem). The DIB’s of several domains may designate the same STOT entry.

There is a one to one relation between STOT entries and segment tables.

There is a relationship called “produces” that relates pages and segment nodes {nodes prepared as segments} to segment tables and page tables. Each table {segment or page} is produced by just one producer.

A thing X produces a thing Y when Y is used solely to support the function of X.

Circumstances Under Which Things are Produced

A node or page produces a segment table when it is the first node or page on a memory tree path with SSC less than 6.

A node or page produces a page table when it is the first node or page on a memory tree path with SSC less than 4.

A node {SSC = 3} produces a page table and a page produces a degenerate page table.

Not that such a node or page can produce both segment and page tables according to the above rules!

It is important to find the tables produced by a given segment node or page, and it is important to find the segment node or page that produces a given table.

Finding the Producer

To this end, a page table has a field called (PTPRODUCER), and a segment table has a field in the associated STOTE called (STPRODUCER) that locates the page or segment node that produces the table.

Finding the Products: See (prod-chains).

Forsaken Segment Tables and Page Tables

See (map-age) for related material.

When reclaiming page table frames, it is awkward to find the STOTE locators in DIB’s to a degenerate segment table owned by the page table. Instead we change STPRODUCER of that segment table to NIL* and invalidate the entry to the page table. It is now a (_forsaken segment table.) Similarly, we must release a specific segment table when its producer is unprepared. Again we set STPRODUCER to NIL* and invalidate each entry. Forsaken segment tables can only be reclaimed by garbage collection. All of the page table pointers in this segment table are void.

In each case we defer the work of finding the pointers to a table (segment or page) by making all entries of the table invalid and canceling the producer pointer (mainly esthetics). The table is now called forsaken. The work is deferred until there are probably many such forsaken tables. The space holding pointers to such tables is now searched and pointers to forsaken tables are efficiently deleted.

The DEPEND Module

When a memory tree is consulted to build a page table or segment table entry, several slots of the memory tree may be consulted. Should these slots be modified, the corresponding entry may be affected. The (_DEPEND) module {(depend), (dependfunc)} is responsible for remembering part of the relationship between such slots and entries. This module also remembers when the field SEGTABPP of the DIB depends on such a memory tree slot. (whendepend) describes which relations DEPEND remembers. See (dependcl) for calls to the DEPEND module. See (depend) for more complete introduction. See (depend-wart) for some suggestions to improve performance. See (dt) for a history of design problems here.

The Use of Protection Keys

The storage key of a user page is used to prevent modification by users who are accessing the page via read-only keys and also to prevent modification by programs that are running while checkpointing or migration is going on. The checkpoint logic sets all changed pages to kernel-read-only and any programs that try to modify such a page are placed on the KROQUE. The migrate logic may also set pages to kernel-ready-only after migration read is completed. If the page is still current, it is made generally available for read, but remains in the kernel-read-only state until migration write is also completed. See (kro) and (storkeyrule).

Allocation of Segment Tables

There is a table called the (_segment table origin table), or STOT. The first field of each of entry of STOT is 4 bytes long and is suitable for control register 1. The first and remaining bytes of this field are called (SZCD) and (STORIGIN). The DIB of a domain holds the address of the STOT entry. The next field in the STOT entry (STPRODUCER) locates real node that produced the segment table. Another field of the STOT entry is STFORMAT, which is from the data byte of the segmode key that caused this segment table to be constructed. This STOT entry is on the chain rooted in NFSDIB of the producing node and chained through NEXTSTOT of the STOT entries. If two DIB’s designate the same STOTE, they must have gained access to it through segmode keys with the same data byte (as attenuated in the NO-CALL and RO bits).

SZCD is (length of segment table)/64-1. {This is required by the format of control register 1.} SZCD is -1 {FF} when the table is empty. In this case, there are no domain DIB’s that designate that entry.

The segment tables themselves are in a block of storage called (_segment table space). These segment tables are in the same order as the corresponding entries in STOT. This table structure allows for segment tables of various and changing sizes. Segment table space may be re-allocated, but all processors must be stopped and their TLB’s purged.

The first member of STOT is called NULSPPT. This entry designates a segment table of 16 invalid entries. NULSPPT is designated by domain DIB’s when the domain’s memory tree root is unprepared.

When a page table is to be de-allocated, all segment table entries must be found. To do this, start at the segment node associated with the page table and walk the access tree rooted at that segment node. One sufficient strategy is to scan each segment table on the access tree and delete references to the departing page table.

See (shove) for a description of SHOVE, a routine that manages this space.

Allocation and Reclamation of Page Tables

Next to a page table is a field (PTPRODUCER) pointing to the LSS=3 segment node or CTE that produced this table. Also next to the page table is (PTSTATE), which indicates whether the table was built for domains with read-only or read-write access to the segment defined by that node. PTSTATE also indicates whether the table is in limbo. When the page table reclaimer {in MEMORY} considers a page table that is not in limbo, it is placed in limbo and all of its entries are made invalid. If such an invalid entry causes an interrupt, the entries are made valid again and the page table is removed from limbo state. If the reclaimer encounters a page table still in limbo, it will reclaim the page table’s frame.

The right three bits of PTSTATE estimate a time since last use for the page table. It is incremented by the page table reclaimer. When PTSTATE reaches 7 it is doomed. Before that its values can be rebuilt. See (ptstate).

Unpreparing and Rescinding

See (forsake) for related material.

One general problem described here is to find all pointers to a thing so that the space occupied by the thing may be reused for another thing. In the current case the thing may be a page, page table, or segment table. This scheme relies on the fact that pointers to one of these classes of thing are segregated so as to be easily searched. We batch the search to bound the overhead.

Reclaiming Segment Table Space (FREESTOTES in SHOVE)

There are two sorts of reclamation here: STOT entries and the space for the tables proper.

Both of the types of reclamation are accomplished by the same activity, but the criterion for the termination of the activity depends on the type of reclamation in progress.

Forsaken segment tables are freed first.

Then we visit the STOTE’s in a non-sequential but exhaustive order and reclaim the space. Upon termination of this activity, several segment tables have been sacrificed and all DIB’s are searched to ensure that no DIB designates a sacrificed STOTE. Such DIB’s are made to designate the constant invalid segment table.

Reclaiming Page Tables

To discern that a page table is unused, we mark the table and make the entries of that table invalid without otherwise disturbing the contents of those entries. When such an invalid entry of a table causes an interrupt, we immediately unmark the table and re-validate the entries of the table. If a table is still marked after a time, we will reclaim the space. If we search the space that holds all pointers to a given class of table, we are now in a position to invalidate the pointers to the table frames that we are about to reclaim. The essential point here is that one pass will find all pointers to that set of things that will be reclaimed on a given reclamation cycle.

See (pt-age) for more info.

Reclaiming Page Frames

Since pages cannot be marked invalid, another strategy is used. We make use of the reference bits by occasionally resetting them and at the same time indicating in the core table that the page is a candidate to be replaced.

Zapping Memory Keys in Slots {with DEPEND’s help}

What must we do when we store into a slot of a prepared segment node? This action must be immediately accompanied by the adjustment of any DIB, page table or segment table constructed by examining the prior contents of the slot.

For this discussion, “.entry” refers to either a segment or page table entry or the STOTE locater (SEGTABPP) in a domain’s DIB. A page table entry is considered different if it is intended to point to a page with a different storage key.

When the memory tree is consulted to build .entries, slots are examined. We say that the .entry (_depends) on the slot in this case. More accurately, an .entry depends on a slot if that slot would be examined while computing the value of the .entry.

We store part of this relation explicitly in tables {(dependfunc)} chained to make accessing by slot address fast. These tables are fixed in size, and extra room is made by zapping old members of the relation by zapping the indicated .entries. All of the slots upon which an .entry depends reside in prepared segment nodes or domain roots. All slots upon which an .entry depends are involved. See (dependcl) and (whendepend).

Those relation members generated by the most common forms of path portions are omitted from the stored relation, because they can be discoverd by efficient forms of tree climbing. Other relation members, sometimes called “obscure”, are recorded in DEPEND. (whendepend) is the official definition.

See (unprdep) for some related DEPEND strategy comments. See SUPERZAP {(superzap)} for function to complement the DEPEND relation. See (good-shrapnel) for design note. See (checkdep) for CHECKDEPS use.

Depend and Mapping Tables

See also (background), (bkt), (lmi), (bgx), (backimp),(peso), (dep-tab), (seaploop) and (bbk) concerning these issues.

Originally there were depend table entries that related slots to entire segment or page tables. Then came a design (that may never have been entirely correctly implemented) where the BGKEY field in table x was allowed to point to a slot in a node frame that had been displaced. It was argued that x held no entries that depended on the background key in that case. The discussion at (bgx) seems the clearest.

A scheme without such entries

Definition: An entry in a page table is said to be “BG” if the entry is valid and, starting from the table’s producer, to evaluate that entry, a background window key is encountered before any background key.

Imagine two assertions for a page (or segment) table:

For each BG entry of the table the field PTBGNODE locates a red node whose format key appoints slot b as the background key. Let w be the window key that exists by the definition of BG. Applying the offset in w to the key in b causes the consulting of a series of slots, starting with b. Each of these slots is involved(w) and associated with the entry in DEPEND.

If some segment table entry locates the page table then DEPEND will hold entries (to the segment table entry and) from each of the slots starting from the segment table’s producer thru the key to the page table’s producer.

Entries in tables cannot cause faults unless entries in higher tables refer to them. (The SEGTABPP is the table entry higher than a segment table.)

A surprising ramification of this is that for some period (while there are no BG entries in a page table) the node referred to by PTBGNODE is “loose” may be displaced. Another node may be placed in the same frame and PTBGNODE will locate the new node. With some small probability the new node will also have a background key and a memory reference to the new node will access the producer of the page table. There is no harm in this. It is difficult and unnecessary to prevent this case.

Charlie remembers a scheme where the segment table entry designating the page table would have been related via a depend entry to the slot at b. This would require the key in b to be involved regardless of its type. We are not sure of the ramifications of involving arbitrary keys.

Either of these schemes should be tested against the logic at (bkt).

The Read-Only Problem

Due to the fact that storage protection on the 370 is associated with real storage instead of the access path between the process and storage {i.e., the segment and page tables}, we must adopt one of the following strategies to simulate the effect of the memory key’s read-only bit. This bit provides the effect of allowing one program to have read-write access to a page at exactly the same time that another program has read-only access.

We describe three alternatives. We have implemented (disjtwo).

Storage Key Switching

We assume that a program will never run unless it is intended that it have read access to all of the pages in its physical map. Thus the fetch protect bit in the hardware will never be turned on. This rule might be violated if we want to provide a virtual machine with fetch protect. We will not consider that problem now.

We might observe in a particular situation that a program has write access to everything in its map and run the program with PSW key 0. This does not seem to win in many real cases.

One scheme might be to run various programs with various keys at different times. This implies that real storage has a variety of storage keys. Before dispatching a program, it must be determined that the write access of that program will not exceed that defined in the Gnosis manual. To do this, we require two storage key value sets in each prepared memory node. Each of these sets is represented by a 16 bit boolean vector. These sets are called the (_total set) and (_vulnerable set). The total set indicates which storage key values correspond to real pages below this node. The vulnerable set indicates which storage key values are represented by pages below that require protection against writing according to the read-only bits in the path from the prepared memory node to the page. Thus if bit 4 of the vulnerable set of the root of the memory tree of a program is on, that program must not be allowed to run with PSW key 4. Obviously if all of the bits of the vulnerable set of the root of the memory tree are on, the program cannot be run until the memory tree is modified.

The total set of a prepared memory node is the union of the total sets of the nodes that this node points to with: node keys, read node keys, or memory keys. The vulnerable set of a prepared memory node is the union of the vulnerable sets of the same nodes ORed with union of the total sets of those nodes which are pointed to by a memory node with the read-only bit on.

Sometimes there will be bits on in these sets that should not be on. The cause of this is that it may be too expensive to turn them off on certain occasions. The consequence of this is that the system will run less efficiently on occasion.

The following strategy seems to prevent “storage key thrashing” in the cases that come to mind. When a storage key must be changed in order to prevent write access, change it to 0. When there is a key mismatch that prevents a legitimate write, and the storage key is 0, change the storage key to match the PSW key. If the storage key is not 0, then with probability .05 change the PSW key to match the storage key, otherwise change the storage key to match the PSW key.

Two Page Tables and Two Keys

In this scheme, the user mode PSW key is always 1. For some segments, Gnosis keeps two page tables, called R and W. For each page of the segment either: the page has storage key 0 and is valid in both W and R, or the page has storage key 1 and the page is valid in W and invalid in R.

A program that is to run with write access to the segment is given access via page table W. If the program attempts to write into a page with storage key 0 and the segment node responsible for the page table has write access to the page, the storage key is set to 1 and the page made invalid in R {and any other R type page tables that may be associated with other segment nodes}.

A program that is to run with read but not write access to the segment is run with access via R. If access is made to a page that is invalid in R but in W, then the storage key is set to 0 and the entry in R is made valid.

Two Disjoint Page Tables and Two Keys

This system is like the above {(twokv)}, except that pages that would be designated by both tables above would be designated in table R invalid in W. The disadvantage is that when switching between programs that alternatively have read-only and read-write access to the segment, storage key thrashing will occur if both programs are accessing the pages. See (pagerules) about some of the details.

Implications of Background Keys {(p1,backkey)}

Table Dependencies {pre-implementation design considerations}

Page and segment tables are specific to a background key. E.g., two users sharing an LSS3 segment node can not share a page table if they have different background keys for that 64K.

Perhaps tables with no slots that depend on the background key can be a special case. They can be shared and may be very common. (They are not shared.)

Even easier would be to share tables only if they had not held “background dependent entries” since they were allocated.

Even so, the background dependent tables must be designed. This implies that the table DIB’s {STOT entries for segment tables and PTFRAMES of page tables} would have to be expanded to reference the controlling background key.

The problem of finding the table entries that depend on a slot in a node does not seem to be compounded by background keys.

Charge Sets

Idea # 1 {Held in abeyance}

Charge Set States

Initially, there will be a fixed number of charge sets, and the scheduler will hold the only keys to them. They will each be implemented with a fixed core page frame.

Tentatively, a charge set will be implemented by a set of CDA’s in order. Each CDA will occupy 3 or 4 bytes {to be decided before implementation}. The CDA’s will be stored in order. To make a binary search work, we allow duplicate entries, although logically a page only belongs to a C.S. once. The logical population of the charge set is maintained to guide the strategy of hole finding.

An alternative, perhaps intermingling several chargesets, is with chained lists.

Charge Set State Transitions

This is one specific proposal. Every 32 seconds of real time, we turn on the invalid bit in every page table entry. At this time we consult each prepared charge set. If it has made at least 4 seconds of headway since it was last (_bumped), we bump it again. Associated with every charge set is a (_state cursor) that cycles around eight states, 0 through 7. Every member of the charge set {pages or nodes} has a state value associated with it. Bumping a charge set consists of incrementing the state cursor of the charge set and then deleting all members from the set whose state is the same as the new value of the state cursor.

When a translation exception occurs due to our invalid bits, we revalidate the entry and consider the real address of the involved page, consult the core table to find the coded disk address of the page, consider the charge set of the domain that was trapped, and look up the page in the charge set {via its CDA}. We then copy the current state cursor for the charge set into the member state. This is a close approximation to the LRU algorithm.

Implementation for ideas described in (p2,soon-sets).

Assume some small fixed number of active charge sets.

Let CSREP be an array with an entry for each active distinct charge set. Indexes into this array will serve as kernel charge set designators.

A CSREP entry will hold a “charge set identifier” {CSID} that must match a field in a charge set key.

These charge set identifiers will be assigned by a domain using a “charge set tool” to avoid the chore of keeping a new piece of permanent Gnosis state on the disk.

The CSREP will also hold the current page count for the set.

The actual CDA’s in the charge set will be kept in Charge Set CDA Space {CSCDAS} with link fields to chain them off the CSREP. New entries will be placed at the tail of the chain. Old entries will be placed at the head of the chain if they are referenced. All chains will end in a dummy entry.

CSCDAS will be allocated out of a fixed space and shared between all CSREP’s.

When space is required, some charge set will be made inactive. A LRU counter in each CSREP will be used to select the charge set to be inactivated.

Two kinds of space are involved here: CSREP entries and CSCDAS.

When a charge set is activated, the current TOD clock value will be placed in the CSREP. This value will be available to the key holder to allow him to determine if the set has remained active over some duration of interest to him.

We add to the DIB a CSREP entry locator. It may be null, which is represented by a zero entry.

This value is set up whenever CPUCACHE is set up.

{ni}Alternate Representation of Sets

We assume a hash table with quick access given the pair (set identifier, page cda). The hash table entry is (set identifier, page cda, hash link, same set links). The CSREP entry is (ends for “same set links”, TOD value, page count). The table of hash chain heads is indexed by a hash of set identifier and page CDA and holds a CSREP entry locator.

Another approach would be to sub-allocate sections of a vector of CSCDAS to each CSREP. Within each section, the CDA’s would be in sort order, allowing a binary search. When a new entry is added, the entries would be moved to allow the insertion. If the current section abuts another section, then the space would be re-organized. This scheme resembles the Tymnet I node buffer allocation scheme and the scheme used in the TYMCOM 370 to allocate typeahead buffer space.

We may be able to put the set representations in a different storage key.

New Code

Whenever a meter key is involved, the charge set ID in the DIB is compared with the charge set derived from a traverse of the meter chain. If they are different, all the page table entries for that domain are invalidated.

When a page-fault is repaired, the page is added, if necessary, to the charge set indicated by the set locator in the DIB of the actor {(add-page)}. If the charge set is inactive, it is activated {(act-set)}.

Charge Set Inactivation

There are two techniques used to inactivate a charge set. The first is used for the external call on the charge set key or if space is needed to remember additional CDA’s for another charge set. The other is used if the space for the charge set representation itself is needed.

When a set is inactivated for the first reason, all pages in the set are DETACHED. Furthermore all the page table entries that refer to those pages are invalidated. This is so those pages will be added to the new charge set if they are subsequently referenced. This is done with a call to CLRPAGTB {(clrpagtb)}. The space occupied by the set representation is reclaimed.

When the charge set representation itself must be reclaimed, CLRALLTB {(clralltb)} is called to clear out page tables, STOTE’s, and DIB’s for the charge set. The storage used to represent the pages and the charge set itself is then reclaimed.

When a set is activated, it may be necessary to inactivate another set {(inact-set)} to make room in the CSREP.

In order to add a page it may be necessary to make room in CSCDAS space. This may be done by inactivating another set.

If we implement the function described in (p2,old-page) {with the ideas of (set-age) perhaps}, we would need the two-way set links. Otherwise, one-way set links will do.

A “set link” or “same set link” is a CSREP locator in a CSREP to another entry for the same set.

When a CPU timer interrupt occurs, the field CPUCACHE is refilled. {This is current logic.} As the routine BORROW seeks the prime meter, it now also notices meters with charge sets. The first charge set key encountered is involved {(invcsk)}, and the associated set is activated, and its CSREP index is placed in the DIB.

This section is redundant but ties a few things together:

A node X is active when there is a node Y with an involved key to X. This means that node Y is prepared and holds information derived from node X. This means that node X must not be changed without considering the state of all of the nodes with involved keys to X. There is a general routine to de-activate a node by unpreparing all of the nodes with involved keys pointing to it.

A prepared key contains a form of core address of the designated node or page. It also holds the FBlinks that chain the other prepared keys to that node or page.

A key is involved if information from the designated node is held in the prepared form of the node that holds the key.

If a key designates a node and is prepared, it holds the item index of the designated node and is chained together with other prepared keys to that node. If {and only if} a key designates a node and is involved, there is information gleaned from the designated node that contributes to the prepared state of the node that holds the key. If a node designates a page and is prepared, it holds an item index that points to the item that holds the core table entry for that page. Such a key is involved if the contents have been consulted in the preparation of the node holding the slot.

Lock Bytes

{See also (locksx).}

Lock bytes are partly in anticipation of multiprocessor configurations. The transformations of nodes between prepared and non-prepared require several intermediate states that must not be observed by other processors. It is required that Gnosis lock a node before reading or writing that node. In the interim, the lock bytes serve another vital function. When a piece of Gnosis is running and is considering several nodes at once {such as preparing a node}, it is usually necessary that each of these nodes be distinct. Locking each node guarantees this. {Deadlocks are avoided by the logic of section (locks).}

Processors will not lock resources for long periods of time. A processor will not leave nodes locked while it waits for I/O or for nodes to be unlocked.

A processor will leave a node locked while it is running that node. There is a table of real page zeros for the respective processors. In each real page 0, ROOTNODESV holds the node designation of the node that it is running, or about to run, or has just finished running, if any. This way an external call between processors can handle the case of one processor wanting to unlock a node that is being run by another processor.

While a lock is not locked, it holds a LRU clock value.

ORGANIZATION OF THE CODE OF THE KERNEL

Linkage Conventions and Reentrantcy, etc.

These are some ideas that organize the various modules that constitute the kernel of Gnosis.

Problem # 1

How are we to cope with the fact that sometimes the node that we need turns out not to be in core? We are liable to find ourselves with some unfinished business, in the middle of some clever algorithm, several subroutines down, and not be able to proceed, and unable to back out. Yet we must proceed with other things while the node is coming in.

Problem # 2

How do we discover that some node of the jumper is the same as some node of the jumpee during a gate jump? The gate jump code won’t work correctly in such cases.

Problem # 3

What happens if we have two processors that, during simultaneous units of transformation, try to transform the same node?

Problem # 4

How do we tell that a node has been unused for a long time?

Without exploring here the various methods that have been used in other operating systems, we will describe one that seems simple now and solves these four problems.

This idea constrains the set of algorithms. The real 370 encounters some of these same design dilemmas. Two cases from the 370 are instructive, MVCL and TR. The MVCL instruction may use more pages in its execution than any real 370 has ever had installed. Yet it works correctly in these cases. It does this by defining the command such that there are “units of execution” between which the command may be interrupted. The TR command is more nearly similar, however, to the problems of Gnosis. The TR command makes references to memory that are data dependent. That is to say that the bits of the command and contents of registers are not sufficient to determine which memory addresses will be used in the execution. To make matters worse, the command destroys some of its input as it goes, eliminating the possibility of going back to the beginning of the command if a real page fault occurs. The solution for the 370 is first to test whether there may be a priori reasons why no page faults are possible {all pages of the table are in}, and, failing that, do a “dry run” to determine whether the real data will cause parts of the table not in core to be referenced. If such is referenced, the page fault is caused before any data is changed.

I believe that the TR solution will generalize to solve most of the Gnosis problems with page and node faults. This requires that some routines be divided into two passes, the dry run and the real thing. If, during the dry run, all relevant nodes are locked, we also solve a class of deadly embrace problems between two processors. Like on a narrow road, if either party can back up {and does when he meets someone}, the conflicts are not fatal.

Primary I/O is handled a little differently; the I/O is done as part of the “dry run” even though it may be irreversible. See (p2,iokeys). An I/O key is in one of three states, viz., “device busy”, “status”, and “idle”. In the idle state, if a jump to perform I/O occurs, the I/O is initiated, the jumper is enqueued on an item index associated with the device, and the key enters the device busy state. In the device busy state, any jump to perform I/O is an error. When the device becomes not busy {an interrupt occurs}, any information that needs to be saved is saved, the key enters the “status” state, and the jumper is restarted. In the status state, if a jump to perform I/O occurs, no I/O is done, the parameters are ignored, the jumper receives the saved information, and the key enters the idle state. The same command to invoke the key is thus executed twice per device operation.

Two ways to unlock all of the nodes. (We use the second.)

A general facility may be of use here: a table of lock addresses locked by the current program, and 3 registers describing the current state of that table. If it is understood that any locks were done at the time of a dry run and that the table and table descriptor were always valid, then the FINDNODE routine would be able to abort, cleanly, when the node was unavailable and to release the locked nodes. The process would be restarted when the node arrived, because the item index of the root node is kept in the IOBLOCK. In case all of nodes were available, a general routine may be used to release the nodes at the end of the program. The current plan is to leave the root node locked while the processor is running in it. For multiprocessor configurations we may unlock the root node but leave involved keys to it from the CPU.

The alternative is to have the calling code remember how to unlock everything. This is a bit tedious but it does work. It is bug prone but the bugs are soon manifest in CHECK.

Linkage Conventions

We assign, in real page 0, a fixed work space for each Gnosis module. This space serves as a register save area as well as temporary space. These spaces need not be disjoint, but they are. Routines that are not simultaneously active could have overlapping work spaces. Such routines are reentrant in the sense that two processors may be executing the same routine at once because they have different page zeros. BALR 14,15 or its equivalent is the standard (but not exclusive) linkage convention.

vxa-Locks for MP.

Initial study of the MP problem has suggested the following style of lock bytes for page frames and perhaps node frames.

Frame States

These ideas are elaborated at (flp).

We name here a set of states that a frame can take on. The variable FRAMESTATE in the frame will hold a code for this state and be manipulated mainly by CS instructions. Some of these states will be “exclusive” meaning that some particular CPU is executing kernel code on their behalf. The state code will be indicative of which kernel code but not of which CPU. This design is for pages and perhaps for nodes.

It may be desirable to know which CPU holds the lock for the exclusive state. Either CPUX asking for exclusive on a frame held by CPUX represents a bug or the lock can be quickly granted (but beware the double unlock problem).

NULL: The frame is occupied. There are no “core locks” or “exclusive locks” in effect.

AL1, AL2, .. ALn: There are n readers of the frame and no writers. There are thus n reasons not to deallocate this frame just now. None of them require exclusive access to the frame’s contents. Some CPUs may count for more than one reader according to the logic of the kernel code being executed. ‘AL’ stands for ‘allocation lock’. If the frame holds a node the slots are logically fixed which means that a key cannot be replaced by another. The representation of the key may be changed however regarding prepared and involved.

Decide if a frame with a shared lock may hold an involved key with the “don’t read” bit zero. Why not? It might require exclusive access to de-involve the key.

EX: Some CPU has gained exclusive access to this frame. It is thus in a position to change its logical contents. Prepared slots in the frame are still protected from representation changes.

OLD1, OLD2, ..OLD6: GSPACE steps thru these states in its quest for really old frames.

SEVERING: Some CPU is zapping page table entries to this page frame. Only for page frames.

FREE: This frame is unallocated.

“To hold a lock” means to have put a frame in state EX or to have incremented the shared lock count and have put the frame in some state AL1, AL2, etc.

We may need an adjunct for each of the ALn states indicating that some actor wants an exclusive lock on this frame. In that case no new shared locks would be established. Perhaps the actor can inhabit the backchain of the locked node.

This need depends on the strategy chosen when a lock is not available.

Perhaps we need to lock the domain annexes while a domain is running? If not, why not?

Strategy

Obey the following rule: Don’t unprepare a key unless you have an exclusive lock on either the frame that it is in or the frame it designates.

Having locked a node frame n, to lock a frame f designated by a key in a slot of n: concurrently copy the right half of the key to HALFKEY. If HALFKEY is prepared consider the frame g that its pointer points to. Attempt whatever lock you want on g. If you succeed compare HALFKEY with the original half slot. If it is the same, f=g and you have found and locked your node.

After locking g you have locked both the holding node frame and the designated frame. No other CPU holds an exclusive lock to either frame. Thus no other CPU can unprepare the key during the interval beginning when you acquire the lock at frame g and ending when you do the compare.

If HALFKEY is not prepared lock the slot by putting 31 in KEYTYPE with CS (keeping the old value, of course). If KEYTYPE already has 31 then someone else is preparing the key. Wait if you wish. If you succeeded in replacing KEYTYPE with 31 then you have exclusive access to the slot. You can now consult the CDA hash chains to locate a frame if some frame holds your object.

If this is the only use of 31 in the keytype field, and if the holder of the “31 lock” releases the lock when he determines that the designated frame is not in a hash chain (a short operation), then a spin lock on the field is a probably a reasonable way to handle the lock conflict.

Proc cs = (Ref Int i, Proc(Int)Int p)Void: [ function to do the 370 CS instruction[;
(Struct(Ref Frame f, Bool prepared) hk =
[right half of key slot.
This binding occurs with a concurrent access.[;] If prepared Of hk Then cs(frame state Of f, (Int s)Int: (If s = null Then exclusive Else Go To excluded));

MVC HALFKEY,SLOT+8 concurrently copy prepared bit and ptr] TM HALFKEY+7,PREPARED was it prepared
IF NZ
L 1,HALFKEY -> frame where designee was
L 2,FRAMESTATE-FRAME(1) probably still there] CH 2,=Y(FREE) Are we too late?
BNH JUSTMISSEDTHEBUS FREE or SEVERING; incredibly rare] (is it one of the AL values?)
IF YES
Put next higher AL value in R3
ELSE]

Real Storage Assertions

If slot s in frame n is prepared and holds a key to frame m then the pointer in s designates m and m’s FRAMESTATE is

Singly Linked Chains

This is a method of adding elements to the head of singly linked chains. Let HEAD be the head of such a chain and let R1 hold the address of a BLOCK to be added at the chain head. The blocks are linked thru a word LINK in the block.

For example:
L R2,HEAD
LP ST R2,LINK-BLOCK(,R1)
CS R2,R1,HEAD
BNE LP

This should work for adding elements to the page and node hash chains at least. It is also compatible with an uninterlocked chain search which includes an explicit chain termination test in the loop. Deleting an element from the chain may be done with a global lock. That occurs only as often as I/O occurs and only lasts a few micro-seconds.

Compare the global lock strategy with local locks in the terminator(s). Another problem is one CPU requesting that a page or node be brought into memory because it was not on the chain while another CPU is bringing it in and just hasn’t quite added it to the chain. The problem is to avoid having two copies. See {(paging-locks)} for a discussion of solutions to this problem.

MP, Locks and Assertions

This is an attempt to explain some of these ideas in terms of assertions. Indeed we will try to make some assertions that might be checked by some sort of program.

Imagine that some processor signals all other processors to stop. To make the strongest assertions it would be good to stop them between memory references, but we must settle for stoping between units of operation. A version of CHECK would run (perhaps collectively) that would ignore locked things. Once per checkpoint a routine of check would run that waited for all processors to finish their “kernel units of operation” and then insist that there be no locked things.

Hierarchy

Haberman has suggested that constructs should be strictly ordered such that a lesser construct does not depend on a greater construct for its implementation. This section is a first attempt to sort the modules of Gnosis into such an order.

Gratis Nodes

This is a category of nodes that do not require meters to pay for their residence in core; they need not belong to charge sets.

Meters

Meters are built out of gratis nodes.

Normal Pages and Nodes

Meters pay for the bringing and holding of most pages and nodes in core.

The Memory Tree Interpreter

This program is given a memory key and address and returns either a page key or gate key to some segment keeper.

Domain Control

These programs cause domains to run.

Decongester, Primordial Space Bank, Primordial Domain Creator, Space Bank Transformers, and Transformed Space Banks.

Scheduler

This is the program that is outside the kernel and manages the resources on a gross scale.

Passing Strings Via Addresses

When, in a gate jump, the jumpee accepts the string by specifying a virtual address, the code {called “MEMORY”} that normally interprets the memory tree of the actor is required to interpret the memory tree of the jumpee, who has no process.

Relevant Facts About the Current MEMORY

Just one node is locked at the beginning of the execution of MEMORY, namely, the domain root of the domain whose action caused the requirement for the page.

MEMORY tries to restart the domain as quickly as possible if it is able to remove some obstacle.

As MEMORY runs, there are three generally used designators by which MEMORY was designed to remember its subject {a domain}: control register 1, PZDOMAINDIB, and R1, which usually contains PZDOMAINDIB] .

Let z be the high bit of 90. {Currently z is 0 or false.} When the machine is in privileged state and (ia = SEGTRANSEXNOPER or ia = PAGETRANSEXNOPER), then:

z or (PZDOMAINDIBP points to the DIB of a prepared locked domain called here “the domain”).

z or (Putting the domain on the CPUQUE and unlocking it is a safe {but unproductive} transition).

TRANSEXCEPTADDR holds an address that the domain requires and (ia = SEGTRANSEXNOPER and that address is invalid according to the domain’s segment table or ia = PAGETRANSEXNOPER and that address is invalid according to a page table).

More accurately, we must say that the process has required it {and probably still requires it}.

Thus, either z or it is safe to jump to a segment keeper that this address determines.

Control register 1 holds the address of the domain’s segment table.

Cell ACTOR holds the address of the root of a domain with an unregistered process whose action caused {somewhat indirectly perhaps} the address to be required.

This means that this is the domain that should be presented as the actor to the paging subroutines.

A statement at a different level that is like the above is:

z and (ACTOR designates the domain whose memory tree is under consideration) or not z and (ACTOR designates a jumper).

I think that the exits from the current MEMORY are:

The LPSW at label LP2,

This LPSW is encountered when MEMORY has removed some obstacle and has not otherwise meddled with the domain that it presumes has caused the fault.

We could get back control here by putting a kernel PSW in the DIB. This is slightly gross.

I think we will test bit 0 of 90 instead.

RESUME,

This is used when the domain has been given a trap code.

We could afford a test of byte 0 of TRANSEXCEPADDR here.

INVOLVE which sometimes doesn’t return,

To get a node or page.

We use a page zero cell, ACTOR, to address the actor explicitly.

KEYJUMP which doesn’t return,

For doing the gate jump to the segment keeper.

We could afford a test of byte 0 of TRANSEXCEPADDR here.

The LPSW for the Hyper-jump {not relevant here},

PERSVC {not relevant here},

HYPERTO {not relevant here}.

While MEMORY runs, the kernel must retain designators to the jumper and jumpee.

MEMORY is concerned directly with the memory tree of the jumpee. It needs the address of the STOTE entry and the top node of the memory tree. It is currently coded with the idea that it has the addresses of the DIB.

I think that the correct way to use MEMORY for bringing the required pages to core is to feign a read-request.

The way that MEMORY is now coded, the pages will have storage key 1 just in case the string can be accepted.

Distinctness of Argument and Parameter Pages

There may now be two argument pages and two parameter pages. What are the problems caused by indistinctness of these pages?

Since the argument pages only provide values, there is no external design problem caused by two argument pages being the same.

Suppose a page occurs at consecutive page addresses in the jumpee’s space and the parameter crosses the page boundary between those addresses. There is no problem, since argument strings are limited to 4096 bytes, and thus no byte of that page gets more than one byte value.

We have decided that when an argument comes from two pages, we will change things so that it comes from one. We may thus talk of one argument page.

The transformation referred to above can also be used to solve problems introduced by the coincidence of the argument page and a parameter page.

In particular, there is a permanently allocated page for use whenever a parameter page is the same as an argument page.

Musings

Suppose that we try to change MEMORY as little as possible. We enumerate the changes here that seem unavoidable.

If MEMORY should decide that the process should be turned back to the CPU {when it has removed an obstacle} then we must ensure that the jumpee does not get control since it has no process yet.

Some places in MEMORY branch to RESUME in order to restart the user program. These places can be easily changed to designate the jumper rather than the jumpee for restarting.

We must prevent MEMORY from calling a segment keeper. This can be done if we can cause the no-call bit to be turned on in RO at the appropriate time. This has to do with where we give control to MEMORY.

Even if no-call is on, we must consider how the domain fault is presented. The jumpee must be faulted but not started.

If MEMORY finds that the page isn’t there, then we provide a trap code for the jumpee and proceed as if the string had been rejected. Presumably we do this by changing the copy of the entry that the kernel is keeping.

Except for considerations of making MEMORY slow for its “normal” case, we would want memory to consider an address in a specified memory tree and cause required pages or nodes to be summoned in the normal way but not to call any segment keepers.

When MEMORY comes to the point of calling a segment keeper, it should instead return to its kernel caller.

When it summons a page or node, it should either return to its caller or treat the jumper as the actor.

Considering the primitives available to MEMORY {GET}, it seems that the latter choice should be made.

Setting Up the Parameter Pages

I think that I will follow the external design of the routines SETUPDESTPAGE, UNSETUPDESTPAGE, and RETBYTESTRING in PRIMARY, except that PRIMARY has already been changed so that in PRIMARY only SETUPDESTPAGE sets the cell ENTRYBLOCK, except for some places that set it to zero.

If RETBYTESTRING consults the returnee’s DIB, then we count on the DIB not changing between SETUPDESTPAGE and RETBYTESTRING. {This possibility causes complexity in order code DOMAIN__PUT_REGS for domain keys when the key designates the returnee.}

Actually it does change but we anticipate the new values in FDIB.

We must also worry about the case where the jump changes the returnee’s memory tree by making valid or invalid the address of the parameter. This is a problem in cases such as the order code NODE__SWAP for node keys.

We solve the above problem by noticing that primary keys that change a node return a zero length argument string.

In case of the old-style entry block, changing a node will not influence the parameter page, because that page isn’t defined by a memory tree.

In case of the new-style entry block, we count on the fact that the argument length is provided to the returnee in its registers and that, there being no bytes in the argument, no parameter page need be present.

We cannot implement a primary key that returns a string with bytes and changes a node with this scheme {but none are specified}.

To make all of this work safely, we may have to change MEMORY to take special action when it is called upon {via DETPAGE I think} to zap a page and that page is locked by virtue of “DEST PAGE SET UP”. The special action might be to unsetup that page. {“DEST PAGE SET UP” would still hold.}

About the Meaning of “DEST PAGE SET UP”

General Desiderata

“DEST PAGE SET UP” means that the page{s} are locked and may also have to mean that the page{s} may not be changed to read-only.

“DEST PAGE SET UP” can hold {as it does in the current design} even when there is no page defined by the designated address. If RETBYTESTRING is called while this is the case, and that page is required to accept a value, the returnee will receive a trap code.

We must change the definition of “DEST PAGE SET UP”. The definition should make SETUPDESTPAGE fast since it is executed more often that RETBYTESTRING.

If while DEST PAGE SET UP we hold ENTRYBLOCK in a standard form {regardless of the style provided by the returnee}, then RETBYTESTRING will not {presumably} depend on which style was used.

In this case we will need a field {set by SETUPDESTPAGE} to keep the parameter length, since RETBYTESTRING can’t go to the DIB for it.

Design:

While “DEST PAGE SET UP” holds, there are two {unordered} control blocks to describe the potential parameter pages that are set up. For each of the control blocks:

The following three states are explicitly distinguished:

page not required by virtue of parameter length or style of entry block;

either:

new-style entry block and page required by parameter length but undefined by memory tree or

old-style entry block and parameter page key invalid;

page present and core-locked.

In the second two cases, the following values are provided:

a page locator (address and CTE address)

an argument portion descriptor indicating which portion of the argument belongs in this page;

a locator relative to the page indicating where in the page the argument portion belongs;

an indication of whether bytes 0 and 1 of this page should receive the argument length.

Note that none of this information depends on the argument length, which is unknown when these values are created.

Note that RETBYTESTRING can process these two control blocks in either order {or even concurrently}.

Loading Control Register 1

The problem

The instruction “LCTL 1,1,..” in the context of the Gnosis kernel under VM on a 3033 seems to take about 160 microseconds. If the value is not changed, it takes about 50 microseconds. We should minimize the number of times that we load this register. The register is currently loaded in GATE about a dozen instructions before loading the user’s PSW.

We have now changed GATE by moving the LCTL backwards from RUNIT to the point in the string copying logic where the jumpee’s map is required. This will make it easier to move the LCTL back further into the PRIMARY module.

SETUPDESTPAGE requires the use of an LRA instruction with the returnee’s segment table designated by X1. For this it seems necessary to specify that, at RESUMEJ, X1 hold the value called for by the DIB pointer R12.

An alternative would be to indicate {perhaps in READINESS} that X1 was not loaded. This would be acceptable for execution under VM but would be much sub-optimal for execution on a real machine. Also, there may not be room in READINESS.

The goal is thus to find a few places to put “LCTL 1,1,..” instructions so that the register will always be loaded but not often be loaded more than once per jump.

Trial Solution

I think that “DEST PAGE SET UP” is made to hold once per jump. If so, we can put an “LCTL 1,1,..” wherever this is made to hold.

I intend to formalize “DEST PAGE SET UP” in GATE also.

Currently, “DEST PAGE SET UP” is made to hold in SETUPDESTPAGE and a few places with the instruction “MVC ENTRYBLOCK,=A(0,0)”.

RETBYTESTRING: For each of the two control blocks {concurrently}:

IF a portion of the parameter exists in this page and some of the corresponding part of the argument exists THEN IF page is present THEN copy appropriate argument portion into page ELSE fault jumpee FI FI;

IF the page requires the argument length THEN IF page present THEN store length in page ELSE fault jumpee FI FI;

IF the page is present THEN unlock it FI

Obsolete Scheme

When we determine that the jumper has a string argument and the LRA instruction tells us that a page holding part of that string isn’t mapped, we simulate {to the satisfaction of MEMORY} a translation exception interrupt. To do this, we:

deduct SVCILC from PSW in jumper’s DIB,

set 90 {translation exception address} to the required virtual address,

set registers 10 and 13 to hold MEMBASE and the address of the jumper’s DIB,

store the address of the jumper’s root in ACTOR,

jump to either PAGETRANSEXNOPER or SEGTRANSEXNOPER, depending on whether we are lacking a page table entry or segment table entry. {LRA tells us this directly.} {See (trex).}

{It is as if the SVC had produced the memory reference.}

If we determine that we lack the real address of a page for the jumpee’s parameter {or returnee’s in the case of primary keys}, we:

store the virtual address of that page in TRANSEXECPADDR (=90) with the high bit on,

load control register 1 via the jumpee’s {or returnee’s} DIB,

set registers 10 and 13 to hold MEMBASE and the address of the jumpee’s DIB,

store the address of the jumper’s root in ACTOR,

jump to either PAGETRANSEXNOPER or SEGTRANSEXNOPER, depending on whether we are lacking a page table entry or segment table entry. {LRA tells us this directly.}

MEMORY does its thing and comes back to us instead of dispatching the domain, since the high bit of 90 is on.

See (prepwrit).

To support the above strategies, we modify MEMORY as follows:

ACTOR is set to the address of the root of the running domain after the interrupt and before the dispatch on interrupt type;

A test of location 90 is inserted before LP2 so that the jumpee will not be dispatched if control arrived at memory from GATE.

Another test of 90 will prevent a segment keeper from being jumped to, but will instead cause the jumpee to receive a trap code.

Another test of 90 in the domain fault logic will prevent starting the jumpee just after it has received its trap code.

To support the above strategies, we modify GATE as follows:

Locating the Argument String

We cause GATE to produce the argument string in real storage and its address to placed in REFREFARGUMENT and its length in REFARGLENGTH. These are contiguous words in page zero. The length of the string will already have been checked for validity. The high byte of REFARGLENGTH will be 0 and REFARGLENGTH will hold 0 if there is no string.

I think that the instruction “LM 2,3,REFREFARGUMENT” can replace the calls to GETSRCSTRING in PRIMARY and other places.

We must decide what to do about recording argument strings to primary keys. This is now done in GETSRCSTRING.

If the argument string is contiguous in the jumper’s memory, then REFREFARGUMENT will point directly into the jumper’s memory. Otherwise, it will point into a page acquired for this jump to hold the argument string.

GATE will be modified to use this setup for the case in which the key is an entry or exit.

SOURCEPAGECTE designates the locked page, if any, that holds the argument.

See (arg-ready) for assertions.

Placing the Argument String

We provide code to accept and deposit the argument string in the jumpee, in the case of a gate, or the returnee, in the case of a primary key.

This code will copy the argument, whose address and size is provided by the caller, into the pages locked above. The pages will then be unlocked.

We build a subroutine to cause the page or pages that hold the parameter string of a designated domain to be present and locked.

This routine is called by GATE, perhaps, after it has been determined that one of the pages is not mapped.

This routine is also called from PRIMARY when it is noticed that the returnee’s parameter pages aren’t all mapped.

This routine verifies that there is write access.

These calls are both from the dry run of gate jump logic.

The actor must be passed to this routine.

We provide a subroutine to determine in the dry run that an argument string can be accepted.

This will result in locking 0, 1, or 2 pages.

If the parameter string is inadequate, the entry block will be modified to indicate that no string is to be accepted, and the jumpee or returnee will be endowed with a trap code.

The parameter is inadequate if the parameter occupies, in part, a page that is not defined by the memory tree, or calling a segment keeper would be required.

When a jumper specifies an argument string that crosses a page boundary:

We copy the argument string to contiguous storage in a page frame permanently allocated for this use. We then place the real address of the string in REFREFARGUMENT {the same place as if the argument had not crossed a boundary}.

If a jumpee or returnee specifies a string parameter, we verify in the dry run that all pages required to hold the string are locked down.

There comes a point after the dry run {it turns out} when the argument exists somewhere in real contiguous storage. At this point we copy the argument into the possibly discontiguous parameter storage.

Algorithms

This section describes some of the kernel algorithms whose designs are not obviously determined by the kernel’s global data.

Elementary Scheduling

An I/O interrupt causes the current job to be put on the good end of the queue. {The world is then valid.} If the I/O interrupt causes some process to be unblocked, that process is put at the good end of the queue {also}.

When the interrupt is done, the unblocked jobs run {typically for a very short while}, and then the original job resumes.

The interrupted job usually gets a shortened slice in this case but occasionally gets parts of two slice quanta for his trip through the CPU queue. It averages out {exactly!}.

For Meters

Connecting Up Meters

Before a domain takes any action, its DIB must hold a positive CPU cache. Before a DIB may hold a positive cache {of any sort}, its domain root must hold an involved meter key. Since involved meter keys designate nodes prepared as meters and prepared meters hold involved meter keys {except for the prime meter}, we follow the meter chain that we hope will lead to the prime meter prep-locking the nodes.

First time up:

We prep-lock the nodes. If we encounter a node prepared as a meter, we have succeeded, unless the number of nodes we have passed plus the level number in the found meter exceeds 20. We also fail if we find a locked node or a non-meter key or a node whose format precludes it becoming a meter.

While traversing this path, we leave prepared {uninvolved} keys as they are, but we leave keys that were unprepared half-prepared {KSUBJECT filled in but no backchains}. This is because we do not know whether the keys will be left involved or not, and this determines how they will be backchained. They may not end up involved because the chain may not reach the prime meter.

First time down if we succeed:

For each node, we involve the counter and charge set slots and prepare and involve the meter keys that were left half-prepared. Meter keys that started out already prepared are rechained to reflect their involved status. We also mark the node as “PREPASMETER” and set NFDRY to indicate that inferior caches are empty and unlock the node. We also fill in the level numbers.

First time down if we fail:

If we fail, we run back the path backchaining the half-prepared keys as uninvolved and unlocking the nodes.

To this point we have not considered the resource values in the meters. We now have a straight shot at the prime meter. If some intermediate counters are too small, we may resort to the logic of the next section. See (borrow) for calling these subroutines.

Scavenging Caches

The routine SCAVENGE operates on a prepared meter and returns stale caches that are inferior to that meter to the primordial meter. A bit in the prepared meter called DRY and a bit called STALECACHE in READINESS in the domain DIB are for the use of SCAVENGE.

STALECACHE is a one-bit LRU counter for the cache in the DIB. If STALECACHE is on the cache is least recently used. SCAVENGE traverses the meter tree below the given meter and for each inferior DIB turns on the STALECACHE bit or, if the bit is already on, returns the cache to the primordial meter.

DRY indicates that all caches below the prepared meter are zero. SCAVENGE turns on DRY when it finds no CACHES under a given node of the tree. SCAVENGE uses DRY, set in previous executions, to short-cut the walk. BORROW uses DRY to determine whether to call SCAVENGE to try to get more cache, or to call the meter keeper.

Memory

The memory tree interpreter (SEARCHPORTION) provides the function described in (p1,memtreeformal). This program conceives of the memory tree path to be made of three parts. The first part is that portion of the path between the domain root and the first node with LSS <= 5. This portion merely determines which segment table will be used and is called the (_trunk.) The second part of the memory tree path extends from the end of the trunk and runs until a segment of LSS <= 3 occurs. This portion is called the branch and determines a page table, and thus a segment table entry. The third portion of the memory tree path is called the stem and runs from the end of the branch until the page is found. This portion determines the page table entry. The design of the MEMORY program is to traverse only those portions of the memory tree path that are relevant to the specific fault.

Logic to Support Background Keys

Fields in both STOTE and PTFRAME locate a background key (STBGKEY & PTBGKEY). This field designates an {involved} background key that was was in effect when this table was built. These fields must match in order to adopt an existing table.

Exercise for the reader: Prove that if two users share the node {that produced this table} and each reached the node with the same background key in effect, then they can share the same {page or segment} table.

The variable BGKEY is global to MEMORY and that is set by SEARCHPORTION each time a background key is consulted, and is consulted when a background key is required. BGKEY locates that key.

Ignore values of BGKEY after the write-trap, fault, and trunk calls to SEARCHPORTION, but after the branch and stem calls use the value in BGKEY for the new background key locator fields of STOTE or PTFRAME. {These are called STBGKEY and PTBGKEY.} Before these calls, initialize BGKEY to null for the case in which no background key is used.

New tables will be required if all existing locators are wrong.

MAKEMOREROOM for Running Out of STOTE’s or Segment Table Space

MAKEMOREROOM is a routine internal to the SHOVE module {which holds EXSEGTAB and ALSEGTAB}. This routine is called when there are no available STOTE’s by ALSEGTAB or when there is not enough available segment table space by EXSEGTAB. MAKEMOREROOM first converts forsaken STOTE’s to free by scanning the forsaken queue {(forsaken)} and converting them to free by setting SZCD to -1 and STPRODUCER to 0. When the end of the forsaken queue is found, it is spliced to the free queue {(freest)}.

If there are not enough free STOTE’s to make the DIB scan economical, we scan the STOTE’s to free some more: We scan for non-free STOTE’s and set their SZCD to -1. Then:

If the STOTE is produced by a segment node {as determined by STPRODUCER, the STOTE is removed from the producer chain.

If the STOTE is produced by a page table, that page table is forsaken.

At this point the world is valid, except that there may be DIB’s that point to free STOT’s. This is now fixed by scanning the DIB’s.

See (shove) for more about this stuff.

See (ptrescue) about managing page table frames.

Swapping

This chapter is concerned with the real I/O necessary to compensate for the small size and volatility of main storage.

Real Core Allocation Strategy

For this section, we use the term (_core frame) to refer to either a page frame or a node frame in item space. Each core frame has a lock byte. The page’s lock byte is in its core table entry in item space. When a core frame’s lock byte is unlocked, the value 3 is placed in the byte. There is a global cursor for page frames and another for node frames. Each cursor passes cyclically over its range of frames.

When a frame must be acquired, the frame at the cursor is considered. If the lock byte is 1 and the frame has been changed since the last copy on disk {it is dirty}, it is scheduled for cleaning. If its lock byte is 0 and the frame is not dirty, it is selected. If it is dirty, the cursor is advanced. If its lock byte is locked, the cursor is advanced. If the lock byte holds a value from 1 to 3, its value is decremented and the cursor is advanced. This process continues until a clean frame is found with a 0 lock byte, in which case if it can be unprepared and removed from the hash chains, it is returned. {If it cannot be unprepared, the search continues. See (cntunhk).} If all the frames have been examined and no clean ones were found and none were cleaned, the process returns NIL.

I think that this is a scheme used by Multics. It is also used by the hardware of the DEC KL-10 processor to manage the cache.

Waiting for Pages and Nodes to Come In

When a unit of transformation encounters a missing page or node {not in core}, it requests that the thing be brought to core and puts the corresponding process in a hashed queue of waiters. The hash is a few bits from the offset on the first real disk to the page or node to be brought in. When such an operation finishes, members of that queue are restarted.

A Problem

An exposure that we know of {that we haven’t even invented a fix for} struck when we moved the packs from a sick Gnosis to a VM machine, IPL’d, ran for a while {writing the packs} and then moved the packs back to the real machine which we did not re-IPL and which still had {now obsolete} page and node-pot values in its core frames that it then wrote out.

Granted that more careful operation would have avoided this crud, we had not invented the rules for the operator to follow.

Retry

This chapter discusses some strategies to cope with fallible hardware and power supplies.

Checkpointing

See (swap-check) for another description of some of this stuff.

Checkpointing is used to save a consistent state of the system. This allows restart, avoiding the problems that would occur if a key were saved in one allocation state and the node it pointed to were saved in a previous allocation state.

Checkpointing Logic - Top View

There are two logical areas on the disk {spread over several packs}. One of these is called the “current” checkpoint area, and the other is called the “backup” checkpoint area. Normal page and node write activity takes place to the current area. When a checkpoint is taken, all dirty pages and nodes are written in the current area, along with a copy of the directory and some other information. Then the current area becomes the backup area, and the backup area becomes the new current area. The migrator is started to make the backup area ready for the next checkpoint.

CHECKPOINTING LOGIC - Processing steps

Wait for migration in progress {if any} to complete.

Inhibit dispatching domains.

Force each dirty page or page being cleaned “kernel-read-only”.

Copy all dirty nodes to node pots {to be written out into the current swap area}.

Inhibit creation of dirty node pots and the cleaning of new dirty objects.

Get a page frame for the CHECKPOINT HEADER, and build a free list of between 1 and the number of swap devices page frames for DISK DIRECTORIES. Timestamp the CHECKPOINT HEADER.

Start dispatching domains again.

N.B. At this point all dirty objects that need checkpointing are either:

Pages marked “kernel-read-only”.

Dirty node pots marked “kernel-read-only”.

If there is anything on the clean-me-first list {implemented as an ordering of the clean list}, and it is marked “kernel-read-only”, then clean it; otherwise, clean anything marked “kernel-read-only” and repeat until there are no objects marked “kernel-read-only”.

N.B. All objects are now checkpointed.

Swap swap-area directories.

Allow dirty node pot creation.

WHILE no more entries in directory(back) DO

Get page frame for DISK DIRECTORY from the free list created above.

If a page frame is available then:

Fill it with entries from directory(back), or if no more, pad with null entries.

Write it on two separate disks {in current swap area}.

Enter the disk address into CHECKPOINT HEADER.

Write CHECKPOINT HEADER onto two fixed locations on two separate disks.

N.B. Checkpoint complete. {HOORAY!!}

Allow cleaning new dirty objects.

Start migration {see (migrator)}.

In order that the order of stalled domains is preserved, the nodes are unprepared, but the keys {especially the hook} remain prepared when it is required to copy a node to a node pot during node cleaning. Providing unprepared keys for the disk image is integrated with the copying of the node to the node pot.

If the system must restart from a checkpoint, the migration will be restarted from the beginning. The order information that was in the stall queue is lost; the fact that there was a process in the node is kept in the disk directory and is not lost.

Checkpointing Issues

The checkpointing logic is invoked for three reasons. With some of these there are special problems.

Entered because of timer interrupt.

The world is valid. There are no special problems.

Entered because swap area directory near full.

This is detected by GDIRECT when the space in the directory is near full. As this is several levels down in the subroutine stack, the stack must be unwound before any serious work can be done on the checkpoint.

Entered because swap area space is becoming exhausted.

This is detected by GSWAPA when the space remaining in the current swap area falls below a threshold. As this is detected several levels down in the subroutine stack, the stack must be unwound before serious work can be done on the checkpoint.

When a checkpoint is needed, there may be nodes locked when the checkpointer goes to clean them.

Checkpointing Solutions

When the need for a checkpoint is detected internally to the I/O logic, the routine that detects the need will set a bit which will be interrogated by the logic that returns control to the caller of GET. If this bit is on, the checkpointing logic will then be performed.

When the checkpointer stops all domains in order to make a consistent copy of the nodes in item space, it will replace the dispatcher procedure with an entry to itself. At this entry all nodes will be unlocked and the world will be valid.

Multi-Copy Pages and Nodes

Each page or node has 2 or more “homes”. These are written to two separate pages on two separate packs in the current swap area.

The checkpoint header is written to two permanently assigned locations on two separate packs.

The migrator must move to all “homes”.

Migration

Migration is the process of copying pages and nodes from the backup swap area to their “home” positions on the disks and updating the appropriate allocation pots. This allows the backup area to be used after the next checkpoint. The migrator consists of a special primary capability and code that runs in a domain.

During migration there are two versions of a page that are important.

The version as of the last checkpoint. This is found in the backup swap area, or if not there, in the home area. The migrator doesn’t mess with pages that aren’t in the backup swap area. Those that are, will be gotten from there if we restart from that checkpoint, regardless of the state of the backup directories in core. The migrator doesn’t change anything on disk in the backup swap area. Therefore nothing the migrator does or doesn’t do will affect this version of a page.

The current version. This may be in core, in the current swap area, in the backup swap area, or in the home area. If in core and clean, it is also in one of the other places. The job of the migrator is to remove the dependency of the current version of a page on the backup swap area, so it won’t be lost when we scrap the backup swap area at the next checkpoint.

Therefore:

The migrator need only concern itself with the current version of a page. When it brings a page into core it always links it into PAGCHHD.

If a page is in the current swap area, its current version can’t depend on the backup swap area, so there is no need to migrate it.

If a page is in core and dirty, it doesn’t depend on the backup swap area, so we don’t need to migrate it.

Migrator Domain

The migrator domain uses the information returned from the kernel about mounted ranges and the backup swap area CDA’s and locations that need migration to minimize the number of disk access and arm movements needed to perform the migration. It specifies the order and, implicitly, the number of core frames needed to perform this migration step, by the number of CDA’s specified in the CDA-list. See (p2,migrate1). When all CDA’s have been migrated, it uses MIGRATE(0=>rc) {(p2,migrate0)} to wait until the next migration is needed.

See (p2,migrate) for external definition of the MIGRATOR CAPABILITY.

See (migr-cntrl) for some kernel details.

Proving that checkpoint/migration works

At the end of a checkpoint, GDISWAP empties the backup directory and moves the working directory to the backup directory. The old backup swap area becomes the working swap area. We must prove that these actions do not change the logical state of the system (except that the “next backup version” becomes the “backup version”). Specifically:

When we look for the current version of a page or node, if we look in the backup directory, then (a) if we find it there, we use the copy in the backup swap area, or (b) if we don’t find it there, we use the copy in the home area. Migration has ensured that if we look in the backup directory (i.e. if the page or node is not in the current directory), these two copies are the same. Therefore emptying the backup directory does not change the current version of anything.

Since we want to re-use the old backup swap area, we must prove there are no pointers to it. Pointers to the backup swap area are called backup swaplocs. These can exist in the following places:

In the backup directory. All backup directory entries are freed by GDISWAP.

In REQUESTs that were built before the backup directory entries were freed. These REQUESTs must be found and aborted.

Otherwise, such a REQUEST might read the swap area after a new page or pot had been written to it.

You might think that this problem is solved by re-checking the directory(s) for the page or node on whose behalf the request was built. If the directory says the object is at the location we just read it from, you might think you had gotten the correct version.

But just because the node you requested is at the location you read, doesn’t mean that the potaddress you would use to identify the pot is correct, because a potaddress is only one of the swaplocs (the “primary” swaploc) where the pot is. I know of no proof that this would work.

Here is a pathological case I want to record. Suppose you start to fetch a node pot from the backup swap area; a checkpoint completes; another domain requests the same node, now from the home area; the home pot is read and the node moved to node space; the node is cleaned into a pot that happens to go to the sameswaploc as the first request; then the first request reads the new pot. I can imagine confusion resulting, depending on whether the swaploc was the primary swaploc of the old directory entry, or of the new one, etc.

In CTCORETBEN.CTPOTADDRESS of frames that hold swap pots. Indeed, these SWAPLOCs will continue to exist after the end of the checkpoint that obsoletes them. However, they will not be used. A CTPOTADDRESS containing a SWAPLOC is used only if it is equal to a primary SWAPLOC found in a directory entry (for a node). At the time GDISWAP completes, there are no directory entries pointing to the backup swap area. If any directory entry is created with the same primary SWAPLOC, the frame containing the old CTPOTADDRESS is freed (by GETQNODP).

So if a REQUEST completes, it was not aborted, therefore its swaplocs are still valid, therefore in the case of a node pot its REQPOTADDRESS is still valid. We can put the pot on the hash chains and restart the actor. (Something may have happened to the node he wanted, but the pot is still valid.) In the case of a page, if the page is in core (it was materialized while the REQUEST was outstanding), the version in core is the current version. If the page isn’t in core, and there is a directory entry for it, it had better be the directory entry we were fetching from.

A Proposal to implement the preceding

This proposal affects requests that are built by GETBIR. Such requests are of two kinds; reads from a swap area, and reads from the home area.

I propose that requests to read from a swap area be linked with the directory entry that gave rise to the request.

A bit in the dirent can say whether this is the case. To save space in the dirent, I plan to put the pointer to the request in DIRENTSECOND, and move DIRENTSECOND to a (new) field in the request. The request will also point to the dirent.

When such a dirent is destroyed (in GDISWAP at the end of a checkpoint), or its SWAPLOCs are changed (in GDISET etc.), the associated request is aborted. This means it is completed with a special code that indicates it was aborted because its swaploc was invalidated.

Then the logic at GETENDED will be as follows:

Call GETENDEDCLEANUP (clear outstanding I/O bit, restart I/O queue, return DEVREQs)

If the request was aborted, return any page frame and exit.

Otherwise, If the request was for a node pot, add to hash chain using REQPOTADDRESS.

If the request was for a page, determine if the page fetched is the current version, the backup version, or neither, and link it into core or discard it as appropriate. This is irrespective of whether the original request sought to bring in the current or the backup version.

If the current version of this CDA is in core (in the hash chains)

Then (page brought in isn’t the current version)

If the backup version of this CDA is in core

Then (page brought in isn’t the backup version) discard the page.

Else If the CDA is in any backup directory

Then If the request was from a swap area and its dirent matched the backup directory entry

Then this page is the backup version

Else (not the backup version) discard the page.

Else If the request was from a swap area

Then (request was from a dirent in the working directory) (page is not the backup version) discard the page.

Else this page is the backup version.

Else (current version not in core)

If the CDA is in the working directory

Then If the request was from a swap area and its dirent matches the working directory entry

Then this page is the current version.

Else (not the current version) Go to check-backup

Else If the CDA is in any backup directory

Then If the request was from a swap area

Then (its dirent must match the backup dirent) this page is the current version

Else (neither current nor backup) discard the page.

Else (the request must be from a home area) this page is the current version.

Journalizing Between Checkpoints

Some application programs may wish to keep a transaction journal to allow more recovery than the system checkpoint described above would allow.

To provide this we add two functions to the kernel as follows:

We add the journal page {(p1,journal-page)}, which allows the application to detect a kernel restart by comparing the remembered time of last restart with the actual time of last restart.

We add a kernel call that allows a journaling program to protect the current contents of a page against checkpoint rollback. If there exists a copy of the page in the area BACKUP {and possibly MULTI-BACKUP {NI}}, the page will be written there as well as to the home{s}. This process is interlocked with the migration process so as to keep a current copy of the journalized data. This process returns after the writes have completed.

The kernel journalizer first sees if there is an outstanding I/O request for the page. If there is, the actor is immediately queued on the JOURNALWAITQUEUE. The page is then checked to see if it is dirty. If it is not dirty and if there is no entry for it in the current swap area directory, then the page has been journalized and the actor is run. Otherwise the page is queued for output to the appropriate disk locations.

When a domain performs an action that would replace some hook key, that action is suppressed. This is justified on the basis that slot 13 of domains are set to DK(0) or DK(1) at random times anyway under those conditions when a hook may be found in a node. See (p1,realdom).

If the node is being severed, the process will leave and the replacement can take place. See (p2,nrange).

An Overview of the Module MEMORY

See (trex) and (prepwrit) for some external specifications for this code.

There are two domains of significance while this code is running: the actor, designated by PZDOMAINDIBP, and the subject, usually designated by R13. The actor is the domain which holds the process whose action requires the service of this code. Some virtual address in the address space of the subject domain is required. The actor and the subject are normally the same, except when one domain refers to another domain’s address space. This only occurs when passing a string argument to another domain during a jump.

See (build-depend) for maintenance of DEPEND.

For this exposition, we divide the code into these sections:

PRGINT to PER,

This code gets control when a program interrupt occurs. If the program interrupt is a translation exception from user mode, then control is given to (map-logic). If the interrupt is a user mode protection exception, control is given to (cae-logic).

If any of these interrupts is accompanied by a PER event, (per-logic) gets control first.

PER to CAE,

Compute Virtual Address,

This section compensates for the fact that hardware does not provide the offending virtual address upon a protection exception. That address is computed and placed in location 090 as in a translation exception. To do this we must examine the instruction that caused the exception. An obscure case is an MVCL instruction that causes a protection exception after it has deleted itself.

SIXHASIT, DOSEGTREX, DOPAGTREX, and SEARCHPORTION.

There are three ways this code can gain control {besides a return from a subroutine}:

following a translation exception via PRGINT,

from GATE via SEGTREX or PAGETREX for locating the string argument of an explicit jumper,

from GATE or PRIMARY via PREPWRIT to locate the parameter string of a jumpee or returnee.

“SEAP” is short for “SEARCHPORTION” both here and in the code. SEAP knows about the semantics of memory trees but nothing about the particular 370 mapping tables. When a mapping table entry must be evaluated the producer of the table is identified (a node) and SEAP is called to find a node that should and might already produce the table which the current entry should locate.

SEAP has some state as it comes to its loop point at SEAPLOOP. I focus here on its key cursor and the value in field BGKEY. When background keys are encountered in the interpretation of a memory tree the displace earlier background keys. BGKEY has the address of the current background key. When SEAP reaches the mode with the desired ssc it returns identifying that node. The value of BGKEY is also part of the yield of SEAP. BGKEY is both input to and output from SEAP.

Miscellaneous Entries

DETPAGE

MAKEKRO

RESETKRO

CLRPAGTB

CLRALLTB

Designing the ICM Trick Implementation

We propose that when an instruction: ICM 0,0,X causes a page fault due to the page being on disk, that the domain obeying that instruction not be blocked while that page comes in.

I organize my thoughts here so as to minimize conceptual errors.

A naïve view is to put the following code some where in the kernel after a page has been summoned due to a memory fault:

L 1,RNPSW+4
LRA 2,0(1)
IF Z
CLC 0(2,2)=X'BF00'
ICM 0,0
IF E
LA 1,4(1)
ST 1,RNPSW+4 psw +:= 4
NI RNPSW+2,X'CF' cc := 0 [PRETEND THAT THE TRAP DIDNT HAPPEN]
JOIN

Note:

Any I/O queue member can be safely put on the CPU queue so long as you don’t do it so often as to fail to make headway. At worst extra cycles will be required to rediscover that it should be back where it was.

Any domain with a process, including those on I/O queues, can have their PSW’s incremented by 4 and their CC’s set to 0 if their PSW -> ICM 0,0,X.

Strategies

As we become idle

When a CPU has time it considers domains on I/O queues, If (PSW -> ICM 0,0,x) Then PSW +:=4; CC := 0; put domain of cpu queue Fi.

This is correct and will cost little since the CPU “has time”.

Just after we summon page due to fault

After starting the I/O for a disk page do the test to see if the domain should be blocked.

Advantages

This is better in that the we do not pay the cost of switching contexts.

If a program ICM’s several pages at once and those pages have successive CDA’s, the I/O system would most likely fetch the pages in one disk operation. This gets the job done with less main store time, channel time and device time.

Disadvantages

It is worse in that we do the test when there may be more important work.

It is not clear where to install the code.

It seems that each of the routines: SEARCHPORTION, INVOLVE, PREPKEY, FIND, FINDPAG2, SRCHPAGE and GET must have new returns defined for the situation “The page isn’t here but the actor is runnable anyway”.

As we put faulter on I/O queue

Perhaps we place faulted domain {actor} on CPU queue and tell the rest of the system that he has been “enqueued”.

This lets memory act as it does now which is to assume that if the return from SEARCHPORTION indicates that the page isn’t in core, the actor can’t be resumed.

This has the effect that the domain issuing the ICM will give up its turn on the CPU but it will be allowed to run as the CPU becomes available. This might be better scheduling but it does involve more overhead because we do more context switching.

Perhaps we can put the code in PREPKEY.

The Fiasco at (bbk)

The “pending entry state” idea.

While it would be possible to merely lock the node holding the background key when it was referenced (using LOCKSTAK) the following seems better all around.

The crash at (bbk) provoked the following idea from Charlie. Suppose that when MEMORY has located an invalid mapping entry that caused a translation specification exception, it puts (with a CS instruction) a distinctive invalid value there just before calling SEAP. When SEAP returns it installs the value proposed by SEAP with another CS. In the common case the distinctive value will still be in the entry and we are done. In cases where DEPENDSAFE would have done its thing the distinctive value would have been replaced by an ordinary invalid value. In this case the new value proposed by SEAP will not be placed in the table and the crash like (bbk) will be avoided.

In this scheme I see no need to lock nodes in the UP kernel. (In an MP kernel it would be necessary to lock exclusively one node at a time at least while we called depend.) Update PENDVAL implements this scheme.

There seems to be a related bug: we do not ensure that a node is prepared as a seg node when we refer to a background key therein.

This is not a bug. We only use the xxBGKEY value upon a fault due to an invalid entry herein. This occurs only if this table is located by a higher one. The node located by xxBGKEY remains prepared as a seg node with the help of DEPEND. Were it unprepared the higher table would no longer locate the table with the BGKEY field.

As DEPEND entries designate table entries and these entries hold “pending values” during execution of SEAP, don’t DEPEND entries that designate whole tables need a similar pending value to put in the xxBGKEY field while executing SEAP to find the “real value”?

It seems just now that the background slot and keeper slot should be treated just the same. This would mean either adding a new field in the mapping table headers xxSKKEY, or deleting the xxBFKEY field from each. This seems like a boog but unnecessary idea.

The benefit in treating these two kind of slots alike is fewer bugs and fewer ideas to go wrong. The have in common that they are found by SEAP and noted for later use. I do not know which is more frequently encountered.

There are two ways to unify these two schemes, make keepers like backgrounds, or vice-versa. The table headers have room for another whole word even without expanding them.

The necessary changes to support above ideas:

New Fields:

xxyySLOT where xx is ‘PT’ or ‘ST’ and yy is ‘BG’ or ‘KP’. (xxBGKEY is gone.)

Concept change:

We have talked about the background key but I think we should talk of the background slot. The field xxBGKEY should perhaps be renamed xxBGSLOT. It identifies the key only by virtue of identifying the slot. The information in xxBGSLOT is derived by consulting a format key. It does not depend on the content of the slot. There should be a DEPEND entry from the format slot to mapping table.(no! See (bgx)) The format slot should be involved (it already is) but there is no need to involve the key in the background or keeper slot since it is still OK to change the key in that slot. If there were entries in the table depending on the key in the slot then there would be separate DEPEND entries from the key in the slot to the entry.

It is important to document and crossreference the following ideas so that we do not remove the ‘mapping table DEPEND’ function again. (Note the imbedded refutation here starting ‘NO!’. See (bgx) for best explanation of why these depend entries are not needed.)

There are two distinct functions for DEPEND regarding the xxyySLOT field:

That the location of a background slot be valid.

DEPEND entries exist here between format keys in red segment nodes, and mapping tables (and implicitly their xxyySLOT fields).

If the format key in the node with that slot is modified or that node is moved, then subsequent use of the (unmodified) xxyySLOT field is clearly invalid. --- NO! If the slot is modified or the node is moved then references to the table from higher tables will vanish due to the actions of depend entries from the format slot.

You can only “use” a BGKEY field:

by comparing it while seeking a suitable existing table to use

In which case there will be no BG entries {(dt)} in the table and thus no harm in finding a “spurious” match

or by initializing SEAP’s BGKEY from xxBGKEY in response to a fault

Which cannot happen because there are no references to the table from higher tables.

This is an assertion about the situation above the producer of the mapping table with the field.

That entries in the mapping tables be valid.

DEPEND entries exist here between the background keys and the table entries built after consulting those keys.

If the background key is replaced then those entries must be invalidated.

This is an assertion about the situation below the producer of the mapping table with the field.

As SEAP progressed from the domain root to the node with the background slot, it examined other slots. Note that these other slots have no DEPEND entries relating them to mapping tables. This is unneeded because we do not have (and don’t need (and can’t even frame)) an assertion that the xxBGSLOT field have the correct value.

Proposed New Assertions: to be moved to (chmem) upon commital.

If DIB x locates STOTE y (via SEGTABPP) then field STzzSLOT in y locates the (keeper or background) slot that would be found by running MEMTRECH from x.

If a segment table whose STOTE is x locates page table y, then y’s PTzzSLOT field locates the slot that would be found starting MEMTRECH from x’s producer, or if none is found then the field is the same as the corresponding field from x.

Ramifications of above:

One need never start from the top to find a segment keeper.

If two mapping tables are alike except for xxKPSLOT then any entry in one would valid in the other. This would be very unusual.

There are DEPEND entries from the format keys to mapping tables for each table located by a higher table.

Code Changes:

The DEPEND entries for mapping tables are made upon return from SEAP when and if a new table is established. The field xxyySLOT of mapping tables is constant. When this DEPEND entry is made the address of the respective slot is provided. This will require new variables for SEAP to remember where it saw the format keys that located the background slot and keeper slot respectively.

Remove the code that “runs from the top” upon lack of knowing where the keeper key may be found.

Misgivings about (pes).

With the thought that the above ideas may not be converging I consider backing off a bit. There remain the issues of:

The crash described at (bbk),

Old bugs revealed by the “exacerbator” (update PENDTEST).

The exacerbator caused a background key to be consulted and thereby involved in a node that had just been uninvolved (by the exacerbator). That was a bug in the new code but it has a brother in the old code. If the background slot had been in the trunk the STBGKEY could have located it in a vacated node frame, and some new (unprepared) node would have acquired an involved key.

With the insight that keeper slots and background slots (which hold background keys) should be treated the same, we could fix the (bbk) crash by eliminating the xxBGKEY fields and find that slot like we find the keeper slot under the same circumstances -- start at the top! The trouble with this idea is that there could never be sharing of tables using the same background key.

The PENDVAL fix

I write this as I am about to release the PENDVAL update.

For each of the three kinds of mapping table entry, there is a distinctive value that SEAP’s caller places in the entry prior to calling SEAP. If that value is displaced by displacement of a node late in SEAP’s execution, that was consulted early in SEAP’s execution, then that distinctive value will have been squashed by DEPEND or whatever else maintains currency of mapping tables with slots. Upon such squashing SEAP’s caller knows that SEAP’s yield is based on slots in displaced nodes, and rejects the yield.

BGKEY has been changed to BGNODE (and so STBGKEY etc.) and it now locates the node with the background key. Similarly SEGKEEPKEY is now SEGKEEPNODE.

LOCKSTAK is now gone and locking is described below.

While SEAP runs, fields BGNODE and SEGKEEPNODE are kept valid by core-locking the nodes they designate, even though this may change during the execution of SEAP.

The initial node for an execution of SEAP is prep-locked for SEAP’s duration so that the caller can find its NFDIB field in order to search and perhaps augment the node’s production list.

Node locking in MEMORY

As SEARCHPORTION begins it preplocks its initial node. This node produces the table, an invalid entry in which, caused the fault that caused the current execution of MEMORY. This locking ensures that tables produced by this node are not sacrificed by the displacement of the node until the original invalid table entry has been repaired. This, in turn, is to ensure that pending value marks left by callers of SEARCHPORTION will not be lost thereby.

As SEARCHPORTION begins the nodes located by BGNODE and SEGKEEPNODE are core locked. This ensures that these pointers remain valid until they are used by the callers of SEARCHPORTION.

SEARCHPORTION unlocks these nodes before it returns. Actions subsequent to SEARCHPORTION (such as ALSEGTAB) don’t move nodes.

Support of termination protection (p1,tp)

The kernel will presumably determine from the model number at IPL whether the CPU terminates instructions.

The following state transition table describes the state coded in the RO and TP bits of SEAPFORMAT.

input TP RO ROT page
00 01 10 10 rw access to page
01 01 10 fault rw access to page
10 10 10 10 ro access to page

ROT is a read only bit in the data byte of a memory key when when running on a machine that terminates instructions.

The three states lead to three possible tables produced by a node or page. xTFORMAT and SEAPFORMAT can hold the state code as it does for the two state system.

We ensure NOT(RO & TP) in involved memory key data bytes as we involve a memory key. (We provide a memory trap at that point.)

Code W as referred to in (p1,tp) is coded as 10 and NEEDWRITE=1.

NI SEAPFORMAT,X'E0'; OC SEAPFORMAT,KEYDATABYTE;
TM SEAPFORMAT,X'A0'; IF O; TM KEYDATABYTE,X'80'; IF O we were in state B; if either NEEDWRITE or the machine terminates then FAULT (In this situation we know that TP in the key was on and that RO in SEAPFORMAT was on. Therefore we were in state C and we thus return to state C) NI SEAPFORMAT,X'DF'; JOIN

OLDER Alternative Implementations

Instructions LH 1,SEAPFORMAT; SRA 1,5; IC 1,KDATABYTE; SRA 1,5; MVC SEAPFORMAT,TRANTAB+32(1) step the machine state. TRANTAB is a 64 byte table initialized depending on the CPU model number.

The above assumes that the information now stored in NEEDWRITE is incorporated into SEAPFORMAT. Both NEEDWRITE and SEAPFORMAT come alive before calls to CHECKMKEY, i.e. outside of SEAP!

The above instructions loose the SSC in SEAPFORMAT!

Splitting SEAPFORMAT

I think that SEAPFORMAT can be split into two fields, say SEAPSTATE and SEAPSSC. The parallel structures in STOTEs and PTFRAMEs would be changed accordingly. The data byte of the memory key would remain packed as it is now. SEAPSTATE codes the state of the “state machine” referred to in (p1,tp).

In this scheme state transition table entries that are to produce faults can have bit 0 set so as to be testable.

Three fields in prepared memory key

This scheme changes the format of prepared memory keys. There are three one byte fields (in a row), one for each of the possible states that the “machine” might be in as it arrives at the key. The field content is the new state and the key’s LSS. The field row is at offset 12 in the slot and encodes but displaces the data byte found there other times. The field KCHECKMARK used in CHECK can be changed to bit 3 of the first byte in the row. CHECK would be changed to leave the bit 0 when it is done.

Encoding the machine state as (A=00, B=01, C=11) the instructions LH 1,SEAPFORMAT; SRA 1,13; IC 1,KROW+1(1); STC 1,SEAPFORMAT; step the machine state and also record the key’s LSS in SEAPFORMAT. KROW is a 3 byte field in the key slot that overlaps KDATABYTE. IT DOES NOT! INCLUDE UPDATING NO_CALL!!

LA 2,15; LH 1,SEAPFORMAT; SRA 1,5; IC 1,KDATABYTE; NR 2,1;] SRA 1,5; IC 3,TRANTAB(1); OR 3,2; STC 3,SEAPFORMAT;] TM TRANTAB(1),1; IF NZ; FAULT.

Logic of NEEDWRITE

I note that (contrary to the comments) NEEDWRITE is an implicit parameter to MEMFAULT since MEMFAULT calls CHECKMKEY directly without setting NEEDWRITE and NEEDWRITE is an implicit parameter to CHECKMKEY (and so documented).

MEMFAULT is a subroutine called only from FAULT1 in PROGINT, and then only upon return 4(R12) from the calls to DOPROTEX, DOSEGTEX and DOPAGTEX there. DOPROTEX sets NEEDWRITE to 1 just as it makes its 4(R12) return (to FAULT1). DOSEGTEX and DOPAGTEX have returns to 4(R12) as well. They have presumably left NEEDWRITE = 0.

Before entring MEMFAULT we go thru one of DOPAGTEX, DOSEGTEX or DOPROTEX. Each of these sets NEEDWRITE.

The sequence of events is:(NEEDWRITE carries information thus:)

Either a translation exception or protection exception leads to a DOxxxTEX (via PROGINT) or we get to one of DOxxxTEX via PREPWRIT. (xxx = {SEG, PAG, PRO}.)

At this point NEEDWRITE has been set because each of DOxxxTEX sets it.

If DOxxxTEX returns to 4(R12) and the original entry was thru PROGINT, then MEMFAULT is called which uses NEEDWRITE by virtue of calling CHECKMKEY without having set NEEDWRITE.

It seems poor style for NEEDWRITE to be a parameter to a routine in MEMORY (MEMFAULT) from outside MEMORY to which NEEDWRITE is local.

Wimpy arguments for putting NEEDWRITE in GLOBALFLAGS.

NEEDWRITE is not local to MEMORY according to the comments above despite the fact that all references to it are in MEMORY.

NEEDWRITE and TERMINATOR need to tested in the same TM command and thus need to be in the same byte and TERMINATOR is not local to MEMORY. It is set in INIT.

Since we allocate page zero space to modules such as MEMORY by words, saving a byte there saves a word in page zero. That is why we put ISPAGE in GLOBALFLAGS too.

Locks

There are two kinds of locks for a node. A node is said to be prep-locked iff Bit 0 of NFPREPLOCK. This is an exclusive lock. While a node is prep-locked, the prepare/unprepare routines will not change the mode of preparation of the node except by explicit request.

NFPREPLOCK also holds ageing information. 3 {symbolic 'F'} is placed in NFPREPLOCK as the node is unlocked. GSPACE decrements it and reallocates the frame when it reaches 0. Values just less than 255 are often put in NFPREPLOCK for debugging purposes. They leave the frame locked and identify the locker.

A node is said to be core-locked iff NFCORELOCK isn’t 0. While a node is core-locked or prep-locked, the pager will not move or remove the node from its node frame in item space. {Addresses and item indexes will remain valid even when the pager is called.} Prepared keys to the node will not be unprepared by the pager.

NFCORELOCK contains the number of reasons for core-locking the node. We intend to make an argument showing that NFCORELOCK can never overflow. {In fact that the sum of all of the core-locks < 255.}

Both pages and nodes have corelock fields. The following goes for both of them. Small positive values for corelock are from 0 thru HILRU(3). These values represent age information. 3 is youngest and 0 is oldest. Reclamation code decrements these values and reclaims upon reaching 0. Negative values of the form -n*8+k where n>0 and 0<=k<8 denote the locked state where n is the number of reasons the node or page is locked. corelock is set to HILRU when it is unlocked.

The Null Job

By “null job” we mean what the CPU does when the CPUQUE is empty. The null job is sort of like a domain. It has a domain DIB at RNDIBS. The null job has no domain root or annex nodes.

The PSW in the DIB has the wait state bit on. The PSW, general registers, floating registers, and CPUCACHE are loaded and stored as if the domain were real. These components are constant except for CPUCACHE.

Note that while the null job never executes SVC’s or causes program interrupts, it does experience external and I/O interrupts and perhaps machine checks.

The CPUCACHE of the DIB accumulates the wait state time.

Note that this time is not checkpointed.

The routine CHECK

This is a routine that may be executed when the kernel is presumably in a valid state. It checks the kernel’s storage and user page storage keys for validity. It crashes upon discrepancy. In production, CHECK runs just before taking a checkpoint.

It verifies most of the invariants in (valid). It extensively checks page and segment table conformance with segment nodes. Check calls CHECKDEP in DEPEND to verify conformance to (whendepend). See (chmem) for memory assertion checking.

CHECKOFF is a dummy routine used when it is desired to bypass all checking.

An object view of the kernel

This is a log of an attempt to explain the kernel code via the object paradigm. The fundamental transformation rule for concepts is to try to invent a type of object for each DSECT.

Exporting a procedure to set a bit is not an improvement over exporting direct access to the bit. It is merely slower. Object design is more than merely not exporting data, it is not thinking about the data within the DSECT.

The DIB

The code that is primarily concerned with the DIB is: DOMAIN which prepares and unprepares domains, The first level interrupt handlers which move process state from the CPU into the DIB, and the dispatchers which move process state from the DIB into the CPU.

Various parts of the kernel (outside the implementation of domain abstraction) may not need to do to a domain any more than a domain key holder may do. If this is so then we may use the external domain abstraction to define the semantics of internal domain object.

DEPEND

Alas the first module accessing the DIB that I considered was DEPEND. Its reason was to zap SEGTABPP. This access does not change the logical state of the domain.

Perhaps we need a conceptual object like a domain that includes as part of its state, access to a segment table (another object).

ENQUEDOM

ENQUEDOM’s access to the DIB is to set the HOOKED bit to indicate the domain is enqueued. This may be considered as supporting an assertion about that bit.

EXTERNAL

In addition to its role as an FLIH, EXTERNAL maintains the trapped bit so as to maintain an assertion.

GATE

LIKE external, GATE is a FLIH but GATE also does all of the first order effects required by jump semantics. Most accesses to the DIB are either clearly for accomplishing the external semantics or maintaining an assertion.

PREPKEY

PREPKEY makes several references to the DIB to preserve back chain ordering assertions.

SHOVE

SHOVE makes a few assertion preserving references to the DIB to preserve assertions about SEGTABPP and valid segment tables.

This brief survey makes me think that modifications to the DIB (and presumably many other types of DSECT) can be divided into those that can be explained in terms of external semantics and those made to preserve assertions.

I further suspect that DSECTS within the kernel can be divided into those that support externally defined objects (pages, nodes, domains, wait objects, devices, segments) and those that support concepts that might well be defined as internal kernel objects. A DSECT for a channel is an example of an object with no corresponding external object. CCW block is an internal transient object. A PPSETUP block is an even more transient object.

Branch (multi) was once here.

Initialization (INIT)

INIT is loaded by CMS along with the kernel and runs to initialize main store to conform to the assertions. After INIT has run CHECK can be executed.

RESTART - Entry point in INIT to initialize the kernel and restart from the latest checkpoint.

As INIT begins, CMS rules still apply. INIT uses CMS to read the module files (holding domain code) into the real core frames where they will either first execute or from which they will swapped to disk. INIT calls INITIS which moves the primordial nodes (which are assembled by GNOSIS ASSEMBLE) to node frames and adjusts them in minor ways. Various module initialization routines are called.

KERNEL SUBROUTINE AND MACRO FUNCTIONS

This section documents the meaning of entry points of modules that comprise the kernel.

Preparing

These are routines concerned with preparing things. See also (dependcl). Definition: A node is (_tied down) if it is core-locked or prep-locked or if it is an annex of a domain whose root is prep-locked. {A tied-down node cannot be swapped out.}

PREPKEY: Prepare Key

This routine tries to prepare a key. It will prepare a key to even a locked node or page. The key must be in a node that is tied down. {Otherwise the key might be swapped out to make room for the thing designated.} The key must be unprepared and uninvolved {else crash}. 1 -> key; 2 -> actor {a domain root}. Actor’s domain root must be tied down. {Of course the backchain order {(bchorder)} is maintained.} PREPKEY may call GET if the key designates a node or page that is designated by no prepared key. PREPKEY preserves registers 6:15 only.

Outcomes:

0(14) The key was obsolete {allocation count or call count too old} and has become DK(0), or the key did not designate a page or node.

4(14) The designated page or node is not in core. It has been queued to come to core. Actor has been enqueued and unlocked. {Other locks are unchanged.}

8(14) The key has been prepared.

TRYPREP - try to prepare a key

Tries to prepare a key, but gives up if I/O would be required. R3 -> the key, unprepared, uninvolved. R14 has return address. R15 has entry point address. Preserves R6:R15

INVOLVEN: Involve a Key to a node

1 -> key; 2 -> actor {a tied-down domain root}. 3 holds a node preparation code. The key MUST designate a node (if that node is not prepared as (R3) then (if the node can be unprepared, unprepare it else return to 0(14)). Prep-lock the node (return to 4(14) if already prep-locked).) Turn on the INVOLVEDW bit in the key. Adjust key’s backchain {if necessary} to preserve (bchorder) {viewing the key as involved}. The node that holds the key will be effectively core-locked. This operation will preserve validity assertions except for the node being prepared. I.e., it is up to the caller to preserve assertions about nodes prepared such as (R3); on the other hand, the rules about prepared and unprepared keys will be handled by the subroutine. {Key may or may not be left prepared if the node cannot be unprepared.} Preserves 0:15.

INVOLVEP: Involve a key to a page

1 -> key; 2 -> actor {a tied-down domain root}. 15 -> INVOLVEP. The key must designate a page. Return to 0(14) upon permanent I/O error. Return to 4(14) with actor enqueued upon page not in core. Return to 8(14) upon obsolete key (key =DK(0)). Return to 12(14) with key involved (INVOLVEDR). Preserves all registers.

HALFPREP: Prepare Half-Prepared Key

R3 -> a key that has prepared bit on and involved bit off and object pointer in SUBJECT. The key will be linked into the backchain, preserving (bchorder) and (midpntr). Upon entry: R15 -> PREPKEY, R3 -> the key, Returns to 0(R14). Clobbers R1, R2 & R5.

PREPDOM: Prepare Domain

Initially: 1 -> DR, must be unprepared else crash, node must be prep-locked. If an annex might not be in core, 2 -> actor’s DR which is tied down. It may crash if there is some domain with an un-prep-locked DR and prep-locked annex. {We believe that this will not occur and thus do not tolerate the situation.} Routine will never return {i.e., loop forever} if there are no DIB’s and none can be stolen because too many domains are prep-locked.

Finally: Registers 1:13 are preserved. PREPDOM may change the mode of preparation of any node, except that it will not change the mode of preparation of any prep-locked node {other than this one} and will not violate the node preparation implications {(prepimpl)}. PREPDOM calls GET if an annex is not in core. {Locks will be unchanged.}

0(14) if delay required. Actor enqueued.

4(14) if some prep-locked node prevented action.

8(14) malformed domain {(p1,malform)}. Not prepared.

12(14) successful. The dirty bits of all 3 domain parts are set. {Keys to annexes prepared and involved, and annexes prepared in their respective modes.}

BORROW: Prepare a Meter

This routine tries to get a cache of some specified resource {and put it in a DIB}. In this process it tends to prepare meters.

12 -> actor. 15 ->BORROW. The resource type is designated by an offset in R11 into a meter node. Floating register 0 provides an upper bound on the cache size; its exponent is X'40'. 1 -> meterkey.

If meter is invalid, actor is un-prep-locked and return is to IDLEX.

Jumps to meter keeper if required.

Returns to 0(14) if cache is gotten.

See (borrow-logic) for implementation details. See (tf) about time formats. See (scavenge) for some related logic.

Unpreparing - Taking Things Apart

These routines tend to undo the effects of those of the preceding section. See also (slotzap).

DETPAGE

Upon entry R1 -> core table entry for a page. All involved keys to the page are uninvolved. {Therefore {see (invmem)} all page table entries to this page are deleted.} Preserves R0:R14.

The above might be restated: The system is transformed into an equivalent valid state such that keys to the indicated page are not involved.

UNPRND - Unpreparing Nodes

UNPRND: Unprepare Node

R1 -> node frame {need not be tied down}. First byte of R1 is 0. If it is prepared, one of the other routines of (unprnd) will be called. 0:13 preserved. Destroys floating registers. Returns to 0(14) if some prep-locked node other than this one prevented any action. {See (prepimpl) for assurance in particular cases.} Returns to 4(14) if successful.

UNPRDR: Unprepare Domain Root

Upon entry 1 -> locked prepared domain root. Locks are unchanged. Preserves all general registers; clobbers floating point registers. Returns to 0(14). Sets age since last reference of annexes to be no older than domain root.

UNPRMET: Unprepare meter

R1 -> prepared meter. That meter {and its inferiors} will be unprepared. Preserves all general registers and destroys floating registers.

UNPRSEG: Unprepare segment

This action proceeds by forsaking {(forsake)} all produced segment and page tables. It then calls SLOTZAP to uninvolve each of the involved keys in the node. (There is an argument that SLOTZAP is sufficient and MEMZAP needn’t be used.) The backchain of these keys is adjusted if necessary.

Finally all involved keys to the node are considered. MEMZAP, SLOTZAP, or ZAPKEY is called depending on the situation.

While UNPRND’s logical function is to leave the system in an equivalent valid state with the node unprepared, it also recovers resources. These resources are: DEPEND entries and page and segment tables produced by the node. After a DEPEND entry is used to invalidate a page or segment table entry (as a consequence of unpreparing a node), the DEPEND entry has no further purpose and is reclaimed.

We note a strange performance effect here. Ageing of segment nodes is not done well. There is indeed no mechanism that insures that actively used segment nodes are protected from ageing. DEPEND’s shrapnel effect comes to the rescue here. DEPEND zaps mapping entries randomly due to several effects and this causes SEARCHPORTION to run and repair the damaged map entries. The consulted segment nodes are marked as new as this proceeds. This effect may be too much or too little. There may be something that we can do to improve it. (Shrapnel has been considered bad!)

UNPRKY: Unprepare Key

1 -> key which must designate page or node and must be prepared but not involved. It unprepares the key. Returns to 14. 15 -> UNPRKY. Clobbers 2, 4, 5.

KEYTODSK: Produce unprepared version of key

1 -> Key slot in node in node frame. Returns to 0(14). 15 -> KEYTODSK. Clobbers R6. produces disk version of key at 0(3). Clobbers R6. The most accessible of the precise definitions of what it means to be a prepared or involved key.

UNINV

Uninvolves designated key. 1 -> key. Key remains prepared but is placed at other end of backchain. Involved bit is turned off. {Involved bit need not have been on.} Caller is responsible for eliminating reason for involvedness. Preserves all registers. {Must not be used on hooks!}

ZAPKEY

1 -> key. Key must be a hook. De-links key and leaves garbage in the slot. ZAPKEY also notices if its action would cause the LASTINVOLVED relation {(midpntr)} to fail, and if so adjusts the LASTINVOLVED of the designated domain’s DIB. Similarly, ZAPKEY notices if the action removes the last stallee, and turns off NFFLAGS.REJECT if so {(rejectrule)}. Turns off READINESS.HOOKED if necessary to preserve (hookrule). Preserves R0:R15

EMPTSTAL: Empty Stall Queue

R1 -> a node, which must have a hook on its backchain. EMPTSTAL tries to remove one such hook {by putting the node that holds the hook on the external queue}, and returns condition code=0 iff it did. Preserves R0:R13.

UNHOOK: Unhook Slot 13

R1 -> a node, which must have a hook in its slot 13. UNHOOK tries to remove it {without removing any process {in the sense of the extended kernel} from the node}, and returns condition code=0 iff it did. Preserves R0:R13.

SST: Sever Segment Table

This is the implementation of the routine called “sever segment table” in the Algol-68 program “shove”. All DIB’s designating the segment table are made to designate the nul table instead, and the C3 of the domain is made uninvolved. 1 holds 8 times the argument. Preserves 2 through 0.

RETCACHE:

Returns CPU cache to superior meters. (Does not: turn off involved bit in meter key, rechain key ring, reset cache in dib.) R1 -> Domain Root. R11 -> DIB. R13 -> DOMAIN. R14 -> return address. Floating registers and R1 clobbered. Other registers preserved.

SUPERZAP For efficiently uninvolving a single slot.

Function

If a slot is involved and does not hold a hook then SUPERZAP leaves the system in a logically equivalent state with the slot still prepared, if it was, but no longer involved.

This call should be done only for involved prepared keys. It is safe but wasted for prepared uninvolved keys.

Costs

SUPERZAP is near optimal for involved slots in segment nodes or a meter slot or memory tree root of a domain roots. In other cases UNPRND is called for the containing node.

Necessary and Spurious side effects

Any of the side effects of UNPRND may occur.

What SUPERZAP does when invoked for slot i of node N:

If N is prepared as a domain root then

If i = 3 then make SEGTABB in the DIB refer to NULSEGTB

Else If i = 1 then the CPU cache in the DIB is added to the superior meters and the CPU cache is set to 0

Else call UNPRND for N.

If N is prepared as a segment node then

Call SLOTZAP for the slot.

By the SSC of a STOTE we mean here the value from STFORMAT if that isn’t zero, otherwise the size code from the format key of N.

For each SSC5 STOTE produced by N, zap entries 16*i thru 16*(i+1)-1 of that segment table if the table is that long.

For each SSC4 STOTE produced by N, zap entry i of that segment table.

For each SSC5 STOTE produced by any node whose slot j holds an involved SSC4 segmode key designating N, zap entry 16*j+i.

Zap slot i of each page table produced by N.

If N is prepared as something other than a domain root or a segment, then call UNPRND for N.

To Call SUPERZAP

R1 -> Slot, Return to 4(R14), R15 -> SUPERZAP. All registers preserved.

Return to 0(R14) if slot is in a node that can’t be unprepared due to a preplock and SLOTZAP can’t otherwise figure out how to uninvolve slot.

Suprises during the implementation of SUPERZAP

Perhaps because SUPERZAP depends on more assertions holding, we learned some peculiar facts about the kernel.

DEPENDSAFE

DEPENDSAFE was never reset. After SEARCHPORTION had run DEPENDSAFE maintained its value and thus would occasionally nullify the required side-effects of calling SLOTZAP when such a call referred to the same mapping entry as DEPENDSAFE. Calling SLOTZAP must do more than preserve whendepend, it must do that and leave access thru the mapping tables invalid.

SEARCHPORTION now zaps DEPENDSAFE just before it returns as it unlocks the nodes mentioned in the lockstack. Callers to SEARCHPORTION must not do anything that might indirectly call SLOTZAP after calling SEARCHPORTION and before placing the mapping table entry value in the table.

LSS=0 STOTEs

Imagine an LSS0 stote STOTE produced by a red segment node. Imagine someone replacing the format key with a memory key. The table entries depending on that key will be found and zapped. The STOTE will remain however with its STFORMAT field reading 0.

If that segment table subsequently causes a segment translation exception, the LSS0 STOTE will be taken as evidence of a red node and the SSC will be fetched from the representation of the key in slot 15 which may no longer be a data key.

The SUPERZAP now notices the zapping of a format key and forsakes all LSS0 STOTEs produced by the node.

DEPEND space pollution

Imagine a mapping table entry that depends on several slots, S and some others. Imagine a program that alternately refers to the address defined by that entry and puts a memory key in slot S. Each memory reference causes each of the slots to be recorded in DEPEND. Only the DEPEND entries associated with S are recovered. The other entries accumulate and are protected from normal DEPEND space reclamation by the DEPENDSAFE logic. Eventually DEPEND loops because it can find no space to reclaim that is not protected by DEPENDSAFE.

I have not fixed this yet but I propose the following:

Note that more than one DEPEND entry on a hash chain referring to the same EA is of no use.

Suppose that after the CLC in ZAPCHAIN finds a match we fall into code that no longer worries about protecting that entry.

The effect of this is to remove all but one copy of the DEPEND entry referring to the protected mapping entry.

Entire tables

DEPEND remembered an association between a background key and an entire mapping table. This was redundant because any entries in that table built by consulting the background key would be invalid anyway!

This redundant feature of DEPEND was removed because that call to DEPEND was made after DEPENDSAFE was reset.

BUGS

We now observe the following behavior: MEMBUG deletes the top node of a memory tree of its test domain. It then calls that domain expecting a domain fault.

About 99% of the time the domain fault occurs as it should. (It always occurs on the old item space kernel.) Occasionally no fault occurs and the process disappears.

I have instrumented the kernel to catch the following mysterious happenings.

The last PRGINT to happen to the test domain was a segment translation exception. The most recent translation exception address is 00000A. (These may be different exceptions!)

The test domain crashed while implicitly calling the segment keeper for a strange segment that I can’t relate to the test domain. (It crashed because of extra code to find this bug.)

The kernel is modified to verify the contents of CTL1 upon each PRGINT.

See (slotzap) for SLOTZAP, a way to uninvolve a key.

Altering Domain Situations

ENQUEDOM: Enqueue a Domain

R2 -> domain root. R15 -> ENQUEDOM. Domain is inserted in chain whose whose head is located by R1. Domain is inserted just to the left of the item designated by R1. {If R1 designates queue, domain is inserted at tail of queue.} R1 must not designate a node. If DR is prepared, then HOOKED in READINESS is turned on. If C13 was a hook before, that hook is removed with ZAPKEY. If C13 was not a hook, it must be unprepared. Preserves 0:15. Returns to 0(14).

KEYJUMP: Jump to a Key

This is not a call; there is no return. ORDERCODE holds order code. R8 -> jumper’s domain root, which is prepared as domain and prep-locked. R10 -> jumper’s DIB. NFKEY0-NFNODE(R12) holds the key to be invoked, which is prepared if it is of a type which can be prepared. {This key is in a node which is not necessarily tied down.} The key must be uninvolved or a hook. R13 holds GATEBASE. Domain is running {(domrun)}.

If some component of the jumpee is not in core, the jumper will be enqueued upon its arrival. In other words, the dry run isn’t over at this point.

JUMPTYPE contains an integer between 2 and 16 inclusive, not 9. JUMPTYPE is > 7 iff P2NODE -> a node which is core-locked. If JUMPTYPE is > 9, then the invoked key is not a segment or meter key. (JUMPTYPE And 6) determines the jump type: 0 is implicit, 2 is call, 4 is return, and 6 is fork. JUMPTYPE is odd iff the jump is explicit and the jumper has PER on. If JUMPTYPE is odd, then PERADDR has the address of the key jump instruction, and PERCODE indicates the PER events that have happened so far.

2 - jump is a call, no per

3 - jump is a call, with per

4 - jump is a return, no per

5 - jump is a return, with per

6 - jump is a fork, no per

7 - jump is a fork, with per

8 - implicit jump, no per

10 - jump is a call, no per, to segment key with a keeper

11 - jump is a call, with per, to segment key with a keeper

12 - jump is a return, no per, to segment key with a keeper

13 - jump is a return, with per, to segment key with a keeper

14 - jump is a fork, no per, to segment key with a keeper

15 - jump is a fork, with per, to segment key with a keeper

16 - implicit jump, no per, to segment key with a keeper

Currently the kernel forbids this case. It is debatable whether it should. The problem is if a trapped domain’s domain keeper is a segment key with a data key as a segment keeper. We must prevent the loop.

ARGUMENT-is-READY {(arg-ready)}.

If bit 4 of EXITBLOCK = 0 then REFARGLENGTH holds 0.

The left byte of REFARGLENGTH is 0.

If the jump instruction is nullified {the jumper becomes queued or trapped}, the contents of SVCINTILC will be subtracted from the jumper’s instruction address. If the jump is implicit, SVCINTILC should be zero.

A jump to the invoked key is made. The arguments passed are determined as follows. The argument page and the first two keys are determined by EXITBLOCK in the usual manner.

The third key passed is .Ybs=0,0;.Irest=5;

If bit 2 of EXITBLOCK = 0 Then DK(0)

Elif JUMPTYPE < 8 Then key from EXITBLOCK

Elif bit 3 of P2SWITCH = 1 Then key from EXITBLOCK

Else a key to the node pointed to by P2NODE; the type and data byte of the key are taken from P2TYPE {the type field should have the PREPARED bit on}

Additional logic after the LOCALKEY change.

If bit 2 of P2SWITCH is one Then a key must be stored in the slot of P2NODE selected by bits 4-7 of P2NODE. If bit 1 of P2NODE is zero then this key is DK0, Else it is the third key passed by the jumper. The slot has been prepared for storing by PUNINV (in PRIMCOM), so if it can not be un-prepared, then there is incestuous overlap with the jumpee and the jumpee is trapped with trapcode 6, subcode 52.

Fi .Ybs=1,0.125;

The fourth key passed is .Ybs=0,0;

If JUMPTYPE And 6 > 2 Then

If bit 3 of EXITBLOCK = 0 Then DK(0)

Else key from EXITBLOCK Fi

Elif JUMPTYPE And 6 = 2 Then a return exit to jumper

Elif bit 3 of EXITBLOCK = 0 Then a restart exit to jumper

Else a fault exit to jumper

Fi .Ybs=1,0.125;.Irest=0;

KEEPJUMP: Jump to a Keeper Key

This is like KEYJUMP, except for putting the jumper on the JUNK queue if the key to be invoked is not a gate key. {This is used by code that senses that a keeper is needed and that no effects will be had from calling a non-gate.}

MIDJUMP: Complete Gate Jump

This entry completes a gate jump to an exit key. R8 -> JRDR, which is prepared as domain and prep-locked. R10 -> JRDIB. R13 -> JEDR, which is prepared as domain and prep-locked. R12 -> JEDIB, which has the BUSY bit on. R11 holds ENTRY. ORDERCODE has the order code to return. ENTRYBLOCK has the entry block. The byte at GATEKEY has PREPARED+EXITKEY. The byte at GATEKEY+1 has the data byte of the exit key being invoked. EXITBLOCK, JUMPTYPE, P2NODE, P2TYPE, P2SWITCH, SOURCEPAGEAE, SOURCEPAGEIX, and SVCINTILC are as at KEYJUMP {(keyjump)}. The data string, if any, has been transferred. The argument keys, order code, and data byte are to be transferred.

MIDFLT: Cause a Fault for Jumper

Jumper’s trap code is set to X'060000000000' plus the contents of R0. R8 -> JRDR, R10 -> JRDIB. JUMPTYPE, P2NODE, SOURCEPAGEAE, EXITBLOCK, and SVCINTILC are as at KEYJUMP {(keyjump)}.

ABANDONJ: Abandon a Jump - Dry run failed

This routine backs up the jumper’s PSW, cleans up from the jump, and goes to IDLED. The jumper has presumably been placed on some queue.

R8 has address of jumper’s domain root, which must be prep-locked. R10 has address of jumper’s DIB. R15 has ABANDONJ.

RELARGPG: Release Argument Page

If SOURCEPAGECTE is not 0 then SOURCEPAGECTE -> core table entry for a locked page. R14 has return address. R15 has RELARGPG. The lock will be removed. Clobbers R2 only.

PERSVC: Get Back to SVC Code

This is not a call; there is no return. If the execution of an SVC causes an instruction fetch event {PER}, we get into the program interrupt code instead of the SVC code. PERSVC is the reentry into the SVC code. The process has been saved except for the process timer. 13 holds GATEBASE, 2 holds the SVC interrupt code, and 8 holds the address of the root node. EXITBLOCK holds the exit block and PERADDR has what the hardware put there. 15 points to the DIB. ORDERCODE is a copy of register 1 of the process.

PRIMSR: Execute simple key invocation (in PRIMCOM)

This entry point is branched to, to return a return code on the invocation of a key. It causes a return to the fourth key parameter as do primary keys. R1 has the return code. R8 has addr of JRDR which is prepared as domain and preplocked. R10 -> JRDIB. JUMPTYPE and EXITBLOCK are set up.

PRIMRET (in PRIMCOM)

This entry point does not return. It performs a jump to a domain called here the jumpee, passing no key parameters. There is no jumper domain. {One can consider the kernel to be the jumper.} R1 has order code. R6 -> beginning of string to pass. R7 has length of string to pass. R8 -> jumpee’s domain root which is prepared as a domain root and prep-locked. Nothing else is prep-locked. R10 -> jumpee’s DIB. R13 -> PRIMCOM. ENTRYBLOCK has the entry block. If EXITTYPE is nonzero, the jumpee will trap if he rejects a nonzero order code. If bit 4 of EXITBLOCK is 1, SOURCEPAGEAE -> core table annex entry for a page which is core-locked. If JUMPTYPE is > 6, P2NODE -> a node, which is core-locked. Dest page is set up, which means: If STRING_SIZE_LIMIT in ENTRYBLOCK is nonzero then if jumpee’s parameter page is OK then DESTPAGEIX has itemindex of core table entry for the page and DESTPAGEANNEXEN -> core table annex entry for it and the page is locked, else DESTPAGEANNEXEN has zero fi fi.

RESUMEJ: Conclude Gate Jump

As at RESUMEQ plus: ENTRYBLOCK and EXITBLOCK are as at KEYJUMP; R13 -> JEDR; GATEKEY and GATEKEY+1 are as at MIDJUMP. Checks for register-modification PER event and sets trap code for any PER event. {{ni} REFARGLENGTH holds the length of the offered argument string. It will be used if the domain is monitoring storage alterations.}

{Alternative to foregoing} The world is valid except that: There is a domain with an unregistered process. R13 -> the domain root and R12 -> the DIB. ENTRYBLOCK holds the entry block of this domain.

The trap code of the domain does not reflect changes to the domain that have happened to the domain under the influence of the current value of the entry block.

CKPTSTLL: Stall This Guy for Checkpointing{de-implemented?}

R10 holds MEMORY. The domain indicated by PZDOMAINDIBP is placed on the KROQUE and goes to IDLEX.

EXTWSTRT: Start Worrier

Called when a domain is placed on the worry queue.

R14 holds return address; R15 holds entry point address. Set a timer element to run the worrier {(worrier)} if none is already set. Preserves 0:15.

CHRGE: Charge Some CPU Time

10 -> DIB of a domain. 15 -> CHRGE. Domain is running {(domrun)}.

The domain has an unregistered process in it. It tries to add to the CPU allocation. It then restarts domain.

Scheduling

IDLEX: Find Work for CPU - Executable entry in PAGEZERO

The world must be valid. Branch to IDLEX. There is no return. If there is an entry on the CPU queue, it will be run; otherwise a wait state PSW will be loaded.

PUTAWAYD: Stop the Running Domain

R14 has return address. R15 has PUTAWAYD. PZDOMAINDIBP has address of the running domain, which must be prep-locked.

This routine un-prep-locks the running domain and un-caches its CPU allocation from the process timer.

STARTDOM: Start Running a Domain

R10 has address of DIB of domain to start. R14 has return address. R15 has STARTDOM.

This routine produces a CPU allocation for the domain and loads it into the process timer. It stores the address of the DIB in PZDOMAINDIBP, making the domain the running domain.

KSSTALL: Adjust priorities for stalling domain

R8 has address of jumper’s domain root, R10 has address of jumper’s DIB, R12 has address of jumpee’s DIB, R13 has address of jumpee’d domain root. R14 has return address. R15 has KSSTALL.

This routine sets the priority of the jumpee to be the minimum of that of the jumper and the jumpee. It adjusts the queueing order on the CPU queue if necessary.

IDLED: PUTAWAYD Then IDLEX

R15 has IDLED. This routine calls PUTAWAYD and then goes to IDLEX.

ENQMVCPU: Move Queue to the CPU Queue

R3 -> the queue head. R14 holds return address. R15 holds ENQMVCPU.

Moves all domains waiting on the indicated queue head (pointed to by R3) to either the CPU queue or to the frozen CPU queue {Iff checkpoint has “dispatching domains inhibited”}. Preserves R0-R15.

RESUMEQ

Give CPU to domain whose DIB is pointed to by R12. Its domain root must be prep-locked. If domain has worry hook in C13 {and HOOKED is on}, it will be removed. R11 holds ENTRY. Domain is running {(domrun)}.

RUNIT: Run a Domain

R12 holds the address of the DIB. R15 has RUNIT. Domain is running {(rundom)} and its READINESS bits have been checked. The domain is dispatched.

RESTPER

R12 -> DIB, R11 -> ENTRY (in GATE), R13 -> Domain root. Restores real PER regs (in case of kernel PER) and resumes execution of domain.

RUNIDLE

Like RUNIT, but the domain’s control register 1 has already been validated. R15 has RUNIT.

STOPDISP: Stop All Domains

Stop dispatching domains. Called during a checkpoint. R14 holds return address; R15 holds entry point address.

RUNDOM

Put a domain on the CPU queue. R2 holds address of domain root. R14 holds return address; R15 holds entry point address.

Running Only the Migrator Domain

Why it is necessary

When swap space is very low, it is necessary to ensure that the migration finishes before swap space runs out. Similarly with directory entries. “very low” in this case means that we have enough to bring in the external migrator and run it and clean all the pages and nodes in core for the next checkpoint.

We accomplish this by running only the migrator domain. We know that it dirties only a few pages (I think 36) and nodes. We could run any domains as long as the migration finishes and a limited number of pages and nodes are dirtied.

The mechanism

There are only two meters that contain the super meter key. They are the prime meter and the external migrator meter. Only the external migrator runs under the latter. (Its keepers and helpers, if any, would have to run under this meter too. At present there are none. They must not use many pages and nodes.)

When we want to run only the external migrator, we stop the prime meter. Since its keeper is a data key, domains that try to run under it will go to the JUNK queue.

When we want to resume running other domains, we restore the CPU count in the prime meter (having saved it in slot 14) and restart the JUNK queue.

The prime meter is known to have CDA 9. It is locked into item space to make it easy to get.

An older mechanism

The current plan is to modify the code at IDLEX {which is already modified by INIT}, to branch to an alternative to IDLE called IDLEEXP. IDLEEXP does the same as IDLE, except when it finds a member on the CPUQUE that is not the migrator. In this case, it puts that queue member on the FROZENCPUQUE and goes to IDLEEXP. When migration has finished, the FROZENCPUQUE is transferred to the CPUQUE. Branches to IDLE should be changed to go to IDLEX.

It seems as if none of the schemes really provide the opportunity for the migrator to call any friends; it will have to do the job all alone. {Can’t even call segment keepers implicitly.}

IDNTMIGR: Notifies the Dispatcher Which CDA is the Migrator

Entry conditions: R8 - pointer to domain root of the migrator, R15 - entry point address, R14 - return address. Preserves R0-R15.

A buggy proposal

Suppose that RUNMIGR copied the queues of nodes with processes {all queues but the worry queue and the stall queues} onto a new queue, except for the migrator. SLOWMIGR would copy the new queue onto the CPU queue. The worrier would be suspended for the duration.

Two advantages of this scheme are:

The migrator can call other domains {such as its domain keeper}.

We don’t need to insert a test in frequent code.

Possible pitfalls are:

The function of the worrier would be absent.

If a process was serving the migrator but was not in the migrator’s domain when RUNMIGR was called, things would stop.

I suspect that there may be other problems, but I can’t think of them now.

The above pitfalls seem to be a manifestation of bad hierarchy levels.

Design Issues

If we want the kernel to survive bugs in the external migrator, we may have to provide fall-back logic in the kernel to do a crude job if the migrator fails. This is one of the few cases where this is possible, since the migrator is merely a dispensable advisor to the kernel.

See (migrator) for external view.

RUNMIGR: Runs Only the External Migrator Domain

Entry conditions: R15 - entry point address, R14 - return address. Preserves R0-R15.

SLOWMIGR: Returns Domains Other Than the Migrator

Entry conditions: R15 - entry point address, R14 - return address. Preserves R0-R15.

Memory Addressing

Terminology

Clean

A page or node is clean when the disk version {either at the home address or in a swap area} agrees with the core version. Each page and node is cleaned upon each checkpoint.

New and Larger Segment Tables

ALSEGTAB: Allocate STOTE

1 -> STOT entry to be protected from re-allocation. Returns with 1 -> new STOT entry. Length of new segment table is zero. {SZCD will be -1.} On entry 15 -> ALSEGTAB. Returns to 0(14). Preserves registers other than 1. See (shove) for overview.

EXSEGTAB: Expand a Segment Table

The designated segment table is expanded by the amount given. Other segment tables may be moved or sacrificed to this end. 0: size increment. {multiple of 64} 1 -> STOTE to be expanded. Garbage is left in the new portion of the segment table. SHOVE always does a PTLB. On entry 15 -> EXSEGTAB. Returns to 0(14). Preserves 0:15. See (shove) for overview.

ZPSEGTAB: Zap a Segment Table

1 -> STOT entry to be zapped. Length must be greater than 0 {else crash}. Table must not be on production chain. Table length will be set to 64 bytes {smallest} and all entries made invalid. {Locators to this table in DIB’s will not be searched out and removed at the time ZPSEGTAB is called.} Table will be placed on FORSAKENSEGTABHEAD chain. Returns to 0(14). Preserves all registers. See (shove) for overview.

The caller must do a Purge TLB instruction.

The DEPEND Module

General Points

This module is in charge of remembering which entries {SEGTABPP’s, page, and segment table entries} depend on which memory tree slots. See (depndintro).

We describe here a near conceptual bug in the implementation of the DEPEND relation! A kernel correctness proof requires a delicate argument here.

One difficulty arises in the possibility that during the traverse of a portion of a memory tree path, members of the DEPEND relation created in that traverse would be scavenged to make room for other members created in the same traverse. This scavenging is done to reclaim some of the limited space in the representation of the DEPEND relation.

Another problem is that SEARCHPORTION may call GET which may call UNPRND (to free a node frame) which may call SLOTZAP which may call ZAPCHAIN which may destroy nascent entries.

This is a disaster, because the scavenged member’s designated entry is set to invalid in order to preserve {the last clause of} (whendepend). This setting to invalid will not take effect yet, because the entry that it must zap has not yet even been created. The effect is to lose a DEPEND member.

This disaster is avoided by DEPEND not scavenging members that refer to the entry address referred to in location DEPENDSAFE. This value is established by MEMORY when SEARCHPORTION is first called. When SEARCHPORTION is done DEPENDSAFE is made to refer to no entry.

An alternative idea:

During SEARCHPORTION we collect the list of slots consulted. If the search succeeds the entries would be made to DEPEND at once. Otherwise (node not in memory, etc.) the list would be abandoned. The list need not exceed 80 entries since at most four slots per node may be consulted.

If this collection is made in depend space then there will be no problems of chain zaps during the transfer to depend space upon success of SEARCHPORTION.

Advantages:

The duplicate entries in DEPEND that now arise from redoing parts of depend after an obstacle has been removed would be avoided.

The cell DEPENDSAFE would no longer have its strange global significance.

Disadvantages:

It might be slower.

It is a change.

An idea that has not been adopted is to allow “DEPEND” members of the form (core table entry, entry address). This would have more directly solved the problem of changing the storage key of a page or reclaiming page frames.

The current DEPEND module is implemented by chains linking members whose slot address hash to the same value. Thus when a slot is zapped that causes an entry at EA to be invalidated, we do not locate and reclaim other entries mentioning EA. Another scheme would be to doubly link members for the slot and singly link members to the same entry {with an unrooted chain loop}.

Use of DEPEND

As SEARCHPORTION begins it sets DEPENDSAFE to the address of the translation table entry whose value is the purpose of the execution of SEARCHPORTION. SEARCHPORTION calls DEPEND for certain slots that it consults in producing that value. DEPEND builds the members at that time.

IDEPEND - Initialization for the DEPEND module

DEPEND

At entry, register 0 holds the entry address and register 1 holds the slot.

B0 of the entry address holds the access at this point in the memory tree, a 1 if read-only else a zero if read-write. B1-B3 contain the entry type. Currently there are five types, namely the DIB, segment table, segment table entry, page table, and page table entry. The type defs are declared in KERNEL MACLIB. B4-B15 contain the offset for segment table entries and page table entries. B16-B31 contain the main index, namely the DIB number, segment table number, or page table number.

The slot contains an item index, the 16 significant bits of which reside in B13-B28.

The return is to 0(14). This call may have the side effect of invalidating other members of the DEPEND relation but no members referring to the same EA. {See (depndintro).} Preserves 0:13.

SLOTZAP

This routine zaps {invalidates} all entries recorded in DEPEND as having been derived from the contents of a memory tree slot. Other entries may also be zapped. R1 -> slot, R15 -> SLOTZAP. R1:R14 are preserved. Returns to 0(14).

The caller must do a PTLB instruction.

See also (scheduler).

Externally visible entry points not documented elsewhere

IDLEZ - Find a domain to run

ISCHED - Initialization

LOGERROR - Maintain I/O trace table, KERRORWAIT key

Privops: LPSW

Externally visible entry points

KERRLOG - KERRORLOG key service

Input is order code, REFREFARGUMENT, and jumper’s root node and dib pointers

LOGERROR - KERRORWAIT key service

Input is order code, REFREFARGUMENT, and jumper’s root node and dib pointers

LOGMOUNT - Log disk mounts and dismounts

Input is the device address and device block pointer and a boolean whether this is a mount or dismount. Writes a record into the error log.

LOGOBR - Log an outboard record for device I/O error

Input is the device address and device block pointer, an error code describing the event and the number of retrys or 0 if the error was not permanent. Writes a record into the error log.

LOGINCCH - Log a channel check recognized at interrupt time

Input is a pointer to the channel table entry for the failing channel, and the limited channel logout, CSW and interrupting device address in page zero.

LOGTRACE - Maintain the I/O trace table

Input is the device address and a code for the event type.

NORESTRT - Indicate device doesn’t need restart to IOINTER

Externally visible entry points

NORESTRT - Returns “Channel still available”

PROGINT - First level program interrupt handler, simulate page faults

Privops: ISK, LCTL, LPSW, LRA, STCTL, STOSM

Externally visible entry points

PROGINT - Program first level interrupt

Entered from hardware program interrupt. Uses PZDOMAINDIBP to locate running domain (or idle process). For interrupts form supervisor mode, detects PER after SVC, and certain expected kernel program interrupts.

PAGTREX - See (trex)

SEGTREX - See (trex)

SCAFOLD - Define entries for un-implemented features

Privops: LCTL, LPSW, SPT, STCTL, STPT

Externally visible entry points

EMPTSTAL - Always returns “didn’t”. See (emptstal).

UNHOOK - Always returns “didn’t”. See (unhook).

DUMMYINT - Always returns “channel and control unit available”

SLEEP

WAKE

Macros

KEYTABLE

KEYTABLE RN,P=XX emits executable code that branches one of 17 (number of key types) ways depending on the type of the key in the slot whose address is in register RN. If that key were a page key control would pass, for example, to XXPAGE.

LESSTHAN A,B

This produces no code or otherwise consumes storage. It causes an assembler diagnostic unless A<B and that fact can be determined before the end of the assembly. Neither A nor B need be evaluable during macro expansion, but must be evaluable during assembly.

Timers

See keytech-kl,process-timer.

Process Timer (370 implementation)

The timer value is: If stopped Then pzcputimer Else value in the hardware process timer Fi shifted right 8 bits. Stopped = CPUTIMERSTORED.

Time Formats

There is a confusing set of formats used within the kernel to store time values. This arises in part because the time hardware is 52(+) bit fixed point while the facilities to add and subtract such values are floating point. Also various scalings are appropriate for storing 32 bit approximations of time values.

Hardware

The kernel uses the TOD, the clock comparator and the process timer each of which may be described as a 64 bit fixed point integer value in units of 2**(-12) microseconds. These hardware registers are transferred to and from memory on 8 byte boundaries.

DIB

The CPU cache in the DIB is called P.CPUCACHE. It never represents a value greater than five seconds and is stored as fixed 2**62 + 16*microseconds. This can be equally be viewed as a double unnormalized floating value 2**(-52)*microseconds when not negative. As such double unnormalized floating commands can add subtract and compare these values when they are not negative.

Values in datakeys in meters.

Here the values are stored as microseconds*16. It will be required that this value be less than 256**7. Meters not conforming will not be prepared and prepared meters will conform.

PROCESS SCHEDULER

See (p1,domserv) for a conceptual abstraction of the domain server. See (p2,supsched) about a scheduler external to the kernel.

Theory

Definition of Priority

With each domain, the kernel associates a number called the Priority. The Priority depends on the domain’s history of running, and it changes with time. The Priority P(t) is defined by:

P(t) = Integral From (-infinity) To (t) Of (R(x) * exp(-b*(t-x))) dx

where exp is the exponential function and R(x) is the characteristic function of the domain. R(x) = 1 if the domain is using the CPU at time x, otherwise 0.

Intuitively, the Priority represents the amount of time the domain has spent using the CPU, weighted towards the present. B is a constant which determines how heavily towards the present the Priority is weighted.

At any given time, of the domains that can use the CPU, the scheduler runs the one with the lowest value of Priority.

Limit Theorem

For all domains and all t, 0 <= P(t) <= 1/b.

Idle Theorem

If R(x) = 0 for t0 <= x <= t1 {i.e., the domain is idle from time t0 to time t1}, then

P(t1) = P(t0) * exp(-b*(t1-t0))

Running Theorem

If R(x) = 1 for t0 <= x <= t1 {i.e., the domain has the CPU from time t0 to time t1}, then

P(t1) = P(t0) * exp(-b*(t1-t0)) + (1 - exp(-b*(t1-t0)))/b

Approximation: If b*(t1-t0) << 1, then

exp(-b*(t1-t0)) is approximately 1-b*(t1-t0) and

P(t1) is approximately P(t0) + (t1-t0).

Time Slice Theorem

Suppose at time t0 we have two domains, with P1(t0) < P2(t0), and we want to run domain 1 until P1 = P2. At what time t will P1(t) = P2(t)?

t - t0 = (1/b)*ln(1+b*(P2(t0)-P1(t0)))

Approximation: Assume b*(t-t0) << 1. Then we can use the approximation of the running theorem for P1(t). Also, using the idle theorem, P2(t) is approximately P2(t0). Then

is approximately P2(t0) - P1(t0).

Guarantee of service theorem

We assume that the set of domains is finite and fixed. To prove: a domain that wants to run will eventually run.

Proof: Let Pw be the priority of the domain. If it wants to run and isn’t running, some other domain with priority Pr < Pw is running. By the time slice theorem, it cannot run more for more than G(Pr(t0)) = (1/b)*ln(1+b*(Pw(t0)-Pr(t0))) without exceeding Pw. (This time need not be used all at once. More detail needed here.) The domain that wants to run will never have to wait longer than the sum of G(Pr(t0)) for all domains with Pr(t0) < Pw(t0).

Pie Slice Theorem

Assume that domain 1 is using the constant b1 in computing its priority, and domain 2 is using b2. {We leave open the question of how these b’s are determined.} Both domains want to run and no other domains do.

The scheduler runs the domain with lowest priority P. In equilibrium, P1(t) will equal P2(t). In the limit of small time slices, we can treat R1 as a constant between 0 and 1 representing the fraction of time domain 1 runs. Similarly for R2. Then

R1/b1 = R2/b2

In other words, the fraction of time a domain runs is proportional to its value of b.

Implementation

Implementation of Priority

Since the Priority changes with time, it is impractical to maintain its current value at all times. The kernel stores with each domain the pair of values (P(t0), t0). t0 is some time in the past, in standard S/370 time-of-day epoch, and P(t0) is the priority at that time. The domain has not run since time t0, so its current priority is given by the idle theorem.

The pair of values is stored in the data key in C12. The seven bytes of the data body are divided into the first three bytes, which contain P(t0), and the last four bytes, which contain t0. P(t0) is the high three bytes of a floating point number. t0 is the high half of the time-of-day clock.

Units

Time (t0) is in units of “big seconds”, that is 2**6 microseconds.

Time slices, CPU allocations, etc. are in meter units, that is 2**(-4) microseconds.

Priority (P(t0)) is represented in floating point (for historical reasons) and is in units of 2**(-52) microseconds. This is equal to a floating exponent of X'3A' followed by a 4-byte fixed number in meter units.

Selection of Domain to Run

The CPU queue is ordered by Priority. The domain with the lowest value of Priority is at the head of the CPU queue {i.e., to the left of the other domains}, and it will be the first to be run.

The CPU queue is ordered by P(t0) values. It is also ordered by Priority, i.e. P(t) for t=now. This is because before we add a domain to the queue, we update the priority of its right neighbor, and because the value of b is the same for all domains.

Known Problems:

If we use different values of b for different domains, the order of the domains on the CPU queue would have to change with time.

Someone with access to a domain tool key could prevent other domains from running by the following means. He gets a domain onto the CPU queue with a priority lower than that of the domains he wishes to thwart, i.e., in front of them. Then he frequently stores into C12 a value indicating a very high priority. That domain and the ones behind it will never run if there are enough other domains to keep the CPU busy.

These problems could be solved with limited overhead by periodically reordering the CPU queue using current priorities.

Measurement of Time

While a domain is not running, time is measured by the time-of-day clock, for purposes of the idle theorem.

While a domain is running, time is measured by the CPU timer, for purposes of the running theorem.

The difference is that the CPU timer effectively does not run during I/O and external interrupts.

Accounting for Running Time

Accounting for running time is for two purposes, namely for scheduling and for the CPU counter in the meter.

When we begin to run a domain, we select a time slice for the domain. We make the time slice somewhat larger than called for by the time slice theorem to reduce the overhead of switching domains.

The CPU allocation for the domain is the minimum of the meter cache and the time slice. The CPU allocation is either in the hardware CPU timer {and running down}, or in PZCPUTIMER.
PZBITS.CPUTIMERSTORED tells which.

In the terminology of (68000-kl,process-timer), the CPU allocation is the same as the process timer.

The {remaining} meter cache at any given time is DIB.CPUCACHE plus the CPU allocation. We always have 0 <= DIB.CPUCACHE < 128 seconds.

The {remaining} time slice at any given time is SLICECACHE plus the CPU allocation. We always have 0 <= SLICECACHE < 128 seconds.

The amount of running time the running domain has accumulated that has not been reflected in its Priority is SLICESTART minus the remaining time slice.

Queues

General Points

In the view of the kernel, domain roots are the active elements of the world. The kernel’s job is to serve these domains, obey the instructions of the domain, but enforce kernel rules. Some domains have processes in them, and the kernel must remember which. Some such domains are waiting for the CPU, some are waiting for channels to finish, etc. This section describes how the kernel distinguishes these various domain situations.

Each of these situations has a queue associated with it, and some have many queues. These queues have been designed so that when the cause for waiting has ceased, the domain can be quickly found and given the service required to get on with its business. An action of domain B may influence the situation with domain A. Such actions generally cause domain A to be removed from its queue and placed on the CPU queue. When it gets its CPU slot, it may be immediately queued on the same or different queue. The larger problem is to identify the domain A whose situation has changed.

These queues are implemented by (_hooks) {(pdhook)}. A hook is a special type of key that is always prepared and involved and always resides in slot 13 of some node. Just as other prepared keys designate pages or nodes, hooks designate queue heads or the domain root of a busy domain. The members of a queue are the domains holding a hook designating the same queue head. The queue of domains waiting for domain X to be ready each hold a hook to X. The queue of domains awaiting an I/O event hold a hook designating an item allocated to that event. The general back pointer mechanism of keys provides for finding members of such queues quickly.

See (quelist) about what queues there are and their names.

On Hooks and Decongesters

Hooks appear only in slot 13 of nodes, and nodes with hooks are in core. A hook can be in a node with any mode of preparation, and can designate a node with any mode of preparation or can “designate a queue”.

A hook must be exactly one of the following types:

The node containing the hook is the root of a domain with a process that is waiting to execute on a CPU. The hook designates the CPU queue, and the order of hooks in the CPU queue is of interest only to the dispatcher.

The node containing the hook is the root of a domain with a process that is in some I/O wait, e.g.,

- waiting for a page/node to be brought in .Ybs=0,0;

- waiting for a range to be mounted

- waiting for I/O request blocks

- waiting for page/node space

- waiting for the clock comparator .Ybs=1,0.125;

The hook designates the appropriate I/O queue, and the order of hooks in the queue is of interest only to the I/O routines.

The node containing the hook is the root of a domain {the (_stallee)} with a process whose most recent action was to attempt to jump to another domain {the (_staller)} {(p1,entrystall)}. The hook designates the staller, and the order of hooks in the backchain is an approximation to the order in which the stallees became stalled. The staller may be ready {see worrying logic}.
The node containing the hook is the root of a domain with no process, which is in the state called (_worrying). The hook, called the (_worry hook), designates the worry queue, and the order is of no significance. {Then why is it called a queue?} The worry queue header is at (WORRY).

Every domain with a process has in slot 13 of its root a hook in state 1, 2, or 3, or is currently running on the CPU.

Since hooks can’t exist on the disk {they are prepared}, the information that they convey must be held in some other form when queue members or queue heads {stallers} are on the disk. Since item space is limited, nodes with processes may overflow it. We use the term (_decongester) to refer to a program, running in problem state, that holds this information. The decongester is probably in everyone’s performance kernel.

There is a facility for communication from the kernel to the decongester. I describe here the essential features of this facility but not the details. This memo gives a complete list of the messages that can be sent. The kernel has {I assume a fixed amount of} space reserved in core to buffer a few messages (how many is a design parameter; there is a minimum requirement given below}. The decongester has a key called the decongester tool {(p2,decongtool)}, with which it can receive messages and wait until there is a message to be received. If the decongester does not accept messages as fast as they are generated, and the buffer becomes full, then no more messages can be sent until the decongester accepts a message. Messages must be buffered first-in first-out {to preserve order of stallees and to properly handle external queues for severed nodes}.

Messages to the decongester are divided into two types: “unhook messages”, which are generated as a result of unpreparing a hook or entering a busy domain with an external queue, and requests for internal queue replenishment. Of the buffers used to store messages, a number, U, are reserved for use solely by unhook messages.

{LOOSE END} Hooks should not be in this section!

Worrying

The worry queue is to counteract the effect that occurs when the situation of a stall queue member changes such that it does not re-attempt the jump that originally placed it on the stall queue. Except for the worrier, such a situation would strand subsequent members of the stall queue.

In addition to the queue of stallees hooked on a staller’s domain root’s backchain, every node has a bit telling whether there is an external queue. When a domain becomes ready and has either an internal queue of stallees or an external queue, we perform procedure X and hook the domain onto the worry queue. Also, every so often {e.g., every 0.5 seconds} the kernel performs procedure X for every domain in the worry queue.

Procedure X: {There is either an internal or external queue.} If there is an internal queue, take the domain at its head and put it in the CPU queue. {{NI} Then, if the internal queue is {nearly?} empty, and there is an external queue, attempt to send a message to the decongester of the form, “This domain, identified by its CDA, wants its internal queue replenished”. If the number of buffered messages of this type is already at the maximum {namely the total number of buffers minus the number reserved for unhook messages}, the message is not sent. {The worrier will try to send it again later.} If there is no longer either an internal or external queue, remove the domain from the worry queue.} End of procedure X.

When a domain becomes non-ready, we remove it from the worry queue {but if it was non-ready when placed on the worry queue we leave it there - see below for “sever”}.

When a domain uses an entry key to jump to a domain, if the jumpee is ready the jump proceeds. Otherwise {the jumpee is busy}, if the jumpee’s external queue bit is off, the jumper joins the tail of the internal queue of stallees. Otherwise {the external queue bit is on}, the jumper attempts to send an {unhook} message of the form, “This domain, identified by a restart exit key, tried to enter this other domain, identified by its CDA, and found its external queue bit on.” If the message is sent, the jumper no longer has a process {or a hook}, and the decongester will put it at the tail of the external queue for that CDA. If the message cannot be sent, see below.

When a “sever” operation is done on a node (NRANGE(2,..) - see (p2,nrange)), the following is done. If the node has a hook, it is destroyed {removing any process from the node}, and all keys to the node are changed to zero data keys {i.e., the allocation count is incremented}. Then, if there is an internal or external queue of stallees, the node is hooked onto the worry queue {in spite of the fact that it may be busy}. {Or, alternatively, put the internal queue members on the CPU queue and send a message to the decongester to do the same to the external queue.} Then, the severing domain attempts to unprepare the hook as described below. {This procedure insures that a domain {in fact, a p2 process - see definition below} will not remain stalled forever on a key that has become a zero data key.}

Unpreparing Hooks

When a hook is copied, the copy is unprepared. An unprepared worry hook is DK(0). Other unprepared hooks are DK(1). {This supports the illusion of (p1,dsterm).} Normally, a key is unprepared (1) when the node frame that the key resides in or {for type 3 hooks} that it designates is required for other uses, and (2) when a domain performs an action that causes the key’s slot to be overwritten. Since the domain server is conceptually continuously storing DK(0) or DK(1) into a domain’s C13, hooks need not be influenced by cause 2. See (wrnop).

To unprepare a hook of type 1 or 2, the kernel attempts to send an {unhook} message to the decongester of the form, “This domain, identified by a restart key, was removed from this queue, identified by some queue ID.” The information as to the order of the hook in the queue is lost. {The representation of an exit key in a message buffer must include a call count.}

In unpreparing a hook of type 3, it is a design goal to preserve the ordering information. To unprepare a hook of type 3 we go to the tail of the chain of stallees of which this hook is a part, take the domain there, and attempt to send an {unhook} message to the decongester of the form “This domain, identified by a restart exit key, was on the tail of the internal queue of this other domain, identified by its CDA.” If the message is sent, the domain is removed from the queue and the external queue bit is turned on. The attempt to unprepare the hook is successful if the hook was removed from the queue in the above procedure, otherwise not. If it was the pager that was trying to unprepare the hook, it may wish to look for space elsewhere after repeating the procedure some small number of times.

To unprepare a hook of type 4 {worrying}, we first require that the internal queue of stallees be empty; if it is not, we take all the domains in the internal queue and put them on the CPU queue {as in procedure X, but do not send a message}. If there is no external queue, that procedure will have unhooked the domain. Otherwise, the kernel attempts to send an {unhook} message to the decongester of the form “This domain, identified by its CDA, was removed from the worry queue.” If the message is sent, the external queue bit is turned off. When the decongester receives the message, it will restart all domains on the external queue.

When the Buffer is Full

Unhook messages are sent only when (a) the pager tries to unprepare a hook because it wants to re-use space occupied by a node which either contains a hook {in slot 13} or has a hook on its backchain; or (b) a domain tries to unprepare a hook {because it is doing a “sever”} or tries to jump to a busy domain with an external queue. {When a domain root is unprepared, the hook, if any, is not unprepared.} When the pager tries to send an unhook message and is unable, it will look for space elsewhere {we show below that it will always be able to find it}.

When a domain tries to send an unhook message and is unable, then instead of what it was doing it will perform an implicit call to a gate to the decongester known as the unhooker. {Or {quick and dirty}, it should be put on the end of the CPU queue.} The decongester will restart the domain when the message buffer is no longer full.

The Decongester

In order to allow synchronization between the kernel and the decongester, we introduce an integer variable N maintained by the kernel. We define the (_hoard) of the decongester to be a constant minus (the number of hooks in existence plus 1 if a domain is running on the CPU {such a domain has no hook} plus the number of buffered unhook messages plus the contents of N). The constant is the size of node space {i.e., the number of hooks that can exist} plus one. The initial contents of N are chosen so that the hoard is initially zero.

The use of the variable N is determined by these two rules: (1) only using the decongester tool changes the hoard; (2) N is never negative.

Initially, there are no buffered {unhook} messages; therefore N is initially non-negative. Rule 1 determines when N must be increased. {Note that sending a message never causes N to change.} Other than decongester tool actions, there are only three actions that can cause N to be decreased: (a) fork, and (b) a return jump to a domain, where the jumper has an internal or external queue of stallees {and hence becomes worrying}, and (c) the situation of (p1,noproc). {LOOSE END -- how do we handle (c)?} When a domain attempts one of these actions and N is zero, rule 2 forbids the action. Instead, the domain does an implicit call to an entry to the decongester called the hooker.

When the decongester uses DECONGESTER_TOOL(2,..), it uses a timeout to detect the case in which the exit key refers to a domain whose root node is dismounted. Such a domain loses its place in the queue, and goes into the set of domains which the decongester is trying to restart.

When the decongester uses DECONGESTER_TOOL(1,..), it uses a timeout to detect unmounted domains. Such domains will be given another opportunity to run later.

We require that there be an upper bound B on the amount of node space that the decongester needs to be able to accept messages from the kernel. {It may be possible to show that there is a bound on the amount of node space that any domain needs to make headway.} We require that U, the number of buffers reserved for unhook messages, be greater than B. Then, if the decongester never allows its hoard to become negative, it will always be able to make headway. Exercise: Prove this.

Theorem: The decongester can always increase its hoard from zero.

Proof: The following program does it:

BEGIN SEMA mutex = LEVEL 1,

INT hoard := 0;

PAR BEGIN

WHILE TRUE DO

accept messages;

DOWN mutex;

hoard +:= number of unhook messages accepted;

IF hoard > 0 THEN use hoard FI;

UP mutex;

OD,

BEGIN

DOWN mutex;

hoard +:= attempt to increase hoard by(1) # returns the amount it was increased by #;

IF hoard > 0 THEN use hoard Fi;

UP mutex

END

If the call on the decongester tool to increase the hoard succeeds, we are done. If it fails, then N must be zero. Since the hoard is zero, by definition of the hoard there must be at least one unhook message buffered; therefore the next call to accept messages will increase the hoard. End of proof.

Definition: A domain contains a (_P2 process) if and only if either it contains a process, or it contains no process and there is an exit key to it either in a message buffer or held by the decongester.

Exercise: Show that there is a decongester such that (1) external queues work right {define this!}, and (2) a domain with a P2 process makes headway at a nonzero rate, and there is no limit to the number of P2 processes {but only one to a domain}. {This requires showing that the decongester never gets into a deadlock.} {LOOSE END} The decongester probably needs to use a timer, but it cannot use (p2,wait) because that is potentially a client of the decongester.

EXTERNAL INTERRUPTS

External interrupt masking: All external interrupts will be enabled when running domains. If the kernel is in the PSW sampling mode {not yet implemented} {NI}, then the I/O, SVC, and Program interrupt handlers will disable all external interrupts and immediately re-enable only Interval timer interrupts.

Clock Comparator interrupts:

Any process that is running is first saved. Then these interrupts are passed to a kernel function that multiplexes them among a small {fixed} number of users as follows:

The BWAIT server provides timing services to a few extra-kernel functions. The following BWAIT numbers are allocated:

0 for CKPTDVR {timer-driven checkpoint driver}

1 unused

2 for the BWAIT multiplexor

3 reserved for future use

{obsolete}The current time slicer interrupts the CPU on a fixed schedule {period in SLICEQUANTUM} and puts the current domain on the bad end of the CPU queue and goes to IDLEX {to pick up a domain from the good end of the queue}.

This activity ceases when the CPU queue is empty.

The kernel BASE I/O driver code.

The worrier. See (worrier).

CPU Timer interrupts are passed to the meter handling logic.

Timer interrupts and PSW sampling:

Interval Timer interrupts will be passed to a routine that performs internal kernel PSW sampling. This processing will use no other kernel functions, so it may transparently interrupt the kernel.

All others are ignored.

Synthetic external interrupts:

The process timer is used to measure and limit the time used by the domain. There are situations in which the kernel may perform an unending task for a domain. The kernel must explicitly handle these situations, especially when there may be no termination of the task. For example, a trapped domain with DK(0) as a trap key will occupy the kernel indefinitely with no opportunity for the external interrupt mechanism to intervene.

For this reason, the kernel explicitly charges {via the cache in the DIB} 200 microseconds for a trap. If this action exhausts the cache, a synthetic external interrupt is generated.

On each trap it is ascertained whether the CPU has been in user mode since the previous trap. If not, an explicit test is performed to see if it is time for a time slice.

Whether the CPU has been in problem state is determined by a TS command to byte 4 of the location from where the user mode PSW is loaded.

When the interrupt is done, the unblocked jobs run {typically for a very short while}, and then the original job resumes.

The interrupted job usually gets a shortened slice in this case but occasionally gets parts of two slice quanta for its trip through the CPU queue. It averages out {exactly!}.

INPUT/OUTPUT

The I/O System is structured in four parts.

The Interrupt Handler (IOINTER) fields interrupts from all devices, and communicates with the DEVICEIO part and the Paging Device Driver part.

VNXA- This module handles any I/O interrupts that occur. It calls the “interrupt proc” associated with the interrupting device. It is responsible for restarting the channel and control unit. When this is necessary, it does this by calling the “restart proc” for each eligible device. It also maintains the control unit busy array and the restart flags in the channel table.

VXA- This module receives the initial I/O interrupt. It also contains the DOIOINT routine. I/O interrupt processing:

Put away running process

IF device block address not returned by hardware THEN

look up device block address

IF there is a device block for device THEN

Get DEVICELOCK, IF not available mark lock “Interrupt pending”

ELSE { device lock available }

Call DOIOINT with device block to process interrupt

Restart process

VXA- DOIOINT processing

Do TSCH for device into common (CPU dependent) area

Perform any necessary Channel Check processing

CASE device interrupt proc IN

: CALL device restart proc

Release device lock

IF lock indicated interrupt pending THEN

GOTO DOIOINT start processing to process it

return completed requests

re-enqueue incomplete requests

check for clean start, checkpoint etc.

The DEVICEIO Handler supports the Device key for all devices other than kernel paging devices.

The Paging Device Driver (GINTDSK, GCOMDSK, GDSKDVR, GCCWBULD) services REQUESTs from the Device Independent Paging Subsystem.

The Paging Device Driver is entered at GDDENQ passing a REQUEST with chained DEVREQs. The Paging Device Driver retains control of the REQUEST and DEVREQs until the REQDONEPROC is called.

The REQDONEPROC is called when no DEVREQs are either PENDING or SELECTED and REQCOMPLETIONCOUNT = 0. If none of the DEVREQs are in either of these states but REQCOMPLETIONCOUNT is not equal to 0 (e.g. because CCWs could not be built), then the request may be placed on the ENQREQUESTWORKQUEUE for further processing after the next interrupt.

The Device Independent Paging Subsystem handles paging, checkpoint, migration, etc. at a level that is independent of specific devices and channels. Its modules include GET, GRANGET, GDIRECT, GSWAPA, GBADPAGE, GCKPT, GMIGRATE, GJOURNAL, GRSYNC, GCLEANL, GUPDPDR, GUINT.

Swapping and Checkpointing Concepts

See (checkpoint) for another introduction which describes the algorithms more and the states less.

Top Level

There are two or three entire logical system states within a real system at any point in time. There are always the “current” state and the “backup state” which would be used upon system restart. For a few seconds after the initiation of a new checkpoint there is also the “next backup state” corresponding to the time of initiation of the new checkpoint.

Naïvely, two physical areas of disk, called “swap areas”, alternatively hold the pages and nodes that have been changed in the duration between successive checkpoints. More accurately a completed swap area holds a set of page states and node states. A page state consists of a CDA, 4K bytes, a gratis bit, and an allocation id. A node state consists of a CDA, 16 unprepared keys, a process bit, a gratis bit, an allocation id and a call id.

As main store fills long unaccessed pages are moved to the current swap area. As a page is needed it is first sought in core, then in the current swap area and then in the backup area and only then from a home range.

During migration from a swap area that area remains unmodified, thus the migration process is restartable (idempotent). The alternate swap area forms during migration. Migration as described above is truncated to the extent that a page modified after a migration has begun but before that page has migrated is not migrated since such a home disk frame will not be read before the next migration.

Next Level

Much is explained by the observation that a swap area is written sequentially. Of the page states only the 4K values are written early. A structure called the “directory” remains in core to provide an association to locate the page on the disk swap area given its CDA. There is a directory for each swap area.

Long unreferenced nodes are collected into page frames, called “swap pots”, and sent to the swap area along with the pages. Directory entries for nodes identify these swap pots.

The life cycle of a swap pot: Created when there are no non full swap pots and node frames are crowded. Added to when node frames are crowded and modified only upon this event. Read when a node is moved back to a node frame. Swapped out when core is crowded. Brought into core when a node therein is required. Read to produce node values upon migration. Deallocated along with all the others in a swap area upon allocation of that area to a new checkpoint.

Another class of swap pages hold directories of the swap area to be used upon restart. These are written on disk and never read except upon restart. They are written after the next checkpoint has been initiated.

A directory entry in core has a CDA, two “SWAPLOCs”, and an allocation count. This holds for both pages and nodes. The call count and process bit for a node is next to the node in some swap pot. Access to the entry is by hashed CDA. Co hashes are chained together on chains rooted in a table of chain heads indexed by hash. A SWAPLOC is a locator that indirects through range tables to the place on disk that holds the swap ranges. Since SWAPLOCS may be invalidated upon ILP they are transformed into something akin to CDAs when transformed for storing on the disk. These special CDAs are a name space local to swap ranges.

The backup directory is divided into three dynamic parts: Unmigrated, Dataforap and Journal. Each has its separate set of hash heads. They describe disjoint sets of CDAs. A page or node is initially represented in the unmigrated section. When the home position has been updated from the swap area (migrated) the corresponding directory entry is “moved” to the “dataforap” part of the directory. When the allocation pot for a page is updated in the home, the directory entry moves to the journal part of the directory. When a checkpoint begins the unmigrated and dataforap parts must be empty. When a checkpoint ends the journal part is emptied.

Page swapping

There are two versions of a page available: the current version, and the backup version. While a checkpoint is in progress there is a third version, the next backup version. The chart below is a complete list of all the states of a page, giving the locations of all of its versions. This analysis does not account for allocation ids.

Fine Print: In this section (as in the rest of this manual) a “page” refers to a particular cda and the collective states of its various versions. The “next older version” of the current version is the next backup version, if any, or the backup version otherwise. The next older version of the next backup version is the backup version.

Legend: Each entry gives the location(s) of the current version followed by a comma, the location(s) of the next backup version followed by a comma, if any, and the location(s) of the backup version. The following symbols are used:

S - This version is the same as the next older version.

D - in core, dirty, not kernel read only

K - in core, kernel read only (dirty)

C - in core, clean

N - not in core

SD - either the same as the next older version, or in core, dirty, not kernel read only

CN - either in core and clean, or not in core

W - in the working directory

U - in the unmigrated directory

UAJ - in the backup directory (either unmigrated, dataforap, or journal)

AJH - in either the dataforap directory or the journal directory, and also in the home location (allocation count in directory entry)

H - in the home location (allocation count in home a-pot)

The states of a page

D,UAJ-CN

D,H-CN

W-CN,UAJ-CN

W-CN,H-CN

S,U-CN

S,AJH-CN

S,H-CN

SD,K,UAJ-CN

SD,K,H-CN

SD,W-CN,UAJ-CN

SD,W-CN,H-CN

SD,S,AJH-CN

SD,S,H-CN

Complication arises from the fact that a page can change state while a read or write is in progress. Therefore, when a read completes, we look at the state of the page at that time to decide what to do with the data read.

Reads and (migrate) writes to the home area must be locked (using a bit in the outstanding io array, selected on a hash of the cda). Thus, when the completion of a read is processed, you know that the page is still as valid now as when the read took place. The read may have taken place earlier depending on the vagaries of the device driver and the hardware.

Reads and writes to the swap area are synchronized using the directory entry for that page and that swap area. While a devreq to read from the swap area exists, it is linked to the directory entry. While a directory entry exists, we never write to the swaplocs that it refers to. And cleaning a page keeps the page locked in core (using IOCOMPLETECOUNT) while it writes the page and updates the directory entry. Updating or destroying a directory entry aborts any linked devreq.

Disk I/O. The Gnosis Disk I/O system is completely described by the Algol-68 programs in which it is designed. This section serves as a guide to those programs and their philosophy.

As in classical operating systems there are a set of addresses at which the kernel expects to find disk packs which it will adopt as its own if the format is suitable (a strong test!).

Contents of the Disk Packs

The contents of the disk packs can define the complete state of the Gnosis system. In contrast to most other operating systems, the list of processes in the system is kept on the disk when information has been evacuated from core. The areas written on the packs are:

IBM standard disk information

A volume label, IPL program, and VTOC to keep OS-based systems off the rest of the pack. These are the only disk records that are not 4K.

Pack descriptor record {see (pack-disc)}

A 4K block which describes the ranges {(range)} on the pack and indicates which pack set this pack belongs to.

Node pots {see (node-pot)}

Contain images of logical nodes. A 4K block each.

Allocation pots {see (allocation-pot)}

Contain allocation information for pages. A 4K block each.

Pages

4096 bytes of logical page data. A 4K block each.

Swap area directory pages {see (disk-directory)}

Locate logical pages and nodes in the swap area.

Checkpoint header pages {see (checkpoint-header)}

Locate swap area directory pages in the swap area.

Pack Sets

Packs will come in sets. Normally, Gnosis will run with all of a given set of packs mounted.

Each pack has at a fixed location {page 5 on the pack} a block of data that explains the pack to the kernel. This block contains a 64-bit TOD reading from the time that the seed pack was originally created. These TOD readings permanently group packs together into the sets described above.

Ranges

A given pack may hold several ranges of consecutive coded disk addresses for pages and several ranges of consecutive coded node disk addresses. Each range may be of one of several types; see (pack-disc) for details. Typically a pack will not contain more than one range of each type. The provision for more than one anticipates problems when converting a set of Gnosis packs from one type of disk drive to another. A coded disk address is unique only within a pack set.

Multiplexed Ranges

Ranges may be multiplexed. This means that two or more ranges in the set contain identical information. The value of this is for decreased probability of lost data and greater read bandwidth to that data. A multiplexed range is like others, except for a counter indicating there is more than one copy. Multiplexed ranges usually reside on different packs. Multiplexed ranges need not reside on the same device type. Critical ranges from the seed pack are prime candidates for being multiplexed.

Write Verify

Some ranges may be marked as requiring write-verify. This means that a read is performed just after a write while the data is still available in core. {{NI} not currently implemented.}

Subset Mode

Gnosis will run with less than the entire set of packs mounted. Programs touching pages or nodes that are not on the mounted packs will be blocked until the packs are mounted. We may make provision for evacuation of those pages and nodes of a given pack short of terminating Gnosis. Shutting Gnosis down would merely consist of evacuating all mounted packs. Each range descriptor has the TOD value of when that version of the range was last migrated. This is to detect ranges that are potentially out of date.

Missing Range

Running without one member of a multiplexed range is possible with very explicit action on the part of the operator. {In the current version, this occurs automatically, without special operator action.} Special procedures will be required to reintroduce the missing range member.

Cylinder and Track Layout

Each range of consecutive coded disk addresses resides on a set of contiguous tracks. The pages are 4096 bytes long. Pages are on tracks formatted three plus pages per track {for the 3330}. The coded disk address, allocation count, and the flags {e.g., the gratis flag} are held in an (_Allocation pot). Each home page range has an allocation pot that precedes every 1024 {or fewer for the last allocation pot if the range does not contain an exact multiple of 1024} logical pages in the range. Since we cannot switch heads fast enough to avoid missing a page time if all tracks have pages starting in corresponding locations, we put a non-integral number of pages on a track.

With allocation pots, we can do the trick described in (p2,fast-copy).

Tracks of nodes are formatted in the same way as tracks of pages. About 18 nodes fit in a 4096-byte block called a “nodepot”. The coded disk addresses, the allocation counts, the call counts, and the flags are included along with the node image in the nodepot. See (flags) for further information on nodes.

The header on the pack also holds a format type field. The format described here will correspond to format type 1. Other values are reserved for formats not yet designed. These formats may be needed to support IBM 3310, 3370, and other Fixed Block Architecture devices.

Introduction of Packs to Gnosis

When Gnosis starts up and when an unsolicited device end interrupt occurs, indicating that a drive has gone from not ready to ready, the header information is read and each range discovered is added to the range table {(rangetable)}. No two such ranges may overlap. Any pack that overlaps with another {previous} range is rejected.

{LOOSE END --- How to reintroduce absent twin. Can Gnosis run with two seed sets?}

Disk Resident Table Formats

The “Pack Descriptor Record” describes packset this pack is for, what ranges are contained on the pack, and where they are on the pack. For each range it contains the first and last CDA’s, the address of the first CDA relative to the start of the pack, the number of copies of this range in the packset, the TOD of the last migration to this range when another copy of this range was not completely migrated, and the type of the range, as follows:

Normal range - contains CDA’s with no special interpretation by the kernel. Page ranges of this type have allocation pots.

Dump range - contains ranges that the kernel may use to take storage dumps when a kernel failure or other event that requires a kernel storage dump occurs. These ranges have allocation pots.

Kernel IPL image - a core image of the kernel that may be loaded by the IPL program. These ranges have allocation pots.

Disk record - a range where the kernel record logic may save its record of stimuli to the kernel. These ranges have allocation pots. See (record).

An allocation pot contains the external flags and allocation counts for 1024 pages. Each entry is 4 bytes long with the first byte being the flags {X'02' is the virtual zero bit indicating that the logical contents of the page are all zero} and the next 3 bytes being the allocation count.

The “Disk Directory Entry” describes the data in the swap areas. It is written on the disk during a checkpoint. It contains the CDA for the entry, whether there is a process in this node, the disk location for the first copy, and the disk location for the second copy. For pages, it also has the flags and allocation count.

The “Checkpoint Header” is permanently assigned to two locations on separate disks. These locations appear in the pack descriptor record as swap areas for a swap area with an ID of X'8000' {the lowest negative number}. These headers are written during checkpoint and contain the two addresses for each disk directory page written as part of the checkpoint.

Disk Routines for Pages and Nodes

Data types

RANGELOC - A combination of a range identifier and an offset within that range. Implemented as a word; the high halfword is a signed index (relative to RANGETABLE) in the range table, and the low halfword is the unsigned offset. Swap area ranges have negative indexes and home area ranges have positive indexes. I don’t know about checkpoint header ranges.

SWAPLOC - A RANGELOC that refers to a swap area range. The range table index is negative.

Tables and work areas.

DEVICE extension for disks.

The device dependent segment of the DEVICE block for disks contains queue heads for the DEVREQ queue, a pointer to the list of CCWBLOK’s which have not yet signaled completion {called “active”}, and the list of completed CCWBLOK’s to return to available status when a channel end occurs. Two lists are needed when running under CP and using the DIAGNOSE 28 interface to dynamically modify the channel program. It also contains the pack ID {within the pack set}, the last device address accessed, and the current state of the device for new requests.

CCWBLOK - contains part of the channel program.

The CCWBLOK contains the channel program for disk I/O. It also contains a list pointer to the next CCWBLOK, the address of the DEVREQ for which it was built, the device address, and sector number for the transfer.

REQUEST - describes one request.

The REQUEST block describes one logical request to the disk I/O system. It contains a pointer to the list of DEVREQ’s that describe devices on which this request may be satisfied, a counter of how many of these DEVREQ’s must complete before the request is to considered complete, an area to hold the CDA, flags, and allocation count, the address of the procedure to call when the request has completed {correctly or not}, and a parameter to pass to that procedure. The REQUEST block also contains the core table offset of the page assigned for the transfer and a flag to indicate what type of request this is {page read, directory write, checkpoint header write, migrate read, or migrate write}.

The field REQUEST.REQDIRENTRY is managed by the device independent paging system (specifically GDIRECT)

See also (driver-locks) and (paging-locks) for more information on the use of specific fields.

DEVREQ - describes device for REQUEST.

The DEVREQ describes which device and where on that device a request may be satisfied. The DEVREQ contains a pointer to the next DEVREQ in the chain {rooted in the REQUEST}, a pointer to the REQUEST, a pointer to the DEVICE, the address on that device, two queue links to link it into the DEVICE DEVREQ chain, and a status field which indicates where it is in the selection process or if it has completed - whether the completion was successful or not.

The fields DEVREQFLAGS.DEVREQSWAPAREA and DEVREQSWAPLOC are managed by the device independent paging system (specifically GDIRECT).

See also (driver-locks) and (paging-locks) for more information on the use of specific fields.

RANGETABLE - describes mounted ranges.

The RANGETABLE describes the ranges that are currently mounted on the system. It is divided into two segments, one for normal ranges and the other for swap ranges. The segment for swap ranges is indexed with negative subscripts. Each entry contains the starting and ending CDA’s in the range, a pointer to the first RANGELIST element for the range, TOD for the last partial migration for this range, the number of copies of this range for normal ranges, current or backup swap area indicator for swap ranges, and some flags.

RANGELIST - describes which devices a range is on.

The RANGELIST describes which devices a range is on. It contains a pointer to the DEVICE, a pointer to the next RANGELIST entry for this range, the page offset on the device to the start of the range, and some flags.

The swap area directory.

There are four swap area directories, named below. Each directory contains the heads of a number of chains of directory blocks {DIRENTRY’s} which are hashed into for directory lookup.

WORKINGDIRECTORY holds entries for the current swap area.

UNMIGRATEDDIRECTORY holds entries for the backup swap area that haven’t been migrated.

DATAFORAPDIRECTORY holds page entries for the backup swap area that have had their pages migrated but have not yet had their allocation pots updated.

JOURNALDIRECTORY holds entries for the backup swap area that have been completely migrated.

Each DIRENTRY contains the CDA which it describes, some flags, the allocation count for pages, a pointer to the next DIRENTRY in the hash chain, and the two locations of the CDA. The first location is the CORETABLE entry for a nodepot in core or the swap area location for a page on disk. The second location is the second swap area location for pages or nodes that are duplicated in the swap area or zero.

If I/O is in progress involving this DIRENTRY, the DIRENTRY is locked (a bit is set) and the DEVREQs are linked to the DIRENTRY. If the DIRENTRY is invalidated, the DEVREQs will be aborted. The DEVREQs are linked using the DIRENTFIRST and DIRENTSECOND fields to save space. The displaced values from those fields are saved in the DEVREQs.

The cylinder table.

The cylinder table contains one entry for each page on a cylinder {plus one dummy entry at the end}. It is used to translate from page number on the cylinder to the HHR portion of the hardware address for the page. It contains the head number, the record number, the sector number, the “radial” {used for slot sorting requests on the same cylinder}, and the number of bytes in the first segment. It is assembled as a part of the module GCCWBULD.

The bad pages table.

The BADDEVLOC table keeps track of the last 16 disk pages that could not be read. Each entry contains a DEVICE pointer and the offset on the device of a page that could not be read. These table entries are chained from 4 hash chain heads.

Basic Concepts and Control Flow

Requests for disk I/O come from calls to GET {see (get)}, requests via the migrate key {see (p2,migrate1)}, the journalize page key {see (p2,journalize-page)}, and internally generated requests for taking checkpoints {see (checkpoint)}, restart and cleaning dirty pages. The control flow for GET is illustrative of the facilities used by all of these entry points.

Overview of Processing in GET

When GET is called, it must find where the desired CDA is located.It calls “look up in directory” (GDILOOK) to see if the CDA is in the current swap area, or the backup swap area, or neither. (A nodepot which is to be written to the current swap area is considered a part of that swap area.) If the page or node is in a swap area, it will be read from there (GDILOOK builds a REQUEST).

If a page or node is in the backup swap area and has been migrated, we could additionally try to read it from the home area. We don’t currently do this.

If a page is virtually zero, no I/O is necessary. If a node is in a swap area nodepot, the pot may already be in memory.

If the page or node is not in either swap area, it must be read from the home location. GRTHOMEP and GRTHOMEN consult the range table to find the devloc (device and offset on device) for the page or node, respectively. In the case of a page we also must obtain the allocation data (allocation id and flags). This may require reading the allocation pot from disk.

Conceptually, GRTHOMEP and GRTHOMEN return a list of devlocs. The first is returned immediately; the others are returned by successive calls to GRTNEXT.

The home nodepot may already be in memory.

GRTHOMEP and GRTHOMEN also check that the disk location is mounted and readable.

If a home location must be read, GETLOCK checks the “outstanding io” array to see if there is already a request for a disk block with the same address as the current one. If there is, it is assumed that it is a request for the same page, and the actor is enqueued on the I/O wait queue for that page. {If there is a hash collision, the request for the CDA will be repeated when the first I/O request ends.}

GET then acquires a REQUEST block and formats it, and loops on the process of acquiring a DEVREQ block, formatting it, and calling GRTNEXT to get the next device and location on device. If any request for a REQUEST block or a DEVREQ block fails, the actor is placed on the “no io req blocks” queue. If all goes well, the actor is placed on one of the I/O wait queues {based on a hash} and the I/O operation is enqueued with a call to “enq request” in GDSKDVR.

“Enq request” examines the DEVREQ’s associated with the REQUEST to find the device most likely to respond fastest to the REQUEST. If DEVENQSTATE=start “start device” is called to start the device. If it returns “don’t continue”, processing stops. If DEVENQSTATE=runningadd, “enq request” will attempt to add this REQUEST after the currently executing REQUEST via a call to “append request” in GCCWBULD. If DEVENQSTATE=notready, the DEVREQ will be marked “device not available”. Otherwise the REQUEST will be enqueued on all the device I/O queues that might satisfy it.

“Start device” first attempts to build the CCW’s for the REQUEST via a call to “build ccw’s” in GCCWBULD. If they were successfully built, it calls “start one device” for each online path to the device until the operation is started, an error occurs, or no more paths exist. If a busy condition is encountered, the device is set up to restart the operation when the busy condition is cleared. If the CCWs can not be built (no CCWBLOKS, no page frame, no space for swap directory, device preempted for clean) “start device” calls GCODIDNT to re-queue the request via check request done and the ENQUEUEREQUESTQUEUE and returns “don’t continue” to its caller.

“Build ccw’s” builds the CCW’s for the REQUEST in a CCWBLOK and sets “active list” in the device block to point to the CCWBLOK. It will return failure if it cannot get a CCWBLOK, or if it cannot get a page frame via call to “get page frame” in GSPACE for an input operation.

When the I/O interrupt comes in, signaling completion of the REQUEST, the first level I/O interrupt handler IOINTER passes control to “io interrupt disk active” in GINTDSK. It checks the status in the CSW, invokes error recovery procedures if the status is not what is expected, and marks the DEVREQ with the completion status of the operation. If channel end was indicated in the status, it sets up the device block to call “restart disk reads” when IOINTER attempts to restart the channel and calls “check request done” to see if the procedure in the REQUEST for completed requests should be called.

When a REQUEST built by GET is completed, “get ended” in GET will be called. “Get ended” returns the REQUEST and DEVREQ’s to the available list, turns off the bit in the “outstanding io” array, and calls “move domains to cpu queue” {see (enqmvcpu)} to run the domains waiting for this page. If a page or nodepot was read, it places it in the correct hash chain.

Overview of Processing to Clean Pages

When GSPACE determines that a page frame is old enough to need cleaning, it calls “add page to clean list” in GCLEANL. This routine attempts to add the page to the list of pages that are to be cleaned. If the page is all zeros, it calls “add virtual zero page to directory” and returns with the page marked clean. Otherwise it returns an indication of whether a page clean operation should be performed.

When GET is ready to return to its caller, it checks to see if a page clean operation is needed. If one is needed, GET calls “start page clean” in GDSKDVR. “Start page clean” repeatedly calls “first best swap out device” and “next best swap out device” in GSWAPA to get the available swap devices in order of preference. For each device returned, start page clean attempts to start a page cleaning operation. This process stops when the available devices are exhausted or when the number of pages needing cleaning falls below the threshold.

The second way that a page cleaning operation can be started is when “restart disk reads” in GDSKDVR discovers that there are no outstanding requests for a swap device and that page cleaning is needed. It will cause “restart clean pages” to be called to perform the page clean operation.

When either “restart page clean” gets control or “start page clean” attempts to start an operation, they call “build clean pages cp” in GCLEANL. This routine causes a channel program to be built that will clean a block of several pages onto contiguous disk pages. The disk pages are allocated by calls to “next swap area slot” in GSWAPA.

A page that is duplexed will be written into the swap area twice {on different devices}. All nodepots will be written twice as well. When the channel program for the first {or only} write is built, the page is marked “clean” by resetting the “dirty” bit in the CPU hardware key field. The page will not be stolen until the clean operation is finished, because the field “io complete count” in the core table entry for the page is not zero. Resetting the “dirty” bit at this time allows modifications to the page that occur after the clean is started to be properly recognized.

When the completion interrupt from the clean channel program comes in, the routine “io interrupt clean pages” in GINTDSK will be called. If the channel program ended normally, the io complete count will be decremented, and the page will then become a candidate for selection by the LRU algorithm in GSPACE. If there was an error cleaning the page, “reclean pages” in GCOMDSK will be called to set the dirty bit in the CPU hardware key field and cause cleaning to be tried again.

Guide to the Modules of the Disk I/O System

This is a large complex program broken into modules based around access and updates of certain tables. The basic philosophy is late binding. For example, a page frame is not allocated or a channel program built until it is probable that it can be immediately started. If a busy condition is received, the page will be returned to the free pool and the channel program released until the busy condition is cleared. The disk interface uses a technique of dynamic chaining of channel programs with Program Controlled Interrupts to attempt to reduce the number of SIOF instructions that must be issued.

The main interface module GET provides the routines “getxxxx” {see (get)}. See (get-logic) for an overview or processing.

The space module GSPACE allocates page and node frames in main memory. It also includes the logic for cleaning nodes and returning pages to the free pool. See (scavitem) for the LRU algorithm. See (getpagex) and (getnodex) for subroutine specifications.

ISPACE - Initialization

GSPCLNOD - Build a node pot out of NODESMARKEDFORCLEANING

GSPDETPG - Remove all access to a page

Used when a page frame is stolen.

The bad disk blocks module GBADPAGE keeps track of unreadable disk blocks.

The range table module GRANGET maintains the table of what ranges are mounted on which devices. See (granget) and (range).

The swap area directory module GDIRECT maintains the directory of which pages and nodes are where in the current and backup swap areas. See (swaparea). It also implements the migrate code 2 {(p2,migrate2)} routine “return backup directory entries” and the routines to update the allocation pots in core as part of migration. It provides routines to build new entries in the directory, modify existing entries with new swap area location information, remove entries from the directory, look up entries in the directory, and read all entries in the directory.

The clean list module GCLEANL maintains a list of the page frames that need cleaning so they may later be superseded by other pages from the disk. It provides entries to add pages to the list, build a channel program to clean a set of pages, prevent new entries from being added to the list, and force pages to be cleaned first during checkpoint.

The swap area allocation module GSWAPA allocates space in the current swap area {see (swaparea)} for cleaning page frames. It provides entries to get device pointers in order of suitability for swapping, modify the swap space available as packs are mounted and dismounted, and allocate space in the swap areas.

The start disk I/O module GDSKDVR performs all disk I/O operations. It contains all the “restart procs” for the disks.

The disk interrupt module GINTDSK processes all interrupts from the disks.

The build ccw’s module GCCWBULD builds all the disk channel programs, except for the restart after error correction channel program, which is built in GCOMDSK.

VXA- A hack which would permit suspend/resume to be used and allow “unlimited” pre-fetch of CCWs is to change the APPENDPAGEREAD routine to (1) Use a full seek instead of seek head on the same cylinder (builds resumable channel programs for all but read next record. Check set sector though!) and (2) in the read next record case, build a SEARCH ID EQ, recognizably bad CCW, read sequence in place of the current read CCW. This sequence would fail if the RSCH was executed late enough to be treated as another SSCH by the control unit. Error recovery would be responsible for re-starting those page requests.

The common disk routines module GCOMDSK contains the routines used by both GDSKDVR and GINTDSK. It also holds the device error recovery logic. See also (gcomdsk-module).

The checkpoint module GCKPT performs the checkpoint logic. See (gckpt) and (checkpoint-logic). It also contains the code to implement the “force checkpoint” key {(p2,force-checkpoint)}.

The restart module GRESTART restarts the system from a checkpoint after an IPL. See also (grestart-module).

The migrate module GMIGRATE performs migrations under direction of the external migrator. See (gmigrate-module).

The journalize page module GJOURNAL implements the journalize page key {(p2,journalize-page)}. See (journalizing-logic)}.

IJOURNAL - Initialization for the GJOURNAL module.

Known Bugs in the Disk I/O System

I/O errors are not handled during migrating.

Migration does not stall when a swap area is missing.

One missing swap area can stall migration.

There is no architecture {or implementation} for I/O errors.

VXA- Locking protocols for the MP I/O system

There are two kind of locks used in the I/O system.

Spin locks are locks where the locker re-tests the lock until it is available. No routine may attempt to acquire another lock while it holds a spin lock.

Defer locks are locks where the locker finds some alternative action when the lock is held by another CPU.

Paging Device Driver locks

The commonly used resource at the device driver level is the DEVICE block. It has two locks as follows:

The DEVICEQLOCK controls access to the DEVREQ doubly linked list chained from the device block. It is a spin lock. It must be held when adding or removing DEVREQs to/from the chain or when searching the chain. A routine holding this lock must not attempt to obtain any other locks.

The rest of the fields of the device block are controlled by the DEVICELOCK. This lock is a defer lock. This lock must be held before issuing I/O instructions to the device (subchannel). This rule prevents the race condition that could occur if one CPU were to be starting an I/O operation to the device while another CPU received an ALERT, device end interrupt. Since the second CPU will not do the TSCH until it holds the lock, the first’s SSCH will not acually start until the pack id can be verified. This lock can take on three values.

Available - Any CPU may gain this lock

Locked - A CPU has exclusive access to the DEVICE block.

Interrupt pending - A CPU was notified of an interrupt from the device while another CPU held the DEVICELOCK. The holder of the lock is responsible for processing the interrupt.

There are certain fields in the REQUEST and DEVREQ that are used at the device driver level. They are controlled as follows:

Fields that are not modified by the device driver level so there is no sychronization necessary at that level.

REQNEXT, REQDEVREQS, REQDONEPROC, REQDONEPARM, REQPOTADDRESS, REQCDA, REQALLOCCNT, REQFLAGS, REQTYPE. DEVREQNEXT, DEVREQREQUEST, DEVREQDEVICE, DEVREQADDRESS

REQCOMPLETIONCOUNT

This field is controlled with compare and swap. Since it controls how many DEVREQs move from pending to selected, the logic is as follows:

In FINDBESTNEXTREQUEST (caller must hold DEVICELOCK):

Get the DEVICEQLOCK

Find the best next request by searching DEVREQ list

Remove selected DEVREQ from device queue and change its status to SELECTED.

Release the DEVICEQLOCK

Use CS to decrement REQCOMPLETIONCOUNT (but not below zero)

IF old value > 0 THEN

IF new value = 0 THEN

remove other pending DEVREQs for request from their device queues.

RETURN selected DEVREQ

ELSE (old value = 0, some other CPU found it)

Use CS to change SELECTED to OFFQUEUE, IF status not SELECTED

THEN IF status = ABORTATEND

THEN status := ABORTED

CALL check request done

GOTO find best next request again

In GDDENQ or GDDREENQ (caller must not hold any device or deviceq locks):

Put all OFFQUEUE DEVREQs on their device queues# Gets and releases DEVICEQLOCKs #

DO # Search until we have pending devreq and both locks #

Find the “best” available device for request by scanning the devreqs from the request for pending devreqs whose devices are DEVICELOCK=unlocked and available. (N.B. Requires assurance that request is not returned while scanning.) Use the “best” of those devices.

IF no device found THEN

call check request done

RETURN

Get the DEVICELOCK

EXITLOOP if DEVICELOCK was obtained

ENDDO

# Now hold DEVICELOCK for selected device #

Get DEVICEQLOCK

EXITLOOP if our devreq still pending

Release DEVICEQLOCK

Release the DEVICELOCK

IF old lock status was interrupt pending THEN

RETURN via DOIOINT (N.B. GCOREENQ needs recursion supression)

RETURN if REQCOMPLETIONCOUNT = 0

ENDDO # Have pending devreq, DEVICELOCK and DEVICEQLOCK #

Remove selected DEVREQ from device queue and change its status to SELECTED.

Use CS to decrement devreq’s REQCOMPLETIONCOUNT (but not below zero)

Release the DEVICEQLOCK

IF old value of REQCOMPLETIONCOUNT > 0 THEN

IF new value = 0 THEN

remove other pending DEVREQs for request from their device queues. # Gets and releases DEVICEQLOCKs #

Start selected DEVREQ

IF device was not started THEN

IF DEVREQSTATUS \= SELECTED THEN

IF DEVREQSTATUS = ABORTATEND THEN DEVREQSTATUS := ABORTED

Release the DEVICELOCK

IF old lock status was interrupt pending THEN

call check request done

RETURN via DOIOINT (N.B. GCOREENQ needs recursion supression)

GOTO put all offqueue devreqs... at top

Use CS to change SELECTED to OFFQUEUE

IF status not SELECTED

THEN IF status = ABORTATEND

THEN status := ABORTED

call check request done

ELSE (old value = 0, some other CPU found it)

Use CS to change SELECTED to OFFQUEUE

IF status not SELECTED

THEN IF status = ABORTATEND

THEN status := ABORTED

call check request done

Release the DEVICELOCK

IF old lock status was interrupt pending THEN

call check request done

RETURN via DOIOINT (N.B. GCOREENQ needs recursion supression)

REQENQTOD

This field is used to calculate page service times. It is changed at the time the REQUEST is passed to the device driver level to the current TOD value with a block concurent STCK instruction. No other sychronization is necessary.

REQPAGECTE

The following DEVREQ fields are changed only when the associated DEVICELOCK is held.

DEVREQADDRESSONDEVICE (?)

DEVREQFLAGS - DEVREQSECONDTRY bit

The following DEVREQ fields are changed only when the associated DEVICEQLOCK is held.

DEVREQNEXTIO and DEVREQPREVIO

The DEVREQSTATUS field controls depend on what its value is.

The transistion table for this field is:

OFFQUEUE --> PENDING, NODEVICE, ABORTED

Must hold the DEVICEQLOCK.

PENDING --> SELECTED, OFFQUEUE, NODEVICE, ABORTED

Must hold the DEVICEQLOCK.

SELECTED --> COMPLETE, NODEVICE, PERMERROR, OFFQUEUE, ABORTATEND

Must hold the DEVICELOCK.

ABORTATEND --> ABORTED

Must hold the DEVICELOCK.

Note that the transistion to ABORTED is special. The logic is:

gddabtdr: CASE status IN

OFFQUEUE: Get DEVICEQLOCK

IF status PENDING THEN

release DEVICEQLOCK; GOTO gddabtdr

status = ABORTED

release DEVICEQLOCK; CALL check request done

PENDING: Get DEVICEQLOCK

IF status PENDING THEN

release DEVICEQLOCK; GOTO gddabtdr

remove from queue; status = ABORTED

release DEVICEQLOCK; CALL check request done

SELECTED: use CS to change SELECTED to ABORTED, IF status SELECTED GOTO gddabtdr

OUT SKIP # No abort needed #

Device Independent Paging Subsystem locks

Ensuring Pages, Nodes, Nodepots, Allocation Pots do not have two copies in memory.

This section assumes two possibilities for hash chain locks. (1) There is some lock a reader gets to lock out a writer, and (2) The algorithms of the readers and writers are such that only writers need to get a lock. (Readers always complete and always find an entry if it is there.)

Nodes

Search hash chain - If found return node

Find the correct node pot - If not in memory see node pot.

Lock hash chain for write, search (if found return node), move node to item space(beware possibly a long operation), update hash chain, unlock hash chain.

Pages

Search hash chain - If found return page
While still holding r/o lock (or get r/w lock and search again) update OUTSTANDING I/O for CDA. If OUTSTANDINGIO was on release chain lock and serve queue.
Find disk location for page. If you need to get a allocation pot release OUTSTANDINGIO, serve queue, and do allocation pot logic.
Queue I/O

Nodepots and Allocation Pots

Search hash chain (on pot address) - If found return pot
While still holding r/o lock (or get r/w lock and search again) update OUTSTANDING I/O for pot address. If OUTSTANDINGIO was on release chain lock and serve queue.
Queue I/O

I/O end (pages and pots)

Lock hash chain (r/w), add entry, and unlock.
Reset OUTSTANDINGIO and serve queue

Ensuring that blocked domains are on the queue before it is served by the unblocker.

The NOIORQBS (No I/O request (and devreq) blocks) queue

There will be a single spin lock which permits modification of this queue, AND the two chains of available blocks. The logic (which requires these routines get the ACTOR pointer):

Get (REQUEST/DEVREQ)

Get lock

IF block available then dequeue it, release lock and return it.

ELSE enqueue actor on queue, release lock and return.

Free REQUEST

get lock, queue block on free chain, serve queue, release lock, return

Free DEVREQ (or REQUEST without serving queue)

get lock, queue block on free chain, release lock, return

The MIGRTCZ (Migrate transit count zero) queue

Note that the counter is maintained with compare and swap

Test the counter

If not zero, get queue lock, put actor on queue, test counter

If counter zero serve queue

release queue lock

Common Tables Used by the I/O System

Device Table

DSECT DEVICE, defined in DEVICE MACRO

There is one DEVICE block for each device known to the system. DEVICE blocks for like device types are assembled in contiguous locations so as to facilitate searching all devices of a given type. The DEVICE blocks are built by assembling the I/O configuration module GIOGEN. The basic DEVICE block contains the following fields:

DEVRESTARTPROC - Pointer to the procedure to call to restart the device.

DEVINTERRUPTPROC - Procedure to call when a I/O interrupt occurs for the device.

DEVADRS - Up to four addresses by which the device can be reached.

Following these common fields, there is the device dependent segment. These are defined separately for each device type.

Channel Table

There is one channel table in the system with one entry per channel. Each entry describes the lowest address on the channel and the number of addresses between the lowest address and the highest address. It is used {in conjunction with the device pointer table} to find the DEVICE block for a device given the device address. It also contains a “restart” bit, which signals that the channel needs to be restarted. The channel table is built by assembling the I/O configuration module GIOGEN.

Device Pointer Table

There is one device pointer table per channel in the system. There is one entry for each device address defined between the lowest and the highest address in the channel table entry. The device pointer table contains the address of the DEVICE block and the index into the control unit busy bit array. The device pointer tables are built by assembling the I/O configuration module GIOGEN.

Control Unit Table

There is one entry in this array for each control unit defined in the system. One flag bit, when on, indicates the control unit returned a control unit busy condition that has not yet been cleared. Another flag bit indicates that a sense operation got a SIOF CC2. When this bit is on, the device address is saved in the entry. The control unit table is built by assembling the I/O configuration module GIOGEN.

Design for controlling unreported IO errors

Imagine a disk system with blocks formatted as some power of two that makes occasional unreported disk addressing errors. We describe here a scheme that copes with such errors in conjunction with disk twinning.

For simplicity consider user pages at home positions first. This is the most critical problem because it is the largest category of disk storage. Select some fixed offset within a page, probably near the beginning. Each user home page has a two fields, c and j of integrity control information inserted into the page. c is a modular count of how many times the home page has been migrated to and j is a hash of the disk address of the home page. Adjacent to the page’s allocation count in the allocation pot is: the displaced user data from the page, the running count of how many times this page has been migrated to, and

The simplest scheme for other disk pages is to keep a two or three bit “write count” field in main store for each such disk page. This information would be preserved on disk at checkpoint time. There are more sophisticated schemes as well.

This scheme has the unfortunate consequence that the page must be removed from user memory while the channel runs.

Note that upon discovering an error after disk read substantial clues are available about the nature of the error.

This scheme is related to what would be necessary to support “page version keys” described in p2,.

IUCV logic (conditional upon the &IUCVSW assembly time switch)

We assume that the parameter list is consulted by CP only for the duration of the execution of the IUCV instruction (B2F0). The data areas and area lists specified by SEND, however are referenced asynchronously by CP. The data areas specified in the RECEIVE and REPLY calls are referenced only during the call.

Tied Pages

For now we mingle TCCWBLOCKS for the two purposes. The code which is sensitive to this has “MINGLE” in a comment.

We use GETPAGEADDRESS (GTPAGADR), and we pass an R5 value that looks like a TCCWBLOCK to GETPAGEADDRESS that will keep a list of CTE addresses in the beginning. Indeed we might invent a variant of the TCCWBLOCK that would include the array of 17 pointers at the beginning {TCCWPAGES} and TCCWNEXT which is used to chain active TCCWBLOCKs starting at ATCIVETCCWBLOCKS. The variant is marked so that DEVICEHD will content itself with unlocking the pages.

The kernel IUCV code mainly contents itself to be a transparent intermediary between the IUCV key holder and CP’s IUCV function. The kernel must, however, keep track of data the send and answer data areas that are tied down.

The SEND areas must be tied down thru checkpoints just as pages with IO in progress. Indeed the IUCV areas and IO segments are so much alike that I steal much of that mechanism. CTDEVICELOCKCOUNT is used to count the times a page frame is part of a IUCV area along with what it counts now. An examination of all references to CTDEVICELOCKCOUNT confirms that this is natural.

The routine GETPAGEADDRESS in DEVICEIO is externalized so as to be available to IUCV logic. The first part of RELEASECCWS in DEVICEIO (DEVRELCC) is also used to backout.

The kernel must keep track of active IUCV data areas (and their data area lists). This includes both the send and answer areas. The receive and reply data areas are used synchronously with the IUCV RECEIVE and REPLY commands. This set grows upon a SEND order and shrinks upon Message Complete External Interrupt or PURGE. It entirely disappears upon a sever of a page in an active data area. Data area address lists must be allocated and freed. Knowing where the lists are suffices to know where the data areas are. Each of these lists is associated with the message by the “message id” assigned by CP

An Implementation Issue with tied pages

There is the following implementation trap. A page is sold that in an active data area for some path. We have decided that the path must be severed. We must free all TCCW blocks involved with that path. We zap the page twice!!

Tying and Untying and Dry Runs

KIUCV finishes in one of two ways, actor blocked, or headway made.

Actor blocked

This may occur when the returnee is not in memory or when a page of a data area is not in memory, or when there are no TCCWBLOCKs. In these cases the actor is enqueued.

Headway made

In this case a message is always sent to a domain.

Data area pages are tied down in anticipation of headway. If there is headway they may remain tied such as in the SEND order. When there is a blockage and headway is impossible they must be untied as IUCV finishes.

After trying to code what was locked in IA and stacked IA I gave up and defined two bits SL1 and SL2 in FLAGS.

Dead Pages

See (devicehd) about severing pages in which IUCV is operating.

The routine DEVICEHD is generalized to sever the path if a page is severed that belongs to a buffer. This requires some accounting of how many times a page was part of some buffer. I presume that DEVICEHD is seldom called and that its performance is not critical. (True.) Just as the old DEVICEHD considers each active CCW of each active channel program, now each active IUVC data area is considered as well.

We may decide to do an IUCV RETRIEVE BUFFER (Draconian) or purge upon a sold page. I think that purge poses no problem of sticking the kernel and it makes path keys even more civilized.

Special care is necessary on the IUCV PURGE. It is necessary for the kernel to know whether the associated data areas were in fact released by CP. Considering the IPRCODE after the purge suffices.

I propose two compromises for first implementation:

Allocate MESSAGEMEMOs in the TCCWBLOCK pool even though they are much smaller.

This makes DEVICEHD less different as to putting these blocks back on the free chain, and indeed avoids another free chain.

Put active MESSAGEMEMOs on the ACTIVETCCWBLOCKS list and thus cause a n**2 performance problem (in RELEASECCWS) where n is the number of members on the list.

A fix to this is to replace ACTIVETCCWBLOCKS with an array of heads hashed on either I/O device address or message id.

Storing Interrupts

Interrupts are divided into two categories, specific and general. Specific interrupts are for established paths and all others are general. (Connection Pending is the only general interrupt.) Interrupts for a specific path must be handed to the path key in the order they occur. The same goes for general interrupts delivered to the IUCVC key.

I propose to maintain an array of chained FIFO queues. A pathid hash indexes into the array. The general interrupts are kept on their own FIFO queue. The same hash may as well index into an array of domain queues. When a domain issues the wait order on a key the specific queue of interrupts is searched looking for a matching interrupt. If found the interrupt is delivered to the key holder and removed from the queue. Otherwise the domain is enqueued.

ASSERTIONS

This supplements section (partrls). It is organized for reference and devoid of motivation. It currently lacks some information found in (partrls). In this section, “P” is the problem state bit in the PSW. See (check) for a program that frequently checks most of these invariants. See meminv for assertions about memory maps and chemem for assertions about memory maps checked by CHECK.

Valid States and Units of Transformation

When nodes conform to the special rules concerning preparedness and coupling, we say they are valid. It is clear that during the transformation from one valid state into another, the system passes through invalid states. The duration of these invalid states is limited to a (_unit of transformation). A design goal is to keep these units of transformation short in duration. On a given processor, only kernel instructions will be executed in a unit of transformation, and any interruptions thereto will not access potentially invalid data. The case of multiprocessors dictates that more than one unit of transformation may be occurring at once.

Some Rules about the CPU States

In this section: P is the problem mode; T is the address translation mode; I is I/O interrupts enabled; and E is external interrupts enabled. These are bits 15, 5, 6, and 7 of the PSW, respectively.

The CPU is in exactly one of the following three states:

Executing a domain: P=T=1, I=E=1

Executing the idle process: P=T=0, I=E=1

Executing the kernel: P=T=0, I=E=0

The CPU remains in EC mode (after initialization).

The translation architecture remains 4K pages and 64K segments.

This is the only combination implemented by all models.

Some Rules about the state of Main Storage

Storage keys of user pages are 0 or 1.

Storage keys of kernel pages are 3, 5, 6, or 15 if they must be modified and 0 otherwise. {N.B. Kernel DDT uses keys 8 and 9.}

Fetch-protection bits of storage keys are off except for page frames on the GSPACE free list {except DDT}.

Things that are true when executing a domain

The storage protection key in the PSW is 1.

PZDOMAINDIBP.ITEMADDR.NFPREPCODE=PREPASDOMAIN.

PZDOMAINDIBP -> DIB of some running domain. Its DR is prep-locked. Nothing else is prep-locked.

PZDOMAINDIBP.ITEMADDR.DOMHOOKKEY= an involved data key {with garbage data}.

PZDOMAINDIBP.READINESS.BUSY=1

To transform the system into a valid state {(valid)}, move general registers, floating point registers, CPU timer and PSW to designated DIB, and unlock the domain root. Then place the domain in the CPU queue.

VXAMP - For 370-XA, the following are also true:

CR0 = 00B0BC00, bit 0 of CR1 = 0 (no address space switching)

PSW bit 16 is 0. (No secondary address mode)

Things that are true when executing the idle process

The storage protection key in the PSW is 3.

PZDOMAINDIBP -> the first DIB in RNDIBS.

Things that are true when executing the kernel

The storage protection key in the PSW is one of the following:

( K = 3 {normal}

or K = 5 {in GDIRECT or GUINT while updating an allocation pot}

or K = 6 {in GMIGRATE or GUINT while updating a node pot}

or K = 15 {in RECORD}

or K = 0 {writing into user pages} )

Definition: (_Valid) means that each of these sections is true:

{Some Rules for Node Space} Each of the following is true:

The following are some invariants about keys that do not depend on the mode of preparation of nodes or the types of keys so long as they designate nodes or pages. Some of these assertions depend on the fact that each of the DSECTS KEY, NODE and CORETBEN have fields called RIGHTCHAIN and LEFTCHAIN and that those fields are in the same position within the DSECT. In this section “Ri” and “Li” refer to those two respective fields in the DSECT at address i. “Si” refers to the contents of field SUBJECT in the slot at i. Let “c”, “s” and “h” represent sets of addresses: The address J is in the set “h” (informally, the chain heads) if J is the address of a core table entry {CORETBEN} with the bit BACKUPVERSION off, or a node header {NODE} or a queue head. (The queue heads are those 8 byte fields between FRSTQUE and LASTQUE together with the field DEVWAITQUEUE in each of the DEVICE DSECT instances located by a table of pointers at DEVPTRS.) The address J belongs to the set “s” (informally, the prepared slots) iff J is the address of a slot {in a node} holding a prepared key. the set “c” is the union of “s” and “h”.

If i is in c then RLi = LRi = i and Ri and Li are in c.

If i is in s then Si is in h.

If i is in s and Li is in s then Si = SLi.

If i is in s and Li is in h then Si = Li.

The above theorem means that the mapping L and R are one-to-one on the set c. This means that L is a permutation of them. A permutation partitions its space into cycles.

If a node is prepared as PREPASGENKEYS, there is exactly one involved key to the node, and that key is in slot 14 of a node prepared as PREPASDOMAIN.

If a node is prepared as a general regs node, it is designated by at most 1 involved key {besides hooks} and holds (16 involved data keys or 15 involved data keys and a hook), and if it is designated by an involved key {other than a hook} that key is in slot 15 of a node prepared as a domain root.

See (prepimpl).

If L and R are prepared keys to some node and RIGHTCHAIN of L designates R, then ((If R is involved then L is involved) and (if R is an exit then (L is an exit or L is involved))). {Involved keys to nodes are to the left of exits which are to the left of other keys. Exits are never involved.} See (install).

An involved key to a page or node is prepared.

An involved key either is a non-hook in a prepared node or is a hook in slot 13 of some node for which NFFLAGS.HOOKED is on. See design note in (hook).

If a hook key designates the header of node N, then N.NFFLAGS.REJECT = 1.

{See (storkeyrule) and (storkeyrule2) also.}

See (midpntr) for stuff concerning the backchain.

{LOOSE END} We need some assertions about where all of the keys are {in item space, node pots, in backup areas, etc.} so that we can make some assertions about “all keys”.

{Some Rules for Domains} Each of the following is true:

If node N is prepared as PREPASDOMAIN then there is a DIB frame D such that N.NFDDIBOFS = D and D.ITEMADDR = N and each of the following are true:

If D.JEREADINESS.PSWNOTCHECKED then each of the following is true

NFDOMPSW is uninvolved. (and thus holds its logical value)

D.READINESS.DIBPER = 0

etc.

If D.READINESS.PSWNOTCHECKED = 0 then each of the following is true:

NFDOMPSW is INVOLVEDW and INVOLVEDR and the logical psw is in the DIB.

D.READINESS.DIBPER = D.PSW.PER_bit = N.PSW.PER_bit

N.C13 is either an involved data key {with garbage data}, or a hook. If it is a hook, it is a worry hook iff D.READINESS.BUSY=0. The domain is busy iff D.READINESS.BUSY=1.

(D.LASTINVOLVED = D.ITEMADDR {no involved keys} or D.LASTINVOLVED points to an involved key) and
(D.LASTINVOLVED.RIGHTCHAIN = D.ITEMADDR {no uninvolved keys} or D.LASTINVOLVED.RIGHTCHAIN points to an uninvolved key)

If an involved key designates N then the key is a {stall} hook.

D.READINESS.MONITOR iff N.C6 isn’t DK(0).

If A’s monitor information is not zero, then A’s information will be in control register 8 while A is running. {If A’s monitor information is zero, there might be anything in control register 8; we will set it to 0 if A traps on a monitor instruction.}

C13 of D is a hook iff HOOKED in JEREADINESS of D’s DIB is 1.

D.LISTINVOLVED \= D.ITEMADDR {there are stallees} iff D.ITEMADDR.NFFLAGS.REJECT.

C5 is an involved data key. N.NFFLAGS.NFDIRTY = D.GENREG.NFFLAGS.NFDIRTY = D.GENKEY.NFFLAGS.NFDIRTY = 1

IF bit 8 of the PSW in the domain root is 1 then the right byte of C10 anded with the right byte of C11 is 0 or bit 3 of TC0 of the trap code is one.

If RNDIB is a member of the array RNDIBS then ( {in use} RNDIB=RNITEMADDR.NFDDIBOFS and RNITEMADDR.NFPREPCODE=PREPASDOMAIN and EC bit of RNPSW is 1) or ({not in use}RNDIB is a member of the chain headed by (DIBFREHD) and EC bit of RNPSW is 0) or RNDIB is the first member of the array {which is not used for processes}.

If node N is prepared as PREPASGENREGS or PREPASGENKEYS, then there is exactly one node M prepared as PREPASDOMAIN, such that M designates N with an involved key.

If two distinct node frames in item space hold nodes that are prepared as domains {NFPREPCODE holds PREPASDOMAIN}, then the three node frames of each domain comprise six distinct node frames.

{Some Rules for Processes}

If node A has a process in it, exactly one of the following is true, otherwise none of the following is true:

A’s slot 13 holds a hook other than a worry hook and one of the following is true: (This list is somewhat redundant with that at (quelist).)

A is on the CPU queue. A holds an involved hook key to the CPUQUE queue head or the CPU queue is frozen and A holds an involved hook key to the FCPUQUE queue head.

A is in page wait on a page wait queue and the pager knows about the page. A holds an involved hook key to an I/O queue head chosen by a hash of the I/O address. This queue head is in the block called IOQUEUES.

A is in page wait and the page or node is not in a mounted range. A holds an involved hook key to the queue head RGUNAVL.

A is in page wait and the there is no room in item space for the node. A holds an involved hook key to the queue head NOPAGES.

A is in page wait and the there is no room in memory for the page. A holds an involved hook key to the queue head NONODES.

A is in page wait and the there are no I/O control blocks available to queue the request. A holds an involved hook key to the queue head NOIORQBS.

A has attempted to write on a kernel-read-only page. A holds an involved hook key to the queue head KROQUE.

A is stalled. A holds an involved hook key to the domain that it wants to enter.

A has invoked a basic I/O operation. A holds an involved hook key to a queue head for that particular I/O device.

A has invoked a BWAIT key. A holds an involved hook key to an item for that particular BWAIT key.

A has invoked the MIGRATE key function “wait for migration needed”. A holds an involved hook key to the item MIGRWAIT.

A has invoked a migrate function with the MIGRATE key and is waiting for I/O to complete. A holds an involved hook key to the item MIGRTCZ.

A is prepared as a domain and the CPU is executing that domain and PZDOMAINDIBP holds the domain’s DIB address. {See also (ifp).}

See (ch1) for rules about memory mapping.

{Some Rules for Core Page Frames}

The storage key of a page is If (STORKEYISZERO or CTKERNELREADONLY) {in CTFLAGS of the coretable} Then 0 Else 1 Fi.

If a page table designates a page, then the read-only bit of PTFORMAT of the page table is the complement of STORKEYISZERO of CTFLAGS in the corresponding core table entry.

If node N produces page table T and entry i of T designates page P, then either:

slot i of N holds a memory key that provides read-write access to P and STORKEYISZERO for P is NOT (PTFORMAT of T), or

slot i of N holds a memory key that provides read-only access to P and STORKEYISZERO for P is 1.

If CTIOCOMPLETECOUNT {in CTFLAGS of the coretable} equals 0 Then:

If CTPOT {in CTFLAGS} Then If CTALLOCATIONPOT {in CTPOTFLAGS} Then {frame holds an allocation pot} the allocation pot must be on the allocation pot hash chain (WRTAPHCH) Else {frame holds a nodepot} the nodepot must be on the node pot hash chain (WETNPHCH) or on the home node pot hash chain (WETHNPHH) Fi Else {frame holds a page} it must be on the appropriate page hash chain{(CDA of page & 01FE) + PAGCHHD}

{Some Rules for Meters} Each of the following is true:

If meter A designates meter B and A is prepared then (B is prepared and If B’s DRY bit is on then A’s DRY bit is on).

If a prepared domain root designates a prepared meter whose DRY bit is on, the domain’s DIB’s caches are zero.

If a node is prepared as a meter, it is the primordial meter or it holds an involved meter key to a prepared node whose meter level is one less. The meter level of the primordial meter is 0.

An involved meter key designates a node that is prepared as a meter.

A node prepared as a meter holds an involved meter key or holds a meter key with a CDA = 0 {the super meter key}.

{Allocation Count Rules}

These rules are in support of an assertion that after a sever operation {(p2,sever)} the only slot in the system with a key to the page or node is the returnee’s slot.

These rules are for any key K in item space, a node pot, a disk swap range, or at home on the disk. They apply to any core page frame p on a hash chain and any node frame f in item space on a hash chain.

If K is an unprepared page key and DISKADDR of K = CTHOMEADDRESS of f and CTALLOCATIONCOUNTUSED in f is 0 then ALLOCCNT of K is less than CTALLOCATIONCOUNT of f {the key is obsolete}.

If K is an unprepared key designating a node and not an exit and DISKADDR of K = NFHOMEADDRESS of f and NFALLOCATIONCOUNTUSED in f is 0 then ALLOCCNT of K is less than NFALLOCCNT of f {the key is obsolete}.

If K is an unprepared exit and DISKADDR of K = NFHOMEADDRESS of f and CALLCOUNTUSED in f is 0 then ALLOCCNT of K is less than NFCALLCNT of f {the key is obsolete}.

“Count used” bits

When a key that designates a page or node is unprepared {in item space or in any of the places that keys are kept besides item space} and the thing that it designates is in item space and the count within the key has the same value as the count in the thing, then a certain bit in the thing in item space is on.

If the key is an exit the bit is NFCALLIDUSED.

If the key is a page key then the bit is CTALLOCATIONIDUSED {(coretable)}.

Otherwise the key designates a node and the bit is NFALLOCATIONIDUSED.

The count-used-bits mean:

A key {of the appropriate type} to this node has existed with the current count value

or the node has been brought to core more recently than it has been severed.

They are maintained this way:

The bit is turned on when:

a key to the thing is unprepared,

the thing is brought in.

The bit is turned off when:

the count is incremented.

The bits are not maintained on the disk. {They are virtually 1 when the thing is on the disk.}

Design notes:

The advantage of these rules is that unpreparing a key to a node does not constitute modifying the node {turning the dirty bit on}. The main advantage of this, in turn, is that checkpointing will cause less I/O and use less swap space.

The only useful meaning of a “used-bit” on the disk whose value is zero would be that all keys to that page were prepared. But no keys to a page are prepared when that page is on disk. Therefore in this case there are no keys to the page. While this may happen, it is not a case of importance for performance. Turning on a “used-bit” never causes the kernel to get the wrong answer. I conclude, therefore, that keeping the bit on the disk has neither logical nor performance justification.

SEAPLOOP Invariant

Each of SEGKEEPSLOT and BGNODE each are either not positive or locate core locked nodes. (n.b. 0 is not positive!).

Unless MEMORYFLAGS.ISPAGE, locates a preplocked node.

Some assertions while running in the module GATE:

Points in the code for which there are assertions have “as” in the listing as a comment.

If IA = SVCINT then {user program has just executed an SVC instruction and} the following actions comprise a safe {(safe)} transformation:

Put the process timer, general, and floating registers in the DIB at PZDOMAINDIBP.

Deduct the SVC ILC from the old SVC PSW and place the results in the same DIB.

Place the domain {whose DIB is at PZDOMAINDIBP} on CPUQUE.

If IA = GATEBASE then 13R holds GATEBASE and the following actions comprise a safe transformation:

Put process timer in DIB designated by 15R.

Deduct SVC ILC from PSW in that DIB.

Put that DIB’s domain on CPUQUE.

PER

Exactly one of the following is true:

PEROWNER points to a domain DIB, DIB.READINESS.DIBPER=1, and the PER control registers (9:11) hold the data from DIB.ITEMADDR.DOMPERA and DOMPERB.

PEROWNER contains 1 and the PER control registers hold the kernel’s PER information {KERNPERR}.

PEROWNER contains 0

If bit 1 {PER} of the PSW is on then either:

{P=0 and LPER=X'40' and nibble 0 of KERNPERR is not 0 and PEROWNER contains 1} or the IA is in some code specifically involved with kernel per (loose end!) or

P=1 and PZDOMAINDIBP = PEROWNER.

See (argument-predicates) for the definition of some propositions concerning argument passing.

PROOFS

Definition: (_Domain is running) means PZDOMAINDIBP has the address of the DIB of a domain, its CPU allocation is in the hardware process timer, PZBITS.CPUTIMERSTORED = 0, and either:

the DIB belongs to the idle job,

or the DIB doesn’t belong to the idle job, the domain’s root is prep-locked, and PZDOMAINDIBP = CTL1OWNER.

Ideas for Proofs about the Kernel

This is a collection of ideas about how to prove some desired attributes of the kernel. The effort implied here is at least several man-years and is beyond the state of the art.

I imagine a formal, machine-oriented language {(lang)} to express propositions, especially about programs.

By “program”, I mean here a machine language program stored in memory {real in the case of the kernel} at the locations where it will run.

This view of a program, in place, addresses problems, normally ignored, of the correctness of compilers, assemblers, file systems {to hold the input and output of the compilers, etc.}, loaders, and bootstrap mechanisms.

It is also perhaps the most convenient place for our proof checker to examine the program.

I further imagine a definition of proof that can be checked in one pass by a small simple fast program called the proof checker.

A hypothetical scenario:

Imagine a person who wishes to be sure that a version of the kernel has certain attributes that can be expressed in our language. Imagine that he has a bare machine at his disposal. He has three reels of tape.

Reel one has the program that checks proofs. This program is trusted because:

it is simple {it consists of about 1000 instructions},

it has been published,

it constitutes a direct check of our definition of proof which, in turn, is directly accessible to the intuition, and

no one has been able to find a proof that some program has an attribute that the program demonstrably lacks and that this checker accepts the proof.

Reel two contains an array of theorems that are to be assumed.

These theorems include common simple mathematical theorems and some axioms about the operation of the computer.

Reel three contains an IPL’able version of the kernel followed by the proofs.

Reel three is mounted and the kernel is IPL’d. The machine is stopped at the point where the kernel is in place. Tapes one and two are mounted. Tape one is IPL’d. {The proof checker loads into main storage at a different location than the kernel.} The proof checker now starts.

The proof checker reads the proof from reel three. The proof consists of a sequence of propositions, each of which is an elementary consequence of previous propositions of reel three or of the theorem tape {reel two}. For each proposition, the proof indicates which of a few simple rules of logic is used and which of the previous propositions are used. The proof also indicates which propositions will no longer be required so that the space in main storage to hold them can be reused.

At the end of the proof, the proof checker prints out the proven propositions.

Note that other proof checkers may be used. In fact, several may be used.

Some ideas about the language:

The first order predicate calculus with equality is the foundation.

Quine’s “Mathematical Logic” has the easiest set of axioms that I know. Proofs in that form, however, tend to be excessively long; the “proofs” in Quine’s book are really informal recipes for constructing proofs.

More accessible to the intuition are the proofs described in Fredric Brenton Fitch’s “Symbolic Logic”. While his definition is more complex, real proofs in his form are easier to understand and very much shorter. They are block-structured in the sense of modern programming languages. A program to check Fitch’s proofs is only marginally more complex than one for Quine’s proofs.

A fast simple program can translate a Fitch proof into a Quine proof if there is room to store the results.

The term “natural deduction” has been used to describe proof forms similar in style to Fitch’s.

Functional notation would be nice, but I don’t know of axioms for that extended language.

Functional language provides for individual constants and variables, predicates and functions. Quantifiers are allowed on individual variables but not on predicates or functions.

I think that it is possible to define a language with functions and a program to translate propositions and proofs in that language to propositions and proofs in the predicate calculus without functions. I don’t know if the transformed proofs would be of practical size.

Suppose that we wish to use an integer function F(x) which is defined when Dx. {x ranges over integers and “Dx” is a predicate expression.}

We will need a predicate F where Fxy means that x = F(y).

Now propositions involving expressions with functional notation can be transformed into propositions involving the corresponding predicates. Proofs involving propositions with functional notation can likewise be transformed into proofs involving only propositions without functional notation.

Proofs of theorems such as “(y)(if Dy then (Ex)(Fxy and (z)(if Fzy then z = x))” must be supplied to make the transformations.

These transformations might be made before the proof checker starts. I hope that the resulting proofs are not so large as to be infeasible for the checker. Perhaps the proof checker could run as the expander runs, but then you would have to trust the expander not to damage the checker or run in the system that you were trying to validate.

As an example of the kind of axiom that would be available about a 370, in particular, let us take a particular example. This is informal, but unabridged.

If bits 1, 5, 6, 7, and 14 of the PSW are off {Per not active, translation off and disabled for I/O and external interrupts, and not in wait state} and I is the value of bits 40 through 63 of the PSW and I is even and the 4 bytes at the location I is (X'58'*16**6 {a load instruction} + i*16**5 + j*16**4 + k*15**3 + d) and 0<=i<16 and 0<=j<16 and 0<=k<16 and 0<=d<16**3) and general register j holds J and general register k holds K and d+J+K < main storage size and the 4 bytes at d+J+K hold S then it will come to pass that bits 40 through 63 of PSW hold I+4 and general register i holds S and everything else that held something holds the same thing as it did before.

It appears that our axiom above talks about time in a way alien to predicate logic. The weakest precondition and post conditions are ideas that may be useful here.

Since nearly all theorems to be proven are of the form “for every time x” or “for some time x” it might not be extravagant to built time into the language.

It seems likely that the semantics must be streamlined even if programs are the only ones to process these propositions. If we had not called for the external interrupts to be disabled, the description would have been much longer. The description would also have necessarily been nondeterministic.

Sets:

Most mathematical proofs involve sets. I think that we may avoid sets. One problem that seems to require sets is the following: How do you phrase the proposition “Element X is on the chain headed at Y”?

One solution is to build a disjunction of n propositions, where the i’th proposition says that X is the i’th element of the chain and n is the number of cells through which one might possibly chain. Each of these propositions may be built in the predicate logic but such long propositions may not be practical.

One way to get the benefits of sets is to have a “small set theory” that would provide for the existence of sets only by enumeration and subsetting.

Numbers and rows of numbers:

As an alternative to sets, I think it suffices to have quantified variables ranging over integers and rows of integers.

As an example, we describe informally here a proposition in such a language. It expresses the {current} fact that the chain rooted in PTFRAMEHD is intact and for just those page table frames on the PTFRAMEHD chain is it true that FREEPAGETABLE = 0. Here “pte” stands for “page table entry”.

There exists an integer j >= 0 and a row of j numbers N such that for each number i (if 0<=i<j then (N(i) is the address of a pte and PTP of that pte holds N(i+1) and PTFRAMEHD holds N(0) and N(j) = 0)) and that for each pte p (there exists an integer i such that (0<=i<j and N(i)is the address of p) iff bit FREEPAGETABLE of p is 0.

Things to prove:

The first thing to prove is the conjunction of a large number of integrity assertions about the kernel. This conjunction is proven by induction over the number of kernel instructions executed since IPL. Examples of these assertions are:

If PSW is in privileged state then its right 24 bits hold one of the following values: ..., ..., .

If PSW is in privileged state and designates instruction at ... then register ... holds ... and register ... holds ..., where “...” represents a numeral in each case.

The instructions and certain data are never modified.

If the PSW is in privileged mode then (the PSW key is 3 or the PSW designates one of the following instructions: ..., ..., ).

When the PSW designates one of the above instructions register ... {which is the address of storage that will be modified by a kernel instruction running in PSW key 0} holds a value that does not designate one of the instructions of the kernel.

The next thing to prove is that the world is valid {(valid)} at the proper times.

Next we prove that objects influence each other only via keys and that certain kernel implemented keys obey some read-only principle. This suffices to assure certain static protection policies where the TCB is the kernel.

The last security relevant theorem about the kernel is that it supports the correct execution of the rest of the TCB. We do not know which kernel features the externam TCB will use yet. We must prove that the kernel, optimization aside, executes the right algorithm. The best that I can imagine here is to prove that the kernel behaves externally the same as a simple kernel that lacks {say} I/O, checkpoint, prepared and involved keys, prepared nodes, page and segment tables {??}, etc. More precisely, it must be proven that the real kernel will do one of the things that the simple kernel might have done.

This simple kernel must be written in some language that is nondeterministic. I am not sure whether machine language is reasonable in this case. It should have semantics that are convenient both to programs {to make the programs that interpret them easy to use} and to humans {who are trying to understand what some standard algorithm is}.

The last set of things to prove that I have imagined is certain performance assertions. The easier of these is that user programs will not cease to run. Much more difficult is that under some useful situations certain problem mode programs will make headway at some specified rate. Alas, I think that this is not true for the current kernel!

See (memorder1) for related ideas in the are of address translation.

Multiprocessors

What is the nature of theorems with multiprocessors? Barring techniques such as Lamport’s and excluding here uses of compare and swap logic, I think that the theorems are modified according to the following ideas:

The kernel’s storage is divided into hunks, normally contiguous, each of which has a lock byte. These hunks are disjoint or perhaps one hunk may be completely contained within another. In the current kernel implementation these hunks are permanently established at IPL time. The hunks would be implicit in the kernel code and explicit in the proof.

The hunks might be: node frames, page tables, segment tables etc.

It might be necessary to exclude from the node frame hunk the chaining fields of prepared keys. Neighboring keys along the chain will sometimes be in other locked nodeframe hunks. To operate on the key chain it may be necessary and sufficient to lock the chain head to ensure that only one processor is meddling with the chain. In this case the hunk would be the entire backchain.

If one were to stop all processors at the end of the their respective next units of operation then

All unlocked hunks would be valid in some modified sense.

All locked hunks would belong to some processor. No locked hunk would belong to two processors. An integrity theorem would apply to each processor that would mention each hunk that belongs to it along with the processor’s PSW and the processor’s prefix page contents.

Expected benefits:

A validity proof would have eliminated at least 95 percent of the debugging so far. The effort to produce that proof, however, would have greatly exceeded the debugging effort so far. On the other hand, we are not nearly done debugging and have no clear idea how long or how much effort remains here. Further, we will not know when we are done finding bugs of this form.

Of the bugs that we have found, a kernel integrity proof would have found about three. {This is a guess.} These bugs have been easy to find. They constitute, however, a much more serious form of bug. These bugs may manifest themselves to users who may then learn to utilize them to thwart the security mechanisms of the kernel. It is this kind of proof that I believe is required in order to satisfy military security requirements.

A validity proof would find roughly the same bugs that CHECK finds but statistically and completely. Most of the bugs found to date have been of this form. Those not detected by CHECK have been hard to find.

Ideas for a Kernel Integrity Proof

These are some ideas for a modest attack on the kernel integrity problem {(integrity)}. The form of the proof is induction on the number of instructions executed in privileged mode. The induction hypothesis is merely the conjunction of the integrity assertions, augmented by various corollaries necessary to make the proof go thru.

The immediate plan is not to produce a proof, but to capture the reason for integrity, which can with more work be turned into a proof.

These assertions take the form of a data base that was primed by data generated by macros scattered thruout the kernel source. These macros would compile to CSECTS to be loaded in a segregated part of memory and would thus not take up space in the running kernel.

Flow Analysis

I am not now sure that we need any of this stuff for integrity!

The induction proof apparatus will require efficient flow information. Given a PSW it is required to know quickly the possible immediate preceding PSWs.

Sequential flow

A bit map with a bit per possible IA value can be built. It can be proven that no instructions overlap (this being the case). This allows quick location of sequentially preceding instructions.

Branches

Scheme One

The conditional and unconditional branches can be turned into macros that compile what they do now and, in addition, information in remote CSECTS describing old-new IA pairs.

This information is not quickly gleaned from the bare instruction because branch instruction are based and the base register values are not known.

Scheme Two

The USING directives can be turned into macros that generate using information to help interpret the branch instruction found in the program. Such information becomes part of the induction hypothesis (to be proved).

I prefer scheme two now.

Modification of Storage

Layout of the kernel’s variable areas

Most of the kernel’s variable areas are allocated as a vector of some structure {“ROW REF STRUCT(...)” in Algol68, “Array of Record(...)” in Pascal}. Such vectors are allocated at kernel initialization time. An instruction whose purpose is to modify field three of such a structure will have an effective address within some finite arithmetic sequence that is established at kernel initialization time.

Each instruction whose OP-code indicates modification of storage must have its effective address explained. This requires assertions about base and index register values for such instructions.

Since such values are most frequently loaded into the register from some other structure, a common conjunct of the induction hypothesis is of the form:

“If i is in set A then location i+12 holds a value in set B” where A and B are arithmetic sequences fixed at kernel initiation time.

Alternate Idea (Current Project)

The first major task is to produce the induction hypothesis. This is too tedious to do manually. I imagine the following process to produce the bulk of the hypothesis:

Imagine a partial computer (simulated by a program (PROVE ASSEMBLE) that I have begun to write) that executes the kernel keeping a partial system state. What part of the system state to keep is ad-hoc heuristics. The output of this calculation is conceptually a CPU state for each IA value. (I say conceptually because this is too expensive and I have ideas to save an order of magnitude of storage with relatively little CPU cost.)

A partial state includes a “partial value” for several of the 16 general registers and selected words of memory. A partial value describes some set of 32 bit values that might constitute the corresponding real value at the same point. There are three kinds of partial values: A known word, a known offset from one of a known set, or ignorance {all 2**32 values).

The PC runs just after the kernel initiation code finishes where it branches to IDLEX. PC first establishes several BA (block address) sets, one for each of the major kinds of kernel control block. For example N belongs to the DIB’s BA set iff some DIB frame begins at N. We use the fact that these sets are known initially. It is a less important characteristic that these sets are arithmetic sequences.

Evolution of the Real State of the System

The previous section might lead to the establishment of the safety of data. We discuss here how progress towards some processing goal can be made despite these restrictions.

Each valid {(valid)} real state maps to one logical state.

Given a valid state of the kernel, the corresponding logical state is determined by unpreparing all of the nodes.

The evolution of the real state is constrained by the external specifications of the kernel. For its part, the kernel really goes through all of the states called out by the external specifications. {It need only appear to!}

The kernel will typically go through several valid states which all map onto the same logical state. We call these states equivalent.

A transformation is called (_safe) if it results in a valid state whose corresponding logical state evolves from some previous logical state. {Intuitively, a transformation is safe if it leaves the system in a valid state and does not cause wrong answers. A safe transformation may fail to make headway.}

We use the idea of safe transformations to define the meaning of some invalid states by describing some simple transformation to a valid state whose meaning is known. These transformations are typically hypothetical.

Starting from a valid state, it is safe to change the PSW to 00CF0000 IDLEX.

The following transformations lead to valid states if the prior state was valid {(valid)}:

Unpreparing a prepared uninvolved key.

Replacing an unprepared uninvolved key with another unprepared uninvolved key. {While the transformation leads to another valid state, the new state is not typically one that might evolve from the prior state.}

See (p1,xa) & (ifpx) for other comments.

These are some thoughts on the implementation of an XA kernel. We simultaneously raise the issue of expanded limits merely because XA machines are typically larger.

31 bit mode for kernel?

The XA BALR and BAL put the right 32 bits of the PSW into R1 unlike the model 67. This eliminates one reason that all of the XA kernel TEXT files would have been distinct from pre XA files.

Multi Processing

See (ncl) & (locksx) about locks. See (dry-run) about avoiding deadly embrace. See (multi-tail) for an idea about the chains to find pages and nodes in frames given their CDAs.

Assertions

Each processor has a distinct prefix register value pointing to its “page zero”.

When distinct processors are in problem mode their respective PZDOMAINDIBP’s point to distinct DIBs.

Channel I/O is done only to user page frames and other designated buffers. (The only ones I am aware of are the sense data and the device type, both in the device block.) Therefore unless the declaration of a variable says it is used for channel I/O, it isn’t, and the rules for block-concurrent access apply to all accesses.

Actors acting on Actors

Suppose a queue for domains that would act on domains running on another CPU. Upon failing to gain a lock on a node because that node is part of a domain running on another CPU or because it is a prepared meter node and UNPRND finds that there is an inferior domain whose DIB is locked (!!!) or because UNPRND is rejected by DEPEND because the DIB whose SEGTABPP is unzappable because the DIB is locked because that domain is now running. The CPU running the target domain would be signaled and upon receiving such a signal domains on the new queue would be put on the CPU queue.

The same queue might serve anyone denied an exclusive lock on a frame already held by shared locks.

Perhaps we should assume that the fast signal order codes are fast enough to wait for. We could then simplify several things. We would then have to watch out for deadly embrace. Perhaps the deadly embrace means that we can’t avoid any of the hair by waiting!

An elaboration of “valid” {(valid)}.

This is to capture the logic of locks. Suppose that we stopped all processors even within units of operation but between accesses to storage, what could we assert?

Assertions within section (valid) hold with exceptions made somehow for those parts of the kernel’s storage cover by locks currently set. This requires a formal mapping between storage and locks covering that storage.

Frame Lock Priorities

These are some fuzzy statements that need to be made precise. Even the nouns need to be clarified.

“Frame” refers to a page frame (and its core table entry) or a node frame. Perhaps “frame” must include hook designees (such as I/O device queues) as well. “Designator” refers to the frame holding the node with the slot that holds a key that designates a page or node. “Designee” refers to the frame that holds the page or node designated by that key.

I see these locks for each frame just now: a shared and an exclusive locks, an exclusive backchain lock and a prep lock. To change the logical content of a slot requires an exclusive lock on the holding frame. To prepare or unprepare a slot does not. (This must be coordinated with the old rule that you can store into a slot of a node subject only to the rule that the slot be uninvolved and then only after unpreparing it.)

To prepare a key requires a shared lock on the designator and designee and the backchain lock. To unprepare a key requires an exclusive lock on either the designator or designee and the backchain lock. Note that a key will remain prepared while you hold shared locks on designator and designee.

Some rules: Don’t insist on getting any locks while you hold a backchain lock. More fundamentally -- don’t keep a backchain lock for long. Preparing a node so as to hold involved slots with the “don’t read” bit, requires an exclusive node frame lock. After preparation the exclusive lock can be down graded to a shared lock. Preparing a node so as to hold no keys with the “don’t read” bit requires only a shared node frame lock. (Note that singly linked lists of page tables and such can be added to without an exclusive lock.) There are no exclusive or shared frame locks except for CPUs executing the kernel or for the domain root of the domain being executed (in problem state) by some CPU.

You may prepare a node for which you hold a shared lock unless that would cause involved keys with “don’t read” on. The latter requires an exclusive lock.

UNPRND is at a higher level than this and must thus obey these rules. Suppose that one invokes a node key designating the regs node of some executing domain. An exclusive lock on the regs node is acquired. The order code is to store a key and the target slot is involved so that SLOTZAP is called. SLOTZAP calls UNPRND there being no short cut available for this case. UNPRND is now coded to reject the request upon encountering a preplocked node. The current code uses PREPLOCK as an exclusive lock.

You may have acquired an exclusive lock on a node in which you have no interest in its preparation mode. If some further action indirectly requires it to be unprepared that is what you would like to happen. In other cases you may have required some particular preparation in which case you want to retain that. This argument seems to require a prep lock. I also imagine a useful state of being prep locked while share-locked. This would apply to segment nodes while executing SEARCHPORTION. This state says that anyone can look at the node but cannot unprepare it.

Summarized rules:

This summarizes the rules implicit in (flp). Assume the frame lock state set described at (fs).

For each frame (and hook designee): FRAMELOCK whose state is in the set described at (fs), A bit called CHAIN (for the back chain lock), and a bit called PREPLOCK.

While CHAIN is zero every prepared slot which designates this frame is on the back chain headed at this frame and every slot on the chain designates this frame. CHAIN is level 0. The CPU that turned CHAIN on turns it off in several instructions without going into problem state. CHAIN may be turned on or off without reference to other properties of the frame although we don’t see any reason to have a prepared key to a null frame.

PREPLOCK is turned on only during some unit of operation. While it is on the CPU that turned it on may prepare and unprepare the node in the frame.

Key accessing primitives

I explore here the idea of abstracting certain parts of key representation logic from most of the kernel by providing a few primitives with which to access keys. When a shared lock is held on a node the key type of a slot may be examined but not the prepared or involved bits. {See (lr1)}

PROC locate frame = (REF SLOT slot, BOOL ex, PROC VOID obs, b, def, REF NODE act)REF FRAME:

‘slot’ is the address of a key slot in a locked node. The value returned points to the frame designated by the key in the slot. A lock has been granted on the frame which is exclusive if ex and shared otherwise. If the frame is already exclusively locked or ex and there is already a shared lock then b is called. If the designated page or node has been severed then obs is called. If object was not in memory it is summoned, domain whose root is act is put on I/O queue and procedure def is called.

Clearly this procedure can be built of a more primitive procedure that does not deal with the actor but merely returns an exception when the designee is not in core. I do it this way here to get a feel for the most commonly used interface.

PROC unlock frame = (REF FRAME f, BOOL ex)VOID: ... ;

Frame f is unlocked. (ex|exclusive|shared).

The stuff in CMS file XA SCRIPT should be brought here.

Limits

This is historical. These changes have now been made to the pre-XA kernel.

CDA

The old 24 bit CDA supports a disk farm of 2**36 or 64 Gb. That may be bought today for 1.2 M$. This is clearly too small. Installations with more than this much storage already exist.

A 32 bit CDA is easy to code and supports 16 Tb. At current IBM prices it would take about 300 M$ to buy that much disk store. It it were easy to go beyond 32 bits we would. It isn’t.

Parts of the kernel are currently coded for 31 bit CDA’s with the high bit distinguishing between page and node. A 31 bit limit would be OK for a first cut.

Allocation Count

The old count is 24 bits. There are (unimplemented) strategies to cope with its overflow. Our experience with system 2686 indicates that in many months of service (between big bangs) counts seldom reach even 100. A 16 bit count might even suffice if there were sufficient gain thereby. I don’t see that gain yet.

Item space

Here the issue is primarily that of the form of the prepared key. The 16 bit locators {“item index”} of old kernel limit item space to 64 K entries or about 3000 nodes. It is clear that this limit binds for some applications. The bind is that node slots in item space are oversubscribed and nodes are thus shuttled to and from their pots. This, in turn, forfeits the investment in domain, meter and segment preparation. The cost averages perhaps one thousand instructions each for each shuttled node.

The three locators of the prepared key are used to: (1) locate the designated object, (2) provide a doubly linked list of the prepared keys to that object. This list is called the “backchain” and is intimately involved in most of the optimizations of the kernel. The use of the back chain pervades the kernel and I don’t know of alternatives.

The old plan requires a shift and add to convert an item index into a pointer or back. Sometimes an extra register is required when index and pointer are required at once. Register 9 holds the virtual origin of item space in most of the kernel. The 370 instruction set is awkward at storing fields ending on other than byte boundaries. This is especially true for MP hardware.

There are only about 14 references to the backchain in the kernel. 5 or 6 bytes could hold two locators and serve the backchain function.

Item space for a 16Mb machine nominally occupies 512Kb.

Niggling Details

Figure out what to do about ILC. It isn’t very quick anymore because you have to see if user was in 31 bit mode to handle wrap-around.

The simple places to put the code are executed much more often than is required.

Subtracting the ILC from the PSW gets the wrong answer when the machine was in 24 bit mode and the subtract borrows from bit 7. In this case the high byte of the result is FF because the high bit was 0 because that is the ex-mode. A high FF can also be caused by an interrupt when the PSW reads FFXXXXXX. It seems that “SH X,ILC; IF M; TM JRPSW+4,X'80'; IF Z; N X,=X'00FFFFFF'; JOIN; JOIN;” is necessary.

Polish the assertion at (getpag).

Turn on bit 32 in each new PSW.

Waken waiting CPU upon: FORK, I/O or clock comparator interrupt.

If a primary key is invoked that designates a locked node or node that cannot be unprepared due to a locked node, distinguish the cases where the node is the actor’s domain root or a domain root running on another CPU. In the latter case signal that CPU and put your domain on the worry queue and put pointer to current DIB in INGERER.

Ensure that DAT is off while waiting lest TLB pollution.

Change the code in GATE which dispatches on invoked key type to work correctly with 31 bit addresses. (It currently uses high bits to test if the key must be prepared.)

Expanded Store

Several suggestions have been made concerning special purpose use of XA’s expanded store (ES). I consider here its general use.

Perhaps pages can live in ES in some status between being swapped out and being in main store. If a page is in no core frame then it might be found in expanded store. The rumored performance of ES permits the fetching of the page upon page fault thus avoiding the several thousand kernel instructions involved in arranging for I/O and rescheduling the CPU, not to mention the elapsed time saved in fetching the page. This would require a ready supply of page frames to receive the page.

Memory Trees

Purge Time

Presume a global 32 bit value that always increases by one when some CPU does PTLB. (Alternatively imagine the TOD value upon PTLB.) Such a value is called a “purge time”. Each segment table keeps a purge time recorded when an entry in that table is changed or invalidated (validated doesn’t count). Each CPU recalls the purge-time when it last did a PTLB.

When a CPU decides it must attach to a segment table it compares its purge time to that of the table. It purges if it hasn’t purged since the table was last changed.

With each segment table there is a field with a bit per CPU. The bit is 1 if that CPU is attached to the table and usually 0 otherwise. If the table must be changed then first set a “stay out” bit in the table and then signal each CPU whose bit is 1 until they are all 0. CPUs will not attach to tables whose “stay out” bit is on. When a CPU gets the signal it merely switches to another domain which guarantees that it will give up the table and, of course, turn off its bit there.

The Coarse Segment Table Origin Problem

Control register 1 (and 7) defines only the left 19 bits of the segment table origin. Segment tables must consequently start on 4K boundaries. Keykos may run with tens of thousands of domains recently active. We have implemented an important set of types of objects that require little more than one page per object. The naïve segment table allocation scheme would require an additional page per object instance. Two schemes have been suggested to ameliorate this cost. The first is described at (trans-sto) and the second at (sync-frame).

Prepared segment nodes:

This is a nascent idea. Suppose that segment nodes produce segment dibs (SDIBs). An SDIB would be an array of 16 SDIB pointers or an array of 16 page table pointers. Each SDIB is associated with just one SSC. SDIBs always point to other SDIBs whose SSC is just one less. SDIB pointers can be null with some null code very quick to sense. Small segment nodes in large contexts would produce an SDIB for each missing level. The SEARCHPORTION code is now reduced to about flve instructions per level. (LR, SHIFT, NR, AL, BM FAULT)

The above mechanism is to make the following scheme affordable. What if we are running some system, such as MVS or UNIX, that wants to put “System Data” at the top of virtual memory. We will need many 8K segment tables. One way to meliorate this is to make rebuilding segment tables very fast. We would keep segment tables for only seconds instead of minutes and we would have only dozens of attached segment tables instead of thousands.

The validity assertions would be to the effect that the SDIBs indicate no access beyond the prepared segment nodes and that the mapping tables provide no access beyond the SDIBs.

A node prepared as a segment might produce several SDIBs for different combinations of SSC and RO status just as the current kernel segment node may produce multiple translation tables. In particular a segment node with small SSC in a context of large SSC would produce several degenerate SDIBs in order to conform to the rule that each SDIB points only to SDIBS of one less level.

An SSC=5 SDIB might be a bib-for-bit image of a portion of a real segment table such that one traverse of the new search-portion got 16M of virtual storage defined.

{This scheme seems well suited to a machine where we could make small changes in the translation table architecture.}

The SLOTZAP function requires that an SDIB entry to be nullified. This is always direct. The SDIB entry nullification requires tables built therefrom to be nullified. I think that a uniform application of the DEPEND logic might serve here.

This scheme might do very well for a RISC system where the TLB is allocated by kernel code.

It would also presumably have much in common with the 68030 style kernel.

Syncopated Node Frames

A bunch of contiguous real core pages are allocated at IPL time. The first 64 bytes of each of these pages hold a small segment table. Such a segment table can define 16M. Most objects can be described in such a segment table. The remainder of each of these core pages is divided into node frames. Other space would be allocated for segment tables too large to fit in the mini-frames.

Perhaps all node frames are provided in this form. The task of turning a slot address into a node frame address is easy.

Bill says that directory entries are also natural candidates for the “near page frames” left over by small segment tables.

In this scheme I would expect that there was a STOT for each size of segment table. The STOTs could be contiguous for the convenience of routines that wanted to scan just one table. The same goes for page table headers.

Steps Toward Syncopated Node Frames

I have defined two versions of each of two macros. NEXTNODEFRAME Rx,Ry,LOOP produces in Rx the address of the next node frame after the node initially designated by Rx. Control is continued at LOOP if there are more node frames. The XA version skips over the interleaved segment tables. SLOTTONODEFRAME Rx,Ry,Rz converts the slot address in Rx to the address of the node frame in which the slot exists. The XA versions are activated by the global &XA assembler switch.

The Segment Table Staging Area

In this scheme we would have a mechanism very much like SHOVE that would allocate segment table space. The tables would not be on page boundaries and could thus not be used directly by the hardware. Upon demand one of these tables would be moved to one of a smaller set of frames that begin on a page boundary. The page boundary segment table frames would be allocated more briefly than those shifty frames managed by SHOVE. There would be about as many real frames as STO-STACK entries.

There is here the dilemma as to whether to mix these ideas and give small segment tables their own real frames.

Segment Tables Among User Pages

This is an idea that might go with any of the above ideas. Where shall we allocate big segment tables? If we grab from the same frame pool as for user pages we gain from not having to purge as often. In the extreme imagine that we allocated a new page frame (or two) each time we needed to change a segment table. Copy the old table into the new frame make the mod and use the new frame. No PTLB is required in this case.

The Issues

The STO-STACK

Beyond the evident storage and CPU costs of the above schemes, there are costs that depend on hardware implementation details. Engineers speak of the size of the “sto-stack”. This is where the TLB remembers some set of segment table origins. Reallocating a real segment table frame requires a PTLB that ruins the TLB investment in all segment tables. If we reallocated these frames with an age estimate similar to that used by the hardware we might minimize this TLB loss.

A hack that limits the PTLBs while allocating segment tables only breifly is to allocate them in general page frame space. There would be a bit in the core table entry indicating that this frame has held a segment table since the last PTLB. This has potential performance problems if one must relocate user pages to new frames merely to fool the STO-STACK.

Process-Processor Affinities

A kindred issue that may be necessary to consider here is the TLB and cache investments in a process by a processor. The idea described at (cpu-tlb) tracks which TLBs have been attached to which segment tables. In conjunction with such a scheme it is profitable to maintain an affinity between address segments and processors. This minimizes the refilling of the TLB and cache. It also increases their effective sizes.

It may be necessary to arrange that two domains that invoke each other frequently are connected to the same processor.

Reclamation Issues

Any of these schemes must address the reclamation of any cache type storage.

Big Page Tables for 370

The Scheme

It is proposed that, as a test, we change the 370 kernel to use the “big page table translation architecture”. This would make the 370 kernel more like the XA kernel, aid in testing XA code without access to an XA machine and also provide what might be a clean cleavage point in the XA project. This is a detailed proposal along those lines.

There are three arrays fixed at IPL time: 64 byte segment tables, 16 entry page tables, 256 entry page tables. The latter are called little and big page tables respectively.

Segment table headers are separated from segment tables on XA and are also separated on the old kernel. For commonality they are separated for the 370 big seg kernel as well.

Page table headers are separated on the big seg 370 kernel for the same reason they are on XA.

The segment tables are produced by prepared segment nodes of SCC<=5. Big page tables are produced by prepared segment nodes with SCC<=4. Little page tables are produced by prepared segment nodes with SCC=3.

Degenerate tables

An SSC3 node in a 16M context produces a degenerate 16 entry segment table that designates its normally produced little page table.

An SSC4 node in a 16M context produces a degenerate 16 entry segment table that designates its normally produced big page table.

A page in a 16M context produces both a degenerate segment table and a degenerate little page table. (These can be on the same chain.)

A page in a 1M or 64K context produces a degenerate little page table.

370 segment and page tables must start at multiples of 2**6 and 2**3 bytes respectively while XA tables must start at 2**12 and 2**6 respectively. XA segment tables will present problems that I don’t address here. In particular I don’t want to burden the 370 with the XA segment table allocation hair. The array of small page tables can be made rather similar in the two systems except that the XA page table entry is four bytes instead of two as in the 370.

It is necessary to convert from page table entry address to page table header address. This can be done by CL, BC, SL, N, SRL, MH, AL if the page tables are packed and not interleaved with headers. This argument, along with commonality, persuades me not to interleave and either locate page tables and page table headers with indexes or to include a page table pointer in the header. The page table headers could be uniformly indexed if this is any benefit. Indexes are wrong if we decide to chain produced tables on one chain which I now propose.

Reclamation

Segment tables are reclaimed much as now except that they are all the same size and miscellaneous hair is avoided.

Big and little page tables are reclaimed separately but by the same algorithm. It is done just like the old kernel.

Remaining Questions

How far should we go in changing segment table stuff? Shove has no remaining function. How much of the change would move toward XA? The answer may depend on yet undesigned XA code. Note that in the XA system a DIB will point to just one size of segment table while it points to a valid table. There is no dynamic resizing as in the old kernel. This means that we could have a different STOT for each size of segment table.

Some Differences

STOTE and PTFRAME now both require pointers to the table proper. That pointer should perhaps include the table size code as required by the hardware although this will force occasional adjustment when such a pointer is used by the program to locate the table.

Steps to get there

DONE!! The STOTE and the PTFRAME should be unified so that fields BGKEY, FMT, CSID & NEXT are at the same offsets. The Field PRODUCER might as well be unified as well. When MEMORY looks for a sharable mapping table it can now search on the concatenation of BGKEY, FMT & CSID.

Throw out SHOVE. Add allocation of three kinds of tables to INIT. Define two tables of page table heads.

Change callers of SEAP to reflect the new relations between nodes and tables. This includes references to the new coroutine stuff.

There are two bits in FMT that can be naturally used to code the kind of mapping table that this is to the extent that that information is not already determined by FMT.

DONE!! kernel-logic,whendepend must be reworked.

The code SEARCHPORTION (or SEAP) is little modified. The callers will be changed considerably.

XA ideas

This is a trial design assertion. Every HASHNEXT field of a node frame or element of the array NODCHHD either points to a node frame or to a terminator. There are no instantaneous loops in chains except the terminators that are chained in a loop. (This might be relaxed to say that there are only momentary loops.) The purpose of this assertion is to make valid the simple following of a chain. Each seek will terminate.

The above is probably confused because it seems necessary for a processor to lock a terminator exclusively during the search. This lock may as well deter any updater as well. The deallocation of a node frame can naïvely be done without finding the terminator.

Without Terminator

When I consider the cost of locking the terminator I wonder if it might not be better to add two termination instructions to the inner loop and ensure short loops by many chain heads. If we maintain that the link in each frame points to a frame then we can search without locking and almost certainly find the target if it is in a frame. If you think that you found the target, try to lock it. If you get the lock look again and see if it is really the right node. If it is the wrong node, increment some strange counter, unlock the node and try again. If the node is already locked consult a (newly invented) field of the frame that indicates whether this is a long term lock (some other processor is running domain code that involves this node) or short term. If short term go find something better to do. If long term find out which processor is running the job and trip him (SIGNAL PROCESSOR). Put your actor in a location where he will find it. Put your actor on the CPU queue as well in case other processor gets distracted. (This can happen if some third processor signals the same guy.) He should run what was your actor who will then find his operand modifiable.

The following code searches the chain fairly fast.

USING NFNODE,R1
SL: CLC NFCDA,TCDA; L R1,NFNEXT; BE FOUND;
CLC NFCDA,TCDA; BE FOUND; L R1,NFNEXT; CLR R1,R2; BNE SL;

(SEE IF YOU CAN FIND THE BUG IN THIS!)When one would remove a (locked) node from its hash chain a compare and swap seems ideal. Search from the chain head, guarding against the possibility that your node isn’t found due to someone mucking with the chain even as you run. If you find your node the previous pointer can be modified with the CS. Adding a node seems nicely done with the CS as well. First copy the old chain head value into the link field of the newly allocated frame and then do a CS to redirect the chain head to the new frame but just incase no-one else has meddled with it.

Upon closer examination I find that updating the chain, either to add members or remove them seems to raise serious interference problems. If processor x who has locked node x argues that he can change the link in node y with a CS, then some other processor can change x’s link for the same reason. The obvious answer is to lock the chain for updating.

Another stab, however, is to try the following: For deletion, you have node X locked. X is to be removed. Seek the reference to X. Lock the node or header with that reference. (Call that Y.) verify that it still points to X. Copy X’s link value into Y’s. Obliterate X’s link value. Unlock both. Rule: Only one link (even among free frames) points to a given frame. Worry about deadlocks...

Another really wild stab is: Almost forget about locks. Maintain the rule that links in headers and frames point to frames or terminators. Chains may now become scrambled. Terminate searches by count. There will be no false hits, only false misses. If you think the chains are scrambled lock everyone out from modifying them and rebuild them.

Changing Segment Table Entries

I think this scheme is inferior to that at (purge-time).

While the hardware provides the IPTE command to aid changing page tables in an MP configuration there is no such aid in changing segment tables. I propose here a scheme that nearly works and is easy to understand, point out the flaw and try to fix it.

Associated with each segment table there are three bits associated with each CPU. Their meanings are:

Corrupt: The CPU may have TBL entries from this table that are not to be used because they are no longer in the segment table and access must be denied.

Attached: The CPU may just now be attached to this segment table.

WasAttached: The CPU has been attached since the last PTLB.

There is also an exclusive lock for the table.

When you need to change a segment table:

Sieze the lock.

For each CPU currently attached to this segment table (as indicated by Attached) signal that CPU to switch to another process.

Wait until there are no CPUs attached. (There were probably none to begin with!)

Change the table.

Or all WasAttached bits for this table into the Corrupt bits.

Release the lock.

When you need to use a segment table:

Turn on your Attached bit and WasAttached bit for this table.

If the exclusive lock is set wait or go do something else.

Do a PTLB and reset your Corrupt and WasAttached bits for all segment tables (Too Slow) if your Corrupt bit is set for this table.

Use the table.

When you are done using the table:

Turn off your Attached bit for this table.

Summary:

Set Attached when you load CR1. Reset it when you load something else.

Set WasAttached when you set Attached and reset it when you do PTLB.

Set Corrupt for a table and CPU when you change that table and that CPU WasAttached to the table. Reset it when you do PTLB.

There are some lock problems in the above. They may be hard. A serious problem is the cost of turning off the Corrupted and WasAttached bits for each table.

A simpler scheme

There is a remarkably simple idea that I think obviates all of the above: Each segment table caries the time (or event counter value) of when it was last modified. (Much like the date on a file directory). Each CPU remembers when it did its last PTLB. Before it attaches a segment table it sees if the table has been modified since its last PTLB and does a PTLB if necessary.

For these ideas an event is the modification of a segment table. The event counter is the global counter maintained (with CS) by any one who modifies segment tables. An event counter value is the state of that counter when a certain CPU did PTLB.

Ramifications on the preservation of TLB investment.

It is best to think here, as the hardware does, of table frames rather than tables. The hardware knows a table by its address, not by its logical identity. A new table at an old address may be confused by the hardware. The ‘time of modification’ associated with a segment table must be associated with the frame. This observation mitigates against the idea of a few segment table frames presented to the hardware and a cache of many small segment table values briefly moved into one of these few frames in order to avoid using a page frame to store a few bytes of segment table.

I think that many words in POP can be boiled down to saying that when a segment table entry value is used to translate an implicit address, any value that has been in storage between the most recent PTLB and now may be used (even if more recent values have already been used).

This scheme doesn’t work without IPTE.

Disabling unneeded ESA features

This list was made by considering the definition of each instruction with the ‘Q’ or ‘A1’ attributes in appendix B of PoO.

CR 0.4 {extraction authority control}

Instructions EPAR, ESAR, IAC, IPK, IVSK are also disabled hereby. This effectively reserves the meanings of these instructions.

This hides system states from domain programs. These system states may contain information that may be secret or may change at undocumentable times.

CR 0.5 {secondary space control}

This must be zero to intercept the instructions MVCP, MVCS and SAC.

CR 0.15

This bit must be 0 to disable implicit access to the “home address space” by instructions: BAKR, EREG, ESTA, MSTA and TAR.

This bit also must be 0 to establish the meaning of bit 0 of CR 5. See below.

CR 3.0:15

These bits must be 0 to thwart user mode SPKA, MVCDK, MVCK and MVCSK.

CR 5.0 and CR 0.15

These bits must be 0 to thwart PC, PT and TR instructions.

CR 14.12

This bit must be 0 to thwart LASP, SSAR, PC and PT.

DEBUGGING TOOLS

Record and Playback Logic

As the kernel runs, the progress of its state-vector is determined by stimuli that are external to the kernel. The kernel has been instrumented to run in two special modes, record mode and play mode.

In record mode, the kernel records all of these stimuli.

In play mode, the kernel is fixed to act as if it were experiencing these stimuli. Most of the kernel will execute in play mode as it did in record mode, including bugs that are being sought.

The state vector includes item space, directories, the DEPEND relationship, etc. It excludes everything stored outside main storage, user pages, and node pots.

The stimuli are interrupts {including SVC’s by user programs}, results of looking in user’s memory, copying of nodes from pots to item space, reading the time of day clock, strings passed to primary keys, and condition codes, CSW, sense bytes, etc., resulting from I/O commands.

Format of the Records

The records are stored contiguously in main storage. The first byte determines the meaning and format of the record. This is a list of the stimuli and the format of their records:

(1,ilc,ic,...) Program interrupt from user mode.

If ic was protection exception then the right 3 bytes of the old PSW follow.

If there was a program event {PER} then the per code and per address {6 bytes} follow.

If it was a monitor event then the 6 bytes of monitor information follow.

If bit 11 of the ic is on {translation exception} then 3 bytes of the translation exception address follow.

(2,ilc,svc code,...) SVC

If the svc code >= 00FD then bytes 1 through 5 of the entry block and the 8 bytes exit block.

(3) Clock comparator interrupt

(4) Process timer interrupt

<(5, csw {8 bytes}, dev addr, (gtt stuff))> I/O interrupt.

If the first byte of the device address is zero, the following 21 bytes follow on the assumption that the device is a GTT port: 4 bytes of TIREADDR, 4 bytes of TIRECUSR, 1 byte of TIRENDCD, 12 bytes of TIREOLINE

(6, ilc, ic, oldpsw, ...) Program interrupt from privileged mode.

The same fields follow conditionally as for the program mode prog-int, except for the PSW fragment.

(7, 2 byte string length, string) Primary key accepts data string.

(8, 4 byte virtual addr) A virtual address has been determined in response to a protection interrupt.

(9) MVCL Ghost.

The ghost of a suicidal MVCL has caused a protection exception that will not repeat.

(10, 3 byte DIB addr) A user has been started. If the DIB addr is RNDIBS, the idle process has been started.

(11, 8 byte TOD value) The clock was read.

<(12, cc, (csw {8 bytes} if cc=1))> A BRS was issued and a condition code was returned.

(13) Check ran successfully.

(14, 8 byte CPU timer value) The CPU timer was read.

We anticipate adding records of nodes as they are moved from node pots to item space.

Same with the information from the external migrator.

Some information from the TBA will be needed.

Records of allocation pot contents will be needed.

The record is kept in an area of storage next to the code. The record logic runs just if it is loaded {external symbol RECORD is defined}. KERNLINK provides an option to load RECORD. The word at ASTREAM contains the address of the beginning of the area, and the word at AESTREAM contains the address of the end.

The record and playback mechanisms both use the field LOGP in page zero. That cell holds the address of the next record.

The record is kept in contiguous half-page frames reserved for the purpose. The storage key for these frames is usually 15. In order to know when the record is about to come to the end of the space reserved for it, the last few {currently only one} frames normally have storage key 0. These are called end frames.

The commands that produce the record run with PSW key 15 and will cause a kernel protection exception when they come to the end frames. The protection exception logic modifies COMEHERE and changes the storage keys of these end frames to 15 and then restarts the program. When a user would have been dispatched, the PSW in COMEHERE leads to code called VTIME {for “valid time”}. VTIME will copy the records that have been produced in the end frames to the beginning of the record frames and adjust LOGP to wherever the end of the records has been moved, and change the end frame keys back to 0.

The above protection exception logic will, as a consistency check, verify that LOGP indeed points into the end frames.

If the end frames are filled before VTIME has run, the system will probably crash. This may be good as evidence of a loop in the kernel. If it turns out that there are legitimate activities in the kernel that might produce many records without dispatching users, those activities may be modified to test for a pending VTIME execution.

Proofs about assertions about denial of service would be closely related to assertions that the end frames will never be filled.

{NI}The Copy Logic

A rather independent mechanism may be running that copies the records onto disk.

The copy logic is unaware of the end frames.

When this mechanism is running, it may create an obstacle to keep the record logic from getting too far ahead of the copy logic.

This obstacle is formed by changing the keys of some contiguous record frames from 15.

This obstacle consists of two parts, the elastic and inelastic parts. The elastic part is motivated for reasons like the end frames, namely that the record logic cannot be stopped instantaneously. The elastic part will be the size of the end frames {perhaps one}.

If the record logic runs into the inelastic part of the obstacle, the system crashes. This is similar to the exhaustion of the end frames and indicates that the kernel is in a loop. The inelastic part is just one frame long.

The copy logic can read LOGP. If it were to see it in the end frames it would suffice to act as if they were just at the beginning of the record frames.

Two rather different strategies for the copy logic are supported by the above ideas.

In the first, one always tries to write the data as soon as a frame is done. This can be accomplished by placing the obstacle in the frame just beyond LOGP when there are no uncopied but done frames. This scheme might take more CPU overhead and more arm time.

In the second scheme, one always waits until the frames are exhausted and then starts the channels and stalls. The disadvantages are clear here.

A cross between these two would be optimal in most cases.

It might be that the pages which constituted the record frames might be logical pages with permanent core-locks. They would thus have core table entries. The journalizing logic might be used to cause these pages to be written.

The routine SCANREC was designed to scan over each record type while examining a crash with DDT. SCANREC stops when it comes to a record whose initial byte is zero, presumably the end of the records. LOGP should be increased by some amount before running SCANREC, because the record logic is still in effect and will record the “kernel traps” in SCANREC ,which has caused great confusion and consternation.

Internal Logic of Keeping the “record”.

The half page frames reserved for the stream normally have storage key 5. A few at the end {perhaps one} normally have another storage key. These are called the end frames. The address of the first frame is STREAM and ESTREAM is the address of the first byte beyond the end frames. The address of the first end frame is REDLINE.

The scattered parts of the kernel that produce the records that constitute the stream place their records at the address specified on LOGP and update LOGP.

There are two uses of storage keys different from 5 in these frames; to cause wrap-around and to trigger copying the stream to disk.

When an attempt is made to form the first record in an end frame a protection exception leads to code that changes the storage keys of the end frames to 5 and also sets the transition trap.

When this trap occurs the record fragments that are in the end frames are copied to the beginning of the frames (STREAM) and the storage keys of the end frames are changed from 5 again. The value in LOGP is also reduced by REDLINE-STREAM.

When the stream information is being recorded of disk there is always one other frame with key different from 5. This is the frame before.

When the stream information is being recorded of disk, two activities chase each other around the ring of frames from STREAM to REDLINE; the production of the records and the copying to disk.

Rule: The storage key of every frame from STREAM to REDLINE is 5 except the frame whose address is in LOGC. The storage key of that frame isn’t 5.

The frame at LOGC can accept records but the next and succeeding frames cannot because the disk copy may not be completed. When the copy is complete, LOGC will be circularly incremented and the storage keys will be adjusted to conform to the rule.

If we choose to stall the entire system when the recording on disk of the stream cannot keep up with its production, we will stall when the transition trap fires.

Much of this logic is predicated on the assumption that the user mode transition trap will fire before a frame of records has been produced. If the kernel should loop this may not be the case. The production of more than a frame of records will be detected and that may serve the useful function of detection a class of kernel loops.

Kernel Storage Dumps

When the kernel running without a DDT detects an internal error {e.g., an unexpected program interrupt}, it will take a storage dump. The dump will be written in the first range it finds which is marked in the pack descriptor record as being a “dump” range.

A directory called “DUMP.” should be in your directory called “USER.”. New dumps are collected there. If D is a key to an unneeded domain, then “KC D 48 (,dump-name)” puts the key in the domain’s key 0 so that a “kernsyms;;” command to a DDT over that domain will give your DDT access to the kernel’s state at the time of the crash. The general registers from the dump are moved to the domain.

There is a domain that will copy the dump from the dump area {to which it has a range key} to a segment and enter it into a directory where a version of DDT that analyzes kernel dumps may access it. See (p2,cdump) and (p3,cdump-today).

If D is a key from DUMP. then:

D is a segment node via which the first meg of the dumped storage appears read only.

D(0;==>;SEG) returns a key, SEG, with the memory image of the dumped kernel. It has DK(0) at undumped addresses.

D(1;==>0;Header_Page). See (d-header).

D(2;==>0;BANK) returns this dump’s bank.

D(3;==>0;Guard) returns this dump’s bank’s guard.

D(kt+4;==>0) deletes the dump.

D also obeys some orders issued by Dump DDT’s “kernsyms;;” command.

Dump Format

Header Page - First page in kernel dump range.

000 Format version number - current is zero.

004 Filler

008 General Registers - at entry to dump.

048 Floating Registers - at entry to dump.

068 Control Registers - at entry to dump.

0A8 CPUID - of the processor dump is running on.

0B0 Time of Day Clock - at entry to dump.

0B8 Clock Comparator - at entry to dump.

0C0 CPU Timer - at entry to dump.

0C8 The first 256 bytes of memory - at entry to dump.

1C8 4K bits {512 bytes} of page present bits. 1 = page is present in dump image; 0 = page is not present in dump image.

Core Image Pages - Rest of pages in kernel dump range.

One page for each “1” bit in the page present bits in the dump header.

State Checksumming

In the module SCAFOLD, there is code in support of the following idea.

Suppose that we run the kernel and checksum the state at each SVC {for example}. If we do this to a trusted version of the kernel and then then to an untrusted version, we might find out what is wrong with the untrusted version. If we discover that the checksum differs first after the n’th SVC, we run the trusted kernel and record states n-1 and n. We then run the untrusted kernel to produce the same states and compare these states with those of the trusted kernel.

At IDLE, one does WATCH;g if checksums are to be produced for each state. {This is slow!}

One does WAIT;g at IDLE if it is desired to record or compare states n-1 and n.

In this case, n is placed in cell Q of SCAFOLD. {Actually the states are numbered by addresses where the checksums are, or would be stored; thus there are multiples of 4, starting, say, at x'400000'.}

Bit 7 of byte at WRITE in SCAFOLD determines whether this run will produce information or compare it.

After doing WATCH;g in WRITE mode, the checksums will be stored using the cursor at P in SCAFOLD.

After doing WATCH;g in “not WRITE” mode, the checksums will be compared using the cursor at P in SCAFOLD.

After doing WAIT;g, the two states designated in Q will be awaited and either recorded or compared {depending on WRITE}.

The CP command “I 190” does not clear storage. This allows records of states and checksums to be preserved across two runs of the kernel.

Reproducibility

The problems of reproducibility severely restrict the use of this tool. Changing {at IDLE} control register 0 from 80800C00 to 80800000 prevents external interrupts. This helps reproducibility but may prevent the activity that provokes the bug.

If programs in domains make decisions based on TOD values, this scheme may fail. This scheme might be combined with the record/playback logic to avoid this problem.

{NI}Diagnose

PERD {PER Diagnose}

This is an idea for a key to aid in finding kernel bugs and to aid in kernel tuning.

I concentrate first on a facility utilizing the PER hardware while running the kernel in support of discovering the behavior of the kernel. We call these facilities “Kernel PER”. The PER DIAGNOSE key {PERD} controls kernel PER and attempts to protect the system from misuse of the key. Hopefully, PERD should be no more dangerous than the PEEK key.

The Kernel PER state has the following state variables:

values to be placed in control registers 9:11

a PER event limit

which {a subset} of the four kernel entries should enable PER

These are SVC, I/O interrupts, program interrupts, and external timers.

a buffer of recorded events called the “PER buffer”

a bit vector indicating which fields of the PER record should be included in the PER buffer

The fields of the record are:

address of the responsible instruction

the right end of the PSW upon PER interrupt

the main store and register value

an address whose contents may be requested in each recorded event

a register designation whose value may be included in the record

{?}a boolean expression which conditions the recording of the event

There are calls on PERD to set and start the facility. There is also a call to read and empty the buffer.

The more delicate parts of this facility are in place already.

The event limit is used to handle the case in which the recording takes so many resources that PERD cannot be used to turn off the facility. The facility turns off when the limit is reached.

KKRD {Call Kernel Routine Diagnose}

There are several routines in the kernel that have no effect on the logical state of the system. They cause transformations between states whose purpose is to optimize the kernel. Driving the kernel into combinations of these states in synchronism with actions by a domain may be able to provoke bugs solidly that are otherwise very intermittent.

Some examples are:

Remove a node from item space.

Swap out a node pot or page.

Call stotzap. {We can already do that pretty well!}

Unprepare some node.

See (check) about a built-in runtime debugging aid.

PERFORMANCE CONSIDERATIONS

The First Benchmark

An initial load test has been prepared to stress test Gnosis and identify performance bottlenecks.

Objectives

Determine whether Gnosis will run with many active processes.

Develop initial performance data on both memory and CPU utilization of “trivial” transactions. This information will be used to project Gnosis applicability for use as an ISD switch.

Identify operational problems in running Gnosis for real.

Identify and resolve performance bottlenecks in trivial transactions.

Building a New Version of Adventure under CMS

To create a new Adventure, do;

GLOBAL MACLIB PLIGATES

PLIOPT GNADVENT (MACRO

PLIFLINK GNADVENT PLISUBS

PLIFLINK EXEC is on the OSSIM disk.

Initialization of the Benchmark System

Many files must be transferred from CMS for use in the benchmark. These files reside on the Gnosis application disk {GNADVENT MODULE, ADVENT DATA, ADVBUILD CMDFILE, and the application controller files}. You may bring up a fresh Gnosis on either the real disks or minidisks. The minidisks owned by GNOSIS2 will run up to 5 users if a dump has been collected, and up to 10 if no dump has been collected. Larger tests will only run with the real disks.

After acquiring the proper disks, LG GNOSIS. Before starting the system get symbols for ws “ws;t” to locate the system counters. It will save you a lot of grief later if you locate ws now “ws=” and write down its address.

There are two ways to move the Adventure files to the new Gnosis, depending on whether or not the Gnosis system has a base.

If the Gnosis system has a base, do the following:

Log on to Gnosis using an ID that has a circuit key to some CMS userid. Issue the following two commands to get command files that will build the application controller system and adventure. “AUXFILE GET RECORD APPGET TYMNET.userid APPGET CMDFILE "XFER"”, “AUXFILE GET RECORD ADVGET TYMNET.userid ADVGET CMDFILE "XFER"”.

Edit the two command files using the EDIT command. Change the circuit key designated in the command files to one that corresponds to the circuit key used to get the files. Then issue the following commands, “CMDFILE APPGET”, “CMDFILE ADVGET”.

If the Gnosis system has no base but has access to a transfer disk, then do the following:

Acquire a transfer disk {the Gnosis 1AF is particularly spacious}, and use the CMS ADVSEND exec to transfer the generalized applications controller system files, and the Adventure files {painlessly!}. Write down the password it returns; you will need it later. {ADVSEND invokes MGSEND, the CMS side of the multiple segment receiver and unpacker {SRUP} described in (p2,srup).}

Now log onto S370 and create the Adventure controller as follows: {It is necessary to ready the 134 disk, if the transfer was done while Gnosis was running.}

A:RECEIVE password

THE CMS FILE NAME WAS "CCC MODULE E1"

WHAT NAME WOULD YOU LIKE FOR THIS SEGMENT? ccc.seg

THE CMS FILE NAME WAS "CREATOR MODULE E1"

WHAT NAME WOULD YOU LIKE FOR THIS SEGMENT? creator.seg

THE CMS FILE NAME WAS "APPLDB MODULE E1"

WHAT NAME WOULD YOU LIKE FOR THIS SEGMENT? appldb.seg

THE CMS FILE NAME WAS "APPLQURY MODULE E1"

WHAT NAME WOULD YOU LIKE FOR THIS SEGMENT? applqury.seg

THE CMS FILE NAME WAS "APPBUILD CMDFILE E1"

WHAT NAME WOULD YOU LIKE FOR THIS SEGMENT? appbuild.seg

THE CMS FILE NAME WAS "GNADVENT MODULE E1"

WHAT NAME WOULD YOU LIKE FOR THIS SEGMENT? adventseg

THE CMS FILE NAME WAS "ADVENT DATA E1" WHAT NAME WOULD YOU LIKE FOR THIS SEGMENT? advdata

THE CMS FILE NAME WAS "ADVBUILD CMDFILE E1"

WHAT NAME WOULD YOU LIKE FOR THIS SEGMENT? advbuild.seg

A:CREATE APPBUILD APPBUILD.SEG

A:CREATE ADVBUILD ADVBUILD.SEG

The files needed to build the Adventure environment are now in the Gnosis system. Issue the following two commands: “CMDFILE APPBUILD” {to build the application controller environment}, “CMDFILE ADVBUILD” {to install Adventure in the application environment}.

Log onto Gnosis with the userid of ADVENT. The prompt “Enter password for this user>” will appear. Enter “RABBIT”. This action adds the ADVENT userid to Gnosis and connects it with the application creator for Adventure. Subsequent logons to the userid ADVENT will begin an Adventure game.

There will be a bell during the build process. This bell corresponds to a branch on the switcher that acts as the application controller. Commands such as LIST, KILL, SPACE exist and allow some control of the application environment.

Initialization is now complete.

Running the Benchmark Driver

The driver is a program called SIMIN which runs on a PDP 10. It runs up to 10 simultaneous ports. To run more, it is necessary to log another user on {the same or another PDP-10}. The script follows:

Log on as .... SRCDEV8;ADVENTURE;BENCHMARK

The command prompt is “-”

-RUN c;SIMnncr

where nn is 5 or 10 to run that many users through the CP base, or 5t or 10t to run users through the Gnosis base.

SIMIN will reply: ENTER cr TO START. Do so.

a } prompt means SIMIN is listening to you. Several useful commands are:

}STOPcr to suspend the drivers

}GOcr to resume execution

}STATUS to find out what is happening

When it is desired to abort a test, enter two escapes. When the command level prompt is received, RUN ZAPPER to disconnect all the ports. From this point, SIMIN may be re-initialized and another test conducted.

Measuring Performance on a Virtual Machine

Technique 1.

Start Gnosis.

Start as many drivers as desired. When each driver prints the message “Tests initiated”, enter STOP. At this point, all circuits should be connected, and many Adventures should be initializing themselves.

Watch the Gnosis PSW, waiting for it to go to idle, after all initialization is complete. All subsequent activity should be Adventure “transactions”.

When Gnosis has become idle, enter GO command at the control console for each of the simulators. This should start the transaction drivers again.

Quickly return to the Gnosis control console, and enter the following commands. Do this quickly so the base doesn’t time out.

.Query TIME

.INDicate USER

.Display whatever address ws counters are at, c0 bytes

for example: .D 9D5A0.C0

Now go to each of the SIMIN consoles, turn on the printer, and enter “STATUS”.

.Query TIME {again}

.Begin

Repeat this process after some reasonable interval {say 3 minutes}. These two snapshots define one test interval. The time length is defined by the second Query TIME from the first snapshot, and the first time stamp from the second one. The number of transactions is determined by adding all of the inputs from each of the scripts {which are printed as part of the STATUS output} and subtracting numbers from the beginning of the interval.

If you have SYSTOOL handy, you might want to enter “READPRIMEMETER;G” during each measurement point.

Technique 2.

Much useful information has been gleaned by stopping everything which you are not interested in, and using cp trace to monitor SVC’s, PROGram interrupts and so forth.

Measuring Performance on a Real Machine

SYSTOOL has a command GATETEST;G to measure the time to jump through a gate and return. See the listing.

There is a program on the domain disk called RDTIME which measures the time it takes to execute a PL/I READ statement. It takes 200 samples and prints them both in sample order and sorted by time. It may be transferred using standard installation techniques.

Tests Conducted

RDTIME Test Results

RDTIME samples the time to read a record from a sequential file in PL/I using READ FILE(...) SET (...);. Times are in microseconds. Each run is based on 200 samples.

% Observations less than: 0 : 20 : 40 : 60 : 80 : ==============================================================

IMFS-3033 4,462 4,899 5,824 8,614 16,515

ratio 2.5 2.7 3.1 3.0 3.0

Gnosis 3033 VM 1,801 1,829 1,868 2,844 5,589

ratio 22 22 23 35 50

CMS 3033 VMA 81 81 81 82 112 --------------------------------------------------------------

Gnosis 158-I RM Record 1,077 1,087 1,089 1,092 1,095

ratio 1.1 1.1 1.1 1.1 1.1

Gnosis 158-I RM No Record 970 973 975 977 979

ratio .9 .9 .9 .9 .8

CMS 158-I No VMA 1,094 1,097 1,099 1,101 1,170

ratio 2.0 2.0 2.0 2.0 1.9

CMS 158-I VMA 552 554 555 558 625

CMS No VMA/Gnosis Record 1.0 1.0 1.0 1.0 1.1

Gnosis Record/CMS VMA 2.0 2.0 2.0 2.0 1.8

Gnosis No Record/CMS VMA 1.8 1.8 1.8 1.8 1.6 ----------------------------------------------------------

CMS 158-III No VMA 947 953 962 967 1,043

ratio 2.0 2.0 2.0 2.0 1.9

CMS 158-III VMA 469 471 472 473 535

CMS VMA ratio -I/-III 1.2 1.2 1.2 1.2 1.2

CMS No VMA ratio -I/-III 1.2 1.2 1.1 1.1 1.1

GATETEST Results

GATETEST in SYSTOOL finds the minimum time to do a gate jump call and return. The call passes three keys {plus the exit} and a 32-byte string. The return passes nothing.

On a real 158-I w/ Record: 527 microseconds

On a real 158-I w/out Record: 424 microseconds

Adventure Benchmark Results

The benchmark was run on a real 158-I with Record on and Check off. There was 5 Meg of core, but the core table had only 900 entries. All measurements are approximate. Transac/sec is the total number of transactions per second. Diskio is the number of reads or writes to disk, assuming that migrating a node pot takes one read and one write.

   users : transac/sec : diskio/sec : jumps/sec :

10           3.3          0.2         478
20           2.          53.          234
30           1.          70.          125
40           0.3         83.           59

Prognosis

We anticipate that further work can make very substantial improvements in both the CPU required to do a transaction and the real storage required to process an Adventure transaction.

CPU time

We have indications that over 90% of the CPU time went to executing kernel code. We anticipate that the greatest gains here may be had by reorganizing the supplementary systems to provide the same function with fewer kernel transactions.

Two multiplicative factors occur here: the number of calls to the terminal system for an interaction, and the number of kernel calls to support a call to the terminal system.

The PL/I program runs in a domain with a good deal of run-time subroutine support. These routines call the terminal support system {in another domain} for each line of output. This is sometimes required but not here. Fixing this {in the support system} would give a factor of a few.

The terminal support system currently runs in several domains linked serially as co-routines. Each terminal event goes through each of these domains twice. A terminal transaction is either two or more events, depending on whether the above change is made. There is a plan to reduce the number of domains peculiar to a given circuit to one, without modifying the external specifications of that system.

The terminal system currently does not make use of the Tymnet facility of remote echoing and thus does a few gate jumps on each keyboard stroke.

Space

In the Adventure application, there are a number of pages that each game writes at the game’s beginning and refers to throughout the game. These pages are the same for all of the games and should be shared. The changes to the terminal support system above would save several pages per game.

This sharing is easier after the loader project is done.

The results of the test indicate about twice as many pages as we can easily explain.

Charge sets are not currently implemented {in the kernel}.

Thus we do not yet have convenient tools for determining where the pages go that are required for a transaction.

In all we would expect a factor of a few fewer pages required per transaction when the above projects are done.

CMS Benchmark Results

The Adventure benchmark was run on the standard TYMCOM-370 CMS system on June 4, 1981. The measurements are as follows:

Test Number: 1 : 2 : 3 : 4 :

Wait Time 47.4 3.6 10.1 2.0

Problem State Time 17.7 28.7 61.6 25.8

Page I/O Count 5 1117 3919 4525

Transaction Count 420 646 461 715

Elapsed Time 120 120 120 120

Num Simulated Users 10 20 40 40

Total CPU Time 72.6 116.4 109.9 118

CP Time 54.9 87.7 48.3 92.2

CPU Utilization (%) 61 97 92 98

Paging Rate .0 9.3 32.7 37.7

Transaction Rate 3.5 5.4 3.8 6.0

CP Utilization (%) 46 73 40 77

Problem State (%) 15 24 52 21

Transactions/User 42.0 32.3 11.5 17.9

Problem/Transaction .042 .044 .134 .036

Total CPU/Trans .173 .180 .238 .165

CP Time/Trans .131 .136 .105 .129

Description of tests:

Ten simulated users from system 54. Timing started after Adventure was initialized.

Twenty simulated users, ten from system 54 and ten from system 56.

Thirty simulated users, 2 simulators with five each + 2 simulators with ten each from system 54 and one simulator with ten from system 56. Timing was mostly logon and Adventure startup.

Forty simulated users, 2 simulators with five each + 2 simulators with ten each from system 54 and one simulator with ten from system 56. Timing started after Adventure was initialized.

Gnosis Benchmark Results - June 7, 1981.

The adventure benchmark was run on a 158-I on June 7, 1981. The significant differences between this test and previous tests were:

The PL/I memory tree was modified to not use background keys.

The PL/I program that implements the Adventure game was modified not to use PL/I I/O to talk with the terminal. It now uses a Gnosis version of $TIN/$TOUT which corresponds to the technique that the CMS program uses.

Results of the test {N.B. does not include all zero counters or the Page Table rescue histogram}:

Tests: 1 : 2 : 3 : 4 : 5 :

Number of Users 5 10 15 20 40

Elapsed Time 120.1 120.0 119.4 119.6 303.2

Test Time @ PDP-10 120 120 120 120 180

Idle Time 99.6 90.6 76.4 31.3 16.4

Prime Meter Time 7.3 16.7 21.6 22.6 10.0

Prime / Transaction .028 .029 .030 .033 .217

PL/I Meter Time 4.0 9.2 11.6 12.2 6.5

PL/I / Transaction .015 .016 .016 .018 .141

Terminal I/O Meter .6 1.5 2.3 2.6 1.1

Term / Transaction .002 .003 .003 .004 .024

Tymnet Adapter Meter 1.4 3.1 3.8 3.6 .7

TA / Transaction .005 .005 .005 .005 .015

Kernel Time 13.2 12.7 21.4 65.7 276.8

Kernel/Transaction .051 .022 .030 .096 6.017

Total CPU Time 20.5 29.4 43 88.3 286.8

/Transaction .079 .051 .060 .129 6.235

Transaction Count 259 580 717 687 46

Transactions/Users 51.8 58.0 47.8 34.4 1.2

Page I/O Count ? ? 786 2707 18703

Data Keys Called ? ? 3780 2711 322

Node Keys Called ? ? 11548 7989 2855

Fetch Keys Called ? ? 25 10 23

Entry Keys Called ? ? 16444 12564 3833

Entries/Transaction 22.9 18.3 83.3

Exit Keys Called ? ? 34477 32592 8695

Exits/Transaction 48.1 47.4 189.0

Domain Keys Called ? ? 10380 9432 2215

DomKey/Transaction 14.5 13.7 48.2

Misc Keys Called ? ? 9932 5602 2133

NRange Keys Called ? ? 1304 0 1120

PRange Keys Called ? ? 549 0 348

RESUMEJ Entries ? ? 73451 53812 12249

RESUMEJ/Transaction 102 78 266

Pages read ? ? 251 386 12162

Nodepots Read ? ? 109 750 4128

Pages Written ? ? 291 172 1730

Nodepots Written ? ? 79 696 253

Checkpoints Written ? ? 0 0 741

Queue Slices (200ms) ? ? 1251 832 1630

DIB’s Stolen ? ? 89 460 57

/Transaction .12 .67 1.24

DEP’s Stolen ? ? 4794 3494 1154

/Transaction 6.69 5.09 25.09

Seg Tbl Crunch ? ? 6181 7198 2001

Out of CCWBLOK’s ? ? 0 0 9

Cleanlist Entry Out ? ? 84 11 4009325

Out of Requests ? ? 0 0 1670

Outstanding I/O Hits ? ? 8 71 16388

Node LRU Wraps ? ? 13 124 47

Page LRU Wraps ? ? 3 18 4704

Disk I/O Count ? ? 786 2707 18703

Channel Pgm’s Built ? ? 786 2721 21900

Xfers/Channel Pgm 1.00 .99 .85

Description of the tests.

Five users from system 54 {the slower of the two PDP-10s today.}

Ten users, five each from system 54 and system 56.

Fifteen users, five from system 54 and 2 times five from system 56. The information from the counters includes circuit zapping after the test proper was done.

Twenty users, 2 times five from system 54 and 2 times 5 from system 56.

Forty users, 2 times ten from system 54 and 2 times ten from system 56. Transaction counts are from 180 second run.

Gnosis Benchmark Results - June 13, 1981.

The Adventure benchmark was run on a 158-I on June 13, 1981. The significant differences between this test and previous tests were:

The time accounting in the kernel was changed to reduce the nonreproducible time for a gate jump that is charged to a domain’s meter. WS was re-allocated to increase the number of node frames and DIB entries. The 30 and 40 user tests were run with a new page selection algorithm that waits for a LRU=0 page to be cleaned and then allocates it, rather then skipping it and continuing to search for a LRU=0 clean page.

Two separate types of test were run. They are described separately here.

Adventure Startup Test

Results of the test {N.B. does not include all zero counters or the Page Table rescue histogram)}:

test: 1 : 2 : CMS :

Total CPU Used 48.5 56.8 54.3

Idle Wait 40.7

Page Wait .1

I/O Wait 1.7

Problem Time 34.8

Timers Running Time 85.0 124.5 96.8

Elapsed Time 38.2 49.8 64.8

CP Time 19.5

Kernel Time 21.0 29.3

Timers Running Wait 36.5 67.7 42.5

Elapsed Wait -10.3 -7.0 10.5

Page Reads 63 451 240

Page Writes 114 213 0

Total Paging 179 664 240

Idle Wait 36.5 67.7

Prime Meter 27.5 27.5

PL/I Meter 25.1 25.1

Term Sys Meter .05 .05

TA Meter .08 .05

Echoing Meter .06 .05

DataKey Jumps 379 338

Page Key Jumps 1090 1166

Segment Key Jumps 10 11

Node Key Jumps 6198 6149

Fetch Key Jumps 841 842

Entry Key Jumps 12885 13211

Exit Key Jumps 9918 9714

Domain Key Jumps 2556 2478

Misc Key Jumps 4227 4144

NRange Key Jumps 1720 1747

PRange Key Jumps 580 820

RESUMEJ Entries 39263 39711

Pages Read 0 413

Nodepots Read 0 38

Pages Written 0 213

Queue Slices 403 603

Seg Tbl Crunch 86 13

Outstanding I/O Clash 0 38

Page Wraps 0 1

Disk I/O Count 0 664

CCW Builds 0 665

Description of the Tests

Starting 5 simulated Adventures in Gnosis. System had Check, no record, and old page replacement algorithm. Pages already in core may have produced optimistic results.

Starting 5 simulated Adventures in Gnosis. System had Check, no record, and old page replacement algorithm. Rerun of test 1.

Starting 5 simulated Adventures in CP/CMS.

Adventure Steady State Test {This is comparable to previous tests.}

Results of the test {N.B. does not include all zero counters or the Page Table rescue histogram}:

tests: 1 : 2 : 3 :

Simulated Users 20 30 40

Timer Running Time 145.5 122.3 128.1

Percent CPU 45 74 82

Idle Time 80.1 32.4 23.4

Kernel Time 42.0 59.5 75.6

/Transaction .048 .065 .091

CPU Time 65.4 89.9 104.7

/Transaction .074 .098 .126

Prime Meter Time 23.4 30.4 29.1

/Transaction .027 .033 .035

PL/I Meter Time 14.4 17.9 17.0

/Transaction .016 .020 .020

Terminal Meter 1.77 3.24 3.13

TA Meter 3.20 3.80 3.47

Echoing Meter 4.00 5.43 5.45

Transactions 883 913 834

/Second 6.1 7.5 6.5

Data Key Jumps 3284 3462 2834

Node Key Jumps 8570 9012 7868

Fetch Key Jumps 6 6 7

Entry Key Jumps 12448 15323 13452

Exit Key Jumps 32884 37270 34859

Domain Key Jumps 9064 11012 9817

Misc Key Jumps 6211 5894 5042

RESUMEJ GOTO’s 61568 69432 60763

/Transaction 69.7 76.0 72.9

Pages Read 62 456 1347

Nodepots Read 3 75 515

Pages Written 113 338 1707

Nodepots Written 1 27 274

Queue Slices 625 597 627

DIB’s Stolen 1326 11081 10386

DEP’s Stolen 462 23868 20169

Seg Tbl Crunch 220 7480 7537

Cleanlist Out 52 306 1452

Outstanding I/O On 0 51 137

Node LRU Wrap 0 2 25

Page LRU Wrap 2 3 22

Disk I/O Ops 179 923 4117

/Second 1.2 7.5 32.1

CCW Builds 180 923 4124

Home Node Fetches ? 106 543

Swap Node Fetch ? 340 3078

Dirty Node Fetches ? ? ?

Description of the Tests

Twenty simulated users. Kernel had Check, no record, and no logging of sources for nodes.

Thirty simulated users. Kernel had no Check, no record, improved page replacement algorithm, and logging of sources for nodes.

Forty simulated users. Kernel had no Check, no record, improved page replacement algorithm, and logging of sources for nodes.

In July, 1982, we concluded that the above benchmark had been slowed by the congestion of the DEPEND tables and the limits on the size of the DIB area.

Since then the DEPEND tables were replicated many times to relieve the congestion.

The DIB locators in the node frame are still 16-bit offsets. They could be changed to indexes at a cost of a multiply. The could also be changed to a shifted offset at the cost of a shift.

Benchmark - July 1982

The benchmark of June, 1981 {(benchmark)}, was redone with modifications to the terminal system and associated modifications to the creator.

Objectives

Determine what effects eliminating many domains from the terminal system would have.

Develop performance data on both memory and CPU utilization of “trivial” transactions.

Identify and resolve performance bottlenecks in trivial transactions.

Building a New Version of Adventure under CMS

To create a new Adventure controller, do “VMFBUILD ADVCON2 FT”.

To create a new Adventure, do “PLILINK GNOADV2 PLISUBS2”.

VMFBUILD EXEC is on the Domain disk.

PLILINK EXEC is on the OS disk.

Initialization of the Benchmark System

Three files must be transferred from CMS for use in the benchmark. Two of the files are on the Gnosis domain disk {ADVCON2 MODULE and GNOADV2 MODULE}, while the third may be acquired by ATTACHing UPL370 {ADVENT DATA}.

Acquire the transfer disk and type “MGSEND ADVSEND2 A” to transfer the three files. Write down the password it returns, you will need it later. {MGSEND is the CMS side of the multiple segment receiver and unpacker {SRUP} described in (p2,srup).}

Log on to GNOSIS2 and type “READY 134” followed by “DISC” to get it to recognize the new transferred files.

Now log onto some Gnosis USERID and create the Adventure controller as follows:

If the VBSTIOC factory key is not available by name in the command system, then get it as follows:

G:KEYCALL WOMBFACIL 1 () (,WOMBFACIL.SUPP)

G:KEYCALL WOMBFACIL.SUPP 41 (%X00000015)
(,WOMBFACIL.SUPP.VBSTIOC)

Receive the files and build the Adventure system as follows:

G:RECEIVE password

THE CMS FILE NAME WAS "ADVCON2 MODULE E1" WHAT NAME WOULD YOU LIKE FOR THIS SEGMENT? cntrlseg2

THE CMS FILE NAME WAS "GNOADV2 MODULE E1" WHAT NAME WOULD YOU LIKE FOR THIS SEGMENT? advseg2

THE CMS FILE NAME WAS "ADVENT DATA E1" WHAT NAME WOULD YOU LIKE FOR THIS SEGMENT? advdataseg

G:FILEDEF CAVES ADVDATASEG

G:FILEDEF SBANK PRIMARY.SPACEBANK {or some other space bank}

G:FILEDEF METER METER

G:FILEDEF VBSTIOC WOMBFACIL.SUPP.VBSTIOC

G:WOMBTEST ADVENT2 CNTRLSEG2 ADVSEG2 FILEDEFQUERY

A bell will ring, telling you that the Adventure controller has run into a program interrupt {which was planted there to get a DDT created for it}. The correct action is to switch to the ddt esc cr, and restart the program at location 10. The script for this is “;w10;g” or, if you like symbols, “;t 0,advcon;u ;w start 2;g”. When the controller completes its initialization, another bell will ring, and the command system will be ready for more input. Switch back to the command system with esc cr.

Next make the monitoring keys available to SYSADMIN as follows:

G:KEYCALL ADVENT2.DOMKEY 31 () (,ADVENT2.KEYS)

G:ADVENT2.KEYS 14 () (,ADVENT2PLIMETER)

G:KEYCALL ADVENT2.KEYS 15 () (,ADVENT2TERMMETER)

You may then take a checkpoint and log off the command system as follows:

G:CHECKPOINT

G:LOG

Now dial in to the system again, accessing SYSADMIN’s command port. Switch to branch “a” (SYSTOOL) by typing “esc a”, and install the Adventure system as a Gnosis user as follows:

Set the username and user name length to the user name you used to build the Adventure system as follows: {N.B. un is the username and unl is its length.}

>USERNAMEL/ >unl CR

>USERNAME%A unl; >'un ' CR

Set the key name and key name length to install as follows:

>KEYNAME;C7; >"ADVENT2" CR

>KEYNAMEL/ >7 CR

Get the entry key to the Adventure creator as follows:

>GETUSERKEY;G

And install it in the LUD as follows: {N.B. nm is the new user name to be installed and lnm is its length in characters.}

>USERNM;X lnm; 'nm' CR

>USERLNM/ > nml CR

>INSTALLSIKUSER;G

Initialization is now complete.

Running the Benchmark Driver

The driver is a program called SIMIN which runs on a PDP-10. It runs up to 10 simultaneous ports. To run more, it is necessary to log another user on {the same or another PDP-10}. The script follows:

Log on as ] c] rSRCDEV8;ADVENTURE;BENCHMARK

The command prompt is “-”

-RUN c;SIMnncr

where nn is 5 or 10 to run that many users through the CP base, or 5t or 10t to run users through the Gnosis base.

SIMIN will reply ENTER cr TO START. Do so.

a } prompt means SIMIN is listening to you. Several useful commands are:

}STOPcr to suspend the drivers

}GOcr to resume execution

}STATUS to find out what is happening

When it is desired to abort a test, enter two escapes. When the command level prompt is received, RUN ZAPPER to disconnect all the ports. From this point, SIMIN may be reinitialized and another test conducted.

Tests Conducted

Adventure Benchmark Results

The benchmark was run on a real 4341-2 with Record off and Check off. There was 8 Meg of core, but the core table had only 888 entries. All measurements are approximate. Transac/sec is the total number of transactions per second. Diskio is the number of reads or writes to disk, assuming that migrating a node pot takes one read and one write.

users : transac/sec : diskio/sec : jumps/sec :

10 2.7 .2 142

20 3.3 .3 172

30 6.5 6.0 547

40 4.9 9.4 250

60 4.2 58.9 264

Gnosis benchmark results - July 30, 1982.

The Adventure benchmark was run on a 4341-2 on July 30, 1982. The significant differences between this test and previous tests {described in (benchmark)} were:

The Adventure controller was re-written to use the VBSTIO terminal system {(p2,vbstio)}. This terminal system had only 3 domains {CCK, SIK, and SOK} with one shared R/W page for each logged-on Adventure.

The following special observations were made during the tests and subsequent data evaluation:

Test 3 {30 users} had considerable page and node allocation activity, indicating Adventures were either starting or ending during the test. Adventures were observed to end during the test. The ending counters were read 10 to 20 seconds late due to fumble fingers on the part of the observer.

Test 4 {40 users} showed migration activity during the test.

Test 5 {60 users} showed log on/off activity. Also, the PDP-10 monitor logs showed not all users ran. Speculation says this may be due to insufficient ports on the PDP-10. This test also showed migration activity.

Adventure steady state test results {N.B. does not include all zero counters or the Page Table rescue histogram}:

tests: 1 : 2 : 3 : 4 : 5 :

Simulated Users 10 20 30 40 60

Percent CPU 9 17 34 20 63

Timer Running Time 118.6 145.0 165.9 143.5 140.9

Idle Time 107.8 128.5 110.3 115.5 51.8

CPU Time 10.8 16.5 55.6 28.0 89.1

/Transaction .034 .034 .052 .039 .149

Kernel Time 6.9 10.3 24.9 18.9 70.8

/Transaction .022 .021 .023 .027 .118

Prime Meter Time 3.9 6.2 30.7 9.1 18.3

/Transaction .012 .013 .029 .013 .031

PL/I Meter Time 2.6 4.3 25.4 6.4 15.0

/Transaction .008 .009 .024 .009 .025

Terminal Meter .3 .5 1.1 .7 .7

TA Meter .8 1.3 2.8 1.8 1.6

Transactions 318 484 1,074 709 598

/Second 2.7 3.3 6.5 4.9 4.2

Data Key Jumps 644 969 2515 1394 1073

Node Key Jumps 1892 2869 15769 4142 6,024

Fetch Key Jumps 0 0 533 0 185

Entry Key Jumps 3956 5816 20825 8975 10201

Exit Key Jumps 7183 10668 26150 15372 16664

Domain Key Jumps 3249 4772 15486 7534 7730

Misc Key Jumps 2448 4054 13200 6145 8513

RESUMEJ GOTO’s 16827 24901 90758 35865 37159

/Transaction 52.9 51.4 84.5 50.6 62.1

/Second 142 172 547 250 264

Pages Read 25 24 450 429 2867

Nodepots Read 0 6 63 0 515

Pages Written 0 13 428 195 2603

Nodepots Written 0 1 24 0 139

Queue Slices 147 220 325 278 243

DIB’s Stolen 0 17 319 280 2241

DEP’s Stolen 0 0 177 0 455

Seg Tbl Entry Crunch 0 0 28 0 592

Cleanlist Out 0 0 172 7 2952

REQUESTs out 0 0 0 8132 134524

Outstanding I/O On 0 2 20 562 3807

Node LRU Wrap 0 0 1 0 17

Page LRU Wrap 0 0 7 3 38

Disk I/O Ops 25 45 989 1350 8295

/Second .2 .3 6.0 9.4 58.9

CCW Builds 25 45 1008 1362 8395

VBS Benchmark - February 11, 1983

A simple measurement of the VBS system was made. This VBS was moved from the PDP-10 with no redesign for Gnosis. The Gnosis was running in a virtual machine in a 3 Meg V=R region on a 4341-2.

The meter time {i.e., user mode time} to do a simple command {e.g., selecting the next screen} was 0.066 to 0.102 seconds.

Let us extrapolate from this figure:

Call it 0.08 seconds. Add 0.01 seconds as a guess of the time spent in the Tymnet adapter {which was not measured by the VBS meter}. Multiply by 2 to include the time spent in the kernel {assuming a real machine}.

We get 0.18 seconds per transaction or about 5 transactions per second.

This is about twice the value measured for the Adventure benchmark, which seems reasonable.

Taking the cost of a 4341-2 as about $0.01/second, the cost of a transaction would be $0.002 or one-fifth of a cent.

If you consider a “transaction” to be more than a single input and the response, multiply accordingly.

A simple command touched about 450K of memory. We don’t know how much of this is code that would be shared among all users, but we think much of it is.

The meter time to log on was over 26 seconds. We think this could be improved substantially with some architectural changes.

How to Set Up VBS

Log on to Gnosis userid VBS. Type:

AUXFILE GET RECORD CMD.GNOTRAN TYMNET.GNOSISVB 142 GNOTRAN CMDFILE "XFER"

CMDFILE CMD.GNOTRAN

CMDFILE CMD.SAVEINIT

{Ignore “INVALID OPERAND” messages.}

CMDFILE CMD.FILEDEF’s

Check that there is a SYSTEM.VBSTIOC.

FORK FRONTER

FORK CREATOR

ADDLUDKY username password 1 FRONTER username.CONTROL

Now log on to the Gnosis machine under username and type password in upper case.

Then under VBS again:

KEYCALL FRONTER.DOMKEY 64 (%X01) (,FRONTER.ENTRY1)

ADDLUDKY mastername masterpw 0 FRONTER.ENTRY1
mastername.CONTROL

Now log on to the Gnosis machine under mastername and type masterpw in upper case.

Now you can log on to username from a TRS-80 and mastername from a line terminal.

CRASHES

This section is a list of crashes and what we know of them.

Allocation Overflow

Observations

On March 10, 1985, 2686 crashed when a page allocation count overflowed.

We proceeded by ignoring the overflow. Numerous page allocation counts were observed that were statistically incompatible with a “fast count” bug. In the next days more overflows occurred and we patched the kernel to ignore them.

We eventually determined that all impacted pages had their allocation counts in the same allocation pot.

Bill and Jeff then noticed that the allocation pot looked much like a home node pot. TOD values in domain roots indicate that most of the domains had not run that year. This is presumably prior to the most recent big bang!

Speculations

One

Assuming that the node pot value was created during a prior big bang:

That value must have existed somewhere during the beginning of the recent bang. The only obvious place that comes to mind is at the disk address where the current allocation pot lives. Presumably a node pot lived there during the previous bang.

This has been denied by the proper authorities! See (two)!

A failure of the disk format routine would be able to explain all. I presume that the formatting routine uses “formatting writes” {is there any other way?}.

A formatting write replaces an integral number of tracks. It does not partially replace information on a track.

Allocation pots are intermingled on the disk with pages which are initially virtually zero {at least if the allocation pot stuff is working right!}.

It should be noted that the format process writes each record of the disk that is formatted. As such, the disk should have been OK. Since this range is on pack 4, one of the old packs, it seems likely that that disk address has been an allocation pot for at least a year.

Two

It should be noted that all the time stamps found in root nodes were earlier than the last big bang, but later than the one before that (22-Oct-84 to 21-Jan-85).

The ranges involved are duplexed and live on packs 3 and 4. I think the disk system was increased in size at the last big bang, but the formats for packs three and four was unchanged. I have even looked at some other formats and have not found any that put that node pot at the location which is now supposed to be an allocation pot. I cannot see any sequence of pack switching during the last big bang which would have caused this either.

If my calculations are correct, the allocation pot of interest spans CDA’s D8AE-DCAD and lives on cylinder 752, page 54. The node pot of interest contains CDA’s 7920-793C and lives on cylinder 345, page 23.

The observed time stamps seem to put the format program at the scene of the crime. But that requires the far-fetched assumption that it read the node-pot from its home and then wrote it on the clobbered allocation pot page. But the formatter writes a whole cylinder at a time with diag 20 and never does reads. And anyway the node pot would have already been formatted earlier in the pass with an empty one since it lives on a lower cylinder number!

Three

OK!. According to (two) we need to look for scenarios that don’t blame the formatter. See (stale) for one particular problem that might cause such havoc.

Suggestions

Perhaps we should implement the TOD value suggested at (seed-tod). This would have prevented a hazy category of suspected causes for this apparition.

We could even put the SEED TOD in the node pot!

Clobbered Home Node Pot {Fixed}

The theory

2686 experienced a memory failure which took out a power supply. After IBM fixed the machine, Gnosis crashed on a bad home node pot. This pot was all zero after about 0A00 within the pot. With DDR we determined that the pot was bad on both twins.

I argue here that not only is this a possible consequence of such a machine failure but, indeed, the likely consequence.

Gnosis’s main task late at night is to migrate. {It was late at night.} The most significant activity of migrating is migrating nodes. This is because to migrate a node requires reading, then writing two node pot images. As I understand the habits of the kernel the following things will happen in sequence:

The kernel is instructed to migrate node N.

If the swap pot isn’t in core, it is brought in.

It reads N’s home node pot {if it is not already in core}.

It copies the new node state into the core frame holding N’s home pot.

It does a SIO for each of the two channels supporting the two ranges where the image of N’s pot belongs.

The system is now in a state in which it spends, perhaps, most of its time.

A memory box fails which causes addresses 0AA4 in the pot’s page frame to become inaccessible due to double errors.

In some irrelevant order each channel comes to write the node pot image. It writes the portion of the page that it can. When it reaches 0AA4 in the page, each channel fails to get the data from the memory and the controllers supply zeros to the disk.

The system is now in the state that we observed.

I believe that this can be fixed by:

Add a special {perhaps unique} nonzero pattern at the end of each home node pot.

Await the successful completion of the first home node pot write before starting the other. {Crash, perhaps, if the pot can’t be written.}

This is involved with a general strategy that we may provide someday to survive failed page frames.

Upon reading a home node pot, check that the special pattern at the end of node pot image is present, and if not declare a bad read and try the alternate range.

This fix has been installed.

The June 17, 1987 crash

I note that there is code in MEMORY to involve any background key in a consulted red node. I see no code to make depend entries for the slots of such keys. I see no code in UNPRND to take special note of involved background keys. I conclude that no purpose is successfully served in involving such a background key. Perhaps there is a good unserved purpose in involving them.

Background keys are made involved and entered into DEPEND when they are consulted (for real). Is there some reason to involve them when their host node is consulted?

One suggested design change may bear on this issue:

Share page and segment tables despite background keys

This project would make possible sharing page tables for many more shared segments (perhaps the PL/I library).

Logic could be added to note that a page (or segment) table makes no use of the background key, even though one might be in effect at that point in the memory tree.

When a node needs to produce a page table, we first look to see if there is already a page table produced by that node using the matching background key. If not, we look for a page table produced by that node that doesn’t make use of any background key. If there is one, we use it, otherwise we build an (empty) page table (that doesn’t use any background key).

When we need to fill in an entry in a page table, if the entry makes use of the background key and the table doesn’t, we must build a new page table.

The same argument applies to segment tables.

Absent Memory Tree

1988, March 17 and May 27: We crashed in MEMFAULT because reexamining the entire memory tree to acquire a segment keeper key failed to discover a fault. This logic is called after a page or segment exception occurs for an invalid entry, or a protection exception for a read-only page. It is not called when the domain is using the null segment table. The domain’s DIB designated that null table.

March 17: (See (xxz) about this!)

Indeed the domain’s address slot held an unprepared node key (LSS0) for a node not in the hash chains.

The faulting domain was executing MBWAIT2, the new wait multiplexor code. The memory tree has an LSS0 node at the top with 6 in the format slot. When the domain was last dispatched it was using a segment table that might well have been produced by such a tree. This fact was determined by the TREETRAP update. Many domains share the red segment node. (The control register 1 value recorded in the header of the dump was to NULSPPT. There is much code to change CTL1 as it changes a DIB if the running domain uses the designated DIB.)

The segment table with which the domain was dispatched was produced by a node that conforms to the above theory, i.e. a node with many background window keys.

The domain faulted accessing address 020027. This address is meant to be defined via a background key to an FSC segment.

I suspect that this domain had not run for a while and that its address tree’s top node was swapped out for valid LRU reasons. When this occurs calls to DEPEND are supposed to cause the DIB’s segment table locator to designate the null segment table. Something presumably failed in this event sequence.

Three other domains held unprepared keys to the same top node (as determined by DDT’s search command on the CDA in the key). Two of those were prepared and their DIBs designated NULSPPT as they should.

This crash occured about a week after installing the UNIFTBL change. (See UNIFTBL SCRIPT.) This somewhat changes the way that mapping tables are found to be shared. If the current domain had last run by acquiring access to a shared segment table but somehow failed to record that DIB.SEGTABPP in DEPEND then we might fail where we failed. This seems unlikely because SEAP is responsible for notifying DEPEND while the new table sharing logic runs only after SEAP has found the node that produced the desired table.

May 27:

The domain was executing code from a page labeled “MURCAL”. Its PSW was 0A16 which designates a ST 300XXX. Access to the page with instruction was via an unprepared, uninvolved segment key. The designated node is in a node frame and was located via the hash chain. The fault was a protection exception for the address 300XXX. That address is defined by a background key and is via a node not in any node frame. There are thus two keys that should have been prepared and involved for there to exist mapping table entries.

A remarkable thing about the dump is that the field BGKEY in the page table produced by a segment node with window keys, designates the right slot of the wrong node.

The update TREETRAP is a temporary change to keep the value with which a domain was dispatched. Knowing the segemnt table that yielded the trap should be of some value.

Explanation

I think that I can explain the behavior of the kernel now. I will proceed to the solution when the explanation is clear.

Suppose that a DIB designates a STOTE via its SEGTABPP. Suppose that a background key is in effect at the STOTE and that this is reflected in the STOTE’s STBGKEY. Suppose that some node that holds the background key is removed from its frame. UNPRND will make mapping tables valid at this point but will leave BGKEY designating the slot in the node frame that no longer holds any key. While this state is not immediately invalid according to the current validity rules it leads to invalid states when the following happens.

A new node is placed in the old frame. A segment fault occurs for the STOTE. SEARCHPORTION is executed with BGKEY initialized to the STOTE’s STBGKEY. The faulting domain now has access to the key designated by STBGKEY which the domain should not have access to!

Counter Claim

The above does not explain the crash -- The removal of the node with the background key from its frame would have caused the DIB to cease designating the STOTE! There could therefore be no such segment table fault.

Explanation Two

Suppose that a segment table, st, exists with STBGKEY -> slot x in an unallocated node frame. Suppose that the frame is allocated to a red segment node with a background key at x. CHECK would be happy at this point. If some domain with logical access to the red segment node accesses a page defined by the key now at x then it may occur that st will be found to meet the requirements of that domain.

Explanation Three

Suppose that domain d’s DIB locates segment table st where STBGKEY locates slot x. Suppose that no domain has touched pages defined by the key at x.

Explanation Four (and fix)

The domain faults on page table entry y in page table Y accessed via segment table entry x. While SEAP works to find the page to repair the fault, a node upon which x depends is evicted from its frame. This invalidates x. SEAP finds that a segment keeper is required and goes to the top of the tree but unexpectedly finds a node-not-in-memory signal from SEAP instead of a segment keeper report.

The obvious fix is to redispatch the domain. Since bugs have been found when we crashed here before and this event is very rare (once a month), we will call CHECK before we redispatch.

This seems to explain the March 17 crash as well.

Update CLAYTREE is such an update that does not call CHECK.

Bad Background key, Dec '88

CHECK crashed when it could not explain a valid page table entry in a page table x whose PTBGKEY pointed to a page key in slot 14 of node frame z. The context required a large background segment.

In another node frame, there was a node with a background key in slot 14. We suspect that this node was in frame z earlier.

We can explain this crash. An invalid page table entry e caused a translation exception in a table t. t’s background pointer pointed to an unlocked node N. SEAP is called to evaluate e. The background key in N is consulted due to a background window key while in SEAP. (But N is not locked!) This causes a depend entry d to be made. While still in SEAP, N is displaced from its frame. This causes e to be zapped but the zapping happens before the intended zappee has reached e. SEAP returns with a value for e and now e holds a valid entry but the depend entry d, which was intended for this value, is already gone and we are in the unsafe state which CHECK found!

See (dependcl) for related theory. See (pes) for a proposed fix.

See (bkt) for ruminations that need to be completed to make this stuff work!

January 89

The crash was at MEMFAULT upon failing to find a segment keeper slot in core. CHECKMKEY has replied to MEMFAULT that a necessary node is absent and has been summoned. MEMFAULT’s job is to notify a segment keeper.

MEMFAULT is responding to the call (at FAULT1) in PROGINT. FAULT1 is in response to

Oct 89

In general we find an involved page key in a black segment node with a SUBJECT field that designates that same segment node. This infelicity is sometimes caught by CHECK and sometimes by MEMORY. In each case the offset of the strange slot within the node is the same as the offset of the only other prepared key to the node with the strange slot. As of the crash, both keys are involved.

Suppose that slot i of node x holds a segmode key to node y. I imagine that the key in x is prepared but not involved. A translation exception requires the key to be involved.

On two crashes the segment node was an LSS3 node in the prime space bank’s bit map. One crash for the node map and one for the page map.

DESIGN PROBLEMS and BUGS

Kernel Code Design Problems

DEPEND inefficiencies

Redundant entries

When a program alternately changes a slot and then refers to storage defined by that slot, DEPEND may accumulate redundant entries. These are for slots that are different from the changed slot but also obscure and required to define the storage.

Two cures for this are suggested: Avoid adding redundant entries and periodically sweep for redundant entries. This could be done when space was needed and when such a sweep had not been done recently.

Ancient entries

DEPEND entries expire only when the associated slot is zapped or when room is required in DEPEND space for other entries. In particular they last as the mapping tables that they point to disappear. When such an ancient entry does expire it zaps its mapping table entry and may thus inconvenience the new user of that entry.

If DEPEND entries held the mapping table entry value in addition to its current information then zapping new tennants could be largely avoided and stale entries could be found and deleted for better space management.

This can be implemented by chaining together DEPEND entries produced on one execution of SEAP. This is necessary because the mapping table entry value is unknown when the DEPEND entry is formed. The chain can be followed just before SEAP returns.

Eliminating redundant entries

Even without extending the depend entry to include mapping table entry values, one may search a chain and eliminate entries that designate the same mapping table entry. Threr are two pit-falls here: to avoid the quadratic cost for long ligitimate chains, and know when to stop hoping to find redundant entries and resign oneself to the old-fashioned way of zapping whole chains.

Ageing of Nodes

Ageing of a node depends on setting the core-lock byte to 0F after use of the node. When a node is unprepared one of two bad things happen:

If the node was prepared as a segment node then the NFDIRTY bit will be off and the age indicated by the lock byte will be greater than when the node was in fact required to be in memory. This subjects segment nodes to premature removal from node space which:

Uses CPU time,

Wastes swap node pot space,

Wastes channel capacity

Leads to panic migrations due to swap space exhaustion.

If the node was prepared as a domain part or a meter then the NFDIRTY bit is on and the node will be misjudged to be very recently used which is the opposite problem as above. This leads to removing younger nodes from node space.

A kludge is to set a moderate age on recently unprepared nodes. This is somewhat at odds with LRU theory but I now know no better scheme that is compatible with the reasons for prepared nodes.

Bugs

Parameter string that cross page boundaries.

There is currently a bug concerned with passing a string to a domain whose parameter string crosses a page boundary. The immediate bug is that the second page may be the same page as the string argument.

I think that if the second page must be faulted in then the first part of the string is deposited before the I/O occurs and thus the jump isn’t instantaneous.

Domain Tool Abuse

We know of no bugs here. Since there is several places in the kernel that protect the kernel from node configurations that only occur from use of the domain tool in unanticipated ways, and that there is no code now designed to wield the domain tool beyond the official domain creator, we conclude that there are probably bugs in this untested kernel code.

A secure system must either ensure that the domain tool is safely hidden or that this kernel code be tested.

Non sharing of mapping tables

This is a minor performance bug in MEMORY. When seeking a mapping table produced by some specific node we require that SEAPFORMAT = xTFORMAT. Bit 1 of this byte is the no call bit. That bit should not be required to match in order to share the table.

GLOSSARY AND INDEX

ABANDONJ - (abandonj)

Active - (active)

Actor - The domain whose action caused the kernel to run (pzdomaindibp)

Apply {an address to a key} - see (p1,applyaddr)

Allocation count - (allcount)

Allocation pot - (allocation-pot)

ALSEGTAB - (alsegtab)

Assertions - Introduction: (ass-intro), (inv)

Backchain order - (bchorder)

Background Keys - (bgkey), (background), (backimp), (bmemalgor), (seapl)

BACKMVCL - (backmvcl)

BDEVICE - BDEVICE key service (bdevice)

Benchmark results - (benchmark)

BORROW {ext. spec.} - (borrow)

BORROW {int. spec.} - (borrow-logic)

BUSYIO - DEVICEIO routine (busyio)

BUSYSNS - DEVICEIO routine (busysns)

BWAIT keys - (bwait)

Call count - (callcount)

CALLDUMP - Take kernel dump and IPL (calldump)

CDA {coded disk address} - (cda)

Charge sets - (charge-set)

CHECK the routine - (check)

CHECKDEP - (checkdep)

CHECKINT - (checkint)

CHECKOFF - (checkoff)

Checkpointing - (checkpoint)

Checkpoint header page - (checkpoint-header)

CHECKPSW - (pswnotchecked)

CHRGE - (chrge)

CKPTSTLL - (ckptstll)

CLEAN - (clean)

Clean Me First - (gcleanmf)

CLRALLTB - (clralltb)

CLRPAGTB - (clrpagtb)

Coded Disk Address - (cda)

Core Table - (coretable)

Corelock - lock for a page of a node (corelock)

Crash Log - (crash)

CSWINCP - Internel routine in GINTDSK (gintdsk-cswincp)

DAT - Dynamic Address Translation, a 370 hardware term

Dataforap - A part of the backup directrory (dir-part)

DEPEND- Intro: (depndintro), to call: (dependc)

DEPENDSAFE - (dependcl)

DEST PAGE SET UP - (setupdestpage), (argument-predicates)

DETPAGE - (detpage)

Device Driver Level Locks - (driver-locks)

DEVICEHD - Halt I/O involving page (for range_key__sever) (devicehd)

DEVICEIO - Module and entry point for non-kernel I/O (deviceio)

DEVICELOCK - Main lock for device block (devicelock)

DEVICEQLOCK - Spin lock for DEVREQ queue off device block (deviceqlock)

DEVINT - Unsolicited interrupt from non-kernel device (devint)

DIB - Derivative Information Block - (dib)

Disk Directory - (disk-directory)

Disk I/O - (disk)

DOIOINT - (doioint)

DOIPL - Simulate IPL button (doipl)

Domain - (domain)

DOPAGTEX - (trex)

DOPROTEX - (doprotex)

DOSEGTEX - (trex)

DPSU - (primcom)

Dry Run - (dry-run)

DUMMYINT - (dummyint)

Dumps - (dump)

EMPTSTAL - (emptstal) implemented in (scafold-emptstal)

ENDUPTO - (gintdsk-endupto)

ENQMVCPU - (enqmvcpu)

ENQUEDOM - (enquedom)

ENSUREDESTPAGE - (primcom)

ENSURRET - ENSURERETURNEE (ensurret)

ENTRY - (entry)

EXIT - (exit)

Exit Zapping - (callcount)

EXSEGTAB - (exsegtab)

EXTBWAIT - (extbwait)

EXTERNAL - (external)

EXTKWAIT - (extkwait)

EXTWORRY - (extworry)

EXTWSTRT - (extwstrt)

EXTWUKIW - (extwukiw)

FINDNOD3 - (findnode)

FINDPAG3 - (findpage)

FLIH - First Level Interrupt Handler

Forsaken - attribute of a segment or page table (forsake)

FREESTOTES - Internal to SHOVE (freestotes)

GATEINIT - (gateinit)

GBADDMNT - (gbaddmnt)

GBADLOG - (gbadlog)

GBADPAGE - Bad disk page module. See (gbadpage) and (gbadpage-module) for module information

GBADREAD - (gbadread)

GBADREWT - (gbadrewt)

GCCBCS - (gccbcs)

GCCWAPND - (gccwapnd)

GCCWBULD - (gccwbuld) see (gccwbuld-module) for module and entry point information.

GCKDECPC - (gckdecpc)

GCKFCKPT - (gckfckpt)

GCKINCPC - (gckincpc)

GCKTKCKP - (gcktkckp)

GCLADD - (gcladd)

GCLBUILD - (gclbuild)

GCLCTPN - (gclctpn)

GCLFREZE - (gclfreze)

GCLEANMF - Clean me first (gcleanmf) and (gcleanmf-entry)

GCOCKRD - (gcockrd)

GCODIDNT - (gcodidnt)

GCODISMT - (gcodismt)

GCODMPR - (gcodmpr)

GCODSE - (gcodse)

GCODSKOF - (gcodskof)

GCOENQDR - (gcoenqdr)

GCOFBNR - (gcofbnr)

GCOIMPR - (gcoimpr)

GCORCR - (gcorcr)

GCORECLP - (gcoreclp)

GCOREENQ - (gcoreenq)

GCOREQA - (gcoreqa)

GCOSELDQ - (gcoseldq)

GCOSSE - (gcosse)

GDDACCW - (gddaccw)

GDDACDRQ - (gddacdrq)

GDDACREQ - (gddacreq)

GDDDOVV - (gdddovv)

GDDENQ - (gddenq-entry)

GDDIOTYP - (gddiotyp)

GDDRAEC - (gddraec)

GDDRCP - (gddrcp)

GDDRDR - (gddrdr)

GDDRECCW - (gddreccw)

GDDREENQ - (gddreenq)

GDDSENSE - (gddsense)

GDDSTPC - (gddstpc)

GDIMIGR2 - (gdimigr2)

GDIMIGR3 - (gdimigr3)

GDIMIGR4 - (gdimigr4)

GDINDE6 - (gdinde6)

GDINTSDR - (gdintsdr)

GDIPMID - (gdipmid)

GDIREDRQ - (gdiredrq)

GDIREMBK - (gdirembk)

GDIRESET - (gdireset)

GDISET - (gdiset)

GDISETVZ - (gdisetvz)

GDISWAP - (gdiswap)

GDIVERRQ - (gdiverrq)

GET - (get)

GETBVN - (getbvn)

GETBVP - (getbvp)

GETCFCS - (getcfcs)

GETENDCL - (getendcl)

GETENDED - (getended)

GETENDSU - (getendsu)

GETENQIO - (getenqio)

GETFPIC - (getfpic)

GETLOCK - (getlock)

GETMNTIS - (getmntis)

GETNODE - (getnode)

GETPAG - (getpag)

GETPAGE - (getpage)

GETQNODP - (getqnodp)

GETREDRQ - (getredrq)

GETREQAP - (getreqap)

GETREQBA - (getreqba)

GETREQBN - (getreqbn)

GETREQBP - (getreqbp)

GETREQHP - (getreqhp)

GETREQN - (getreqn)

GETREQNM - (getreqnm)

GETREQP - (getreqp)

GETREQPM - (getreqpm)

GETREREQ - (getrereq)

GETRET - (getret)

GETSUBVP - (getsubvp)

GETSUCAP - (getsucap)

GETSUCNP - (getsucnp)

GETSUVZP - (getsuvzp)

GETUNLOK - (getunlok)

GINBCPI - (ginbcpi)

GINBERI - (ginberi)

GINBRI - (ginbri)

GINBSI - (ginbsi)

GINBVVI - (ginbvvi)

GINGETDA - (gingetda)

GINIOICP - (ginioicp)

GINIOIDA - (ginioida)

GINIOTYP - (giniotyp)

GINITIS - (ginitis)

GINSE - (ginse)

GINUDSKI - (ginudski)

GINVVE - (ginvve)

GIPL - (gipl)

GJOURNAL - (gjournal)

GMEASURE - (gmeasure)

GMIGRATE - Module (gmigrate-module) and entry point (gmigrate)

GMIMAP - (gmimap)

GRANGET - (granget)

GRESTART - (grestart-module)

GRSRUNOK - (grsrunok)

GRSTART - (grstart)

GRTADD - (grtadd)

GRTBRD - (grtbrd)

GRTCCTSL - (grtcctsl)

GRTCDAPO - (grtcdap0)

GRTCDAP1 - (grtcdap1)

GRTCHDRL - (grtchdrl)

GRTCLEAR - (grtclear)

GRTCRL - (grtcrl)

GRTFADFP - (grtfadfp)

GRTFDFCC - (grtfdfcc)

GRTFNDRG - (grtfndrg)

GRTFPUD - (grtfpud)

GRTHOMEN - (grthomen)

GRTHOMEP - (grthomep)

GRTHOMWL - (grthomwl)

GRTMOR - (grtmor)

GRTMRO - (grtmro)

GRTNEXT - (grtnext)

GRTNMP - (grtnmp)

GRTNPLEX - (grtnplex)

GRTSLEDI - (grtsledi)

GRTSTODL - (grtstodl)

GRTSYNCD - (grtsyncd)

GRTUPGRD - (grtupgrd)

GRTUPPDR - (grtuppdr)

GSPACE - (gspace)

GSPCLNOD - (gspclnod)

GSPDETPG - (gspdetpg)

GSPGNODE - (gspgnode)

GSPGPAGE - (gspgpage)

GSPMNFA - (gspmnfa)

GSPMPFA - (gspmpfa)

GSWCKMP - (gswckmp)

GSWDESS - (gswdess)

GSWFBEST - (gswfbest)

GSWINSS - (gswinss)

GSWNBEST - (gswnbest)

GSWNEXT - (gswnext)

GSWNUMSD - (gswnumsd)

GSWPPOD - (gswppod)

GSWRESET - (gswreset)

GSWSCSAA - (gswscsaa)

GUINTAP - (guintap)

GUINTCR - (guintcr)

GUINTNP - (guintnp)

GUINTPD - (guintpd)

GUPDPDR - (gupdpdr)

GWRTBOOT - (gwrtboot)

HALFPREP - (halfprep)

HARDSTOP - see (p3,hardstop)

HNDLJMPR - (hndljmpr)

Hoard - (decngst)

Hook - (hook)

ICKPT - (ickpt)

ICLEANL - (icleanl)

ICLOCK

ICM trick

ICOMDSK - (icomdsk)

IDEPEND - (idepend)

IDIBDE - (idibde)

IDIBDE6 - (idibde6)

IDIIRAP - (idiirap)

IDINAP - (idinap)

IDIRECT - (idirect)

IDLED - (idled)

IDLEX - (idlex)

IDLEZ - (idlez)

IDSKDVR - (idskdvr)

IET - (iet)

IEVICEIO - (ieviceio)

IEXTERNA - (iexterna)

IIMPLECS - (iimplecs)

IINTDSK - (iintdsk)

IJOURNAL - (ijournal)

IMIGRATE - (imigrate)

INIT - (init)

INITIS - (initis)

Input - Output - (io)

Invariants - (inv)

Involved Keys - (involved)

INVOLVEDR - Involved (don’t read) (involvedr)

INVOLVEDW - Involved (don’t write) (ob)

INVOLVEN - (involven)

INVOLVEP - (involvep)

IOICCCC2 - (ioicccc2)

IOIDEVLK - (ioidevlk)

IOIMKCUB - (ioimkcub)

IOIMKNR - (ioimknr)

IOINTER - (iointer) for function and (iointer-module) for module information

IOIPTHOF - (ioipthof)

IRANGET - (iranget)

ISCHED - (isched)

ISHOVE - (ishove)

ISPACE - (ispace)

ISWAPA - (iswapa)

IUCV - (iucv)

IUPDPDR - (iupdpdr)

IVCLOCK - (ivclock)

JDATA, JCHRGSET, JDEVICE, JDOMAIN, JFETCH, JHOOK, JMETER, JMISC, JNODE, JNRANGE, JPAGE, JPRANGE, JSEGMENT JSENSE - (jprim)

Journalizing - (journalizing)

JUMPTYPE - (jumptype)

KDIAG - (kdiag)

KEEPJUMP - (keepjump)

Kernel-Read-Only state - (kro)

KERRLOG - (kerrlog)

Key - (keys)

KEYJUMP - (keyjump)

KEYTABLE - A macro to dispatch on keytype. (keytable)

KEYTODSK - (keytodsk)

KEYTONOD - (keytonod)

KSSTALL - (ksstall)

LESSTHAN - An assertion macro (lessthan)

Locks - General ideas (locks), (locksx), (ncl) and (io-locks)

Device Driver Level Locks - (driver-locks)

Paging Level Locks - (paging-locks)

LOGERROR - (logerror)

LOGINCCH - (logincch)

LOGMOUNT - (logmount)

LOGOBR - (logobr)

LOGTRACE - (logtrace)

LOOK1 - (look1)

MAKEKRO - MAKE Kernel Read Only (makekro) (kro)

Mapping Table - Collective term for page table and segment table.

MEMCHECK - (memcheck)

MEMCHKEY - (chmem)

MEMFAULT - (memfault)

Memory:

Memory Code (mem-code)

MEMTRECH - (memtrech)

MEMZAP - (memzap)

Meters - (borrow), (metalgo)

MIDFLT - (midflt)

MIDJUMP - (midjump)

Migration - (migrator)

Multi Processing - (multi)

NFCORELOCK - Keep this node in this node frame - (nfcl)

Nodepot - (node-pot)

NORESTRT - (norestrt)

Null Job - (null-job)

Obscure - Slot recorded in DEPEND (obscure)

P2NODE - (jumptype), (p2n), (p2n2), (p2n3)

P2SWITCH - (p2s)

Pack descriptor record - (pack-disc)

Paging - Swapping (swap)

Paging Level Locks - (paging-locks)

PAGTREX - (trex)

Parameter string setup - (parameter-setup)

PCOMRTN - (primcom)

PCOMPR0 - (pcompr0)

PENDVAL - (pes)

Performance Tuning - (table-size)

PERSVC - (persvc)

Pots

Node pots - (node-pot)

Allocation Pots - (allocation-pot)

Prepared Key - (prepkeys)

Prepared Node - (prepnode)

PREPDOM - (prepdom)

PREPKEY - (prepkey)

PREPLOCK - Don’t change preparation mode - (nfpl)

PREPWRIT - (prepwrit)

PRIMRET - (primret)

PRIMSR - (primsr)

PRIRETK0 - (priretk0)

Produce - (produce), (prod-chains)

PROGINT - (progint)

Proofs - (proof)

PSWNOTCHECKED - Bit in READINESS (pswnotchecked)

PTFIRST - (ptfirst)

PTSTATE - (pt-age), (ptstate)

PUTAWAYD - (putawayd)

PZDOMAINDIBP - Pointer to DIB of actor (pzdomaindibp)

Queues {of domains} - (queues)

Ranges - (granget)

Real Storage Allocation - (real-stor)

Reclamation - (forsake) for mapping tables, (pt-age) for page tables, (map-age) for ...., (nfpl) for node frames.

Record Mode - (record)

READINESS - Byte in DIB: Reasons not to run. (readiness)

REFARGLENGTH - assertions: (arg-ready), intro: (getarg)

REFREFARGUMENT - See REFARGLENGTH

RELARGPG - (relargpg)

RESETKRO - (resetkro)

RESSAME - (ressame)

RESTART - (restart)

RESTPER - (restper)

RESUMEI - (resumei)

RESUMEJ - (resumej)

RESUMEQ - (resumeq)

RETBYTESTRING - (primcom), (rbs)

RETCACHE - (retcache)

Retry - (retry)

Returnee - The domain gaining control after a jump to primary key

RUNDOM - (rundom)

RUNIDLE - (runidle)

RUNIT - (runit)

RUNMIGR - (runmigr)

Safe - Hypothetical transformation of kernel state - (safe)

SCAVENGE - (scavenge)

Scheduler - (scheduler)

SCSADDP - (scsaddp)

SCSINVKY - (scsinvky)

SCSNODES - (scsnodes)

SCSPAGES - (scspages)

SCSRESET - (scsreset)

SCSUNINV - (scsuninv)

SCSUNLK - (scsunlk)

SRCHNODE - (srchnode)

SRCHPAGE - (srchpage)

SEARCHPORTION - (seap)

SEGTREX - (trex)

SETUPPER - (setupper)

Sever - (allcount)

SETUPDESTPAGE - (parameter-setup)

SHOVE - Manage segment table space (shove)

Shrapnel - Phenomenon of DEPEND (good-shrapnel)

SHRINK - (shrink)

SIMPLEKT - (simplekt)

SLOTZAP - (slotzap)

SLOWMIGR - (slowmigr)

SOURCEPAGECTE - (getarg)

SRCHBVOP - (srchbvop)

SRCHNBVP - (srchnbvp)

SRCHNODE - (srchnode)

SRCHPAGE - (srchpage)

SST - (sst)

Stalling - (install), see (p1,stall)

STARTDOM - (startdom)

Storage Keys - (real-stor)

STORKEYISZERO - (storkeyrule)

STOPDISP - (stopdisp)

STOT - Segment Table Origin Table - (stot),

Allocation - (alsegtab)

Background key of - (bgkey), (bmemalgor)

Charge sets - (ch-stot), (clralltb)

DEPEND logic - (dep-stot), (dependc)

Expanding - (exsegtab)

Forsaken - (forsaken)

Free - (freest)

Produced by Segment Nodes - (prod-stot), (thix)

Reclamation - (rec-stot), (stot-dep)

References from DIB’s - (dib-stot)

Reference to Seg Table - (stot-seg)

Sever Seg Tab - (sst), (sst1)

Zapping - (stot-zap)

SUPERZAP - (superzap)

SVCINT - (svcint)

Swap Area - (swaparea)

Swap Area Directory, on disk - (disk-directory), in core - (core-directory)

Swapping - (swap)

Tied down - (tie)

TIMERDEQ - (timerdeq)

TIMERENQ - (timerenq)

Time Storage Formats - (tf)

Trunk - (memalgo)

TRYPREP - (tryprep)

Tymnet Today - (tiretore)

UNHOOK - (unhook) implemented in (scafold-unhook)

UNINV - (uninv)

Unit of Transformation - (transunit)

Unprepared Keys - (unprep)

UNPRKY - (unprky)

UNPRMET - (unprmet)

UNPRDR - (unprdr)

UNPRND - (unprndc)

UNPRSEG - (unprseg)

UNSETRET - (unsetret)

UNSETUPDUSTDESTPAGE - (primcom)

USENWPSW - (usenwpsw)

Valid - (valid)

WHENDEPEND - (whendepend)

Worrying - (worry)

XA - (xa)

Xvalid - (ptev)

ZAPKEY - (zapkey)

ZPSEGTAB - (zpsegtab)

GNOSIS .Text(Hdr)="";.Oddpage;.Snfshow=Off;
.Pxfshow=Off;.Ybs=1,0.125;.Irest=0;.Gybs=19,3.25;.Gyes=3,0.375;

KERNEL LOGIC
.Sp=C;.Fsw=Off;.Vsplit;.Gcr;.V3Font=11p,5,2;.Grule=0.5,9.875,7.3,9.875; Key Logic Proprietary .Split;.V3=2; Key Logic.Gyes=0,.075;.Pes;.Names=Off;

.Gybs=25,6.0;Second Edition (.Gdm;-.Gmonth;-.Gyear; .Ghr;:.Gmin;)

This Gnosis Kernel Logic manual obsoletes all previous versions.

This manual is intended both for the developers of Gnosis and for those interested in the internal logic of the Gnosis kernel. A user-oriented view is presented in the Gnosis External Specifications manual.

This document is one of a set continuously maintained online in the Key Logic “Keydoc” system. These design documents represent the current state of a continuously evolving system. Anyone needing current information should access the online version.

KEY LOGIC PROPRIETARY MATERIALS.Gcr;These materials contain confidential and proprietary information which is the property of Key Logic. These materials may not be duplicated, displayed, disclosed or used, in whole or in part, without the prior written consent of an officer of Key Logic..Pes;.Pxpshow=1;.Pxfshow=(1,2);.Ybs=0,0;.Pntype=2002;