Floating Point State

These are some ideas on how to implement Keykos and such systems on a machine with a large floating point state which is slow to save and restore. It has other application for machines with vector state or other large application state that is closely bound to the processor. Some of these ideas are similar to solutions proposed for Cal-TSS. The same techniques may be even more useful for newer larger optional user state such as vector registers etc.

Modern processors that I have studied recently have modes that deny access to the floating state. In such a mode user instructions that attempt such access are cleanly trapped so as to grant access “at the last possible instruction” as it were. A general Keykos inclination to postpone work until absolutely necessary seems to win here.

This user access bit (UAB) may already reside in a privileged register that must be restored for other reasons upon transition to the user mode. This seems not to be true of the x86 but that makes the scheme only a bit slower. In the following I use “image” to refer to the process state kept in RAM while the process is not running. The image is loaded into various registers as the process starts and those parts that may be changed by execution of user mode instructions are preserved upon traps and interrupts of the process. The frequent kernel paths are coded as if there were no floating point hardware and the kernel does not itself use the floating point for its work. In particular the floating state is neither saved nor restored with the rest of the process image upon entry and exit from the kernel. When a domain is prepared its UAB image is set to deny access regardless of the existence of floating values in the domain’s image. (The image is that data that is loaded from RAM to privileged registers by the kernel in preparing to begin obeying the user mode instructions of the domain—typically a register with a name such as “process state”.) Some domains (depending on the particular hardware architecture) may lack storage space for floating point image. If a user instruction attempts to access floating state the kernel is entered and the domain is examined to see if it is equipped with a floating point image. The domain gets a fault code if it lacks such floating values. If it has a FP image then the floating state is prepared (to use a Keykos kernel term) by allocating real storage to hold the image in a form which is quick for saving and restoring the real floating point registers where the state lives during execution of the program. Of course the floating image may already be prepared. As with the domain’s DIB this image can always be immediately reclaimed as it merely reflects those floating values otherwise held in the swapped domain annex, just as a cache reflects RAM contents. After the floating image is prepared it remains to verify that the real hardware floating state for the processor is not already occupied by the only copy of some other domain’s state.

For each real processor with floating hardware, there is a real storage location LF that locates the storage area allocated to hold the image that is now in the real hardware floating point registers. In design X that area will belong to some particular domain that has most recently obeyed floating point commands on this processor. In design Y that area will be shared by (a club of) domains which share floating state. It will be null just in case the real hardware state is already duplicated in RAM. Such state is left in the real hardware registers to optimize the case where only one domain (or club) is frequently accessing such state. In such cases floating state need not be saved or restored upon context switches. If LF indicates that the real registers are occupied, then LF says where that state must first be saved.

Prepared domains are in one of two states:

Their UAB denies.
Their UAB allows and a field in the DIB designates the real processor which holds the logical state for this domain. (That processor’s LF field will, in turn, locate the storage for this domain’s floating image.) If the domain is running and not just now occupying a real CPU, then it may be on a segregated ready queue (CPUQUEUE) for the particular processor that holds its floating state. This is a design decision not made here.

In design X there is at most one domain per processor with UAB set. In design Y there may be a list of them. The domains on this list share one floating state. Such a club must not run on another processor while its UAB allows. LF is often null and no save is required. After the possible save, the new state of the current domain may be loaded from the prepared floating area and LF is made to locate that area. The UAB for this domain is set to permit access and the interrupt ends by returning to the interrupted program.

Another possibility is that the domain’s floating state currently resides in some real processor. In fact it may reside in the real processor that took the fault but kernel design X would avoid this. In design X we turn on the UAB just as we load the real floating state. In design Y we turn on a domain’s UAB only after the domain has been trapped for its being 0.

A surprising thing about this scheme is that domains can share floating state if that state lives in its own (sharable) domain annex. I don’t know if it is easier to allow (design Y) or prevent (design X) this. I cannot think of a compelling reason to allow it except as it may be simpler the kernel. In this scheme it would seem unnecessary to consider the floating state to belong to the domain any more than the address segment belongs to the domain. (By belong I refer to the habit of saying that a domain (in 370 Keykos) consists of three nodes.)

Ramifications

If only one domain issues floating instructions over some span of time then that floating state will reside in the real registers and be neither saved nor restored. Other domains will be unable to access that state as their UABs will prevent it. If another domain should begin to use floating point then saves and restores will be minimized and occur probably much less frequently than transitions to and from the kernel. In a SMP system there is the embarrassing possibility that a domain will issue a floating point instruction only to find that its floating values currently resides in another real processor. Probably the best thing to do here is to retire from the current processor and go to the head of the queue for the processor that already holds its state.

Issues Outside the Kernel

It might seem that there remains the dilemma of buying the disk space for many domains that may not need the large state. This takes time and disk space. If it is not known it the domain will need the large state then the space for the state can be bought only when the program comes to the point of using the special commands. This would invoke a domain keeper which can be prepared to buy the space if that is the only problem. That domain keeper can even be shared rather like the Keykos “virtual domain keeper” (bad name) shared by many domains. Procrastination wins at all abstraction levels! (See “VDK” here.) If it was called it would install a debugger on the faulted domain.

The x86

Here are the x86 manuals for 2006 from Intel. Page 509 (sec 12.5.1) of “Volume 3A: System Programming Guide” describes these ideas where bit TS in CR0 plays the role of UAB above. There is a CLTS instruction that clears the TS bit presumably more efficiently than replacing the entire contents of CR0. This scheme seems to need to set CR0 on every switch or at least compare the value that belongs there with what you last put there if replacing it is slow. Some machines (at least non-Intel machines) do this compare themselves and attend only to bits that really change. Bits AM and NE of CR0 bear on the user experience and are perhaps changed occasionally as well. Vol. 1, Section D.3.6 (page 413) covers these ideas, but restricted to floating point register state. The SSE* architecture extensions use the new set of XMM registers introduced in Vol. 1, section 10.1.1 (page 280).

Page 263 of volume 1 states that the MMX state is aliased to the floating point state. Bit TS in CR0 disables MMX instructions. Vol. 3a, section 2.5 (page 67) describes the TS bit of CR0 as controlling access to SSE* state as well as floating point state. (Section 12.8.1 of Feb 2014 edition) The text seems to use “task switch” as a technical term. See v3a, §6.3. §6.4.2 talks about task switches and introduces “task gate descriptor”. It recommends Chapter 5, Vol. 3b. Vol. 1, section 11.6.10.2 speaks of XMM state and task switches. Page 414, vol 1. refers to Chapter 6, Vol 3a. for more info on task switching. After an hour or so with Adobe Reader I sort of think that the TS bit in CR0 does its thing for the unified SSE* and floating states. I think that EROS and Capros do not use the task structures defined by the x86 architecture. I presume that bit TS in CR0 can be explicitly set.