2005 Caveat: I noticed that search engines were leading people to this page and so I read it closely once again. I found the sort of bug that can be difficult to find in a running system. It is referred to below. Towards the end there is a section titled With the Compare-and-Swap Instruction which I think works. I hope that anyone who finds bugs therein will let me know!

Imagine a circular buffer with a pointer for the writer and another for the reader, all in shared memory. There is just one reader and one writer and they may be running simultaneously on different CPUs. Both pointers are located so that fetches and stores of pointers are atomic and even when there are two CPUs, one doing a store and the other doing a fetch, the fetch is guaranteed to get either the old value or the new value intact. No special instructions, even user mode instructions are required. This is in part because for each pointer there is just one process (thread) that writes it and thus at most one CPU.

The first use of this construct that I saw was Seymour Cray’s small OS for the CDC 6600. Typically one of the buffer accessors was a “Peripheral Processing Unit (PPU)” which managed one IO stream at a time and did nothing but care for that one stream. It had thus no need to be diverted from other tasks to look again to see if the buffer allowed further progress. I do not recall how the main CPU task manager (a PPU) noticed that some task was no longer blocked by the state of some circular buffer. It almost certainly polled.

There remains the problem of what a program is to do when it reads the other’s pointer and discovers that it cannot proceed. At least in this case it is significant that the program is ahead of the game and may presumably assume a lower priority for a while. The program must generally relinquish its time slot if the CPU is shared with other tasks. If the CPU allocated to the task is shared among several tasks attending to circular buffers with similar characteristics, then a round robin scheduler would seem to do well.

A simple scheme to restart blocked threads quickly is as follows. When a participant becomes blocked it sets a flag in RAM that it shares with the other participant. It then waits on some synchronization mechanism and resets the flag upon resumption. Just after a participant updates its pointer it examines the flag of the other participant and, if it is set, signals the scheduler that the other thread should be scheduled once again.

The above fails in case of the following event order: (1) P1 discovers he is blocked. (2) P2 adjusts her pointer. (3) P2 notices that P1’s flag is clear (not blocked). (4) P1 sets his flag. Now P1 sleeps awaiting P2’s signal which does not arrive. One efficient fix to this is for a participant to re-examen the other’s pointer after setting the flag. If per chance it has been fixed, unset the flag. One must now understand the semantics of a message to the scheduler asking to wake a process that is not sleeping.

A convenient synchronizing primitive would be an object with an internal variable integer. One method on the object would O.set(n) which sets the variable to n. Another method O.wait(n) does not return until the variable reaches or exceeds n. Meanwhile the object would signal the scheduler so as to yield. This neatly avoids a class of races. For circular buffers the participant that discovers from the pointer states that he is blocked would leave a mark in shared memory and wait until the blocking pointer was sufficiently advanced. Upon awakening and noting that the blocking pointer was indeed advanced, it would undo the mark. Note that this process never waits unless the mark is made. Notice further that invoking set on a wait object is harmless if no one is waiting. The correspondent upon advancing the blocking pointer would notice the mark and notify the wait object of the new position.

In the example below I shall assume another trick. I choose the buffer size to be a power of two and map it twice in my address space in adjacent page frames at virtual address B. Portions of the buffer that wrap around the end thus appear contiguously in my space. The reader and writer share offset into the buffer and indeed only the low bits play a communications role. The parties to the buffer need not agree on its virtual address nor need they both employ the double mapping scheme. The example code is for one that does.

static char ctt(int a, int b, int c) {return (a<b) + (b<c) + (c<a) >= 2;}

void deposit(int bc, void * p){
  // We must deposit bc bytes.
  if(ctt(co&bsm, po&bsm, (po+bc)&bsm)) { // there is room for the bc bytes.
    while (bc--) *(B+(po++)) = *(p++);
    po &= bsm;
    return;}
  // We must stall.
} 
This may be inefficient on machines with virtual caches for the two addresses at which data appears even in one address space cannot be shared in the cache. See this note on such cache issues.
One way out of this morass is to note that when I discover that I cannot proceed that this is merely means that I am probably ahead of the game. The only thing I know for sure is that I was blocked. If the rate of progress thru the buffer were known within a factor of two it might be efficient to merely wait until there is about 1/2 a buffer to consume (or 1/2 a buffer to fill). This may be better if the overhead cost in switching tasks is large compared to the cost of a bigger buffer. This is poor for very bursty traffic and I will not pursue it further.

With the Compare-and-Swap Instruction

Suppose we share a 3 state variable X under compare-and-swap control. The states are:

  1. only I am running, you are blocked on pointers.
  2. only you are running, I am blocked on pointers,
  3. we are both running.
State 0 indicates system failure—can’t get there. X is in addition to the pointers but is modified only upon blockage.

After advancing my pointer I observe X. If you are not running (X=1) I set X = 3 and direct the scheduler to wake you.

When I am blocked by the pointers I update X atomically with map <1➝3, 2➝0, 3➝2>. Then depending on the initial state of X, I go three ways:

  • If X was 1, I set X = 3, wake you and resume thwarted pointer check.
  • If X was 2, we have a hardware failure or software bug and stop.
  • If X was 3, I yield, confident that you will wake me. When I awake I resume with my thwarted attempt to advance my buffer pointer.

    You perform the same algorithm with states 1 and 2 reversed.