Streams for Tymnet

The task was to allocate a single processor to pushing character streams thru a network of simple processing steps (nodes). The application required low latency and thus single character processing, and there was no buffering except as explicitly provided by one of the nodes. There were a dozen or so stream processing tasks that we needed and we foresaw a slowly growing set of new processes. The re-plumbing of these nodes was to be dynamic; it is not necessary to lose data to change plumbing.

I describe the implementation ideas in machine language for that is how they were conceived. I suspect that implementing these ideas in C might cost very substantial performance. I would love to be proven wrong. (2004: Perhaps I was wrong.) It is not clear to me how best to express this in C except to simulate a machine obeying pseudo machine language. I see now that the scheme amounts to an efficient use of continuations. Perhaps Scheme continuations could do the trick, but again not efficiently. Perhaps we could describe this code plan as heavy use of continuations in a context without garbage collection. And in Keykos?

As in SVR4 streams, the idea is to provide a small fixed set of stream transformations but with the ability to dynamically compose these into data-flow like networks of nodes. Each node in the net performs some particular transformation from the fixed set. Each node is implemented as a small memory block with:

the address of machine code that does the transformation which is the reason for this node,
the address(es) of the down stream nodes to which output stream(s) from this node are delivered,
any state that the node may have between characters in the stream.

The format of a particular node must conform to the transformation code located by the first word of the node.

The Varian Data Machines 620i had three registers, A, B and X which fit our scheme as if it had been designed for us. The entry point for each transformation routine assumed that X would locate the node, Register A would hold the next character in the input stream to be processed by the node and B would locate the node to pull whenever more characters from the stream were required. In other words A is the first character and B is a promise for the rest. B is sort of like Scheme’s continuation.

Each node pointer within a node would correspond to a conceptual character stream. Each time that pointer was used, by branching to the code that it located, the next character of that stream had its fleeting existence.

So far we have described the situation for nodes that produce one output for each input but we have not described “pull”. Some routines have pull entry points that assume no input character. A node that produces an infinite stream of ‘a’s will have only a pull entry point. Branching to that entry point produces yet another ‘a’. At such an entry point neither A nor B are meaningful. In all cases the first word of block for a node will locate the entry point for some transformation. Both push and pull entry points assume that X locates a node which will know where any yield is to be delivered.

The above is the idea in a nut-shell. Below are some examples of stream processing. Dispatch is merely shared code.

Convert to upper case

For this transformation the node comprises: the code address (see below) and the addresses of the downstream node. The code is:

if A holds a lower case letter, put corresponding upper case letter in A.
Go to (not call) Dispatch. Note that B does not change.

Stream of ‘a’s (no input).

Put ‘a’ in A.
Copy X into B.
Go to Dispatch.

Discard Stream (no output)

Copy B to X.
Go to Dispatch.

Stutter (Make “axb” into “aaxxbb”.)

The words of this node are
0: code address for transformation just below,
1: downstream node address,
2: place to keep character to be repeated,
3: place to keep promise for stream remainder,
4: code address to produce second instance of repeated character.

Store B at X + (3 words).
Put X + (4 words) into B.
Store A at X + (2 words).
Go to Dispatch.

The code whose address is located at X + (4 words) does:

Subtract 4 words from X.
Load A from X + (2 words).
Load B from X + (3 words).
Go to Dispatch.

Discard ‘x’s

If A doesn’t hold ‘x’, go to Dispatch.
Copy B to X.
Go to Dispatch.

Duplicate Stream

Just like stutter but there are two distinct down stream node pointers.

Merge characters alternately from two streams

Node state is address of pull node for “other input stream”.

Exchange B with word 2 of node.
Go to Dispatch.

Dispatch

Locate the address of the downstream node from X + (1 word), leaving new node address in X.
Locate the entry point of the new node, using (new) X.
Go to the new code.

Note that those transformations that go to Dispatch do so with the next character in A, the promise in B but the current node address in X. All nodes that point to such code have the downstream node address in word 1. (Most such nodes have an entry point address in word 0.)

There are probably several bugs in the above “code”. Note that most of the steps amount to one machine instruction. We average about 7 machine instructions per node per character processed, even on a severely “register starved” machine. Note also that in contrast to the presumably more efficient batching of blocks of characters, this scheme runs the character stream thru the processing nodes without placing them in RAM or even cache for each node.

Note that the code is entirely reentrant and uses no stack. It might be described as using register B as a link register and doing tail call optimization. I don’t know how to code the stutter routine that way however as that seems to require explicitly naming the continuation.

Control

Streams would be explicitly built by normal program logic, either nearby or remote. Changes in the plumbing in mid stream would not be supported except perhaps by in band signaling within the stream and with the explicit cooperation of one of the processing elements.

At the edge of the network of nodes would be buffers for comm links in the case of Tymnet. These buffers are to improve link efficiency at minimal cost to latency.

Determinism

It has recently been brought to my attention that schemes such as this might adhere to two different disciplines regarding nodes which merge two input streams. In the first scheme it is allowed that a node react to input data as upstream events deliver new characters for the stream. Indeed for some purposes such as multiplexing, this must be done. The second discipline is that a node not invite new input until it is ready. If the purpose of the node is to form an output stream alternating output characters from the two input streams, then the node must store one continuation. I have not explored the ramifications of mixing these two disciplines.

A similar dichotomy bears on a node with two outputs. The deterministic discipline would store the continuation from an incoming character among with the character, and send the character along to now of the down stream nodes with a continuation to this node. When control is regained the same character is sent to the other downstream node with the continuation originally received with this character.

The first discipline if adhered to by all nodes, leads to output streams whose content is not sensitive to order of events.

There may be design issues here to avoid deadlocks, but I have seen none.

In C the declaration of a routine can include the string “__dead2” or “__attribute__((__noreturn__))” which might make the compiler refrain from creating a stack frame or at least release its current frame as it calls a routine that does not return. Perhaps this is no better than faking routine patterns that enable tail call optimization; this is usually natural and perhaps even necessary.