Hardware

I am an amateur concerning hardware but here are sb/some vague notions. I mostly presume 10Gb/s links because I imagine that short links do not warrant expensive link termination hardware. 10Gb/s is about the capacity of Intel’s LightPeak technology. I am aware of 50 km Pb/s fiber bundles but there was no quoted price on the equipment necessary to terminate such 10¹⁵ b/s links, let alone switch the traffic. 10 Gb/s is also about the bandwidth of one DRAM chip. Such a chip can be filled or emptied in about 100 ms. I don’t know if lower cost memory with that bandwidth memory with fewer bits and lower cost is available. The DRAM prices are not a big issue, however. Juniper systems’ data center products use short 40Gb/s links.

See this about managing the link.

The people here were rumored in SlashDot to have produced a demo with inexpensive off the shelf equipment of 26Tb/s over a single fiber. It is daunting to terminate such a fiber let alone put the streams in DRAM and back out again, for perhaps 5 streams in and 5 streams out. Here is a plan.

I looked at the DDR4 specs briefly and assumed that the time should soon come where for $10 you could get a chip with 16 2 Gb/s data pins. I presume here that 8 chip Dimms with 128 data pins will be available but I will count DDR4 chips. The chip bw is 2³⁵ b/s and 10 data streams is 2⁴⁸ b/s. That is 2¹³ chips which is about $100,000 for just enough chips to keep up with the 10 fire hoses, 5 in and 5 out. No conventional memory bus will keep up with this but here is an unconventional hardware plan. First note that with the multi bank DDR4 chips (DDR3 too I think) the 60 ns access time can be overlapped with reading or writing data in previously opened banks. Also DRAM timing can be planned many clock cycles ahead of time.

Here are the critical ideas that I think are compatible and together seem to me to provide a proof of concept:

Color Muxing

Each fiber is color multiplexed down to some decent number of digital streams and I don’t think we need to be specific yet about how many. I suppose that there is some dispersion skew (red signals arrive before blue) but skew is old hat for 1960’s tape controller designers. Assuming a 2GHz system clock for conventional digital logic, input or output to one fiber is about 2¹⁴ wires. (Ouch) It is not necessary for this hardware to be perfect because the software is in a good position to route around failing parts.

Lambda switches or routers bear on these designs. Transport over lambda switches is like the freight consolidation business. It is tricky to keep the latency down given that you need central coordination of time slots on fibers.

Electronic (not optical) Bits

Blocks

Blocks are currently out of favor. Quotas must be worked back into the new schemes.

Here is an older version which was here on 2012 Nov 11, I now make make several ‘improvements’. (Some text is in both old and new.) Error control is by blocks. This is one plan. Special 10 bit codes on the fiber may be inserted it the memory bus logic is incapable of delivering data fast enough. Such codes are discarded at the receiving end and excluded from the error control. More ideas here.

Invitations and acknowledgments are by blocks. Quotas are by blocks. A block may carry several packets but the headers come at the beginning of the block. The headers include the money field, either the channel number or turn ops, and the length of the corresponding payload. The size of the collected payloads (CPLs) of a block is a multiples of some power of 2, agreed upon for a link, about 2¹¹ bits. CPLs are never larger than 2¹⁹ bits. CPLs are seldom smaller than 2¹⁸ bits. CPLs arrive in aligned uncached DRAM buffers of 2¹⁹ bits. The hardware knows all of this. The headers are placed in the cache of the CPU. The path between the fiber and the DRAM is expensive and each fiber has DRAM devoted to it with sufficient bandwidth to capture the data and also feed it, via a crossbar, to other outgoing links. It is the job of software, and perhaps some special hardware, to control the crossbar so that the data gets to the right fiber, and perhaps color, as it leaves the switch.

Packet Headers

The stuff above is a payload strategy. 8b/10b encoding commonly used to code arbitrary bit streams into DC balanced light streams have several extra 10 bit codes left over after depicting all the 8 bit bytes. One of those extra codes precedes a header (Money, SS, payload length and current location) and the hardware directs such headers to a cached queue for the software. A different extra code informs the hardware of the ensuing pay load which is directed to DRAM which is associated with the fiber and perhaps the color. The address chosen by the hardware for the payload is delivered to the software along with the header.

There are alternative sub-plans here:

The packet headers could be all put at the beginning of the block, (call it a ‘manifest’) In this case the hardware could deliver DRAM payload addresses to the end of the packets.
They could each immediately precede their payload.

In either case special stream markers would precede both headers and payloads and the hardware would switch between placing payloads in non-cached DRAM and headers in cached memory. I can’t decide now. Perhaps the hardware could place the payloads on aligned boundaries if that is convenient for the hardware.

Colors vs. Blocks

If a block went over one color:

it would be easier to think about,
it might make the hardware simpler,
it might make the hardware more modular in the sense of treating one band of colors as a whole interface,
it would increase the block latency; a 2¹⁹ bit block has 56 μs latency over a 10Gb/s sub channel.

Summary

This needs considerable additional detail. The buffers seldom see a cache. Perhaps some special code in the stream directs a buffer to cached memory. The manifest probably goes to cached memory.

ITU Standards

Interface to Program

Here are some ideas on how the hardware should interact with the software. The notion of circular buffer is introduced here.

If output to fiber stalls for lack of headers, then it should put flags of the equivalent symbol on the fiber until resumption. Output of payloads to fibers should respect wrap-around points of the input payload rings from which it gathers the data. How does it learn these sizes? If the hardware knew the size of the buffer (which power of 2), and the buffer were aligned on its size, then it could easily do the right thing. This could easily be provided in the output packet header.

This hardware is a slightly fancy DMA. Such hardware could easily be adapted to moving payloads from DRAM to DRAM when some packet must be delayed. I imagine such a move operation being controlled by the equivalent of a output header ring which orchestrates the payloads in peril of being overwritten. Hardware nearly identical to fiber output DMA would gather payloads that must be moved to avoid being overrun, and produce a stream of payloads that another piece of hardware, much like an fiber input DMA that would place in a more dynamically allocated contiguous holding area. It is not much more than a fiber link without the fiber. DRAM priority can be lower for such access. Software can move the headers.

Another variation on the above idea is to move payloads that are queued to leave on a high latency fiber into a DRAM holding area whose turn over time matches the fiber latency.

X86 interlocks

See 7.3.1.2 of x86-12-08.pdf XCHG XADD CMPXCHG