See this about managing the link.
The people here were rumored in SlashDot to have produced a demo with inexpensive off the shelf equipment of 26Tb/s over a single fiber. It is daunting to terminate such a fiber let alone put the streams in DRAM and back out again, for perhaps 5 streams in and 5 streams out. Here is a plan.
I looked at the DDR4 specs briefly and assumed that the time should soon come where for $10 you could get a chip with 16 2 Gb/s data pins. I presume here that 8 chip Dimms with 128 data pins will be available but I will count DDR4 chips. The chip bw is 235 b/s and 10 data streams is 248 b/s. That is 213 chips which is about $100,000 for just enough chips to keep up with the 10 fire hoses, 5 in and 5 out. No conventional memory bus will keep up with this but here is an unconventional hardware plan. First note that with the multi bank DDR4 chips (DDR3 too I think) the 60 ns access time can be overlapped with reading or writing data in previously opened banks. Also DRAM timing can be planned many clock cycles ahead of time.
Here are the critical ideas that I think are compatible and together seem to me to provide a proof of concept:
Lambda switches or routers bear on these designs. Transport over lambda switches is like the freight consolidation business. It is tricky to keep the latency down given that you need central coordination of time slots on fibers.
Here is an older version which was here on 2012 Nov 11, I now make make several ‘improvements’. (Some text is in both old and new.) Error control is by blocks. This is one plan. Special 10 bit codes on the fiber may be inserted it the memory bus logic is incapable of delivering data fast enough. Such codes are discarded at the receiving end and excluded from the error control. More ideas here.
Invitations and acknowledgments are by blocks. Quotas are by blocks. A block may carry several packets but the headers come at the beginning of the block. The headers include the money field, either the channel number or turn ops, and the length of the corresponding payload. The size of the collected payloads (CPLs) of a block is a multiples of some power of 2, agreed upon for a link, about 211 bits. CPLs are never larger than 219 bits. CPLs are seldom smaller than 218 bits. CPLs arrive in aligned uncached DRAM buffers of 219 bits. The hardware knows all of this. The headers are placed in the cache of the CPU. The path between the fiber and the DRAM is expensive and each fiber has DRAM devoted to it with sufficient bandwidth to capture the data and also feed it, via a crossbar, to other outgoing links. It is the job of software, and perhaps some special hardware, to control the crossbar so that the data gets to the right fiber, and perhaps color, as it leaves the switch.
There are alternative sub-plans here:
If output to fiber stalls for lack of headers, then it should put flags of the equivalent symbol on the fiber until resumption. Output of payloads to fibers should respect wrap-around points of the input payload rings from which it gathers the data. How does it learn these sizes? If the hardware knew the size of the buffer (which power of 2), and the buffer were aligned on its size, then it could easily do the right thing. This could easily be provided in the output packet header.
This hardware is a slightly fancy DMA. Such hardware could easily be adapted to moving payloads from DRAM to DRAM when some packet must be delayed. I imagine such a move operation being controlled by the equivalent of a output header ring which orchestrates the payloads in peril of being overwritten. Hardware nearly identical to fiber output DMA would gather payloads that must be moved to avoid being overrun, and produce a stream of payloads that another piece of hardware, much like an fiber input DMA that would place in a more dynamically allocated contiguous holding area. It is not much more than a fiber link without the fiber. DRAM priority can be lower for such access. Software can move the headers.
Another variation on the above idea is to move payloads that are queued to leave on a high latency fiber into a DRAM holding area whose turn over time matches the fiber latency.