Some slow serial path thru the registers (scan-in, scan-out) is used to initialize and debug the system. The smallest distinct primes are used.
Shamir’s machine would truly beautiful to see. Alas perhaps it is not needed.
This leads to solving the equation
x2 = y2 (mod n) where n is the number to be factored.
gcd(x−y, n) divides n.
I am writing some notes on the algorithm.
Clock distribution is the dual problem. The symmetric clock (50% duty cycle) would travel from the root towards the leaves. The only requirement on the clock is a fixed phase between a unit and its signal suppliers. That phase is very sensitive to the length of the links between those units. When traveling more than a few millimeters it might suffice to invert the clock. Some clever engineering may be necessary here. Also the clock would probably have to be repeated. I don’t know what is involved here. Some analog magic may be necessary to maintain the 50% duty cycle. It will probably take other magic to send the signal between chips, if it is not wafer scale.
We can process the counts in redundant form as in carry-save-add.
This makes the systolic links wider by at most a factor of two.
This is the trick makes multipliers fast.
(Carry save add takes three binary numbers and reduces them to two in two levels of logic.
Read about “cook”.
{lump s = cook(a, b, c); assert(2*s.two + s.one == a+b+c);}
Two more levels of logic can shrink the link width to nearly log2(count) bits.
This may pay off for the longer branches near the root where both the means (time) and incentive (space) exist.)
A candidate tree shape is a fractal given in the spring 1999 “The Mathematical Intelligencer”, page 21. The leaves uniformly fill the area of a rectangle. The rectangle corners are: (0, 0), (sqrt(2), 0), (sqrt(2), 1), (0, 1). The trunk runs from (sqrt(2)/2, 0) up to (sqrt(2)/2, 1/2) where it branches into the two trunks of the two similar sub trees which fill the bisection at x = 1/2 of the original rectangle into two similar rectangles.
This plan is somewhat more sensitive to bad spots than is Shamir’s. Bad spots degrade the hardware but only in proportion to their count.
Consider the data streams over the various data paths described above. They concern consecutive numbers and report some measure of how many small prime factors the number has. When we are at some sub tree of largish primes this stream will be mostly zeros. Such streams are a prime candidate for compression. In fact you compress the very calculation of the stream. The leaf node just adds its prime to its accumulator and reports the next multiple of that prime modulo 232. Downstream nodes must merge these streams. The tree must run asynchronously! Indeed a simple computer can multiplex to serve the function of quite a few nodes. The optimal configuration is far from clear! This computing task reminds me of what I suggest for Google.
This idea rests on the plausible hypothesis that a memory cell is very much cheaper than a counter. This is true in commodity hardware, but it may not be true of the above designs. It is not clear where there are any real improvements in this direction. This second improvement is certainly more complex.
More recent proposals: TWIRL which also extends TWINKLE and probably includes all the insights above.