TWINKLE

(the dark) TWINKLE

Here is my understanding of Shamir’s hardware proposal. Here is RSA’s analysis of the system.

About 10⁵ LEDs (lights) each flashing with a different prime period. Two flashes per period.
A photo detector that reports a coincidence of many flashes.

Each unit has an LED and three ~20 bit registers, one to hold the constant prime for the unit and two others to count the common clock pulses and define the unit’s phase.

Some slow serial path thru the registers (scan-in, scan-out) is used to initialize and debug the system. The smallest distinct primes are used.

Shamir’s machine would truly beautiful to see. Alas perhaps it is not needed.

What’s the Problem?

This hardware will search some big range of consecutive numbers for those with a large set of small prime factors. I presume that the range that you search depends on the number that you are trying to factor. The length of range is about 10¹⁵.

This leads to solving the equation
x² = y² (mod n) where n is the number to be factored.
gcd(x−y, n) divides n. I am writing some notes on the algorithm.

Improvement One

A modified systolic array transformation works here. Connect the units into a tree designed for packaging convenience. Linking a unit to each subunit is a data link carrying a weighted factor count each clock cycle. These messages accumulate and travel towards the tree’s root, one message per link per clock. Replace the LED with an adder that accumulates the results of its subtrees for the previous clock time, plus its own contribution. Messages along these links are the weighted “small factor count” for some subset (sub-tree) of primes.

Clock distribution is the dual problem. The symmetric clock (50% duty cycle) would travel from the root towards the leaves. The only requirement on the clock is a fixed phase between a unit and its signal suppliers. That phase is very sensitive to the length of the links between those units. When traveling more than a few millimeters it might suffice to invert the clock. Some clever engineering may be necessary here. Also the clock would probably have to be repeated. I don’t know what is involved here. Some analog magic may be necessary to maintain the 50% duty cycle. It will probably take other magic to send the signal between chips, if it is not wafer scale.

We can process the counts in redundant form as in carry-save-add. This makes the systolic links wider by at most a factor of two. This is the trick makes multipliers fast. (Carry save add takes three binary numbers and reduces them to two in two levels of logic. Read about “cook”. {lump s = cook(a, b, c); assert(2*s.two + s.one == a+b+c);}
Two more levels of logic can shrink the link width to nearly log2(count) bits. This may pay off for the longer branches near the root where both the means (time) and incentive (space) exist.)

A candidate tree shape is a fractal given in the spring 1999 “The Mathematical Intelligencer”, page 21. The leaves uniformly fill the area of a rectangle. The rectangle corners are: (0, 0), (sqrt(2), 0), (sqrt(2), 1), (0, 1). The trunk runs from (sqrt(2)/2, 0) up to (sqrt(2)/2, 1/2) where it branches into the two trunks of the two similar sub trees which fill the bisection at x = 1/2 of the original rectangle into two similar rectangles.

This plan is somewhat more sensitive to bad spots than is Shamir’s. Bad spots degrade the hardware but only in proportion to their count.

Improvement Two(?)

The next step is rather easy to describe in broad outline but much harder to evaluate as there are many incomplete sub-designs.

Consider the data streams over the various data paths described above. They concern consecutive numbers and report some measure of how many small prime factors the number has. When we are at some sub tree of largish primes this stream will be mostly zeros. Such streams are a prime candidate for compression. In fact you compress the very calculation of the stream. The leaf node just adds its prime to its accumulator and reports the next multiple of that prime modulo 2³². Downstream nodes must merge these streams. The tree must run asynchronously! Indeed a simple computer can multiplex to serve the function of quite a few nodes. The optimal configuration is far from clear! This computing task reminds me of what I suggest for Google.

This idea rests on the plausible hypothesis that a memory cell is very much cheaper than a counter. This is true in commodity hardware, but it may not be true of the above designs. It is not clear where there are any real improvements in this direction. This second improvement is certainly more complex.

Acknowledgments

Shamir’s paper made clear the nature of the heavy lifting to be done. I had failed to find good descriptions before. Perhaps I had not looked far enough. Many of the novel aspects of Shamir’s design remain in this plan. Chris Hibbert has helped my debug these ideas.

More recent proposals: TWIRL which also extends TWINKLE and probably includes all the insights above.