Counting clocks

Here we fantacize a minimum latency double divide predicated on 8 quotient bits per cycle and 52 bits of CSA. I make ad-hoc assumptions on levels per clock but I try to make them explicit. First we compute N from a 210 table lookup to fetch 9 bits.

We compare fractions of operands and subtract operand exponents to determine ultimate exponent. Also in that cycle we consider pulling the emergency cord due to sub-normal denominator or foreseen sub-normal quotient. Call that clock 0.

We compute Nd and 3Nd. Nd requires 10 logic levels plus a full add. 3Nd also requires a full add. 3Nd is delayed beyond Nd by 4 logic levels. Nf is needed soon too. That might all be squeezed into 3 clocks but 4 is more likely.

Now we iterate the following 7 times serially. Between iterations we have the 63 bit r expressed as 11 high plain bits plus 52 low bits of CSA. q is carried somehow and rb is coded in some convenient way. Nd and 3Nd are carried along unmodified. We consult high bits of r to get nq. We code the 10 bit nq according to this plan into one multiplier bit and 3 nibbles. This results in 4 augends and 2 r’s from the previous iteration. We also accumulate nq into q which causes a potential carry which we may have prepared for in the previous cycle by producing a q assuming a carry and another version assuming no carry. This feels doable in a clock.

There is not much to do at the end. This is optimistically a 12 clock operation.

Total gate count

(OK just the big data paths). Each bit of CSA work takes 16 3-way and gates and 4 4-way or gates. I count 12∙52 for Nd and 3Nd. I count 7∙6∙63/2 for the 7 iterations. The half is for the fact that r shrinks in case of a systolic divider. Without the systolic hack I count 3270∙16 = 52320 gate ops for a division. That is roughly 3 times a naïve 1954 implementation. The table costs something too.