48 bit multiplication

The IBM Stretch provided floating point arithmetic with 48 significant bits. I believe that the following is fairly close to how it multiplied. The CDC 6600 used almost exactly the same scheme but specified that only 8 bits beyond the least significant bit of the result were carried and those extra bits were used for rounding.

The first step was to multiply the multiplicand by 3 and assimilate the resulting carries. The multiplier was divided into four 12 bit hunks and each hunk was processed in successive (300 ns) clock cycles with the same hardware; low order hunk first. Each hunk was divided into four 3 bit nibbles all processed in one cycle but as if the following step had been taken to produce a 50 bit addend for each nibble in turn, low order nibble first:

If nibble=0 then select zero as addend.
If nibble=1 then select multiplicand as addend.
If nibble=2 then select 2*multiplicand as addend. (exploit binary nature)
If nibble=3 then select 3*multiplicand as addend as pre-calculated.
If nibble=4 then select −4*multiplicand as addend and ‘save carry’.
If nibble=5 then select −3*multiplicand as addend and ‘save carry’.
If nibble=6 then select −2*multiplicand as addend and ‘save carry’.
If nibble=7 then select −multiplicand as addend and ‘save carry’.
If nibble=8 then select zero as addend and ‘save carry’.

As a nibble was examined it was incremented by one if the prior nibble had produced a carry (thus the possible value 8). The carry was retained between hunks. An extra addend was needed if there was a carry after the last hunk. Negative addends are supplied as one’s complement which adds confusion but saves circuitry and delay. Of course the addend for successive nibbles were left shifted by 3*(nibble number) + 12*(hunk number).

The result of this was four 59 bit quantities to be summed in each cycle, plus the sum from the previous cycle. The previous sum was in redundant format and consisted thus of two 49 bit values. These six binary numbers were added to produce a redundant sum of two 59 bit values plus more bits to the right that might be discarded or might be retained for rounding consideration. (The Stretch arithmetic was checked modulo 3 and I think that the bits were discarded after contributing to the check.) These 6 addends were reduced to 2 addends with four carry-save adders in 6 levels of logic. After the four hunks were processed there was one more carry save stage to turn the three addends into two and then thru the normal adder.

The full floating multiply took 2.4 μsec or 8 clock cycles.

In C (using q.s from here) product = mul(multiplicand, multiplier).
We must reason here about carry-save add and the cook routine in the context of two’s complement negative numbers. We consider the fundamental rule:

{rp x = cook(a, b, c); assert(a+b+c == (x.hi<<1) + x.lo);}

In order that the left shift not lose a significant bit, the magnitude of x.hi must be suitably bounded, or the shift must occur after the value has been coerced into a higher precision context with sign extension. In this modest illustration the expression x.hi within (x.hi<<1) is coerced to a 32 bit int with sign extension, before the shift.

See this about routines divq, mulq, addq. llsrs is long long (128 bit) signed right shift. lllx is long long (128 bit) left shift. addq is 128 bit add. eqr is (128 bit) equality test. Given three bits of multiplier, sel returns addend as described above.

The 16 fold iteration (for(k=0; k<48; k+=3)) in routine mul accumulates the product as described above. The accumulator comprises these places whose sum accumulates the product:

sum.lo,
sum.hi,
dump,
a*carry. (carry is either 0 or 1; ‘a’ is the multiplicand.)

Every cycle, each of these is shifted three places to the right. At the end the low order bits of sum.lo and a*carry are weighted as 2^k, sum.hi as 2^k+1, dump as 2^k−34. The routine ck verifies the loop invariant. It is no mean feat to get the right answer to 0*4 with this code! (It takes a fancy algorithm to screw up 0*4!) Bits in variable ‘dump’ all have positive weight. The left-most bit in each of the two parts of ‘sum’ is a sign bit or has a negative weight, as you may wish to think. The code survived 10⁹ random tests.

There is so much code to optionally print and assert beliefs that the multiply logic is hard to find. Here is a version largely shorn of the cruft.

Here is how the Stretch divided.

This program draws this image which describes a Boolean function on the 4 high bits (excluding high bit) by which you can usually be sure of the exponent of the result of a floating multiply. For the red squares the exponent will emerge late.