You may just want to see the algorithm.

I suggested this division scheme to Seymour Cray late in the design of the 6600. It was too late in the design and was not used there. I then described it to some friends including T.C. Chen who was still at IBM where I had worked in Poughkeepsie on the Stretch project. A couple of years later T.C. told me that the system 360 model 95 used my scheme. The manual for that model specified that floating division might be one bit off. The follow-on machine, the model 195 was unchanged in this area.

T.C. also told me later that Gene Amdahl had fixed my algorithm with an insight, obvious to me only in hindsight, but not soon enough for the 195. I describe Gene’s scheme here for it is manifestly error free. (Mine is sometimes one bit off.)

Normalize the denominator d and declare that 1 ≤ d < 2. The high bit of d is 1 and the next M bits (M is about 8) are used to index into a table T of 2M M+1 bit constants. The table index, r = [(d−1)2M], is taken directly from bits in d. Let N = Tr from the table. The original division problem, n/d, is replaced by (Nn)/(Nd). Neither Nn nor Nd are truncated thus the new quotient is exactly the same as the old. Table values are construed as scaled fixed point numbers so that for all 2M r values, 1/2 ≤ N < 1. For each r, N is chosen so that 1 − 2−n ≤ Nd < 1. (I had assumed that they would be truncated to ‘appropriate precision’.) For each r, the high bit of N is 1, which saves a bit in the table.

Meta comment: The above paragraph uses a math theory that views some values such as d to be rational numbers between 1 and 2. That is self consistent but inconsistent with other parts of this note. In either view M+1 bits are required to express N.

We need 53 bits of quotient and if we also declare that 1 ≤ n < 2 we have q = n/d = (Nn)/(Nd) ≥ 1 just in case n≥d. Thus we know early how many quotient bits to compute.

We approximate as we compute bits of the quotient, left to right. High order bits may remain uncertain until the very end. The value of nq (in both Scheme and C) may exceed 2M and that may cause carries in the accumulation of the quotient. It would be possible but slow to eliminate this uncertainty. At the very end we do eliminate the uncertainty at the cost of one or perhaps a few full adds with carry. Various hacks allow us to accommodate this uncertainty with a small hardware cost but no extra latency, until the very end. I don’t know yet whether one must pay for more than one full long add to be certain of the last bit.

If |x| < 1 then 1/(1−x) = 1 + x + x2 + x3 ..., especially so for small x.


When 1 ≤ d < 2 then for some integer r, 0≤r<2M and 1+r2−M ≤ d < 1+(r+1)2−M.

Sketch

We choose N and compute Nd in the beginning keeping all bits. Between iterations we have: The quotient bits are unsettled which means that new quotient contributions can produce carries that influence previously computed quotient bits.

We need to make these quantities and their allowed ranges precise. Reference this


This scheme was inspired by a table of 128 digit natural logarithms I once saw. The logs of 10n were given for 0<n<10 were given. The logs of integers 1 thru 9999 were given. Then for numbers of the decimal form 1.0000xxxx. Also 1.00000000xxxx, etc. With a mechanical calculator it was not too much work to factor a 128 digit argument into the form 10n∙xxxx∙(1.0000xxxx)∙(1.00000000xxxx) etc., and look up and sum the logs of each factor.
This Scheme code seems to work but lacks all rounding and only for positive operands. No subnormals, but with subnormals in mind. Information is preserved for rounding. Scaling is for perspicacity.

This Scheme code rounds to nearest and compares with native float. (function gfdivr produces a rounding to nearest version of divide.) It includes a random test suite but does not probe corner cases.

This Scheme code has confusing shift counts but total shifts and consequent operand sizes are much smaller by eliminating right zeros. Shift amounts depend on M, ad and rb. M and rb don’t depend on operands. M is a design time parameter, ad has just 2 operand dependent values and rb takes on just a short sequence of known values. They can be tabulated and then simplified by noting that (sl (sl x 1) −1) = x and (+ (sl x 1) (sl y 1)) = (sl (+ x y) 1).

This Scheme code seems better in all ways than the rest. It runs the 105 test suite with rounding without complaint.

This Scheme code is instrumented with worst case analysis in order to seek out bugs and cheats. The instrumentation is not pretty. It introduces a parameter z which is how many more than M bits of df we examine to compute N. For M=8 we get the following worst case reports for parameter z ranging thru {0, 1, 2}:

(#t #7(7583 5190 3623 2397 1513 945 511) . 42)
(#t #7(3180 2591 2034 1553 1148 812 511) . 15)
(#t #7(1993 1631 1453 1274 1004 745 511) . 8)
For M=7:
(#t #8(4922 3435 2416 1674 1110 721 480 255) . 56)
(#t #8(1934 1586 1277 994 778 568 399 255) . 18)
(#t #8(1192 1064 899 806 659 519 389 255) . 10)
For M=8 and z=2 this amounts to looking at 10 variable in d to select 8 variable bits for N. It seems that examining the denominator more closely pays off, but with diminishing returns. We can get the benefit of z=2 with interpolation.

This Scheme code performs well enough. Alas it requires an N with M+2 bits to guarantee M bits per cycle. For M=8, z must be at least 1.

This version has been slimmed down to remove some report code, worst case analysis, and table logic.

This C code (and notes) is a compact bare bones version that performs and confirms double division 108 times in a few seconds. (It passes 109 tests.) This version notes ranges of intermediate values and suggests invariants we should prove. Curiously the expressions Nd*nq and (r<<8) produce 72 bit values but (r<<8) - Nd*nq fits nicely in 62 bits. That too is to be proven. No sweat!

The C code assumed that z=1. In 108 random cases this caused about 70 cases where the last ‘while’ in the C code ran three times. The first such case was:

(let ((c (λ (x) (cons 1024 (string->number x 16)))))
  (div (c "dc3829aaf2f6d") (c "e085d0264e5b1")))
This was not found in the 105 random cases explored in Scheme. This raises the requirement for a real proof. I think that the last ‘while’ cannot be eliminated but conceivably it can be reduced to at most one execution. (For z=3, in 15 cases from 108 random cases, the loop runs twice.)

Plan:

  1. Make honest test suite.
  2. Change scaling to be more like engineering version.
  3. Change arithmetic to do range stuff.
  4. Incorporate discovered ranges into version 2 as asserts.
Here is how to do most of these long adds in two logic levels.

Table Logic

Computing N from df is a delicate natter. It requires computing 9 bits from 9 bits. The function maps [512, 1023] into itself. The input is the high 10 bits of the normalized denominator whose high bit is 1. It is a numerically well behaved function and interpolation in a table should serve well.

To be chintzy I use a table with 128 entries instead of 512 entries. Each entry has an extra 3 bit field (.sl) saying how the omitted three values differ—whether they jump, or take an extra jump. To decode this field I need yet another 2D table of 3 bit numbers. One index into this table is the .sl field from the main table. The other index is the low two bits of the current argument from the denominator bits.

Here is hairy code that computes, verifies and prints these two tables. It takes z=1. This if for z=2 and may not be practical. Judge for your self. This makes probably fatal errors.

Proof

I think out loud here. We need to establish an upper bound for r, the remainder, and also show that it never goes negative. Perhaps one bound will serve, or perhaps we will need 8 of them, one for each iteration. The remainder mutates thus r = (r<<8) - Nd*(r>>54);. We depend vitally on the fact that Nd is very nearly 262, indeed it perhaps suffices to know that the first 8 bits of Nd are all 1. I.e. 262 − 254 ≤ Nd < 262 which we prove elsewhere. Inductively we begin 0 ≤ r < 263. An upper bound for the new r is [max of (r<<8)] − [min of Nd*(r>>54)]. That is far too pessimistic for we ignore the fact that the value for r in the first bound is the same as in the second. We require a case analysis depending on 210 ranges for d considering that we use 10 bits of df to choose N. (M=8 & z=2) We also need to consider each discrete possible value of r<<8. Fortunately we have a computer. Current candidate. old, older.
One way to reason about floating divide, producing 54 quotient bits plus indication of inexactitude, is to take the two 53 bit fractions, n and d, as integers and reason about the entirely integer division (n∙254)÷d. We want 54 bits in order to reason about rounding. If n<d then we need (n∙255)÷d instead. We also note whether the remainder is zero.

a÷b is defined for b>0 and is the greatest integer q such that q∙b ≤ a.
r = a − q∙b and 0≤r<b.

The process produces approximations to q, high bits first. We continuously monitor whether q∙b ≤ a and keep r exactly.

Possible useful theorems

Guess of critical path timing.

Older intro