I suggested this division scheme to Seymour Cray late in the design of the 6600. It was too late in the design and was not used there. I then described it to some friends including T.C. Chen who was still at IBM where I had worked in Poughkeepsie on the Stretch project. A couple of years later T.C. told me that the system 360 model 95 used my scheme. The manual for that model specified that floating division might be one bit off. The follow-on machine, the model 195 was unchanged in this area.

T.C. also told me later that Gene Amdahl had fixed my algorithm with an insight, obvious to me only in hindsight, but not soon enough for the 195. I describe Gene’s scheme here for it is manifestly error free. (Mine is sometimes one bit off.)

Normalize the denominator d and declare that 1 ≤ d < 2.
The high bit of d is 1 and the next M bits (M is about 8) are used to index into a table T of 2^{M} M+1 bit constants.
The table index, r = [(d−1)2^{M}], is taken directly from bits in d.
Let N = T_{r} from the table.
The original division problem, n/d, is replaced by (Nn)/(Nd).
Neither Nn nor Nd are truncated thus the new quotient is exactly the same as the old.
(I presumed truncating to original precision.)
Table values are construed as scaled fixed point numbers so that for all 2^{M} r values, 1/2 ≤ N < 1.
For each r, N is chosen so that 1 − 2^{−n} ≤ Nd < 1.
(I had assumed that they would be truncated to ‘appropriate precision’.)
For each r, the high bit of N is 1, which saves a bit in the table.

Meta comment: The above paragraph uses a math theory that views some values such as d to be rational numbers between 1 and 2. That is self consistent but inconsistent with other parts of this note. In either view M+1 bits are required to express N.

We need 53 bits of quotient and if we also declare that 1 ≤ n < 2 we have q = n/d = (Nn)/(Nd) ≥ 1 just in case n≥d. Thus we know early how many quotient bits to compute.

We approximate as we compute bits of the quotient, left to right.
High order bits may remain uncertain until the very end.
The value of nq (in both Scheme and C) may exceed 2^{M} and that may cause carries in the accumulation of the quotient.
It would be possible but slow to eliminate this uncertainty.
At the very end we do eliminate the uncertainty at the cost of one or perhaps a few full adds with carry.
Various hacks allow us to accommodate this uncertainty with a small hardware cost but no extra latency, until the very end.
I don’t know yet whether one must pay for more than one full long add to be certain of the last bit.

If |x| < 1 then 1/(1−x) = 1 + x + x^{2} + x^{3} ..., especially so for small x.

When 1 ≤ d < 2 then for some integer r, 0≤r<2

- r, which is a ‘remainder’ which changes each iteration. We may code low bits of r redundantly to shorten carry latency.
- we have the quotient whose bits we are developing, most significant first, about 8 bits per iteration,
- we have the count of how many more bits of quotient we need to compute.

We need to make these quantities and their allowed ranges precise. Reference this

This scheme was inspired by a table of 128 digit natural logarithms I once saw. The logs of 10

This Scheme code seems to work but lacks all rounding and only for positive operands. No subnormals, but with subnormals in mind. Information is preserved for rounding. Scaling is for perspicacity.

This Scheme code rounds to nearest and compares with native float.
(function `gfdivr` produces a rounding to nearest version of divide.)
It includes a random test suite but does not probe corner cases.

This Scheme code has confusing shift counts but total shifts and consequent operand sizes are much smaller by eliminating right zeros. Shift amounts depend on M, ad and rb. M and rb don’t depend on operands. M is a design time parameter, ad has just 2 operand dependent values and rb takes on just a short sequence of known values. They can be tabulated and then simplified by noting that (sl (sl x 1) −1) = x and (+ (sl x 1) (sl y 1)) = (sl (+ x y) 1).

This Scheme code seems better in all ways than the rest.
It runs the 10^{5} test suite with rounding without complaint.

This Scheme code is instrumented with worst case analysis in order to seek out bugs and cheats. The instrumentation is not pretty. It introduces a parameter z which is how many more than M bits of df we examine to compute N. For M=8 we get the following worst case reports for parameter z ranging thru {0, 1, 2}:

(#t #7(7583 5190 3623 2397 1513 945 511) . 42) (#t #7(3180 2591 2034 1553 1148 812 511) . 15) (#t #7(1993 1631 1453 1274 1004 745 511) . 8)For M=7:

(#t #8(4922 3435 2416 1674 1110 721 480 255) . 56) (#t #8(1934 1586 1277 994 778 568 399 255) . 18) (#t #8(1192 1064 899 806 659 519 389 255) . 10)For M=8 and z=2 this amounts to looking at 10 variable in d to select 8 variable bits for N. It seems that examining the denominator more closely pays off, but with diminishing returns. We can get the benefit of z=2 with interpolation.

This Scheme code performs well enough. Alas it requires an N with M+2 bits to guarantee M bits per cycle. For M=8, z must be at least 1.

This version has been slimmed down to remove some report code, worst case analysis, and table logic.

This C code (and notes) is a compact bare bones version that performs and confirms double division 10^{8} times in a few seconds.
(It passes 10^{9} tests.)
This version notes ranges of intermediate values and suggests invariants we should prove.
Curiously the expressions Nd*nq and (r<<8) produce 72 bit values but (r<<8) - Nd*nq fits nicely in 62 bits.
That too is to be proven.
No sweat!

The C code assumed that z=1.
In 10^{8} random cases this caused about 70 cases where the last ‘while’ in the C code ran three times.
The first such case was:

(let ((c (λ (x) (cons 1024 (string->number x 16))))) (div (c "dc3829aaf2f6d") (c "e085d0264e5b1")))This was not found in the 10

Plan:

- Make honest test suite.
- Change scaling to be more like engineering version.
- Change arithmetic to do range stuff.
- Incorporate discovered ranges into version 2 as asserts.

Being chintzy I use a table with 128 entries instead of 512 entries. Each entry has an extra 3 bit field (.sl) saying how the omitted three values differ—whether they jump, or take an extra jump. To decode this field I need yet another 2D table of 3 bit numbers. One index into this table is the .sl field from the main table. The other index is the low two bits of the current argument from the denominator bits.

Here is hairy code that computes, verifies and prints these two tables. It takes z=1. This if for z=2 and may not be practical. Judge for your self. This makes probably fatal errors.

One way to reason about floating divide, producing 54 quotient bits plus indication of inexactitude, is to take the two 53 bit fractions, n and d, as integers and reason about the entirely integer division (n∙2

a÷b is defined for b>0 and is the greatest integer q such that q∙b ≤ a.

r = a − q∙b and 0≤r<b.

The process produces approximations to q, high bits first. We continuously monitor whether q∙b ≤ a and keep r exactly.

- Either there are less than 53 bits in an exact quotient or an infinite number of them.
(whence we can ignore ‘to even’.)
(A subnormal quotient may be another problem.)
- A float denotes a sum of distinct powers of 2.
The largest exponent minus the smallest exponent is the width of that sum.
The widths of 4 and 0.125 are both 0.
The widths of 5 and 1.75 are both 2.
A width of the product of two numbers is the sum of their widths, or just one more.
If A/B can be computed exactly then there is float value Q such that A = BQ.
The width of Q cannot be less than the width of A.
If the round bit is on the width of the quotient is at least 52.
The width of A must be at least 52 but every floating value is never more than 51.
This is a contradiction, thus A/B is not exactly representable and thus there are remainder bits which preclude the ‘round to even’ clause.

- Round to nearest never changes exponent. (I went thru a crude proof carelessly.)

Guess of critical path timing.