You may just want to see the algorithm.
I suggested this division scheme to Seymour Cray late in the design of the 6600. It was too late in the design and was not used there. I then described it to some friends including T.C. Chen who was still at IBM where I had worked in Poughkeepsie on the Stretch project. A couple of years later T.C. told me that the system 360 model 95 used my scheme. The manual for that model specified that floating division might be one bit off. The follow-on machine, the model 195 was unchanged in this area.
T.C. also told me later that Gene Amdahl had fixed my algorithm with an insight, obvious to me only in hindsight, but not soon enough for the 195. I describe Gene’s scheme here for it is manifestly error free. (Mine is sometimes one bit off.)
Normalize the denominator d and declare that 1 ≤ d < 2. The high bit of d is 1 and the next M bits (M is about 8) are used to index into a table T of 2M M+1 bit constants. The table index, r = [(d−1)2M], is taken directly from bits in d. Let N = Tr from the table. The original division problem, n/d, is replaced by (Nn)/(Nd). Neither Nn nor Nd are truncated thus the new quotient is exactly the same as the old. (I presumed truncating to original precision.) Table values are construed as scaled fixed point numbers so that for all 2M r values, 1/2 ≤ N < 1. For each r, N is chosen so that 1 − 2−n ≤ Nd < 1. (I had assumed that they would be truncated to ‘appropriate precision’.) For each r, the high bit of N is 1, which saves a bit in the table.
Meta comment: The above paragraph uses a math theory that views some values such as d to be rational numbers between 1 and 2. That is self consistent but inconsistent with other parts of this note. In either view M+1 bits are required to express N.
We need 53 bits of quotient and if we also declare that 1 ≤ n < 2 we have q = n/d = (Nn)/(Nd) ≥ 1 just in case n≥d. Thus we know early how many quotient bits to compute.
We approximate as we compute bits of the quotient, left to right. High order bits may remain uncertain until the very end. The value of nq (in both Scheme and C) may exceed 2M and that may cause carries in the accumulation of the quotient. It would be possible but slow to eliminate this uncertainty. At the very end we do eliminate the uncertainty at the cost of one or perhaps a few full adds with carry. Various hacks allow us to accommodate this uncertainty with a small hardware cost but no extra latency, until the very end. I don’t know yet whether one must pay for more than one full long add to be certain of the last bit.
If |x| < 1 then 1/(1−x) = 1 + x + x2 + x3 ..., especially so for small x.
We need to make these quantities and their allowed ranges precise. Reference this
This Scheme code rounds to nearest and compares with native float. (function gfdivr produces a rounding to nearest version of divide.) It includes a random test suite but does not probe corner cases.
This Scheme code has confusing shift counts but total shifts and consequent operand sizes are much smaller by eliminating right zeros. Shift amounts depend on M, ad and rb. M and rb don’t depend on operands. M is a design time parameter, ad has just 2 operand dependent values and rb takes on just a short sequence of known values. They can be tabulated and then simplified by noting that (sl (sl x 1) −1) = x and (+ (sl x 1) (sl y 1)) = (sl (+ x y) 1).
This Scheme code seems better in all ways than the rest. It runs the 105 test suite with rounding without complaint.
This Scheme code is instrumented with worst case analysis in order to seek out bugs and cheats. The instrumentation is not pretty. It introduces a parameter z which is how many more than M bits of df we examine to compute N. For M=8 we get the following worst case reports for parameter z ranging thru {0, 1, 2}:
(#t #7(7583 5190 3623 2397 1513 945 511) . 42) (#t #7(3180 2591 2034 1553 1148 812 511) . 15) (#t #7(1993 1631 1453 1274 1004 745 511) . 8)For M=7:
(#t #8(4922 3435 2416 1674 1110 721 480 255) . 56) (#t #8(1934 1586 1277 994 778 568 399 255) . 18) (#t #8(1192 1064 899 806 659 519 389 255) . 10)For M=8 and z=2 this amounts to looking at 10 variable in d to select 8 variable bits for N. It seems that examining the denominator more closely pays off, but with diminishing returns. We can get the benefit of z=2 with interpolation.
This Scheme code performs well enough. Alas it requires an N with M+2 bits to guarantee M bits per cycle. For M=8, z must be at least 1.
This version has been slimmed down to remove some report code, worst case analysis, and table logic.
This C code (and notes) is a compact bare bones version that performs and confirms double division 108 times in a few seconds. (It passes 109 tests.) This version notes ranges of intermediate values and suggests invariants we should prove. Curiously the expressions Nd*nq and (r<<8) produce 72 bit values but (r<<8) - Nd*nq fits nicely in 62 bits. That too is to be proven. No sweat!
The C code assumed that z=1. In 108 random cases this caused about 70 cases where the last ‘while’ in the C code ran three times. The first such case was:
(let ((c (λ (x) (cons 1024 (string->number x 16))))) (div (c "dc3829aaf2f6d") (c "e085d0264e5b1")))This was not found in the 105 random cases explored in Scheme. This raises the requirement for a real proof. I think that the last ‘while’ cannot be eliminated but conceivably it can be reduced to at most one execution. (For z=3, in 15 cases from 108 random cases, the loop runs twice.)
Plan:
Being chintzy I use a table with 128 entries instead of 512 entries. Each entry has an extra 3 bit field (.sl) saying how the omitted three values differ—whether they jump, or take an extra jump. To decode this field I need yet another 2D table of 3 bit numbers. One index into this table is the .sl field from the main table. The other index is the low two bits of the current argument from the denominator bits.
Here is hairy code that computes, verifies and prints these two tables. It takes z=1. This if for z=2 and may not be practical. Judge for your self. This makes probably fatal errors.
a÷b is defined for b>0 and is the greatest integer q such that q∙b ≤ a.
r = a − q∙b and 0≤r<b.
The process produces approximations to q, high bits first. We continuously monitor whether q∙b ≤ a and keep r exactly.
Guess of critical path timing.