I am not a computer designer but I have learned the schemes that several fast computers have used thru the years to divide floating point numbers.

The naïve division (used in many early computers) was to repeatedly:

The LARC was a decimal machine. At the beginning of a divide it would compute two times the divisor (I recall) and in each clock cycle (500 ns) compare the divisor and twice the divisor with the current remainder. This produced two quotient bits per clock. It took two clocks per quotient digit. No carry propagation is required to multiply or divide a decimal number by two!

The CDC 6600 did just about the same thing. Being binary, it produced two quotient bits per clock (100 ns).

Each cycle the Stretch (IBM 7030) did a table lookup using several high order bits from each of the normalized remainder and divisor. The result was an approximation of the next several bits of quotient. I think that it was guaranteed to produce four bits of quotient each cycle (300 ns) but would take advantage of up to nine bits when luck prevailed. It averaged about six bits per cycle. Here is a recent paper on this algorithm. It differs in some respects from what I remember but I will leave my recollections in place. I can vouch for Dura Sweeney (the ‘S’ in ‘SRT’) being there during the design of the Stretch.

The Illiac II is rumored to have done non restoring division without performing carry assimilation each cycle. In other words the running remainder was expressed redundantly. I know no other details (not even these!). It seems that they could have done carry assimilation on say the top ten bits so as to see farther ahead. (Wikipedia says that the Illiac II divide was designed by Robertson who co-invented the SRT division algorithm. This fits my rumor.)

The IBM/360 model 91 did a table look up on the high bits of the divisor at the beginning of the divide. This provided a few bits, n, of approximation to the reciprocal of the divisor. Numerator and denominator were multiplied by this so that the divisor of the resulting division problem would be close to one. This makes guessing quotient bits trivial. (divide 3.141593 by .999543.) This resulted in a quotient that might be one off. This was documented. See this.

I thought that the IBM/360 195 fixed the 91 problem but recently I find that it did not. It was also documented to be occasionally one bit off in the low position. Here is the plan that I heard proposed for the 195. Again with n being a few bit approximation to the reciprocal of B, the division problem A/B is replaced with nA/nB. The modified division problem discarded no bits. A/B rounded down to an integer is always the same as nA/nB rounded down. Floating point needed no remainder, which this scheme could not produce!

Machines from Cray Research didn’t divide directly. There was a command that produced an approximate reciprocal. The result bits that did not carry correct reciprocal values, carried information that another command could use to produce an accurate reciprocal of the original number. If a division was really necessary the compiled code would multiply the numerator by the reciprocal of the divisor. A general division took three instructions, which could be scheduled by a compiler.

The Motorola AltiVec architecture uses this scheme.

I would guess that the Pentium used tables like the 195. I have heard that the problem was tracked down to an error when someone cut and pasted some table from an earlier design. I now (2006) have better information. There was a proof that the division algorithm was correct. This was an excuse not the check that part of the design as carefully as they might have. It seems no one read the proof carefully. After the error was discovered, they found the bug in the proof.