I am not a computer designer but I have learned the schemes that several fast computers have used thru the years to divide floating point numbers.

The naïve division (used in many early computers) was to repeatedly:

The LARC was a decimal machine. At the beginning of a divide it would compute two times the divisor (I recall) and in each clock cycle (500 ns) compare the divisor and twice the divisor with the current remainder. This produced two quotient bits per clock. It took two clocks per quotient digit. No carry propagation is required to multiply or divide a decimal number by two!

The CDC 6600 did just about the same thing. Being binary, it produced two quotient bits per clock (100 ns).

Each cycle the Stretch (IBM 7030) did a table lookup using several bits of remainder and several bits of dividend. The result was an approximation of the next several bits of quotient. I think that it was guaranteed to produce four bits of quotient each cycle (300 ns) but would take advantage of up to nine bits when luck prevailed. It averaged about six bits per cycle.

The Illiac II is rumored to have done non restoring division without performing carry assimilation each cycle. In other words the running remainder was expressed redundantly. I know no other details (not even these!). It seems that they could have done carry assimilation on say the top ten bits so as to see farther ahead. (Wikipedia says that the Illiac II divide was designed by Robertson who co-invented the SRT division algorithm. This fits my rumor.)

The IBM/360 model 91 did a table look up on the high bits of the divisor at the beginning of the divide. This provided a few bits, n, of approximation to the reciprocal of the divisor. Numerator and denominator were multiplied by this so that the divisor of the resulting division problem would be close to one. This makes guessing quotient bits trivial. (divide 3.141593 by .999543.) This resulted in a quotient that might be one off. This was documented.

The IBM/360 195 fixed the 91 problem. Again with n being a few bit approximation to the reciprocal of B, the division problem A/B is replaced with nA/nB. The modified division problem discarded no bits. A/B rounded down to an integer is always the same as nA/nB rounded down. Floating point needed no remainder, which this scheme could not produce!

Machines from Cray Research didn’t divide directly. There was a command that produced an approximate reciprocal. The result bits that did not carry correct reciprocal values, carried information that another command could use to produce an accurate reciprocal of the original number. If a division was really necessary the compiled code would multiply the numerator by the reciprocal of the divisor. A general division took three instructions, which could be scheduled by a compiler.

The Motorola AltiVec architecture uses this scheme.

I would guess that the Pentium used tables like the 195. I have heard that the problem was tracked down to an error when someone cut and pasted some table from an earlier design. I now (2006) have better information. There was a proof that the division algorithm was correct. This was an excuse not the check that part of the design as carefully as they might have. It seems no one read the proof carefully. After the error was discovered, they found the bug in the proof.