Logic of sqrt.s

The file sqrt.s is an assembler file for the PowerPC that computes sqrt(x) in less than about 70 cycles instead of the 300 required by earlier routines. The assembler dialect is that of Apple’s system 9. Here it is in Apple’s OS X dialect which is more gnu like. The sqrt routine that comes with OS X is nearly as fast as mine and is frequently a bit more accurate. I have not taken the time to see if I could make mine as accurate.

The identifier “prec” is defined as 50 in the source. Its definition may be changed to trade off accuracy and performance. See comments in code about this tradeoff. See below about logic of tradeoffs.

Unlike the OS 9 version, the X version saves all floating registers that it uses except f0 and f1.

The Math

To compute sqrt(x) we factor x as

x = 4ⁿ * oe * e * nu

where n is an integer, oe is either 1 or 2, e is of the form 1+k/32, abs(nu−1) < 1/32, where k is an integer and 0 <= k < 32.
k is taken directly from bits 12 thru 16 of the double floating IEEE representation. nu is computed by multiplying x by 1/(1+k/32) which comes from a table indexed by k.
sqrt(x) = sqrt(4ⁿ * oe * e * nu) = 2ⁿ * sqrt(oe) * sqrt(nu) * sqrt(e).
sqrt(nu) is computed as a Taylor series segregating odd and even terms to achieve processing overlap provided by the 601 and accommodating its latency. sqrt(e) is looked up in a table of 32 values.
This gives about 52 bits of precision on the average but worst case is about 50 bits. The 50 bit version stops here. The rest of the theory stems from the idea that we are trying to find the root of r²−x. We use the 601’s multiply add instruction to compute r²−x where r is the approximate root. We use an estimate of the slope of the curve, again from the table, to adjust r from the error. This gives us about 57 bits of accuracy. It might seem strange that there may still be errors after we have more bits than there are in the double float but errors arise when the root is very near the midpoint between two representable answers. The 57 bit version stops here. For the exact answer we evaluate r² at that midpoint in the suspect cases and observe the sign of r²−x. From that we know which interval the root is in and we thus know which value for r is the closest.
The code does not “signal inexact” because while the PowerPC Numerics manual from Apple uses the term, it does not define it.

The Mac OS X version restores floating registers except 0 and 1. It preserves all 32 fixed registers. This is more than required by the calling conventions of Mac OS X.