You may not like my formatting but the algorithm is obscure with any formatting.
fft provides about 8 bits less precision than provided by the type named “c”. This is due to the way that w and b are calculated. A more complicated way of computing b and w can improve their accuracy or alternatively the computing platform might provide a type for b and w with 12 more bits of precision than needed for the entire calculation. If the type identifier “cx” in this version is bound to that type, the problem is solved. This affects the computation of b, “b = csqrtl(b);”, and w, “w *= b;” even if it is slow. Most critically w needs to carry 9 or 10 more bits of precision than the input to fft so that the 512 iterations of “w *= b;” do not corrupt the calculation. The 10 iterations of “b = csqrtl(b);” could be replaced by a table lookup for any practical application. Again each successive value of b needs one more bit than the previous and the first needs the precision of the input.
In 2006 Nov the Mac OS X csqrtl produced only 52 bits of precision. In 2007 July, it gives 64 bits as it should! Here is a complex square root if you should need it.
Here is experimental code for bit reversal.