void pcf(double x){int j = 20; while(j--){ int c = x; printf("%d ", c); x = 1/(x-c);}} int main(){pcf(3.1415926535897932384626);}which would require a multi-precision division for each of many terms. Instead we keep the initial and working value x represented as the quotient of two multi-precision integers xn and xd, themselves represented conventionally. The step int c = x; is quick but uncertain as we look only at the high order 32 significant bits of each large integer. It is uncertain for it may be that either
x = x - c is done as xn = xn - c*xd. Note the simplicity of the multiplication of a multi-precision by a single precision. Next the reciprocal is performed by swapping xn and xd.
When we compute c we may be one off.
We compute (xn - c*xd) with the uncertain c and verify that
0 ≤ (xn − c*xd) < xd.
If this does not hold we increment or decrement c and add, or subtract xd from xn once again.
All of this nicely sidesteps the gory multi-precision division required of the naïve algorithm. It also fits nicely into a distributed calculation where one processor specializes in the high order portions of xn and xd. If that processor handles the first 214 words of precision, then in each iteration c is determined locally and exactly with extraordinarily high probability. Signals to the processor handling the next most significant batch of precision are of the form subtract c times your denominator from your numerator and return to me the (single precision) overflow; then swap xd and xn. The sender of this message must process the response but may proceed with the next iteration while the response is prepared. This process extends to a whole string of processors each handling a portion of the precision of the multi-precision xn and xd. In cases so rare that it will be difficult to debug, this full-duplex pipeline of calculations must drain out when the exact value of c in fact depends immediately on low order bits of xn and xd.
A further improvement assimilates about 20 of these inter-processor messages to mean:
Compute <xn, yn> := <a*xn + b*xd, c*xn + d*dx> and return to me the two overflows.
In these messages a, b, c and d are all single precision and so are the overflows.
This cuts down the work by a factor of about 20.
This trick works in a single processor that handles just the first few words for each iteration of c and assimilates the processing for the bulk of the multi-precision values.
This scheme only tolerates a message latency of about 106 instructions. A complete calculation requires about as many messages as there are words of precision in the collective calculation. For 16 processors each handling 214 32 bit precision words, converts values of 223 bit precision, yielding somewhat fewer values for c. The calculation would take about 218 message at each interface and about 10 ms of compute per message. The processors at the low end can stop sooner, for their later yields are unlikely to influence the answer. 20 minutes to compute several million terms.
The work is still quadratic in the number of terms but the inner loop is simple and fast. It does not use the forms of multiply that are better than quadratic but it does suggest ways of doing the problem with larger factors than the naïve algorithm suggests on its surface.