About this code
Αfter α in the code we have selected one of the ranges of the denominator which collectively cover all possible denominators. df is the lower bound of this range. df determines the magic value N. nl will be a list of bounds for r, one element for each stage but the first element is the number of extra subtracts that are conditionally necessary at the end to get the last quotient bits needed.
The named let rlp steps over each division iteration calculating a bound for r and passing that onto the next iteration. The initial remainder is at most twice Nd. The variable ll holds a growing list of quotient piece limits, one per iteration. Keeping those in limits is what makes this work.
For each iteration we are given an upper bound as r and must compute the new bound: r ← f(r) where f(x) = 2Mx − Nd(floor(2−54x)). The input range for r may be divided into disjoint ranges [n254, (n+1)2−54 − 1] for 2M+z ≤ n < 2M+z+1. In such a range f is monotonic increasing. From one range to the next x increases by 254 and f increases by 254+M − Nd which we taken precautions to make positive. The maximum value for f(x) in the total range will be either for x within the last partial range, or at the end of the previous full range. For an adequate bound we compute both of these function values and take the max as the next bound for r.
For interpreting the output (at the bottom of the code) the integer 422 appears once. The next to last iteration will receive an r such that 0≤r≤422 when we are computing 7 bits per iteration and looking at 7+1 bits in the remainder to choose the next 7 quotient bits. In that case we may have to subtract 3Nd from the last remainder and add 3 to the quotient for the ultimate quotient. We see that surveying the enumerated cases we generally need just M+2 multiplier bits at each iteration.
The code has two nested loops, both employing Scheme’s named let pattern. The first loop, dlp, runs thru denominator ranges. For each iteration the integer j is taken from the high 9 (= M+s−1) bits of the denominator. Since the high bit is always on there are 2M+s of these cases.
The second loop, rlp, goes thru the same stages as the division algorithm. The inner loop assumes that the C expression r>>54 produces the largest value for nq that the previous max on r allows. That, in turn produces the largest r for the next iteration.
First we know the denominator, df; we iterate over 210 (= 2M+z) values ranges for d. For such a range there is some value of j such that 210≤j<211 and j∙242≤ d < (j+1)∙242. For this range we compute a specific N = 220/(j+1) and then Nd.
Having fixed j and a lower bound for Nd, we consider a sequence of 53÷M upper bounds for r, which is initially Nn, each range corresponding to a value for nq that we extract from the remainder each iteration of the division loop. The first ‘remainder’ is Nn which cannot be more than twice Nd.
This variation models using carry save add between the division iterations to save time. The right most y bits of the remainder are carried redundantly and this makes those bits of r inaccessible for nq. That means we must consider the worst case and add 2y to the bound. It is easy to modify our code to see the impact on the upper bounds of r that we provide. A new parameter, y, indicates how many bits are carried redundantly. Each iteration requires an ordinary M bit full adder to materialize these bits from the CSA output of the previous iteration, but without CSA the alternative is one 62 bit full adder each iteration.
With y=3 I see no change. This code needs close study!