What is the situation just after we discover that rb is negative (to wit -3 if M=8)? q already holds 3 too many bits of nearly correct quotient. if q were augmented by (÷ r Nd) it would be exactly correct. I.e. q+(÷ r Nd) is the exact unrounded quotient to 4 too many places. We need the left of those bits to do the required round. The invariant ('Invar) still holds! That is just what SchemeF computes with s recursion.

Let x be the low 4 bits of q (q&15). I really need a hard bound on r!!!!!!!!

Notation

floor(x) is the greatest integer not greater than x.
fp(x) = x − floor(x), the fractional part of x.
x = floor(x) + fp(x).
fp(8/3) = ⅔.
In the beginning 252 ≤ df < 253 and 252 ≤ nf < 253. For some j, 210 ≤ j < 211 and j∙242 ≤ df < (j+1)∙242.
Enumerating the cases we find (or here) that 0x3fd9200000000000 ≤ Nd < 262.
r < 254. The code “lu r = (N*nf)<<ad” ensures that the high order bit of the quotient appears where we need it and that we compute just enough quotient bits.
r = (r<<8) - Nd*(r>>54);
The value (r>>54) is named “nq” in the code. Since Nd is near 262 we can see that this expression nearly cancels, but how closely is the vital question.

The initial r is N∙nf or twice that, and we take the product of their upper bounds as an upper bound for the initial r. Thus r ≤ 210+1+53 = 264 — close call.

Here is statistical evidence for bounds we aspire to prove in this sequence of programs: 2, 3 and 4. They are each small textual modifications on this short version which works for 108 random cases. Number 4 makes evident that M=8 is at the edge of testing this design in simple C programs on 64 bit computers. Note the ramifications of the fact that max exceeds 239 and thus Nd*nq would require 72 bits to hold properly, were it not for the fact that 8 of those high bits cancel in the subtraction and thus nothing is lost.

I note one possibility that might be considered, but not considered beyond this paragraph. A bit or two in the table entries for N which would indicate interpreting one or more bits in r to choose nq. Such might require only a few dozen gates whereas allowing another bit in nq would require hundreds. Parameter z in some of the code specifies how many more bits of df to consult to choose N. The code without z has implicitly chosen z to be 2. Perhaps a better choice would be to set z to 1 and consult an extra bit in r while choosing nq.