Note preprocessor symbol C (for “`const`”) and defined type `B` for a 64 bit value.

The routine `ml` is for debugging only.
`void ml(int C * C M, int C m, int C * C N, int C n, int * C P)`

The number in M[0] thru M[m−1] is multiplied by the number in N[0] thru N[n−1].
The product is put into P[0] thru P[m+n−1].
One of the inputs may have a 31 bit value in the high order word, which might produce an output with a 31 bit value in the high order.

Routine `mm` multiplies one 31 bit value by 4 words of integer.
`void mm(int r0, int N[4], int M[5])`

The product of r0 and N is placed in M.
M[4] is a 31 (wb+1) bit value.

This routines assumes that

The product of A and B, reduced modulo M, is placed in P. A, B & P are all L words long. L must be a multiple of 4. The modulus, M, is found in words *m[0] thru *m[L−1] where m has file scope. A precondition of calling multMod is that ms (of file scope) be the number of leading zero words in the modulus. A and B must be less than M and P will also be less than M. The storage areas for A, B and P may overlap.

d is the “same” 120 bits taken from the remainder as from the modulus when its reciprocal was computed. Those 120 bits are multiplied by the 120 bit reciprocal to compute qe. The high 120 bits of that product become qe.

qe in m4 in multMod qe is a 120 bit number which is ideally what should be multiplied by the modulus and this deducted from the accumulator in order that the accumulator be less than the modulus. It is only an approximation, however.

- wb = 30
- bits of precision in a word. See above.
- ugg = 0
- Flag (0 or 1) that turns on diagnostic printing of intermediate results in main loop.
- ug = 0
- Flag that turns on printing of intermediate results in computing of 120 bit reciprocal.
- stat = 0
- Flag to count and report measure of inaccuracy in approximation of reciprocal.
- aQ = 0
- This flag turns on a powerful check in multMod that checks a = b*c%m by verifying that b*c = q*m + a where q is the quotient, which is normally discarded. q is accumulated in array Q.
- fs = 1
- Selects a multi-precision square calculation that takes only about half as long as multMod.
- cs = 0
- A flag that turns on a simple check of the special square routine by calling the simple ml routine for comparison.

The problem just now is that when we attempt to compute the reciprocal of a number that begins 0x20000000, the approximation of the reciprocal of that word is 0x80000000, but this is viewed as a negative number by the hardware multiplier. This was fixed by decreeing that 1/1 = 1−2

Just now m4, is broke. m4 is intended as an internal routine to multMod but Apple’s gcc is broken and can’t do nested routines. The parameters

does

Here is how it is supposed to do it:
The code is a specialization of `{B p = 0; int j;
for(j=0; j<L+4; ++j) {p = (p << wb) + `val_{j}`; acc[j] += p&msk;}}`.

The plan is that if val_{j} produces a number less than 2^{2*wb} then
p4 will add ∑val_{j}2^{wb*j} to acc.
That much is pretty well tested.
I had hoped that by presenting a 32 bit one’s complement of each Y value, that `acc −= Y*X` would happen.
The result is accurate to only about 32 bits.
When I make it two’s complement, the result is accurate to about 120 bits, with some high order noise.
I need better theory here.
When each Y is 2’s complemented, the respective val’s are two’s complemented and the effect on acc is exactly reversed.
This needs rigor.