Numbers are stored 30 bits per word, (actually wb bits per word), in a word array in which the units position is at index zero. There are special conditions where values of 2wb or more may be allowed. Only positive values are used. We avoid multiplication by words with the high bit on. This may make the code suitable for machines without high performance unsigned multiplies. We use the high 2 (32−wb) bits for deferred carry assimilation.

Note preprocessor symbol C (for “const”) and defined type B for a 64 bit value.

ml

The routine ml is for debugging only.
void ml(int C * C M, int C m, int C * C N, int C n, int * C P)
The number in M[0] thru M[m−1] is multiplied by the number in N[0] thru N[n−1]. The product is put into P[0] thru P[m+n−1]. One of the inputs may have a 31 bit value in the high order word, which might produce an output with a 31 bit value in the high order.

mm

Routine mm multiplies one 31 bit value by 4 words of integer.
void mm(int r0, int N[4], int M[5])
The product of r0 and N is placed in M. M[4] is a 31 (wb+1) bit value.

inv

This routine produces a 120 bit approximation to a reciprocal.
void inv(int C * C a)
This routines assumes that a[0] holds the most significant bit of a number. It will produce in r[0] thru r[3] a 121 bit approximation to the reciprocal of the most significant 120 bits of that number, which it finds in a[0], a[−1], ... There will be 31 significant bits in r[3]. The number of leading 0’s in the left wb bits of a[0] is placed in the file scope variable J.

multMod

This computes the product of two big numbers, modulo a modulus.
void multMod(int C * A, int C * B, int * P, int L)
The product of A and B, reduced modulo M, is placed in P. A, B & P are all L words long. L must be a multiple of 4. The modulus, M, is found in words *m[0] thru *m[L−1] where m has file scope. A precondition of calling multMod is that ms (of file scope) be the number of leading zero words in the modulus. A and B must be less than M and P will also be less than M. The storage areas for A, B and P may overlap.

d is the “same” 120 bits taken from the remainder as from the modulus when its reciprocal was computed. Those 120 bits are multiplied by the 120 bit reciprocal to compute qe. The high 120 bits of that product become qe.

qe in m4 in multMod qe is a 120 bit number which is ideally what should be multiplied by the modulus and this deducted from the accumulator in order that the accumulator be less than the modulus. It is only an approximation, however.

modPow

See this for the rules for calling modPow.

Switches and Such

The suggested production value is given after the name.
wb = 30
bits of precision in a word. See above.
ugg = 0
Flag (0 or 1) that turns on diagnostic printing of intermediate results in main loop.
ug = 0
Flag that turns on printing of intermediate results in computing of 120 bit reciprocal.
stat = 0
Flag to count and report measure of inaccuracy in approximation of reciprocal.
aQ = 0
This flag turns on a powerful check in multMod that checks a = b*c%m by verifying that b*c = q*m + a where q is the quotient, which is normally discarded. q is accumulated in array Q.
fs = 1
Selects a multi-precision square calculation that takes only about half as long as multMod.
cs = 0
A flag that turns on a simple check of the special square routine by calling the simple ml routine for comparison.

The problem just now is that when we attempt to compute the reciprocal of a number that begins 0x20000000, the approximation of the reciprocal of that word is 0x80000000, but this is viewed as a negative number by the hardware multiplier. This was fixed by decreeing that 1/1 = 1−2−wb. It is, after all, only an approximation!
Just now m4, is broke. m4 is intended as an internal routine to multMod but Apple’s gcc is broken and can’t do nested routines. The parameters L & acc were added to unnest it. Here is what m4 is supposed to do:
void m4(int C * C X, int C Y[4], int L, int * acc)
does acc += Y*X where X is L words long and acc is L+4 words long.

Here is how it is supposed to do it: The code is a specialization of
{B p = 0; int j; for(j=0; j<L+4; ++j) {p = (p << wb) + valj; acc[j] += p&msk;}}.
The plan is that if valj produces a number less than 22*wb then p4 will add ∑valj2wb*j to acc. That much is pretty well tested. I had hoped that by presenting a 32 bit one’s complement of each Y value, that acc −= Y*X would happen. The result is accurate to only about 32 bits. When I make it two’s complement, the result is accurate to about 120 bits, with some high order noise. I need better theory here. When each Y is 2’s complemented, the respective val’s are two’s complemented and the effect on acc is exactly reversed. This needs rigor.

Some Inequalities

We explain the logic of the code by viewing the payload as a fraction between −1 and 2. In each case the payload is the multi-precision values which are the input, output and intermediate values. Often we require that the words other than the most significant be non-negative and less than 2wb. The accumulator is the array acc in the routine multMod. We take the binary point in the acc so that 1/2 <= m < 1 for m, the modulus.