The x86 can multiply 32 bit numbers for a 64 bit product, a capability that seems to be disappearing in other architectures. On a 1.8 GHz Intel Core Duo (using one core) this does a 1024 bit job in 50 ms.
typedef int i32;
typedef long long B;
static B xp(int * a, int * b){
  return
(B)*a * *(b+3) + (B)*(a+1) * *(b+2) + (B)*(a+2) * *(b+1) + (B)*(a+3) * *b;}

int main(){
int f[100];
{int j = 100; while(j--) f[j] = j*31007 + 47;}
{short c = 0; B bs=0;
 int j = 1000000; while(j--) {int k = 100; while(k -= 4) {
    {B b = xp(f, f + k); c += b >>58; bs += b;}}}
printf ("%d\n", c);
}}
The code does 108 64 bit products and sums the carries. It presumes 30 bits per operand. An n bit multiply or modulus takes (n/30)2 of these multiplies. A n bit modPow operation takes n*3*(n/30)2. For n=3K this is 108.

An ElGamal key takes 2 modPow’s to encrypt and one to decrypt.


More detail:

The logic of the routine xp above assumes that we use batches of 4*30 multiplier bits on each time thru the outer loop of the multiply. One scheme that I have used before for a*b%m is to take batches of multiplier bits, high order first, and with each batch:

  1. multiply batch by multiplicand,
  2. add that to accumulator,
  3. reduce accumulator by modulus,
  4. shift accumulator left by length of batch.
The accumulator’s length is just the modulus length plus the batch length. With one bit batches, this multiply routine uses high bits first and this is the same routine with the reduction logic.

Reducing modulo the modulus sounds like division but the inner loop needs be no worse than multiplication. I suggest pre-computing, rm, the 120 leading significant bits of the reciprocal of the modulus. This is strategic especially when the same modulus is reused many times as in multi precision modular powers. With proper fixed point scaling pr -= (rm*pr)*m; is the inner loop for modular reduction where pr is the partial remainder, and m is the modulus. The same 120 bits of pr may be taken each cycle and the value (rm*pr) is computed only to a near integer which requires just 16 32 bit multiplies. Thereafter (rm*pr)*m is computed exactly.

Computing rm is tricky but not especially time critical. We talk here as if the 120 bits of rm and the high 120 bits of the modulus were each scaled between 0 and 1, using the low 30 bits in each of 4 words. Since a number and its reciprocal cannot both be less than 1 we settle for seeking to make their product = 1/2. The high bit of each will be 1. (Alternatively we might use 31 bits in the high word of the reciprocal.) We should do something about the edge condition that the high 120 bits produce exactly 1/2. I think it will do to say that we always round down and a value too low is merely slightly inefficient.

We use this approximate reciprocal of the modulus just after we have multiplied 120 bits of the multiplier by the entire multiplicand. We thus reduce S = (that product) + ((previous reduction) << 120). This is n+121 bits long where the multiplicand is n bits long. We do this by extracting the high 121 bits of S and multiplying that by the reciprocal. This gives us a 120 bit approximation by which to multiply the modulus. This product will be subtracted from S to get something fairly likely to be >=0 and < modulus.

Here are some notes on the specific code.

Except for finding the approximate inverse of the modulus we need not require a normalized number—we need not locate the high order bit. We begin by ensuring that the factors in a product as each less than the modulus. Let M be the number of bits in the modulus—
M = ceil(log2(modulus)). The two factors reside each in an array of 4*ceil(M/(4*wb)) words. The modulus resides an an array whose high word is not 0. The modular product is developed in a field four words longer yet. The multiplier bits are used 4*wb bits at a time, or 4 words at a time, starting at the high order end, much like this routine.