The above works on recent Macs. (on Max OS X 10.7.5 with a 64 bit x86) On a different OS it was necessary to remove the first line and each of the 6 underscores. Suitable prototypes for the routines are:

typedef unsigned long int ul; typedef struct{ul q; ul r;} res; res divq(res, ul); res mulq(ul, ul); res addq(res, res);

mulq multiplies two unsigned 64 bit integers and returns a 128 bit integer.
divq divides a 128 bit number by a 64 bit number returning a 64 bit quotient and remainder to fully exploit divq.
In such a context `(res){j, k}` denotes j∙2^{64} + k.

The “q” in “divq” is a GAS (Gnu Assembler) convention to indicate 64 bit operands.

Warning: divq is fairly slow. On a 2.4 GHz Intel Core 2 Duo, divq takes about 22 ns.

clang m.c q.s ./a.outproduces

6700417 3 32820881070691349 462887844924339 2559402536 11508347135197814379 11000000000001 1000000000000000which is right.

lsq below serves as a 128 bit left shift and llsrs as long long signed right shift:

typedef long int L; res llsrs(res v, int n){return n<64?(res){(L)v.q>>n, (n?(v.q<<(64-n)):0)|(v.r>>n)} :(res){(L)v.q>>63, (L)v.q>>(n-64)};} res lsq(res v, int n){return n<64?(res){(v.q<<n)|(n?(v.r>>64-n):0),v.r<<n} :(res){v.r<<(n-64), 0};}

This is the program that motivated these routines. It computes e and converts it to decimal to many digits. C does not expose the power of the divq command which is exploited here for a particularly simple calculation. Before hardware floating point computers relied on commands such as these for scaled fixed point arithmetic. They still have their uses.

This takes 3 64 bit parameters and returns 2. “clang c.c -O3 -S” produces file “c.s” which reveals that a arrives in register %rdi, b in %rsi and d in %rdx. The yield goes to regs %rax and %rdx.

divq divides 128 bit value in RDX:RAX by named divisor and puts quotient in %rax and remainder in %rdx. See Vol. 2A 3-221 of Intel x86 manual: “Intel® 64 and IA-32 Architectures Software Developer’s Manual”

This is useful: