The code at the end produces an RC4 (or ARCFOUR) compatible cipher stream at 357 MB/sec. This assumes that a 2.5GHz PowerPC produces a stream byte in 7 clocks as my 400 MHz machine does.
This driver file invokes this assembler file which computes the stream and compares the result with this simplest C version which was adapted from Wikipedia whose test vectors may be verified thus:
gcc tvh.c x.c; ./a.out
The following shows the permutation table of the state after 227 steps.
This should reveal any bugs resulting from erroneous hoisting of loads over aliased stores.
gcc -O3 -Wmost tv6.c x.c; ./a.out
The PPC machine code may be invoked and tested thus, even on an Intel Mac:
gcc -fnested-functions -arch ppc -O3 -Wall b.c xx1.s x.c; ./a.out keytext
The main tweakable values in the driver file are:
comp: boolean: whether to do comparison or mere timing test of assembler.
m: the size of the tested key stream.
jx: handicap better to test timing.
x.c produces about 23 MB/sec on a 400 MHz PowerPC compiled by gcc with -O3. If a 2.5 GHz machine scales by clock rate then this C program would produce about 146 MB/sec of keystream.
The next assembler version is less flexible and always produce just 256 bytes of keying stream. This removes 2 of 8 inherently sequential dependencies.
gcc b2.c is.c -fnested-functions -arch ppc -O3 -Wall xx2.s; ./a.out keytext
always does just 256 bytes of output thus removing a mask instruction from the inner loop.
gcc b2.c is.c -fnested-functions -arch ppc -O3 -Wall xx3.s; ./a.out keytext
does progressive indexing to eliminate another dependency.
gcc b3.c is.c -fnested-functions -arch ppc -O3 -Wall xx3.s tb.s; ./a.out keytext
includes high resolution timing.
Time units are 16 clocks.
A 400 MHz machine does 400 clks/μs.
This version takes 10 clks per byte of keystream.
gcc b3.c is.c -fnested-functions -arch ppc -O3 -Wall xx4.s tb.s; ./a.out keytext
permutes instructions for better scheduling and thus produces a new byte every 7 clks.
typedef unsigned char ch; typedef struct {ch s[257]; ch j;} state;An instance of state provides an opaque state memory for the cipher stream generator.