The code at the end produces an RC4 (or ARCFOUR) compatible cipher stream at 357 MB/sec. This assumes that a 2.5GHz PowerPC produces a stream byte in 7 clocks as my 400 MHz machine does.

This driver file invokes this assembler file which computes the stream and compares the result with this simplest C version which was adapted from Wikipedia whose test vectors may be verified thus:
gcc tvh.c x.c; ./a.out
The following shows the permutation table of the state after 227 steps. This should reveal any bugs resulting from erroneous hoisting of loads over aliased stores.
gcc -O3 -Wmost tv6.c x.c; ./a.out

The PPC machine code may be invoked and tested thus, even on an Intel Mac:
gcc -fnested-functions -arch ppc -O3 -Wall b.c xx1.s x.c; ./a.out keytext

The main tweakable values in the driver file are:
comp: boolean: whether to do comparison or mere timing test of assembler.
m: the size of the tested key stream.
jx: handicap better to test timing.

x.c produces about 23 MB/sec on a 400 MHz PowerPC compiled by gcc with -O3. If a 2.5 GHz machine scales by clock rate then this C program would produce about 146 MB/sec of keystream.

The next assembler version is less flexible and always produce just 256 bytes of keying stream. This removes 2 of 8 inherently sequential dependencies.

gcc b2.c is.c -fnested-functions -arch ppc -O3 -Wall xx2.s; ./a.out keytext
always does just 256 bytes of output thus removing a mask instruction from the inner loop.

gcc b2.c is.c -fnested-functions -arch ppc -O3 -Wall xx3.s; ./a.out keytext
does progressive indexing to eliminate another dependency.

gcc b3.c is.c -fnested-functions -arch ppc -O3 -Wall xx3.s tb.s; ./a.out keytext
includes high resolution timing. Time units are 16 clocks. A 400 MHz machine does 400 clks/μs. This version takes 10 clks per byte of keystream.

gcc b3.c is.c -fnested-functions -arch ppc -O3 -Wall xx4.s tb.s; ./a.out keytext
permutes instructions for better scheduling and thus produces a new byte every 7 clks.


Together the code in xx4.s and is.c provide the following:
typedef unsigned char ch;
typedef struct {ch s[257]; ch j;} state;
An instance of state provides an opaque state memory for the cipher stream generator.
void initst(state * st, ch * key); initializes the state st given a zero terminated string key.
void mv2(char * outputstream, state * st); generates the next 256 bytes of keystream in outputstream and updates the generator memory st.
Here is a Scheme version to supply a quality pseudo random number generator for Scheme programs.