Just as permutation can always be decomposed into elementary permutations (by various definitions of “elementary”), the corresponding exponential permutations can likewise be decomposed. Such a pattern may not always lead to the most efficient plan for the large permutation, but it may lead to a general and easy to understand program. Many exponential permutations corresponding to elementary permutations are much more cache efficient than the general direct exponential permutations and thus the decomposed version may be faster even when more instructions are issued.

could be performed without the instruction counter.

Harvest was big endian and also numbered and addressed bits from the left. The basic transformation took 8 bit bytes b[j] and b[j+65536]; and, labeling their respective bits abcdefgh and ijklmnop, produced aibjckdlemfngohp by a table lookup using those two input bytes as table indexes. This 16 bit value was placed in the output array at a[2*j]. This was iterated for 0≤j<65536. The above “round” did not require the program counter and was repeated 10 times. Bit 0 of byte 0 stayed put for each of the 10 rounds. Quoting bit addresses in hex Bit 1 of the array moved on successive rounds to hex bit addresses 2, 4, 8, 10, 20, 40, 80, 100, 200, 400. Bit 8 moved as follows 10, 20, 40, 80, 100, 200, 400, 800, 1000. Meanwhile the bit starting at 1000 progressed to 2000, 4000, 8000, 10000, 20000, 40000, 80000, 1, 2, 4, 8. Reviewing the array as a square bit array, bit b[0,0] stayed put, b[0,1] went to b[1,0] and b[0,8] exchanged with b[8,0]. The array was thus transposed. The hardware accessed the variable data by 64 bit words and the table lookups from a small fast memory by 16 bit hunks.

I am not sure that I would have been able to think thru this if bits were numbered in a byte from right to left and bytes in a string from left to right. My mental model was that of an array of bits and the hardware was documented and programmed this way.

While each bit traveled between core memory and the stream processor 20 times, it traveled each time with 63 other bits that were each also on a profitable trip.
The naïve programs would have required 2^{20} fetches or stores of isolated bits.
The optimized program required 10*2*2^{20}/64 fetches and stores.
Alan Karp makes and extends these performance observations in his note:
Bit Reversal on Uniprocessors.

Peter Deutsch selected a different elementary operation to transpose a square array of pixels. His elementary permutations swapped the bits of the array index, left and right, one at a time. The intermediate images were amusing. The nature of BitBlt made this efficient for the SmallTalk world.

Yet another instance of this pattern is the index bit reversal portion of the Fast Fourier Transform. The required permutation of array indexes, is to reverse the order of the bits. I suspect that the fastest way to do this does not fit in the exponential pattern. The exponential version, with caches, may be more efficient than the fancy schemes that require fewer instructions, and it is easier to understand.

The Harvest could do pocket sorts too but less efficiently. It could maintain just one efficient output stream but two efficient input streams.

There are “block algorithms” for manipulating matrices that pass over the matrices by sub-blocks instead of the naïve row and column order. This is a related idea which achieves related memory bandwidth benefits.

Data base query optimization resorts to such patterns.

typedef long unsigned int L; … L rb(L x){L a = x<<32 | x>>32, b = ((a&0xffff0000ffff0000L)>>16) | 0xffff0000ffff0000L&(a<<16); return ((b&0xff00ff00ff00ff00L)>>8) | 0xff00ff00ff00ff00L&(b<<8);}This can clearly be extended to nibble or bit reversal. Here the address bits are complemented, not moved.