Here is some code that counts bits fast—A maze of twisty little routines. There are many techniques and several ingenious tricks. Only dcount might be original with me.

This code counts (one) bits fast especially when there are multiple words worth of bits to count.

cook” takes three words and returns two. In the result, twice the ones count of the first word plus the ones count of the second is the number of ones in the original three words. If the operands are viewed as binary numbers and (x, y) = cook(a, b, c) then 2x+y = a+b+c. cook is two logic levels for most circuit families. This idea is used in “carry save adders” and makes it possible to multiply 48 bit numbers in a few clock cycles. The designers of the LARC told me in 1956 that the idea was ancient. It seems it was invented by John von Neumann. The LARC used a decimal analog.

The property of cook that is relevant to fast multiply is that {word a = random, b = random(), c = random(); lump x = cook(a, b, c); a+b+c == (x.two<<1) + x.one;}.

gcount” is very old gcount(x) is the number of one bits in x.

tcount” uses a pre-computed table that knows how many bits there are in each of the 256 possible bytes. This code can be easily modified to use a tables whose size is other powers of two. On chip cache size bears on your choice.

dcount” counts the ones in two words. It is useful when your machine prefers gcount to tcount. The thing to notice is that gcount(a&b) is nearly twice as fast as gcount(a) because there are only about half as many bits in a&b. Thus dcount(a, b) causes only half the cycles in gcount as does gcount(a) + gcount(b).

acount” takes an array of 11 words and counts the bits using the above tricks. There is nothing magic about 11.

bcount” is like acount but some compilers understand it better.

Together the three routines IA, New & report provide a systematic use of cook. When you have a stream of words whose cumulative bit count you need, first call IA, then call “New(x, 0)” for each word x in the stream and then call “report(-1)” to get the final sum. The state of this computation resides in the array “t”. It knows the sum for each of the 32 bit positions. The implementation of “report” says better than I can how the intermediate sum is maintained in the array. The argument to “report” determines for which subset of the 32 bit positions, the count is wanted “New(x, n)” adds 2^n times the bit count of x to the state. The average number of calls to “New” for each call “New(x,0)” is two. There is no check and after 2pw calls to “New(x,0)” for non-zero x, the code will access off the end of the array t!

xew(w)” and “yew(w)” is like New(w, 0) but further optimized.

The routine “acount” takes an array of 11 words and merely shows an “unrolled” and folded version of the above.

main() provides a short demonstration and test of each of these routines.

Here are some timing results run on a 60MHz PowerPC 601, optimized for 604 with the Metrowerks compiler.

```Ones = 15421, 0.0
ns = 15421000, 0.70
gs = 15421000, 1.18
ds = 15421000, 0.92
ts = 15421000, 0.68
bs = 15421000, 0.37```
The right column is microseconds per word counted.

A machine and compiler able to process 64 bits words should gain an easy factor of two. With a 120 MHz clock this might count 400 bits per microsecond.

2012, January; 2.4 GHz Intel Core 2 Duo; clang -O3:

```Ones = 16447,  0.0021
ns = 16447000,  0.0076
gs = 16447000,  0.0240
ds = 16447000,  0.0180
ts = 16447000,  0.0066
bs = 16447000,  0.0044
```
A 64 bit version yields:
```Ones = 32099,  0.0022
ns = 32099000,  0.0086
gs = 32099000,  0.0377
ds = 32099000,  0.0238
ts = 32099000,  0.0105
bs = 32099000,  0.0060
```