Abstractly a Bloom filter is a mutable object that maps strings onto bools. Given a string it returns a bool. It is allowed false positives. A creation time parameter trades off the frequency of false positives against storage costs. A bloom filter can also have strings added which causes the filter to respond true to that string in the future. There is no delete operation.
The easiest way to describe the internals of Bloom filters is to describe how they are read. The writing is then obvious. Given a string, its hash is computed. This must be a long very high quality hash. The hash is divided into n (about 10) pieces each of m (about 20) bits. The filter’s state is a string of 2m bits. Each of the n hash pieces is treated as an index into the state string. The filter responds true if all of the accessed bits are one.
Under usual circumstances it is optimal for the density of the state string to be 1/2. Then the probability of a false positive is 2−n. Each new word adds about .7*n bits to the data base. We will call this number (.7*n) the load on the filter.
Two Bloom filters can be merged by oring them together. The merged filter will respond positively at least whenever either of the inputs would have. The density of the merged filter will be greater than that of the inputs and thus either the inputs or the outputs will have a sub-optimal density.
Several Bloom filters can be ored together offset by various small numbers of bits. Now hits can be related to a particular original filter. If you think this thru you will find that such a merged filter is very efficient to read. Of course the merged filter may be originally constructed as merged.
This elegant application relies on the impossibility of retrieving the strings from the filter.
Here are the cost and performance considerations for this scheme. For each annotator that annotates an URL, there is a load of .7*n (about 7). Multiple annotations by the same annotator to the same URL do not increase the load. Consulting the offset filter for a new URL requires computing the hash. SHA is a very good hash and costs about 30 instructions per character. Retrieving the p bits requires about another 50 instructions.