Theory

Notes on the Transclusion Finder

The html version of this file has some insignificant corrections. I have lost the ability to edit the other formats.

The purpose of this program is to find transclusions in a Unix tree of files. It treats each file as a bit string and seeks sub strings that occur more than once.

The stuff that I have written so far is more of a logic manual than a presentation of ideas.

The main idea is to define a landmark that is a property of a string of bits that is suitably rare. Each file is scanned for landmarks and a checksum of the data around each landmark is made and included in a new file along with the address of where it was found. The new file is then sorted by checksum and then landmark sites with the same checksum are compared by reading the original files.

The main difficulty is finding a suitable landmark definition. I invented a stream hash function that is fairly efficient to compute. It amounts to a convolution of the bit stream of the searched file with a string of about 500 bits. I look for strings of about 11 ones in the hash and call this a landmark. Bits surrounding this string of ones are used as the checksum. This has the convenient property that if two file portions are alike except for different bit offsets then the transclusion will be found even if the offset difference is not a multiple of one byte.

This scheme is likely to find transclusions of more than 256 characters as such a stream is likely to have a landmark. Shorter transclusions are less likely to be found.

One piece of hair is removal of redundant reports resulting from multiple landmarks in one transclusion.

Bit Order

We must consider how a long bit stream is represented in a Unix file. We are persuaded by the notion that fetching a 32 bit word from the file should yield 32 bits of the stream in their natural order on both big-endian and little-endian machines. Adjacent bits in the word are thus adjacent in the stream. Alas for little-endian machines, that means that the left end of the word, which is the more significant end, holds bits later in the stream.

If we flip the bits in each byte of a file end-for-end and move the file to a machine of the other endianness, and search the file, there are within the calculation bit stream values that are identical on the two machines except being flipped bit-by-bit, end-for-end. These values come in sizes 8, 16 and 32 bits.

Local Stationary Bit Streams Functions

This is the theory of a string hash function. It is a function on a long string of bits. Its value is a few hundred bits longer than the argument. It is local, stationary and fairly fast to compute. We use “string” here to refer to arrays of bits indexed by the integers. If b is such a string then b_j is bit j of that string. The strings are infinitely long in both directions but the bits are zero outside some finite range. Note that we picture bits with a larger index on the left! e.g. b₂ b₁ b₀ b₋₁ b₋₂ ... —this is the classic order for polynomial coefficients. Both the argument and value of string hash is a string. b ≪ n is the string b shifted n bits to the left. (b≪n)_i= b_i−n for any integers i and n. We also write “b(x)” to represent the polynomial Σb_ixⁱ over the field GF(2).

A string function s is stationary iff s(b≪1) = s(b)≪1 for all b. In this case s(b≪n) = s(b)≪n for all b and n.

A function s is local when there are integers a and b such that a<b and s(b)_i is determined by b_j only when i+a < j < i+b. This means that such functions depend on only bits in the neighborhood. More precisely L is local if there are integers a and b such that a<b and for any strings s and t (for all i, if (for all j if (i+a < j < i+b) then s_j= t_j) then L(s)_i = L(t)_i).

We want a stationary function S so that properties of a string that are preserved under shifting will remain visible in the yield of S if they were visible before. We want S to be local so that equality of sub-strings of modest length will lead to equality of the yield of S. We will look for landmarks in the yield such as strings of ones. No particular pattern can be presumed to have a useful density in the original data—zeros, for instance, are excluded from ascii text. Such patterns should have known densities in the randomized data that comes from S.

Our particular string hash function is based on a polynomial h over GF(2) of degree P≈50. We identify a bit string b with the polynomial h(x) = Σb_ixⁱ summed over non-negative i.

When polynomials are used to checksum transmission blocks a checksum is computed and transmitted that would have caused the block to be a multiple of the polynomial had it been exclusive-ored into the right end of the block. In this case the checksum is a function of all of the bits in the block and is thus not local.

Our string hash function S is a convolution function. S(b)_i = Σb_i−jp_j. (In this context Σ means the one bit exclusive-or: ⊕) If h(x) = x⁶⁴+x⁶¹+1 then p may be recursively defined as p_j = 0 for j<0 or j≥B and p₀ = 1 and p_j = p_j−64 ⊕ p_j−61 for 0 < j < B. (B is some constant ≈ 500). Alternatively p may be defined by the relations x^B + r(x) = p(x)h(x) and deg(r)≤deg(h). r(x) is the remainder after dividing x^B by h(x) and p(x) = (x^B + r(x))/h(x). It may be seen that S is stationary and local for the range a = −B, b = 0.

The “polynomial” associated with p for infinite B is the reciprocal of h in the sense that the terms of the product of p and h are the coefficients of the polynomial 1 despite the fact there are infinitely many terms in p. This is like 1/(1−x) = 1+x⁻¹+x⁻²+x⁻³... You may have seen the old fallacious proof that 1+2+4+8... = −1. That proof is valid in this funny world.

The desired convolution is m(x)p(x) where m_i are the bits of the argument to S. If there are N bits in the argument then the first bit is m_N−1 and the last bit is m₀. We will process the bits from first to last.

m(x)p(x) = m(x)[(x^B + r(x))/h(x)] = [m(x)(x^B + r(x))]/h(x)

We first describe a conceptual method that does not have the factor B in its cost. Then we show how to do it several bits at a time instead of one. The first conceptual method of computation proceeds as follows. The string argument is delivered one bit at a time, largest index first. The algorithm does not know the value of that index. S being stationary, we can compute the answer as soon as the data begin to arrive. By the first and last bits of the string argument we refer to some range of the infinite input outside of which is all zero. We buy three left infinite accumulators m, s and q. m accumulates the argument, s will accumulate m(x)(x^B + r(x)) and q will accumulate the quotient [m(x)(x^B + r(x))]/h(x). We prepare the initial answer as if the input string were empty (all zero) by setting m, s and q to all zeros.

When ever a bit of m arrives:

1): Shift m, s and q each to the left one bit and deposit the new bit in m₀.
2): If the new bit is 1 we exclusive-or x^B + r(x) into s.
3): Set q₀ to s_B.
4): If s_B then exclusive-or h(x)x^(B−P) into s. (This turns s_B off.)

When an oracle tells us that there are no new bits we supply B−P zeros after which s will be all zeros. Note that the last step above keeps s_j = 0 for j>B. Supplying zeros continues to increase the number of low bits in s.

The loop invariant for the above is:

s(x) = (x^B + r(x))m(x) − q(x)h(x)x^(B−P) and deg(s)<B.

Since s is driven to zero we have q(x) = (x^B + r(x))m(x)x^(B−P)/(h(x)x^(B−P)) upon exit from the loop. The extra factor x^(B−P) in m(x)x^(B−P) above arose when we fed B−P more zero bits into the algorithm after the last bit of the argument arrived.
Steps 1 and 2 constitute the multiplication (x^B + r(x))m(x) and steps 3 and 4 perform the division (x^B + r(x))m(x)/h(x). We now have:

q(x) = (x^B + r(x))m(x)/h(x) = m(x)p(x)

which is the desired answer.

....We reformulate the above in notation without mutation such as shifting and “exclusive-or”ing.

To accelerate the above computations we consider the data flow. The only parts of the computation that are proportional to B are the shifts of s, m and q. m is entirely unneeded—it was computed only to make the loop invariant testable and thus make the computation easier to debug. The bits of q are produced as the answer and are unused again in this computation. Only the first B bits of s are used—indeed the rest are all zero. After s_P is produced it is not used again, aside for shifting until it becomes s_B−P. This is thus a fixed length FIFO queue and may be implemented without shifting. Alternatively we can dispense with the queue and perform steps 3 and 4 sooner if we can get an early edition of the input bits so as to perform the x^B element of step 2. The latter is the only thing delaying the computation in steps 3 and 4. In either case there remain no factors of B in the computation.

Substantial further savings are possible by processing several bits of input at a time. For simplicity we will describe the 8 bit strategy. As the algorithm processes 8 bits, its storage, s, undergoes a linear transformations. s₀ thru s_P−1 are transformed by being shifted 8 to the left, becoming s₀ thru s_P+7, and having a P+8 bit string depending on the next 8 input bits, exclusive-ored into them. The P+8 bits may be found in a table of pre-computed constants indexed by the next input byte.

Bits s_B−P thru s_B−1 undergo a similar linear transformation. Any time after the input bits have been contributed at position x^B eight iterations of step 4 may be done in an 8 bit left shift followed by an exclusive-or of P bits into s_B−P thru s_B−1. These P bits are taken from another constant table indexed by the old bits s_B−8 thru s_B−1. Yet another table indexed the same yields the 8 answer bits.

There is a simple but unsatisfying proof of the above. The output is a linear function of the input in the sense that input and output are in a vector space over GF(2). The answer is correct when the message has a single one bit. (you can work this out in your head.) It is therefore correct for any argument with a single bit since S is stationary. The messages of a single bit form a basis for the vector space. The algorithm above computes a linear function. Linear functions that agree for each member of a basis agree everywhere. Q.E.D.

Some Combinatorics

How many bit strings of length n have at least one contiguous string of eight ones? A recursive formula is S_i = 2S_i−1 + (2ⁱ⁻⁹ − S_i−9), S₈ = 1 and S₀...S₇ = 0. The idea behind this formula is that an n bit string is composed of the concatenation of a one bit string and an n−1 bit string. We use “8string” to designate a bit string with one or more runs of 8 ones. Each n bit 8string is in exactly one of the following three forms: a zero followed by an n−1 bit 8string, a one followed by an n−1 bit 8string or a one followed by an n−1 bit string that begins with 11111110 and is then followed by an n−9 bit string that is not an 8string. The last proviso avoids double counting. The formula above follows from this trichotomy.

This yields S₃₂ = 131730944.

Running the Program

To invoke TF:
TF [-s options] directory_name.

“options” is a string of letters, each suppressing a different form of diagnostic output to “terminal”. The more characters the less output. Without the -s option, “-s CL” is assumed. “directory_name” is traversed recursively.

Classes of diagnostic and amusement output:

“F” names each file as it is first read, and directory as it is entered.

“R” reports how many characters in each file and prints “b” for each block.

“L” reports details about where matches were found.

“V” is for vernier faults, unfortunate and rare.