The simple idea is to view a flat file as a tree structure and compute a secure hash for each node and leaf. It is necessary, however, to divide the tree into branches and leaves so that replacement of an earlier part of the file with a part of different size does not impact how later parts are partitioned into nodes and leaves. Thus partitions at simple addresses, such as powers of 16, will make the tree for a file look entirely unlike the tree for the same file with the first character deleted. We will form a tree such that a depth first traversal is able to produce the original flat file. The node hashes will be a secure hash of the child hashes and thus the root hash may be considered a secure hash of the entire file.

The challenge is to choose natural cleavage points in a file so that distant differences do not affect the choice. We propose a stream hash for this purpose. A stream hash is a boolean function of the bits in the corresponding neighborhood of the input stream.

 hi = f(si-8, si-7, si-6, ..., si+8)
produces a stream hash whose span is 17 bits. If we look in the stream hash for strings of zeros of length 24-2*n to split the file at level n, then we have a tree with average fan out of 4. We advocate spans of about 100 here.

What can go wrong

If the stream hash span is too short then small repetitive patterns in the original file will show thru in the stream hash. We may not want each “hereinafter” to trigger a split. This can lead to badly balanced trees. If the span is too large then it might be hard to calculate (but don’t despair yet), and smaller blocks will not be discovered.

If the file consists of strings of 0’s or strings of 1’s whose length is greater than the span then will find too many or too few cleavage points. The sorts of file that cause these problems, compress well.

There is fast stream hash code buried in the duplication finder code.

It is easy to produce files that defeat this scheme. For the obvious ways to do so there seem to be obvious counters. I don’t know how this race plays out. I don’t know a context for such a race.