Here are some core pieces of a duplication finder. Here are some notes that describe the mathematical parts of the algorithm, along with a brief report of what routines do what math. Other older versions in other formats: .pdf, PostScript, and rtf.

Here are the code fragments proper. I am sorry that I have not had the time to make these presentable. They have run on a moderately large Unix directory of ascii files, on big endian machines.

Program Logic

How to find differences between remote files without transmitting them first.