High Level View of Program

Here are the source files:
Fast.c, fl.c, FM.c, Lmain.c, New.c, sstream.c, p.c, sort.c, local.h Umain.c, m.h.

The above stream hash function has a mathematical style of theory. The rest of the program is anything but mathematical.
Indeed this is an ugly program. It would be rather improved if written in C++, but still ugly. Part of the ugliness is due to the lack of a crisp spec. Compromises on just what duplications to find are scattered about the code. In the limit large duplicates will be found.

There are two phases to the program and it is currently coded to finish one phase before starting the next. There is some flexibility to merge the phases and achieve trade-offs for space and against execution time.

Phase one reads the accessible files, eliminating duplication arising from links. The large output of this phase is a list of marks. Each mark is represented by an instance of the sx structure that comprises a 32 bit signature and a 32 bit mark locator. This is accumulated into a large array LM. A small output of phase one is a coded map which can translate a mark locator into a file name and a block within file.

At the end of phase one we sort the marks by signature and carry along the locations. Then in phase two we pass over the sorted list. When we come to a pair of adjacent equal marks we call locate and read the files where the marks were found. We extend the equality span as far as possible. Then, so as not to process this span again as triggered by other marks in the same span, we remove the marks from the sorted list associated with one of the two files. We do this by again passing over that file portion and for each mark find we do a binary search in the sorted mark list and “remove” the mark from the list. We cannot merely zap the entry for that would break the binary search code. Instead we disable the entry by changing its location field to −5. We keep such invalidated mark entries at the end of the run of entries with equal signatures.

Navigating Unix File Systems

Directory and file links present a special hazard for our purposes. Loops in directories can prevent termination and even without loops would cause false transclusion indications. The lstat system call in SVR4 alerts us that a directory entry is a symbolic link. We merely ignore such entries. Hard links to files are a bigger problem as the entries are symmetric. lstat returns a structure that includes a count of hard links to the file. To avoid double counting we remember the “i_node” number of each file with more than one hard link and visit the file only the first time we find a directory entry (link) to it.

Locating Data in Files

We must remember where each landmark was found. Considering that we must remember and sort millions of landmarks we must code this memory efficiently. Since S is strictly local it may be quickly recomputed starting anywhere in a file. We choose a block count to locate both which file and where in the file. A block is 512 bytes even where the underlying file system block is something different. This provides a 32 bit locater which is unique within two terabytes of storage.

We imagine the accessible files to be concatenated in directory order and their blocks numbered in that concatenation. This means that we must provide an auxiliary memory of which file holds block numbers within a given range. This memory will be large but not nearly so large as the landmark memory. I propose that we form a chronological and linear record of each directory and file visited on the first pass. If we merely record the size of each file then the file may be efficiently located. Each directory entry will record the total size of the files within and point to the previous directory entry at the same level. This allows rapid location of the file given the block locator. It is also used in conjunction with the opendir call to reconstruct the file name that is necessary to open the file to confirm the transclusion. This form of file memory is suitable to writing into a file during the first pass and landmark sort. The pointers between sibling directories are relative to the file memory.

We build a large FILO structure FM to hold our file memory. We build it in one order and consult it repeatedly in the other order. Abstractly FM is a stack of integers—it grows like a stack but does not shrink as it is used backwards. As we process a file yield of readdir we append something to FM. If the yield indicated a file we append the file’s block count. If we cannot process the file, perhaps because access was denied, we append zero. If the yield indicates a directory we note the FM cursor and the current global block count. We then process the entries in the new directory. After we exhaust that directory we append the size of the data appended to FM for this directory. We also append the total block count. We then a append −1 as a distinctive marker indicating a directory. This allows to skip this part of FM as we read FM. Since FM is large and most of its integers are small we compress them by techniques best understood by reading the code.

There is an infelicity in this design. It takes special logic to discover a string that is repeated in the same file block. We omit that logic for now.

Some Program Logic

Lmain.c and Umain.c

are alternative files. Umain.c is the production version and Lmain.c is a test version. Lmain takes the sources of the program as a test case. File m.h is included twice to form a transclusion to be found. File junk is now a random fragment copied from p.c which is not found on little endian machines. This is a bug to be fixed.

fport

fport is a data structure to designate a file and a point in that file. These are short lived, on the stack, and do not persist between the two phases.

sx

sx is the type of a structure that remembers the location of a landmark between phases. There is the array LM of sx’s.

locate

locate is the only code that makes values for fports. It opens files in phase two.

scan

scan proceeds forwards thru the stream hash and watching for landmarks. scan calls so for each anded (see so) word. scan is shared between the two phases. It is called by fl for each file in the first phase and vernier and forget during second phase.

gaw

scan calls gaw which successively returns each word of hashed stream. gaw calls the system fread function to get data from the file.

vernier(fport * C p, b32 C bn, fport C * C dp)

veriner starts with a crude recollection of where a landmark was found in the first pass and locates it exactly. It does this by rereading and computing the string hash.

void find_match()

Scans the list of landmarks, sorted by signature, looking for the same signature at different points. When it finds the same signature at for two places it calls span.

span(fport C * C X, fport C * C Y, sstream * C x, sstream * C y)

is local to file p.c. It is called when two marks with the same signature are noted. Its job is to maximally extend the transclusion forward and backward. X and Y identify the marks and where they were found. The bit offset for X is < that of Y. This is significant only when the files are the same. It is easy for span’s caller and convenient for span. x and y are open sstreams for the respective files available for accessing the files. span makes the transclusion report and removes (calling forget) the marks from the base generated by the first file in the specified span.

sread(uchar * wh, sstream * s, b32 C loc)

sread is like fread except for having improved read backwards performance, taking a modified stream parameter and doing bit oriented offsets. sread(uchar * wh, sstream * ss, b32 loc) reads 512 bytes from stream ss at bit offset loc into wh. The sstream thing buffers some data to improve performance. sread uses the buffer in the sstream as a cache for this file.

void sopen(sstream * s, FILE *f)

opens an sstream on an already open file.

sstream

This is a structure that lives on the stack in phase two. It includes a file descriptor and 1K buffer.

void forget(fport C * p, b32 start, b32 stop)

locates and removes all marks from the base that were generated by p’s file between byte offsets start and stop. It does this by re-doing the hash function over that span and doing a binary search in the base for each mark regenerated. Marks are removed only if the loc field in the mark indicates it came from the start-stop span since the same signature may appear elsewhere in the file and thus in the mark data base. Marks are “removed” by changing their loc’s to −5. Modifying their sig field would break the logic of binary search! Some future version of this program might learn to consolidate the mark data base by removing the dead entries and then shrinking it. (Note the peculiar effects of the signedness coercions of the arguments of the max function. max operates on signed values so as to make otherwise unsigned values such as ~0 seem small.)

The routines w_log and r_log store and retrieve a list if 32 bit integers. They compact leading zeros in these integers. r_log retrieves successive integers in the reverse order that w_log stored them. log_crsr is a variable that can be read and written to provide random access to the integer store. w_log and r_log are used to record an excessively clever data base called name-memory and located by log_crsr that is used to convert a global block number recorded in that mark data base back into a file name. It is very compact but a bit delicate. The delicacy is that it may fail if a directory in the path for the file name is changed between forming the memory (first phase) and consulting it (second phase). I don’t know a way overcome this delicacy short of vast increase in the size of the main mark data base. It could be made more self diagnostic by including some sort of name hash which could be used to detect most of these changes. This would double the size but not ultimately improve the situation for there would be nothing safe to do to find the real old file. The logic of the name-memory assumes constant directories. For each item in a directory (file or sub-directory) it remembers the number of global blocks traversed while processing that entry in phase one. When phase one finishes a directory It assumes that the order and number of the items in a directory as presented in the SVR4 system call readdir is the same in phase one and two. When this fails transclusions are lost and copious diagnostic messages arise unless suppressed. No other ill effects should result.

so

is called once with each word of the ‘anded stream hash’. This stream is the and of eight streams consisting of the stream hash shifted i bits to the right for i from 0 thru 7. A one bit in this stream indicates eight consecutive ones in the stream hash and thus a mark. In order of “early to late” so finds each 1 bit and its offset in the word.

It calls a deeper routine with the bit offset within file and 32 bits of relevant bits of stream hash computed from array grunge, a small circular array or stream hash bits. These hash bits are aligned with the landmark.

Which deeper routine it calls, Rlm, sm or forget_sam, is determined by the current value of the file global variable pointer rlm.

X8: uchar → uchar

is the sole interface to the stream hash function. If b_i are bytes of a bit string then X8(b_i++), yields successive bytes of the hash of b_i. Its output depends exactly on its current and previous 57 (=(B−P)/8) inputs. zreg() means the same as {int j; for(j=0; j<58; ++j) X8(0);} — it merely clears the memory.

core(much, who)

slightly improves malloc by testing for exhaustion. “who” is merely for reporting the failing request.

GBlc

is a global symbol that counts all blocks of all files. It supports locating stuff in files for later location and perusal.

IGBl

is a copy of GBlc’s value upon opening of the current file.

At the bottom, fread at the zx label in routine gaw in file fl.c reads the files in both phases.

The preprocessor symbol “C” is for “const” and I have tried to put it just about anywhere it could go. This is in line with my suspicion that the default identifier should be constant.

bgary

is an epiphenomenon, merely for debugging.

bi is the index thru the file block buffer, viewed as a word array.

X8

is the function that does the stationary hash function, one character at a time.

in

is a file opening variable local to fl.c. It gets its value as routine fl is called. It is for the current subject file.

Endianess

This code was designed and debugged on big endian machine (Sparc and 68K). Images of horizontal bit streams inhabit my head as I try to understand the code. I thought it unwise to code it ambidextrously from the beginning.

The main code transformation for a little endian machine is end-for-end swapping of the bits in fields that represent file data. This consists largely of interchanging C’s ≪ and ≫ operators. The two preprocessor symbols SA (towards Small Addresses) and LA (towards Large Addresses) map to ≪ and ≫ respectively for big endian machines and vice-versa for little endian ones. When the preprocessor symbol L is 1, the high bit of a byte is considered to be adjacent to the low bit of the following byte from the file. It is 1 for little endian machines and 0 for others. If preprocessor symbol Ebug is set then ‘L must be wrong for the machine’. This funny mode allows one to check out code pretty well on a machine of the wrong endianess. Ebug entails some run-time cost.

The function X8 is an 8 bit interface to the hash magic. It suffices to bit invert both the argument and value of X8. We go a bit farther for efficiency. The magic constant tables, head and tail, are special. Their entries are reversed just as they are computed. The internal variables a, b, c, d, y and z, are all reversed. The array index is also reversed. This seemed simpler than reworking the synthetic division logic. rBy reverses bytes and rW reverses words. The code X8pr tests the X8 code and the code in Fast.c. Its yield depends on L but not the endianess of the machine! It can be debugged on a machine with the wrong end.

Debugging little Endianess

I have verified that the computed hash function of the string ...0001 on the little endian machine, is indeed the reverse of the hash function of 1000... on the big endian machine.

More accurately, the yield of X8(1) le is the same as the yield of X8(0x80) be. I think the bits of the yield should reversed!!

That was then; now the yield of X8 for the one bit stream agrees on the two machines.