The above stream hash function has a mathematical style of theory.
The rest of the program is anything but mathematical.
Indeed this is an ugly program.
It would be rather improved if written in C++, but still ugly.
Part of the ugliness is due to the lack of a crisp spec.
Compromises on just what duplications to find are scattered about the code.
In the limit large duplicates will be found.
There are two phases to the program and it is currently coded to finish one phase before starting the next. There is some flexibility to merge the phases and achieve trade-offs for space and against execution time.
Phase one reads the accessible files, eliminating duplication arising from links. The large output of this phase is a list of marks. Each mark is represented by an instance of the sx structure that comprises a 32 bit signature and a 32 bit mark locator. This is accumulated into a large array LM. A small output of phase one is a coded map which can translate a mark locator into a file name and a block within file.
At the end of phase one we sort the marks by signature and carry along the locations. Then in phase two we pass over the sorted list. When we come to a pair of adjacent equal marks we call locate and read the files where the marks were found. We extend the equality span as far as possible. Then, so as not to process this span again as triggered by other marks in the same span, we remove the marks from the sorted list associated with one of the two files. We do this by again passing over that file portion and for each mark find we do a binary search in the sorted mark list and “remove” the mark from the list. We cannot merely zap the entry for that would break the binary search code. Instead we disable the entry by changing its location field to −5. We keep such invalidated mark entries at the end of the run of entries with equal signatures.
We imagine the accessible files to be concatenated in directory order and their blocks numbered in that concatenation. This means that we must provide an auxiliary memory of which file holds block numbers within a given range. This memory will be large but not nearly so large as the landmark memory. I propose that we form a chronological and linear record of each directory and file visited on the first pass. If we merely record the size of each file then the file may be efficiently located. Each directory entry will record the total size of the files within and point to the previous directory entry at the same level. This allows rapid location of the file given the block locator. It is also used in conjunction with the opendir call to reconstruct the file name that is necessary to open the file to confirm the transclusion. This form of file memory is suitable to writing into a file during the first pass and landmark sort. The pointers between sibling directories are relative to the file memory.
We build a large FILO structure FM to hold our file memory. We build it in one order and consult it repeatedly in the other order. Abstractly FM is a stack of integers—it grows like a stack but does not shrink as it is used backwards. As we process a file yield of readdir we append something to FM. If the yield indicated a file we append the file’s block count. If we cannot process the file, perhaps because access was denied, we append zero. If the yield indicates a directory we note the FM cursor and the current global block count. We then process the entries in the new directory. After we exhaust that directory we append the size of the data appended to FM for this directory. We also append the total block count. We then a append −1 as a distinctive marker indicating a directory. This allows to skip this part of FM as we read FM. Since FM is large and most of its integers are small we compress them by techniques best understood by reading the code.
There is an infelicity in this design. It takes special logic to discover a string that is repeated in the same file block. We omit that logic for now.
The routines w_log and r_log store and retrieve a list if 32 bit integers. They compact leading zeros in these integers. r_log retrieves successive integers in the reverse order that w_log stored them. log_crsr is a variable that can be read and written to provide random access to the integer store. w_log and r_log are used to record an excessively clever data base called name-memory and located by log_crsr that is used to convert a global block number recorded in that mark data base back into a file name. It is very compact but a bit delicate. The delicacy is that it may fail if a directory in the path for the file name is changed between forming the memory (first phase) and consulting it (second phase). I don’t know a way overcome this delicacy short of vast increase in the size of the main mark data base. It could be made more self diagnostic by including some sort of name hash which could be used to detect most of these changes. This would double the size but not ultimately improve the situation for there would be nothing safe to do to find the real old file. The logic of the name-memory assumes constant directories. For each item in a directory (file or sub-directory) it remembers the number of global blocks traversed while processing that entry in phase one. When phase one finishes a directory It assumes that the order and number of the items in a directory as presented in the SVR4 system call readdir is the same in phase one and two. When this fails transclusions are lost and copious diagnostic messages arise unless suppressed. No other ill effects should result.
It calls a deeper routine with the bit offset within file and 32 bits of relevant bits of stream hash computed from array grunge, a small circular array or stream hash bits. These hash bits are aligned with the landmark.
Which deeper routine it calls, Rlm, sm or forget_sam, is determined by the current value of the file global variable pointer rlm.
At the bottom, fread at the zx label in routine gaw in file fl.c reads the files in both phases.
The preprocessor symbol “C” is for “const” and I have tried to put it just about anywhere it could go. This is in line with my suspicion that the default identifier should be constant.
bi is the index thru the file block buffer, viewed as a word array.
The main code transformation for a little endian machine is end-for-end swapping of the bits in fields that represent file data. This consists largely of interchanging C’s ≪ and ≫ operators. The two preprocessor symbols SA (towards Small Addresses) and LA (towards Large Addresses) map to ≪ and ≫ respectively for big endian machines and vice-versa for little endian ones. When the preprocessor symbol L is 1, the high bit of a byte is considered to be adjacent to the low bit of the following byte from the file. It is 1 for little endian machines and 0 for others. If preprocessor symbol Ebug is set then ‘L must be wrong for the machine’. This funny mode allows one to check out code pretty well on a machine of the wrong endianess. Ebug entails some run-time cost.
The function X8 is an 8 bit interface to the hash magic. It suffices to bit invert both the argument and value of X8. We go a bit farther for efficiency. The magic constant tables, head and tail, are special. Their entries are reversed just as they are computed. The internal variables a, b, c, d, y and z, are all reversed. The array index is also reversed. This seemed simpler than reworking the synthetic division logic. rBy reverses bytes and rW reverses words. The code X8pr tests the X8 code and the code in Fast.c. Its yield depends on L but not the endianess of the machine! It can be debugged on a machine with the wrong end.
More accurately, the yield of X8(1) le is the same as the yield of X8(0x80) be. I think the bits of the yield should reversed!!
That was then; now the yield of X8 for the one bit stream agrees on the two machines.