Usage

    links [options] Unix-name-of-served-directory
This is an ordinary Unix program that uses read access to a directory tree served by a web server. It recursively considers each html file in the named directory tree. It reports internal links to non existent files, and reports files to which there are no internal links. The program ignores symbolic links in the Unix file system. It does not reprocess files due to hard links.

Links of the form <a … href="…" … >, <embed … src="…" … > and <img … src ="…" …> are recognized. A link is considered internal if its href or src field does not contain a colon. <a href="link.html"> is an internal link but <a href="http://www.google.com"> is external.

There is as yet no “transitive closure” logic in the program. A file that refers to itself is considered referenced even if no other files refer to it.

Output Format

Output is written to standard output. For broken links the name of the file with the bad link is given and the unresolvable file name is also given as in:
In file /home/norm/cap-lore.com/CapTheory/Rees/figs.html :
   on line 11, Bad keyword
   on line 14, No such file as domain-figure.png
The unreferenced files are listed at the end, one file name per line. When whole directories and their contents are unreferenced the report is as:
Priv/obscure/  …
When a file is unreferenced but in a directory without an index file, and some other file in that directory is referenced, then the unreferenced file name is listed, but followed by “except trunk” as in
annotes/88314F2.JPG except trunc
A surffer may find that file by truncating the URL of the referenced file.

These names are relative to the root of the server tree. Funny characters in reported file names are escaped in a form compatible with http conventions, I think.

A few miscellaneous statistics are reported and explained at the very end.

Options

“-N fileName” causes files or directories whose last pathname component is fileName to be skipped.

An option beginning “-o” selects several of a few debugging options, one per character following the “-o”. Possible selection characters are “nAHF”.

“-mmax” makes the program refrain from reading more than max bytes from files. “max” is in decimal. This is to voluntarily limit impact on system especially while debugging the program. Default is 50000000.

Performance

This code is fast enough that it is probably less work for the file server to run the code than to serve the files for another machine that runs the code.

Security

Some low grade security comes from disseminating URLs for some of your pages to only a few colleagues and not linking to them from your more public pages. It is tempting to put the output of this program in the web space for easy viewing. The “-N” options is a natural candidate for such “private files”. The reports includes however the names of the files that were thus neglected. Giving the report output an obscure name may be good enough.

Build the code

On Solaris and x86 Linux I get an executable by saying:
gcc -O3 tree.c

On Mac OS X (2011 Apr) I get an executable with:
gcc -O3 -arch i386 -fnested-functions tree.c

Some guidelines for portable html