The html_analyzer-0.30 README file

OVERVIEW:

This file contains information outlining the types of processing performed by the html_analyzer software as well as copyright, disclaimer, and funding information. Please read the file Installation in this directory for information on installing the software. To walk through and example run of the analyzer, see Example.

AVAILABILITY:

The software is currently being distributed via anonymous ftp from: ftp.ncsa.uiuc.edu in the /Mosaic/misc direcortory in compress'd and gzip'd forms. It is also available via anonymous ftp from: ftp.gvu.cc.gatech.edu in pub/gvu/www/pitkow/html_analyzer.

MOTIVATION:

The intent of the html_analyzer is to assist the maintenance of HyperText MarkUp Language (HTML) databases. As the number of HTML databases increases, the potential for hyperlinks that point to files or servers that no longer exist also increases. This results in the need for an automated hyperlink validation program. This is exactly what the html_analyzer does. The program also explores the relationship between hyperlinks and the contents of the hyperlink.

PROCESSING:

This directory contains the software to perform analysis of HTML databases. Specifically, the following tasks are performed:

Extract all hyperlinks (a.k.a. anchors) from all *html files within a given directory hierarchy. The HREF values are allowed to be either quoted or not. The following types of hyperlinks are not processed:
- HREF=""
- HREF=" "
- HREF="#foo"
- HREF=#foo
- HREF="telnet"
- HREF=telnet
- HREF="rlogin" and
- HREF=rlogin
Note:Within document hyperlinks are pointless to verify, either the the hyperlinks goes to the intended section or is does not. The telnet and rlogin hyperlinks only require that the intended machine is alive. If the machine is alive, the user must proceed to enter information. Since this user interaction defeats the automated goal of the software, these access methods are not processed.
Create non-html versions of the files. These files are by default placed in /var/tmp/html_analyzer. These file are used to examine the relationship between hyperlinks and the contents of the hyperlinks. To change the location of this repository, place the desired directory as the last command line argument, e.g.
shell_prompt> html_analyzer . /users/pitkow/swap
Creates the directory /html_analyzer in /users/pitkow/swap and places the non-html files there.
Note:The path must already exist in order for successful execution. The html_analyzer creates a directory within this directory; it does not create the directory itself.
Validates the availability of the documents pointed to by the hyperlinks. this test is called validate. This is accomplished via routines from Mosaic's modified WWWLibrary2.
Looks for hyperlink contents that occur in the database but are not themselves hyperlinks (See Example). This test is termed completeness.
Look for a one-to-one relation between hyperlinks and the contents of the hyperlink (See Example). This test is called consistency.

RATIONALE:

We believe that there ought to exist a one-to-one correspondence between hyperlinks and the hyperlink's contents, such that every occurrence of the hyperlink points to only one document ( or section of document). This means every time a user sees a hyperlink, it will always point to the same section of a document. It also means that each section of document will only have one hyperlink pointing to it. We hypothesize that such a correspondence is necessary to create a clear internal representation in the user of the connections in the HTML database.

RUNNING:

To run the html_analyzer after it has been installed (Please read the file Installation in this directory for information on installing the software), type:

html_analyzer [-val] [-com] [-con] directory [path of repository]

The -val, -com, and -con turn off the validation, completeness, and consistency tests. Only the name of a directory can be specified to check. If a directory is specified, all *.html files within the directory hierarchy will be processed. The path of the temporary repository (default is /var/tmp) can be used if /var/tmp is full or not desirable. A directory (/html_analyzer) is created in this directory to store the temporary files generated by execution. The program does not create the temporary repository.

COPYRIGHT:

The libwww2 directory is the modified WWW library that accompanies xmosaic-pre4. The libhtmlw directory is also from the prerelease.i Mosaic was developed by Marc Anderson at the National Center for Super- Computing Applications. This code is available from ftp.ncsa.uiuc.edu in the /Web directory. The original WWWLibrary2 library was developed by Tim Berners-Lee at the European Laboratory for Particle Physics (CERN). This code is available from ftp.info.ch in the /pub/www/src directory Please see the file Copyrights in this directory for more information on the copyrights that exist to these portions of code.

The Regents of the University of Colorado claim copyright on the other portions of the distribution.

This distribution of the software may be freely distributed, used, and modified but may not be sold as a whole nor in parts without permission of the copyright owners of the parts.

DISCLAIMER:

This software is provided as is. The Laboratory for Atmospheric and Space Physics (LASP) and the author are not responsible for support of this distribution.

FUNDING:

Development of this software was funded by the NASA Earth Observing System Project under NASA contract NAS5-32392.

CHANGES:

Version 0.30 from 0.10

0) make compliant with current html as implemented by NCSA
1) removed memory leak
2) removed MOTIF dependencies

Version 0.10 from 0.02

0) made Mosaic libbwww-2 and libhtmlw dependent; this means that all valid Mosaic files are now valid html_analyzer files.
1) removed unnecessary temporary files created by extract_links(); extract_links() now loads the skiplists directly.
2) enabled validation of other access methods. e.g gopher, wais, etc.

version 0.02 from 0.01:

0) converted CHECK_HTML_DB and GET_ANCHORS to c code.
1) added verification of relative addressed hyperlinks.
2) added one-to-many check of the hyperlink's contents to documents pointed to (previously: many-to-one check of hyperlinks to the hyperlink's contents)
3) cleaned up

ENHANCEMENTS:

Here's a list of things that could be done to improve the html_analyzer:

0) create a program to automatically prune hyperlinks that no longer point to valid files. This entails some tricky questions as to how automated this process needs to be. In other words, it might be nice for the user to have the option of specifying the correct location of the file and have the software make the changes to the HREFs as needed AS WELL as provide the user with the option of having the software remove all anchors pointing to the no-longer existent file. Let me know if your interested in this option, this seems like the next logical addition to the software.
1) add a linked list to the data struct of the skiplist that points to a list of other files that have the same hyperlink and hyperlink content. This will enable more sophisticated analysis, e.g. enable option 0) above by producing a list of files that point to a document for pruning purposes, etc.
2) add statistical analysis of the HTML db i.e. number of hyperlinks per document, number of links to a document, list of files that point to a document, etc.
3) perform empirical study to confirm the hypothesis of the importance on a one-to-one correspondence between hyperlinks and their content. [I might do this this fall if time allows].

COMMENTS:

The purpose of this distribution is to further the development of HTML database creation and maintenance utilities. Comments, questions, and REVISIONS are indeed welcome.

To be added to the html_analyzer mailing list, mail

pitkow@cc.gatech.edu with the subject: html_analyzer add

James E. Pitkow
Graphics, Visualization and Usability Laboratory
Georgia Institute of Technology

pitkow@cc.gatech.edu