NAME
SYNOPSIS
DESCRIPTION
OPTIONS
FORMATTING CONVENTIONS
LIMITATIONS
HTML 2.0 CONSTRAINTS
SEE ALSO
AUTHOR

NAME

html-pretty - prettyprint HTML files

SYNOPSIS

html-pretty [ -? ] [ -a ] [ -c ] [ -f filename ] [ -h ] [ -i nnn ] [ -n ] [ -v ] [ -w nnn ] < infile > outfile

or

html-pretty [ -? ] [ -a ] [ -c ] [ -f filename ] [ -h ] [ -i nnn ] [ -n ] [ -v ] [ -w nnn ] file(s) > outfile

DESCRIPTION

html-pretty filters its HTML input from stdin, or from one or more named files given on the command line, and prettyprints it to stdout.

HTML (HyperText Markup Language) is the language used to specify formatting instructions in text files intended for viewing with World-Wide Web (WWW) client programs (browsers), such as arena(1), hotjava(1), lynx(1), netscape(1), and xmosaic(1).

The WWW idea began in late 1992, and because viewer programs support display of text, line drawings, color raster images, hypertext links, and uniform access to several Internet services, including file transfer, in the first two years, the number of WWW servers grew from zero to several hundred thousand, and some of the more popular sites receive up to two million accesses a day from all over the Internet. Consequently, many Internet computer users are beginning to write HTML documents for their own home pages, and html-pretty is written for them.

The goal of a prettyprinter is to recognize all legal inputs, and produce output that is indented to reflect the structure, and in which line lengths have been restricted for improved readability. Irregularities in coding practice, and outright errors, are more likely to be detected in the prettyprinted output, than in the input.

SGML (Standard General Markup Language, ISO8879), and its particular document type definition instance, HTML, follow a rigorous grammar for text markup that makes it possible to clearly identify document parts, such as headings, sections, subsections, paragraphs, figures, tables, equations, and so on, and files with such standardized markup are particularly good candidates for prettyprinting.

The current definition of HTML is still in flux (version 2.0 is nearing standardization, and version 3.0, informally called HTML Plus, is under development in early 1995). html-pretty follows the grammar of version 3.0, which is a superset of that of version 2.0. Version 3.0 introduces several new tags (see the FORMATTING CONVENTIONS section below), and supports figures, input forms, tables, and a limited math mode.

One significant difference between the two grammar versions is that the HTML tag  is a paragraph separator in version 2.0, while it is a paragraph begin in version 3.0, and consequently expects to have a matching  end tag that is not required in 2.0. html-pretty will supply missing  tags, and delete empty  ...  environments. Since HTML translators ignore unrecognized tags, this is transparent to HTML version 2.0 implementations, and causes no problems.

html-pretty expects that its input is reasonably well-formed. Usually it is sufficient that the file can be displayed by one or more WWW browsers, producing the expected form. However, it would be unwise to write a large amount of program code without a compiler to check it, and it is similarly unwise to write documentation in HTML or SGML without at least a validating parser to ensure that the text is syntactically correct.

Fortunately, at least two such programs are publicly available (thanks to the generosity of their author, James Clark) nsgmls(1) and sgmls(1), together with UNIX shell scripts, html-check(1) and html-ncheck(1), to facilitate their use with HTML files. In addition, the nsgmls(1) distribution is accompanied by two SGML tag normalizers, sgmlnorm(1) and spam(1), and there is a UNIX shell script, html-spam(1), for one of them. You may therefore find it useful to apply html-spam(1), and either html-check(1) or html-ncheck(1), to your HTML files, and fix all of the errors that they detect, before filtering the files with html-pretty.

HTML strictly requires a certain amount of boiler-plate to be wrapped around the text, and there is ample evidence that most HTML files omit these wrappers, because WWW browsers are written to be tolerant of grammatical deviations. html-pretty will supply the wrappers if they are omitted; indeed, if given an empty input file, html-pretty produces output similar to this:

<!-- -*-html-*- -->
<!-- Prettyprinted by html-pretty lex version 0.09 [05-May-1995] -->
<!-- on Sat May  6 09:55:25 1995 -->
<!-- for Nelson H. F. Beebe (beebe@sunrise) -->

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">

<HTML>
    <HEAD>
        <TITLE>
            <!-- Please supply a descriptive title here -->
        </TITLE>
        <!-- Please supply a correct e-mail address here -->
        <LINK REV="made" HREF="mailto:USERNAME@HOSTNAME>">
    </HEAD>
    <BODY>
    </BODY>
</HTML>

This example, minus the comments , shows the minimal markup that should be expected in an HTML file, although the grammar permits the HTML, HEAD and BODY environments to be implicitly assumed if they are omitted. While WWW browsers ignore the DOCTYPE declaration, it is essential for SGML parsers, since it identifies the grammar rules that apply to what follows.

OPTIONS

The following command-line options are supported, and they affect all following filenames. Option values are always provided as separate arguments following the option name. Letter case in option names is not significant.

Any argument that begins with a hyphen is expected to be an option, and will raise an error if it is not recognized. If a filename begins with a hyphen, you therefore need to disguise it by supplying a leading directory path. For example, ./-foo represents the file named -foo in the current directory in UNIX.

-a: Show author information on stderr.
-c: Show copyright information on stderr.
-f filename: Supply an alternate input filename for use in the output comment banner. This overrides the actual filename(s), and provides a way to name the output, even when no named input file is available, because standard input is redirected, or comes from a pipe.
-h or -?: Display brief usage information on stderr.
-i nnn: Set the number of spaces for each indentation level (default: 4).
-n: Suppress generation of the default leading comment banner.
-v: Show version information on stderr.
-w nnn: Set the maximum output line width (default: 72). This limit may be exceeded if an excessively long string without embedded spaces is encountered, and it is ignored completely inside preformatted or verbatim text.

FORMATTING CONVENTIONS

HTML 2.0 contains the following 49 tags: A, ADDRESS, B, BASE, BLOCKQUOTE, BODY, BR, CITE, CODE, DD, DIR, DL, DT, EM, FORM, H1, H2, H3, H4, H5, H6, HEAD, HR, HTML, I, IMG, INPUT, ISINDEX, KBD, LI, LINK, LISTING, MENU, META, NEXTID, OL, OPTION, P, PLAINTEXT, PRE, SAMP, SELECT, STRONG, TEXTAREA, TITLE, TT, UL, VAR, and XMP.

HTML 3.0 augments the 2.0 grammar with 53 additional tags: ABBREV, ABOVE, ACRONYM, ARRAY, ATOP, AU, BAR, BELOW, BIG, BOX, BQ, BT, CAPTION, CHOOSE, CREDIT, DDOT, DEL, DFN, DIV, DOT, FIG, HAT, INS, ITEM, LANG, LEFT, LH, MATH, NOTE, OF, OVER, OVERLAY, PERSON, PRE, Q, RIGHT, ROOT, ROW, S, SMALL, SQRT, STYLE, SUB, SUP, T, TAB, TABLE, TD, TH, TILDE, TR, U, and VEC.

These tags are identified by their occurrence in the html.dtd and html-3.dtd document type definition files in lines like these:

<!ENTITY % font " TT | B | I ">
<!ENTITY % phrase "EM | STRONG | CODE | SAMP | KBD | VAR | CITE ">
<!ELEMENT (%font;|%phrase) - - (%text)+>
<!ELEMENT XMP - -  %literal>

ENTITY declarations define text string substitutions, and ELEMENT declarations define the tags recognized by the grammar.

The HTML grammar permits certain end tags to be omitted, when their implied position can be determined from the grammatical context. In HTML 3.0, this includes the following tags: BODY, DD, DT, HEAD, ITEM, LI, MESSAGE, OPTION, P, TD, TH, and TR. Supporting such a feature requires the ability to parse a complete SGML grammar, which requires a great deal more code than html-pretty provides. Consequently, it does not support optional end tags; based on typical usage, they are expected to be always present, or always absent, according to the rules given below.

HTML comments are prettyprinted on separate lines. Their internal form is preserved exactly, without any line wrapping, since they will often contain specially-formatted material.

The following HTML tag names occur in begin/end pairs (<TAG>and</TAG>), often with substantial amounts of intervening text. They are prettyprinted on separate lines, with their enclosed text indented one level: A, ABBREV, ABSTRACT, ACRONYM, ADDED, ADDRESS, ARG, AROW, ARRAY, AU, BLOCKQUOTE, BODY, BQ, CAPTION, CITE, CMD, CREDIT, DIV1, DIV2, DIV3, DIV4, DIV5, DIV6, FIG, FN, FOOTNOTE, FORM, H1, H2, H3, H4, H5, H6, HEAD, HIDE, HTML, LANG, LH, MARGIN, MATH, MESSAGE, NOTE, OPTION, OVERLAY, P, PERSON, Q, QUOTE, REMOVED, SELECT, TABLE, TEXTAREA, and TITLE.

These HTML tag names occur in begin/end pairs, usually with smaller amounts of enclosed material. They appear inline in the running text, and do not alter indentation: B, BIG, BT, CODE, DFN, EM, I, KBD, Q, REV, S, SAMP, SMALL, STRONG, T, TT, U, and VAR.

These HTML tags names occur in begin/end pairs, and delimit lists. They appear on separate lines, with their enclosed text indented two levels: DIR, DL, MENU, OL, and UL.

These HTML tags mark the beginning of list items, and have matching end tags which are supplied if they are absent. They are output on separate lines, indented one level from the enclosing list: DD, DT, and LI.

These HTML tag names appear inline, without affecting indentation: TAB, TAG, TD, TH, and TR.

These HTML tag names occur only inside MATH mode, and appear inline, without affecting indentation: ABOVE, ATOP, BAR, BELOW, BOX, CHOOSE, DDOT, DEL, DOT, HAT, INS, LEFT, OF, RIGHT, ROOT, ROW, SQRT, SUB, SUP, TILDE, and VEC.

This HTML tag marks an explicit line break, and has no matching end tag; preceding space is deleted, and a newline follows: BR.

These HTML tags have no matching end tag; they appear alone on separate lines: BASE, CHANGED, HR, IMG, ITEM, INPUT, ISINDEX, LINK, META, NEXTID, OVER, RENDER, STYLE, and STYLES.

These HTML tags appear in begin/end pairs, and delimit preformatted, or verbatim text. The beginning and ending tags are output on separate lines, with the enclosed material copied exactly as it appeared in the input stream: CDATA, LISTING, PRE, and XMP.

Finally, the HTML tag PLAINTEXT marks the beginning of verbatim text that continues to end-of-file; it appears on a separate line. Although some HTML viewers will terminate the verbatim text environment on reaching a matching end tag, </PLAINTEXT>, that practice is now considered erroneous.

All tags that are not explicitly named above are treated as normal text, and have no effect on the indentation.

LIMITATIONS

Like LaTeX, Lisp, and TeX, SGML is an extensible language. In particular, it is possible to redefine at run time the syntax and semantics of all lexical elements, including the base character set, the characters that may appear in identifiers, and the special characters that delimit tags or strings.

In SGML, this is conventionally done in SGML declaration files for syntax definitions, and in Document Type Definition (DTD) files, which are analogous to LaTeX style files. Low-level typesetting commands of the flavor ``select a 14-point Lucida-BoldItalic font'', ``skip down vertically 6 picas'', and ``draw a horizontal rule 3 ems long'' are notably absent. SGML documents are expected to use only high-level markup commands, leaving the visual appearance entirely up to the DTD specification, and the formatting software.

Unlike LaTeX and TeX, SGML is not a typesetting system. It is only a grammar for a standard markup language, and it is left to SGML software implementors to write DTDs, and to provide for translation of SGML documents to specific document formatting, typesetting, or word processing systems. In this respect, SGML is similar to the RTF (Rich Text Format) supported by many popular word processors, in that it can serve as an intermediate language for electronic document exchange; several language translators between SGML and other text representations are mentioned in the SEE ALSO section below. Regrettably, there is wide variation in the capabilities and document models of current text formatting systems, so such translations are usually rough approximations that may require substantial hand patching to make them truly satisfactory.

HTML is a modest subset of SGML, with tag names apparently chosen from SGML, the Free Software Foundation's TeXinfo system (which in turn is modeled on the earlier Scribe document formatting system), and occasionally, also from LaTeX.

html-pretty is written in this spirit: it knows about the meaning and typical use of all standard HTML tags, but nothing about other SGML DTDs. While it would be reasonably straightforward to prepare modifications of html-pretty to support other, specific, SGML DTDs, you cannot expect it to be very effective for handling arbitrary SGML text.

HTML 2.0 CONSTRAINTS

It is not immediately obvious from the HTML grammar just what tags can be contained in, or by, other tags. The following list, adapted from the Soft Quad HoTMetaL documentation, provides a useful summary. HTML tags that are obsolete in the HTML 2.0 grammar are identified; they should not be used in new documents.

contained in: ADDRESS B BODY CITE CODE DD DFN DT EM H1 H2 H3 H4 H5 H6 I KBD LI P PRE SAMP STRONG TT U VAR

contains: #PCDATA B CITE CODE DFN EM I IMG KBD SAMP STRONG TT U VAR

ADDRESS

contained in: BLOCKQUOTE BODY FORM LI

contains: #PCDATA A HR IMG P

contained in: A B CITE CODE DD DFN DT EM H1 H2 H3 H4 H5 H6 I KBD KEY LI P PRE SAMP STRONG TT U VAR

contains: #PCDATA A B BR CITE CODE DFN EM I IMG INPUT KBD SAMP STRONG TEXTAREA TT U VAR

BASE

contained in: HEAD

BLOCKQUOTE (obsolete)

contained in: BODY FORM

contains: #PCDATA ADDRESS DL HR IMG OL P PRE UL

BODY

contained in: HTML

contains: ADDRESS BLOCKQUOTE DL FORM H1 H2 H3 H4 H5 H6 HR IMG LISTING OL P PRE UL XMP

contained in: A ADDRESS B BLOCKQUOTE CITE CODE DD DFN DT EM H1 H2 H3 H4 H5 H6 I KBD P PRE SAMP STRONG TT U VAR

CITE

contained in: A B CITE CODE DD DFN DT EM H1 H2 H3 H4 H5 H6 I KBD KEY LI P PRE SAMP STRONG TT U VAR

contains: #PCDATA A B BR CITE CODE DFN EM I IMG INPUT KBD SAMP STRONG TEXTAREA TT U VAR

CODE

contained in: A B CITE CODE DD DFN DT EM H1 H2 H3 H4 H5 H6 I KBD KEY LI P PRE SAMP STRONG TT U VAR

contains: #PCDATA A B BR CITE CODE DFN EM I IMG INPUT KBD SAMP STRONG TEXTAREA TT U VAR

contained in: DL

contains: #PCDATA A B BR CITE CODE DFN DL EM HR I IMG INPUT KBD OL P PRE SAMP STRONG TEXTAREA TT U UL VAR

DFN

contained in: A B CITE CODE DD DFN DT EM H1 H2 H3 H4 H5 H6 I KBD KEY LI P PRE SAMP STRONG TT U VAR

contains: #PCDATA A B BR CITE CODE DFN EM I IMG INPUT KBD SAMP STRONG TEXTAREA TT U VAR

DIR (obsolete)

contains: LI

contained in: BLOCKQUOTE BODY DD FORM LI

contains: DD DT

contained in: DL

contains: #PCDATA A B BR CITE CODE DFN EM I IMG INPUT KBD SAMP STRONG TEXTAREA TT U VAR

contained in: A B CITE CODE DD DFN DT EM H1 H2 H3 H4 H5 H6 I KBD KEY LI P PRE SAMP STRONG TT U VAR

contains: #PCDATA A B BR CITE CODE DFN EM I IMG INPUT KBD SAMP STRONG TEXTAREA TT U VAR

FORM

contained in: BODY FORM LI

contains: ADDRESS BLOCKQUOTE DL FORM H1 H2 H3 H4 H5 H6 HR LISTING MESSAGE OL P PRE UL XMP

contained in: BODY FORM

contains: #PCDATA A B BR CITE CODE DFN EM I IMG INPUT KBD SAMP STRONG TEXTAREA TT U VAR

contained in: BODY FORM

contains: #PCDATA A B BR CITE CODE DFN EM I IMG INPUT KBD SAMP STRONG TEXTAREA TT U VAR

contained in: BODY FORM

contains: #PCDATA A B BR CITE CODE DFN EM I IMG INPUT KBD SAMP STRONG TEXTAREA TT U VAR

contained in: BODY FORM

contains: #PCDATA A B BR CITE CODE DFN EM I IMG INPUT KBD SAMP STRONG TEXTAREA TT U VAR

contained in: BODY FORM

contains: #PCDATA A B BR CITE CODE DFN EM I IMG INPUT KBD SAMP STRONG TEXTAREA TT U VAR

contained in: BODY FORM

contains: #PCDATA A B BR CITE CODE DFN EM I IMG INPUT KBD SAMP STRONG TEXTAREA TT U VAR

HEAD

contained in: HTML

contains: BASE ISINDEX LINK META TITLE

contained in: ADDRESS BLOCKQUOTE BODY DD FORM LI

HTML

contains: BODY HEAD

contained in: A B CITE CODE DD DFN DT EM H1 H2 H3 H4 H5 H6 I KBD KEY LI P PRE SAMP STRONG TT U VAR

contains: #PCDATA A B BR CITE CODE DFN EM I IMG INPUT KBD SAMP STRONG TEXTAREA TT U VAR

IMG

contained in: A ADDRESS B BLOCKQUOTE BODY CITE CODE DD DFN DT EM FORM H1 H2 H3 H4 H5 H6 I KBD LI P PRE SAMP STRONG TT U VAR

INPUT

contained in: B CITE CODE DD DFN DT EM H1 H2 H3 H4 H5 H6 I KBD P PRE SAMP STRONG TT U VAR

ISINDEX

contained in: HEAD

KBD

contained in: A B CITE CODE DD DFN DT EM H1 H2 H3 H4 H5 H6 I KBD KEY LI P PRE SAMP STRONG TT U VAR

contains: #PCDATA A B BR CITE CODE DFN EM I IMG INPUT KBD SAMP STRONG TEXTAREA TT U VAR

KEY (obsolete)

contains: #PCDATA B CITE CODE DFN EM I KBD STRONG TT U VAR

contained in: DIR MENU OL UL

contains: #PCDATA A ADDRESS B CITE CODE DFN DL EM FORM HR I IMG KBD OL P PRE STRONG TT U UL VAR XMP

LINK

contained in: HEAD

LISTING

contained in: BODY FORM

contains: 12 A

MENU (obsolete)

contains: LI

MESSAGE

contained in: FORM

META

contained in: HEAD

NEXTID (obsolete)

contained in: BLOCKQUOTE BODY DD FORM LI

contains: LI

OPTION

contained in: SELECT

contained in: ADDRESS BLOCKQUOTE BODY DD FORM LI

contains: #PCDATA A B BR CITE CODE DFN EM I IMG INPUT KBD SAMP STRONG TEXTAREA TT U VAR

PLAINTEXT (obsolete)

PRE

contained in: BLOCKQUOTE BODY DD FORM LI

contains: #PCDATA A B BR CITE CODE DFN EM I IMG INPUT KBD SAMP STRONG TEXTAREA TT U VAR

SAMP

contained in: A B CITE CODE DD DFN DT EM H1 H2 H3 H4 H5 H6 I KBD KEY LI P PRE SAMP STRONG TT U VAR

contains: #PCDATA A B BR CITE CODE DFN EM I IMG INPUT KBD SAMP STRONG TEXTAREA TT U VAR

SELECT

contained in: B CITE CODE DD DFN DT EM H1 H2 H3 H4 H5 H6 I KBD P PRE SAMP STRONG TT U VAR

contains: OPTION

STRONG

contained in: A B CITE CODE DD DFN DT EM H1 H2 H3 H4 H5 H6 I KBD KEY LI P PRE SAMP STRONG TT U VAR

contains: #PCDATA A B BR CITE CODE DFN EM I IMG INPUT KBD SAMP STRONG TEXTAREA TT U VAR

TEXTAREA

contained in: B CITE CODE DD DFN DT EM H1 H2 H3 H4 H5 H6 I KBD P PRE SAMP STRONG TT U VAR

TITLE

contained in: HEAD

contains: #PCDATA

contained in: A B CITE CODE DD DFN DT EM H1 H2 H3 H4 H5 H6 I KBD KEY LI P PRE SAMP STRONG TT U VAR

contains: #PCDATA A B BR CITE CODE DFN EM I IMG INPUT KBD SAMP STRONG TEXTAREA TT U VAR

contained in: A B CITE CODE DD DFN DT EM H1 H2 H3 H4 H5 H6 I KBD KEY LI P PRE SAMP STRONG TT U VAR

contains: #PCDATA A B BR CITE CODE DFN EM I IMG INPUT KBD SAMP STRONG TEXTAREA TT U VAR

contained in: BLOCKQUOTE BODY DD FORM LI

contains: LI

VAR

contained in: A B CITE CODE DD DFN DT EM H1 H2 H3 H4 H5 H6 I KBD KEY LI P PRE SAMP STRONG TT U VAR

contains: #PCDATA A B BR CITE CODE DFN EM I IMG INPUT KBD SAMP STRONG TEXTAREA TT U VAR

XMP

contained in: BODY DD FORM LI

AUTHOR

Nelson H. F. Beebe, Ph.D.
Center for Scientific Computing
Department of Mathematics
University of Utah
Salt Lake City, UT 84112
Tel: +1 801 581 5254
FAX: +1 801 581 4148
Email: <beebe@math.utah.edu>
URL: http://www.math.utah.edu/~beebe

Table of contents