..
An inverted index is essentially a list of pointers to occurrences of every word in a collection of documents, so that, for example, in a collection of Shakespeare's works, one can pose questions like How often is a fool mentioned in the plays? and Are Romeo and Juliet ever mentioned in the same sentence? While search utilities like the UNIX grep(1) family can be used for simple queries, they suffer from several limitations:
The mg software is described in Appendix A of the book
Ian H. Witten, Alistair Moffat, and Timothy C. Bell Managing Gigabytes: Compressing and Indexing Documents and Images Van Nostrand Reinhold 1994 xiv + 429 pages US$54.95 ISBN 0-442-01863-0 Library of Congress catalog number TA1637 .W58 1994
Many of the algorithms implemented in mg represent significant advances over previous work, both in speed, and in storage requirements. On a fast workstation, in tens of minutes, or a few hours, mgbuild(1) can create an index to all of the words in hundreds of thousands of documents occupying hundreds of megabytes, or even more than a gigabyte, of disk space. mgquery(1) can then be used to answer complex queries, with responses often returned in a second or less. mg also contains algorithms to deal with images, so that with a small amount of descriptive text for each image, it is possible to do searches in collections of images, and to have retrievals display the images using a viewer like xv(1).
mg can deal with compressed text and image files and surprisingly, it usually runs faster than it would if the files were not compressed! Thus, the considerable disk space savings possible from file compression are not lost because of the need for fast document search and retrieval.
The Free Software Foundation GNU Project compression utilities gzip(1) and gunzip(1) are recommended for general use over older alternatives, like compress(1), because of their speed and high compression ratios.
A document for mg is a fragment of text suitable for retrieval as a unit when it is found to contain a requested word, or words. In a collection of poetry, a document might be a stanza, while in a novel, it could be a paragraph. In an index of first lines of poems, a document would likely be just a single line.
Just what constitutes a document is decided by a user-modifiable UNIX shell script, mg_get(1). The default script provided with the mg source distribution knows about these named document collections:
Let us take two examples: all BibTeX .bib files, and all TeX files, contained in subdirectories under the login directory.
For BibTeX, each bibliographic entry will be considered a separate document. In order to facilitate easy identification of entries, we shall require them to begin at the start of a line; the bibclean(1) utility can be used to standardize the format of .bib files, and to validate their string values, so that this requirement is met.
For TeX, each paragraph will be a separate document, and we assume that paragraphs are separated by blank lines. We assume that files with extensions .atx, .ltx, .stx, and .tex contain input to common TeX macro package variants.
Make a personal copy of the mg_get script, using the one in the mg source distribution (mg-1.0/mg/mg_get.sh), or the one in the local binary program directory, at many sites called /usr/local/bin/mg_get.
Examination of the mg_get script shows that each document collection name is used in a csh(1) case statement selector, and that most of work is done by very simple awk(1) programs that extract documents from files. In your private copy of the mg_get file, after the line
breaksw #davinci
and before the line
default:
insert this new code:
case bibfiles: # Takes a list of files that contain BibTeX entries, and splits them up # by putting ^B after each entry. Assumes that each entry # begins with a line '^@'. switch ($flag) case '-init': breaksw case '-text': find $HOME -name '*.bib' -print | \ sort | \ xargs -l100 awk \ '/^@/&&NR!=1{print "^B"} {print $0} END{print "^B"}' breaksw #-text case '-cleanup': breaksw #-cleanup endsw #flag breaksw #bibfiles case texfiles: # Takes a list of TeX files and split them up # by putting ^B after each paragraph. Assumes that each entry # begins with a line '^@'. switch ($flag) case '-init': breaksw case '-text': find $HOME -name '*x' -print | \ egrep '[.]tex$|[.]ltx$|[.]atx$|[.]stx$' | \ sort | \ xargs -l100 nawk ' /^ *$/ {if (b!=1) printf "^B";b=1} \ \!/^ *$/ {print;b=0} \ END {printf "^B"}' breaksw #-text case '-cleanup': breaksw #-cleanup endsw #flag breaksw #texfiles
The ^B characters here are Control-B characters, not caret-B pairs.
If you have a large number of BibTeX or TeX files, it is likely that a list of them would be too long for the UNIX shell to hold in a single variable, or on a single command line. Thus, instead of storing the output of find(1) in a variable, we proceed more cautiously, and employ it to produce a list of the required files, then pipe them to xargs(1), which in turn passes up to 100 filenames at a time to nawk(1) for document selection.
Install this modified mg_get script in your private directory for executable programs (e.g. $HOME/bin), create a directory $HOME/mgdata to hold the index, issue a rehash command if you are using csh(1) or tcsh(1), ensure that mg_get occurs in your search path before any system-wide one (the command which mg_get will tell you which version will be selected), then create the inverted indexes by mgbuild bibfiles and mgbuild texfiles. These commands may take several minutes to run if you have a lot of BibTeX or TeX files, or a large home directory tree. Once they are complete, you can then query the index with the commands mgquery bibfiles and mgquery texfiles. These should respond very rapidly.
In order to keep your index up-to-date, you should arrange for it to be recreated automatically and regularly, probably every night. You can do this with cron(1). Use the command crontab -e to edit your crontab file and add two lines like this:
00 04 * * * mgbuild bibfiles >$HOME/mgdata/bibfiles.log 2>&1 15 04 * * * mgbuild texfiles >$HOME/mgdata/texfiles.log 2>&1
Save the file and exit the editor. Now, every night at 4am and 4:15am, mgbuild(1) will reconstruct your inverted indexes, and the results of the builds will be saved in log files in your $HOME/mgdata directory.