WebGlimpse Technical Documentation

Note: This information is not meant for the beginner. It is meant for purposes of modifying WebGlimpse and giving advanced users more insight and control over the power of WebGlimpse. The formatting isn't pretty, and the writing isn't the best.

For a more user-level view of WebGlimpse, see the WebGlimpse home page.

This document is an overview of the WebGlimpse code; for more specific details, see the comments in the code itself.

It is assumed that the reader knows:

Contents

Cgi-bin scripts

The cgi-bin scripts are the 'heart' of WebGlimpse. There are two main files used here: webglimpse and webglimpse-fullsearch.

webglimpse

webglimpse is the 'main' script and is the one that does all the actual search. It reads the arguments from the query string and puts them into variables preceded by $QS_. These arguments are parsed, and the arguments to Glimpse are set accordingly. File names are resolved into titles and urls (using some of the generated files). The output of Glimpse is formatted into HTML and output to the user.

If line searching is done (option lines=1), things become a bit more complicated. webglimpse calls mfs (also in the cgi-bin directory), and this in turn, does a bit of checking, and calls getfile. (Earlier versions of mfs and getfile had a security problem that allowed someone to sneak by them and grab any world-readable file on the system. We believe we solved this problem.) getfile copies the file line-by-line (and prepends <pre> if it's not an html file) and highlights the line that matches the query. It also inserts a 'name' anchor and the browser should then jump to that line.

webglimpse-fullsearch

This script dynamically creates a large search box with many options. It remembers the referring page using the "file" option is set (i.e., if the code in .wgbox.html is written correctly), and it will show the neighborhood if the option shownh is set to 1 in the search string (i.e., http://...?file=...&shownh=1). This script does no call to glimpse itself -- it calls webglimpse.

Executables

The executables are by far the most complicated part of webglimpse. These allow you to manipulate archives and create/modify the search boxes.

confarc

This script prompts the user for the archive directory, tries to read a configuration file (archive.cfg) from that directory (if there is one, it reads in the default settings), then prompts the user for several settings.

The information is then stored into the archive.cfg file. If this is the first configuration of the archive (i.e., the configuration file didn't exist before), it copies over the files from the distribution directory. archive.cfg is pretty self-explained. You can change it at any time and rerun wgreindex.

makecron

This generates the wgreindex script (see generated files below). It takes the archive directory as the only argument, reads the configuration file (archive.cfg) there, and creates the wgreindex script there.

addsearch

addsearch is a very powerful program. It can operate in two modes: insert and remove. It inserts or removes the .wgbox.html html 'snippet' into all the files listed in the .wg_madenh file (see generated files).

In insert mode, it reads the .wgbox.html file and inserts this file (with the appropriate variable substitutions) into the files before the first <!GH_SEARCH>, </body>, or </html> tag or EOF it sees (in that order). If it sees a <!GH_SEARCH> tag it replaces everything up to the <!GH_END> tag (including both tags). When the box is inserted, the tags are inserted appropriately as well (to delimit the box for removal or replacement).

In remove mode, it simply looks for the tags as before, and replaces it with nothing (effectively removing all tags).

makenh

This is by far the most complicated script. makenh is in charge of traversing the archive, figuring out which pages to index, which pages to fetch from remote servers, and how to create the neighborhoods.

The input to makenh is complicated as well. See the diagram below to see which files are read in and which files makenh produces.

wginstall

This is the perl installation script. The code is self-explanatory. See the installation manual for more information.

wginstall.server

This is a 'sub-script' of wginstall, and is used to parse the configuration file of the http daemon and get the settings for the server. After it gets the necessary information, it writes the information to the .wgsiteconf file in the $WEBGLIMPSE_HOME directory.

Configuration files

.wgfilter-box

This controls which files will get a search box added to them. Files that are not indexed (i.e., do not pass the .wgfilter-index) will not be used. It uses the Deny/Allow syntax similar to that of Harvest's configuration files.
It is copied from the /dist directory when confarc is first invoked for a directory.
It is used by makenh.

.wgfilter-index

This controls which files will be indexed. It uses the Deny/Allow syntax similar to that of Harvest's configuration
It is copied from the /dist directory when confarc is first invoked for a directory.
It is used by makenh.

.wgsiteconf

This file contains the configuration for the site.
It is generated during wginstall when the srm.conf is parsed.
It is used by makenh.

.wgbox.html


It is copied from the /dist directory when confarc is first invoked for a directory.
It is used by addsearch. You are free to modify this box as you wish, as long as the WebGlimpse name and a pointer to WebGlimpse Home Pages appear.

Generated files

.remote/*

The .remote directory contains all files retrieved by makenh.
Generated by makenh.

.wg_toindex

This contains the list of files to index and is passed to glimpseindex when indexing is performed.
Generated by makenh.

.wgmade_nh

This a list of file names of which files makenh made neighborhoods for.
Generated by makenh.

.nh.*

These are neighborhood files. Contained in them is the list of files which WebGlimpse should consider 'neighbors' (and pass to Glimpse) of the file.
Generated by makenh.

Library files (/lib)

html2txt

This is a simple perl file which takes stdin, removes everything between < and > and puts the output to stdout. Used for filters in glimpseindex.

Change on 5/16... Udi rewrote html2txt in C (now html2txt.c in the /lib directory) and it works faster. By default, the glimpse_filters file in the /dist directory refers to html2txt in the /lib directory, so the filters will be used for html files.

Distribution files (/dist)

These files are the defaults for the individual archives. Basically, confarc just takes these files, appends a '.' to the name, and puts them into the archive directory. For information on what the files themselves are used for, see Executables and Generated files above).

File diagram (This diagram is slightly out of date -- gettitles and wgmapfile are no longer used; the information is obtained directly from the glimpse index.)