site-index.pl, Perl Script to index WWW sites, version 0.2...

rst@ai.mit.edu (Robert S. Thau)

Mail folder: WWW Talk Apr 94-present
Next message: John C. Mallery: "Why the Web needs to change"
Previous message: frans van hoesel: "Re: html+ - remove <p> and </p> "

Errors-To: listmaster@www0.cern.ch
Date: Sat, 2 Apr 1994 07:12:35 --100
Message-id: <9404011712.AA08880@volterra>
Errors-To: listmaster@www0.cern.ch
Reply-To: rst@ai.mit.edu
Originator: www-talk@info.cern.ch
Sender: www-talk@www0.cern.ch
Precedence: bulk
From: rst@ai.mit.edu (Robert S. Thau)
To: Multiple recipients of list <www-talk@www0.cern.ch>
Subject: site-index.pl, Perl Script to index WWW sites, version 0.2...
X-Listprocessor-Version: 6.0c -- ListProcessor by Anastasios Kotsikonas
Content-Length: 3089

I have a new version of site-index.pl, a Perl script which largely
automates the job of building local indexes for Martijn Koster's ALIWeb
(and, perhaps, other future services), at sites running NCSA httpd.
Documentation, and pointers to code, are available at:

  http://www.ai.mit.edu/tools/site-index.html

The script works by looking for HTML documents keyed with metainformation,
using the <META ...> tag which has been proposed on this list as a possible
feature for HTML+.  It's also, as of this version, capable of building
multiple indexes (e.g., one for information of local interest only, and one
for export).  For instance, site-index.html itself has the <META>info:

  <meta name="keywords" value="resource discovery, site management, tools">
  <meta name="description"
  value="This is the documentation for site-index.pl, a tool which allows
  administrators of sites running NCSA httpd to largely automate the
  construction of local indexes for services such as Martijn Koster's ALIWEB">

There are configuration flags which determine whether the script follows
symlinks, where it puts its output, etc.  (There's also an undocumented
flag which will cause the script to build an index including every HTML
file you have with so much as a title, <meta>information or no.  I put it
in for debugging only, and I can't imagine that it's the right thing for
very many people, but if you really think you want it, it's in there...).

A note of possible wider interest --- I've changed the NAMEs of the
<META>information to which the script responds.  The current lot are:

  description --- used to fill in the Description field of the IAFA
     templates (i.e., index entries) which the script builds.

  keywords --- used to fill in the Keywords field of the index entries

  resource-type --- what sort of thing this HTML file is.  The default
     (if none specified) is 'document', which is almost always appropriate;
     however, cover pages for search engines, input forms, and the like
     may be more appropriately indexed as being a 'service' (which is the
     other recognized value

  distribution --- if the script is configured to build multiple indexes,
     this meta-datum is used to determine which index is appropriate for
     a particular file.  The mapping between distributions and the names
     of the index files is configurable, as is the distribution used for
     documents which don't specify any.

(These are case-insensitive; also, the 0.1 meta-names, 'iafa-description',
'iafa-keywords', and 'iafa-type', are deprecated but still work).

A few final notes:

The script should be getting the description for the index entries out of
an HTML+ <ABSTRACT>, if one is present, but it doesn't do that yet.  (I
haven't yet figured out what to do with formatting tags in the abstract.
That problem also arises with <TITLE>s, BTW, but there I'm just stripping
them out).

Also, it should be fairly easy to adapt the script to servers other than
NCSA --- the NCSA dependencies are entirely confined to one function which
reads the config files.  

rst