Future of meta-indices: site indexing proposal and Perl script

rst@ai.mit.edu (Robert S. Thau)
Errors-To: listmaster@www0.cern.ch
Date: Mon, 21 Mar 1994 16:11:00 --100
Message-id: <9403211507.AA01703@volterra>
Errors-To: listmaster@www0.cern.ch
Reply-To: rst@ai.mit.edu
Originator: www-talk@info.cern.ch
Sender: www-talk@www0.cern.ch
Precedence: bulk
From: rst@ai.mit.edu (Robert S. Thau)
To: Multiple recipients of list <www-talk@www0.cern.ch>
Subject: Future of meta-indices: site indexing proposal and Perl script
X-Listprocessor-Version: 6.0c -- ListProcessor by Anastasios Kotsikonas
Content-Length: 3749
In the latest spin of the Archie-for-the-Web discussion, there seems to be
a consensus (at least among the discussants) that having individual sites
provide an index for their contents, along the lines of the IAFA templates
used by Martijn Koster's ALIWeb, is a good way for such a service to work.

Now, harvesting the titles and URIs from an existing site is just a SMOP.
(It took me about an hour to write the Perl code to do it, starting from my
(NCSA) server config files --- pointer to the code below).  However,
Martijn's recommendations for ALIWeb strongly recommend "Keywords" and
"Description" fields in the templates as well.  No automated script is
likely to do a really good job writing Description fields in the reasonably
near term, and even automatically choosing appropriate keywords from among
a preselected base set is a non-trivial problem.

What I'm doing at least provisionally to drive the final version of my Perl
script (site-index.pl v0.1, pointer to code below), is to put the source
for the IAFA descriptions and keywords fields inside the document itself,
by (ab?)use of the <META ...> tag which was discussed on this list some
time ago to solve a different problem.  For example, the document at
http://www.ai.mit.edu/events/events-list.html contains, near the top, the

  <meta name="iafa-description"
  value="MIT AI lab events, including seminars, conferences, and tours">
  <meta name="iafa-keywords"
  value="MIT, Artificial Intelligence, seminar, conference">

(There's one other kind of meta-information my indexer uses --- if it sees
<meta name="iafa-type" value="service">, it indexes the page in question
with a SERVICE template, as opposed to a DOCUMENT template.  This is useful
for cover pages of search engines and the like).

This use of <META ...> solves another problem as well, that of determining
which documents make the index.  Files with the <meta name="iafa-...">
fields get indexed; the rest don't.  So, once these tags are in the
documents, the rest of IAFA template preparation (finding the files,
getting the titles out and the URIs right) can be completely automated
(which is effectively what my site-index.pl script does).

However, the <META ...> tags do raise another problem, that of whether this
use of <META ...> is appropriate, and if it is, making sure that the uses
which different tools may eventually make of meta-information don't
conflict.  A central registry of meta-information names would be a good
idea, if people are going to start using it.

BTW, the Perl code I'm using to build my site index from the <meta ...>
tags (site-index.pl v0.1) is available at


N.B. the script knows the structure of NCSA httpd site configuration files,
which it reads to find out which directories (and their subdirectories) to
index; it would have to be modified to work with the configuration files of
another server, but that shouldn't be hard.  A sample of the output is,
naturally, at


For the moment, the script has to be configured by changing some variables
at the top of it, as described in the comments; cleaner configuration and
documentation will be there eventually.

Incidentally, even if you don't want to deal with the <meta ...> tags,
site-index.pl can still be some use; setting the $require_meta variable to
0 in the prolog will get you a draft site.idx with entries for every HTML
document you have with so much as a title.  The result is probably not
suitable for direct submission to an indexing service, but culling the
inappropriate files and filling in the blanks is probably a better way to
construct a useful site.idx than typing the whole thing in from scratch.