I would like for HTML documents to have some facility to communicate
to an indexing tool (a web worm/spider/etc) what the author believes
is significant. Currently, most webworms simple index the entire text
of the HTML document, throwing out excessively common words ("the",
"web", ...) and 'words' which have numbers or special characters in
them. Some webworms pay special attention to what's inside of <title>
or <h1> tags, as a way of trying to figure out what the author of the
document thinks is significant about the document.
Both of these methods have serious flaws. The first tends to index
both too much and too little. For example, while the word "web" may
be so prevalent in web documents that indexing tools are forced to
drop it, it should be retained for the W3 Consortium site, for
example, or a nature site with a special page about spiders. And
webworms that drop words with numbers and special characters in them
will drop the word "3Com" from their index of 3Com's web site.
At least one web index that I have looked at does specify a way for
authors to build the index. Since there is no diret suppor in HTML,
the tool requires web server maintainers to build their own, separate,
index file, in a specific format, and leave it in an HTTP-accessible
document with a specific filename so the indexing tool can find it.
What we really should have is some sort of markup that goes into the
<head> portion of an HTML document (since it should not be displayed)
that specifies the author's intention of what keywords should be used
to index this document. For example, maybe <kl>..</kl> to mark a key
list, with <ki> to denote each individual key item.
-- Cos (Ofer Inbar) -- cos@leftbank.com cos@cs.brandeis.edu
-- The Left Bank Operation -- lbo@leftbank.com http://www.leftbank.com
"We all misuse the net for personal gain, one way or another."
-- Larry Wall <lwall@netlabs.com>