Re: Searchable Indexes: LISTEN NOW!

Tim Berners-Lee <timbl@www3.cern.ch>
Date: Thu, 1 Jul 93 11:38:34 +0200
From: Tim Berners-Lee <timbl@www3.cern.ch>
Message-id: <9307010938.AA05487@www3.cern.ch>
To: Nathan Torkington <Nathan.Torkington@vuw.ac.nz>
Subject: Re: Searchable Indexes: LISTEN NOW!
Cc: www-talk@nxoc01.cern.ch
Reply-To: timbl@nxoc01.cern.ch
Status: RO


Summarizing, we need

1.	A standard URL like http:/Catalogue.html

2.	A convention for saying "no" which could be just an existing
	but void file.

3.	A standard format for the file.

4.	A PD program distributed with the server to generate
	the file so that many people will do it weekly.

The format could be a set of
links where the content was the title of the document.
<a href="/docs/overview">Overview of our documentation</a>
which would have the advantage of human readability.
I agree it would be useful to have some depth information
or at least a weight.

It could be alternatively
	
	<LINK TITLE="Overview of hour documentation"
		HREF=""/docs/overview"
		WEIGHT= 0.654>
	
where we could argue for hours about the meaning of
WEIGHT. (WEIGHT is an extra, but the rest is standard
HTML).

In either case, a "no" catalogue could contain
a ploite message, and no list.

A FEW NUMBERS
	
From time to time I run a breadth-first traversal of the web from the
http://info.cern.ch/hypertext/DataSources/WWW/Geographical.html list  
of servers. Yesterday, counting unique
hostname:port pairs (without checking for CNAME aliases),

	95	registered servers (level 0)
	99	servers refered to by level 0 (level 1)
	172	servers referred to by level 1 (level 2)
	174	distinct servers in levels 0-2.

Going this deep takes long enough (an hour or so).  I use a filter at  
each stage to cut out known slow sites (typically Eastern Europe) or  
known buggy servers.  I also have to clean the links quite a bit for
references with local hostnames only (not FQDN) and a small
amount of junk.  [Obviously mailing the webmaster at sites
with bad links would be a possibility.]

This is only for interest.  I don't generate a index.  The engine is
just a bunch of scripts using www -listrefs.

Tim