Re: Resource discovery, replication (WWW Announcements archives?)

Martijn Koster <m.koster@nexor.co.uk>

Mail folder: WWW Talk Apr 94-present
Next message: Eric Katz: "Re: still no connectioon to www.ncsa.uiuc.edu"
Previous message: Martijn Koster: "Re: Resource discovery, replication (WWW Announcements archives?) "
Maybe in reply to: Martijn Koster: "Re: Resource discovery, replication (WWW Announcements archives?) "
Reply: Daniel W. Connolly: "Re: Resource discovery, replication (WWW Announcements archives?) "

Errors-To: listmaster@www0.cern.ch
Date: Wed, 4 May 1994 14:15:44 +0200
Errors-To: listmaster@www0.cern.ch
Message-id: <9405041212.AB08521@dxmint.cern.ch>
Errors-To: listmaster@www0.cern.ch
Reply-To: m.koster@nexor.co.uk
Originator: www-talk@info.cern.ch
Sender: www-talk@www0.cern.ch
Precedence: bulk
From: Martijn Koster <m.koster@nexor.co.uk>
To: Multiple recipients of list <www-talk@www0.cern.ch>
Subject: Re: Resource discovery, replication (WWW Announcements archives?) 
X-Listprocessor-Version: 6.0c -- ListProcessor by Anastasios Kotsikonas


> How? How does the robot know where to go and look? And does each
> robot have to search the entire space?

I didn't say it was the most efficient thing to do, just that it was
possible. The important thing about ALIWEB is that it is reasonably
efficient for the server, the server admin has a say about what stuff
is indexed, and the index is in a common format.  The star-shaped
gathering and the single indexing site haven't become a problem yet.

> Each of these is an all-or-nothing proposition: in the first
> case, I have to locate an ALIWEB server with all the data in the
> world on it (scalability test says: BZZZZT). Or I can copy
> all the data to my machine (BZZZZT). Or I can get "the list of
> hosts" (BZZZT) and do it myself.

Sure, the current implementation of ALIWEB doesn't scale up forever.
The central data gathering might be a bottleneck, which can be solved
by a hierarchy of data gatherers, or as you suggest by broadcasting,
and the single database could be a problem too.

> With my broadcast strategy, I just set up a process that gathers new
> articles and expires old ones. ...
> And its scalable: everybody has access to everything without anybody
> having to do everything.

Regarding the scaleability, you still have "all the data in the world"
in a single machine, namely in the News spool area and in our
database, and you still have in effect "copied all data accross", with
NNTP instead of HTTP. So your first two bzzzzt's bite your own approach
too.

The big advantages I see in using NetNews to broadcast IAFA templates
are that you don't have to register explicitly, and that it is more
likely that a number of different sites will provide searcheable
indices for it.  Because it pushes rather then pulls there is little
overhead (although with If-modified-since this isn't a problem in HTTP
either), but you do burden people to push.

Security is a problem though, at least with ALIWEB I know that the
index comes from the server I just pulled it off, with NetNews that's
not so easy; yes PEM could help but then you are starting to make it a
complicated procedure. Also, because anybody can post the
signal-to-noise ratio will drop, but that may well be acceptable.

> Plain text messages containg URL's should be deprecated in favor of
> articles that use MIME to indicate text/html,
> application/wais-source, message/external-body, etc. body parts.

As you are going to end up with a large database, you do need some sort
of attribute-value schema so that you can search on something sensible.
I suggest that IAFA-like templates do that quite nicely.

-- Martijn
__________
Internet: m.koster@nexor.co.uk
X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster
WWW: http://web.nexor.co.uk/mak/mak.html