Re: Last-modified date & indexing

Mike Schwartz (schwartz@latour.cs.colorado.edu)
Fri, 11 Nov 1994 01:53:55 +0100

> Date: Thu, 10 Nov 1994 11:53:02 +0000
> From: " (Nick Arnett)" <narnett@verity.com>
> To: www-talk@www0.cern.ch, /DD.Common=robots/@nexor.co.uk
> Subject: Last-modified date & indexing

> ...
> As we're building our indexing tools, we're trying to figure out how to
> trigger index updates efficiently, which is why I'm asking.

I think HTTP, FTP, etc., are the wrong places to look to get indexing
information. They are very inefficient for collecting the information
because you have to set up connections, fork, etc. for each object
retrieval. Also, they don't supply all the information you need
to build effective indexes (timestamps, MD5 signatures, etc.)

In Harvest (http://harvest.cs.colorado.edu/) you can run a Gatherer at the
archive site, and it builds all this information and exports it very
efficiently. It provides MD5s (which the Broker uses to do duplicate
elimnation) and timestamps (which you can use to do incremental updates).
It also supports much more efficient gathering than you get when gathering
via HTTP etc. As an example, when we gathered the data needed to index the
AT&T 1-800 Telephone Directory, it took about 10 hours to pull the data
across the Internet. In contrast, you can pull all of the data across the
Internet from our server as a single compressed, structured stream in just
a few mintues. On average, running an archive-site Gatherer causes about
6,660x less load on the archive's CPU and 50x less network traffic than
doing remote gathering ala the robots - and this doesn't include the
savings you get from doing incremental updates (retrievals of the form
"give me all the content summaries for objects changed/created since date
XYZ").
- Mike