Re: web roaming robot (was: strategy for HTML spec?)

"Tony Johnson (415) 926 2278" <TONYJ@scs.slac.stanford.edu>

Mail folder: WWW Talk Jan-Mar 1993 Archives
Next message: Tony Johnson (415) 926 2278: "Re: suggested libWWW architecture"
Previous message: Dan Connolly: "suggested libWWW architecture"

Date: Wed, 13 Jan 1993 21:07 PDT
From: "Tony Johnson (415) 926 2278" <TONYJ@scs.slac.stanford.edu>
Subject: Re: web roaming robot (was: strategy for HTML spec?)
To: Guido.van.Rossum@cwi.nl
Cc: www-talk@nxoc01.cern.ch
Message-id: <0F3E6794C0824DD6@SCS.SLAC.STANFORD.EDU>
X-Envelope-To: www-talk@nxoc01.CERN.CH
X-Vms-To: IN%"Guido.van.Rossum@cwi.nl"
X-Vms-Cc: TONYJ, in%"www-talk@nxoc01.CERN.CH"

>I have written a robot that does this, except it doesn't check for
>valid SGML -- it just tries to map out the entire web.  I believe I
>found roughly 50 or 60 different sites (this was maybe 2 months ago --
>I'm sorry, I didn't save the output).  It took the robot about half a
>day (a saturday morning) to complete.

If you do run your robot again I would be very interested if you could 
generate a simple list of document titles and their corresponding 
document id's (or URL's). We have a powerful spires database here, 
interfaced to the web, which we could easily import such a file into to 
great a VERONICA like index of the web. I think that would be pretty 
useful (unless someone is already doing it??).

One other problem to add to you list.....many documents are probably 
only accessible by giving a "keyword" . Unless you can write a robot 
which can successfully guess all possible keywords, you cannot 
gaurantee to be able to traverse the whole web.

Tony