Re: web roaming robot (was: strategy for HTML spec?)
"Tony Johnson (415) 926 2278" <TONYJ@scs.slac.stanford.edu>
Date: Wed, 13 Jan 1993 21:07 PDT
From: "Tony Johnson (415) 926 2278" <TONYJ@scs.slac.stanford.edu>
Subject: Re: web roaming robot (was: strategy for HTML spec?)
To: Guido.van.Rossum@cwi.nl
Cc: www-talk@nxoc01.cern.ch
Message-id: <0F3E6794C0824DD6@SCS.SLAC.STANFORD.EDU>
X-Envelope-To: www-talk@nxoc01.CERN.CH
X-Vms-To: IN%"Guido.van.Rossum@cwi.nl"
X-Vms-Cc: TONYJ, in%"www-talk@nxoc01.CERN.CH"
>I have written a robot that does this, except it doesn't check for
>valid SGML -- it just tries to map out the entire web. I believe I
>found roughly 50 or 60 different sites (this was maybe 2 months ago --
>I'm sorry, I didn't save the output). It took the robot about half a
>day (a saturday morning) to complete.
If you do run your robot again I would be very interested if you could
generate a simple list of document titles and their corresponding
document id's (or URL's). We have a powerful spires database here,
interfaced to the web, which we could easily import such a file into to
great a VERONICA like index of the web. I think that would be pretty
useful (unless someone is already doing it??).
One other problem to add to you list.....many documents are probably
only accessible by giving a "keyword" . Unless you can write a robot
which can successfully guess all possible keywords, you cannot
gaurantee to be able to traverse the whole web.
Tony