Re: WWW Information Discovery Tools

"Tony Johnson (415) 926 2278" <TONYJ@scs.slac.stanford.edu>
Date: Thu, 8 Apr 1993 13:44 PDT
From: "Tony Johnson (415) 926 2278" <TONYJ@scs.slac.stanford.edu>
Subject: Re: WWW Information Discovery Tools
To: wmperry@guava.ucs.indiana.edu
Cc: www-talk@nxoc01.cern.ch
Message-id: <9C69093772A094D1@SCS.SLAC.STANFORD.EDU>
X-Envelope-To: www-talk@nxoc01.CERN.CH
X-Vms-To: IN%"wmperry@guava.ucs.indiana.edu"
X-Vms-Cc: TONYJ, in%"www-talk@nxoc01.CERN.CH"
William M. Perry (wmperry@indiana.edu) writes:

>  Well, right now it would be pretty trivial to modify my emacs browser to
>follow _every_ link it finds and record it.  Only problem would be in
>keeping it from getting in an infinite loop, but that wouldn't be too hard.
>Problem would be disk space & CPU time.

Unfortunately I don't think infinite loops is the only problem to be solved. 
For example we have databases of Physics Publications accessable via the web, 
and cross-referenced for citations. This databases contain ~300,000 entries. A 
robot, even if it is smart enough to not get into a loop, could spend many days 
roaming this one database trying to find all the entries. One way around that 
would be to have a list of places where the robot should not look, but finding 
this list would itself be a time consuming task. 

Conversly there are many interesting documents that can only be accessed by 
giving a keyword, making it difficult for a robot to discover these documents 
at all.  

>  Once I get the browser stable, I can work on something like this - unless
>someone else wants to work on it in the meantime.  Might be more
>stable/faster if written in C though. :)  But then what isn't?
>
>  What type of format would the output have to be in?  It would be very
>easy to spit out "URL :: TITLE" into a file.

If anyone does solve the problems and generate a "URL :: TITLE" list (possibly 
a few other fields such as last modified date would be useful too) I would be 
happy to try to make the information available through the database we have 
interfaced to WWW.

Tony Johnson