WWW indexing and the resource location service

Jared_Rhine@hmc.edu

Mail folder: WWW Talk Jan 94-present
Next message: Fisher Mark: "RE: Mosaic Accessories"
Previous message: Jonathan Abbey: "Encryption and Limited Data Sharing on the Web"
Maybe in reply to: Jared_Rhine@hmc.edu: "WWW indexing and the resource location service"

Date: Fri, 28 Jan 1994 11:30:01 -0800
Message-id: <199401281930.LAA29948@osiris.ac.hmc.edu>
To: www-talk@www0.cern.ch
Subject: WWW indexing and the resource location service
From: Jared_Rhine@hmc.edu
X-Attribution: JRhine
Content-Length: 8671

(This message originally bounced to www-talk; the original recipients are
listed below.  My apologies to those who are on multiple lists.)

To: www-talk@www0.cern.ch, interpedia@telerama.lm.com, Peter Deutsch
    <peterd@bunyip.com>, Chris Weider <clw@bunyip.com>, Martijn Koster
    <m.koster@nexor.co.uk>, bajan@bunyip.com, quality@sunsite.unc.edu,
    unite@mailbase.ac.uk

Koster> Are you aware of ALIWEB? It is currently the only system that is
Koster> doing what you want: it retrieves hand-prepered IAFA templates via
Koster> HTTP, and rolls them into a searcheable database (which in turn is
Koster> included in the W3 Catalog; one of the best WWW catalogs about).
Koster> Currently about 20 sites have deployed index files, and the number
Koster> is growing.

I was not aware that ALIWEB was performing its magic via IAFA files.  I,
too, am developing systems to interface with the IAFA indexing system; it's
good to hear it is catching on elsewhere.  I'm taking a slightly different
approach than w3-catalog, though.

My system, called ResInfo for Resource Info, is a site-local resource
maintenance tool.  That is to say, it is a tool designed to track and
maintain information about the information resources available within a
given administrative domain.

ResInfo is a key/value database utilizing a fairly lightweight protocol for
transactions.  Transactions are handled via TCP sockets, allowing any host
(subject to access restrictions) to query the site for information
pertaining to the resources.  Administration of the database (such as
updates) is handled via the same protocol.

I prefer this kind of system because it seems that manual maintenance of the
IAFA files would be rather tedious and prone to error; all around, not fun.
The IAFA files are fairly flexible and useful, but their implementation does
not lend itself to easy interfacing with other indexing systems and
information protocols.

My plan is to instead have IAFA files generated __automatically__ by
periodically querying the ResInfo database and building the template files
based on the information in the ResInfo database.  This means that changes
to the ResInfo database will be automatically reflected elsewhere.  Other
formats besides IAFA files could be extracted from the database, such as
static gopher and web indexes.

The simplicity of the protocol allows for a variety of clients to be written
to access the information in the database.  For instance, some way of
searching the database is required.  While w3-catalog is an excellent tool
and a boon to the Internet community, its searching capabilities are rather
limited.  To search the ResInfo database, I've developed an interactive
document which utilizes HTML+ forms.  Using a gateway to ResInfo, the
interactive document presents a form, allowing the various fields to specify
parameters to the search, such as keywords, author, abstract, dates,
languages, and so forth.  The interactive document formulates that into a
ResInfo query, asks ResInfo for a search, gets the results, and formats the
results in a pleasing HMTL document, complete with hypertext links.  The
format and level of detail of the document returned is specified by the
user.

ResInfo has been designed with the Interpedia project in mind.  Interpedia
articles are stored on the local server in whatever form you wish, perhaps
as plaintext or HTML or postscript.  Each article has a ResInfo entry, and a
variety of associated information such as SOAPs, authors, abstracts,
languages, and so forth.  The search procedure described above is also used
with Interpedia documents.  When you perform a search, you would get back a
list of documents that hit, displayed with varying levels of detail, such as
toggling the display of the abstract.  A link would be present to take you
to the actual document.

Since ResInfo is a fairly simple protocol, it is easy to have simple clients
interact with it.  For example, if you are familiar with Deutsch's and
Weider Internet draft, "A Vision of an Integrated Internet Information
Service", you may view ResInfo as a part of the Resource Location Service
(RLS).  Resource transponders, entities which travel with a resource and
keep track of information about the resource, would easily be able to
contact ResInfo and announce, "Here I am; here's some information about my
history" and all the systems which query ResInfo about resources would
suddenly know about the new resource.

An important part of ResInfo is that it was designed primarily to be
accessed via other automated systems.  I've already mentioned a variety of
these: resource transponders, other parts of the Resource Location Service,
Interpedia gateways, IAFA file generation programs, and other interactive
documents.  Since ResInfo keeps track of the current location(s) of the
resources (probably on the local LAN, but not necessarily), gateways which
wish to describe an instance of a resource can ask ResInfo where it is
currently located and hopefully be fairly assured of getting a correct
answer.  HTML gateways would translate the location information into a link
on the fly, giving a w3-catalog kind of feel.

Since administration commands are built into the protocol, a variety of
gateways could be built for administrative purposes.  My primary tool for
administration of the database is an interactive HTML document which uses
HTTP authentication.  This is difficult to do right, but I think the system
is generally more useful if you can do all aspects of it from within a
single client, your Web browser.  It is pretty slick and makes maintenance
very simple.

I see the primary advantage of ResInfo as being the consolidation of
infosystem administration into an easy to use, automated package.  It is not
infosystem-specific, and allows much of the work of updating the database to
be done automatically. (How about having ftp mirroring software contact
ResInfo to tell it that a new version of a document just got mirrored?)  It
can also expand to fulfill other parts of the Resource Location System, and
it can return the URN of a document instead of a local URL, if desired.  It
integrates well with current systems such as the IAFA indexing project.
Gateways are trivial to write since there is a (loosely) defined API (at
least for Perl).

I currently do not have plans to turn ResInfo into a full-blown part of a
RLS.  It is still highly experimental and I'm playing around to see what
works and what doesn't.  It is not coded for efficiency or speed, but rather
for flexibility and rapid prototyping.  The current implementation is in
perl which means that searches are generally perl regexps.  The database is
implemented as a shared dbm file.  It could be reimplemented with some
faster algorithms and coding techniques, but for now, perl gives acceptable
performance.

I'm not worrying too much about RLS-specific features, such as inter-ResInfo
communication.  The standards of the RLS will certainly not be decided by
me; if standard protocols are agreed upon for those transactions, it should
be fairly easy to write a gateway between that protocol and the ResInfo
protocol.  Although that looks like a very interesting area of research,
that will require external coordination, so I'll put that off for now.

I'm not sure what the current state of the art is for these kind of things.
I haven't looked at Whois++ since I've started writing these kind of
site-oriented databases (I have another one called UserInfo which keeps
track of account information such as centralized mail forwarding info,
site-wide UIDs, directory information, and so forth.  Lots of good gateways
and interfaces written for that).  At this early stage in the development of
this field, it seems like a good idea to get multiple systems using
different paradigms deployed in order to gain some understanding of the
various strengths and weaknesses of the various approaches.

I'd be interested in hearing about other such resource management tools.
The ResInfo database is currently pretty empty, but as the parts fall into
place and the fundamental format of the database is nailed down, it is
getting easier and easier to add data.  The next step is getting some
programs which automatically scan the local resources and update the
database.

- - - --
Jared Rhine         Jared_Rhine@hmc.edu
wibstr              Harvey Mudd College
                    http://www.hmc.edu/www/people/jared.html

"Society in every state is a blessing, but Government, even in its
best state, is but a necessary evil; in its worst state, an intolerable one."
                                              -- Thomas Paine
------- end -------