Openning the WAIS document-id syntax

Jonny Goldman <jonathan@think.com>
Date: Thu, 26 Mar 92 09:47:34 PST
Message-id: <9203261747.AA00262@philo.quake.think.com.>
From: Jonny Goldman <jonathan@think.com>
Sender: jonathan@quake.think.com
To: timbl@nxoc01.cern.ch
Cc: www-talk@nxoc01.cern.ch, wais-talk@think.com
In-reply-to: Tim Berners-Lee's message of Thu, 26 Mar 92 15:25:12 GMT+0100 <9203261425.AA23337@ nxoc01.cern.ch >
Subject: Openning the WAIS document-id syntax
First, I'd like to point out the WAIS-FTP doesn't mean a client or server
understands FTP protocol.  It's simply a customized server that functions
like FTP (but is read-only).  It's mainly an experiment in modifying
servers and providing services under WAIS.

   Date: Thu, 26 Mar 92 15:25:12 GMT+0100
   From: timbl@nxoc01.cern.ch (Tim Berners-Lee)

   [...]

   The data model of WAIS (documents in databases) could be deconstrained
   to allow documents themselves to be or contain lists of documents, and
   for lists of documents to point to things other than documents in the
   same database.

I take it you're suggesting a new TYPE for a document: Derived types?  In a
sense the catalog is one of these.

   This is the way the second part can work.  Normally, a search returns a
   list of doc-ids, each one (basically) like

	   /usr/local/lib/wais/mydatabase/fred/myfile.txt

   which is in fact a filename.

Let me also point out that this is just the method used in the sample
server.  The CM server does not return DocID's that are derived from
filenames.

In fact, DocID's are "any"s, and that means they can have anything in them,
so long as the server understands how to return a specified amount of data
to a client when presented a DocID and a range.

   There's a load of other stuff in there which we can ignore for now.
   What a WAIS search needs to be able to do, when you are pointing to
   files, is to return a pointer to a file in FTP say. We do that in two
   steps.

I don't agree.  I think the server should do the retrieval.  The client
should not have to know anything about the REAL location of the document.
More on that below.

   First, we recognise that that id is local to the conext of a wais server
   on host myhost and port myport. When the server returns that string, the
   client uses knowledge of the context in which it was quoted to exapnd
   that to

	   wais://myhost.dom.net:myport/usr/local/lib/wais/mydatabase/fred/myfile.txt

   This is a refernece you can quote to anyone as it makes sense anywhere.
   No context.  I called it a UDI but we'll have to change the name.
   Document Access Token maybe.  It's like Brewster's proposal but
   extendable to other protocols.  [Yes, WAIS is a good protocol but there
   are others. Including name servers and directories which will be needed
   for long-lived but movable documents.]

This is a good idea, but I feel rather strongly that we should be very
careful in overloading the protocol.  Specifying a syntax for DocID's is
one way of overloading the protocol.  Standardizing types is another.

   Now suppose one day a server returns a doc-id INCLUDING the protocol,
   host, etc.  For example, your WAIS FTP engine (like the ARCHIE WAIS)
   returns what are basically pointers to files. Just now, because of the
   constraints of the model, it has to return a part of a file within the
   database. Suppose we change that, so that in your case it just returns a
   doc-id which specifies anonymous ftp access, like:

WAIS-FTP doesn't return pointers to remote files.  It returns local DocIDs
for use in retrieving a file local to the server.  Archie WAIS (and
ftpable-readmes) returns these pointers.  That's a different story.

Now for a small discussion of WAIS DocID's. So far WAIS DocID's have only a
few fields:

typedef struct DocID{
   any* originalServer;
   any* originalDatabase;
   any* originalLocalID;
   any* distributorServer;
   any* distributorDatabase;
   any* distributorLocalID;
   long copyrightDisposition;
} DocID;

The part you refer to is just the LocalID part.  If you look at some of the
DocID's returned by the serial server, you'll see the other fields are
filled in (though the Server fields don't contain much useful information -
it's that part we were trying to standardize with the doc-id proposal).

	   file://otherhost.com/pub/doc/mydoc.txt

   The client has a general retrieval engine which can accept doc-ids in
   many domains -- not just WAIS. That allows it to go out over a different
   protocol to retrieve the object.

There are two ways to handle this, of course.  Either the client or the
server could do the retrieval.  I believe the server should handle the
protocol part (if the document is stored on some FTP server somewhere, the
WAIS server can just fetch the file, and return it to the client).  This
reduces client complexity.  I have no objection to specifying the
protocol/server in the DocID (perhaps with another field), but we must
standardize the meanings.

   This is the way WWW and Gopher work.  They are open systems -- you can
   link into any other system within reason.  That's why the fuss about
   universal document identifiers.  Maybe the WAIS people would to
   incorporate them -- that is, just make sure that the normal WAIS server
   return things which are -- like the one above -- special cases of the
   more general syntax.

   I haven't had much comment from the WAIS side about the UDIs, but I'd
   like to have some. (file://info.cern.ch/pub/www/doc/udi1.ps was
   background for the IETF discussions.) We plan a small working group
   hacking out the details before an RFC is submitted.

Come up with an RFC, and we'll try to abide by it.  I'd like to caution you
against overloaded strings.  We've got enough of them already.

For a start, I'd suggest we use the originalServer as the identifier for
the HOST, and the originalDatabase can inform us of the protocol.

- Jonny G