URL's and SGML document

gtn@ebt.com (Gavin Nicol)
Errors-To: listmaster@www0.cern.ch
Date: Sun, 29 May 1994 22:26:24 +0200
Errors-To: listmaster@www0.cern.ch
Message-id: <9405292020.AA14281@ebt-inc.ebt.com>
Errors-To: listmaster@www0.cern.ch
Reply-To: gtn@ebt.com
Originator: www-talk@info.cern.ch
Sender: www-talk@www0.cern.ch
Precedence: bulk
From: gtn@ebt.com (Gavin Nicol)
To: Multiple recipients of list <www-talk@www0.cern.ch>
Subject: URL's and SGML document
X-Listprocessor-Version: 6.0c -- ListProcessor by Anastasios Kotsikonas
Here is the document I mentioned earlier. I would appreciate any feedback
anyone might have on this, buyt you should all realise that I regard it
as something of a kluge, even if a workable one.

There must be something better than URL's...

<TITLE>URL's and structured documents</TITLE>

Recently, the World Wide Web project has gained a great deal
of momentum. The World Wide Web proposes to tie together all the
information sources available on the Internet, and has gained a great
amount of success by providing a single hypertext interface to such
services as Usenet, FTP, and mail. 
One of the crucial elements of the World Wide Web is the URL, or
Uniform Resource Locator. Currently, URL's appear much like a
Unix filename, with extensions for deciding the type of service to be
used, the port number for the server, and other such parameters. While
URL's are in wide use, they do suffer from a number of problems,
including object uniqueness and equality problems (CORBA is
currently facing similar issues). An IETF group is working to overcome
these problems, but one problem that remains is that all of an object
is retreived(except in searches), and that there is no way to take
advantage of the inherent structure in a correct SGML document. 

<TITLE>The URL syntax</TITLE>
The generic URL has the following structure:
where scheme names the service to use (ftp, wais, http etc.), path
specifies the location of the document, and the optional search
parameter specifies a list of keywords to search for. In theory, this
provides a single, simple naming scheme, but in practise, almost all
of the different servers use a slightly different syntax.

<TITLE>The URL Path Extensions</TITLE>
This document specifies extensions to the generic URL which can be
used in conjunction with SGML document servers to provide a much finer
level of control over what is to be retrieved. The extensions have, as
far as possible, been designed to be compatible with the concepts in both
the URL and HTTP RFC's. 

The key concepts are that an SGML document can be represented as a
tree of nodes, in much the same way that files and directories
correspond to nodes in the tree of the file system. As such, we can
map elements into an extended Unix path to create something like the
where the extension syntax consists of an element GI and a specifier
saying which one of possibly multiple elements to chose. The equals sign
here is arbitrary. Any character that cannot be used in an element GI
in the Reference Concrete Syntax may be used (which agrees with the
TEI). In addition to element GI's the following keywords should be

<TI>toc</TI><TT>Table of contents. If a specifier follows
                it is the name of the TOC to use.</TT>
<TI>max-bytes</TI><TT>Specify the maximum number of bytes to transfer. If
                the number of bytes exceeds this, generate a TOC as a
                guide to a more specific search. If the requested
                element is a graphic, scaling might be used, or a
                small icon attached to a hyperlink with a higher
                max-bytes value could be sent. </TT>
<TI>username</TI><TT>Specify the name of the user.</TT>
<TI>passwd</TI><TT>Specify the password to be used. The password is
                not encoded.</TT>

The grammar for the path extensions can be specified as:

    extended_path_member ::= member_name optional_specifier
    member_name          ::= SGML_GI | '!' keyword
    keyword              ::= 'toc' | 'max-bytes' | 'username' | 'passwd'
    optional_specifier   ::= empty | '=' specifier_list
    specifier_list       ::= specifier | specifier ',' specifier_list
    specifier            ::= string | number
    number               ::= [0-9]* | [0-9]* '.' [0-9]*
    string               :: '"' character_constant '"'
where the character constant rules follow the rules for ANSI C.

The group of recognised keywords is currently very small, as time
goes by, this will be expanded to include ideas from the TEI, and
other groups. In addition, it is expected that the specifier data
format will be expanded over time to allow for selection based upon
attribute values. In addition, the choice of using the bang character
as a keyword marker is arbitrary, and may need to be reconsidered. A
sharp sign might be more suitable, but might confuse Mosaic.

<TITLE>Other extensions</TITLE>

The following are a few extensions, or notes on how the extensions
mentioned above will work with the normal HTTP URL's.

<ITEM>In HTTP, anything following the # character is interpreted as an
identifier for a fragment of a document. A set of keywords will be
used here to specify document fragments. Currently, the supported
keywords will be the same as those in the extended URL path space
grammar, with the addition of the following three:

<TI>ebt-search</TI><TT>For using the EBT query language in
<TI>element</TI><TT>An arbitrary element number</TT>
<TI>node</TI><TT>The internal node identifier</TT>


<ITEM> The normal HTTP search mechanism (the '?' character followed by
a '+' seperated list of keywords) will be supported.

<ITEM> In addition to the normal HTTP search mechanisms, extended
searches using the EBT query language will be supported via text
following the fragment ID delimiter. Such searches would appear like
the following:   

    http://collection/book/section#!ebt-search="foo within bar"


<ITEM> There will be some pathalogical cases where the above
extensions can be used to create an illegal URL. In such cases, the
server should move back up the path until a legal URL is created. For

    <a href="http://ebt.com/collection/book/!toc/!toc/!toc/list=2">

is basically meaningless. This should be reduced to:

   <a href="http://ebt.com/collection/book/!toc">

which is legal, but perhaps not what the user wanted. If a legal URL
cannot be found, an error should be returned.

By supporting multiple named TOC's, is is possible to easily create
alternate views of the document. For example, it should be  possible 
to have lists of figures etc.