SGML for URLs

Dan Connolly <connolly@imagine.convex.com>

Mail folder: WWW Talk 1992 Archives
Next message: Joseph Wang: "tk interface to World Wide Web"
Previous message: joe@athena.mit.edu: "Connecting WWW and Tk"

Message-id: <9207241532.AA26087@imagine.convex.com>
To: cni-arch@uccvma.bitnet
Cc: wais-talk@think.com, www-talk@nxoc01.cern.ch
Subject: SGML for URLs
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="cut-here"
Date: Fri, 24 Jul 92 10:32:14 CDT
From: Dan Connolly <connolly@imagine.convex.com>

--cut-here
Content-Type: multipart/alternamtive; boundary=alt

--alt

                                   OBJECTIVE
                                       
   The issue of what to call these things we're defining has been discussed at
   length. First it was Universal Document Identifier. The name has changed as
   the objective has been refined. The latest name is Universal Resource
   Locator. The provisional charter is;
   
    To define a printable string syntax to the allow
    
      The expression of the address on the network of any accesable object
      using existing information retrieval protocols;
      
      The expression of the name of any object held in a directory system or
      unique naming space on the network;
      
      The distinction to be made easily in the syntax between such protocols
      and directories and name spaces;
      
      New protocols, directories and naming schemes to be included as and when
      they are developed. [1]
      
      Clearly what we are about is defining a language, i.e.  a syntax and
      semantics for communicating some information.
      
      The information is the location and/or identity of some information
      object in the global hypertext. It's a citation or a reference or a
      hypertext link anchor.
      
      I propose a specification for the language of URLs, in the context of a
      specification for a language of global hypertext references.
      
      These global hypertext references include more semantics than just
      differentiating between protocols and accessing data. There are also
      issues of determining the type and the identity of the referent data.
      
SGML as a syntactic specification tool

   That's what it's for, after all. What I propose is a DTD that (with the
   default SGML declaration) defines the language of global hypertext
   references.
   
    Some examples of the language:
    

<http host="info.cern.ch" path="hypertext/TheProject.html">
<http host="info.cern.ch" path="hypertext/people.html" anchor="timbl">
<http host="info.cern.ch" path="XFIND" search="SGML">

<prospero host="archie.mgil.ca" path="pub/ftp">

<file host="snoopy" path="~connolly/bin/cgrep.pl" type=appl subtype=x-perl>

<ftp host="export.lcs.mit.edu" dir="contrib"
name="XcRichText-1.2.tar.Z">

<usenet group="comp.infosystems.gopher">
<usenet article="<abc@convex.com>">

<wais host="quake.think.com" database="INFO" search="help">
<wais host="quake.think.com" database="INFO" wtype="TEXT" size=1000
        path="/usr/local/wais/README" >

<telnet host="info.cern.ch">

<gopher host="boombox.umn.edu" port=70>
<gopher host="boombox.umn.edu" selector="foo &#34;bar&#34;" gtype=0 >

   The DTD uses only the most basic features of SGML, and thus the resulting
   language is not very complex. Implementation of a parser for this particular
   SGML language is a vastly more simple task than implementing an SGML parser.
   At the same time, we get the benefits of a rigorously defined language based
   on established standards.
   
  Note:                  I haven't studied the HyTime standard very carefully.
                         I think it's beyond the scope of the task at hand, but
                         I'd like to have that opinion substantiated by someone
                         who really knows. In particular, its Finite Coordinate
                         Systems could be used to model positions within
                         documents: characters, lines, paragraphs.
                         
  RELAVENT ISSUES
  
  Verbosity              This syntax is somewhat verbose, but I think that
                         implicit markup (punctuation rather than names) will
                         lead to a mass of quoting in many cases. And the
                         consistency between schemes is not necessarily very
                         high.
                         
  Long URLs              Extra whitespace between tokens has no effect. There
                         is still the problem of quoted strings that are longer
                         than a mailer allows. Certainly there's some SGML
                         feature that I'm not aware of that addresses the
                         issue.
                         
  I don't believe there's a way to restrict the length of an element, though
                         there is a 960 character limit on the length of an
                         attribute value (in the default SGML declaration).
                         
  Quoting                The SGML numeric character reference (e.g. &#128;)
                         allows an attribute value literal to represent any
                         sequence of bytes.
                         
  NAMELEN                The default SGML declaration specifies that names of
                         elements and attributes be 8 characters or less. It's
                         a conceptually simple matter to operate under an SGML
                         declaration where NAMELEN is higher.
                         
  Extensibility          One problem with the current UDI syntax specification
                         is that it seems to allow new schemes to add arbitrary
                         complexity to the grammar. This specification limits
                         the language to an SMGL start tag.
                         
  If we adopt this spec, we need to give it a public text identifier, and
                         maintain a registry of the names used (probably with
                         the IANA).
                         
  DEPLOYMENT AND USAGE
  
   The first place to try this specification out is in the WWW browser. (I'll
   try to make the code changes if I find time). It's a simple matter of
   elevating UDI's as SGML attributes to URLs as SGML elements. I'd like to
   have someone who really knows SGML to have a look at this DTD and see if it
   can be improved. And I'd like to study the HyTime standard, the Davenport
   DASH, the CFCM standard, etc. to see how this element meshes with their
   citation strategies. Also, it would be nice to have explicit support from
   WAIS and Gopher clients -- drag and drop comes to mind.
   
SGML and semantics

   SGML is famous for being divorced from application semantics. Most of the
   semantics of URLs is in the constituent protocols. All we need to do is
   define a way to parse a URL and pass the various bits to the protocol. But
   as long as we're going to all the trouble to gather information accessible
   with all these protocols into one specification, it makes sense to define
   some semantics common to most applications that will use URLs.
   
  DATA TYPES
  
   Some of the schemes have explicit type information (wais, gopher), some have
   implicit typing (html, USENET), and some have no typing at all (file, ftp).
   The MIME content-type system is general and useful enough to warrant
   support. An application should be able to determine the content-type of the
   data regardless of the protocol.
   
  RESOURCE IDENTITY
  
   Many applications have use for determining whether two URLs refer to the
   same information. Various schemes (such as USENET article id's) may have
   semantics for identifying resources. But I think this capability is so
   widely useful that it should be coherently supported for all protocols.
   
                                                            connolly@convex.com
--alt
Content-Type: text/x-html

<!DOCTYPE html SYSTEM>
<title>Using SGML to define Universal Resource Locators</title>

<H1>Objective</H1>

The issue of what to call these things we're defining has been
discussed at length. First it was Universal Document Identifier. The
name has changed as the objective has been refined. The latest name is
Universal Resource Locator. The provisional charter is;

<a HREF="x-message-id:<9206262004.AA29919@zippy.lcs.mit.edu>">
<h4>To define a printable string syntax to the allow</h4>

<ol>
<li>The expression of the address on the network of any accesable
object using existing information retrieval protocols;

<li>The expression of the name of any object held in a directory
system or unique naming space on the network;

<li>The distinction to be made easily in the syntax between such
protocols and directories and name spaces;

<li>New protocols, directories and naming schemes to be included as
and when they are developed.
</ol>
</a>

<p>
Clearly what we are about is defining a language, i.e.  a syntax and
semantics for communicating some information.
<p>

The information is the location and/or identity of some information
object in the global hypertext. It's a citation or a reference or a
hypertext link anchor.
<p>

I propose a specification for the language of URLs, in the context of
a specification for a language of global hypertext references.
<p>

These global hypertext references include more semantics than just
differentiating between protocols and accessing data. There are also
issues of determining the type and the identity of the referent data.

<H2>SGML as a syntactic specification tool</H2>

That's what it's for, after all. What I propose is a DTD that
(with the default SGML declaration) defines the language of
global hypertext references.
<p>

<h4>Some examples of the language:</h4>
<XMP>
<http host="info.cern.ch" path="hypertext/TheProject.html">
<http host="info.cern.ch" path="hypertext/people.html" anchor="timbl">
<http host="info.cern.ch" path="XFIND" search="SGML">

<prospero host="archie.mgil.ca" path="pub/ftp">

<file host="snoopy" path="~connolly/bin/cgrep.pl" type=appl subtype=x-perl>

<ftp host="export.lcs.mit.edu" dir="contrib"
name="XcRichText-1.2.tar.Z">

<usenet group="comp.infosystems.gopher">
<usenet article="<abc@convex.com>">

<wais host="quake.think.com" database="INFO" search="help">
<wais host="quake.think.com" database="INFO" wtype="TEXT" size=1000
	path="/usr/local/wais/README" >

<telnet host="info.cern.ch">

<gopher host="boombox.umn.edu" port=70>
<gopher host="boombox.umn.edu" selector="foo &#34;bar&#34;" gtype=0 >
</XMP>

The DTD uses only the most basic features of SGML, and thus
the resulting language is not very complex. Implementation
of a parser for this particular SGML language is a vastly
more simple task than implementing an SGML parser.

At the same time, we get the benefits of a rigorously
defined language based on established standards.

<dl><dt>Note:
<dd>I haven't studied the HyTime standard very
carefully. I think it's beyond the scope of the task at
hand, but I'd like to have that opinion substantiated by someone
who really knows. In particular, its Finite Coordinate Systems
could be used to model positions within documents: characters,
lines, paragraphs.
</dl><p>

<h3>Relavent Issues</h3>

<dl>
<dt>Verbosity <dd>This syntax is somewhat verbose, but I think that
implicit markup (punctuation rather than names) will lead to a mass of
quoting in many cases. And the consistency between schemes is not
necessarily very high.

<dt>Long URLs
<dd>Extra whitespace between tokens has no effect. There is still
the problem of quoted strings that are longer than a mailer allows.
Certainly there's some SGML feature that I'm not aware of that
addresses the issue.
<p>
I don't believe there's a way to restrict the length of an element,
though there is a 960 character limit on the length of an attribute
value (in the default SGML declaration).

<dt>Quoting
<dd>The SGML numeric character reference (e.g. &#128;) allows
an attribute value literal to represent any sequence of bytes.

<dt>NAMELEN
<dd>The default SGML declaration specifies that names of
elements and attributes be 8 characters or less. It's a
conceptually simple matter to operate under an SGML declaration
where NAMELEN is higher.

<dt>Extensibility
<dd>One problem with the current UDI syntax specification is that it
seems to allow new schemes to add arbitrary complexity to the grammar.
This specification limits the language to an SMGL start tag.
<p>
If we adopt this spec, we need to give it a public text identifier,
and maintain a registry of the names used (probably with the IANA).
</dl>

<h3>Deployment and Usage</h3>

The first place to try this specification out is in the
WWW browser. (I'll try to make the code changes if I find
time). It's a simple matter of elevating UDI's as SGML
attributes to URLs as SGML elements.

I'd like to have someone who really knows SGML to have a look
at this DTD and see if it can be improved. And I'd like
to study the HyTime standard, the Davenport DASH, the CFCM
standard, etc. to see how this element meshes with their
citation strategies.

Also, it would be nice to have explicit support from WAIS and Gopher
clients -- drag and drop comes to mind.

<h2>SGML and semantics</h2>

SGML is famous for being divorced from application semantics.
Most of the semantics of URLs is in the constituent protocols.
All we need to do is define a way to parse a URL and pass
the various bits to the protocol.

But as long as we're going to all the trouble to gather information
accessible with all these protocols into one specification, it makes
sense to define some semantics common to most applications that will
use URLs.

<h3>Data Types</h3>

Some of the schemes have explicit type information (wais, gopher),
some have implicit typing (html, USENET), and some have no typing at
all (file, ftp). The MIME content-type system is general and useful
enough to warrant support. An application should be able to determine
the content-type of the data regardless of the protocol.

<h3>Resource Identity</h3>

Many applications have use for determining whether two URLs refer
to the same information. Various schemes (such as USENET article
id's) may have semantics for identifying resources. But I think this
capability is so widely useful that it should be coherently supported
for all protocols.

<address>connolly@convex.com</>
</HTML>

--alt--
--cut-here

<!-- Universal Resource Locator specification
     derived from http://info.cern.ch/hypertext/WWW/Addressing/BNF.html
     on 24 July 1992
     by connolly@convex.com -->

<!-- Typical usage:
	<!DOCTYPE url SYSTEM>
		(we need a public identifier)
	or as part of another SGML document type:
	<!ELEMENT url SYSTEM>
	&url;
	-->

<!-- minimization? I believe you can omit the name= part
	of an SGML attribute specification in some circumstances.
	I don't think it works with CDATA attributes because
	order is not significant. -->

<!-- news: scheme renames USENET -->
<!-- file: is somewhat vague. I suggest explicit support for FTP: -->
<!ENTITY % schemes "http|file|ftp|usenet|telnet|prospero|gopher|wais">

<!ELEMENT url - - (%schemes;)* >
<!-- content model of URL: more than one element in a URL? (obviously
	an application can use multiple URLs. The question is whether
	to define semantics for multiple elements in a single URL.)

	Also, what about type, size, search information? Perhaps
	one element should describe the connection information,
	another element or elements describes the path to the data
	(allowing us to define semantics of hierarchical databases)
	and another element defines the type of information there.

	-->

<!ELEMENT (%schemes;) - O EMPTY >

<!-- TCP connection info: internet domain address and port number -->
<!ENTITY % host "host CDATA #REQUIRED" >
<!ENTITY % hostp "%host; port NUMBER #IMPLIED" >

<!ENTITY % types "text|image|audio|video|message|multi|appl">
<!ENTITY % stypes "plain|richtext|
	gif|g3fax|
	basic|
	mpeg|
	rfc822|external|partial|
	mixed|altern|parallel|
	octets|ps|oda">
<!-- content-type parameters? -->

<!ENTITY % cte "7bit|8bit|qp|base64|binary"
	-- we could define several of the gopher types
	   in terms of encodings and types
		e.g. x-binhex, application/x-stuffit
	-->

<!ENTITY % MD5 "datasig CDATA #IMPLIED" -- MD5 data signature -->
<!ENTITY % bytes "bytes NUMBER #IMPLIED">
<!ENTITY % lines "lines NUMBER #IMPLIED">

<!ATTLIST http
	-- information accessing attributes --
	%hostp;
	path CDATA #REQUIRED -- server local name --
		-- must match xalpha [/ path ] --
		-- can a CDATA attribute contain an arbitrary bytestream? --
	search CDATA #IMPLIED -- search terms --
	anchor CDATA #IMPLIED -- HTML anchor name --

	-- information content attributes --
	type (%types) text
	subtype (%stypes) #IMPLIED
	encoding (%cte) 7bit
	%MD5;
	%bytes;
	>

<!ATTLIST prospero
	%hostp;
	path CDATA #REQUIRED
	-- prospero path should not be constrained to WWW path syntax --


	-- information content attributes --
	type (%types) appl
	subtype (%stypes) octets
	encoding (%cte) binary
	%MD5;
	%bytes;
	>

<!ATTLIST file
	%host;
	path CDATA #REQUIRED
	-- unix path should not be constrained to WWW path syntax --

	-- information content attributes --
	type (%types) appl
	subtype (%stypes) octets
	encoding (%cte) binary
	%MD5;
	%bytes;
	>

<!ATTLIST ftp
	%hostp;
	dir CDATA #REQUIRED -- directory for cd command --
	name CDATA #REQUIRED -- name for get command --
	user CDATA "anonymous" -- anonymous ftp by default --
	password CDATA #IMPLIED -- not always needed --

	-- information content attributes --
	type (%types) appl
	subtype (%stypes) octets
	encoding (%cte) binary -- use 7bit for ascii transfers --
	%MD5;
	%bytes;
	>

<!ATTLIST usenet
	group CDATA #IMPLIED -- usenet newsgroup name --
	article CDATA #IMPLIED -- article message-id --

	-- information content attributes --
	type (%types) message
	subtype (%stypes) rfc822
	encoding (%cte) 7bit
	%MD5;
	%lines; -- you can add headers without changing a USENET
			article, so bytes isn't a good measure --
	>

<!-- should we split this into two nodes so that
	we can put #REQUIRED on the size and type for documents? -->
<!ATTLIST wais
	%hostp;
	database CDATA #IMPLIED -- WAIS database name --
	search CDATA #IMPLIED -- search terms --
		-- what about relavent documents? --
	wtype CDATA #IMPLIED -- WAIS data type --
		-- this should be obsoleted by the MIME type system --
	bytes NUMBER #IMPLIED
	path CDATA #IMPLIED -- split into original x, y? --

	-- information content attributes --
	type (%types) text
	subtype (%stypes) plain
	encoding (%cte) binary
	%MD5;
	>

<!ATTLIST telnet
	%hostp;
	user CDATA #IMPLIED -- username --
	>

<!ATTLIST gopher
	%hostp;
	gtype CDATA "1" -- gopher type --
		-- again, MIME types should be used --
		-- www browser can be inundated by non-text data
		   unless it recognizes other types --
	selector CDATA "" -- gopher object selector --
	search CDATA #IMPLIED -- fulltext search terms --

	-- information content attributes --
	type (%types) #IMPLIED
	subtype (%stypes) #IMPLIED
	encoding (%cte) binary
	%MD5;
	%bytes;
	>
--cut-here
Content-type: text/sgml
Content-Description: Example URLs

<!DOCTYPE url SYSTEM>
<url>
<http host="info.cern.ch" path="hypertext/TheProject.html">
<http host="info.cern.ch" path="hypertext/people.html" anchor="timbl">
<http host="info.cern.ch" path="XFIND" search="SGML">

<prospero host="archie.mgil.ca" path="pub/ftp">

<file host="snoopy" path="~connolly/bin/cgrep.pl" type=appl subtype=x-perl>

<ftp host="export.lcs.mit.edu" dir="contrib"
name="XcRichText-1.2.tar.Z">

<usenet group="comp.infosystems.gopher">
<usenet article="<abc@convex.com>">

<wais host="quake.think.com" database="INFO" search="help">
<wais host="quake.think.com" database="INFO" wtype="TEXT" size=1000
	path="/usr/local/wais/README" >

<telnet host="info.cern.ch">

<gopher host="boombox.umn.edu" port=70>
<gopher host="boombox.umn.edu" selector="foo &#34;bar&#34;" gtype=0 >
</url>

--cut-here--