Updated URI test suite; resolving some issues...

"Daniel W. Connolly" <connolly@hal.com>
Errors-To: listmaster@www0.cern.ch
Date: Wed, 16 Mar 1994 22:50:18 --100
Message-id: <9403162137.AA10696@ulua.hal.com>
Errors-To: listmaster@www0.cern.ch
Reply-To: connolly@hal.com
Originator: www-talk@info.cern.ch
Sender: www-talk@www0.cern.ch
Precedence: bulk
From: "Daniel W. Connolly" <connolly@hal.com>
To: Multiple recipients of list <www-talk@www0.cern.ch>
Subject: Updated URI test suite; resolving some issues...
X-Listprocessor-Version: 6.0c -- ListProcessor by Anastasios Kotsikonas
Content-Length: 5705

I've updated my URI test suite
	<http://www.hal.com/%7Econnolly/dist/url_test-19940316.tar.Z>
to address _lots_ more issues.  While I was at it, I of course had to
tweak the grammar because of things I hadn't thought of.

WHICH CHARACTERS?

While I was at it, I decided on a workable finalization of the set of
data characters. I started with the POSIX portable filename character
set (letters, digits, hyphen, underscore, and period). Then I looked
at the MIME recommendataions about characters that make it through
mailers without harm. But in the end, I settled on the set from the
isAcceptable table from HTParse.c in the libwww distribution out of
Mosaic.

So the data characters are letters, numbers, period, hyphen,
underscore, at-sign and asterisk. In regexp-speak, that's
	[0-9A-Za-z*\.@_-]

FTP CHANGES

The first result of this is that the user@host in
	ftp://user@host/dir/file.ext
is one word to the parser. So it's no longer part of the URI syntax --
it's specific to the FTP scheme. This is handy in that it makes the
grammar LR(1) again! There is a conflict when using user:passwd@host,
though. The ':' is special and can't be part of a word unless it's
escaped. So the full ftp syntax will have to change to:
	ftp://user*passwd@host/dir/file.ext
or
	ftp://user%3Apasswd@host/dir/file.ext
or something else where the whole user/passwd/host triple is one word.


WAIS STUFF

The other result of picking that char set is that all the other
characters ("!@#$%^&*()=+~`':...") are either markup or reserved.

This caused a conflict with WAIS URLs. So I extended the grammar to
include ';' and '=' as tokens, and added keyword=value syntax. So
the syntax for WAIS files is:
	wais://host/database/type/size/keyword=value;keyword=value;...
and the parser extracts the keyword/value pairs.

The keyword=value syntax is allowed in the path and in the search
string. So the syntax includes things like:
	x-smart-database://host/database-name?author=fred;year=1994
	x400:/G=Jack;S=Jansen;O=cwi;PRMD=surf;ADMD=400net;C=nl
	x500:/c=GB@o=NEXOR%20Ltd@cn=Martijn%20Koster

URNs vs URLs vs RELATIVE URIs

I have been thinking about near-term ways to deploy URNs. Even if
there is no generalized way to resolve a URN to a URL, they are
useful. For example, I have a whole bunch of cached documents from the
web in my local filesystem. But the connection between them and the
place they came from is lost. So when I'm browsing some document that
references the MIME RFC, for example, my browser has no way of taking
advantage of the fact that I've already got a copy of it locally. And
the problem scales as documents are copied, mirrored, cached, etc.

On the other hand, if we had an rfc: URN scheme registered, I could
perhaps configure my browser (or my proxy server) to map
	rfc:*	=> local-file:/u/connolly/web/rfc/*	(try this first)
		=> ftp://ds.internic.net:/rfc/*		(try this next...)

The same is true of mailing lists. When I'm browsing the www-talk
archive, I actually have local copies of many of the messages. We
could register a message-id:<id> scheme or even mailing-list:mbox/<id>
scheme. The I could map
	mailing-list:www-talk@info.cern.ch/*
		=> local-file:/u/connolly/Mail/by-id/www-talk/*
	newsgroup:comp.text.sgml/*
		=> local-file:/u/connolly/News/by-id/comp.text.sgml/*
		=> wais://ifi.no/comp.text.sgml/TEXT/99999/*

I extended the grammar to include relative URIs, and I invented a way
to merge URNs into the URL namespace while still begin able to tell
them apart. A URL always looks like:
	scheme://WORD...
or
	scheme:/WORD...
whereas a URN always looks like:
	scheme:WORD...
(i.e. no slash)

So we can begin to deploy things like:
	message-id:9403161725.AA11467@dragget.hpl.hp.com
	isbn:IBM/832u9283
	issn:29o3u7982
by, for example, using the www_proxy mechanism in Mosaic.

Why is it necessary to distinguish URNs from URLs? To me, the
distinction between URNs and URLs is that URNs identify immutable
objects, and URLs identify mutable objects. Once you've resolved a
URN, you can keep that copy forever and use it to satisfy other
queries for that URN. As to the issues of versioning, translation,
etc., I'd say that a URNs may identify a set of documents, and the
versions, translations, etc. are elements of the set.

For example, the URNs
	rfc:rfc822.ps
and
	rfc:rfc822.txt
are elements of, say
	rfc:rfc822.*

The last URN above can't be directly resolved.

In many ways, the URN <rfc:rfc822.txt> is the same as the URL
<ftp://ds.internic.net/rfc/rfc822.txt>. But a WWW client has no
was of knowing that the ftp file is guaranteed not to change.

Hmmm... this isn't all coming together like I had hoped. The goal is
to deploy the more sophisticated "URCs" or IAFA-templates or whatever
is a scalable, distributed fashion. In the short term, I'd like to be
able to compose documents with references like:

  <REFERENCE linkend="x1">RFC 822: Format for Internet Mail Messages
	</REFERENCE>
  <urnloc ID="x1" locsrc="loc1"
      DATE="19910434094433" EXPIRATION="19990101000000">rfc:rfc822.txt</urnloc>
  <url ID="loc1" backup="loc2">local-file://ulua.hal.com/u/connolly/rfc/822.txt
	</url>
  <url ID="loc2" backup="loc3">wais://host/rfcs/.../822.txt
	</url>
  <url ID="loc3" backup="loc4">ftp://ds.internic.net/rfc/rfc822.txt</url>
  <bibloc ID="loc4">
0822 S     D. Crocker, "Standard for the format of ARPA Internet text  
           messages", 08/13/1982. (Pages=47) (Format=.txt) (Obsoletes 
           RFC0733) (STD 11) (Updated by RFC1327, RFC0987) 
  </bibloc>


Anyway... this citation stuff is still muddling around at this point.
But I think I've got most of the URL issues hammered out, while
leaving room for URNs and allowing this stuff to be used in URCs.

Dan