Testing URIs for equality

"Daniel W. Connolly" <connolly@hal.com>
Errors-To: listmaster@www0.cern.ch
Date: Tue, 15 Mar 1994 14:13:50 --100
Message-id: <9403142053.AA07867@ulua.hal.com>
Errors-To: listmaster@www0.cern.ch
Reply-To: connolly@hal.com
Originator: www-talk@info.cern.ch
Sender: www-talk@www0.cern.ch
Precedence: bulk
From: "Daniel W. Connolly" <connolly@hal.com>
To: Multiple recipients of list <www-talk@www0.cern.ch>
Subject: Testing URIs for equality
X-Listprocessor-Version: 6.0c -- ListProcessor by Anastasios Kotsikonas
Content-Type: multipart/mixed; boundary="----- =_aaaaaaaaaa0"
Content-Type: multipart/mixed; boundary="----- =_aaaaaaaaaa0"
Content-Type: multipart/mixed; boundary="----- =_aaaaaaaaaa0"
Content-Type: multipart/mixed; boundary="----- =_aaaaaaaaaa0"
Content-Type: multipart/mixed; boundary="----- =_aaaaaaaaaa0"
Mime-Version: 1.0
Mime-Version: 1.0
Mime-Version: 1.0
Mime-Version: 1.0
Mime-Version: 1.0
Content-Length: 9767
------- =_aaaaaaaaaa0
Content-Type: text/plain; charset="us-ascii"
Content-ID: <7861.763677456.1@ulua>

I'm new to this discussion, so I may be missing a few issues, but I
read through the nifty hypertext archive of the list, and I'd like to
make some suggestions that may resolve several outstanding lexical
issues one way or another:

It seems to me that there's a lot of discussion about the syntax of
URI's without regard to what the syntax represents. In order to
resolve the question of what a URI in string form represents, let's
consider the question:

	How does one decide if two URI's are "equal"?

I suggest that any specification which leaves this question unanswered
or unanswerable in any cases is sorely lacking, as many caching
strategies will need to be able to ask "have I been here before?"
independent of the scheme involved.

I suggest that the algorithm for comparing URI's is simple string
comparison [after reducing it to canonical form]. I believe it is
already fairly widely deployed. As we'll see, this leaves no room for
different implementations to use different quoting mechanisms, but it
does allow for different strategies for reducing a data stream to
canonical URI form (for example, removing all whitespace and strip <
and > in mail messages).

Question 1: What's a scheme? The grammar in
<http://info.cern.ch/hypertext/WWW/Addressing/URL/5_BNF.html> is
highly ambiguous. One can trace
	scheme => ialpha => xalphas => xalpha => extra => ':'
so that a scheme can have a ':' character in it. That seems pretty
silly, but in practice it's not much of an issue. Most implementations
disambiguate the same way. But, one can also
trace:
	scheme => ialpha => xalphas => xalpha => escape

So are the following equal?

	x-my-scheme:abc
	x-my-schem%65:abc

[my answer: no. The latter is illegal]

There's also the question of whether

	ftp://ftp.host.com/foo.txt
and
	FTP://ftp.host.com/foo.txt
are equal, but I think that's in the spec somewhere

[my answer: no, they're not equal, though I'm willing to bend on this]


Question 2: Is the "opaque string" limited to a certain set of
characters, or are all 256 octets available? If the character set
is limited, how does one represent octets outside that set?

For example, suppose I want to compose a URI consisting of the
string "<%a/b/c%>" and the scheme "x-my-scheme". Can I write:

	x-my-scheme:<%a/b/c%>

I suggest that if we allow that, we'd break a lot of existing
implementations, and that's a Bad Thing To Do. So perhaps I write:

	x-my-scheme:%3C%25a%2Fb%2Fc%3E

Hmmm... is that equal to the following?

	x-my-scheme:%3C%25%61%2F%62%2F%63%3E

If the answer is yes, then we have violated the premise that
everything after the first ':' is opaque. (which I will do again
below anyway, but for now...)

If the answer is no, then implementations will have to be very careful
to do only "minimal" quoting; i.e. there will have to be at least a
per-scheme agreement on the canonical representation of the opaque
string.

[my answer: no. The latter is illegal, or at least not in canonical form.]

I suggest that (a) the character set is limited, but (b) we agree on a
common quoting mechanism, carefully specified so that it is not
necessary to "unquote" a pair of URI's for comparison. It goes like
this:

We partition the 256 octets into three classes:
	(1) markup characters (exactly the 6 characters ":/@?#%").
	These _must_ be quoted ala "%3A%2F%40%3F%23%25"
	(2) reserved characters, e.g. <, >, and space because historical
	implementations can't support them, { and } because they
	don't survive email transport, octets 0-31 and 127-255 because
	they're not "printable", etc.
	These _must_ be quoted.
	(3) data characters, i.e. whatever's left.
	(e.g. letters, digits, -, _ ., perhaps others. We need to
	specify this set carefully)
	These _must not_ be quoted.

This quoting scheme maps artibrary octet strings into strings over the 
set of data characters plus % in an invertible fashion; i.e. if
	strcmp(quote(s1), quote(s2)) == 0
then we can conclude that
	strcmp(s1, s2) == 0
without going to the trouble to unquote s1 and s2.

For backwards compatibility, some systems may want to check for
canonical form before comparing URI's. For example, a routine might
take:

	URI_compare("  <http://host.com  /%61%62%63>",
		"HTTP://host.com/abc")
and reduce them both to
	"http://host.com/abc" before passing them to strcmp().
but that routine could _not_ change:
	http://host.com/abc%2Fdef
to
	http://host.com/abc/def
before passing it to strcmp() -- markup and reserved characters
_remain quoted_ in canonical form. Otherwise there is no way to
represent those characters opaquely.


Question 3: Is there a well-defined way to determine the host, port,
username, password, path, search string, and fragment-identifier, etc.
from a URI in a scheme-independent manner? In other words, can you
learn _anything_ about a URI if you don't recognize the scheme?

Again, using the premise that contradicting widely-deployed
implementations is Bad, the answer is yes, even though this violates
the "everything after the : is opaque" rule.

Using the following grammar and the above quoting rules, we can achive
both goals of being able to parse the stuff after the : and being able
to encode arbitrary data in an opaque fashion: note that a WORD can
represent _any_ sequence of octets in a way that's opaque to this
grammar. On the other hand, if you want to take advantage of the
hierarchical namespace properties of URL's, you would, e.g. use the
'/' character unescaped.

The purpose of a grammar, to me, is to give implementors an exact,
unambiguous description of the syntax. The grammar in the current spec
is so ambiguous that it does not accomplish this goal. I suggest the
following replacement. I've used yacc syntax rather than the
"BNF-like" syntax so that I can use the machine to check for
ambiguity.

[The resulting grammar is LR(1) with the exception of the
user/password production -- you need to look ahead about three tokens
before you see the '@' that disambiguates. So you can't build a URL
parser that groks user:password@host.com with yacc. Take the line:
      | '//' user passwdOpt '@' host portOpt 
out, and it becomes completely LR(1).]

I've also introduced a lexical token WORD that separates the quoting
issues from the parsing issues. So the terminals of the grammar are:

	':' '@' '//' '/' '?' '#' and WORD


------- =_aaaaaaaaaa0
Content-Type: text/plain; charset="us-ascii"
Content-ID: <7861.763677456.2@ulua>

/* YACC spefication of URI grammar */
/* $Id: url-formal.y,v 1.1 1994/03/14 17:17:33 connolly Exp connolly $ */

%token WORD /* string of data characters and escape squences, i.e.
		DATA = [^:@/?#%]
		HEX = [0-9A-F]
		WORD = ({DATA}|%{HEX}{HEX})+ */ ;

%%
uri : scheme ':' hostpartOpt abspath suffix
	;

hostpartOpt : /* null */
	| '//' user passwdOpt '@' host portOpt
	| '//' host portOpt
	;

passwdOpt : /* null */
	| ':' password
	;

portOpt : /* null */
	| ':' port
	;

abspath	: '/' path
	;

path : WORD '/' path
	| WORD '/'
	| WORD
	;

suffix : '#' fragmentid
	| '?' search
	;

scheme : WORD ;
user : WORD ;
password : WORD ;
host : WORD ;
port : WORD ;
fragmentid : WORD ;
search : WORD ;


------- =_aaaaaaaaaa0
Content-Type: text/plain; charset="us-ascii"
Content-ID: <7861.763677456.3@ulua>

We should perhaps introduce '+' as a markup character so that clients
could parse the individual search words from the following:

	x-my-scheme://host/database?word1+word2


Consider the follwing applications of the above suggestions:

A Macintosh filename 'Vol X:dir1:dir two:xyz.html' available via FTP
could be represented as:

	ftp://mac.host.name/Vol%20X%3Adir1%3Adir%20two%3Axyz.html

or as:

	ftp://mac.host.name/Vol%20X/dir1/dir%20two/xyz.html

A client won't be able to tell that these point to the same file. It
should interpret the first as:

	ftp mac.host.com
	ftp> get "Vol X:dir1:dir two:xyz.html"	[not sure about quoting in ftp]

and the second as:

	ftp mac.host.com
	ftp> cd "Vol X"
	ftp> cd dir1
	ftp> cd "dir 2"
	ftp> get xyz.html

Relative links in the xyz.html file probably won't work if the path is
encoded the first way.

These options are also available for VMS files, WAIS doc-ids, etc.


The point here is to agree on some semantics for URIs that matches
current practice without infringing on future mechanisms. The grammar
I've given is arbirary in some ways: it specifies a scheme-independent
way to parse the scheme, user, password, host, path, search string,
and fragment identifier, but leaves other parsing to scheme-specific
parsing of WORDs (e.g. gopher type, wais size, type, etc.). But it is
unambiguous and workable.

Plus, it leaves some room for extension. For example, the string:

	x-my-scheme://host/keyword:value/keyword:value/keyword:value

is illegal by this grammar. This means that in the future we can
safely extend the grammar to include this syntax with the knowledge
that we will not be changing the meaning of any extant URLs.


Comments?

If this proposal receives a somewhat favorable response, I think
the next thing to do is to augment the above yacc specification
with enough supporting materials to build a reference implementation
of a URI parser and begin to compile a suite of test cases so
that implementors can validate their work. I volunteer to compile
the test suite.


Dan


p.s. What was the motivation for abbreviating Message-Id: down to
mid:? I vote we just use message-id: and content-id: as the scheme
names.

p.p.s. The ~ character is used quite a bit these days in urls, e.g.

	http://host/~user/index.html

Is this legal? The grammar has an unused production for "national"
that mentions the ~ character, as if it's not allowed in URI's.


------- =_aaaaaaaaaa0--