Re: Is this use of BASE kosher?

Daniel W. Connolly (connolly@beach.w3.org)
Tue, 1 Aug 95 09:22:20 EDT

In message <9508011125.AA20439@plato.ansa.co.uk>, Owen Rees writes:
>"Daniel W. Connolly" <connolly@beach.w3.org> writes:
>> Q2. What is the address of the head of the link whose tail is "Support
>> and ..."
>>
>> A2. http://www.hp.com/go/ftp-sites#Miscellaneous Support
>> aka
>> http://www.hp.com/go/ftp-sites#Miscellaneous%20Support
>>
>> (space and %20 have identical semantics. Both are correct.)
>
>I disagree - I think that space must be written %20 in this context, my reason
>ing follows from Larry Masinter's message. A browser may be lenient here, but
>that is a separate matter.

Hmm... after re-reading some of the specs, I see that they agree with
you.

My understanding of URL escaping was that there are "reserved"
characters, "safe" characters, and "unsafe" characters. Reserved
characters are those that have different meaning when escaped
vs. unescaped; for example, '/'. The "safe" and "unsafe" characters
(e.g. 'a', '~', and ' ') may or may not be escaped, depending on whether
the context permits them or not. For example, ' ' must be escaped
in an HTTP GET because space delimits arguments in HTTP commands.
'~' must be escaped due to brokenness in internet mail gateways, etc.

So ' ' could be considered unsafe given the behaviour of Netscape, but
as far as SGML is concerned, it's safe. Hmmm... actually, it's not,
since in SGML attribute value literals, tabs, newlines, and multiple
spaces are collapsed to single spaces to form the attribute value.

But in any case, RFC1738 disagrees with me:

| All unsafe characters must always be encoded within a URL. For
| example, the character "#" must be encoded within URLs even in
| systems that do not normally deal with fragment or anchor
| identifiers, so that if the URL is copied into another system that
| does use them, it will not be necessary to change the URL encoding.

[I would consider '#' to be reserved, not just unsafe.]

> (I am happy with either enco
>ded or not encoded, but absolutely opposed to encoding being optional - that o
>ld ambiguity argument again.)

I think it's folly to say that we've found all the characters that
might be disallowed in any context in which URLs will be used, and
hence to say that any URL has exactly one unambiguous "spelling."

I think that http://a and http://%61 are exactly the same URL.

Hmmm... on the other hand, the recently release path: URN spec says
that letters are folded to lower-case by default, and so %41 must be
used if you _really_ mean upper-case A.

RFC1738 is due for revision: RFC1808 already pointed out several
mistakes in RFC1738. I wonder how this issue will be resolved in
the revision.

>Oh dear! '#' and fragment identifiers might not be valid in anchors according
>to the HTML draft. In the comment in the DTD it says 'The term URI means a CDA
>TA attribute whose value is a Uniform Resource Identifier, as defined by "Univ
>ersal Resource Identifiers" by Tim Berners-Lee aka RFC 1630' and RFC1630 page
>22 makes it clear that the fragmentid is not part of the URI.
>
>A possible solution is to replace this with a reference to 'URL as defined in
> RFC1808'.

Good point! I'll fix this in the upcoming draft.

>Since anchor names are not URIs, presumably the encoding rules do not apply to
> them. Therefore <A HREF="#a%20name"> introduces a tail for <A NAME="a name">,
> and there is no option of not encoding the space in the tail, or encoding it
>in the head. I think it needs to be made explicit whether or not names are enc
>oded since there is potential for confusion here.

Another good point. I'll see if I can work this in.

Dan