Re: what to do about ...

Daniel W. Connolly (connolly@hal.com)
Thu, 10 Nov 1994 05:15:53 +0100

[Apologies for the wide distribution, but I think it's worth
relaying this situation to all these forums.]

In message <Pine.3.89.9411091954.A20308-0100000@is.rice.edu>, Rick Troth writes
:
>> > What do I do about things like this:
>> >
>> ><li> <a href="//vm.rice.edu/local/problem?assignee=troth&status=open">
>>
>> Yuk... this is a problem. &foo; is an entity reference, even inside
>> attribute value literals. The way to represent & inside an attribute
>> value literal is &#38; or &lt; e.g.
>>
>> <li> <a href="//vm.rice.edu/local/problem?assignee=troth&#38;status=open">
>>
>> but I doubt the current browsers handle that!
>
> Likewise.
>
> Further, it can't be percent-escaped like your URL.
>"...assignee=troth%26status=open..." would result in an embedded
>ampersand instead of the desired input stream line break.
>(that's how my server passes it to CGI, others may vary)
>
>> Hmmm... this calls for a red-alert message to html-wg@oclc.org!
>>
>> Any chance you could get your CGI script to handle ; as a parameter
>> separator and write:
>>
>> <li> <a href="//vm.rice.edu/local/problem?assignee=troth;status=open">
>
> Oh, sure, but ...
>
>> Any chance of getting the CGI-bin spec ammended at this point?
>
> That's the real problem. I used that syntax because that's
>what was in use elsewhere in the Web. Red alert? If you say so.

Well... extending the CGI bin scripts to allow ';' in addition
to '&' as a parameter separator is a pretty much upward compatible
change.

It's fine for URLs computed by HTTP clients like Mosaic to go
right on using

http://foo/bar?x=y&a=b

but any such URLs actually place in the text of an HTML document
would be changed to:

href="http://foo/bar?x=y;a=b" since existing browsers don't grok
http="http://foo/bar?x=y&#38;a=b"

And CGI-bin scripts would be enhanced to allow '&' or ';'. The
only possible problem is if a form value had a ';' character in it, like:

http://foo/bar?x=a;b&z=123

I doubt that happens very much, and there's a good chance that browsers
already %xx-ify these things. If so, then the only remaining nasty
is coding these in documents. It would look like

href="http://foo/bar?x=a%3Bb;z=123"

> Is it at all possible for the DTD to be "fixed"?

Well, this is something that's not specified by the HTML DTD, but
rather by SGML itself. So no, it can't be fixed.

The problem is that folks have been lazily slinging strings around
in various context without taking encoding considerations into
account.

For example, to encode an arbitrary string as a form value in an href
attribute, you must go through the following steps:

1. Encode the string as a URL "word" by replacing all occurences
of "reserved" characters ':', '/', '?', etc. and whitespace
by %xx equivalents.

2. Encode the resulting string as an SGML attribute value
literal by replacing all occurences of '&', '\'', '"', '\n',
'\r' by &#ddd; equivalents.

3. write "http://host/path?name=_resulting_string_"

But current browsers don't implement the reverse of step 2 correctly,
and I understand there was considerable difficulty getting CGI
implementations to handle the reverse of step 1 consistently.

For example, if you write:

<img src="eq1.gif" alt="a > b">

you'll be surpised to find that the > in a > b is interpreted as
the end of the IMG tag, in some browsers. And you'll be further
disappointed that you can't work around this bug by writing:

<img src="eq1.gif" alt="a &#62; b">

Regarding:

>I want Dan's %7E to go away too (as in tilde being
>okay per spec) and I'm not alone. But ...

There's nothing wrong with using

http://www.hal.com/~connolly

in most contexts. There are no nasty interactions with HTML or HTTP.
But if you try to mail that through some BITNET or JANET sites, you'll
get gibberish on the other end. There's a limited "mail safe"
character set documented somewhere in the RFC's.

And the URL spec, for some dubious reason, says that URLs should be
written in a "transport friendly" way, i.e. in a way that they don't
have to be scanned and escaped for various protocols. That is, rather
than using base64 or quoted-printable or some other mechanism specially
desinged to deal with the brokenness in the internet mail world,
the URL syntax should provide a way to escape the characters so that
these interactions are prevented. Hence ~ is supposed to be written
%7E for maximum transport friendliness.

This is a silly thing to do, since it impossible to be certain that
_any_ URL syntax has no interaction with all surrounding syntaxes.
This '&' interaction is a perfect case in point: the '&' character
cannot be %xx escaped, because that changes its meaning from "form
parameter separator" to "form value constiuent character." This
business of %xx-ifying the "unsafe" characters doesn't work in the
case that a character is both "reserved" and "unsafe."

In summary: each layer of encoding should be handled carefully, but
independently. The URL syntax should not attempt to address internet
mail brokennes (I should use base64 or quoted-printable if I'm really
worried about ~ not making it through the mail), and HTTP clients
should not gloss over SGML attribute value literal syntax when parsing
URLs in href attributes.

Daniel W. Connolly "We believe in the interconnectedness of all things"
Software Engineer, Hal Software Systems, OLIAS project (512) 834-9962 x5010
<connolly@hal.com> http://www.hal.com/%7Econnolly