Re: Globalizing URIs

Daniel W. Connolly (
Wed, 2 Aug 95 18:47:33 EDT

In message <>, Glenn Adams writes:
>It is my current understanding that arbitrary bytes can be encoded in URLs.

Well... that's stretching it. Arbibrary bytes can be encoded in
morse code too. A URL is a sequence of US-ASCII characters. Check

2.2. URL Character Encoding Issues

URLs are sequences of characters, i.e., letters, digits, and special
characters. A URLs may be represented in a variety of ways: e.g., ink
on paper, or a sequence of octets in a coded character set. The
interpretation of a URL depends only on the identity of the
characters used.

In most URL schemes, the sequences of characters in different parts
of a URL are used to represent sequences of octets used in Internet
protocols. For example, in the ftp scheme, the host name, directory
name and file names are such sequences of octets, represented by
parts of the URL. Within those parts, an octet may be represented by
the chararacter which has that octet as its code within the US-ASCII
[20] coded character set.

>This provide a means for an HTML UA to formulate a response to a form
>submission for arbitrary character encodings.

Er... well.. I suppose so. But that seems like a pretty roundabout
way to go about it.

I'd much prefer to see a general purpose replacement for
the application/x-www-form-urlencoded media type.

Somethink like text/tab-separated-values might work nicely.
Or something SQLish, or lispish, or Tcl-ish. text/tab-separated-values
would be nice because you could use other charset= values for
other encodings. Of course you'd have the same nasty interactions
between octet 7 for the TAB character as octet 10 and 13 for CR/LF.

> However, I do have one important
>question: how does an HTTP server identify the encoding of such bytes (i.e.,
>the CHARSET) and communicate that encoding to the consumer of this data (e.g.,
>a CGI script)?

Well... I gather you're still talking about the
application/x-www-url-encoded media type. The only "specification" for
that is in the HTML spec. (hang on... I'd better check the CGI
spec... nope. It just says stuff like "Examples of the command line
usage are much better demonstrated than explained.")

I think the character encoding scheme is US-ASCII, or perhaps
ISO-Latin-1, by convention.

Like I said... I'd much prefer to see x-www-form-urlencoded replaced
than having other character sets shoehorned into that hack.