Re: Globalizing URIs

Gavin Nicol (gtn@ebt.com)
Wed, 2 Aug 95 22:10:36 EDT

I have something written up on this that was going into the
rfc. It is appended.

---
<H2>Notes on the URL syntax</H2>

<P>The URL specification allows arbitrary 8 bit data to form part of the <TT>&lt;scheme-specific-part></TT> of a URL but requires that only octets which correspond to character codes for printable ASCII be used in the URL definition. Octets that fall outside of this set must be encoded using the <EM>URL-encoding</EM> mechanism, which encodes the octet as a '%', followed by 2 hexadecimal digits. The '%' sign must also be encoded.</P>

<P>URL's often point to files on a file system, which increasingly, may <EM>not</EM> have a name that uses printable ASCII characters. For example, on a Japanese systems, a file might have the name "insatsu.html", in which the "insatsu" might be represented in romanji, katakana, hiragana, or kanji. In such cases, the octets that fall outside the range of printable ASCII would be encoded as per the specification, resulting in something looking like the following on EUC-based systems: <PRE> http://www.jacme.co.jp/%B0%F5%BA%FE.html </PRE> <P>In general, this does not present a problem, because URL's are seldom decoded on machines where the coded character set and encoding differ from that found within the URL. However, in such cases (for example robots), it may, or may not, be possible for the decoder to sense the coded character set and encoding used. Even if the decoder does correctly guess, it is not guaranteed that they will be able to successfully decode the URL, and then process the resulting text.</P> <P>To allow the coded character set and encoding to be explicitly stated in the URL, the URL syntax should be expanded as follows: </P> <PRE> &lt;scheme>:&lt;character-set-data>&lt;scheme-specific-part> </PRE> <P>where the the BNF definition of <TT>&lt;character-set-data></TT> would be:<P> <PRE> character-set-data = "[" [ character-set ":" ] encoding "]:" character-set = name-string encoding = name-string name-string = 1*[ alpha | digit | "-" | "," ] </PRE> <P>and the <TT>&lt;character-set-data></TT> part of a URL should be optional, thereby resolving any backward compatibility concerns. An example of such URL's would be:</P> <PRE> http:[EUC]//www.jacme.co.jp/%B0%F5%BA%FE.html </PRE> <P>It would also be advisable for the HTTP protocol to provide some mechanism for indication the coded character set and encoding used with URL's that are parts of a request.. For example, the <TT>PUT</TT> method syntax could be extended such that the coded character set and encoding of the URL be an optional part of the method parameters:</P> <PRE> request-line = method SP request-uri SP optional-charset-data SP http-version CRLF optional-charset-data = "[" [ character-set ":" ] encoding "]:" character-set = name-string encoding = name-string name-string = 1*[ alpha | digit | "-" | "," ] </PRE>