Re: Globalizing URIs

Gavin Nicol (gtn@ebt.com)
Wed, 2 Aug 95 22:10:36 EDT

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Larry Masinter: "Re: Globalizing URIs"
Previous message: Paul Burchard: "Re: Is this use of BASE kosher?"
Maybe in reply to: Glenn Adams: "Globalizing URIs"
Next in thread: Larry Masinter: "Re: Globalizing URIs"

I have something written up on this that was going into the
rfc. It is appended.

---
<H2>Notes on the URL syntax</H2>

<P>The URL specification allows arbitrary 8 bit data to form part of
the  <TT>&lt;scheme-specific-part></TT> of a URL but requires that only
octets which correspond to character codes for printable ASCII be
used in the URL definition. Octets that fall outside of this set must
be encoded using the <EM>URL-encoding</EM> mechanism, which encodes
the octet as a '%', followed by 2 hexadecimal digits. The '%'
sign must also be encoded.</P>

<P>URL's often point to files on a file system, which increasingly,
may <EM>not</EM> have a name that uses printable ASCII characters. For
example, on a Japanese systems, a file might have the name
"insatsu.html", in which the "insatsu" might be represented in
romanji, katakana, hiragana, or kanji. In such cases, the octets that
fall outside the range of printable ASCII would be encoded as per the
specification, resulting in something looking like the following on
EUC-based systems:
<PRE>
       http://www.jacme.co.jp/%B0%F5%BA%FE.html
</PRE>
<P>In general, this does not present a problem, because URL's are
seldom decoded on machines where the coded character set and encoding
differ from that found within the URL. However, in such cases (for
example robots), it may, or may not, be possible for the decoder to
sense the coded character set and encoding used. Even if the decoder
does correctly guess, it is not guaranteed that they will be able to
successfully decode the URL, and then process the resulting text.</P> 
    
<P>To allow the coded character set and encoding to be explicitly
stated in the URL, the URL syntax should be expanded as follows: </P>
<PRE>
       &lt;scheme>:&lt;character-set-data>&lt;scheme-specific-part>
</PRE>
<P>where the the BNF definition of <TT>&lt;character-set-data></TT>
would be:<P> 
<PRE>
       character-set-data = "[" [ character-set ":" ] encoding "]:"
       character-set      = name-string
       encoding           = name-string
       name-string        = 1*[ alpha | digit | "-" | "," ]
</PRE>
<P>and the <TT>&lt;character-set-data></TT> part of a URL should be
optional, thereby resolving any backward compatibility
concerns. An example of such URL's would be:</P>
<PRE>
       http:[EUC]//www.jacme.co.jp/%B0%F5%BA%FE.html
</PRE>
<P>It would also be advisable for the HTTP protocol to provide some
mechanism for indication the coded character set and encoding used
with URL's that are parts of a request.. For example, the <TT>PUT</TT>
method syntax could be extended such that the coded character set and
encoding of the URL be an optional part of the method parameters:</P>
<PRE>
       request-line          = method SP request-uri SP optional-charset-data
                               SP http-version CRLF
       optional-charset-data = "[" [ character-set ":" ] encoding "]:"
       character-set         = name-string
       encoding              = name-string
       name-string           = 1*[ alpha | digit | "-" | "," ]
</PRE>

Next message: Larry Masinter: "Re: Globalizing URIs"
Previous message: Paul Burchard: "Re: Is this use of BASE kosher?"
Maybe in reply to: Glenn Adams: "Globalizing URIs"
Next in thread: Larry Masinter: "Re: Globalizing URIs"