Re: Globalizing URIs

Glenn Adams (glenn@stonehand.com)
Fri, 4 Aug 95 09:04:59 EDT

From: Larry Masinter <masinter@parc.xerox.com>
Date: Thu, 3 Aug 1995 15:14:38 PDT

Your example shows why implementors of TCP/IP protocol stacks ...
should not just blindly translate their internal character codes into
bytes used on the Internet.

I think the finger is being pointed in the wrong direction here. If the
mechanism used to interchange character data (in this case URIs) is not
adequate in terms of its character coverage then it has to be overloaded.
The real problem is that the designers of application level protocols have
routinely ignored the minimum requirements regarding character interchange.

I'd recommend ... using unicode-1-1-utf7 ...

OK. This is one possible solution. Rather than using the single limited
character set US-ASCII, use the much less limited character set ISO/IEC 10646.
This solution would not require communicating the character set along with the
pathname or URI or identifier.

Can you elaborate on your proposal so that Francois Y. can include it in his
draft RFC on HTML I18N?

Otherwise, the client has to know not only what protocol it is
speaking but also more about the internal operation of the server than
it can possibly be expected to know.

Of course, without something to indicate the CHARSET of the input to the
transcription, such a system could only work with magic. However, asking for
a designation of CHARSET (in the absence of using UTF-7) by itself *does*
provide the necessary information to obtain a proper interpretation.

I too would prefer using the UTF-7 approach rather than simply marking the
character set. However, keep in mind that this *will* present considerable
difficulties for users since they routinely type URIs into applications that
know nothing about UTF-7 and they do so using their own local character set.
Although UTF-7 would be a cleaner approach, I don't see support spreading for
UTF-7 so quickly that it would facilitate your approach. I suspect that we
should do the following:

(1) consider how to designate the input character set to a naive URI
transcription (one base on simple octet transcription)
(2) specify that in the absence of such a designation, that the simple
octet transcription will be interpreted according to the receiving
applications character set (which may result in a misinterpretation)
(3) specify that if no mechanism exists for satisfying (1), then a UTF-7
transcription should be used

Even with the above, we still would have the problem of communicating which
URI transcription system is being used. Note that if UTF-7 is used, then
the '+' ASCII character must be represented as the string "+-" since it is
an escape in UTF-7.

Regards,
Glenn