Re: Globalizing URIs

Larry Masinter (masinter@parc.xerox.com)
Thu, 3 Aug 95 18:21:02 EDT

I didn't respond directly before, so I'll try now:

> OK. Say the transcription system used by both the FTP client and
> server says:

> (1) if an octet in the local encoding of a file name is in positions
> 0x20 - 0x24 or 0x26 - 0x7E of US-ASCII, then use that octet;
> (2) if an octet in the local encoding of a file name is in position
> 0x25 of US-ASCII (i.e., '%'), then use "%%"
> (3) otherwise, use %XX for each octet in the encoded file name (in big
> endian order), where XX is the hexidecimal value of each such octet.

This transcription system will not be effective if the FTP client and
the FTP server use different internal character encodings in file
names.

> Now, say I am a Chinese version of Windows/NT using Unicode and I ask you for:

> RETR E-W%0B%00.%00T%00X%00T

> That is, in Unicode I have the following encoding of my file name:

> 4E2D 570B 002E 0054 0058 0054

> Say you are a Taiwanese server using the BIG5 character set for you file
> names. How do you interpret my request? Do you interpret it as a BIG5
> string? If so, then you think I just asked for the file "E-W??.?T?X?T"
> (assuming for a moment that you don't throw up on NUL and you interpret
> NUL and other C0 escapes as '?'.

Your example shows why implementors of TCP/IP protocol stacks on
systems that do not use ISO-8859-1 as their internal character set
should not just blindly translate their internal character codes into
bytes used on the Internet.

I'd recommend, for example, that both the Windows/NT server using
Unicode and also the Taiwanese server using BIG5 transliterate file
names using unicode-1-1-utf7 in their implementations of FTP and HTTP.

Otherwise, the client has to know not only what protocol it is
speaking but also more about the internal operation of the server than
it can possibly be expected to know.