Re: Comments on: "Character Set" Considered Harmful

Glenn Adams (glenn@stonehand.com)
Thu, 27 Apr 95 18:54:46 EDT

Date: Thu, 27 Apr 95 18:04:47 EDT
From: bobj@netscape.com (Bob Jung)

At 10:27 PM 4/26/95, Gavin Nicol wrote:
>No. I for one, would like to use UCS-2.

Why? Please elaborate on the advantages of UCS-2 and the disadvantages
of UTF8.

I think it's useful to discriminate between interchange and processing here.
HTTP is based on an 8-bit (octet) data stream (principally TCP). While any
binary data can be transmitted, it is perhaps better to interchange
10646/Unicode data in the only standard octet oriented encoding form, i.e.,
UTF-8. This way we can avoid the problem of byte order, etc. Furthermore,
the vast majority of HTML out there today is already encoded in UTF-8
(since ASCII data expressed with UTF-8 is identical to its ASCII form).

As for processing, however, there are pros and cons toward using a variable
length encoding like UTF-8 as an internal process code. Most likely here
people will use the 16-bit form.

Since we can't stipulate what people will use internally in their apps,
we can't really push any particular form for the latter.

Thus it seems to me that UTF-8 makes quite good sense as a standard
interchange encoding form.

The more important point facing us now is to shift to the use of 10646/Unicode
as the standard document character set.

Regards,
Glenn Adams