Re: Followup on I18N comments

Sandra Martin O'Donnell (odonnell@osf.org)
Wed, 30 Nov 1994 21:42:33 +0100

>BTW, you might want to check with some Japanese users before
>deciding that the UTF-8 encoding of Unicode characters is the
>right solution for Japanese. . .

First, let me say that I live in Japan, and have done for over 8 years
now. I think I know the industry here reasonably well.

Most people here who have something against Unicode are either pushing
their own standards, or have some *political* reason for not agreeing...

Yes, I've heard these arguments too. An additional one is
that characters as currently encoded in Japanese-specific
code sets are in an order such that the characters can be
collated using their encoded values. This is the same as
ASCII often is collated. In Unicode, however, the characters
are no longer "in order", so collation based on encoded values
doesn't work.

Every code set has pros and cons.

. . . they always
start by saying "well, let's have the browsers understand EUC, and
SJIS". I then point out that there are *dozens* of local character
encoding systems, and there are more being invented every day. I ask
them if it is reasonable to expect every browser to be able to handle
a potentially huge number of encoding systems. I think not.

There are many local encoding systems, but in the past few
years I've seen a good bit of progress toward consolidation.
There is a national profile for Japanese that specifies
very carefully and completely a standard EUC encoding, and
those who produce EUC-based systems are moving toward this.
There still is a lot of variability in Shift-JIS implementations.
I agree that it clearly is not reasonable for every browser to
handle a multitude of varying encoding implementations.

Rather, what I think is a more reasonable way, is to use UTF-8 or
UTF-7 for charcater encoding while the document is *in the process of
being transferred*. It is then converted to a local encoding by the
browser. With such a system we might end up with many cases like:

Windows "The Ether where electrons flow" Unix
SJIS----------------->UTF-8--------------------EUC

And this *vastly* simplifies the problem. Instead of having to
potentially understand hundreds of encoding, and being able to convert
them all into the local encoding (and UTF would be one of them), one
now only has to understand *two* encoding, and be able to convert
between them.

Interestingly, this is part of what we implemented in OSF's
DCE (Distributed Computing Environment). We actually use
Unicode (ISO 10646) as a transfer form rather than UTF-8. However,
the major difference between your suggestion and our implementation
is that we require support for Unicode as a transfer encoding
but do *not* require that all transfers be done in Unicode. If
the source and target are both using, say, the same version
of SJIS, we allow data to be sent without conversion. If the
source and target are different but there is a direct conversion
module between them (say, SJIS to EUC), we allow the data to be
converted from the source to the target.

If the source and target are different *and* there is no direct
conversion module, we require that there be source-to-Unicode
and Unicode-to-target conversion modules. This is the same as
your example above.

The benefit of this model is that it avoids unnecessary
conversions. However, by requiring support for conversions
into and out of Unicode/10646, we avoid trying to support
an infinite number of encoding variations.

---------------------------------------------------------------------
Sandra Martin O'Donnell email: odonnell@osf.org
Open Software Foundation phone: +1 (617) 621-8707
11 Cambridge Center fax: +1 (617) 225-2782
Cambridge, MA 02142 USA
---------------------------------------------------------------------