Re: Followup on I18N comments

Gavin Nicol (gtn@ebt.com)
Wed, 30 Nov 1994 04:03:43 +0100

>BTW, you might want to check with some Japanese users before
>deciding that the UTF-8 encoding of Unicode characters is the
>right solution for Japanese. Common encodings in Japan are
>Shift-JIS and EUC, and there is a *far* greater installed
>base of users of these encodings than there is for UTF-8.
>Therefore, users will probably want something that works
>with their existing software rather than an encoding that
>has been dictated by a group that has few Asian representatives.

First, let me say that I live in Japan, and have done for over 8 years
now. I think I know the industry here reasonably well.

Most people here who have something against Unicode are either pushing
their own standards, or have some *political* reason for not agreeing.
One can waffle about the fact that the Kanji Unification leads to
cases where the character displayed is not what one would expect, but
that is an *application* and *display* problem, not a codeset problem.

Due to the line of work I'm in, I have to talk to many people here who
are in the publishing, printing, and computer industries. When I ask
them how we will deal with large scale international document
exchange (and acutally, this is relevant to OMG too), they always
start by saying "well, let's have the browsers understand EUC, and
SJIS". I then point out that there are *dozens* of local character
encoding systems, and there are more being invented every day. I ask
them if it is reasonable to expect every browser to be able to handle
a potentially huge number of encoding systems. I think not.

Rather, what I think is a more reasonable way, is to use UTF-8 or
UTF-7 for charcater encoding while the document is *in the process of
being transferred*. It is then converted to a local encoding by the
browser. With such a system we might end up with many cases like:

Windows "The Ether where electrons flow" Unix
SJIS----------------->UTF-8--------------------EUC

And this *vastly* simplifies the problem. Instead of having to
potentially understand hundreds of encoding, and being able to convert
them all into the local encoding (and UTF would be one of them), one
now only has to understand *two* encoding, and be able to convert
between them.

Many people will now be saying "yes. but what about the cases where
you *can't* display the document, and the case where the Kanji
Unification rears it's head". Well, in the first case, one can offer
to save the document in it's UTF form, which would be the same even if
UTF-8 wasn't being used (except you'd have to save it in some other
format). In addition, I think that as computational linguistics
improve, it will be possible to convert the text into a local encoding
of the phonetic representation of the text (for example, the Kanji
"hara" would be converted from the Unicode encoding to the word "hara"
on ASCII-Only systems). This will probably happen *after* most systems
become internationalised enough to handle the most common cases :-(.
For the Unification problem, I think it should be possible to mark up
the UTF-8 document such that the display characteristics will be
preserved as part of the translation from the local encoding to the
UTF-8 encoding. For example, is one sends a Kanji, one could mark the
document language as "Japanese" or "Chinese", or even do it with some
kind of inline escape sequence (using the area reserved for
extensions).

When I discuss these ideas with people here, they are very receptive.
Many people say that more conservative, and governmental thinkers are
very stubborn, so things will not change quickly. I think that by
implementing such systems *now*, we can force them to accept what I
think is a very practical solution to handling the myriad character
encoding schemes extant in the world.

I should note that the stylesheet format which is being defined by
SGML-Open *will* be able to handle international display
characteristics like bidirectional text, though it will be an optional
feature in the lowest functionality version. The upgrade path will be
seemless, or close to it.