i18n and the WWW

Gavin Nicol (gtn@ebt.com)
Wed, 7 Dec 94 22:11:20 EST

>Gavin Nicol (gtn@ebt.com) recently proposed using Unicode as a kind of
>character set "bus". Servers and clients would deal with their own, local

Yes, but a very important part of my proposal is the auto-tagging
performed in the conversion process. Simply using Unicode, with no
disambigutation tags, will cause information loss. In addition to the
Han Unification problem, my proposal could probably be used to extend
the coverage for the languages that Unicode does not cover
completely. In addition, I think it is very important to maintain the
presentational differences of local encoding systems. For example, on
certain Japanese platforms, zenkaku and hankaku are used extensively,
and in Japanese typography, the Kinsoku (rendering) rules are
different. In Unicode, they have the same code position I believe. As
you note toward the end of your document, such "hints" are not
generally required to get legible renderings, but they are *vital* for
getting top quality output.

>1. Since there is complete round-trip mapping between Unicode and the
>character sets in use today.
I understand that it does not cover Thai completely, and that it only
covers 12 of the 15 Indian languages. I may be wrong though.

>The only disadvantage I see is how to enable such use of Unicode while
>still supporting existing clients who don't understand it (the issue raised
>by Sandra O'Donnell of OSF). In the absence of a character set negotiation
>protocol in HTTP, it's not clear how to do this. It is certainly not the
Yes, and as I pointed out, the "text/html; charset=xxxxx" is not
sufficient. We need a proper field in HTTP.

>clients. It has the disadvantage that Asian characters take three bytes
>each, a 50% increase in overhead over the encoding methods used now.
Are all Asian encodings 2 bytes? I believe Thai is 4....

>Straight Unicode results in no overhead for Asian users, but users of
>8859-x will pay double overhead. Since it is very easy to mechanically
>convert between UTF-8 and Unicode, if there were a character set
>negotiation protocol, you could select the appropriate encoding based on
>client preferences.

Yes, and as another optimisation, 2 systems with the same local
character set and encoding could use that (SJIS->SJIS). In the case of
SGML/HTML, this buys you little though: in the parser, it is simpler
to have everything be 16 bits wide. That is, you have:

Machine #1 Machine #2 Parser
SJIS--------------->SJIS--(conv)------->Unicode/Wide Characters
SJIS----(conv)----->UTF---(conv)------->Unicode/Wide Characters

>Another issue with using straight Unicode is that the MIME spec (at least,
>the new draft version) specifies that all subtypes of the "text" content
>type must use CRLF (0x0D 0x0A) line feed conventions, which rules out any
>character set that does not use ASCII as a base, and certainly rules out a
>16 bit character set like Unicode.

So this suggests UTF-8 or UTF-7, or a change in the MIME standard.

>Also note that this only occurs when the recipient has a multilingual,
>multi-character set operating system with the right fonts installed. This
>capability is still a rarity. In all other cases, the language tagging
>won't make any difference in the end result.

Planning for the future is never a bad thing. The
language/presentational hint auto-tagging can be ignored by many
current systems (and it *could* be part of format negotiation), but as
Unicode aware systems become more popular, they will be useful.

>In conclusion, I'm pretty familiar with Unicode but not very familiar
>with HTTP and HTML. I am hoping to work together with the people on
>this list to enable use of Unicode on WWW.

Nice to have you here! I am not a Unicode expert....

Anyway, as I said in a recent posting, and again, above: just
"text/html; charset=xxx" will not suffice for all the upcoming
document formats, so we need a specific field in HTTP allocated for
charset negotiation, and I would like to see UTF-7 or UTF-8 strongly
recommended as the preferred encoding for documents in which the
default MIME and HTML (US-ASCII and 8859-1) do not suffice. I will do
the necessary editorial work if need be.

Of course, for quick acceptance, we also need *free* libraries (in the
GNU/BSD sense) for Unicode conversion, and Unicode font mapping. I
would be *very* interested in hearing about peoples experiences
implementing Unicode based systems. My main experience is in a
windowing system I am writing for an experimental OS (shades of

Gavin Nicol gtn@ebt.com
*Not* speaking for EBT.