Re: Draft minutes from Stockholm

Glenn Adams (glenn@stonehand.com)
Sat, 29 Jul 95 08:19:47 EDT

Date: Fri, 28 Jul 95 17:22:39 EDT
From: eric@rafiki.spyglass.com (Eric W. Sink)

There is a problem with unicode that it uses the same code to
represent different glyphs as a function of the language being Chinese
or Korean.

I object to this being characterized as a "problem" with Unicode. It
may be an implementation issue, but it is not a "problem." In fact
it is a principal design feature of Unicode, and, for that matter, all
international and most national character sets. Namely, character set
standards encode characters, not glyphs. That is, except for certain
compatibility cases, distinct glyphs which may depict a character are
not distinctly encoded.

For example, a dollar sign may be depicted with two glyphs, one using
a single vertical line, another using two vertical lines. A lower
case 'a' is depicted in one fashion in a Helvetica font and another
fashion in a Times font. In both of these cases, it is not necessary
to distinctly encode the glyphs since basic text processes consider
them as equivalent, and since readers are capable of interpreting either
glyph unambiguously. The same case holds in the CJKV uses of Han
ideographs. Although conventional usage may prefer a particular form
in China and another form in Japan, a reader is still able to recognize
either form as instances of the same character.

An HTML user agent may be characterized as primarily mono-lingual or
as primarily multi-lingual. The vast majority of users only require
the former (here I consider small, fixed combinations of languages to
be inherently mono-lingual -- e.g., Japanese and English).

For the East Asian market, there are primarily four mono-lingual
configurations: Japanese, Korean, Chinese (PRC), and Taiwanese (ROC).
In each of these cases, the primary fonts available will contain glyphs
for those ideographs found in JIS X 0208, KS C 5601, GB 2312, and BIG5,
respectively. Looking only at the ideographic characters in these
standards, this amounts to covering only a small subset of the CJK
Unified Ideographs of ISO/IEC 10646 (Unicode 1.1):

Font Encoding # Ideograph Glyphs

JIS X 0208-1990 6356
KS C 5601-1987 4620
GB2312-80 6763
BIG5 13053

Given that Unicode contains 20,902 distinct ideographs, none of the
fonts currently available on these platforms will even come close to
providing full display support for the entire Unicode Han repertoire.
Consequently, for the present and some time to come, user agents in
East Asia will only be able to effectively display completely those
texts whose characters are supported by their system fonts.

Given the above, when a Japanese UA receives a Japanese HTML file en-
coded in Unicode, it will be able to display the entire file using the
expected font(s) (assuming the Japanese HTML data limited itself to
those ideographs present in JIS X 0208). Similarly, when a Chinese UA
receives a Chinese HTML file encoded in Unicode, it will be able to
display the entire file using the expected font.

Now, if such a mono-lingual UAs receive HTML data of a langauge their
fonts don't necessarily cover, then they still may be able to display
a legible document using glyphs they are familiar with. For example,
the fonts supporting GB2312 can display kana and many kanji in JIS X 0208,
though not all. Thus, if a Chinese UA receives a Japanese document, the
user may be able to read a great deal of it (provided they read Japanese).

The above pretty much describes how Unicode will be used in the vast
majority of cases in East Asia, particularly until fonts are available
which cover the full 20,902 ideographs in Unicode. Though see the new
announcement of such a font from DynaLab:

http://www.stonehand.com/unicode/products/dynalab.html

When it comes time to implement the general, multi-lingual East Asian
solution, Unicode will be the perfect vehicle when combined with the
proposed LANG attribute (actually an architectural form). Assuming a
UA has multiple East Asian fonts designed for each cultural domain, the
LANG attribute can be used to select an appropriate font collection
to be employed when displaying content bound to a particular language.

Even in cases where no LANG attribute is specified, it is trivial to
determine whether a Unicode encoded mono-lingual document is Japanese,
Korean, Chinese, or Taiwanese (or traditional Chinese) using the
following heuristics:

(1) Japanese sentences contain kana (hiragana and/or katakana) characters.
Japanese sentences do not use hangul.

(2) Korean sentences contain hangul characters (in fact Han ideographs --
hanja -- are hardly used at all today). Korean sentences do not use
kana.

(3) Chinese (PRC) sentences contain simplified Chinese characters, e.g.,
the common classifier GE4 is written in the PRC with U+4E2A but with
U+500B in Taiwan. Chinese does not use kana or hangul.

(4) Taiwanese (ROC) sentences contain traditional Chinese characters which
are simplified in China, e.g., U+500B. This holds for other cultural
uses of traditional Chinese writing, such as found in Hong Kong, and
in overseas Chinese communities.

In the case of a multi-lingual document, there is no substitute for using
the LANG attribute or some other marker of the language. However, this type
of document will be a rarity and in any case will be readable only on systems
that have much larger font collections.

If you want more information on the above subjects, see my upcoming paper
"Unicode and the WWW: Supporting Unicode in HTTP and HTML" to be presented
at the 7th International Unicode Conference (see http://unicode.org for more
info).

Regards,
Glenn Adams