Re: Revised language on: ISO/IEC 10646 as Document Character Set

Glenn Adams (glenn@stonehand.com)
Wed, 10 May 95 21:23:54 EDT

>>> Unicode maps characters from different repetoires into single code
>>> positions. This is done to reduce the number of characters you need
>>> to something manageable.

>> This latter statement is quite untrue and misrepresentative.
>> Reduction in code space was emphatically not the reason for
>> undertaking Han unification.

> Not that it matters, but this isn't what I said. I neither said
> nor implied that reduction in _code space usage_ was a motivation.

You did say "... [in order] to reduce the number of characters you need".
To me that equates to "code space usage." I'm not sure how to interpret
it in another way. However, it was not done either to "reduce the number
of characters" or to reduce code space usage (if you deem these to
be different).

The basic reasons for performing Han unification is quite simple:

1. 10646/Unicode encodes scripts independently of the writing systems
which use them. [For example, it encodes the Latin script once even
though it is used to write over a thousand languages.]

2. 10646/Unicode identifies the collection of symbols used by one
writing system with the collection of symbols used by another writing
system if (a) those collections of symbols derive from the same historical
source, and (b) such identification does not exclude the distinctions
necessary to support basic character processes or properties. The result
of such identification is a union of the two or more collections of
symbols which are collectively called a script.

3. Basic character processes are defined to be:

(1) identification of lexical category (letter, ideograph, digit, etc.)
(2) case conversion and normalization (for those symbols which have case)
(3) numeric evaluation of digits
(4) minimum legibility in the display thereof

Other potential character processes or properties which are excluded
and not considered to be basic character processes or properties are:

-- collation order
-- language
-- typographic quality display

4. In addition, the source set separation rule excludes unification where
it would otherwise be required by the above. [The source set separation
rule says that two distinct characters in a source standard must remain
distinct in their 10646/Unicode encoding. This is to facilitate round-trip
conversion between existing source set encoded data and 10646 encoded data.]

Given the above principles, Chinese ideographs (Hanzi), Japanese ideographs
(Kanji), Korean ideographs (Hanja), and Vietnamese ideographs (chu+~ Ha'n No>m)
were identified as collections of symbols which derive from a single source
(the Han dynasty of China) and which does not exclude the basic processes
specified above. [It is useful to note that within these four languages,
the words used to designate ideographs all mean "Han character(s)".]

A common misperception (and misrepresentation) has been that Unicode and/or
ISO initiated the Han unification. This was not the case. The unification
process began within Asia (China, Taiwan, Korea, Vietnam). Both Taiwan (ROC)
and China (PRC) had made significant progress towards unification prior to
the involvement of Western standards organizations. In fact, certain types of
Unification were considered and executed in the design and specification of
JIS X 0208. The actual rules for Unification that were followed, and which
continue to be followed in ISO/IEC JTC1/SC2/WG2/IRG (Ideographic Rapporteur
Group) were primarily developed in Japan and agreed to by all the concerned
National standards bodies in Asia. [I am an active participant in the IRG
and have variously represented both Unicode and the US (ANSI) in this forum
in the recent past. Most of the IRG meetings have only had one or two
participants from the West, with most of the active work being done in Japan
and China.]

Since I didn't participate in the earlier MIME discussions, I'm afraid I
can't comment on how well the facts were represented. However, I have had
plenty of opportunity to communicate with certain vocal participants who
fail to understand or acknowledge the above principles. I have no objection
to someone requesting that the above principles be modified according to
their perceived needs; at the same time, though, I believe the principles
chosen above were based not so much on theory, but, rather, on well established
conventions and principles that are currently embodied by existing character
set standards. I think most Westerners would be surprised if I suggested
that the letters A-Z used to write French constituted a different script from
that used to write German, and, that consequently they should be encoded
separately. This is no different from what certain people have suggested
is required for Japanese vs. Chinese, etc. I happen to reject such an
argument not on theoretical grounds, but on practical grounds: there is
no identified need embodied by current character encoding practice that
would admit to such a marked departure from existing practice.

Before someone asks the question, Latin, Cyrillic, and Greek cannot
be unified without sacrificing both case conversion semantics and the
source set rule.

Again, I suspect most readers of this list aren't all that interested in
this topic (but then again I could be wrong). I've asked for a separate
list to help filter the traffic from others. But if there's no consensus
I guess we're stuck with continuing this conversation here.

In any case, I agree with you that the HTML spec need not require the
set of processable data characters to be limited to those in the document
character set.

Regards,
Glenn