Re: Revised language on: ISO/IEC 10646 as Document Character Set

Glenn Adams (glenn@stonehand.com)
Wed, 10 May 95 09:34:04 EDT

Date: Wed, 10 May 95 00:33:50 EDT
From: erik@netscape.com (Erik van der Poel)

SGML parsers parse documents in the "document character set",

Not necessarily. SGML parsers parse documents using a system character
set. That system character set must not be inconsistent with the
document character set. However, it doesn't mean it has to be identical
to the document character set.

My only excuse is that I haven't looked at those tables for a while, and
in the past they *did* appear to have the problem I mentioned. Sorry.

All the zenkaku roman have been in Unicode since version 1.0. Early versions
of the mapping tables, however, only specified Kanji mappings, while the
mappings of the non-Han characters were being determined. Whether the tables
had those mappings or not had no bearing on whether Unicode was a superset
from its first release -- which it was.

As far as I can recall, someone even asked about
these new Taiwanese character sets on the net, and Glenn (or some other
Unicoder) answered that Unicode/10646's repertoire was frozen before
those Taiwanese character sets hit the streets.

It is correct that the latest version of CNS 11643 (1994) contains
characters which are not yet in 10646/Unicode. 10646/Unicode is not
frozen for all time, though. New characters will be added. This
addition process takes time and depends on submissions from National
Standards Bodies. So far, Taiwan, which is not an ISO P-Member (but
TCA, which is a class C liaison ISO member), has not submitted the full
collection of new characters in CNS 11643 to the Ideographic Rapporteur
Group (SC2/WG2/IRG) whose current job includes specifying an extension
to the Han repertoire in 10646 (and which will eventually go into Unicode).
Furthermore, TCA nor Taiwan have registered this new version with ISO
for use with ISO 2022. Given these facts, you should not infer either
that 10646 is inadequate now or will remain inadequate. It remains to
be seen whether any user community will even develop in Taiwan around
this new standard (BIG5 being the current widespread standard).

Even if 10646 is the document character set, it remains possible to
represent and proceess all of the characters in this new version of 11643
(or any other character repertoiire which includes characters not present
in 10646); namely, use SDATA general entities.

So, how about defining the "document character set" to be the union
of the "charset" and 10646?

We can't redefine "document character set". It already has a fixed,
known definition. Furthermore, as I indicated above, it isn't even
necessary.

Glenn