Re: Revised language on: ISO/IEC 10646 as Document Character Set

Glenn Adams (
Wed, 10 May 95 18:24:11 EDT

Date: Wed, 10 May 95 16:56:05 EDT
From: Ned Freed <>

Disclaimer: I am not an expert on character sets. In addition, I
cannot speak or write in Japanese or Chinese or any of the other
languages here.

I am an expert on character sets. If for no other reason, I can say this
because I was sworn in as an expert witness in a Federal Court on the basis
of my knowledge of characters, particular those of non-Roman languages. In
addition, I do speak and read Chinese (both modern & classical, both
simplified & traditional) and also two other non-Han Asian languages, one
of which has used Chinese characters for nearly two millenia.

The problem is that some people don't agree that its the same
character. They believe that the language the character is
associated with is part of the character that has to be preserved.

The problem was that those people were not experts on characters. In
fact, those persons know very little about character encoding principles
as practiced in the larger National and International standards communities.
Unfortunately, they were a noisy lot and you didn't have people to turn
to who could have sufficiently responded to their objections to which you
could not respond because of your lack of expertise.

Therefore ISO-2022-JP is NOT a subset of ISO-10646.

If this is true, then why has the Japanese National Standards Body prepared
a new standard which specifies the full mapping from the character sets
encapsulated by ISO-2022-JP to ISO/IEC 10646-1:1993? [That standard is called
JIS X 0221.] That is, why does JIS consider ISO-2022-JP to be a subset of

To be quite frank, the person that was doing the rather forceful arguing
in the MIME discussions is not even an active particpant in the Japanese
National Standards Body groups which standardize character encodings in
Japan. He does not represent the consensus of either knowledgable experts
on character encoding in Japan or practical implementors of Japanese
systems. It is unfortunate that you and others were put in a position
where you had to accept his statements at face value without having
knowledgable sources to turn to.

The same does not have to be the case with respect to HTML and the Web.

Unicode maps characters from different repetoires into single code
positions. This is done to reduce the number of characters you need
to something manageable.

This latter statement is quite untrue and misrepresentative. Reduction in
code space was emphatically not the reason for undertaking Han unification.

The only problem I see here is the notion that the charset has to be
a subset of ISO 10646. This, as far as I can tell, is a relatively
new notion and, I think, a very dangerous one that is best avoided if
at all possible.

Personally, I have never said this nor is there a need to specify this.
I have recently indicated that this is not a requirement imposed by SGML
(that is, a requirement that one can only have data characters which are
also found in the document character set).

I think it has been pointed out here before that SGML requires a document
character set, that ISO-2022 nor any of its usages constitute a character set,
and that it is desirable to choose a document character set which covers the
widest array of linguistic territory. If there is any other solution than
specifying 10646 as a standard document character set which at the same time
has as significant a linguistic coverage as 10646, then I'd be pleased to
hear about it (provided it is a recognized standard). Otherwise, I'd suggest
that any futher discussion about which document character set to use is

Furthermore, the issue of which document character set to choose and which
Content-Type encoding to use are completely unrelated. As Larry M. has also
pointed out, the HTML spec should not and need not say anything about the
transport encoding (other than it exists and that it may be different from
the document character set.)

Glenn Adams