Re: Revised language on: ISO/IEC 10646 as Document Character Set

Wed, 10 May 95 20:02:12 EDT

> Disclaimer: I am not an expert on character sets. In addition, I
> cannot speak or write in Japanese or Chinese or any of the other
> languages here.

> I am an expert on character sets. If for no other reason, I can say this
> because I was sworn in as an expert witness in a Federal Court on the basis
> of my knowledge of characters, particular those of non-Roman languages. In
> addition, I do speak and read Chinese (both modern & classical, both
> simplified & traditional) and also two other non-Han Asian languages, one
> of which has used Chinese characters for nearly two millenia.

> The problem is that some people don't agree that its the same
> character. They believe that the language the character is
> associated with is part of the character that has to be preserved.

> The problem was that those people were not experts on characters. In
> fact, those persons know very little about character encoding principles
> as practiced in the larger National and International standards communities.
> Unfortunately, they were a noisy lot and you didn't have people to turn
> to who could have sufficiently responded to their objections to which you
> could not respond because of your lack of expertise.

I'm sorry, but I have to say that this simply isn't true. We did have access to
experts in this area, and we did consult them. At considerable length. This
included people involved in the initial ISO 10646 work, several people involved
with the Unicode work, as well as several others with substantial experience in
other aspects of character set work. We also had access to people higher up in
the ISO hierarchy (i.e. IESG equivalents) and we used those contacts as well.

There is only one word that adequately describes the results of all these
consultations, and that's "inconsistent". Part of this was due to the fact that
the initial ISO 10646 proposal was in the process of being shot down in flames
and it was far from clear that the new proposal would get much support, but
there were other sources of dissention as well. Procedural issues were also
raised that I haven't even talked about.

You may believe that there was a clear, unambiguous consensus of expert
opinions out there, but this is only true if you pick and choose the experts
you listen to pretty carefully.

> Therefore ISO-2022-JP is NOT a subset of ISO-10646.

> If this is true, then why has the Japanese National Standards Body prepared
> a new standard which specifies the full mapping from the character sets
> encapsulated by ISO-2022-JP to ISO/IEC 10646-1:1993? [That standard is called
> JIS X 0221.] That is, why does JIS consider ISO-2022-JP to be a subset of
> 10646?

I never said that Japanese National Standards Body was a source of dissenting
views. In fact their position on these matters (as well as on several other
issues) has been brought to our collective attention many times.

> To be quite frank, the person that was doing the rather forceful arguing
> in the MIME discussions is not even an active particpant in the Japanese
> National Standards Body groups which standardize character encodings in
> Japan. He does not represent the consensus of either knowledgable experts
> on character encoding in Japan or practical implementors of Japanese
> systems. It is unfortunate that you and others were put in a position
> where you had to accept his statements at face value without having
> knowledgable sources to turn to.

The assumption that a single individual is responsible for all this is also
false. There was one particularly vocal participant but he most certainly
wasn't the only one expressing similar views.

> The same does not have to be the case with respect to HTML and the Web.

> Unicode maps characters from different repetoires into single code
> positions. This is done to reduce the number of characters you need
> to something manageable.

> This latter statement is quite untrue and misrepresentative. Reduction in
> code space was emphatically not the reason for undertaking Han unification.

Not that it matters, but this isn't what I said. I neither said nor implied
that reduction in _code space usage_ was a motivation.

> The only problem I see here is the notion that the charset has to be
> a subset of ISO 10646. This, as far as I can tell, is a relatively
> new notion and, I think, a very dangerous one that is best avoided if
> at all possible.

> Personally, I have never said this nor is there a need to specify this.
> I have recently indicated that this is not a requirement imposed by SGML
> (that is, a requirement that one can only have data characters which are
> also found in the document character set).

You didn't say it, but it is what the text that has been quoted on this list
currently says (or can be interpreted to say). And given what you say here, I
see absolutely no remaining reason to retain the wording about subsets of
ISO 10646.

> I think it has been pointed out here before that SGML requires a document
> character set, that ISO-2022 nor any of its usages constitute a character set,
> and that it is desirable to choose a document character set which covers the
> widest array of linguistic territory. If there is any other solution than
> specifying 10646 as a standard document character set which at the same time
> has as significant a linguistic coverage as 10646, then I'd be pleased to
> hear about it (provided it is a recognized standard). Otherwise, I'd suggest
> that any futher discussion about which document character set to use is
> fruitless.

I never said that standardizing on ISO 10646 as a document character set is a
bad idea. On the contrary, I think it is a very good idea and I fully support
doing it. My only objection is, and has always been, to the wording that states
that valid charsets have to be subsets of ISO 10646. I think this needs to be

> Furthermore, the issue of which document character set to choose and which
> Content-Type encoding to use are completely unrelated. As Larry M. has also
> pointed out, the HTML spec should not and need not say anything about the
> transport encoding (other than it exists and that it may be different from
> the document character set.)

Agreed. But it currently does say something about this, and that's what I'd
like to change.