Re: Revised language on: ISO/IEC 10646 as Document Character Set

Ned Freed (NED@SIGURD.INNOSOFT.COM)
Wed, 10 May 95 16:56:54 EDT

Disclaimer: I am not an expert on character sets. In addition, I cannot speak
or write in Japanese or Chinese or any of the other languages here. I am,
however, a "veteran" of MIME character set "wars", and I like to think I
learned something about people's positions on character set issues from it all.
All I'm doing here is repeating some of the things I believe are the positions
of others. I apologize in advance if I misrepresent anyone's views or misuse
the terminology in some way.

> Now, the problem is that there may be MIME/HTTP "charsets" that cannot
> be mapped to 10646 (the proposed document character set). Earlier, I
> gave an example: iso-2022-jp. But Glenn pointed out that this example
> is incorrect. Having re-checked the JIS X 0201 and 0208 vs Unicode
> tables today, I now find that he appears to be correct. My only excuse
> is that I haven't looked at those tables for a while, and in the past
> they *did* appear to have the problem I mentioned. Sorry.

You're missing the point of the objection that was raised in the original MIME
work (and will almost certainly be raised again in present language is left
unchanged). Yes, there are mappings from ISO-2022-JP to Unicode and vice versa.
And these mappings appear to be fully invertible from a mathematical
perspective, so that you can go from one to the other without any loss of
information.

It doesn't matter. The issue isn't whether or not you such a mapping exists,
its whether or not the mapping is correct, and the characters really are
equivalent.

Unicode maps characters from different repetoires into single code positions.
This is done to reduce the number of characters you need to something
manageable. The result is that, say, a Hanzi character, a Kanji character, and
a Hanja character all end up in the same position.

You say: So what? Its the same character, after all, so why not map them all to
the same code position?

The problem is that some people don't agree that its the same character. They
believe that the language the character is associated with is part of the
character that has to be preserved. According this logic you can talk about
mapping from ISO-2022-JP to something you might call ISO-10646-JP, but that you
cannot map to generic ISO-10646, and that therefore ISO-2022-JP is NOT a subset
of ISO-10646. (Some have even go so far as to assert that ISO-10646 does not
meet the requirements of being a character set.)

This argument was raised, quite forcefully, during the MIME work. Speaking as
one of the coauthors of MIME, I felt that the right thing to do in MIME was to
try to move to some sort of universal character set that could represent all of
the world's characters. Lots of other people felt this way as well, and some
felt that either ISO 10646 or Unicode (they were completely different critters
back then) was the way to go. Other people felt, however, that neither of these
character sets were adequate. There was a huge battle and no consensus was ever
reached. This is why the MIME specification now says:

NOTE: Beyond US-ASCII, an enormous proliferation of character sets is
possible. It is the opinion of the IETF working group that a large number of
character sets is NOT a good thing. We would prefer to specify a SINGLE
character set that can be used universally for representing all of the world's
languages in Internet mail. Unfortunately, existing practice in several
communities seems to point to the continued use of multiple character sets in
the near future. For this reason, we define names for a small number of
character sets for which a strong constituent base exists.

In other words, this is still an open issue. Some people believe that all
character sets are, or can be made to be, subsets of ISO 10646. And others do
not. And I don't see any chance of this changing any time soon.

> Also, the word 10646 was only recently actually added to the draft,
> and the author specifically asked for comments on that wording. So
> I submitted comments.

Exactly right. The only problem I see here is the notion that the charset
has to be a subset of ISO 10646. This, as far as I can tell, is a relatively
new notion and, I think, a very dangerous one that is best avoided if at all
possible.

Ned