Very carefully. First of all, it's not the charset (i.e. character
encoding scheme) that's a subset; it's the character repertoire.
And I define subset in the traditional manner:
x subset y iff
ForAll z: (z in x) implies (z in y)
> If you convert iso-2022-jp to 10646 and then back again to iso-2022-jp,
> you could end up with a file that is different from the original
> iso-2022-jp document.
Again: take care with the terms. ISO10646 is a coded character set.
ISO-2022-JP is a character encoding scheme. Type mismatch.
ISO-10646-UCS-2 is a character encoding scheme.
> For example, some of the "double-width" Roman
> characters in the JIS X 0208 portion of iso-2022-jp are not in 10646.
Then the character repertoire of iso-2022-jp is _not_ a subset
of the character repertoire of ISO10646. I was misled.
> Also, you could lose info encoded in the escape sequences themselves:
> ESC ( B, ESC ( J, ESC $ @ and ESC $ B. So iso-2022-jp is not a subset
> of 10646, if you look at it this way.
I don't follow this portion: does iso-2022-jp somehow encode more than
a sequence of characters? I don't care if there is more than one
way to encode a given sequence of characters in iso-2022.
> Why not just remove the restriction that the HTTP charset has to be a
> subset of 10646? I.e. remove the word "subset" somehow.
The restriction is currently that the document character set must
agree with the ISO10646 coded character set. That implies that no
characters outside the ISO10646 character repertoire be used.
This is a restriction. I called it out loud and clear a long time ago,
but everybody said "Yes, we know. That's OK." Now you're saying it's
not OK? I'm afraid you've got some serious lobbying to do!
Dan