Re: Revised language on: ISO/IEC 10646 as Document Character Set

Dan Connolly (connolly@w3.org)
Tue, 9 May 95 10:55:18 EDT

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Gavin Nicol: "Re: format nego in HTML/10646?"
Previous message: Glenn Adams: "Re: Revised language on: ISO/IEC 10646 as Document Character Set"
Maybe in reply to: Dan Connolly: "Revised language on: ISO/IEC 10646 as Document Character Set"
Next in thread: Gavin Nicol: "Re: Revised language on: ISO/IEC 10646 as Document Character Set"

Erik van der Poel writes:
> >For example, the
> >ISO-2022-JP character encoding scheme can be used for HTML documents,
> >since its repertoire is a subset of the ISO10646 repertoire.
>
> Does the HTTP charset have to be a subset of 10646? How do you define
> "subset"?

Very carefully. First of all, it's not the charset (i.e. character
encoding scheme) that's a subset; it's the character repertoire.
And I define subset in the traditional manner:

x subset y iff
ForAll z: (z in x) implies (z in y)

> If you convert iso-2022-jp to 10646 and then back again to iso-2022-jp,
> you could end up with a file that is different from the original
> iso-2022-jp document.

Again: take care with the terms. ISO10646 is a coded character set.
ISO-2022-JP is a character encoding scheme. Type mismatch.
ISO-10646-UCS-2 is a character encoding scheme.

> For example, some of the "double-width" Roman
> characters in the JIS X 0208 portion of iso-2022-jp are not in 10646.

Then the character repertoire of iso-2022-jp is _not_ a subset
of the character repertoire of ISO10646. I was misled.

> Also, you could lose info encoded in the escape sequences themselves:
> ESC ( B, ESC ( J, ESC $ @ and ESC $ B. So iso-2022-jp is not a subset
> of 10646, if you look at it this way.

I don't follow this portion: does iso-2022-jp somehow encode more than
a sequence of characters? I don't care if there is more than one
way to encode a given sequence of characters in iso-2022.

> Why not just remove the restriction that the HTTP charset has to be a
> subset of 10646? I.e. remove the word "subset" somehow.

The restriction is currently that the document character set must
agree with the ISO10646 coded character set. That implies that no
characters outside the ISO10646 character repertoire be used.

This is a restriction. I called it out loud and clear a long time ago,
but everybody said "Yes, we know. That's OK." Now you're saying it's
not OK? I'm afraid you've got some serious lobbying to do!

Dan

Next message: Gavin Nicol: "Re: format nego in HTML/10646?"
Previous message: Glenn Adams: "Re: Revised language on: ISO/IEC 10646 as Document Character Set"
Maybe in reply to: Dan Connolly: "Revised language on: ISO/IEC 10646 as Document Character Set"
Next in thread: Gavin Nicol: "Re: Revised language on: ISO/IEC 10646 as Document Character Set"