Re: Revised language on: ISO/IEC 10646 as Document Character Set

Martin J Duerst (mduerst@ifi.unizh.ch)
Tue, 9 May 95 15:44:27 EDT

>
>> > > Does the HTTP charset have to be a subset of 10646?
>> > >
>> > >No. It can be anything. Making 10646 the doc charset doesn't place
>> > >any requirements on the HTTP charset.
>> >
>> > Except that all the characters in the document have to be within ISO
>> > 10646.
>>
>>I think I missed this too: We were talking about HTML, but the
>>above question is about HTTP. Nobody means to restrict the charset
>>parameter values for all of HTTP, right?

Yes, I think we don't want to restrict them. See below for some
examples of what goes bejond what probably assume that
we are saying (but is still allowed according to the text).

>Implementors need to know the relationship between the HTTP/MIME
>"charset" parameter and the HTML "document character set".

I think the problem (if any) lies with wording (not of the
portion of the standard under discussion, but of the discussion
itself!). The "charset" parameter is in fact an encoding, i.e. some
sets of characters with numbers for the character of each of
these sets, and a way to map sequences of such characters from
these sets to octets. E.g. ISO-2022-JP contains
two sets of characters, namely the ASCII set and the set from
JIS 0208. They are mapped to octets with escape sequences
according to ISO 2022. The character repertoire of ISO-2022-JP
is the union of ASCII and JIS 0208, and is a subset of the
character repertoire of ISO 10646.
[For Glenn: The escape sequences are not to be translated to
ISO 10646, as far as I understand. If this is necessary for
round-trip conversion, we could also define round-trip
integrity by saying that we can always convert octets to
the corresponding Latin-1 entries in Unicode and back,
which of course doesn't make sense.]

>I don't care whether you put that in the HTML spec or somewhere else.
>But it has to be put somewhere.
>
>Glenn seems to agree that the charset does not have to be a subset
>of 10646.
I think this has been resolved to the point that there may be characters
outside the repertoire of the document character set, but only
in a construction that doesn't appear in the HTML DTD anyway
(I think it was DATA or SDATA). So this turned out to be irrelevant.

>Can we remove the word "subset" from that part of your spec please.
>Or are you referring to something other than the charset
>in the following:
>
> The document character set is somewhat independent of the character
> encoding scheme used to represent a document. For example, the
> ISO-2022-JP character encoding scheme can be used for HTML documents,
> since its repertoire is a subset of the ISO10646 repertoire. The
> crititcal distinction is that numeric character references agree
> with ISO10646 regardless of how the document is encoded.

I do not see any problem with the word 'subset' here.
Note that the paragraph above doesn't say that the character
repertoire of an encoding has to be a subset of ISO 10646.
One could easily assume some encoding with a character
repertoire that goes, in some places, beyond ISO 10646.
As long as none of these characters outside ISO 10646
is used, there will be no problems at all. I think such cases
are captured in the wording *somewhat independent*,
and the *subset* in the *example* just shows a straight-
forward (and probably very frequent) case.
Another case would be that an encoding is used that
in some of its aspects made some distinctions between
objects that are subsumed by the same character in ISO 10646.
A combination of Japanese and Korean via ISO 10646 could
be an example. There would not be any problem using
this as an encoding for the charset parameter as long
as the distinction between the same character as encoded
via JIS or via KS is irrelevant for the document, or where
it plays some role, is made by other means such as fonts
and so on. Again, the above paragraph doesn't say that
this is not allowed; it just doesn't take such a rather rare
construction as an example.

There are some possibilities to misinterpret the above paragraph,
the worst I guess is that somebody assumes that because it says that
ISO-2022-JP "can be used for HTML documents", one might
assume that this works for *all* HTML documents. Maybe
adding *some* could help here?

----
Dr.sc. Martin J. Du"rst ' , . p y f g c R l / =
Institut fu"r Informatik a o e U i D h T n S -
der Universita"t Zu"rich ; q j k x b m w v z
Winterthurerstrasse 190 (the Dvorak keyboard)
CH-8057 Zu"rich-Irchel Tel: +41 1 257 43 16
S w i t z e r l a n d Fax: +41 1 363 00 35 Email: mduerst@ifi.unizh.ch
----