Re: Revised language on: ISO/IEC 10646 as Document Character Set

Albert Lunde (Albert-Lunde@nwu.edu)
Tue, 9 May 95 16:13:54 EDT

>Glenn seems to agree that the charset does not have to be a subset
>of 10646. Can we remove the word "subset" from that part of your
>spec please. Or are you referring to something other than the charset
>in the following:
>
> The document character set is somewhat independent of the character
> encoding scheme used to represent a document. For example, the
> ISO-2022-JP character encoding scheme can be used for HTML documents,
> since its repertoire is a subset of the ISO10646 repertoire. The
> crititcal distinction is that numeric character references agree
> with ISO10646 regardless of how the document is encoded.

I think you/we are confusing two meanings of subset.

Let met try to take a crack at this. (Correct me if I'm wrong.)

The MIME/HTTP charset is a mapping from octets on the wire to character
names/glyphs.

The SGML document character set can be looked at as a mapping from a
different whole different space of numbers (numeric references) into
character names/glyphs.

You have to have some kind of outside knowledge of what characters
correspond between the range of the two mappings to translate characters
into the document character set or to parse them as if you were using the
document character set.

If you ignore the question of processing arbitrary numeric references, what
it seems to me an implentation needs to know is the mapping from the
characters in the MIME charset range to corresponding positions in Unicode.

It is not required that characters have the same numeric positions in the
MIME charset as they have in Unicode.

A simple concrete example may, I hope, illustrate the two notions of subset
that are getting confused here.

* ISO Latin-1 is a "subset" of Unicode in two senses:

1) All the numbers in its domain map to the same characters that those
numbers map to in Unicode.

2) All the characters in its range correspond to characters in Unicode.

* EBCDIC is a "subset" of Unicode in only the second sense, and that's what
we are asking.

EBCDIC could be used as a MIME charset (disregarding issues of IANA
registration) with this defintion of HTML.

The appropriate SGML document character set would _not_ be EBDCIC, but it
could either be ISO10646 or any subset of ISO10646 that contained in its
range the characters corresponding to the range of the EBDCIC mapping. It
would map the same numbers to the same characters as ISO10646, but this
would mostly be used to interpret numeric references.

Getting more abstract:

If we denoted the MIME charset as function mapping:

M: O->C from octets to characters (ingoring multi-bytes issues for now)

and call Unicode/ISO10646 the mapping:

U: I' -> C' from numeric references to Unicode characters

and call the abstract mapping that translates the MIME characters to
corresponding characters in Unicode:

T: C -> C'

then a function that could represent a minimal document character set would
be the restriction of the function U to the domain set: U-inverse(T(M(O))
and the range set: T(M(O)).

---
    Albert Lunde                      Albert-Lunde@nwu.edu