Re: ISO/IEC 10646 as Document Character Set

Glenn Adams (glenn@stonehand.com)
Sun, 30 Apr 95 11:54:56 EDT

From: erik@netscape.com (Erik van der Poel)
Date: Fri, 28 Apr 95 21:07:28 -0700

Just to clarify things in my mind, would the following be allowed
in your world? HTTP headers followed by HTML document:

HTTP/1.0 200 OK
Date: Saturday, 29-Apr-95 03:53:33 GMT
Server: ...
MIME-version: 1.0
Content-Type: text/html; charset=iso-2022-jp
Last-modified: Tuesday, 18-Apr-95 16:10:13 GMT
Content-length: 15132

<TITLE>...</TITLE>
<BODY>
Here is some normal text.
Here is a 10646 numerical entity: &#23598732;.
Here is some ISO-2022-JP text: ...
</BODY>

I.e. is the charset allowed to be iso-2022-jp (or any other non-Latin-1
and non-10646/Unicode charset), and are you still allowed to use 10646
numeric entities within such documents?

Yes. The charset parameter of the Content-Type header specifies the
encoding scheme which applies to the *representation* of the document entity
(or another entity), and not the document entity. A document entity is
really an abstraction as such which can have multiple representations (e.g.,
be encoded using different system character sets). Numeric character
references, on the other hand, are always in terms of the document character
set (or the document character set which applies to the entity in which
the numeric charref occurs).

Thus the above example is completely acceptable.

Keep in mind, however, that the lexical properties of characters as processed
by an SGML parser are expressed in terms of the properties assigned to the
characters of the document character set. That is, these properties are not
assigned or specified for the character set used in the *represenation* of
the document. What this means is that if you are a parser and someone gives
you a storage object which represents an entity, then you must be able to parse
the entity _as if it were represented using the document character set_. This
means you can do one of the following given an entity represented using
ISO-2022-JP:

(1) translate from the representation character set to the document
character set and parse using the latter;

Example: translate storage object to 10646, parse as 10646

(2) parse using the representation character set (e.g., ISO-2022-JP)
and infer the lexical properties of its characters in terms of
their relationship to the document character set's characters.

Example: parse as ISO-2022-JP, inferring lexical properties of
characters in ISO-2022-JP from their corresponding 10646 chars.

(3) translate from the representation character set to a third character
set (i.e., a character set other than the representation or document
character set), then treat the result as a new representation character
set and apply (2) above.

Example: translate storage object to (pick your favorite charset),
then parse as (your favorite charset), inferring properties of
(your favorite charset) from their corresponding 10646 chars.

In the case of current practice where 8859-1 is the representation character
set, one can parse as 8859-1, according to (1) above, since the lexical
properties of all of 8859-1 characters are identical to their corresponding
10646 characters (furthermore, they have the same code values). This is
the reason I am saying that changing to 10646 as the document character set
won't require any necessary change existing practice.

Glenn