Re: ISO/IEC 10646 as Document Character Set

Glenn Adams (glenn@stonehand.com)
Mon, 1 May 95 14:25:55 EDT

Date: Mon, 1 May 1995 13:33:52 -0400
From: Gavin Nicol <gtn@ebt.com>

Irrespective, if I have a system
character set of ISO 8859-1, and a numeric character reference of
&#9312; the result is undefined by the SGML standard, and is
therefore, system dependent.

What do you mean by "the result is undefined"? If the document character
set is 10646, then &#9312; is well-defined. It designates U+2460 CIRCLED
DIGIT ONE. That it is well-defined is a consequence of the fact that it
expresses a valid, assigned code value of the document character set. None
of these facts have any thing to do with the system character set as used
to represent the entity in which this reference occurs. For example,
the representation of this character reference would be as follows:

& # 9 3 1 2 ;

ASCII : 26 23 39 33 31 32 3B
EBCDIC : 50 7B F9 F3 F1 F2 5E

The replacement text of the entity would be expressed by reference to
the document character set. Such a reference may or may not have a
representation in the system character set. In the above case, neither
ASCII nor EBCDIC have a suitable representation, and, if this is what
you mean by "result", then, yes, it is system dependent as to how to
interpret such a reference.

It seems that you aren't distinguishing properly between

(1) the well-definedness of a numeric character reference as such; and
(2) the interpretation of the character which the reference specifies
in terms of the system character set used in the parsing process.

SGML requires that the interpretation of a numeric character reference
occur in relation to the document character set which applies to the
entity in which the reference occurs. This means that if I were to
translate an entity containing &#9312; to another document character
set then I would have to translate the numeric character reference
accordingly. For example, if I translate this entity to a document
character set of KS C 5601:1987 (the primary Korean character set),
then I would have to translate the numeric charref to &#10343; which
is the equivalent character. If I translated it to an entity whose
document charset was ASCII, then, I could do one of the following:
(1) indicate an error due to inability to translate; (2) translate it
to &#26; ('SUB'), the ASCII substitution control character; (3) translate
it to &#49; ('1') as an approximate mapping, etc.

What we can't do with numeric charrefs is to say they are interpreted
according to the system character set (in general). We can only say this
as a side-effect of the parsing process (or other processes) where we
need to represent the referenced character according to the system character
set(s) at hand.

I think you understand all of the above. Perhaps we are just in violent
agreement but are using different terminology?

Glenn