Re: ISO/IEC 10646 as Document Character Set

Gavin Nicol (gtn@ebt.com)
Mon, 1 May 95 22:07:11 EDT

>The replacement text of the entity would be expressed by reference to
>the document character set. Such a reference may or may not have a
>representation in the system character set. In the above case, neither
>ASCII nor EBCDIC have a suitable representation, and, if this is what
>you mean by "result", then, yes, it is system dependent as to how to
>interpret such a reference.

This is precisely what I am trying to say, though I should note that
the mapping from document character set to representation in the
system character set (to use your terminology), is not defined at all
by SGML. It is assumed that LATIN CAPITAL LETTER A will be mapped to
LATIN CAPITAL LETTER A, but this is not *required* behaviour, though
it is obviously *desirable* behaviour.

>It seems that you aren't distinguishing properly between
>
>(1) the well-definedness of a numeric character reference as such; and
>(2) the interpretation of the character which the reference specifies
> in terms of the system character set used in the parsing process.

Well, perhaps I do suffer from lack of precision. I'll pull out the
old excuse about not using English enough ;-) I've been talking about
case 2, because 1 is well defined (even if it's only an abstract
step).

>For example, if I translate this entity to a document
>character set of KS C 5601:1987 (the primary Korean character set),
>then I would have to translate the numeric charref to ⡧ which
>is the equivalent character. If I translated it to an entity whose
>document charset was ASCII, then, I could do one of the following:
>(1) indicate an error due to inability to translate; (2) translate it
>to  ('SUB'), the ASCII substitution control character; (3) translate
>it to 1 ('1') as an approximate mapping, etc.

Exactly. It is system dependent.

>What we can't do with numeric charrefs is to say they are interpreted
>according to the system character set (in general). We can only say this
>as a side-effect of the parsing process (or other processes) where we
>need to represent the referenced character according to the system character
>set(s) at hand.

Yes, this is entirely correct.

>I think you understand all of the above. Perhaps we are just in violent
>agreement but are using different terminology?

I think so. As Dan has pointed out, clarity is not my strong point...

Anyway, this is all really academic. Obviously the most desirable
behaviour is to map, and represent the characters in accordance with
ISO 10646. All I have been trying to show is that current browsers all
exhibit legal behaviour, including things like Mosaic L10N. I'm sure
this conversation is quite soporific to most readers...