Re: partial draft: "Character Set" Considered Harmful

Gavin Nicol (gtn@ebt.com)
Tue, 11 Apr 95 04:51:38 EDT

>Does this mean that, if I have a document coded (in a simple-minded sense)
>in US-ASCII containing, say, SGML stuff refering to ISO Latin-1 characters
>(not in US-ASCII) via numeric character references, and I translate it to
>EBCDIC, I have to do something to convert the numeric character references,
>to stay consistent with SGML?

Not the encoding, the coded character set, For example, say I
translated a document from JIS 201 to IS0 8859-1. In any case, when
one does this, numeric character references must have their character
numbers changed, or they'll have a different meaning. In other words,
you must *parse* the document in order to translate the code character
set.

>The idea of having all numeric character references refer to Unicode (which
>seems to be a feature of Gavin's proposal), treasts this case nicely. (One
>just converts the text in a simple-mined way, and the references mean the
>same thing.)

Quite. Numeric character references are not a nice thing, but this
minimises the damage they can do (and simplifies the general
processing model for handling multiple coded character sets as well).

>I'm not sure if you are saying that this (Gavin's proposal) would not be
>legal SGML, suggesting a different scheme for interpreting numeric
>references, or what.

My proposal is perfectly legal, and IMHO, should make it easier for
people to produce software that is conformant. My proposal also offer
a legal escape hatch for browsers than only want to be concerned with
ISO-8859-1 or US-ASCII.