Numeric Char Ents in 2.0 draft

Terry Allen (terry@ora.com)
Thu, 30 Mar 95 09:56:03 EST

This is a different character set issue than the one that concerned
me previously.

I confess I didn't fully understand what was intended by the following
until yesterday afternoon. I believe the functionality it describes
is not conformant with ISO 8879.

| Character octet references are represented in an HTML document as
| SGML entities whose name is number sign (#) followed by a numeral
| from 32-126 and 161-255. The HTML DTD includes a numeric character
| for each of the printing characters of the ISO-8859-1 encoding, so
| that one may reference them by number if it is inconvenient to
| enter them directly.
|
| The character octet references are not dependent on the character
| set encoding of the document. For example, "×" always
| represents the ISO-8859-1 multiply sign, even when the document's
| declared character set is other than ISO-8859-1.

It seems to me that this means that even if I declare the document
character set encoding to be ISO 8859-6 (English/Arabic, I think),
in which 215 means something else (say, the letter mim), an HTML
app is supposed to interpret that × as referring to a character
in another character set (which is available through the named
character reference × anyway). However, an SGML system will
understand that × as "mim". From the SGML Handbook, section
9.5, production 64, ll. 10--13:

A replacement character is treated as though it were entered
directly except that the replacement for a numeric character
reference is always treated as data in the context in which the
replacement occurs.

And when I parse a sample with sgmls (using a different numeric
character entity) this input

<!doctype html system "recon.dtd"[
]>
text L trombones</> <body></body></html> I get AVERSION CDATA -//IETF//DTD HTML 2.0//EN ASDAFORM CDATA Book (HTML (HEAD ASDAFORM CDATA Ti (TITLE -text L trombones )TITLE )HEAD (BODY )BODY )HTML (for the entity × the output shows "-text \327 trombones") So the language in the 2.0 spec describes an apparently illegal variant of SGML parsing, and has no force, as we also say When the above conflicts with the SGML standard, the SGML standard may be ignored. [in the Character Set section] and presumably the same warning applies to all those parts of the spec that describe SGML functionality (Roy points out that the spec also describes browser behavior). So why isn't the language "character octet references are not dependent on the character set encoding of the document" bogus and to be deleted? -- Terry Allen (terry@ora.com) O'Reilly & Associates, Inc. Editor, Digital Media Group 101 Morris St. Sebastopol, Calif., 95472 occasional column at: http://gnn.com/meta/imedia/webworks/allen/ A Davenport Group sponsor. For information on the Davenport Group see ftp://ftp.ora.com/pub/davenport/README.html or http://www.ora.com/davenport/README.html <!-- body="end" --> <p> <ul> <!-- next="start" --> <li> <b>Next message:</b> <a href="0873.html">Ralph Ferris: "Re: Defined facilities for the extension of HTML"</a> <li> <b>Previous message:</b> <a href="0871.html">wmperry@spry.com: "Re: input type=file"</a> <!-- nextthread="start" --> <li> <b>Next in thread:</b> <a href="0892.html">Gavin Nicol: "Re: Numeric Char Ents in 2.0 draft"</a> <li> <b>Maybe reply:</b> <a href="0892.html">Gavin Nicol: "Re: Numeric Char Ents in 2.0 draft"</a> <li> <b>Maybe reply:</b> <a href="0898.html">Larry Masinter: "Re: Numeric Char Ents in 2.0 draft"</a> <li> <b>Maybe reply:</b> <a href="0901.html">Roy T. Fielding: "Re: Numeric Char Ents in 2.0 draft"</a> <li> <b>Maybe reply:</b> <a href="0906.html">Gavin Nicol: "Re: Numeric Char Ents in 2.0 draft"</a> <li> <b>Maybe reply:</b> <a href="0914.html">Terry Allen: "Re: Numeric Char Ents in 2.0 draft"</a> <li> <b>Maybe reply:</b> <a href="0920.html">Francois Yergeau: "Re: Numeric Char Ents in 2.0 draft"</a> <li> <b>Maybe reply:</b> <a href="0926.html">Larry Masinter: "Re: Numeric Char Ents in 2.0 draft"</a> <!-- reply="end" --> </ul>