Re: Numeric Char Ents in 2.0 draft

Roy T. Fielding (fielding@avron.ICS.UCI.EDU)
Fri, 31 Mar 95 05:25:26 EST

> | The character octet references are not dependent on the character
> | set encoding of the document. For example, "×" always
> | represents the ISO-8859-1 multiply sign, even when the document's
> | declared character set is other than ISO-8859-1.
>
> It seems to me that this means that even if I declare the document
> character set encoding to be ISO 8859-6 (English/Arabic, I think),
> in which 215 means something else (say, the letter mim), an HTML
> app is supposed to interpret that × as referring to a character
> in another character set (which is available through the named
> character reference × anyway).

Yep, that's what it means. I could have sworn that is what a previous
discussion (months ago) recommended, but I can't find that discussion now.

> However, an SGML system will
> understand that × as "mim". From the SGML Handbook, section
> 9.5, production 64, ll. 10--13:
>
> A replacement character is treated as though it were entered
> directly except that the replacement for a numeric character
> reference is always treated as data in the context in which the
> replacement occurs.

Hmmmmm, well that would have been nice to know -- I was wondering where
the × definition was coming from anyway, since its not in the DTD
and not in my handy-dandy SGML reference. It's too bad that the SGML
Handbook is not on the Web; as it stands, the fact that HTML is SGML is
more by accident than design, and would have remained an accident if
it were not for Dan's persistance.

It needs to be changed in the spec, then, to not be iso-8859-1 specific.
I'll think about the new wording when I find the time.

> So the language in the 2.0 spec describes an apparently illegal
> variant of SGML parsing, and has no force, as we also say
>
> When the above conflicts with the SGML standard, the SGML standard
> may be ignored. [in the Character Set section]
>
> and presumably the same warning applies to all those parts of
> the spec that describe SGML functionality (Roy points out that
> the spec also describes browser behavior).

No, it applies to user agent functionality. User agent is a well-defined
term in MIME and HTTP (we could add it to the HTML spec as well, if you want).

> So why isn't the language "character octet references are not
> dependent on the character set encoding of the document" bogus
> and to be deleted?

Well, we can't just delete it -- a suitable (and conformant) alternative
should be proposed.

....Roy T. Fielding Department of ICS, University of California, Irvine USA
<fielding@ics.uci.edu>
<URL:http://www.ics.uci.edu/dir/grad/Software/fielding>