Re: Parsing < and <

Joe English (joe@trystero.art.com)
Thu, 20 Apr 95 21:13:34 EDT

Luke <ylu@ccwf.cc.utexas.edu> wrote:

> Please tell me which part of the spec states the _exact_ difference between
> named reference and numeric reference. Section 13. of the current spec
> (http://www.ics.uci.edu/pub/ietf/html/draft-ietf-html-spec-03.txt)
> Character Entity Sets, esp. 13.1 Numeric and Special Graphic Entity Set is
> not clear for this question...

[...]

> My question has nothing to do with the HTML DTD (grep the DTD you'll find
> no &#60; or &lt;). It seems to be more like a SGML question. Which
> behavior is conforming? or is it undefined? Tell me, SGML gurus...

Yep, it's an SGML question.

Clause 9.5 "Character Reference" lines 10-13, (Goldfarb p. 357) states:

(The) replacement character (of a numeric character reference)
is treated as though it were entered directly except that
*the replacement for a numeric character reference is always
treated as data* in the context in which the replacement occurs.

[Parenthetical comments and emphasis mine.]

The precise difference between character references and general
(or "named") entity references can be subtle, since it depends
on how the entity text was declared, but in this case 'lt'
and 'gt' are CDATA data entities so they are also always treated as
data.

So, in HTML document content, "&lt;foo&gt;" should be identical to
"&#60;foo&#62;".

I don't have a copy of Netscape 1.1b3 handy, but Netscape 1.0N,
Lynx 2.3 BETA, and NCSA Mosaic for X version 2.4 all do the right thing.

--Joe English

joe@trystero.art.com

[

As an aside, note that if you have:

<!ENTITY lt CDATA "<">
<!ENTITY gt CDATA ">">

<!ENTITY e1 "&lt;foo&gt;">
<!ENTITY e2 "&#60;foo&#62;">

then &e1; and &e2; are *not* identical -- since the &#60; and &#62;
character references are expanded when 'e2' is declared, &e2; will
be parsed as a start-tag, while &e1; will be parsed as data.
I recently got bit by this.

]