Re: HTML 2.0 LAST CALL: Numeric character refs

Daniel W. Connolly (connolly@beach.w3.org)
Fri, 2 Jun 95 12:47:08 EDT

In message <9506020810.ZM23914@dmg.west.ora.com>, "Terry Allen" writes:
>
>Because this section is about entities, not numeric charrefs, which
>are dealt with elsewhere (grep for 10646).

Huh? This section ( 3.2.1. Undeclared Markup Error Handling ) is about
how to handle screwey markup.

>| > I strongly urge we stay with the
>| >present language here, much as I feel your pain.
>|
>| Personally, I don't give a flying flip one way or the other. I'm
>| pretty tired of specifying what HTML user agents should do when the
>| modem introduces line noise into the document, your baby brother pukes
>| on it, and the stars align to signal the end of the world. An error
>| is an error. Deal with it.
>
>As you may recall, we are talking about SGML conformance here.

Huh? We're talking about how to handle errors -- in general, whether
to throw out the erroneous markup or display it as data characters,
and in particular, what to do with &#NNN; where NNN is outside
the domain of the document character set.

Perhaps this is more clear:

|On the other hand, references to undeclared entities and undefined
|numeric character references (i.e. references to code positions that
|are not in the domain of the document character set) should be treated
|as data characters.

>If you parse this document
>
><!doctype html system "html.dtd">
><title>chars</title>
><p>charref: &#62123;
>
>with sgmls and the HTML sdecl you get in the error stream:
>
>sgmls: SGML error at teal.html, line 3 at ";":
> Numeric character reference exceeds 255; reference ignored
>
>and in the output:
>
>AVERSION CDATA -//IETF//DTD HTML 2.0//EN
>ASDAFORM CDATA Book
>(HTML
>(HEAD
>ASDAFORM CDATA Ti
>(TITLE
>-chars
>)TITLE
>)HEAD
>(BODY
>ASDAFORM CDATA Para
>(P
>-charref:
>)P
>)BODY
>)HTML
>
>
>Notice that the NCR is not in the output.

Notice also that there is no "C" at the end of the output; i.e. the
document is not conforming. Ignoring &#62123; altogether is one way to
handle the error. The HTML 2.0 specification suggests another.

> There is thus no way
>to convert it to a text string.

Sure there is: pretend you never recognized characters as markup,
and just treat them as data characters.

> That will have to wait until
>we agree upon 10646 as the doc charset.

Huh? what does 10646 have to do with the price of tea in china?

>I repeat my opposition to Dave's proposed language. We spent too
>much time on this matter to regress in this fashion, and if we
>specify HTML so that it is not conformant to 8879, we will
>deserve what we get if people ignore our spec.

It's not that big a deal: we're not going against 8879; it's just
one more "should" in the interest of consistent error handling,
which I sensed from the working group is a good thing.

I'll take it out if Mr. Morris will stipulate.

Dan