Undeclared entities, wierd numeric character references

Dan Connolly (connolly@w3.org)
Fri, 5 May 95 19:46:31 EDT

dwm@shell.portal.com writes:
>
>
> On Fri, 5 May 1995, Alex Hopmann wrote:
>
> > 2) HTML 2.0 uses 10646. We say that minimally complient browsers must only
> > support the first 256 positions, or in other words Latin-1. A reference to
> > ૥ gets rounded to 8 bits like Glenn found from experience. People
>
> Seems to me as a publisher and reader of published material, there is
> no conceptual difference between ૥ and &xxx; where the rendering
> program doesn't understand what they mean.

Hmmm... as an SGML implementor, they're quite distinct to me. There
was some confusing terminology like "character octet entities" and
"numeric entity references" thrown around for a while. To be clear:
the terms are "numeric character reference" and "entity reference".
Numeric character references have nothing to do with entities.

The term "character entity" is strictly informal: it's just a text
entity that happens to be one character long.

But anyway...

> I would expect (and have
> seen for   at least) browsers to just leave the unknown entity
> as written in the text.

Yup.

> Irrespective of the ultimate document character set, should the standard
> spell out handling of undefined entities?

The spec currently says this;

"Undeclared Markup Error Handling"
http://www.w3.org/hypertext/WWW/MarkUp/html-spec/html-spec_3.html#SEC18

|To facilitate experimentation and interoperability between
|implementations of various versions of HTML, the installed base of
|HTML user agents supports a superset of the HTML 2.0 language by
|reducing it to HTML 2.0: markup in the form of a start-tag or end-tag
|whose generic identifier is not declared is mapped to nothing during
|tokenization. Undeclared attributes are treated similarly. The entire
|attribute specification of an unknown attribute (i.e., the unknown
|attribute and its value, if any) should be ignored. On the other hand,
|references to undeclared entities should be treated as data
|characters.
|
|For example:
|
|<div class=chapter><h1>foo</h1><p>...</div>
| => <H1>,"foo",</H1>,<P>,"..."
|xxx <P ID=z23> yyy
| => "xxx ",<P>," yyy
|Let &alpha; and &beta; be finite sets.
| => "Let &alpha; and &beta; be finite sets."
|
|Support for notifying the user of such errors is encouraged.
|
|Information providers are warned that this convention is not binding:
|unspecified behavior may result, as such markup is not conforming to
|this specification.

It doesn't say anything about numeric character references that aren't
in the document character set. I'd prefer to leave it that way.

Dan