Hmmm... as an SGML implementor, they're quite distinct to me. There
was some confusing terminology like "character octet entities" and
"numeric entity references" thrown around for a while. To be clear:
the terms are "numeric character reference" and "entity reference".
Numeric character references have nothing to do with entities.
The term "character entity" is strictly informal: it's just a text
entity that happens to be one character long.
But anyway...
> I would expect (and have
> seen for at least) browsers to just leave the unknown entity
> as written in the text.
Yup.
> Irrespective of the ultimate document character set, should the standard
> spell out handling of undefined entities?
The spec currently says this;
"Undeclared Markup Error Handling"
http://www.w3.org/hypertext/WWW/MarkUp/html-spec/html-spec_3.html#SEC18
|To facilitate experimentation and interoperability between
|implementations of various versions of HTML, the installed base of
|HTML user agents supports a superset of the HTML 2.0 language by
|reducing it to HTML 2.0: markup in the form of a start-tag or end-tag
|whose generic identifier is not declared is mapped to nothing during
|tokenization. Undeclared attributes are treated similarly. The entire
|attribute specification of an unknown attribute (i.e., the unknown
|attribute and its value, if any) should be ignored. On the other hand,
|references to undeclared entities should be treated as data
|characters.
|
|For example:
|
|<div class=chapter><h1>foo</h1><p>...</div>
| => <H1>,"foo",</H1>,<P>,"..."
|xxx <P ID=z23> yyy
| => "xxx ",<P>," yyy
|Let α and β be finite sets.
| => "Let α and β be finite sets."
|
|Support for notifying the user of such errors is encouraged.
|
|Information providers are warned that this convention is not binding:
|unspecified behavior may result, as such markup is not conforming to
|this specification.
It doesn't say anything about numeric character references that aren't
in the document character set. I'd prefer to leave it that way.
Dan