Re: Undeclared entities, wierd numeric character references

David - Morris (
Mon, 8 May 95 01:56:28 EDT

On Fri, 5 May 1995, Dan Connolly wrote:

> writes:
> Hmmm... as an SGML implementor, they're quite distinct to me. There
> was some confusing terminology like "character octet entities" and
> "numeric entity references" thrown around for a while. To be clear:
> the terms are "numeric character reference" and "entity reference".
> Numeric character references have nothing to do with entities.

> But anyway...
> > I would expect (and have
> > seen for   at least) browsers to just leave the unknown entity
> > as written in the text.
> Yup.
> > Irrespective of the ultimate document character set, should the standard
> > spell out handling of undefined entities?
> The spec currently says this;
> "Undeclared Markup Error Handling"
> |To facilitate experimentation and interoperability between
> |implementations of various versions of HTML, the installed base of
> |HTML user agents supports a superset of the HTML 2.0 language by
> |reducing it to HTML 2.0: markup in the form of a start-tag or end-tag
> |whose generic identifier is not declared is mapped to nothing during
> |tokenization. Undeclared attributes are treated similarly. The entire
> |attribute specification of an unknown attribute (i.e., the unknown
> |attribute and its value, if any) should be ignored. On the other hand,
> |references to undeclared entities should be treated as data
> |characters.

> It doesn't say anything about numeric character references that aren't
> in the document character set. I'd prefer to leave it that way.

As an engineer I think, things which look and feel about the same ought to
have similar handling when not otherwise defined. A group of characters
which start with & and end with ; and are replaced with one or more other
characters if handled correctly sure look, smell and feel the same to me.

An undefined entity and a numeric character reference which can't be
resolved by an implementation share a significant lack of utility to
the end user and also appear quite similar if viewed in the raw.

I believe both should be treated the same ... left as is in the text
presented to the user if they can't be resolved in a way that conforms
to the author's expectations as defined by the standards.

The result may be ugly but at least no information is lost and the
user doesn't see bizare characters on the screen which have no
correlation with the author's intent. In the end, I think what we
are doing with the WWW is trying to maximize the potential that
authors know how to write stuff for delivery to end users. All of
our standards and other machinations should be to that end.

A user who sees &wxyz; may from other sources understand the semantic
meaning or at least understand that their UA isn't handling content
provided by their information source. The same is true for &54321;.
If either of these is subject to 'correct' and 'permitted' behavior
which transforms either to something else which is not required to
uniquely correlate with the original input, then no-one is served by
the transformation, notwithstanding the transformation's conformance to
HTML, SGML, and ISOxyzab.

Dave Morris