<XMP> and <LISTING>, declared content (Was Re: HTML 2.0 LAST CALL: ...)

Joe English (joe@trystero.art.com)
Fri, 2 Jun 95 21:12:38 EDT

Arthur.Vanhoff@Eng.Sun.COM (Arthur van Hoff) wrote:
> > [joe@art.com wrote:]
> > CDATA declared content is PURE EVIL, and I'd like to
> > see <XMP> and <LISTING> pushed as far away as possible
> > from the other more respectable block elements.
> >
> > Calling them "deprecated" is not strong enough for me.
> > They were "obsolete" as of May 31, and I *strongly urge* that
> > they stay that way until HTML 2.1, when they should
> > be eliminated altogether.
> Could you explain why you think that elements with a CDATA content
> model are bad? It seems like perfectly valid and useful SGML construct
> to me. PRE is annoying if you simply want to list verbatim...

Not just bad. PURE EVIL.

There are several reasons. CDATA and RCDATA declared
content (along with DATATAG, another bad idea) are the
only SGML features whereby the element structure can change the
delimiter recognition mode. This has several practical

Suppose I write a search engine with support for SGML documents.
If there are no elements with declared content, then determining what's
data (and should be indexed) and what's markup (and should not be)
is a fairly straightforward lexical analysis task. On the other hand,
the presence of elements like <XMP> and <LISTING> either requires
special-case hacks in the scanner for each supported DTD or a much more
comprehensive analysis of the document structure than would
otherwise be necessary.

Suppose also I want to use this engine as a server back-end,
and insert PIs or other markup to identify search hits in
indexed documents before returning them to the browser.
(There was talk of just this sort of application on html-wg
a while back; in fact, it's part of the rationale behind the
proposed HTML 3 <SPOT> element.) Now if there's a hit inside
an <XMP> element, there is *no way* to insert a hit marker
since the browser will treat it as data.

Not to mention that it complicates browser implementations,
and isn't really all that useful to begin with.

<!-- this looks just like a comment, but it really isn't -->
This: &foo; looks just like an entity reference, but it isn't.
This: <bar> looks just like a start-tag, but it isn't.
This: </baz> looks just like an end-tag, and guess what?
It *is* an end-tag. Oops.

There is no way to include an end-tag open delimiter-in-context
inside an element with CDATA or RCDATA declared content, just
like you can't include the string "\end{verbatim}" in a LaTeX
verbatim environment. Now the string "</" followed by a name
start character probably doesn't show up in "verbatim" text very
often, unless you're discussing SGML itself, but it's still
problematic. (You can't include the sequence "]]>", either.)

The *right* way to suppress delimiter recognition
is with a <![ CDATA [ marked section ]]>. Marked
sections are independent of element boundaries, so
they are much more flexible: if you need to add
an entity reference (or PI or comment declaration or...)
you can close the marked section, add the markup,
and open a new one without disturbing the document
structure. This is not the case with <XMP> and <LISTING>.

Now if only browsers implementors would support marked sections...
In the meantime, we're stuck with <PRE> and
sed -e "s/&/\&amp;/g" -e "s/</\&lt;/g"
(or the equivalent on your platform of choice).

To summarize my position:

<XMP> and <LISTING> should remain obsolete, and should be
removed as soon as is convenient;
CDATA and RCDATA declared content should not be
used for any new elements in future HTML versions;
Browser authors should be strongly encouraged to
support marked sections [*].

--Joe English


[*] and null end tags, too, while they're at it, since they can
save <em/many more/ keystrokes than the ill-advised <XMP>.

P.S. My next rant will be on the FORM element inclusion exceptions.
After the RFC is published, of course :-)