Re: #PC Definition

Joe English (joe@trystero.art.com)
Wed, 2 Nov 94 14:38:38 EST

newtonjs@vnet.net (Stan Newton) wrote:

> Peter Flynn said in private message:
>
> >#PCDATA is <em>permitted</em>
> >to contain markup, but doesn't <em>have</em> to: CDATA and SDATA must
> >not contain markup.

The difference between #PCDATA in a content model,
and CDATA and RCDATA declared content (not SDATA, btw,
that applies to entities, not element content)
is in what markup is *recognized*.

If an element has mixed content:

<!element title - - (#PCDATA) >
<!element el-mixed - - (foo|bar|#PCDATA)* >

then all markup delimiters are recognized.
Entity references are allowed, but whether
or not an element may appear is determined
by the rest of the content model.

In other words:

<el-mixed>You can have <foo>foo</foo>s and <bar>bar</bar>s
here, but no other subelements.
</el-mixed>

<!-- error -->
<title>You can't have any <foo>foo</foo>s here oops.

If an element has CDATA declared content

<!ELEMENT el-cdata - - CDATA>

then only the ETAGO (end-tag open, </ followed by a name start
character) is recognized as markup; all other characters are
treated as data. (The null end-tag delimiter / is also
recognized if it's applicable.)

In other words,

<el-cdata>
This thing here: <tag> looks like a start tag, but it's not.
This: &foo; looks like an entity reference, but it isn't either.
This: </ might have been an end-tag open delimiter if it were
followed by a letter, like </this one is oops
<!-- prematurely terminated the el-cdata element, even though
the end-tag is invalid. All it takes is the ETAGO.
-->

In RCDATA declared content

<!ELEMENT el-rcdata - - RCDATA>

any delimiter that would terminate the element is recognized
just like in CDATA declared content, and entity references
are expanded also.

<el-rcdata>
This: <tag> looks like a start-tag, but it isn't.
This: &foo; *is* an entity reference.
<!-- This looks like a comment, but it isn't -->
</el-rcdata>

> Very confusing still!

Yep. That's why it's a good idea not to use
CDATA and RCDATA declared content for any elements
in the DTD.

The recommended practice is to use marked sections in
the document:

<![ CDATA [
Anything at all can appear here&it will all
be treated as data, except for the <MSC> closing
delimiter, which looks like this: ]]>

Most HTML browsers choke on these though.

To make it more confusing, the keyword CDATA is used in
several places:

* elements may have CDATA or RCDATA declared content,
and #PCDATA may appear in a content model group

* data entities may be declared as CDATA or SDATA;

<!ENTITY lt CDATA "&">

* external data entities may be CDATA, SDATA, or NDATA.

<!ENTITY listing1 SYSTEM "foo.txt" CDATA>

* attributes may have a declared value of CDATA (or NAME,
NAMES, NUMBER, NUMBERS, ID, IDS, and a bunch of others).
Even if the attribute value is CDATA, entity references
are *still expanded* in attribute value literals:
<img alt="G&ouml;del">

(Possibly others too, these are all I can think of.)

--Joe English

joe@trystero.art.com