Re: #PCDATA definition??

Daniel W. Connolly (connolly@hal.com)
Thu, 27 Oct 94 00:16:43 EDT

In message <199410270133.AA16729@char.vnet.net>, Stan Newton writes:
>The HTML 2.0 DTD uses #PCDATA in the definition of allowed content for
> %text ENTITY
> %pre.content ENTITY
> Title tag
> Option tag
> TextArea tag
>
>The only definition I could find for #PCDATA (from an HTMLPLUS draft,
>I think) says:
>
>PCDATA
>"text occurring in a context in which markup and entity references may occur"
>
>This should be contrasted with the definition for CDATA (same source)
>"text which doesn't include markup or entity references"
>
>OK, now I'm (once again) confused. These ambiguous definitions don't really
>seem to help.

Could you expand on what makes you think these definitions are
ambiguous? The "Understanding Structured Text" section of the spec is
something of a "gentle introduction to SGML." As such, it's not very
good. (It's one of the sections of the text that has survived since my
original January 1993 draft. Nobody seems to know enough about both
SGML and writing clearly to take a crack at re-writing it.)

If you really want to know what's going on, I'm afraid you'll have to
read the SGML standard or one of the books on SGML. Softquad has a 20
page document that covers most of what you need to know in their
documentation. Yuri?

I'll reiterate that the way I learned these things was interactively,
by feeding different test cases to sgmls and looking at the output.
You can do the same with the validation service at:

http://www.hal.com/%7Econnolly/html-test/service/validation-form.html

Here's an attempt at an explanation:

An element delcaration declared the content of an element to
be one of five types of content:

EMPTY -- no content. No end tag either. e.g.
<br>
CDATA -- character data. No markup. e.g.
<xmp>things that look like <start> tags,
&entity; references,
<!-- comments -->
etc. are all just data characters
inside CDATA content </xmp>
RCDATA -- like CDATA, but &entity; markup is recognized
(there are no elements of this type in HTML)
MIXED -- any content model that contains #PCDATA is MIXED
content. The content of the element P in:
<p>chars <br>&#65; <!--comment--> </p>
is:
* 6 characters "chars "
* a BR element
* three characters "A "
ELEMENT -- Any content model that contains elements but
no #PCDATA is element content. Whitespace is not
considered data inside element content. For example:
<HEAD> <title>blah</title> </head>
The spaces before and after the title tags are
completely ignored.

We can see this in the ESIS representation output by sgmls:

Input

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">

<HEAD> <title>blah</title> </head>

<BODY>
<xmp>things that look like <start> tags,
&entity; references,
<!-- comments -->
etc. are all just data characters
inside CDATA content </xmp>

<p>chars <br>&#65; <!--comment--> </p>

</body>

Parsed Output (Element Structure Information Set)

AVERSION CDATA -//IETF//DTD HTML//EN//2.0
(HTML
(HEAD
(TITLE
-blah
)TITLE
)HEAD
(BODY
(XMP
-things that look like <start> tags,\n &entity; references,\n <!-- comments -->\n etc. are all just data characters\n inside CDATA content
)XMP
-\n\n
(P
-chars
(BR
)BR
-A
)P
-\n
)BODY
)HTML
C

>Yesterday's message thread on attribute values, started by Keith Ball,
>contained examples for what I understood to be CDATA attribute values but
>containing expanded characters (such as &gt; for >). I thought this kind of
>character escaping is what was meant by 'entity references' and would have
>been prevented by the definition above.

Well, CDATA attributes are completely different from CDATA elements.
Isn't that handy?

There's this process of going from an "attribute value literal"
to an attribute value, i.e. looking at:

<img src="foo.gif" alt="abc &lt; def">

and determining that the value of the alt attribute is "abc < def".
This process is the same for all attributes.

_After_ the sgml parser computes the value of an attribute, it
checks it against the "declared value" for an attribute, which might
be NAME, CDATA, NUMBER, an enumerated list of names, or a few other
possibilities. At that point, you might get an error, for example,
if you said
<form method=123 action="xxx">

because 123 doesn't match the declared value for method.

A declared value of CDATA for an attribute means anyting goes.
(up to LITLEN, or 1024 in our case, characters).

>Conversely, I don't think that you really mean that markup (highlighting?)
>should be allowed in the document Title text box, in the listbox entries
>provided by the Option tags, or in the TextArea text box.

I'm not clear on what you mean here. It _is_ valid to say:

<title> Kurt G&ouml;del's Writings </title>

[And not only is it valid, but the &ouml; is recognized as
an entity reference...]

It's _not_ valid to say

<title> Really <em>Neato</em> stuff! </title>

>Could somehow set me straight on this or perhaps direct me to a reference
>that can help me.

The online resources that I use are the comp.text.sgml newsgroup
[Unfortunately, asking for a FAQ in that newsgroup causes great
commotion. At least it did for some time] and the sgml archive at

ftp://ftp.ifi.uio.no/pub/SGML/

I also have a copy of ISO-8879 near my desk. I hear that Charles
Goldfarbs handbook (sorry, no citation handy) is indispensible,
but somehow I have managed to get by without it.

I don't recommend a trip through the sgmls source code (long story --
not James Clark's fault), and the documentation strictly complements
the SGML standard (i.e. it does not attempt to explain it at all). But
I do recommend playing with it.

> More directly, for purposes of this Working Group,
>What is the intention for content within the tags listed above?

I hope I have answered this question above. In invite you to use the
validation service and/or install sgmls and play around to further
refine your understanding of the corner cases.

Dan