comments on the DTD in Nov 16 draft

Paul Grosso (pbg@texcel.no)
Mon, 21 Nov 94 10:33:13 EST

I've got a few comments on the latest DTD. I've tried to check other
comments first, but apologies if I'm repeating anything. Some of my
comments border on SGML-esoterica. For those who don't care to delve
in SGML-nits, there is nothing in this message that affects the
actual definition of HTML 2.0. I'm just trying to make sure the
DTD is as rigorous as possible.

Use of public text display version field in FPIs
------------------------------------------------

First, I'm not sure that the Formal Public Identifiers (FPIs) shown
for the DTD(s) make appropriate use of the public text display version
field (the "//2.0" after the "//EN"). Though the FPIs are syntactically
valid, the SGML standard makes it pretty clear that this field "distinguishes
among public text that has a common public text description [the "DTD HTML"
field] by describing the devices supported or coding scheme used. If omitted,
the public text is not device-dependent." [definition 4.244]. In the
explanation of the production for public text display version [clause 10.2.2.5],
it indicates that, "if the public text is device-dependent, the text
identifer must include a public text display version that describes the
devices supported or coding scheme used." In further explanation added
by Goldfarb in his "SGML Handbook," he reinforces the fact that this
'public text display version' field is for device dependent variants.
I don't feel it is appropriate to use it for distinguishing distinct
(device-independent) versions of a DTD. instead, DTDs whose contents
differ (even if "only" different versions of related DTDs) should have different
public text descriptions. In other words, I would recommend that the
version info (be it version number and/or revision date) should be
part of the public text description field. This would mean that
the FPIs in the DTD and in various places in the spec (including
in the sample SGML Open entity catalog) would look more like:

PUBLIC "-//IETF//DTD HTML//EN" html.dtd
PUBLIC "-//IETF//DTD HTML 2.0//EN" html.dtd
PUBLIC "-//IETF//DTD HTML Level 2//EN" html.dtd
PUBLIC "-//IETF//DTD HTML 2.0 Level 2//EN" html.dtd

-- Ways to refer to Level 1: most general to most specific --
PUBLIC "-//IETF//DTD HTML Level 1//EN" html-1.dtd
PUBLIC "-//IETF//DTD HTML 2.0 Level 1//EN" html-1.dtd

-- Ways to refer to Level 0: most general to most specific --
PUBLIC "-//IETF//DTD HTML Level 0//EN" html-0.dtd
PUBLIC "-//IETF//DTD HTML 2.0 Level 0//EN" html-0.dtd

Creating a copy of ISO Added Latin 1 character entity declaration set
---------------------------------------------------------------------

The suggestion of specifying an IETF version of ISO's Added Latin 1
characters confuses and troubles me. I admit that I have
not been part of any discussions of this in the past, so I realize
I'm leaving myself open to criticism for speaking without full
background. However, given the time constraints on both the
HTML WG work and my own, I am taking the risk and mentioning it here.

Why is the HTML spec creating a new character entity set that is
basically identical to the ISO set (as far as I can tell) and giving it
a different FPI? [I'm reading section 3.15.2 of the spec.] I don't
think we should do that. I'm guessing it might have something to
do with the device-specific replacement text one might wish to have for
the entities (and if that is the case, we shouldn't change the
base FPI, but instead make use of the public text display version
field), but I can't really get into details of a suggestion without
understanding more the reason behind this.

-- ISO latin 1 entity set for HTML --
PUBLIC "-//IETF//ENTITIES Added Latin 1//EN" ISOlat1.sgml

. . .

<!--================ Character mnemonic entities ==========================-->

<!ENTITY % ISOlat1 PUBLIC
"-//IETF//ENTITIES Added Latin 1 for HTML//EN">
%ISOlat1;

If, in fact, we decide after all to create a new public entity with
the FPI shown above, I would suggest that we shouldn't recommend
the use of the file name "ISOlat1.sgml" for it. First, this is
misleading since it isn't an ISO set (given the FPI); second, this
may cause conflicts with systems that map the original ISO FPI into the file
named "ISOlat1.sgml" since that is the obvious thing to do; third,
we might want to consider sticking to 8.3 for file names recommended
by the standard. Likewise, I'd prefer to see the parameter entity
named something like %HTMLlat1; instead of %ISOlat1;. Of course,
all this paragraph is irrelevant if we just decide to reference
the ISO set and not redeclare an IETF one.

Use of + versus * occurrence indicator on content models with #PCDATA
---------------------------------------------------------------------

This comment is not noting an erroneous use of SGML, but is making
more of a "stylistic" suggestion that may help make some models
more obvious to readers.

Wherever #PCDATA is allowed, this token can always be satisfied
by the null string. For example, given that %text is an OR group
including #PCDATA and given the following model:

<!ELEMENT P - O (%text)+>

then it is the case that a P element can have empty content.
Therefore, the above declaration is always completely equivalent to

<!ELEMENT P - O (%text)*>

while the use of the * instead of the + helps remind the casual
reader at a glance that P's can be empty. Therefore, I have
heard it recommended that the * be used instead of the + in
these cases for the increased clarity it provides. I would make
this recommendation throughout the DTD. My quick scan notes
the following declarations:

<!ELEMENT (%font;|%phrase) - - (%text)+>
<!ENTITY % A.content "(%text)+"
<!ENTITY % A.content "(%heading|%text)+">
<!ELEMENT P - O (%text)+>
<!ELEMENT ( %heading ) - - (%text;)+>
<!ELEMENT PRE - - (%pre.content)+>
<!ELEMENT DT - O (%text)+>
<!ELEMENT OPTION - O (#PCDATA)> [I'd make it ... (#PCDATA)*> ]
<!ELEMENT TEXTAREA - - (#PCDATA)> ditto
<!ELEMENT TITLE - - (#PCDATA)> ditto

Non-compliant use of parameter entities
---------------------------------------

There are several occurrences of non-compliant use of parameter entities
in the latest DTD. In brief, you cannot define a parameter entity with
"dangling" connectors such as "| FORM | ISINDEX".

I know that many parsers accept this--but others do properly flag this
as an error. It turns out that it isn't made illegal directly by the
productions of 8879 itself, but by the text. To quote a few parts
(with thanks for help/confirmation by Sam Wilmott of Exoterica):

"A parameter entity reference can be used anywhere in a group that a token
could occur. The entity must consist of one or more of the consecutive
complete tokens that follow the _ts_ [token separator] in which the reference
occurs in
the same group (i.e., at the same nesting level), together with any
surrounding or intervening _ts_ separators and any intervening connectors.
The entity must end within the same group." [from clause 10.1.3]

The relevant term "token" is defined in clause 4.319 as:

"The portion of a group, including a complete nested group (but not a
connector), that is bounded by _ts_ separators (whether required or optional)."

So:
1. A connector is not a token.
2. An entity must consist of one or more ... tokens ..., together with
.. any intervening connectors.
3. Any connector in the replacement text of a parameter entity referenced
in a _ts_ is allowed by the "intervening connectors" provision, not by the
"one or more ... tokens" provision.

Therefore, in the declarations below, the leading "|" and "&" are
connectors but are not "intervening" and so are in error:

<!ENTITY % block.forms "| FORM | ISINDEX">
<!ENTITY % head.link "& LINK*">
<!ENTITY % head.nextid "& NEXTID?">

There are several ways to rearrange things to be valid. Here I
make one suggestion:

<!--=================== Text Flows ========================================-->

<![ %HTML.Deprecated [
<!ENTITY % preformatted "PRE | XMP | LISTING">
]]>

<!ENTITY % preformatted "PRE">

<![ %HTML.Forms [
<!ENTITY % block "P | %list | DL
| %preformatted
| BLOCKQUOTE | FORM | ISINDEX">
]]>

<!ENTITY % block "P | %list | DL
| %preformatted
| BLOCKQUOTE">

. . .

<!--================ Document Head ========================================-->

<![ %HTML.Recommended [
<!ENTITY % head.content "TITLE & ISINDEX? & BASE? & META* & LINK*">
]]>

<!ENTITY % head.content "TITLE & ISINDEX? & BASE? & META* & NEXTID? & LINK*">

Why not use LITA?
----------------

I don't understand why numeric character entity references are being
used in the HTML element's VERSION attribute's default value.

<!ELEMENT HTML O O (%html.content)>
<!ENTITY % version.attr "VERSION CDATA #FIXED &#34;%HTML.Version;&#34;">

Since all literals in SGML can be delimited equally well by either the
LIT character (by default, double qoute) or the LITA character (by
default, the single quote), why are we using &#34 for the double quote?
Instead, I recommend the above line be replaced with either:

<!ENTITY % version.attr 'VERSION CDATA #FIXED "%HTML.Version;"'>
or (equivalently)
<!ENTITY % version.attr "VERSION CDATA #FIXED '%HTML.Version;'">

CDATA terminated by any ETAGO
-----------------------------

The comment associated with the declaration for %literal; is somewhat
misleading. Since the use of CDATA for element content models can lead
to surprises [I'm glad this is deprecated], I think it's important to
make the comment clearer.

<![ %HTML.Deprecated [

<!ENTITY % literal "CDATA"
-- historical, non-conforming parsing mode where
the only markup signal is the end tag
in full
-->

<!ELEMENT (XMP|LISTING) - - %literal>
<!-- <XMP> Example section -->
<!-- <LISTING> Computer listing -->

<!ELEMENT PLAINTEXT - O %literal>
<!-- <PLAINTEXT> Plain text passage -->

]]>

The fact is that an element whose content model is CDATA is terminated
by ANY end tag--even an invalid one! That is, if the string "</"
occurs in CDATA content, the current element will be terminated.

Furthermore, there is no way for a user or editing interface to
escape the "</" to allow, for example, SGML code to be shown within
such an element. That is why I usually recommend the use of RCDATA
for such content models. This would allow the escaping of "</"
and "&" by using entity references. In any case, I think the
comment should be made more accurate.

paul

Paul Grosso
VP Research Chief Technical Officer
ArborText, Inc. SGML Open

Email: paul@arbortext.com
or pbg@texcel.no