Re: New DTD (final version?)

Paul Grosso (pbg@texcel.no)
Mon, 30 Jan 95 06:12:25 EST

> From: "Daniel W. Connolly" <connolly@hal.com>
>
> Would some of the SGML experts out there look over the SGML
> declaration? I think the capacities/quantities need some
> tweaking. Anybody who really knows about character set declarations is
> invited to look those over too. I'm still not clear on the distinction
> between NONSGML, UNUSED, SHUNCHAR, etc.

I'm afraid I am not a character set expert. I see they haven't changed
since the IETF draft when I last looked at them, so I doubt I'll have
much more input. Hopefully some other SGML experts with more expertise
in character sets will check it out (I'll ask around a bit).

As far as the rest, my first comment is on the newly introduced line
breaks in the public identifiers. A public identifier is a minimum
literal (clause 10.1.7 of 8879) whose exact normalized form is crucial
for doing such things as catalog lookup for external entity resolution.
In particular, the existence of space characters (in the normalized
minimum literal) and case of letters is significant. A minimum literal
is "normalized by ignoring record starts, condensing record end and
space sequences to a single space, and stripping spaces at the start
or end of the minimum literal." All of this to say that, if one wishes
to introduce a line break in a public identifier, it *must* be done at
an existing space to avoid changing the (normalized) minimum literal.
The latest sgml decl has broken the public identifier for all of the
three character set references just before "//ESC..." which was not
a place where a space appeared in the original public identifier. You
are therefore, in fact, not referencing the registered public texts
you think you're referencing. You should change your "line breaking
algorithm" so that it breaks the public identifiers at an existing space.

As far as the quantities, I've written a long message about that already.
In summary, here are my suggestions, though other values may make as much
sense (read my Jan4 message appended below):

QUANTITY SGMLREF
ATTSPLEN 2100
LITLEN 1024
NAMELEN 72 -- somewhat arbitrary; taken from
internet line length conventions --
PILEN 1024
TAGLEN 2100

(I've removed the non-RCS values of GRPCNT, GRPGTCNT, and TAGLEVEL since
no one gave me a good reason why they're necessary when I last raised the
question, but you can put them back in if you think there's good reason.)

paul

Paul Grosso
VP Research Chief Technical Officer
ArborText, Inc. SGML Open

Email: paul@arbortext.com
or pbg@texcel.no

----- Begin Included Message -----

>From pbg Wed Jan 4 14:39:27 1995
Date: Wed, 4 Jan 95 14:39:25 GMT
From: pbg (Paul Grosso)
To: html-wg@oclc.org
Subject: Re: HTML 2.0 SGML declaration [was: ATTSPLEN?]

> From: Paul Burchard <burchard@horizon.math.utah.edu>
>
> pbg@texcel.no (Paul Grosso) writes:
> > LITLEN [...] ATTSPLEN [...] TAGLEN [...]
>
> > Note that it doesn't make sense to expect to be able to
> > enter large values for attributes unless one increases
> > all three of the above quantities.
>
> Thanks for the explanation -- it looks like we have a definite
> problem in the current SGML declaration for HTML, then. It sets
> LITLEN to 1024 in order to provide reasonable room for URLs and FORM
> values, but then leaves ATTSPLEN and TAGLEN at their default values.

Looking at the HTML 2.0 SGML decl, I realize I don't remember any
discussion on the quantity values before, so I might have missed
something. But here's my comments on the SGML declaration.

I didn't consider the capacities--I think capacities are more annoyances
than useful, and almost all products I have seen rightly ignore, for
all practical purposes, the capacities (usually after giving a warning).
Besides, appropriate values for capacities are usually only determinable
by trial and error, and I figure Dan's already done that.

I have to admit to lack of expertise in the area of the details of the
character set stuff in SGML declarations. Not that I haven't tried,
but there are just too many issues for me to know if the ones given
are the best ones for HTML use. Character sets in HTML is an open
issue, and I have no reason to think that the ones Dan has included
aren't the best ones for now.

The features are quite reasonable and standard, and the syntax is the same
as the Reference Concrete Syntax (RCS) with the exception of the quantities.

What's currently in the HTML 2.0 spec as far as quantities is:

QUANTITY SGMLREF
NAMELEN 72 -- somewhat arbitrary; taken from
internet line length conventions --
TAGLVL 100
LITLEN 1024
GRPGTCNT 150
GRPCNT 64

For reference, the RCS quantities are:

QUANTITY SGMLREF
ATTCNT 40
ATTSPLEN 960
BSEQLEN 960
DTAGLEN 16
DTEMPLEN 16
ENTLVL 16
GRPCNT 32
GRPGTCNT 96
GRPLVL 16
LITLEN 240
NAMELEN 8
NORMSEP 2
PILEN 240
TAGLEN 960
TAGLVL 24

My comments:

1. I'm not sure why it was felt necessary for GRPCNT, GRPGTCNT, and
TAGLVL to be raised from their RCS values. In my experience, I
have rarely seen the need, and the HTML application is one of the
smaller ones I've seen. I don't see anything wrong with the larger
values, I was just a bit surprised to see them.

2. A value of 1024 for LITLEN makes sense. Most people increase PILEN
when LITLEN is increased. Basically, if you expect to have large
literals, you might well have large PIs. In particular, PIs may be
used to contain things that are related to things tags contain, so
I usually recommend a PILEN at least as large as TAGLEN. (From the
following paragraph, that would imply a value of 4230 if you follow
the argument in a strict fashion.) I would recommend making PILEN
at least the same as LITLEN--in this case, 1024.

3. As the earlier exchange discusses, ATTSPLEN and TAGLEN should usually
be increased when LITLEN is increased. [This isn't necessarily the
case--one might want to allow for large literals in parameter literals
(e.g., for the replacement text of entities), but still not expect
such long literals for attribute value literals. I am assuming that
we wish to allow URL's and VALUE's and such to have lengths up to
LITLEN in the rest of this paragraph.] A quick glance at the DTD shows
that the elements A and INPUT have four CDATA attributes plus a few
others, LINK has three CDATA atts plus others, and IMG and FORM have
two CDATA atts plus others. Unless someone has a good argument for
thinking it isn't necessary to allow for the case that all four of
A's and INPUT's CDATA attributes have values that approach LITLEN,
that would indicate a value of ATTSPLEN near 4150. With a NAMELEN
of 72 (even though no element names currently approach that), that
would suggest a TAGLEN near 4230 in round numbers. In practice, one
would rarely expect such extremes, so smaller numbers may be reasonable,
but I'm just laying out the appropriate logic. In particular, the
elements A (with HREF and NAME), IMG (with SRC and ALT), INPUT (with
SRC and VALUE), and LINK (with HREF and URN) all have at least two
CDATA attributes that, I would think, could both get long (either by
virtue of being a URL or URN or by having a long textual string for
a value), so a value of at least 2100 for ATTSPLEN and TAGLEN seems
necessary if we want to be consistent with LITLEN.

paul

----- End Included Message -----