SGML Open recommendations on HTML 3

Steven J. DeRose (sjd@ebt.com)
Mon, 20 Mar 1995 18:21:23 -0500

Most readers of these lists not only know about these recommendations, but
have participated in their preparation in one way or another. So, fresh
from the SGML Open Technical Committee meetings at Documation, here is the
first major portion of the document. Many thanks to all those in SGML Open,
the HTML
Working Group, and elsewhere that helped develop this document.

Steve DeRose

-------------------------------------------------------------------------------

SGML OPEN RECOMMENDATIONS FOR HTML 3.0
Posting #1 - March 20 , 1995

SUMMARY
-------

The new draft provides some exciting and ambitious functionality that
should make HTML even more appealing to Web publishers. However, there
appear to be some flaws that could hinder its achieving this objective.
In summary, the principal points are:

1. In some small details, some of the proposed constructs in HTML 3
are incompatible with SGML. That loss of compatibility jeopardizes
access to industry-standard tools for HTML production. It also
jeopardizes the ability in future to extend the Web to non-Internet
documents.

2. The more HTML becomes a procedural formatting language rather than a
neutral document structure description language, the more vulnerable it
becomes to competition from RTF and other popular formatting languages.
However, there appears to be a resistance in HTML 3.0 to any sort of
structural container elements. We have heard it even proposed that input
objects be allowed outside forms, which we especially oppose. It is
relatively simple to maintain a structure stack when parsing, and users
will find it much easier to get the results they want if structure
checking is made available to them.

3. The new tables are a great idea, although the table markup could be
simplified. In addition, we encourage extensions be implemented in an
orderly fashion -- moving towards a target with consistent names for
likely extensions so that where common functionality is desired,
all browser implementations use the same names -- even while in an
experimental mode.

4. The design objectives regarding backward compatibility seem
inconsistent. On one hand, the DOCTYPE declaration clearly identifies
the version of HTML. On the other, HTML 3.0 includes deprecated
constructs for compatibility with earlier versions. As a browser can
easily tell which version it is processing, there should be no need for
this approach to backward compatibility, which is restricting HTML's potential.
Indeed, we wonder whether HTML 3 might be designed without allowing
constructs termed "deprecated" in HTML 2, assuming that creators of
HTML 3 documents should not be encouraged to use deprecated constructs.
If they want to use those, they should simply use HTML 2.0.

Further details on these points and others can be found in the
attachment.

INTRODUCTION
------------

At the SGML Open technical committee meetings last summer, Dave Raggett
invited members of the consortium to participate, either formally or
informally, in the ongoing development of HTML 3.0. On the informal
side, many of our members have reviewed various drafts and participated
in discussions online, at conferences, by private email, or by other
means. Formally, SGML Open has organized its participation in this
area through the "SGML and the Internet" technical committee, chaired
by Steve DeRose (sjd@ebt.com) and reporting to Paul Grosso
(paul@arbortext.com), SGML Open's chief technical officer. Online
committee discussions take place through the sgml-internet mailing
list (subscribe through majordomo@ebt.com).

This posting represents the first of our formal set of SGML Open
comments, based on discussions at the SGML Open technical committee
meeting on March 10, 1995 and subsequent follow-up through the
sgml-internet mailing list. Understanding that Dave Raggett and others
are under a constant barrage of suggestions to improve or enhance the
HTML 3.0 DTD, we explicitly wish to keep our ideas simple and to the
point. To this end, we are consolidating all the various comments and
observations from SGML Open membership into a single prioritized list
maintained by the "SGML and the Internet" technical committee. This
list is organized into three major categories (in priority order), the
first of which is covered in detail by this posting:

* Easily avoidable incompatibilities with
SGML and current SGML applications.

* Suggestions that would broaden the
impact of HTML beyond simple browsing and linking without
significantly increasing complexity.

* Additional comments and observations.

Between them, SGML Open members offer authoring, document management,
database, print and electronic publishing tools and application
development environments based on SGML. Having HTML conform to
standard industry practices, including full conformance to the ISO
standard for SGML, means that Web information providers and users will
inherit a suite of available SGML software tools in addition to those
designed specifically for HTML. It will also enable HTML content to be
readily integrated into existing document production and management
systems. At the same time, we fully appreciate the new perspectives
and vigor that HTML activity brings to the SGML world. Nothing in our
comments should be construed as an attempt to stifle the
straightforward thinking and clever experimentation which has been such
a vital part of Web history. We are sensitive to the fact that HTML
must be kept simple and understandable, and that so-called
"improvements" must be carefully examined to avoid needlessly forcing
code changes in browsers or causing existing documents to "break."
However, we believe that HTML 3.0 provides the critical opportunity to
resolve fundamental issues, and our formal comments are heart-felt.

* * *

EASILY AVOIDABLE INCOMPATIBILITES WITH SGML
AND CURRENT SGML APPLICATIONS

Hard Incompatibilities With SGML
--------------------------------

We consider it crucial to ensure that documents which are valid HTML
are also valid SGML. Without this, user's options for editing,
formatting, and other processing software will be needlessly restricted
(they can't drop their HTML into an SGML system, apply the many SGML
data conversion tools, etc.) and implementors' work is significantly
increased.

First, we recommend in the strongest possible terms that HTML 3.0 be
changed to remove any constructs inherently incompatible with SGML.
Currently the most important example of this is the use of a
"quasi-CDATA" content model for elements XMP, LISTING and PLAINTEXT in
%HTML.Deprecated. While deprecated, this construct if used can result
in potentially significant and confusing error conditions. For
example, a conforming SGML parser will interpret an occurrence of "</"
intended as data as the end of the element, causing the rest of it to
be parsed normally, including any unintended SGML markup it may
contain. We believe this "non-conforming parsing mode" should be
removed entirely rather than simply being deprecated; at minimum users
should be strongly warned against using it.

Second, we note that there are potential compatibility problems if Web
documents assume that HTML parsers will not enforce standard SGML
rules. For example, a document might include "</>" or "<![" as data
because the particular parser in use does not support empty end-tags or
marked sections, and does not provide a warning upon seeing these
delimiters. When the same document is processed by an SGML parser,
however, it will fail or be processed in unintended ways.

The most classic of these errors is failing to quote a URL attribute,
in which case "/" would be taken as the SGML "NET" delimiter.

The opposite error may also arise: some existing browsers do not
notice "</>" as a tag, and would for example render an entire
document in huge type if it has "<h1>Title</>" near the top.

Therefore, we strongly recommend that HTML parsers (such as are
embedded within WWW clients) be built to interpret HTML documents in
the same way SGML parsers would (though of course they need not support
features not required by HTML). At minimum, the HTML DTD documentation
should encourage this approach and provide explicit warnings about
known issues.

Specifically, HTML 3.0 should require that attributes (e.g., URLs)
be quoted according to the same rules as described in the SGML standard.

We strongly recommend the use of DOCTYPE declarations at the top of
HTML documents to indicate the HTML level to which they conform. Given
this, we feel the design work for HTML should become considerably
easier.

Use of Deprecated Constructs As The Default
-------------------------------------------

We strongly recommend that the HTML 3.0 DTD's normative, distributed
form turn the deprecated material (if any) off, which is not the case in the
drafts we've seen. Otherwise the DTD tacitly recommends what it claims
to deprecate, and will tend to encourage continued use of the older
methods.

Asynchronous Elements
---------------------

Anything that subverts the tree-structuring of an SGML document is
prone to inconsistencies of implementation and behavior. In
particular, use of EMPTY elements that are logically paired (such as
MARK in the current HTML 3.0 draft) is a classic problem case that has
been discovered "the hard way" and has now been removed from nearly all
SGML applications.

The problem occurs because SGML parsers and other generic processors
have no knowledge of the intended pairing of the elements, and cannot
prevent illogical combinations like two "end" elements occurring before
the first "start." Similarly, without complex, application-specific
logic there is no way to prevent the removal of an "end" element while
leaving the orphaned "start" element in place. Many optimizations that
editors, formatters and other processing programs can use are lost if
such structures are permitted. For example, a program can no longer
tell how to format part of a document without going all the way back to
the beginning, on the off chance that a "start" member of the pair
occurred a long way back. Likewise, one cannot easily build a stack-
based formatter that keys styles off the list of element types in one's
ancestry. It is extremely difficult for an editor to even validate
that such pairs match, since "matching" becomes a non- generic notion
that must be custom-built for the specific semantics of each kind of
pair. For these reasons, we strongly recommend that all such
"asynchronous" elements be removed from the HTML 3.0 DTD.

Tables
------

Having support for tables in HTML 3.0 is an important step forward, and
we encourage the creation of an HTML 2.1 (as has previously been
suggested) if only to bring forward the most crucial aspects of HTML 3
into an earlier spec. (We would place tables and superscript/subscript
support in this category.)

The suggestions here come from several years of discussions about
optimal table implementation and are particularly focussed on enabling
ready support of HTML tables by the existing collection of "What you
see is what you get" table editors and publishing tools.

In the long run, we recommend that the HTML 3 DTD use a table model as
similar as possible to those supported by existing products, preferably
a proper subset. In a separate posting, later, we'll make proposals for
future directions for richer formatting of tables. The current comments
are directed only to the existing HTML 3.0 tables.

Omitting formatting information, while admirable in principle, is
dangerous for tables. In practice it forces the user to use ad-hoc
line-break and other tags to manage formatting, thus engendering severe
tag-abuse and cross- client incompatibility problems. Indeed, the HTML
3 table examples we have seen regularly use non-table tags in a
bewildering variety of ways to "trick" table formatting code into
coming up with specific column widths, rather than just saying what is
wanted up front.

The current HTML DTD shows a combined attribute on the table element,
with widths and justifications packed: COLSPEC="L20C8L40". Although
quite compact, we find it indeed too compact; so compact as to be
unreadable. This does not seem maximally readable for either humans or
computers.

For a start, it implies that a column would maintain its
characteristics throughout the length of a table. While this is often
true, there are many cases where it's not, for example, a table
footnote area broken into two columns (or none) but still part of the
table.

The most common existing support for the capability carried by the
attribute uses a very similar technique: It associates an empty COLSPEC
element with each column of the table. This becomes the holder of the
attributes which HTML3 bunches together in one string. There are
several advantages to this approach:

1) A foundation for extensibility: We believe that people will want to
add greater capability to the ability to describe the nature of an
individual column. Beyond simply left, right, center, and width
information, they will want to specify the nature of the column
separator, and alignment to decimal point or comma (or indeed to any
special character).

2) Ease of editing: Either a human or a computer removing, adding or
moving a column can do so readily, simultaneously removing, adding or
shifting the COLSPEC element. (In fact, in our proposal below, we
suggest using the name COLSPEC for compatibility with the ICADD tables
and suggest attribute names based on those too. Certainly the names can
be changed if necessary.)

3) Compatibility with some half a dozen existing table editing tools
which use a form of COLSPEC to carry this information.

4) These changes have several benefits. SGML can validate widths as
being of type NUMBER if units are not allowed in attributes where
they are appropriate. This is very useful online, since users can then
choose either unitless "rubber" tables which pro-rate columns to
the available width, or unitful "hard" tables that use absolute
widths and clip. Users are known to demand both of these.

It is our recommendation that the content model therefore become:

<!ELEMENT TABLE -- (CAPTION?, COLSPEC*, TR*) >
<!ELEMENT COLSPEC - o EMPTY -- only exists to hold attributes -->

There has been some recent discussion about moving beyond simple left,
right and center to character alignment. To do this in an ICADD
compatible way, one could use the following attribute list
declaration:

<!ATTLIST COLSPEC
align (left|justify|center|right|char) "Left"
char CDATA #IMPLIED
-- character upon which to align ( such as . or , ) --
charoff NUTOKEN #IMPLIED
-- position of character upon which to align --
colwidth CDATA #IMPLIED>

We are recommending omitting the separate "units" attribute and permitting
units (using compatible unit names in the colspec's colwidth attribute) to
be suffixed directly to the numbers to which they apply. We would recommend
accepting the full set of measurements in current use in publishing and
suggest the following abbreviations: px|pt|pi|mm|cm|in (and * to indicate
a relative width).

We have intentionally left out ems; although they are frequently useful,
it seems likely that people will want to mix point sizes in tables,
and using EMs as a overall unit of measurement will be often open to
misinterpretation. We suggest that they should also be removed anywhere
else where they may be misinterpreted in the HTML 3 DTD.

It is nearly impossible to anticipate all the questions of
interpretation a new table DTD raises; they arise over the course of
use. One such ambiguity we notice the HTML 3 DTD does not address is
whether empty cells must, may, or must not appear as placeholders in
the non-first rows of a vertical span. This needs to be stated, or
implementation disagreements will arise. If a precedent is useful, the
ICADD tables specifically call for cells to be removed if they are
"spanned over".

In a later posting, we would like to propose an alternate table model
which would act as the base for future development of HTML tables. That
is, we would like to propose a model capable of rich formatting and
suggest that as people (inevitably) add new functionality to their
table support that they do so in a way that is compatible with an
agreed upon HTML 4 (for instance) fashion. More on that later. (Our
goal would be to propose something which is a legal subset of CALS,
forward compatible with HTML 3 tables and incorporates the ICADD
requirements. We believe this is possible.)

Form Elements In Forms
----------------------

We strongly support the current restriction that form- relevant
elements (such as input objects) be permitted only inside FORM
elements. We also recommend clarifying the intended semantics of
"submit": a number of current browsers send all input from all forms in
a document, whenever the 'submit' of any single form is clicked. This
behavior seems undesirable.

Steve DeRose
Chair, SGML Open Technical Committee on
SGML and the Internet
EBT, 1 Richmond Square
Providence RI 02906 USA
Phone: +1 401 421-9550
Fax: +1 401 421-9551
sjd@ebt.com