Come 'n Get it: A DTD for current practice in HTML

"Daniel W. Connolly" <connolly@hal.com>
Errors-To: listmaster@www0.cern.ch
Date: Thu, 7 Apr 1994 04:37:02 --100
Message-id: <9404070228.AA20819@ulua.hal.com>
Errors-To: listmaster@www0.cern.ch
Reply-To: connolly@hal.com
Originator: www-talk@info.cern.ch
Sender: www-talk@www0.cern.ch
Precedence: bulk
From: "Daniel W. Connolly" <connolly@hal.com>
To: Multiple recipients of list <www-talk@www0.cern.ch>
Subject: Come 'n Get it: A DTD for current practice in HTML
X-Listprocessor-Version: 6.0c -- ListProcessor by Anastasios Kotsikonas
Content-Length: 5611

Ok. I did it. I grabbed a bunch of HTML files from all over creation
(NCSA, CERN, Leeds, U Hawaii, etc.), and ran them all through sgmls
with the same DTD. The results are pretty raw, but the DTD
is at:

http://www.hal.com/%7Econnolly/html-test/html.dtd

and the whole shootin match is at:

http://www.hal.com/%7Econnolly/html-test/

I had to tweak the docs a little, but mostly, when in doubt, I tweaked
the DTD. I did find quite a few "coding errors" (i.e. missing or extra
</a> tags, HREF spelled HERF).

I'd like to take the diffs from the draft-iiir-html-01.txt version
of the DTD and enumerate them, but I pretty much rewrote the DTD,
so diff won't tell me much.

But from memory:

* I changed OMITTAG to YES in the <!SGML declaration, and tweaked
the DTD to take advantage of it. This means that

	<!DOCTYPE HTML PUBLIC "...">
	<TITLE>title</TITLE>
	<H1>header</H1>
	...

parses the same as

	<!DOCTYPE HTML PUBLIC "...">
	<HEAD><TITLE>title</TITLE></HEAD>
	<BODY><H1>header</H1>
	...
	</BODY>

which is pretty much current practice anyway.

* I also changed LI, DT, and DD from EMPTY to being containers with
omitted end tags. I think this is the way people see it.

* But I left <P> as EMPTY. The problem is not just changing all the
markup that's out there, but changing the tutorials, conversion tools,
etc. Soon, I'd like to change <P> to being a container, but not until
the folks that write tutorials and converters sign up to support the
change. And even then, perhaps a better strategy would be to introduce
a new element name <PP>..</PP> and retire <P> altogether. It's bad to
change the meaning of a widely-bound symbol.

* I changed SHORTTAG to YES to support
	<DL COMPACT> and <OPTION SELECTED>
as short forms form
	<DL STYLE="COMPACT"> and <OPTION STATE="SELECTED">
This opens up all sorts of tricky SGML parsing snafus in theory, but
in practice nobody does stuff like <em/this/. Yet.

* I added forms as per

http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/Docs/fill-out-forms/overview.html

All 13 example documents are valid. (except for a missing </FORM> in one of
them...) I also checked a couple random forms from other places.

* The content model for BODY is ANY. Technically, this means you
could put TITLE, LINK, BASE, and ISINDEX tags in the body even though
they're only supposed to go in the HEAD. But I got lazy...

* I added a FIG element, cuz I got grossed out when folks used PRE
to attach captions to images.

* Most elements are allowed anywhere:
	- The basic "tag soup" level is called %htext;
	It consists of #PCDATA, plus %inline (EM, STRONG, ...)
	plus %fonts (B, I, TT, ...) plus A, P, and BR.

	All these can be nested except A.

	- A can contain anything (ANY content mode), like headers, lists, etc.

	- The structured elements are called %block,
	i.e. DL, UL, OL, PRE, BLOCKQUOTE, FORM...
	Most of these can be nested, but FORM can't contain another FORM.

	- Headers can only occur inside BODY (or inside A).


Here are the idioms I saw on my journey that I'd like to ban ASAP,
since they'll never work with SGML tools:

	* Big long unquoted attribute values, like
		<A HREF=http://foo/bar/baz.html>
	The LaTeX2HTML converter is particularly bad about this.

	* Anchor names that don't work as IDs, e.g. <A NAME=1> or
	<A NAME="my_name">. If we make the anchor name an SGML ID,
	the SGML parser can validate uniqueness.

	* <SELECT SIZE=12,3>. Use <SELECT SIZE="12 3"> and
	an SGML parser can validate that the SIZE attribute is
	a list of numbers. Gotta use ""'s in any case.

Here are some that just gross me out:

	* Use of %font (<B> and <I>) outside of PRE. If folks
	don't have enough info to choose between EM, CITE, STRONG,
	etc., just use <EM> for italics and <STRONG> for bold.

	* Use of %inline (EM, CITE, STRONG) inside PRE. It doesn't
	make sense to me.
	
	* <IMG> inside <PRE> to attach captions to images.

	* Using <DL><DD>...<DD>..</DL> in stead of <UL>...</UL>
	just cuz it looks better.


So... now.. what's next? If folks are interested in what I've got
(that is: interested enough to test out the DTD on their stuff, use it
to guide experimentation, stuff like that) then I'm willing to add
some comments and write up what I've got from a technical point of view.

Then, In order to publish this DTD, we need a human-readable document
to accompany it. The CERN spec is pretty bulky, and it would need
quite a bit of editing to get all the new features and old crufties
out.

By the way... is anybody interested in various "levels of
conformance"? The CERN spec categorizes stuff as "Mainstream" or
"Extra". If we're going to standardize that stuff, we need
corresponding versions of the DTD so that we can ask an SGML parser
"Did I use any non-Mainstream features?" i.e. "Will this document be
presented without loss of information on all minimally-conforming
implementations?"

Anyway... What I've created here matches NCSA Mosaic for X version 2.2
pretty well. I'd be willing to attach NCSA's "A Beginner's Guide to
HTML" (aka
http://www.ncsa.uiuc.edu/General/Internet/WWW/HTMLPrimer.html) and
submit it as an informational RFC. Or if the NCSA pubs folks would
like to write an HTML reference, I'm sure that would go over well.

Then, after that... is anybody out there interested in making HTML a
little more useful as an SGML document type? e.g. make <P> a
container, use the ID/IDREF feature here and there... use processing
instructions in stead of <BR> tags... stuff like that so that
we could start building tools to take advantage of strucutre
(outlining tools, automated ways to combine documents for printing and
such...).

Dan