Re: Does HTML today comply with SGML?

"Daniel W. Connolly" <connolly@hal.com>
Errors-To: listmaster@www0.cern.ch
Date: Mon, 4 Apr 1994 21:28:35 --100
Message-id: <9404041915.AA16588@ulua.hal.com>
Errors-To: listmaster@www0.cern.ch
Reply-To: connolly@hal.com
Originator: www-talk@info.cern.ch
Sender: www-talk@www0.cern.ch
Precedence: bulk
From: "Daniel W. Connolly" <connolly@hal.com>
To: Multiple recipients of list <www-talk@www0.cern.ch>
Subject: Re: Does HTML today comply with SGML? 
X-Listprocessor-Version: 6.0c -- ListProcessor by Anastasios Kotsikonas
Content-Type: text/plain; charset="us-ascii"
Content-Type: text/plain; charset="us-ascii"
Mime-Version: 1.0
Mime-Version: 1.0
Content-Length: 7766
In message <9404041841.AA09871@dxmint.cern.ch>, Rich Wiggins writes:
>Our best Web-weaver, Chuck Henrich, and I are trying to resolve a simple
>question:  Does HTML comply with SGML?  All along it's been my
>interpretation that it does.  By that I assume that it is possible
>to take the Document Type Definition for HTML and feed that to
>an SGML tool, and use that tool to validate HTML documents.

Yes. For example, if your doc is called 'foo.html' and you have
some version of "the HTML DTD" (e.g. the text from the appendix
of draft-iiir-html-01.txt, currently availale as
http://info.cern.ch/hypertext/WWW/MarkUp/HTML.dtd.html)
and you have the sgmls exectuable in your path, you
can issue:

	sgmls -s html.dtd foo.html

and sgmls will validate foo.html.

>  Chuck
>is under the impression that at least some aspects of HTML are not
>compliant; ie one could not use a commercial SGML tool to validate
>documents as used on the Web today.

Sad but true. Few authors take the time to validate their documents.
Few tool developers have ensured that their tools generate
valid HTML. Idioms like nested lists and such are not valid according
to the DTD, but they are supported and used widely.

The DTD has not been maintained over the last year or so.

>If HTML does comply, is there anyone out there who actually uses
>an SGML tool to verify their HTML documents?

I do. And I'm trying to develop a new DTD that captures more of
the current idioms. I'm currently collecting the issues and
developing a test suite.


>  Or is this just
>talked about?  (There was a great deal of discussion a few months
>ago as to the badness of using Mosaic or another browser as a
>validation tool.)

Yes. It has deteriorated from the point of using something like
sgmls to validate a document and being pretty sure it would give
good data to the situation we have today where the only way to
be reasonably confident that the document is useful is to try
it on all the different browsers.

>How about the proposed HTML+ spec?  Is it any more or less compliant
>with SGML?

There's a DTD, but I'm not sure how the various features are being
used and/or supported.


Here's a reply I gave to a similar inquiry in comp.text.sgml:

In-reply-to: swlodin@kocrsv01.delcoelect.com's message of Thu, 24 Feb 1994 02:00:10 GMT
Newsgroups: comp.text.sgml,comp.infosystems.www
Subject: Re: Help with the relationship between SGML and HTML
References: <1994Feb24.020010.7431@kocrsv01.delcoelect.com>
Distribution: 
--text follows this line--
In article <1994Feb24.020010.7431@kocrsv01.delcoelect.com> swlodin@kocrsv01.delcoelect.com (Steve Lodin) writes:

   I would like some clarification of the relationship between SGML, HTML,
   DTD, and the ISO standards.  Here is my interpretation.  Please correct
   me where necessary.

As I wrote the original draft of the HTML spec with the SGML standard
in my lap, I'll be glad to comment.

   - SGML (Standard Generalized Markup Language) is an ISO standard 
   (ISO 8879:1986).  It is a system for defining documents and the 
   markup languages that represent those document types.

You got it. The official way to Cite SGML is:

ISO 8879:1986, Information ProcessingText and Office Systems --
Standard Generalized Markup Language (SGML)

   - A DTD (Document Type Definition??) is a specific set of SGML semantics
   used to specify a document type and the markup language for representing
   that document.

   - HTML is an example of a SGML DTD.  Other examples are??

Here's the way I phrased it when I wrote the spec:

>From http://info.cern.ch/hypertext/WWW/MarkUp/Intro.html
(which is part of:
 ftp://ds.internic.net/internet-drafts/draft-ietf-iiir-html-01.txt):

The HyperText Markup Language is defined in terms of the ISO Standard
Generalized Markup Language [SGML]. SGML is a system for defining
structured document types and markup languages to represent instances
of those document types.

Every SGML document has three parts:

 o An SGML declaration, which binds SGML processing quantities and
syntax token names to specific values. For example, the SGML
declaration in the HTML DTD specifies that the string that opens a tag
is </ and the maximum length of a name is 40 characters.

 o A prologue including one or more document type declarations, which
specifiy the element types, element relationships and attributes, and
references that can be represented by markup. The HTML DTD specifies,
for example, that the HEAD element contains at most one TITLE element.

 o An instance, which contains the data and markup of the document. 

We use the term HTML to mean both the document type and the markup
language for representing instances of that document type.

All HTML documents share the same SGML declaration an prologue. Hence
implementations of the WorldWide Web generally only transmit and store
the instance part of an HTML document. To construct an SGML document
entity for processing by an SGML parser, it is necessary to prefix the
text from ``HTML DTD'' on page 10 to the HTML instance.

Conversely, to implement an HTML parser, one need only implement those
parts of an SGML parser that are needed to parse an instance after
parsing the HTML DTD.


   I have some other questions:

   - Can I effectively say that HTML is ISO-compliant or ISO-compatible?

I'm not sure how those terms are defined. These terms _are_ defined by
the SGML standard:

Conforming SGML Document: An HTML document (that is, when you take an
html file and prepend the HTML.DTD text, and change all the unix
newlines to SGML RE's) is a Conforming SGML Document (as per section
15.1 of the SGML standard). (Well... they're supposed to be anyway...
there are a lot of html files out there that wouldn't parse correctly
relative to any published DTD).

Minimal SGML Document: An HTML document (as above) is also a Minimal
SGML document (meaning it doesn't take a very powerful SGML parser to
parse it.) [There may be a few corners of the HTML declaration that
aren't quite minimal -- I'm not 100% sure at the moment, since we
tweaked LITLEN and NAMELEN a little]

Conforming SGML Application: WWW is _not_ a conforming SGML
application. For one, you have to document the fact that you are one
in order to be one, and nobody's done that. For two, most WWW
implementations allow all kinds of crap in HTML documents, and section
15.2.2 says "A conforming SGML application shall require its documents
to be conforming SGML documents... ."

Conforming SGML System: WWW is _not_ a conforming SGML system -- you
have to support arbitrary DTD's for this part.

   - When creating an SGML document, is the DTD the filter or translator
   that defines the resulting output?

Nope. The only output specified by the SGML standard is "Yes. It is a
valid document instance" or "No. It is not a valid document instance."
There's also this sort-of-formal ESIS (Entity Structure Information
Set or some such) that an SGML parser magically exposes to an
application. The question of how to translate the ESIS to postscript,
for example, is not specified by the DTD or any other SGML entity
(you'd have to look into DSSSSSSSSL or FOSI or something).

There are some tools for converting HTML to LaTeX for printing I
believe... though I am not familiar with their completeness/quality.

   - What commercial programs exist to create SGML documents?

Try the comp.text.sgml FAQ or some such... this isn't my area of
expertise.

   - What is the status/relationship of HTML+ to this?

   - What documentation exists about HTML other than what is on info.cern.ch?

Try this: http://www-external.hal.com/~connolly/html-design.html
It's my notebook on the design of a successor for HTML. It's got
pointers to all sorts of tutorials, discussion, and related specs.


Dan