Re: proposed registration of type 'text/html' for MIME

Dan Connolly <connolly@pixel.convex.com>
Message-id: <9211102338.AA02403@pixel.convex.com>
To: Edward Vielmetti <emv@msen.com>
Cc: www-talk@nxoc01.cern.ch
Subject: Re: proposed registration of type 'text/html' for MIME 
In-reply-to: Your message of "Tue, 10 Nov 92 15:13:07 EST."
             <m0mp1xh-00009MC@garnet.msen.com> 
Date: Tue, 10 Nov 92 17:38:19 CST
From: Dan Connolly <connolly@pixel.convex.com>

>Here's the form for registering 'text/html' partly filled in, from RFC
>1341.

I strongly suggest we bring the definition of HTML into conformance
with the SGML standard before we register it with the IANA.

>Published specification:
>	"The HTTP Protocol as Implemented in W3", avaiable for
>	anonymous ftp from ftp://info.cern.ch/pub/doc/www/http.txt.  
>	Describes the HTTP interactive access protocol and the tags used
>	in HTML documents.

This is the HTTP document, not the HTML document:

     This document defines the Hypertext Transfer protocol (HTTP) as
     currently implemented by the WorldWideWeb initaitive software.

The HTML document is: http://info.cern.ch/hypertext/WWW/MarkUp/MarkUp.html
an old version of which is contained in http.txt.

In any case, both documents mention some relationship between HTML and
SGML which is not formally defined:

   The hypertext mark-up language is an SGML format. This defines the
   basic syntax used. The particular language, the set of tags and the
   rules about their use, and their significance is not part of the
   SGML standard. There being no standard on this, we have adopted a
   set which seems sensible. We call them HTML -- hypertext markup
   language. HTML is not an alternative to SGML, it is a particular
   format within the SGML rules (an SGML "DTD").

The standard is very clear on this kind of thing. [I just got myself a
copy, so I can quote it:]

	4.103 (document) type declaration: A markup declaration that
	contains the formal specification of a document type
	definition.

	4.104 document type delcaration subset: The element, entity,
	and short reference sets occuring within the declaration
	subset of a document type declaration.

	4.105 document (type) definition: Rules, determined by an
	application, that apply SGML to the markup of documents of a
	particular type. A document type definition includes a formal
	specification, expressed in a document type declaration, of
	the element types, element relationships, and attributes, and
	references that can be represented by markup. It thereby
	defines the vocabulary of the markup for which SGML defines
	the syntax.

So it seems that the HTML DTD is missing the "formal specification."
I have written a document type declaration subset that matches HTML as
currently defined and implemented, with a few exceptions (most
importantly, the PLAINTEXT tag). See
http://info.cern.ch/hypertext/WWW/MarkUp/HTML.dtd

Most existing HTML documents need only small modifications to bring
them into conformance (quote attribute values, add the <!DOCTYPE ...>
prologue). And the existing WWW browser parses conforming documents
just fine.

     Currently HTML documents are transmitted without the normal SGML framing
     tags, but if these are included parsers will ignore them.

I don't know what "the normal SGML framing tags" are. An SGML document
has three parts: the SGML declaration, the prologue, and the instance.
It is common in SGML applications to use an implied SGML declaration
and include the prologue by reference (kinda like an #include
directive in C.) but without these "framing tags," it's just not an
SGML document.

Besides, it's very little work to add the line:

<!DOCTYPE HTML SYSTEM>

at the beginning of HTML documents.

More non-conforming stuff in Markup.html:

Plaintext

   This tag indicates that all following text is to be taken litterally, up to
   the end of the file.  Plain text is designed to be represented in the same
   way as example XMP text, with fixed width character and significant line
   breaks. Format:
   

                <PLAINTEXT>

   This tag allows the rest of a file to be read efficiently without parsing.
   Its presence is an optimisation. There is no closing tag.

This should be moved outside the definition of HTML. It should just be
part of the HTTP protocol: if the server starts the response with
<PLAINTEXT>, what you're getting is plain text, not SGML.

Another problem:

Example sections

       The text may contain any ISO Latin printable characters, including the
          tag opener, so long as it does not contain the closing tag in full.

This doesn't fit in SGML. The ETAGO delimiter ("</") ends a CDATA
section.

A clarification:

Paragraph

   This tag indicates a new paragraph. The exact representation of this
   (indentation,  leading, etc) is not defined here, and may be a function of
   other tags, style sheets etc. The format is simply
   

        <P>

   (In SGML terms, paragraph elements are transmitted in minimised form).

The implementation suggests that the <P> tag marks an empty element, a
paragraph separator, rather than allowing minimization in the form of
an omitted end tag, </P>.



We could even go so far as to call WWW an SGML application:

	 4.279 SGML Application: Rules that apply SGML to a text
	 processing application. An SGML application includes a formal
	 specification of the markup constructs used in the
	 application, expressed in SGML. It can also include a
	 non-SGML definition of semantics, application conventions,
	 and/or processing.

	 Note 2 The formal specification of an SGML application
	 constitutes the common portions of the documents processed by
	 th application. These common protions are frequently made
	 available as public text.

In other words, ftp://info.cern.ch/pub/doc/the_www_book.txt would
serve as the "non-SGML definition." [by the way: I could only find
postscript and LaTeX versions of the book: no txt file.] The "common
portion" is html.dtd (we could obtain a public text identifier for
it...).

If we want to do this (define an SGML application) section 15.5
requires this statement to be plastered all over the place:

	 An SGML Application Conforming to International Standard
	 ISO 8879 -- Standard Generalized Markup Language

If we're gonna use SGML, why not do it right?
 
Dan