Comments on MIME/SGML

"Daniel W. Connolly" <connolly@hal.com>
Errors-To: listmaster@www0.cern.ch
Date: Fri, 25 Feb 1994 10:59:17 --100
Message-id: <9402190010.AA10063@ulua.hal.com>
Errors-To: listmaster@www0.cern.ch
Reply-To: connolly@hal.com
Originator: www-talk@info.cern.ch
Sender: www-talk@www0.cern.ch
Precedence: bulk
From: "Daniel W. Connolly" <connolly@hal.com>
To: Multiple recipients of list <www-talk@www0.cern.ch>
Subject: Comments on MIME/SGML
X-Listprocessor-Version: 6.0c -- ListProcessor by Anastasios Kotsikonas
Content-Length: 21544
------- =_aaaaaaaaaa0
Content-Type: text/plain; charset="us-ascii"
Content-ID: <10024.761615492.2@ulua>
Content-Description: MIME/SGML as text

                                                          Comments on MIME/SGML
               PROPOSED CHANGES FOR MIME REPRESENTATION OF SGML
                                       
                                          Daniel W. Connolly <connolly@hal.com>
                                                                               
SGML Entites in MIME

   I believe that the goals of Mr. Levinson's MIME Content-types for SGML
   Documents[1] are essential to the success of the intenet as an Integrated
   Open Hypermedia (c.f. HyTime[2]) system.
   
   However, after a careful reading of the SGML standard[3] (specifically
   section 6: Entity Structure), I believe that SGML/MIME[4] fails to specify
   the most important machine representation of an SGML document; that is, the
   SGML document entity (production 2, section 6.2).
   
   In conventional practice, an SGML document "doc" of type "T" is represented
   as a file doc.sgml, which looks like:
   

<!DOCTYPE T SYSTEM "t.dtd" [
<!ENTITY fig1 SYSTEM "foo.ps" postscript>
]>
<T>blah blah blah <figure graphic=fig1> blah blah blah</T>

   along with an SGML declaration in the file T.decl. Technically speaking, the
   SGML document entity is the concatenation of T.decl and doc.sgml (with
   perhaps some system-specific newline->RS/RE conversions). But according to
   the standard it's OK to interchange SGML documents with an implied SGML
   delcaration, and in practice, the SGML declaration is often compiled into
   the processing software.
   
   So for all intents and purposes, the file doc.html is an SGML document
   entity. And it seems critical that the file doc.html should correspond to
   the body of some MIME body part.
   
   The draft[5] misuses the term "DTD", aka "Document Type Definition" (defn
   4.104). An SGML document indeed has three parts: the SGML declaration, the
   prologue, and the instance. And the distinction between the term prologue
   and the term DTD is not trivial.
   
   First, to be pedantic, a DTD is not generally representable in SGML syntax.
   The concept of the DTD includes not only the SGML-representable formal part,
   but also the associated application conventions which cannot be represented
   in SGML. Also -- a document may have more than one DTD in its prologue.
   
   Second, to be practical, the conventional machine representation of a DTD is
   just an SGML text entity in the file t.dtd which looks like:
   

<!NOTATION postscript PUBLIC "-/Adobe/Postscript">
<!ELEMENT T - - (#PCDATA) >
..

   Note that it does not correspond exactly to the prologue of the document, in
   that it does not contain the <!DOCTYPE [ ... ]> markup.
   
   For these reasons, I suggest the following modifications to the proposed
   MIME representation of SGML documents:
   
   1. We make the following correspondence between the terms of the SGML
   standard and the MIME RFC:
   
      SGML notation => MIME content-type
      
       SGML SYSTEM identifier => MIME Content-ID
      
       SGML data entity (= notation, data)
      
       => MIME body part (=content type, body)
      
       SGML text entity (= sequence of characters)
      
       => body of MIME text/sgml body part (= seq of chars)
      
       SGML document => a MIME multipart/SGML body part
      
   2. We use text/sgml in stead of application/sgml for the SGML "files", since
   they are in general readable on teletype devices.e
   
   3. We change the parameters of the multipart/SGML content type from
   

               sgml-part       := "intance" / "declaration"
                                / "dtd" / "fosi" / extension-token

   to
   

               sgml-part       := "document" / "declaration"
                                / "dtd" / "fosi" / extension-token

   where "document" is required, declaration is optional, and dtd is acutally
   redundant (since it's in the document entity) but useful, since a MIME UA
   might want to know what kind of document it is without parsing the document.
   
References

        Network Working Group, Internet Draft: MIME/SGML,
      <draft-levinson-sgml-01.txt>, E. Levinson, Accurate Information Systems,
      Inc., January 17, 1993
      
        ISO 8879:1986, Information Processing: Text and Office Systems:
      Standard Generalized Markup Language (SGML)
      
      ISO/IEC 10744 Information technology -- Hypermedia/Time-based Structuring
      Language (HyTime)
      
Editorial Note

   I used HTML because it works to a certain extent, not because I think it's
   exactly how I think internet IOH should work. My comments on HTML are still
   under development. See my notebook on the design on an HTML successor[6].
   
Production Note

   This document is brought to you by the following tools:
   
       Lucid emacs
      
       html-mode by Marc Andressen
      
       SGMLs
      
       NCSA Mosaic
      
   

------- =_aaaaaaaaaa0
Content-Type: multipart/x-sgml; boundary="----- =_aaaaaaaaaa1";
	document="10024.761615492.5@ulua";
	dtd="10024.761615492.6@ulua";
	declaration="10024.761615492.7@ulua"
Content-ID: <10024.761615492.3@ulua>

------- =_aaaaaaaaaa1
Content-Type: text/x-html; charset="us-ascii"
Content-ID: <10024.761615492.4@ulua>
Content-Description: comments on MIME/SGML as html
Content-Transfer-Encoding: quoted-printable

<HEAD>
<TITLE>Comments on MIME/SGML</TITLE>
</HEAD>
<BODY>
<H1>Proposed Changes for MIME representation of SGML
</H1>

<ADDRESS>Daniel W. Connolly &lt;connolly@hal.com&gt;</ADDRESS>

<H2>SGML Entites in MIME</H2>

I believe that the goals of Mr. Levinson's <CITE><A HREF=3D"#r1">MIME
Content-types for SGML Documents</A></CITE> are essential to the
success of the intenet as an Integrated Open Hypermedia (c.f. <A
HREF=3D"#HyTime">HyTime</A>) system. <P>

However, after a careful reading of <A HREF=3D"#SGML">the SGML
standard</A> (specifically section 6: Entity Structure), I believe
that <A HREF=3D"#r1">SGML/MIME</A> fails to specify the most
important machine representation of an SGML document; that is, the
SGML document entity (production 2, section 6.2). <P>

In conventional practice, an SGML document "doc" of type "T" is
represented as a file doc.sgml, which looks like:

<PRE>
&lt;!DOCTYPE T SYSTEM "t.dtd" [
&lt;!ENTITY fig1 SYSTEM "foo.ps" postscript&gt;
]&gt;
&lt;T&gt;blah blah blah &lt;figure graphic=3Dfig1&gt; blah blah blah&lt;/T=
&gt;
</PRE>

along with an SGML declaration in the file T.decl. Technically
speaking, the SGML document entity is the concatenation of T.decl and
doc.sgml (with perhaps some system-specific newline-&gt;RS/RE
conversions). But according to the standard it's OK to interchange
SGML documents with an implied SGML delcaration, and in practice, the
SGML declaration is often compiled into the processing software. <P>

So for all intents and purposes, the file doc.html is an SGML document
entity. And it seems critical that the file doc.html should correspond
to the body of some MIME body part. <P>

<A HREF=3D"#r1">The draft</A> misuses the term "DTD", aka "Document Type
Definition" (defn 4.104). An SGML document indeed has three parts: the
SGML declaration, the <EM>prologue</EM>, and the instance. And the
distinction between the term prologue and the term DTD is not trivial. <P>

First, to be pedantic, a DTD is <EM>not</EM> generally representable
in SGML syntax. The concept of the DTD includes not only the
SGML-representable formal part, but also the associated application
conventions which cannot be represented in SGML. Also -- a document
may have more than one DTD in its prologue. <P>

Second, to be practical, the conventional machine representation of a
DTD is just an SGML text entity in the file t.dtd which looks like: <P>

<PRE>
&lt;!NOTATION postscript PUBLIC "-/Adobe/Postscript"&gt;
&lt;!ELEMENT T - - (#PCDATA) &gt;
..
</PRE>


Note that it does <EM>not</EM> correspond exactly to the prologue of
the document, in that it does not contain the <CODE>&lt;!DOCTYPE [ ...
]&gt;</CODE> markup. <P>

For these reasons, I suggest the following modifications to the
proposed MIME representation of SGML documents: <P>

1. We make the following correspondence between the terms of the SGML
standard and the MIME RFC:

<UL>
<LI>SGML notation =3D> MIME content-type
<LI> SGML SYSTEM identifier =3D> MIME Content-ID
<LI> SGML data entity (=3D notation, data)
		 =3D> MIME body part (=3Dcontent type, body)
<LI> SGML text entity (=3D sequence of characters)
		 =3D> body of MIME text/sgml body part (=3D seq of chars)
<LI> SGML document =3D> a MIME multipart/SGML body part
</UL>

2. We use text/sgml in stead of application/sgml for the SGML "files",
since they are in general readable on teletype devices.e <P>

3. We change the parameters of the multipart/SGML content type from

<PRE>
               sgml-part       :=3D "intance" / "declaration"
                                / "dtd" / "fosi" / extension-token

</PRE>

to

<PRE>
               sgml-part       :=3D "document" / "declaration"
                                / "dtd" / "fosi" / extension-token

</PRE>

where "document" is required, declaration is optional, and dtd is
acutally redundant (since it's in the document entity) but useful,
since a MIME UA might want to know what kind of document it is without
parsing the document.

<H2>References</H2>

<OL>
<LI> =

<A NAME=3D"r1">
Network Working Group,
Internet Draft: MIME/SGML,
&lt;draft-levinson-sgml-01.txt&gt;,
E. Levinson,
Accurate Information Systems, Inc.,
January 17, 1993
</A>

<LI> =

<A NAME=3D"SGML">ISO 8879:1986, Information Processing: Text and Office Sy=
stems:
Standard Generalized Markup Language (SGML)
</A>

<LI><A NAME=3D"HyTime">ISO/IEC 10744 Information technology -- Hypermedia/=
Time-based
Structuring Language (HyTime)
</A>
</OL>

<H2>Editorial Note</H2>

I used HTML because it works to a certain extent, not because I think
it's exactly how I think internet IOH should work. My comments on HTML
are still under development. See <A
HREF=3D"http://www-external.hal.com/~connolly/html-design.html">my
notebook on the design on an HTML successor</A>.

<H2>Production Note</H2>

This document is brought to you by the following tools:

<UL>
<LI> Lucid emacs
<LI> html-mode by Marc Andressen
<LI> SGMLs
<LI> NCSA Mosaic
</UL>

</BODY>

------- =_aaaaaaaaaa1
Content-Type: text/x-sgml; charset="us-ascii"
Content-ID: <10024.761615492.5@ulua>
Content-Description: SGML document wrapper

<!DOCTYPE HTML SYSTEM "10024.761615492.6@ulua"
	-- PUBLIC "-//IETF/DRAFT/ietf-iiir-html-01" @@ -- [
<!-- $Id$ -->

<!ENTITY web-node SYSTEM "10024.761615492.4@ulua">
]>
<HTML>
&web-node;
</HTML>

------- =_aaaaaaaaaa1
Content-Type: text/x-sgml; charset="us-ascii"
Content-ID: <10024.761615492.6@ulua>
Content-Description: HTML dtd
Content-Transfer-Encoding: quoted-printable

<!-- Jul 1 93 -->
<!--    Regarding clause 6.1, SGML Document:

        [1] SGML document =3D SGML document entity,
            (SGML subdocument entity |
            SGML text entity | non-SGML data entity)*

        The role of SGML document entity is filled by this DTD,
        followed by the conventional HTML data stream.
-->

<!-- DTD definitions -->

<!ENTITY % heading "H1|H2|H3|H4|H5|H6" >
<!ENTITY % list " UL | OL | DIR | MENU ">
<!ENTITY % literal " XMP | LISTING ">

<!ENTITY % headelement
         " TITLE | NEXTID |ISINDEX" >

<!ENTITY % bodyelement
         "P | HR | %heading |
         %list | DL | ADDRESS | PRE | BLOCKQUOTE
        | %literal">

<!ENTITY % oldstyle "%headelement | %bodyelement | #PCDATA">

<!ENTITY % URL "CDATA"
        -- The term URL means a CDATA attribute
           whose value is a Uniform Resource Locator,
           as defined. (A URN may also be usable here when defined.)
        -->

<!ENTITY % linkattributes
        "NAME NMTOKEN #IMPLIED
        HREF %URL;  #IMPLIED
        REL CDATA #IMPLIED -- forward relationship type --
        REV CDATA #IMPLIED -- reversed relationship type
                              to referent data:

                                PARENT CHILD, SIBLING, NEXT, TOP,
                                DEFINITION, UPDATE, ORIGINAL etc. --

        URN CDATA #IMPLIED -- universal resource number --

        TITLE CDATA #IMPLIED -- advisory only --

        METHODS NAMES #IMPLIED -- supported public methods of the object:
                                        TEXTSEARCH, GET, HEAD, ... --

        ">


<!-- Document Element -->

<!ELEMENT HTML O O  (( HEAD | BODY | %oldstyle )*, PLAINTEXT?)>

<!ELEMENT HEAD - -  ( TITLE?  & ISINDEX?  & NEXTID?  & LINK*
                              & BASE?)>

<!ELEMENT TITLE - -  RCDATA
          -- The TITLE element is not considered part of the flow of text.
             It should be displayed, for example as the page header or
             window title.
          -->

<!ELEMENT ISINDEX - O EMPTY
          -- WWW clients should offer the option to perform a search on
             documents containing ISINDEX.
          -->

<!ELEMENT NEXTID - O EMPTY>
<!ATTLIST NEXTID N NAME #REQUIRED
          -- The number should be a name suitable for use
             for the ID of a new element. When used, the value
             has its numeric part incremented. EG Z67 becomes Z68
          -->
<!ELEMENT LINK - O EMPTY>
<!ATTLIST LINK
        %linkattributes>
        =

<!ELEMENT BASE - O EMPTY    -- Reference context for URLS -->
<!ATTLIST BASE

        HREF %URL; #IMPLIED

        >
<!ENTITY % inline "EM | TT | STRONG | B | I | U |
                        CODE | SAMP | KBD | KEY | VAR | DFN | CITE "
        >

<!ELEMENT (%inline;) - - (#PCDATA)>

<!ENTITY % text "#PCDATA | IMG | %inline;">

<!ENTITY % htext "A | %text"    -- Plus links, no structure -->

<!ENTITY % stext                -- as htext but also nested structure --
                        "P | HR | %list | DL | ADDRESS
                        | PRE | BLOCKQUOTE
                        | %literal | %htext">


<!ELEMENT BODY - -  (%bodyelement|%htext;)*>


<!ELEMENT A     - -  (%text)>
<!ATTLIST A
        %linkattributes;
        >

<!ELEMENT IMG    - O EMPTY --  Embedded image -->
<!ATTLIST IMG
        SRC %URL;  #IMPLIED     -- URL of document to embed --
        >


<!ELEMENT P     - O EMPTY -- separates paragraphs -->
<!ELEMENT HR    - O EMPTY -- horizontal rule -->

<!ELEMENT ( %heading )  - -  (%htext;)+>

<!ELEMENT DL    - -  (DT | DD | %stext;)*>
<!--    Content should match ((DT,(%htext;)+)+,(DD,(%stext;)+))
        But mixed content is messy.  -Dan Connolly
  -->

<!ELEMENT DT    - O EMPTY>
<!ELEMENT DD    - O EMPTY>

<!ELEMENT (UL|OL) - -  (%htext;|LI|P)+>
<!ELEMENT (DIR|MENU) - -  (%htext;|LI)+>
<!--    Content should match ((LI,(%htext;)+)+)
        But mixed content is messy.
  -->
<!ATTLIST (%list)
        COMPACT NAME #IMPLIED -- COMPACT, etc.--
        >

<!ELEMENT LI    - O EMPTY>

<!ELEMENT BLOCKQUOTE - - (%htext;|P)+
        -- for quoting some other source -->

<!ELEMENT ADDRESS - - (%htext;|P)+>

<!ELEMENT PRE - - (#PCDATA|%inline|A|P)+>
<!ATTLIST PRE
        WIDTH NUMBER #implied
        >

<!-- Mnemonic character entities. -->
<!ENTITY AElig "&#198;"  -- capital AE diphthong (ligature) -->
<!ENTITY Aacute "&#193;" -- capital A, acute accent -->
<!ENTITY Acirc "&#194;"  -- capital A, circumflex accent -->
<!ENTITY Agrave "&#192;" -- capital A, grave accent -->
<!ENTITY Aring "&#197;"  -- capital A, ring -->
<!ENTITY Atilde "&#195;" -- capital A, tilde -->
<!ENTITY Auml "&#196;"   -- capital A, dieresis or umlaut mark -->
<!ENTITY Ccedil "&#199;" -- capital C, cedilla -->
<!ENTITY ETH "&#208;"    -- capital Eth, Icelandic -->
<!ENTITY Eacute "&#201;" -- capital E, acute accent -->
<!ENTITY Ecirc "&#202;"  -- capital E, circumflex accent -->
<!ENTITY Egrave "&#200;" -- capital E, grave accent -->
<!ENTITY Euml "&#203;"   -- capital E, dieresis or umlaut mark -->
<!ENTITY Iacute "&#205;" -- capital I, acute accent -->
<!ENTITY Icirc "&#206;"  -- capital I, circumflex accent -->
<!ENTITY Igrave "&#204;" -- capital I, grave accent -->
<!ENTITY Iuml "&#207;"   -- capital I, dieresis or umlaut mark -->
<!ENTITY Ntilde "&#209;" -- capital N, tilde -->
<!ENTITY Oacute "&#211;" -- capital O, acute accent -->
<!ENTITY Ocirc "&#212;"  -- capital O, circumflex accent -->
<!ENTITY Ograve "&#210;" -- capital O, grave accent -->
<!ENTITY Oslash "&#216;" -- capital O, slash -->
<!ENTITY Otilde "&#213;" -- capital O, tilde -->
<!ENTITY Ouml "&#214;"   -- capital O, dieresis or umlaut mark -->
<!ENTITY THORN "&#222;"  -- capital THORN, Icelandic -->
<!ENTITY Uacute "&#218;" -- capital U, acute accent -->
<!ENTITY Ucirc "&#219;"  -- capital U, circumflex accent -->
<!ENTITY Ugrave "&#217;" -- capital U, grave accent -->
<!ENTITY Uuml "&#220;"   -- capital U, dieresis or umlaut mark -->
<!ENTITY Yacute "&#221;" -- capital Y, acute accent -->
<!ENTITY aacute "&#225;" -- small a, acute accent -->
<!ENTITY acirc "&#226;"  -- small a, circumflex accent -->
<!ENTITY aelig "&#230;"  -- small ae diphthong (ligature) -->
<!ENTITY agrave "&#224;" -- small a, grave accent -->
<!ENTITY amp "&#38;"     -- ampersand -->
<!ENTITY aring "&#229;"  -- small a, ring -->
<!ENTITY atilde "&#227;" -- small a, tilde -->
<!ENTITY auml "&#228;"   -- small a, dieresis or umlaut mark -->
<!ENTITY ccedil "&#231;" -- small c, cedilla -->
<!ENTITY eacute "&#233;" -- small e, acute accent -->
<!ENTITY ecirc "&#234;"  -- small e, circumflex accent -->
<!ENTITY egrave "&#232;" -- small e, grave accent -->
<!ENTITY eth "&#240;"    -- small eth, Icelandic -->
<!ENTITY euml "&#235;"   -- small e, dieresis or umlaut mark -->
<!ENTITY gt "&#62;"      -- greater than -->
<!ENTITY iacute "&#237;" -- small i, acute accent -->
<!ENTITY icirc "&#238;"  -- small i, circumflex accent -->
<!ENTITY igrave "&#236;" -- small i, grave accent -->
<!ENTITY iuml "&#239;"   -- small i, dieresis or umlaut mark -->
<!ENTITY lt "&#60;"      -- less than -->
<!ENTITY nbsp "&#32;"    --  should be NON_BREAKING space -->
<!ENTITY ntilde "&#241;" -- small n, tilde -->
<!ENTITY oacute "&#243;" -- small o, acute accent -->
<!ENTITY ocirc "&#244;"  -- small o, circumflex accent -->
<!ENTITY ograve "&#242;" -- small o, grave accent -->
<!ENTITY oslash "&#248;" -- small o, slash -->
<!ENTITY otilde "&#245;" -- small o, tilde -->
<!ENTITY ouml "&#246;"   -- small o, dieresis or umlaut mark -->
<!ENTITY szlig "&#223;"  -- small sharp s, German (sz ligature) -->
<!ENTITY thorn "&#254;"  -- small thorn, Icelandic -->
<!ENTITY uacute "&#250;" -- small u, acute accent -->
<!ENTITY ucirc "&#251;"  -- small u, circumflex accent -->
<!ENTITY ugrave "&#249;" -- small u, grave accent -->
<!ENTITY uuml "&#252;"   -- small u, dieresis or umlaut mark -->
<!ENTITY yacute "&#253;" -- small y, acute accent -->
<!ENTITY yuml "&#255;"   -- small y, dieresis or umlaut mark -->

<!-- deprecated elements -->

<!ELEMENT (%literal) - -  CDATA>

<!ELEMENT PLAINTEXT - O EMPTY>

<!-- Local Variables: -->
<!-- mode: sgml -->
<!-- compile-command: "sgmls -s -p " -->
<!-- end: -->

------- =_aaaaaaaaaa1
Content-Type: text/x-sgml; charset="us-ascii"
Content-ID: <10024.761615492.7@ulua>
Content-Description: HTML SGML declaration

<!SGML  "ISO 8879:1986"
--
        Document Type Definition for the HyperText Markup Language
        as used by the World Wide Web application (HTML DTD).

        NOTE: This is a definition of HTML with respect to
        SGML, and assumes an understanding of SGML terms.

        If you find bugs in this DTD or find it does not compile
        under some circumstances please mail www-bug@info.cern.ch
--

CHARSET
         BASESET  "ISO 646:1983//CHARSET
                   International Reference Version (IRV)//ESC 2/5 4/0"
         DESCSET  0   9   UNUSED
                  9   2   9
                  11  2   UNUSED
                  13  1   13
                  14  18  UNUSED
                  32  95  32
                  127 1   UNUSED
     BASESET   "ISO Registration Number 100//CHARSET
                ECMA-94 Right Part of Latin Alphabet Nr. 1//ESC 2/13 4/1"
     DESCSET   128 32 UNUSED
               160 95 32
               255  1 UNUSED


CAPACITY        SGMLREF
                TOTALCAP        150000
                GRPCAP          150000

SCOPE    DOCUMENT
SYNTAX
         SHUNCHAR CONTROLS 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
                           19 20 21 22 23 24 25 26 27 28 29 30 31 127 255
         BASESET  "ISO 646:1983//CHARSET
                   International Reference Version (IRV)//ESC 2/5 4/0"
         DESCSET  0 128 0
         FUNCTION RE          13
                  RS          10
                  SPACE       32
                  TAB SEPCHAR  9
         NAMING   LCNMSTRT ""
                  UCNMSTRT ""
                  LCNMCHAR ".-"
                  UCNMCHAR ".-"
                  NAMECASE GENERAL YES
                           ENTITY  NO
         DELIM    GENERAL  SGMLREF
                  SHORTREF SGMLREF
         NAMES    SGMLREF
         QUANTITY SGMLREF
                  NAMELEN  34
                  TAGLVL   100
                  LITLEN   1024
                  GRPGTCNT 150
                  GRPCNT   64

FEATURES
  MINIMIZE
    DATATAG  NO
    OMITTAG  NO
    RANK     NO
    SHORTTAG NO
  LINK
    SIMPLE   NO
    IMPLICIT NO
    EXPLICIT NO
  OTHER
    CONCUR   NO
    SUBDOC   NO
    FORMAL   YES
  APPINFO    NONE
>


------- =_aaaaaaaaaa1--

------- =_aaaaaaaaaa0--