HTML DTD: past, present, future [was: HTML DTD and HyTime]

"Daniel W. Connolly" <connolly@hal.com>
Errors-To: listmaster@www0.cern.ch
Date: Tue, 15 Feb 1994 21:27:38 --100
Message-id: <9402152009.AA04644@ulua.hal.com>
Errors-To: listmaster@www0.cern.ch
Reply-To: connolly@hal.com
Originator: www-talk@info.cern.ch
Sender: www-talk@www0.cern.ch
Precedence: bulk
From: "Daniel W. Connolly" <connolly@hal.com>
To: Multiple recipients of list <www-talk@www0.cern.ch>
Subject: HTML DTD: past, present, future [was: HTML DTD and HyTime]
X-Listprocessor-Version: 6.0c -- ListProcessor by Anastasios Kotsikonas
Content-Length: 52426
------- =_aaaaaaaaaa0
Content-Type: text/plain; charset="us-ascii"
Content-ID: <4639.761342909.2@ulua>
Content-Description: plain text version

                                Toward a Formalism for Communication On the Web
                                          Daniel W. Connolly <connolly@hal.com>
                                                                               
         $Id: html_essay.html,v 1.2 1994/02/15 20:07:12 connolly Exp connolly $
                                                                               
Status

   I had hoped to polish this more before publishing it, but I can't seem to
   get caught up... there's so much new stuff all the time!
   
                SOME BACKGROUND ON SGML FOR THE WORLD-WIDE WEB
                                       
   In late 1992 and early 1993, I did quite a bit of work on the HTML DTD while
   I was working at Convex in the online documentation group.
   
   When I began, there was the LineMode browser and the NeXT implementation,
   and a few nodes in The Web describing HTML with some oblique references to
   SGML. I was not intimately familiar with SGML, but I was quite familiar with
   the problems of document interchange, and I was eager to apply some of my
   formal systems background to the problem.
   
On Formally Unconvertable Document Formats

   My experience with document interchange led me to classify document formats
   using the essential distinction that some are "programmable" and some are
   not. Most widely used source forms are programmable: TeX, troff, postscript,
   and the like. On the other hand, there are several "static" formats: plain
   text, Microsoft RTF, FrameMaker MIF, GNU's TeXinfo,
   
   The reason that this distinction is essential with respect to document
   interchange is that extracting information from documents in "programmable"
   document formats is equivalent to the halting problem. That is, it is
   arbitrarily difficult and cannot be automated in a general fashion.
   
   For example, I conjecture that it is impossible to write a program that will
   extract the third word from a TeX document. It would be an easy task for 80%
   of the TeX documents out there -- just skip over some formatting stuff and
   grab the third bunch of characters surrounded by whitespace. But that
   "formatting stuff" might be a program that generates 100 words from the
   hypenation dictionary. So the simple lexical scan of the TeX source would
   find a word that is not third word of the document when printed.
   
   This may seem like an obscure and unimportant problem, but I assure you that
   the problem of converting TeX tables to FrameMaker MIF is just as
   unsolvable.
   
   So while "programmable" document formats have the advantage that features
   can be added on a per-document basis, they suffer the disadvantage that
   these features cannot be recovered by the machine and translated in an
   automated fashion.
   
Document Formats as Communications Media

   If we look at document formats in light of the conventional
   sender/message/medium/receiver communications model, we see that document
   formats capture the message at various levels of "concreteness".
   
   The message begins as a collection of concepts and ideas in the mind of the
   sender. In order to communicate, the sender and receiver must share some
   language. That is, they must both understand some common set of symbols and
   the way those symbols combine to represent ideas. The senders job is to
   express the message in terms of the common symbols and express them on the
   medium -- that is "render" or "present" them. The the medium stimulates the
   receiver to reconstruct the symbols in his/her brain -- that is, the
   receiver "interprets" or "recognizes" the symbols from the medium. Those
   symbols interact with other symbols in the receiver's brain, and the
   receiver "gets the message."
   
   The communications medium is often a layered combination of more and less
   concrete media. For example, folks first render their ideas in the symbology
   of the English language, and then render those symbols as sequences of
   spoken phonemes or written characters. Those written characters are in turn
   combinations of lines, curves, strokes, and points. The receiving folks then
   assemble the strokes into characters, the characters into words, the words
   into phrases, sentences, thoughts, ideas, and so on.
   
   The most common and ubiquitous document format, plain ASCII text, captures
   or digitizes messages at the level of written characters. PostScript
   captures the characters as lines, curves, and paths. The GIF format captures
   a document as an array of pixels. GIF is in many ways infinitely more
   expressive than plain text, which is limited to arrangements of the 96 ASCII
   characters.
   
   The RTF, TeX, nroff, etc. document formats provide very sophisticated
   automated techniques for authors of documents to express their ideas. It
   seems strange at first to see that plain text is still so widely used. It
   would seem that PostScript is the ultimate document format, in that its
   expressive capabilities include essentially anything that the human eye is
   capable of perceiving, and yet it is device-independent.
   
   And yet if we take a look at the task of interpreting data back into the
   ideas that they represent, we find that plain text is much to be preferred,
   since reading plain text is so much easier to automate than reading GIF
   files (optical character recognition) or postscript documents (halting
   problem). In the end, while the source to a various TeX or troff documents
   may correspond closely to the structure of the ideas of the author, and
   while PostScript allows the author very precise control and tremenous
   expressive capability, all these documents ultimately capture an image of a
   document for presentation to the human eye. They don't capture the original
   information as symbols that can be processed by machine.
   
   To put it another way, rendering ideas in PostScript is not going to help
   solve the problem of information overload -- it will only compound the
   situation.
   
   As a real world example, suppose you had a 5000 page document in PostScript,
   and you wanted to find a particular piece of information inside it. The
   author may have organized the document very well, but you'd have to print it
   to use those clues. If the characters aren't kerned much, you might be able
   to use grep or sick a WAIS indexing engine on it. Then, once you've found
   what looks like postscript code for some relavent information, you'd pray
   that the document adheres to the Adobe Document Structuring conventions so
   that you could pick out the page containing the information you need and
   view that page.
   
   If that's too perverse, look at the problem of navigating a large collection
   of technical papers coded in TeX. Many of the authors use LaTeX, and you may
   be able to convince the indexing engine to filter out common LaTeX
   formatting idioms -- or better yet, weight headings, abstracts, etc. more
   heavily than other sections based on the formatting idioms. While there are
   heuristic solutions to this problem that will work in the typical 80%/20%
   fashion, the general solution is once again equivalent to the halting
   problem; for example, individual documents might have bits of TeX
   programming that change the significance of words in a way that the indexing
   engine won't be able to understand.
   
SGML as a Layered Communications Medium

   So where does SGML fit into the sender/message/medium/receiver game?
   
   I'll use PostScript as a basis of comparison. The PostScript model consists
   of a fairly powerful and general purpose two dimensional imaging model, that
   is, a set of primitive symbols for specifying sets of points in two
   dimensions using handy computational techniques, and a general purpose
   programming model for building complex symbols out of those primitives. That
   model is applied extensively to the problem of typography, and there is a an
   architecture (that is, a set of well known symbols derived from the
   primitives) for using and building fonts.
   
   So to communicate message consisting of symbols from human communications in
   PostScript, one may choose from a well known set of typefaces, or create a
   new typeface using the well known font architecture, or free-hand draw some
   characters using postscript primitives, or draw lines, boxes, circles and
   such using postscript primitives, or scribble on a piece of paper, scan it,
   and convert the bits to use the postscript image operator. The space of
   symbols is nearly limitless, as long as those symbols can be expressed
   ultimately as pixels on a page.
   
   The distinctive feature of PostScript (an advantage at times, and a
   disadvantage at others) is that whether you print it and deliver the paper
   or you deliver the PostScript and the receiver prints it out, the result is
   the same bunch of images.
   
   The SGML model, on the other hand, specifies no general purpose programming
   model where complex symbols can be defined in terms of primitive symbols.
   The meaning of a symbol is either found in the SGML standard itself, or in
   some PUBLIC document (which may or may not be machine readable), or in some
   SYSTEM specific manner, or defined by an SGML application. The only real
   primitives are the character and the "non-SGML data entity".
   
   The model perscribes that a document consist of a declaration, a prologue,
   and an instance. The declaration is expressed in ASCII and specifies the
   character sets and syntactic symbols used by the prologue and instance. The
   prologue is expressed in a standard language using the syntactic symbols
   from the delcaration, and specifies a set of entities and a grammar of
   element types available to the instance.
   
   The instance is a sequence of elements, character data, and entities
   constrained by the grammar set forth in the prologue, and the SGML standard
   does not specify any semantics or meaning for the instance.
   
   So to communicate using SGML, the sender first chooses a character set and
   certain processing quatities and capacities. For example "I'm writing in
   ASCII, and I'll never use an element name more than 40 characters long" is
   some information that can be expressed in the SGML declaration. [The
   standard allows the SGML declaration to be implicitly agreed upon by sender
   and receiver, and this is generally the case].
   
   The tricky part is the prologue, where the sender gives a grammar that
   constrains the structure of the document. Along with the information
   actually expressed in SGML in the prologue, there is usually some amound of
   application defined semantics attached to the element types. For example,
   the prologue may express in SGML that an H2 element must occur within the
   content of an H1 element. But the convention that text in an H1 is usually
   displayed larger and considered more important is application defined.
   
   Once the prologue is determined (this usually involves considerable
   discussion between a collection of authors and consumers in some domain --
   in the end, there may be some "parameter entities" in the prologue which
   allow some variation on a per-document basis), the sender is constrained to
   a rigorous structure for the organization of the symbols and character data
   of the document. On the other hand, s/he has an automated technique for
   verifying that s/he has not viloated the structure, and hence there is some
   confidence that the document can be consumed and processed by machine.
   
                  THE HTML DTD: CONFORMING, THOUGH EXPEDIENT
                                       
Design Constraints of the HTML DTD

   Tim's original conception of HTML is that it should be about as expressive
   as RTF. In contrast to traditional SGML applications where documents might
   be batch processed and complex structure is the norm, HTML documents are
   intended to be processed interactively. And the widespread success of
   WYSIWYG word processors based on fairly flat paragraph structure was proof
   that something like RTF was suitable for a fairly wide variety of tasks.
   
   As I learned a little about SGML, it was clear that the WWW browser
   implementation of HTML sorely lacked anything resembling an SGML entity
   manager. And there were some syntactic inconsitencies with the SGML
   standard. And it didn't use the ID/IDREF feature where it should have...
   
   Then, as I began to comprehend SGML with all its warts, (who's idea was it
   to attach the significance of a newline character to the phase of the moon
   anyway?) I was less gung-ho about declaring all the HTML out there to be
   blasphemy to the One True SGML Way.
   
   Thus I chose for my battle to find some formal relationship between the SGML
   standard and the  HTML that was "out there." The quest was:
   
  FIND SOME DTD SUCH THAT THE VAST MAJORITY OF HTML DOCUMENTS ARE INSTANCES OF
  THAT DTD, CONVERSELY, SUCH THAT ALL ITS INSTANCES MAKE SENSE TO THE EXISTING
  WWW CLIENTS.
  
   I struggled mightily with such issues as:
   
      Should we be sticking <! DOCTYPE HTML SYSTEM> in .html files? What if
      somebody puts an entity declaration in there? (And does that mean that
      WWW clients have to be able to parse SGML prologues in general?
      
      What's the syntax of an attribute value? If we allow SHORTTAG YES, does
      that mean we have to parse <em/this/ style of markup too?
      
      Can we put some short reference maps in the DTD that will cause real SGML
      parsers and current WWW browsers to do the same thing w.r.t newlines?
      (i.e. can we make all that phase-of-the-moon processing with newlines a
      moot issue)
      
      What about marked sections? Short reference maps?
      
      What character set should we be using? How do I express ISO-Latin-1 in
      the SGML declaration? How should authors express the 'How do you put
      quotes in an attribute value literal?
      
      How can I deal with the current paragraph element idioms without using
      minimization?
      
      Can I stick base64 encoded stuff in a CDATA element? Do I have to watch
      out for How do we combine SGML and multimedia data in the same data
      stream?
      
   I found solutions to some problems, and punted on others. I probably should
   have put more comments in the DTD regarding the compromises. But I wanted to
   keep the DTD stripped down to the normative information and keep the
   informative information in other documents.
   
   I did, by the way, draft a series of 4 or 5 documents demonstrating various
   structural and syntactic features of SGML -- a sort of validation suite. I'm
   not sure where it went.
   
   I'd like to respond to Elliot Kimber's critique of the HTML DTD that I
   posted.
   

>At the bottom of this posting is a slightly modified copy of the
>HTML DTD that conforms to the HyTime standard.  I have not modified
>the elements or content models in any way.  I have not added any
>new elements.  I have only added to the attribute lists of a few
>elements.
>
>The biggest change I made was to the way URL addresses are handled.
>In order to use HyTime (as opposed to application-specific)
>methods for doing addressing, I had to change the URL address
>from a direct reference into an entity reference where the
>entity's system identifier is its URL address.

   I suggested this long ago, but Tim shot the idea down. As I recall, he said
   that all that extra markup was a waste. On the one hand, I agree with him --
   the purpose of a language is to be able to express common idioms succinctly,
   and SGML/HyTime are poor in that respect. On the other hand, once you've
   chosen SGML, you might as do as the Romans do.
   

>  This makes
>the link elements conform to the architectural forms and puts
>in enough indirection to allow other addressing methods to
>be used to locate the objects without having to modify the
>links, only the entity declarations.

   Why is it easier to modify entity declarations than links? Six of one,
   half-dozen of the other if you ask me.
   

>  I use SUBDOC entities
>for refering to other complete documents, although I'm not
>sure this the best thing, but there's no other construct in
>SGML that works as well.  Note that nowwhere in 8879 does it
>define what must happen as the result of a SUBDOC reference,
>except that a new parsing context is established.  The actual
>result of a SUBDOC reference is a matter of style and presumably
>in a WWW context it would result in the retrieval of the document
>and its presentation in a seperate window.  The key is that
>the subdoc reference establishes a specific relationship between
>the source of the link and the target, namely one document
>refering to another.  The target document could also be defined
>as a data entity with whatever notation is appropriate (possibly
>even SGML if it's another SGML document).  This may be the better
>approach, I don't know.

   I don't expect that the data entity/subdocument entity distinction matters
   one hill of beans to contemporary WWW clients. I'm interested to know if it
   means anything to HyTime engines.
   

>If I were re-designing the HTML, I would add direct support
>for HyTime location ladders using at a minimum the nameloc,
>notloc, and dataloc addressing elements.  However, if these
>elements are needed for interchange they could be generated
>from the information contained in WWW documents using the
>DTD below, so it's not critical.
>

   Could you expand on that? If we'll be "generating" compliant SGML for
   interchange, we might as well use TeXinfo or something practical like that
   for application-specific purposes.
   

>This is just one attempt at applying HyTime to the HTML.
>I'm sure there are other equally-valid (or more valid)
>ways it could be done.  Given the current functionality
>of the WWW, I'm sure there are ways to express that functionality
>using HyTime constructs.  HyTime constructs may also suggest
>useful ways to extend the WWW functionality, who knows.

   I finally got to actually read the HyTime standard the other day, and  the
   clink and noteloc forms looked most useful. I'm also interested in
   expressing some of the "relative link" idioms used in HTML. (e.g how would
   we express HREF="../foo/bar.html#zabc" using HyTime? The object of the game
   is to do it in such a way that the markup can be copied verbatim from one
   system to another (say unix to VMS) and have the right meaning)
   

><!ENTITY % URL "CDATA"
>        -- The term URL means a CDATA attribute
>           whose value is a Universal Resource Locator,
>           as defined in ftp://info.cern.ch/pub/www/doc/url3.txt
>        -->
><!--=====================================================================
>    WEK:  I have defined URL addresses as a notation so that they can
>          be then used in a notloc element.
>    =====================================================================-->
><!NOTATION url PUBLIC "-//WWW//NOTATION URL/Universal Resource Locator
>                             /'ftp: info.cern.ch/pub/www/doc/url3.txt'
>                             //EN"
>>

   Cool good idea.
   

>
><!ENTITY % linkattributes
>        "NAME NMTOKEN #IMPLIED
>        HREF ENTITY #IMPLIED
>
> --=== WEK =======================================================
>
>      HREF is now an entity attribute rather than containing a
>      URL address directly.  To create a link using a URL address,
>      declare a SUBDOC or data entity and make the system
>      identifier the URL address of the object:
>
>      <!ENTITY  mydoc SYSTEM "URL address of document " SUBDOC >
>
>      This indirection gives to things:
>
>      1. A way to protect links in the source from changes in the
>         location of a document since the physical address is only
>         specified once.

   Ah... now I get it... in case you have lots of links to mydoc or parts of
   mydoc, you only have one place that defines where mydoc is. Nifty.
   

>
>      2. An opportunity to use other addressing methods, including
>         possibly replacing the URL with an ISO formal public
>         identifier.
>    =================================================================-->
>
>        TYPE NAME #IMPLIED -- type of relashionship to referent data:
>                                PARENT CHILD, SIBLING, NEXT, TOP,
>                                 DEFINITION, UPDATE, ORIGINAL etc. --
>        URN CDATA #IMPLIED -- universal resource number. unique doc id --
>        TITLE CDATA #IMPLIED -- advisory only --
>        METHODS NAMES #IMPLIED -- supported methods of the object:
>                                        TEXTSEARCH, GET, HEAD, ... --
>        -- WEK: --
>        LINKENDS  NAMES #IMPLIED
>          -- Linkends takes one or more NAME= values for local links--
>        HyNames  CDATA #FIXED 'TYPE ANCHROLE URN DOCORSUB'
>        ">

   I thought the ANCHROLEs of a clink were defined by HyTime to be REFsomething
   and REFSUB. Or are those just defaults? Also... does the HyNames think work
   locally like this? What a HACK!
   

>
><!--=== WEK ==========================
>
>    The HyNames= attribute maps the local attribute names to their
>    cooresponding HyTime forms.
>
>    The Methods= attribute is bit of a puzzle since it is really
>    a part of the hyperlink presentation/processing style, not
>    a property of the anchors, but there's nothing wrong with
>    having application-specific stuff in your HyTime application.

   The Methods= attribute has been striken :-(. It was motivated by the
   observation that textsearch interactions in WWW go like this:
   
      Doc A says "click here[23] to see the index"
      
      user clicks
      
      client fetches link 23, "http://host/index"
      
      displays "cover page" document
      
      user enters FIND abc
      
      client fetches "http://host/index?abc"
      
      search results are displayed
      
   Wheras in gopher, you get to save a step if you like:
   
      Doc A says "click here[23] to search the index"
      
      user clicks
      
      client displayes "enter search words here: " dialog
      
      user enters FIND abc
      
      client fetches "http://host/index?abc"
      
      search results are displayed
      
   So to specify the latter, you would create a link with Methods=textsearch.
   

>    I added LinkEnds= so that the various linking elements will
>    completely conform to the clink and ilink forms.  The presence
>    of the LinkEnds= attribute does not imply required support
>    for this type of linking, but it does make HTML more consistent
>    with other DTDs that do use the LinkEnds= attribute form.
>
>    Note that 10744 shows the attribute name for the ILINK form
>    to be 'linkend', not 'linkends'.  I consider this to be a
>    typo, as there's no logical reason to disallow multiple anchors
>    from a clink and lack of it puts an undue requirement of
>    specifying otherwise unneeded nameloc elements.  In any case,
>    an application can transform linkends= to linkend= plus a
>    nameloc, so it doesn't matter in practice.

   Are there any HyTime implementations out there? Do they use 'linkend' or
   'linkends'? It's hard to beleive that HyTime became a standard without a
   proof-of-concept implementation.
   

>
><!ELEMENT P     - O EMPTY -- separates paragraphs -->
><!--=== WEK ==========================================================
>
>    Design note:  This seems like a clumsy way to structure information.
>                  One would expect paragraphs to be containing.
>
>    ==================================================================-->

   Yeah, well, try implementing end tag inference in <1000 or so lines of code.
   Maybe we'll get it right next time...
   

><!ELEMENT DL    - -  (DT | DD | P | %hypertext;)*>
><!--    Content should match ((DT,(%hypertext;)+)+,(DD,(%hypertext;)+))
>        But mixed content is messy.
>  -->
><!--=== WEK ============================================================
>
>    Design note:  This content should be:
>
>    <!ELEMENT DL  - - (DT+, DD)+ >
>    <!ELEMENT (DT | DD) - O (%hypertext;)* >
>
>    There's no reason for DT and DD to be empty.  Perhaps there was
>    some confusion about the problems with mixed content?  There are
>    none here.
>
>    These comments apply to the other list elements as well.
>
>    ====================================================================-->

   The problem is that DL, DT, DD, UL, OL, and LI were marked up in extant HTML
   documents as if minimization were supported. But I didn't want to introduce
   minimization into the implementation, so I made the DT, DD, and LI elements
   empty.
   
   It's possible I'm confused about mixed content, but the way I understand it,
   you don't want to use mixed content except in repeatable or groups because
   authors will stick whitespace in where it is meant to be ignored but it
   won't be.
   

>
><!-- Character entities omitted.  These should be separate from
>     the main DTD so specific applications can define their values.
>     ISO entity sets could be used for this.
>  -->

   Another point I should have explained in the DTD: the WWW application
   specifies that HTML uses the Latin-1 character set, and that the Ouml entity
   represents exactly that character from the Latin-1 character and not some
   system specific thingy. Translation to system character sets is done outside
   of the SGML parser.

------- =_aaaaaaaaaa0
Content-Type: text/html; charset="us-ascii"
Content-ID: <4639.761342909.3@ulua>
Content-Description: html version
Content-Transfer-Encoding: quoted-printable

<!-- $Id: html_essay.html,v 1.2 1994/02/15 20:07:12 connolly Exp connolly =
$ -->
<html>
<head>
<title>Toward a Formalism for Communication On the Web</title>
</head>
<body>

<ADDRESS>Daniel W. Connolly &lt;connolly@hal.com&gt; <P>
$Id: html_essay.html,v 1.2 1994/02/15 20:07:12 connolly Exp connolly $
</ADDRESS>

<H2>Status</H2>

 <P>I had hoped to polish this more before publishing it, but I can't seem
to get caught up... there's so much new stuff all the time!

<H1>Some Background on SGML for the World-Wide Web
</H1>

 <p>In late 1992 and early 1993, I did quite a bit of work on the HTML DTD
while I was working at Convex in the online documentation group.

 <p>When I began, there was the LineMode browser and the NeXT
implementation, and a few nodes in The Web describing HTML with some
oblique references to SGML. I was not intimately familiar with SGML, but
I was quite familiar with the problems of document interchange, and I
was eager to apply some of my formal systems background to the problem.

<H2>On Formally Unconvertable Document Formats
</H2>

 <P>My experience with document interchange led me to classify document
formats using the essential distinction that some are "programmable" and
some are not. Most widely used source forms are programmable: TeX,
troff, postscript, and the like. On the other hand, there are several "sta=
tic"
formats: plain text, Microsoft RTF, FrameMaker MIF, GNU's TeXinfo,

 <P>The reason that this distinction is essential with respect to document
interchange is that extracting information from documents in
"programmable" document formats is equivalent to the halting problem.
That is, it is arbitrarily difficult and cannot be automated in a
general fashion.

 <P>For example, I conjecture that it is impossible to write a program tha=
t
will extract the third word from a TeX document. It would be an easy
task for 80% of the TeX documents out there -- just skip over some
formatting stuff and grab the third bunch of characters surrounded by
whitespace. But that "formatting stuff" might be a program that
generates 100 words from the hypenation dictionary. So the simple
lexical scan of the TeX source would find a word that is <em>not</em> thir=
d
word of the document when printed.

 <P>This may seem like an obscure and unimportant problem, but I assure yo=
u
that the problem of converting TeX tables to FrameMaker MIF is just as
unsolvable.

 <P>So while "programmable" document formats have the advantage that
features can be added on a per-document basis, they suffer the
disadvantage that these features cannot be recovered by the machine and
translated in an automated fashion.


<H2>Document Formats as Communications Media
</H2>

 <P>If we look at document formats in light of the conventional
sender/message/medium/receiver communications model, we see that
document formats capture the message at various levels of
"concreteness".

 <P>The message begins as a collection of concepts and ideas in the mind o=
f
the sender. In order to communicate, the sender and receiver must share
some language. That is, they must both understand some common set of
symbols and the way those symbols combine to represent ideas. The
senders job is to express the message in terms of the common symbols and
express them on the medium -- that is "render" or "present" them. The
the medium stimulates the receiver to reconstruct the symbols in his/her
brain -- that is, the receiver "interprets" or "recognizes" the symbols
from the medium. Those symbols interact with other symbols in the
receiver's brain, and the receiver "gets the message."

 <P>The communications medium is often a layered combination of more and
less concrete media. For example, folks first render their ideas in the
symbology of the English language, and then render those symbols as
sequences of spoken phonemes or written characters. Those written
characters are in turn combinations of lines, curves, strokes, and
points. The receiving folks then assemble the strokes into characters,
the characters into words, the words into phrases, sentences, thoughts,
ideas, and so on.

 <P>The most common and ubiquitous document format, plain ASCII text,
captures or digitizes messages at the level of written characters.
PostScript captures the characters as lines, curves, and paths. The GIF
format captures a document as an array of pixels. GIF is in many ways
infinitely more expressive than plain text, which is limited to
arrangements of the 96 ASCII characters.

 <P>The RTF, TeX, nroff, etc. document formats provide very sophisticated
automated techniques for authors of documents to express their ideas. It
seems strange at first to see that plain text is still so widely used.
It would seem that PostScript is the ultimate document format, in that
its expressive capabilities include essentially anything that the human
eye is capable of perceiving, and yet it is device-independent.

 <P>And yet if we take a look at the task of interpreting data back into
the ideas that they represent, we find that plain text is much to be
preferred, since reading plain text is so much easier to automate than
reading GIF files (optical character recognition) or postscript
documents (halting problem). In the end, while the source to a various
TeX or troff documents may correspond closely to the structure of the
ideas of the author, and while PostScript allows the author very precise
control and tremenous expressive capability, all these documents
ultimately capture an image of a document for presentation to the human
eye. They don't capture the original information as symbols that can be
processed by machine.

 <P>To put it another way, rendering ideas in PostScript is not going to
help solve the problem of information overload -- it will only compound
the situation.

 <P>As a real world example, suppose you had a 5000 page document in
PostScript, and you wanted to find a particular piece of information
inside it. The author may have organized the document very well, but
you'd have to print it to use those clues. If the characters aren't
kerned much, you might be able to use grep or sick a WAIS indexing
engine on it. Then, once you've found what looks like postscript code
for some relavent information, you'd pray that the document adheres to
the Adobe Document Structuring conventions so that you could pick out
the page containing the information you need and view that page.

 <P>If that's too perverse, look at the problem of navigating a large
collection of technical papers coded in TeX. Many of the authors use
LaTeX, and you may be able to convince the indexing engine to filter
out common LaTeX formatting idioms -- or better yet, weight headings,
abstracts, etc. more heavily than other sections based on the
formatting idioms. While there are heuristic solutions to this problem
that will work in the typical 80%/20% fashion, the general solution is
once again equivalent to the halting problem; for example, individual
documents might have bits of TeX programming that change the
significance of words in a way that the indexing engine won't be able
to understand.


<H2>SGML as a Layered Communications Medium
</H2>

 <P>So where does SGML fit into the sender/message/medium/receiver game?

 <P>I'll use PostScript as a basis of comparison. The PostScript model
consists of a fairly powerful and general purpose two dimensional
imaging model, that is, a set of primitive symbols for specifying sets
of points in two dimensions using handy computational techniques, and a
general purpose programming model for building complex symbols out of
those primitives. That model is applied extensively to the problem of
typography, and there is a an architecture (that is, a set of well known
symbols derived from the primitives) for using and building fonts.

 <P>So to communicate message consisting of symbols from human
communications in PostScript, one may choose from a well known set of
typefaces, or create a new typeface using the well known font
architecture, or free-hand draw some characters using postscript
primitives, or draw lines, boxes, circles and such using postscript
primitives, or scribble on a piece of paper, scan it, and convert the
bits to use the postscript image operator. The space of symbols is
nearly limitless, as long as those symbols can be expressed ultimately
as pixels on a page.

 <P>The distinctive feature of PostScript (an advantage at times, and a
disadvantage at others) is that whether you print it and deliver the
paper or you deliver the PostScript and the receiver prints it out, the
result is the same bunch of images.

 <P>The SGML model, on the other hand, specifies no general purpose
programming model where complex symbols can be defined in terms of
primitive symbols. The meaning of a symbol is either found in the SGML
standard itself, or in some PUBLIC document (which may or may not be
machine readable), or in some SYSTEM specific manner, or defined by an
SGML application. The only real primitives are the character and the
"non-SGML data entity".

 <P>The model perscribes that a document consist of a declaration, a
prologue, and an instance. The declaration is expressed in ASCII and
specifies the character sets and syntactic symbols used by the prologue
and instance. The prologue is expressed in a standard language using the
syntactic symbols from the delcaration, and specifies a set of entities
and a grammar of element types available to the instance.

 <P>The instance is a sequence of elements, character data, and entities
constrained by the grammar set forth in the prologue, and the SGML
standard does not specify any semantics or meaning for the instance.

 <P>So to communicate using SGML, the sender first chooses a character set
and certain processing quatities and capacities. For example "I'm
writing in ASCII, and I'll never use an element name more than 40
characters long" is some information that can be expressed in the SGML
declaration. [The standard allows the SGML declaration to be implicitly
agreed upon by sender and receiver, and this is generally the case].

 <P>The tricky part is the prologue, where the sender gives a grammar that
constrains the structure of the document. Along with the information
actually expressed in SGML in the prologue, there is usually some amound
of application defined semantics attached to the element types. For
example, the prologue may express in SGML that an H2 element must occur
within the content of an H1 element. But the convention that text in an
H1 is usually displayed larger and considered more important is
application defined.

 <P>Once the prologue is determined (this usually involves considerable
discussion between a collection of authors and consumers in some
domain -- in the end, there may be some "parameter entities" in the
prologue which allow some variation on a per-document basis), the sender
is constrained to a rigorous structure for the organization of the
symbols and character data of the document. On the other hand, s/he has
an automated technique for verifying that s/he has not viloated the
structure, and hence there is some confidence that the document can be
consumed and processed by machine.


<H1>The HTML DTD: Conforming, though Expedient
</H1>

<H2>Design Constraints of the HTML DTD
</H2>

 <P>Tim's original conception of HTML is that it should be about as
expressive as RTF. In contrast to traditional SGML applications where
documents might be batch processed and complex structure is the norm,
HTML documents are intended to be processed interactively. And the
widespread success of WYSIWYG word processors based on fairly flat
paragraph structure was proof that something like RTF was suitable for a
fairly wide variety of tasks.

 <P>As I learned a little about SGML, it was clear that the WWW browser
implementation of HTML sorely lacked anything resembling an SGML entity
manager. And there were some syntactic inconsitencies with the SGML
standard. And it didn't use the ID/IDREF feature where it should have...

 <P>Then, as I began to comprehend SGML with all its warts, (who's idea wa=
s
it to attach the significance of a newline character to the phase of the
moon anyway?) I was less gung-ho about declaring all the HTML out there
to be blasphemy to the One True SGML Way.

 <P>Thus I chose for my battle to find some formal relationship between th=
e
SGML standard and the  HTML that was "out there." The quest was:

<H3>Find some DTD such that the vast majority of HTML documents are
instances of that DTD, conversely, such that all its instances make
sense to the existing WWW clients.
</H3>

 <P>I struggled mightily with such issues as:

<UL>
<LI>Should we be sticking &lt;! DOCTYPE HTML SYSTEM> in .html files? What
if somebody puts an entity declaration in there? (And does that mean
that WWW clients have to be able to parse SGML prologues in general?

<LI>What's the syntax of an attribute value? If we allow SHORTTAG YES,
does that mean we have to parse <CODE>&lt;em/this/</CODE> style of
markup too?

<LI>Can we put some short reference maps in the DTD that will cause real
SGML parsers and current WWW browsers to do the same thing w.r.t
newlines? (i.e. can we make all that phase-of-the-moon processing with
newlines a moot issue)

<LI>What about marked sections? Short reference maps?

<LI>What character set should we be using? How do I express ISO-Latin-1
in the SGML declaration? How should authors express the '<' character?
How should this be expressed in the DTD?

<LI>How do you put quotes in an attribute value literal?

<LI>How can I deal with the current paragraph element idioms without
using minimization?

<LI>Can I stick base64 encoded stuff in a CDATA element? Do I have to
watch out for <'s and such?

<LI>How do we combine SGML and multimedia data in the same data stream?

</UL>


 <P>I found solutions to some problems, and punted on others. I probably
should have put more comments in the DTD regarding the compromises. But
I wanted to keep the DTD stripped down to the normative information and
keep the informative information in other documents.

 <P>I did, by the way, draft a series of 4 or 5 documents demonstrating
various structural and syntactic features of SGML -- a sort of
validation suite. I'm not sure where it went.

 <P>I'd like to respond to Elliot Kimber's critique of the HTML DTD that I
posted.

<pre>
>At the bottom of this posting is a slightly modified copy of the
>HTML DTD that conforms to the HyTime standard.  I have not modified
>the elements or content models in any way.  I have not added any
>new elements.  I have only added to the attribute lists of a few
>elements.
>
>The biggest change I made was to the way URL addresses are handled.
>In order to use HyTime (as opposed to application-specific)
>methods for doing addressing, I had to change the URL address
>from a direct reference into an entity reference where the
>entity's system identifier is its URL address.
</pre>

 <P>I suggested this long ago, but Tim shot the idea down. As I recall, he
said that all that extra markup was a waste. On the one hand, I agree
with him -- the purpose of a language is to be able to express common
idioms succinctly, and SGML/HyTime are poor in that respect. On the
other hand, once you've chosen SGML, you might as do as the Romans do.

<pre>
>  This makes
>the link elements conform to the architectural forms and puts
>in enough indirection to allow other addressing methods to
>be used to locate the objects without having to modify the
>links, only the entity declarations.
</pre>

 <P>Why is it easier to modify entity declarations than links? Six of one,
half-dozen of the other if you ask me.

<pre>
>  I use SUBDOC entities
>for refering to other complete documents, although I'm not
>sure this the best thing, but there's no other construct in
>SGML that works as well.  Note that nowwhere in 8879 does it
>define what must happen as the result of a SUBDOC reference,
>except that a new parsing context is established.  The actual
>result of a SUBDOC reference is a matter of style and presumably
>in a WWW context it would result in the retrieval of the document
>and its presentation in a seperate window.  The key is that
>the subdoc reference establishes a specific relationship between
>the source of the link and the target, namely one document
>refering to another.  The target document could also be defined
>as a data entity with whatever notation is appropriate (possibly
>even SGML if it's another SGML document).  This may be the better
>approach, I don't know.
</pre>

 <P>I don't expect that the data entity/subdocument entity distinction
matters one hill of beans to contemporary WWW clients. I'm interested to
know if it means anything to HyTime engines.

<pre>
>If I were re-designing the HTML, I would add direct support
>for HyTime location ladders using at a minimum the nameloc,
>notloc, and dataloc addressing elements.  However, if these
>elements are needed for interchange they could be generated
>from the information contained in WWW documents using the
>DTD below, so it's not critical.
>
</pre>

 <P>Could you expand on that? If we'll be "generating" compliant SGML for
interchange, we might as well use TeXinfo or something practical like
that for application-specific purposes.

<pre>
>This is just one attempt at applying HyTime to the HTML.
>I'm sure there are other equally-valid (or more valid)
>ways it could be done.  Given the current functionality
>of the WWW, I'm sure there are ways to express that functionality
>using HyTime constructs.  HyTime constructs may also suggest
>useful ways to extend the WWW functionality, who knows.
</pre>

 <P>I finally got to actually read the HyTime standard the other day, and =

the clink and noteloc forms looked most useful. I'm also interested in
expressing some of the "relative link" idioms used in HTML.
(e.g how would we express HREF=3D"../foo/bar.html#zabc" using HyTime? The
object of the game is to do it in such a way that the markup can be
copied verbatim from one system to another (say unix to VMS) and have
the right meaning)

<pre>
>&lt;!ENTITY % URL "CDATA"
>        -- The term URL means a CDATA attribute
>           whose value is a Universal Resource Locator,
>           as defined in ftp://info.cern.ch/pub/www/doc/url3.txt
>        -->
>&lt;!--=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>    WEK:  I have defined URL addresses as a notation so that they can
>          be then used in a notloc element.
>    =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D-->
>&lt;!NOTATION url PUBLIC "-//WWW//NOTATION URL/Universal Resource Locator
>                             /'ftp: info.cern.ch/pub/www/doc/url3.txt'
>                             //EN"
>>
</pre>

 <P>Cool good idea.

<pre>
>
>&lt;!ENTITY % linkattributes
>        "NAME NMTOKEN #IMPLIED
>        HREF ENTITY #IMPLIED
>
> --=3D=3D=3D WEK =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>
>      HREF is now an entity attribute rather than containing a
>      URL address directly.  To create a link using a URL address,
>      declare a SUBDOC or data entity and make the system
>      identifier the URL address of the object:
>
>      &lt;!ENTITY  mydoc SYSTEM "URL address of document " SUBDOC >
>
>      This indirection gives to things:
>
>      1. A way to protect links in the source from changes in the
>         location of a document since the physical address is only
>         specified once.
</pre>

 <P>Ah... now I get it... in case you have lots of links to mydoc or parts
of mydoc, you only have one place that defines where mydoc is. Nifty.

<pre>
>
>      2. An opportunity to use other addressing methods, including
>         possibly replacing the URL with an ISO formal public
>         identifier.
>    =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D-->
>
>        TYPE NAME #IMPLIED -- type of relashionship to referent data:
>                                PARENT CHILD, SIBLING, NEXT, TOP,
>                                 DEFINITION, UPDATE, ORIGINAL etc. --
>        URN CDATA #IMPLIED -- universal resource number. unique doc id --
>        TITLE CDATA #IMPLIED -- advisory only --
>        METHODS NAMES #IMPLIED -- supported methods of the object:
>                                        TEXTSEARCH, GET, HEAD, ... --
>        -- WEK: --
>        LINKENDS  NAMES #IMPLIED
>          -- Linkends takes one or more NAME=3D values for local links--
>        HyNames  CDATA #FIXED 'TYPE ANCHROLE URN DOCORSUB'
>        ">
</pre>

 <P>I thought the ANCHROLEs of a clink were defined by HyTime to be
REFsomething and REFSUB. Or are those just defaults? Also... does the
HyNames think work locally like this? What a HACK!

<pre>
>
>&lt;!--=3D=3D=3D WEK =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D
>
>    The HyNames=3D attribute maps the local attribute names to their
>    cooresponding HyTime forms.
>
>    The Methods=3D attribute is bit of a puzzle since it is really
>    a part of the hyperlink presentation/processing style, not
>    a property of the anchors, but there's nothing wrong with
>    having application-specific stuff in your HyTime application.
</pre>

The Methods=3D attribute has been striken :-(. It was motivated by the
observation that textsearch interactions in WWW go like this:

<OL>
<LI>Doc A says "click here[23] to see the index"
<LI>user clicks
<LI>client fetches link 23, "http://host/index"
<LI>displays "cover page" document
<LI>user enters FIND abc
<LI>client fetches "http://host/index?abc"
<LI>search results are displayed
</OL>

Wheras in gopher, you get to save a step if you like:

<OL>
<LI>Doc A says "click here[23] to search the index"
<LI>user clicks
<LI>client displayes "enter search words here: " dialog
<LI>user enters FIND abc
<LI>client fetches "http://host/index?abc"
<LI>search results are displayed
</OL>

So to specify the latter, you would create a link with Methods=3Dtextsearc=
h.

<pre>
>    I added LinkEnds=3D so that the various linking elements will
>    completely conform to the clink and ilink forms.  The presence
>    of the LinkEnds=3D attribute does not imply required support
>    for this type of linking, but it does make HTML more consistent
>    with other DTDs that do use the LinkEnds=3D attribute form.
>
>    Note that 10744 shows the attribute name for the ILINK form
>    to be 'linkend', not 'linkends'.  I consider this to be a
>    typo, as there's no logical reason to disallow multiple anchors
>    from a clink and lack of it puts an undue requirement of
>    specifying otherwise unneeded nameloc elements.  In any case,
>    an application can transform linkends=3D to linkend=3D plus a
>    nameloc, so it doesn't matter in practice.
</pre>

Are there <EM>any</EM> HyTime implementations out there? Do they use
'linkend' or 'linkends'? It's hard to beleive that HyTime became a
standard without a proof-of-concept implementation.

<pre>
>
>&lt;!ELEMENT P     - O EMPTY -- separates paragraphs -->
>&lt;!--=3D=3D=3D WEK =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>
>    Design note:  This seems like a clumsy way to structure information.
>                  One would expect paragraphs to be containing.
>
>    =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D-->
</pre>

Yeah, well, try implementing end tag inference in &lt;1000 or so lines of =
code.
Maybe we'll get it right next time...

<pre>
>&lt;!ELEMENT DL    - -  (DT | DD | P | %hypertext;)*>
>&lt;!--    Content should match ((DT,(%hypertext;)+)+,(DD,(%hypertext;)+)=
)
>        But mixed content is messy.
>  -->
>&lt;!--=3D=3D=3D WEK =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>
>    Design note:  This content should be:
>
>    &lt;!ELEMENT DL  - - (DT+, DD)+ >
>    &lt;!ELEMENT (DT | DD) - O (%hypertext;)* >
>
>    There's no reason for DT and DD to be empty.  Perhaps there was
>    some confusion about the problems with mixed content?  There are
>    none here.
>
>    These comments apply to the other list elements as well.
>
>    =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D-->
</pre>

The problem is that DL, DT, DD, UL, OL, and LI were marked up in extant
HTML documents as if minimization were supported. But I didn't want to
introduce minimization into the implementation, so I made the DT, DD,
and LI elements empty.
<p>

It's possible I'm confused about mixed content, but the way I understand
it, you don't want to use mixed content except in repeatable or groups
because authors will stick whitespace in where it is meant to be ignored
but it won't be.

<pre>
>
>&lt;!-- Character entities omitted.  These should be separate from
>     the main DTD so specific applications can define their values.
>     ISO entity sets could be used for this.
>  --&gt;
</pre>

Another point I should have explained in the DTD: the WWW application
specifies that HTML uses the Latin-1 character set, and that the Ouml
entity represents exactly that character from the Latin-1 character and
not some system specific thingy. Translation to system character sets is
done <em>outside</em> of the SGML parser.

</body>
</html>

------- =_aaaaaaaaaa0--