thoughts on the future of HTML [long]

Dan Connolly <connolly@pixel.convex.com>

Mail folder: WWW Talk Jan-Mar 1993 Archives
Next message: Dan Connolly: "libHTML to date"
Previous message: Jim Whitescarver: "Re: EDI for forms?"

Message-id: <9301212240.AA24439@pixel.convex.com>
To: www-talk@nxoc01.cern.ch
Subject: thoughts on the future of HTML [long]
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="cut-here"
Date: Thu, 21 Jan 93 16:40:54 CST
From: Dan Connolly <connolly@pixel.convex.com>

--cut-here

HTML was designed to be simple. Folks are supposed to be
able to whack out HTML with a text editor -- no rocket
science required.

Also, you ought to be able to use MS/Word or the equivalent
to write your documents or to view HTML documents.

But it's also designed to be processed by machine -- lots
of machines all over the planet.

Enter SGML.

It seemed like the natural choice, so Tim implemented an
informal SGML parser in his WWW clients.

Nobody really knew the ins and outs of SGML, so information
providers who wanted to produce HTML automatically just
checked to be sure the public www client grokked.

Then other folks tried to write HTML parsers. We discovered
that there were a lot of issues that were not covered by
any spec other than the WWW source code.

Then I tried to use the sgmls parser to develop an HTML
to FrameMaker tool. I discovered that the WWW source code
conflicted with the SGML standard. Uh oh!

By now I think we all agree that we should actually use SGML
to specify the syntax and structure of HTML.

But I wonder: on whom rests the responsibility for validating
HTML documents? This is really an HTTP issue: is it part
of the protocol that the data stream is _valid_ HTML? Or
is it the client's responsibility to deal with errors?

I suggest that it should be the responsibility of the _server_
to produce valid HTML. Of course the client should be robust
in the face of errors. But I suggest that when a client and
a server differ on their interpretation of a document, the
client is at fault if the document is valid, and the server
is at fault if the document is not.

It's too late to introduce this scenario into HTTP v0.9. But
future servers should have the burden of producing valid documents.
This will add complexity to the server code: it can no longer
just grab the contents of any old .html file and ship it
out the port. But it could, for example, fix the markup errors
on the fly and write error messages to a log file.

If a server knows the structure of the document it sends,
it should be able to send the document using SGML, ASN/1,
MIME, or whatever transport mechanism we chose. This is the
real value of standardizing on SGML: the syntax is one
thing, but we don't even have to use it! We have a DTD
that tells, in a more abstract way, what the content of
the document is.

With that in mind, I suggest we make HTML2 more prescriptive
than HTML. It should match the way documents are structured
and processed more than the way they are typed in a text editor.

For example, the following document is legal, but it's a pain
to process:

--cut-here
Content-Type: text/x-html

<html>
Here's the first paragraph. It's at the out structural level.<P>

The is the <em>Second</em> paragraph.
<body>
Here's another paragraph.
<H1>Another one</h1>
The last paragraph.
</body>
</html>

--cut-here

Imagine you want to parse that document and answer queries
like: "show me the second paragraph of the document."

HTML isn't supposed to be too sophisticated, but it _is_
supposed to model typical word-processor documents fairly
well. A paragraph is a pretty fundamental chunk of information.
The definition of a paragraph in HTML is much more complex
than it need be.

Consider the following representation of essentially
the same document:

--cut-here
Content-Type: text/x-html

<html>
 <body>
  <A>Here's the first paragraph.</A>
  <a>The is the <em>Second</em> paragraph.</a>
  <a>Here's another paragraph.</a>
  <H1>Another one</h1>
  <a>The last paragraph.</a>
 </body>
</html>

--cut-here

The the elements that make up the content of the BODY are
all paragraphs. Wouldn't it be a lot easier to write
a formatter for the latter type of document?

The original HTML design was motivated by conventional use
of SGML, with shortrefs and other markup minimization features
to aid keyboarding of documents.

But Tim (wisely) didn't want to put those features in his
parser, so we ended up with a compromise: it's fairly
easy to keyboard, but it has virtually no structure.

so...

I suggest that future versions of HTML should have more structure.

How much structure? Enough. Enough to model whatever kinds of
document make a WWW node. About as much as a TeXinfo node, which
is pretty similar to a FrameMaker TextFlow, or a MS/Word section.
We should probably also model typical markup conventions of internet
mail and USENET news.

Then... the big step: hytime. I _really_ think we should look at
hytime architectural forms to model things like threads, webs,
hierarchies of documents, etc.

I think we could use HyTime mechanisms to form an abstraction
that models the structure of unix filesystems, message threads,
and other typical hypertext organizations on the internet.

This is how we should model "relative links." The unix ../../foo
syntax is fine as a model. But we should abstract the features from
that syntax, so that we can use the same model on VMS systems without
ad-hockery.

That syntax happens to work for most gopher holes too,
but that's cheating: the gopher "path" string is supposed to be opaque.
The URL spec says something about how servers that use hierarchical
databases should use the unix path syntax. [What a load of hooey!
Oh... Ahem... sorry.]

And the syntax of most WAIS URLs' has _two_ unix paths in it. What
do you do with that?

I really think HyTime links and locs are a good way to model all this.
The connection between HyTime and SGML is incidental. The only reason
to use SGML markup is to _interchange_ information between HyTime
applications (...or to talk about HyTime constructs in email, or any
of the other things that text is convenient for.)

Standardizing HTML was one thing: it's only used in the WWW community.
But standardizing the WWW addressing architecture is a much larger
venture. I hope eventually the various IETF groups etc. will realize
that the HyTime community has thought about formal mechanisms to
name and reference information a lot, and the product of their labors,
HyTime, may have some technical merit as well as the weight [and
overhead...] of an international standard.

Dan

p.s. I thought I subscribed to the cni-arch mailing list where URL
stuff was supposed to happen. I don't remember getting anything from
that list for a long time. Is there any URL discussion going on?

--cut-here--