MIME, SGML, UDIs, HTML and W3

timbl@zippy.lcs.mit.edu (Tim Berners-Lee)
Date: Thu, 11 Jun 92 12:22:56 -0400
From: timbl@zippy.lcs.mit.edu (Tim Berners-Lee)
Message-id: <9206111622.AA03819@zippy.lcs.mit.edu>
To: connolly@pixel.convex.com, enag@ifi.uio.no, www-talk@nxoc01.cern.ch
Subject: MIME, SGML, UDIs, HTML and W3
Cc: timbl@zippy.lcs.mit.edu

I have printed off the recent discussion on the new
HTTP, HTML and MIMe and UDIs and done what I can
to disentangle it all in my mind.  I will reply
in one message, becase many of the points are linked.
I know this should be hypertext, with references but
(a) I am away from home and (b) we don't yet have a
universal mail/news archive server running to link to.

	HTTP and HTML

First of all, Jean-Francois <jfg@dxcern.cern.ch>
points out very properly that the enhaced HTTP
protocol and the enhanced HTML spec are quite
separate things, and should be specified separatedly.
I agree wholeheartdly about all this, and
I aplogize for muddling the levels up till now.

(As a small aside, I would point out that wheras a
HTERR file is not very useful, a HTFWD file IS.
It is like a hypertex soft link. But I am happy to
leave that as a separate type of file. It should
certainly get a different extension so that it gets a
different icon)

        HTTP: SGML vs ASN/1

Let's look at the HTTP protocol first. Carl <barker@cernnext.cern.ch>
is mapping out  the requirements for this, and assuming that SGML
would be a reasonable representation for it in practice.
And so it is.  When the requirements are clear,
it would certainly be interesting to look at mapping them
onto a z39.50 - style ASN/1 implementation. This would
be useful for two reasons. First, the comparison would
point out to us things in z39.50 which we might not have thought of
which would b useful for HTTP. Second, the comparison might give
a nice short or at least well-defined things which the WAIS
guys might like to take into account in the next version
of their protocol.  (I demod W3 to Brewster who hadn't
seen it before live, and was very keen that WAIS and W3
should merge, changing the WAIS protocol if necessary.

There is no reason why we shouldn't try both protocols.
If they map well onto each other, its just a question
of having two separate prasers at the low level, building
the same internal structures.

When we're talking about an SGML representation,
and describe a file to come later down the link,
I don't think we have to use the NOTATION= attribute with a notation
type, because we won't in fact be talking about
the notation of an SGML element.
The format in this case is not something which the SGML
parse is aware of.

I must admit I was disappointed to learn that SGML
didn't allow for any way of including 8 bit data. Thanks Eric
<enag@ifi.uio.np> for your explanations.


	MIME and SGML

Dan <connolly@pixel.convex.com> rightly points out
the relevance of the coming MIME standards. There
are several things which we must separate here, though:

   1. The MIME classification of data formats
   2. The MIME format for multi-part messages
   3. The MIME format for rich text.
   4. The MIME formal for external document addresses (MIME UDIs)

1. MIME classification of data formats

	We must do the same disentangling job which JF did
	on HTML to MIME.

	First of all, the MIME job of classifying data formats
	is a useful job which is ideally done by just one
	bunch of people. Ther has been some suggestion that
	the MIME classifications are not well enough defined,
	but they seem to be the best effort yet and one can only
	assume they will eveolve in the right direction. So I'd
	back the use of these for W3.


2. The MIME format for multi-part messages

	This is necessary for sending a multi-part
	document over a mail link.  We have to ask ourselves
	whether it is reasonable to use over a binary link.
	Personally, my initial impression is that the MIME
	stuff, using as it does terminators such as
	--xxx-- separated by blank lines, looks more horrible
	to work with in this respect than SGML! Still we have
	the problem of restrictions on the content:
	Must not contain delimiters, limited 7 bit character set,
	line orientation, in fact all the things which email
	carries as a restriction.  This is really taking on board
	a legacy of all the mail which has evolved over the years.
	Do we need that for our new ultra-fast hypertext access
	protocol?

	[Compare the MIME format with the rather cleaner NeXT
	Mail format which is as far as I understand simply
	a uuencoded compressed tar file of all the bits, where
	uuencoding is designed as an optimal way of getting over
	mail transport restrictions, compress does what it says
	and tar is a multipart wrapper designed for that only. Not
	standard outside unix, perhaps, but cleaner in that the
	mail formatting is done at the last minute and doesn't
	affect the other operations]

	If course, with HTTP2, multipart/alternative shouldn't
	be needed.

  Multipart for hypetext?

	Now, Dan not only suggests the use of this for
	multipart messages, but also suggests that a hypetext
	document shoudl necessarily contain many parts,
	one on SGML and one for each link as a MIME external document.
	This means that an SGML hypertext document can never stand
	on its own! An SGML parser will always need to have
	a MIME parser sitting just outside.  I don't like
	this: I feel we have to separate these two things.

	Suppose that an SGML document does want to
	be sent in a MIME message and does want to
	refer to other parts of that MIME message. In that case,
	it seems reasonable to have a format for that.
	However, when an SGML document is seen by itself, and
	refers to a news message for example, then there is
	no resaon for it not to be able to contain a
	complete reference within itself.

	When SGML documents include other files, then
	the SYSTEM value is typically a file name.
	It is a reeference to something outside. The
	precedent is set that SGML documents are allowed
	to refer to things outside.

	I think part of you objection, Dan is based on 
	a dislike of the UDI syntax -- which I'll come to later.
  
3. The MIME format for rich text.

	Here, I am not so impressed.  Basically, the MIME
	people are at the same level that we were before we started
	this cleanup, that they have SGML-LIKE stuff which isn't SGML.
	As its not difficult to make it SGML, they should do that.
	Comparing MIME's rich text and HTML, I see that
	we lack the characetr formatting attributes BOLD and ITALIC
	but on the other hand I feel that our treatment of
	logical heading levels and other structures is much more powerful
	and has turned out to provide more flexible formatting	
	on different platforms than explicit semi-references
	to font sizes.  This is born out by all the systems which
	use named styles in preference to explicit formatting,
	LaTeX or other macros instead of TeX, etc etc.

	So technically, HTML has some things to give MIME's rich
	text. Are the MIME people still open to additions?
	If not, I would suggest we add BOLD and ITALIC (or
	two emphasis styles for characters), and keep HTML
	separete from MIME's rich text, proposing it as a
	MIME text standard.
	(HP0 and HP1 were in the HTML spec but as unimplemented)
  
4. The MIME format for external document addresses (MIME UDIs)

	As Ed <emv@msen.com> says, this is a bit of a non-issue,
	as MIME addersses and currnet style UDIs map onto
	each other. However, we have to agree on a "concrete
	syntax" (or two... :-) in the end.

	It's like the difference between an x400 style mail address
	generated from an internet address, and that internet address.
	Which do you prefer

		timbl@zippy.lcs.mit.edu

	where the sections of the domain name are defined
	to have no semantics at all, or

		S=timbl; HO=zippy; OU=lcs; O=MIT; SECTOR=edu

	(this is not real x400 - don't use it!) or

		user=timbl
		host=zippy
		group=lcs
		organization=mit
		sector=education

	You say, Dan, that you "don't think [UDIs] work".
	Do you mean people don't use them in all correspondance?
	Well, what DO they use? They use ange-ftp addresses	
	for FTP (like info.cern.ch:/pub/www/doc/*.ps),
	which are even more terse than UDIs! They use news
	message-ids which are UDIs.

	Let me say that I personally don't much care about the
	arbitrary punctuation. There are a few things, though,
	which are important:

	-  The thing should be printable 7-bit ASCII.

	   Unlike arbitrary document formats,
	   UDIs must be sendable in the mail

	- White space should not be significant. I would
	  accept the presence of some arbitrary white space
	  as a delimiter, but one cannot distinguish between
	  different forms and quantities of white space.
	  This is because things get wrapped and unwrapped.

	  Dan, you object to UDIs because they don't
	  contain white space. But that is purely so that
	  to CAN wrap them onto several lines and still
	  recuperate them.  You can put white space
	  in but it shouldn't mean anything. (This is not possible
	  in W3 as is but it is in the UDI document)

	  I don't see why you say they
	  can't be put as an SGML attribute. They are just
	  text strings. They will be quoted of course
	  (Yes, I know the old NeXT browser doesn't quote them)
	  Is that not allowed? What are the problem characters?
	  If there SGML problem characters in the UDI spec, they
	  probably are ruled out of SGML for a reason.

	  (I recently saw in a galley proof of an article in which
	  our mail adress had been hypernated! UDIs must be
          squeezable into 2 inch columns.)

        There is a sematic difference between a tagged
	list and a punctuation-divided set, and that is that
	the former has defined semantics but the latter doesn't and
	can therefore be extended more easily.  I suggest that tagging
	could be used for the four bits of an address
	that must be separable by all sides, which are
	limited in number (4). Within those bits, the string should
	be transparent as the protocol does not require
	every party to understand the innards. 

	The bits are
			MIME		Used by

	name space:	ACCESS		Used by client

	server details:	HOST, PORT	used by client, protocol-dependent
	
	local doc id:	PATH		used by server only

	anchor id: 	(none)		used by presntation application only

	It seems useful to maintain the ability to work out which
	bits are seen by whom.

	I only used punctation to separate these parts in the W3 UDI
	because people like internet addresses and mail addresses
	and filenames and telephone numbers and message-ids and
	room numbers and zip codes which don't have tags and
	do make do with punctuation.  If the groundswell of
	opionion on this list is that tags are better, then
	let's use tags!

	Whatever we sue, it should be as quotable in an SGML
	attribute as in a MIME external reference as in a
	scribbled note or a link-pasteboard or whatever.
	(The U is for Universal, NOT Unique!)

PHILOSOPHY

	In the W3 world, the model is of a dynamic world of
	documents which generally have some "home" or
	(or several), which can be found using sufficient
	intelligence and the help of ones friends given the UDI.

	A mail message has no home, and so in principle the parts
	of it have no home. When a hypertext multipart message
	(really consisting of multiple hypertext documents)
	has links between its parts they refer to each other
	within a completely isolated conetext.

	There are now two possibilites when the message is in fact
	archived and made readable. One is we say that the parts
	are then addressed as parts ofthe message, wherever it
	may be. The other is to say that the parts of the message
	are very likely things which had some original home.
	In that case, the message is just giving the reciever
	a copy to save him the (perhaps insurmountable) trouble
	of retrieving it.  In this case the parts should be
	identified with thier original UDIs so that the
	receiver is not confsed with multiple documents which
	are in fact the same thing. 
	

I think that's all the comments I have on what I've read so far..

	Tim
________________________________________________________________
Tim Berners-Lee
World-Wide Web initiative
CERN, 1211 Geneva 23, Switzerland        timbl@info.cern.ch
Visiting MIT: NE43-513, (617)234 6016    timbl@zippy.lcs.mit.edu