Re: MIME, SGML, UDIs, HTML and W3

Dan Connolly <connolly@pixel.convex.com>
Message-id: <9206120131.AA29502@pixel.convex.com>
To: timbl@zippy.lcs.mit.edu (Tim Berners-Lee)
Cc: enag@ifi.uio.no, www-talk@nxoc01.cern.ch
Subject: Re: MIME, SGML, UDIs, HTML and W3 
In-reply-to: Your message of "Thu, 11 Jun 92 12:22:56 EDT."
             <9206111622.AA03819@zippy.lcs.mit.edu> 
Date: Thu, 11 Jun 92 20:31:08 CDT
From: Dan Connolly <connolly@pixel.convex.com>

Now my comments on your comments:

>There is no reason why we shouldn't try both protocols.
>If they map well onto each other, its just a question
>of having two separate prasers at the low level, building
>the same internal structures.
>
On the other hand, I'd like to keep a telnet based protocol
around -- maybe gopher is good enough.

>When we're talking about an SGML representation,
>and describe a file to come later down the link,
>I don't think we have to use the NOTATION= attribute with a notation
>type, because we won't in fact be talking about
>the notation of an SGML element.
>The format in this case is not something which the SGML
>parse is aware of.
>
I don't believe this is true. From the horse's mount (Erik Naggum, that is):
----
|   What's the scoop? Do we have to use external entities for raw data?

Yes.  An external entity that is not an SGML text entity requires a
notation identifier, so you only need to list the entities in the DTD,
with notation, and refer to them by name in the document instance.

----

>1. MIME classification of data formats
>
>	 So I'd
>	back the use of these for W3.
>
Yeah!!

>
>2. The MIME format for multi-part messages
>
>	This is necessary for sending a multi-part
>	document over a mail link.  We have to ask ourselves
>	whether it is reasonable to use over a binary link.
>	Personally, my initial impression is that the MIME
>	stuff, using as it does terminators such as
>	--xxx-- separated by blank lines, looks more horrible
>	to work with in this respect than SGML!

The algorithm to separate a MIME multipart message into its
parts is simply: search the data stream for CRLF--boundary--CRLF.
It can be done by a finite state machine. Even the simplest
SGML documents require a pushdown automaton to parse.

> Still we have
>	the problem of restrictions on the content:
>	Must not contain delimiters, limited 7 bit character set,
>	line orientation, in fact all the things which email
>	carries as a restriction.  This is really taking on board
>	a legacy of all the mail which has evolved over the years.
>	Do we need that for our new ultra-fast hypertext access
>	protocol?
>

No, we don't. MIME _allows_ transfer of data over 7 bit ASCII
channels, but it hardly requres it. The Content-transfer-encoding
can be:
	7 bit (default): line oriented 7 bit data
	8 bit : line oriented 8 bit data
	binary : raw 8 bit data, no CRLF's required
	base64: uuencode standardized
	quoted-pritable: text with escape sequences

The MIME standard explicitly supports expansion to 8 bit transport
mechanisms.

>	[Compare the MIME format with the rather cleaner NeXT
>	Mail format which is as far as I understand simply
>	a uuencoded compressed tar file of all the bits, where
>	uuencoding is designed as an optimal way of getting over
>	mail transport restrictions, compress does what it says
>	and tar is a multipart wrapper designed for that only. Not
>	standard outside unix, perhaps, but cleaner in that the
>	mail formatting is done at the last minute and doesn't
>	affect the other operations]
>
It was a requirement of MIME that the structure of the document
be accessible without decoding or uncompressing data, especially
since MIME messages are recursive and complex messages might
otherwise go through more than one encoding.

Compression was not addressed by the MIME standard, and uuencode
doesn't make it though some gateways.

>	If course, with HTTP2, multipart/alternative shouldn't
>	be needed.
>
What does HTTP2 define that obviates the multipart/alternative
type?


>  Multipart for hypetext?
>
>	Now, Dan not only suggests the use of this for
>	multipart messages, but also suggests that a hypetext
>	document shoudl necessarily contain many parts,
>	one on SGML and one for each link as a MIME external document.
>	This means that an SGML hypertext document can never stand
>	on its own!

That's exatly the point. Anything besides text should be handled
as an external entity to be resolved by the parsing system. I just
suggested that a portable way to resolve SGML external entities
is to refer to MIME attachments.

> An SGML parser will always need to have
>	a MIME parser sitting just outside.  I don't like
>	this: I feel we have to separate these two things.
>
Well, it has to have something sitting outside. The SGML parsers
I've seen resolve system entities using the file system. I proposed
we use a MIME message like a mini file system, with links to
other file systems.

>	Suppose that an SGML document does want to
>	be sent in a MIME message and does want to
>	refer to other parts of that MIME message. In that case,
>	it seems reasonable to have a format for that.
>	However, when an SGML document is seen by itself, and
>	refers to a news message for example, then there is
>	no resaon for it not to be able to contain a
>	complete reference within itself.
>
OK, I can see that we should be able to resolve the lexical
issues and put the whole UDI/MIME access specification inside
the SGML document.

But what about multimedia web nodes?

SGML describes text and references to other texts just fine.
But if we want a format that can include more than just
text, I don't think we should try to fit it _inside_ SGML.

I think SGML should be used to convey text and document
structure. But I still like the idea of wrapping it in
a MIME message for multimedia interoperability.


>3. The MIME format for rich text.
>
>	Here, I am not so impressed.
Nor am I.


>4. The MIME format for external document addresses (MIME UDIs)
>
>	As Ed <emv@msen.com> says, this is a bit of a non-issue,
>	as MIME addersses and currnet style UDIs map onto
>	each other. However, we have to agree on a "concrete
>	syntax" (or two... :-) in the end.
>
Exactly. And why not the MIME concrete syntax?

>	Let me say that I personally don't much care about the
>	arbitrary punctuation. There are a few things, though,
>	which are important:
>
>	-  The thing should be printable 7-bit ASCII.
>
MIME: check.

>	   Unlike arbitrary document formats,
>	   UDIs must be sendable in the mail
>
MIME: check.

>	- White space should not be significant. I would
>	  accept the presence of some arbitrary white space
>	  as a delimiter, but one cannot distinguish between
>	  different forms and quantities of white space.
>	  This is because things get wrapped and unwrapped.
>
MIME: check.

>	  Dan, you object to UDIs because they don't
>	  contain white space. But that is purely so that
>	  to CAN wrap them onto several lines and still
>	  recuperate them.  You can put white space
>	  in but it shouldn't mean anything. (This is not possible
>	  in W3 as is but it is in the UDI document)
>
I must not have read the UDI document closely. I certainly
got the impression that a UDI should look like one word
when "written on the back of an envelope."

>	  I don't see why you say they
>	  can't be put as an SGML attribute. They are just
>	  text strings.

The WAIS UDIs are huge. An SGML declaration defines a maximum
for the length of an attribute value. The default value is ...
oh. ahem. it's 960. I think the MIME 72 character line length
is a little more restrictive than that :-)

> They will be quoted of course
>	  (Yes, I know the old NeXT browser doesn't quote them)
>	  Is that not allowed? What are the problem characters?
>	  If there SGML problem characters in the UDI spec, they
>	  probably are ruled out of SGML for a reason.
>
Good question. These are the things we should research before
we go _any_ further implementing this stuff.

>	Whatever we sue, it should be as quotable in an SGML
>	attribute as in a MIME external reference as in a
>	scribbled note or a link-pasteboard or whatever.
>	(The U is for Universal, NOT Unique!)
>
Here's an idea for a quoting strategy for the four parts: Either
	a) it'a a quoted string delimited by "" with \" allowed
	in the middle, or
	b) it's a base-64 representation of an arbitrary
	binary stream.
Just an idea.

I'm late for an appointment. Gotta go.

Dan