Re: non-text documents

Dan Connolly <connolly@pixel.convex.com>

Mail folder: WWW Talk 1992 Archives
Next message: joe@athena.mit.edu: "Announcing tkWWW release 0.4"
Previous message: Putz.parc@xerox.com: "Re: document caching"
In-reply-to: Jim Davis: "non-text documents"

Message-id: <9210161929.AA21965@pixel.convex.com>
To: www-talk@nxoc01.cern.ch
Cc: Jim Davis <davis@dri.cornell.edu>
Subject: Re: non-text documents
In-reply-to: Your message of "Fri, 16 Oct 92 12:11:46 EDT."
             <199210161611.AA06348@willow.tc.cornell.edu> 
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="8<"
Date: Fri, 16 Oct 92 14:29:23 CDT
From: Dan Connolly <connolly@pixel.convex.com>


--8<

>Can you tell us when WWW and Viola will support non-text
>documents?  I would very much like to be able to return
>pictures including Postscript.  I know this has been discussed
>in the past.

That discussion in the past never really went anywhere. I'm going
over my suggestions, and I'd like to summarize them here and solicit
support.

If we use the semantics of the MIME body part to define a WWW
document, many of these issues have straightforward solutions, and
there is hope of interoperating with other MIME compliant systems.
For example, I also hope WAIS and gopher will use the MIME typing
system.

This means every document has an associated Content-Type and
Content-Transfer-Encoding. The MIME RFC defines content types for
text, image, audio, video, and application data, and two encapsulation
types message and multipart. Text, GIF, postscript, and several other
commonly used data formats are mentioned in the standard. And there is
a well-defined method for adding formats to MIME.

It does _not_ mean that all 8-bit data has to be base64 encoded. 8bit
is a perfectly valid Content-Transfer-Encoding, if you're using FTP or
WAIS or some other 8-bit-clean transport mechanism. But if you're
using email or HTTP, base64 solves the issue quite neatly.

Nor does it mean that all documents look like RFC-822 messages.
Messages is one _type_ of MIME body part. The text/plain type covers
plain text documents. Image/gif is good for lots of pictures.
Application/postscript covers printer-ready stuff. Unknown data can be
tagged application/octet-stream. And you can use types starting with
"x-" for experimental private types.


WWW should use typed links, either implicitly or explicitly. For example,
client software can infer the type of a link to a USENET news article
to be message/rfc-822. But it should not assume that a link to a file
on an FTP server is text! That link should include the type of the
data so that the client software can process it intelligently. There's
already quite a bit of this going on in the WWW browser code. The MIME
semantics are just a way to formalize it with the possibility of
interoperating with other systems.

For example, if an anchor that points to a postscript format document
should look like:

<A href="http://info.cern.ch/hypertext/foo.ps"
type="application/postscript"> here is a postscript file.</A>

I wrote a proposal to recast the whole syntax of URLs so that the
above sample would look more like:

<A><HTTP host="info.cern.ch" path="hypertext/foo.ps"
         type="application" subtype="postscript">
here is a postscript file.</A>

The ideas is that we've already got some sort of SGML parser: why not
let it do the work of parsing the URL too? We could use SGML features
for certain semantics; we could use the SGML numeric character entity
reference (&#123;) in stead of the %FF quoting scheme, and we could
use attribute default values to infer type, subtype, and other
attributes.

I also think we should have optional identity information in the
links. This allows clients to determine whether two links point to the
same information. For example, someone might extend some FTP an WAIS
servers to return the MD5 signature of documents. Then a client could
conclude (with some very high probability) that

<ftp host="cs.utexas.edu" path="contrib" name="x.tar.Z" md5="abcdef">

and

<wais host="quake.think.com" database="software" path="x-2.1.tar.Z"
  md5="abcdef">

refer to the same release of the package.

Thoughts?

Dan

p.s. I've attached the MIME RFC below. If you've got a MIME mail UA,
it should be able to bring up the text. If not, there's enough info
for you to ftp the thing manually.

--8<
Content-Description: RFC1341 MIME  (Multipurpose Internet Mail Extensions)
Content-Type: message/external-body;
	access-type=anon-ftp;
	site="thumper.bellcore.com";
	dir="pub/nsb";
	name="BodyFormats.txt";
	size=214081

Content-Type: text/plain


--8<--