MIME for global hypertext

Dan Connolly <connolly@pixel.convex.com>
Message-id: <9206080349.AA00415@pixel.convex.com>
Subject: MIME for global hypertext
To: www-talk@nxoc01.cern.ch, wais-talk@think.com
Organization: Engineering, CONVEX Computer Corp., Richardson, Tx., USA
Date: Sun, 07 Jun 92 22:49:51 CDT
From: Dan Connolly <connolly@pixel.convex.com>

[This was posted to several newsgroups, but someone from wais-talk
suggest I forward it there also.]


The WAIS, gopher, and world-wide-web projects are all client/server
information retrieval systems. All three deliver plain text information
quite well, and they each have evolving mechanisms for delivering
other forms of information.

The MIME RFC defines a system for processing multi-part, multimedia
messages on the internet. I would like to see these systems, along
with USENET news and internet mail, interoperate with MIME as the substrate.

The clients for these systems go something like this:
0	user invokes client (and chooses a starting point)
1	client displays user's request
2	user reads page, chooses a reference to more info
3	user informs client of choice
		 (e.g. "show me item #1," or "search for googoo")
4	go to step 1

These systems often consist of a hierarchy of menus with text files at
the leaf nodes. The system allows the user to interactively navigate
the menus and browse leaf nodes. But 1) the format of the menus is
particular to the system (USENET newsgroups/articles, unix
directories/files, WAIS source/database/document). And 2) once a user
is at a leaf node, the system can no longer interactively follow
references.

The novel aspect of hypertext is that the distinction between the
menu pages and the text pages disappears. In the world-wide-web,
text documents have machine-readable links inside them, and all
menus are represented as hypertext documents.

The WWW format works well, but it would benefit from use of MIME's
features.

For a common hypertext document format, I propose we define a
subtype of the MIME multipart message: X-HYPERTEXT. The first
part of a multipart/X-HYPERTEXT message is the content of
the document, and the remaining parts are multimedia attachments
and links to other documents.

The content part contains references (by Content-ID) to the
attachments and links. The client software allows the user
to interactively choose references to display/follow.

The remaining parts may be attached image/audio/video using
MIME's various types and transfer encodings (text attachments
would work too) or they may be references to information
accessible elsewhere using MIME's message/external-body type.
The parameters to the external-body content-type provide the
same information as WWW's Universal Document Indentifier.
(MIME only defines ANON-FTP, FTP, TFTP, LOCAL-FILE and AFS.
The remaining access-types (WAIS, gopher, etc) would be
experimental (X-WAIS, X-GOPHER) until standardized.)

The emerging standard for structured, platform-independent text
is SGML. The WWW project defines an SGML document type with
traditional elements (title, heading, paragraph, list) and
new hypertext elements (anchor). Soon it will have multimedia
elements (image, audio).

The current design places external document references (to files,
WWW servers, WAIS documents, gophers, etc.) inside the SGML as
attributes. There are lexical incompatibilities, and the design
is under strain. I suggest that we implement references as
as SGML entities that identify message/external-body parts
by content-id.

Representing document content in SGML allows the same information
to be accessed using different user interface paradigms (e.g. dumb
terminals vs. curses style vs. x windows point-and-click).

Short of full SGML parsing, we could adopt the MIME text/richtext
format, with the addition of a <REF ID="xxx">...</REF> tag.
In fact, any representation that allows the user to interactively indicate
one of the attached body parts by content-id will do. For example,
plain text with one-line descriptions would do. The Andrew ez
data stream would also work, but only Andrew sites could parse it.

This brings up the issue of format negociation. No one format is
optimal for all information. Clients are likely to be able to process
information in several formats, and servers are likely to be able
to provide different representations.

The various formats can be enclosed in a MIME multipart/alternative
message. And rather than including the data for all formats in
the message, the data could be in message/external-body parts. The
client chooses the type of data it likes and retrieves the corresponding
external-body. This (modified) example from the MIME rfc may help explain:

MIME-Version: 1.0
Content-Type: multipart/alternative; boundary=42

--42
Content-Type: message/external-body;
	name="BodyFormats.ps";
	site="thumper.bellcore.com";
	access-type=ANON-FTP;
	directory="pub";
	mode="image";

Content-type: application/postscript

--42
Content-Type: message/external-body;
	name="/u/nsb/writing/rfcs/RFC-XXXX.ez";
	site="thumper.bellcore.com";
	access-type=AFS;

Content-type: application/x-ez

--42
Content-Type: message/external-body;
	name="BodyFormats.txt";
	site="thumper.bellcore.com";
	access-type=ANON-FTP;
	directory="pub";

Content-type: text/plain

--42--

The client can choose between postscript, ez, and plain text, and
retrieve the corresponding message body.


The question then becomes: how do these systems interoperate?
By making information available as multipart/X-HYPERTEXT MIME
messages.

The WWW client interfaced to the other systems by defining
"addressing schemes" and implementing the various protocols
and translating the data into HTML. Gopher has a similar
typing scheme -- one character is reserved to indicate
the access type and the data type. WAIS clients have yet
another method of resolving types, though they only support
one protocol. The NewsGrazer application has its own
encapsulation mechanism. This is becoming a mess.

In the short term, global hypertext viewers will have to support
the access-type and content-type of each system with which it
interoperates (so we have X-WAIS, X-HTTP, X-GOPHER, X-NNTP, as well as
X-WAIS-SRC, X-HTML, X-GOPHER-1 thru X-GOPHER-9).

Some of the access types will become standard, and some will die out.
But all the data types should be encapsulated in MIME messages. Any
data that has machine-readable pointers to other data should be made
into a multipart/X-HYPERTEXT message. For example, a WAIS question
should have attachments for each of the result documents (the content
part can stay application/x-wais-question, or it could be converted to
a text type, or both), at least in the case where those documents are
available by some standard access method.  [I wrote a perl script that
will change an HTML document into a MIME message with attachments.]

Leaf documents, i.e. documents with no external links, can stay in
single part types. e.g. Plain text files become MIME messages by simply
adding a blank line at the beginning (to separate the headers (none)
from the body).

Under this model, a mail message can point to a news article
which references a WAIS document which contains several drawings
and pointers to several more available by FTP, and a user could
just point-and-click between them. The only need for
protocols like gopher and HTTP is to encapsulate data that's not
already MIME compliant.

This is clearly a pipe dream, but it's the kind of thing we can work
towards today.

Dan