MIME as a hypertext architecture

Dan Connolly <connolly@pixel.convex.com>
Message-id: <9206060553.AA23369@pixel.convex.com>
To: www-talk@nxoc01.cern.ch
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary=-
Subject: MIME as a hypertext architecture
Date: Sat, 06 Jun 92 00:53:20 CDT
From: Dan Connolly <connolly@pixel.convex.com>
NOTE: This message uses existing and proposed MIME structuring
conventions. Some parts of it may look strange on pre-MIME viewers.

---

The WWW project needs an architecture for interchange of structured
multimedia hypertext documents. The original architecture, HTML,
introduced some structuring conventions and a way of specifying
hypertext links.

The HTML format is under stress from several issues:
	* We need an SGML DTD so that we can parse HTML using
	something besides the public implementation of WWW, and so that
	we can verify documents converted from other authoring
	systems such as GNU info, Andew's EZ, or FrameMaker.

	* We need to be able to distribute documents and document
	elements in other formats, including raw 8 bit data streams.
	The SGML NOTATION feature falls short of providing and
	adequate mechanism.

	* The UDI syntax doesn't match the SGML attribute syntax.
	There are problems with quoting out-of-band characters, and
	the length of complex UDI's may exceed SGML limits and/or
	line-length limits of transport mechanisms. Also, the
	terse syntax of UDI's conflicts with the goal that they
	be human-readable.

This is a proposed architecture for global hypertext, addressing
the issues raised by the WWW project, but using the MIME architecture.

We define a new subtype of the MIME multipart content type called
x-HTDOC. The syntax is the same as multipart/mixed, but the semantics
are that of a WWW client: the first part is displayed, and the rest
represent links to other documents or other elements of this document.

Then we define a new subtype of the MIME text content type called
x-HTML. This is an SGML markup language using the default SGML declaration
(i.e. the reference concrete syntax, default processing limits, etc.)
and the HTML DTD (included below).

---

<!-- This DTD was produced by DeveGram on Tue Jun  2 18:58:16 1992 -->
<!-- and hand-edited by connolly@convex.com -->

<!--     Parameter Entities       -->

<!--      Terminal symbols        -->

<!ENTITY % words "#PCDATA" >

<!--    Non-ELEMENT symbols       -->

<!ENTITY % inline	"%words | A" >
<!ENTITY % text         "%inline | P" >
<!ENTITY % heading "H1|H2|H3|H4|H5|H6" >

<!ENTITY lt "<">
<!ENTITY gt ">">
<!ENTITY amp "&">

<!ENTITY lt. "<">
<!ENTITY gt. ">">
<!ENTITY amp. "&">

<!--     Document structure       -->

<!ELEMENT html	O O  (TITLE, NEXTID?, ISINDEX?, section+, ADDRESS?)>

<!ELEMENT TITLE	- -  (%inline)+>
<!ELEMENT ADDRESS - - (%text)+>

<!ELEMENT NEXTID - O EMPTY >
<!ATTLIST NEXTID N NUMBER #IMPLIED>

<!ELEMENT ISINDEX - O EMPTY >


<!ELEMENT section O O ((%heading)?,
			(
			%text |
			section |
			MENU |
			UL |
			OL |
			DIR |
			DL)+)>

<!ELEMENT (H1|H2|H3|H4|H5|H6)	- -  (%inline) >

<!ELEMENT P	- O  EMPTY -- paragraph SEPARATOR -->


<!ELEMENT A	- -  (%inline)+>
<!ATTLIST A
	NAME CDATA #IMPLIED
	PART ENTITY #IMPLIED >

<!ELEMENT MENU	- -  (LI+)>

<!ELEMENT UL	- -  (LI+)>

<!ELEMENT OL	- -  (LI+)>

<!ELEMENT DIR	- -  (LI+)>

<!ELEMENT LI	- O  (%text)+>

<!ELEMENT DL	- -  ((DT, DD)+)>

<!ELEMENT DT	- O  (%inline)+>

<!ELEMENT DD	- O  (%text)+>

---

An HTML document would use external entities to reference other parts
of the multipart message. The system identifier matches the
Content-Id field of the intended part. The content-type of the indicated
part could be image, audio, or video for multimedia inclusions; text for
quotes etc., or message/external-body for references to other documents.

MIME defines access-types for local-file and anon-ftp. We could define
x-HTTP, x-NEWS, x-WAIS, and the other UDI access types.

Within HTML documents, SGML IDREFs and IDs are used to reference and define
elements of a document. (I think HYTIME defines a way to reference elements
without explicit IDs.)


The next part of this message is a default.html from the WWW
distribution adapted to use the conventions here.

It should interoperate with existing MIME systems,
though they will not be able to do anyting intelligent with HTML.

---
Content-Type: multipart/x-HTDOC; boundary=cut-here

--cut-here
Content-Type: text/x-HTML

<!DOCTYPE HTML SYSTEM 
[
<!ENTITY part1 SDATA "QuickGuide.html">
<!ENTITY part2 SDATA "http://info.cern.ch/hypertext/WWW/TheProject.html">
<!ENTITY part3 SDATA "http://crnvmc.cern.ch./WHO">
<!ENTITY part4 SDATA "http://crnvmc.cern.ch./FIND/yellow?">
<!ENTITY part5 SDATA "http://crnvmc.cern.ch./FIND/jaune?">
<!ENTITY part6 SDATA "http://crnvmc.cern.ch./FIND">
<!ENTITY part7 SDATA "http://crnvmc.cern.ch/NEWS/?">
<!ENTITY part8 SDATA "http://crnvmc.cern.ch./NEWS/cern">
<!ENTITY part9 SDATA "http://crnvmc.cern.ch./NEWS/vmnews">
<!ENTITY part10 SDATA "http://crnvmc.cern.ch/NEWS/student">
<!ENTITY part11 SDATA "http://info.cern.ch/hypertext/DataSources/NewsFromVM/Overview.html">
<!ENTITY part12 SDATA "http://info.cern.ch/hypertext/DataSources/News/Overview.html">
<!ENTITY part13 SDATA "http://info.cern.ch/hypertext/DataSources/bySubject/Overview.html">
<!ENTITY part14 SDATA "http://info.cern.ch./hypertext/DataSources/Overview.html">
<!ENTITY part15 SDATA "http://slacvm.slac.stanford.edu./FIND/spires">
<!ENTITY part16 SDATA "http://crnvmc.cern.ch/FIND/DESY?">
<!ENTITY part17 SDATA "http://info.cern.ch:8001/archive.orst.edu:9000/archie-orst.edu">
<!ENTITY part18 SDATA "http://iicm.tu-graz.ac.at./jargon">
<!ENTITY part19 SDATA "http://info.cern.ch./hypertext/Products/WAIS/Sources/Overview.html">
<!ENTITY part20 SDATA "http://info.cern.ch/rpc/doc/User/UserGuide.html">
<!ENTITY part21 SDATA "http://otax.tky.hut.fi/tky/default.html">
<!ENTITY part22 SDATA "gopher://gopher.micro.umn.edu:70/11/Other%20Gopher%20and%20Information%20Servers">
<!ENTITY part23 SDATA "http://info.cern.ch./hypertext/WWW/LineMode/Defaults/default.html">
]>
<TITLE>CERN Information</TITLE>
<NEXTID N=10>
<SECTION><H1>CERN Information - Select by number</H1>
<DL>
<DT><A PART="part1">Help</A>
<DD>On this program, or the
<A PART="part2">World-Wide Web project</A>.
<DT><A PART="part3" NAME=2>Phone book</A>
<DD>People, phone numbers, accounts and email addresses.
See also the analytical
<A PART="part4" NAME=yellow>Yellow Pages</A>, or
the same index in French :
<A PART="part5" NAME=jaune>Pages Jaunes</A>.
<DT><A PART="part6" NAME=1>"XFIND" index</A>
<DD>Index of computer centre documentation, newsletters, news,
help files, etc...
<DT><A PART="part7" NAME=groups>News</A>
<DD>A complete list of all public CERN news groups, such as
<A PART="part8" NAME=3>news from the CERN User's
Office</A>,<A PART="part9" NAME=4>
CERN computer center news</A>,<A PART="part10">
student news</A>. See also <A PART="part11" NAME=5>private
groups</A> and <A PART="part12" NAME=inews>Internet
news</A>.
</dl>
</section>
<section>
<SECTION><H2>From other sites</h2>
See online data by
<A PART="part13" NAME=subject>subject</A>,
pointers to
<A PART="part14">other forms of online data</a>, and the following specific databases:
<DL>
<DT><A PART="part15" NAME=spires>SLAC SPIRES</A>
<DD>The High Energy Physics preprint index at Stanford Linear Accelerator, California.
(This is the same information avialable via the QSPIRES facility on BITNET.
Include the word "FIND" as the first keyword, eg: K FIND AUTHOR FRED.).
<DT><A PART="part16" NAME=desy>DESY documents</a>
<DD>Documents and help files from the DESY lab in Hamburg.
<DT><A PART="part17" NAME=archie>
Archie</a>
<DD>An index of almost everything available by "anonymous FTP".
<DT><A PART="part18" NAME=7>Hacker Jargon</a>
<DD>An index to a cross-referenced set of hacker terms. A demonstration
of the WWW gateway to the Graz Technical University Hyper-G database.
<DT><A PART="part19" NAME=9>W.A.I.S.</a>
<DD>All kinds of information available from "Wide Area Information Servers".
<DT><A PART="part20" NAME=6>CERN RPC</A>
<DD>The user guide for the RPC system developed in CERN CN division
(not Sun/RPC). This is an example of documentation (partially) converted
into hypertext.
<DT><A PART="part21" NAME=hut>Helsinki</a>
<DD>Helsinki Technical University information service (Mostly Finnish).
<DT><A PART="part22" NAME=gopher>Gophers</a>
<DD>Campus-wide information systems using "Gopher" software. (Requires www version 1.1 or higher)
</DL>
(This page may be an out of date copy. See the
<A PART="part23" NAME=latest>latest version</a>.)

--cut-here
Content-id: QuickGuide.html
Content-type: message/external-body
	;access-type=x-relative
	;name="QuickGuide.html"

Content-Type: message


--cut-here
Content-id: http://info.cern.ch/hypertext/WWW/TheProject.html
Content-type: message/external-body
	;access-type=x-HTTP
	;site=info.cern.ch
	;name=/hypertext/WWW/TheProject.html

Content-Type: message


--cut-here
Content-id: http://crnvmc.cern.ch./WHO
Content-type: message/external-body
	;access-type=x-HTTP
	;site=crnvmc.cern.ch.
	;name=/WHO

Content-Type: message


--cut-here
Content-id: http://crnvmc.cern.ch./FIND/yellow?
Content-type: message/external-body
	;access-type=x-HTTP
	;site=crnvmc.cern.ch.
	;name=/FIND/yellow?

Content-Type: message


--cut-here
Content-id: http://crnvmc.cern.ch./FIND/jaune?
Content-type: message/external-body
	;access-type=x-HTTP
	;site=crnvmc.cern.ch.
	;name=/FIND/jaune?

Content-Type: message


--cut-here
Content-id: http://crnvmc.cern.ch./FIND
Content-type: message/external-body
	;access-type=x-HTTP
	;site=crnvmc.cern.ch.
	;name=/FIND

Content-Type: message


--cut-here
Content-id: http://crnvmc.cern.ch/NEWS/?
Content-type: message/external-body
	;access-type=x-HTTP
	;site=crnvmc.cern.ch
	;name=/NEWS/?

Content-Type: message


--cut-here
Content-id: http://crnvmc.cern.ch./NEWS/cern
Content-type: message/external-body
	;access-type=x-HTTP
	;site=crnvmc.cern.ch.
	;name=/NEWS/cern

Content-Type: message


--cut-here
Content-id: http://crnvmc.cern.ch./NEWS/vmnews
Content-type: message/external-body
	;access-type=x-HTTP
	;site=crnvmc.cern.ch.
	;name=/NEWS/vmnews

Content-Type: message


--cut-here
Content-id: http://crnvmc.cern.ch/NEWS/student
Content-type: message/external-body
	;access-type=x-HTTP
	;site=crnvmc.cern.ch
	;name=/NEWS/student

Content-Type: message


--cut-here
Content-id: http://info.cern.ch/hypertext/DataSources/NewsFromVM/Overview.html
Content-type: message/external-body
	;access-type=x-HTTP
	;site=info.cern.ch
	;name=/hypertext/DataSources/NewsFromVM/Overview.html

Content-Type: message


--cut-here
Content-id: http://info.cern.ch/hypertext/DataSources/News/Overview.html
Content-type: message/external-body
	;access-type=x-HTTP
	;site=info.cern.ch
	;name=/hypertext/DataSources/News/Overview.html

Content-Type: message


--cut-here
Content-id: http://info.cern.ch/hypertext/DataSources/bySubject/Overview.html
Content-type: message/external-body
	;access-type=x-HTTP
	;site=info.cern.ch
	;name=/hypertext/DataSources/bySubject/Overview.html

Content-Type: message


--cut-here
Content-id: http://info.cern.ch./hypertext/DataSources/Overview.html
Content-type: message/external-body
	;access-type=x-HTTP
	;site=info.cern.ch.
	;name=/hypertext/DataSources/Overview.html

Content-Type: message


--cut-here
Content-id: http://slacvm.slac.stanford.edu./FIND/spires
Content-type: message/external-body
	;access-type=x-HTTP
	;site=slacvm.slac.stanford.edu.
	;name=/FIND/spires

Content-Type: message


--cut-here
Content-id: http://crnvmc.cern.ch/FIND/DESY?
Content-type: message/external-body
	;access-type=x-HTTP
	;site=crnvmc.cern.ch
	;name=/FIND/DESY?

Content-Type: message


--cut-here
Content-id: http://info.cern.ch:8001/archive.orst.edu:9000/archie-orst.edu
Content-type: message/external-body
	;access-type=x-HTTP
	;site=info.cern.ch
	;port=8001
	;name=/archive.orst.edu:9000/archie-orst.edu

Content-Type: message


--cut-here
Content-id: http://iicm.tu-graz.ac.at./jargon
Content-type: message/external-body
	;access-type=x-HTTP
	;site=iicm.tu-graz.ac.at.
	;name=/jargon

Content-Type: message


--cut-here
Content-id: http://info.cern.ch./hypertext/Products/WAIS/Sources/Overview.html
Content-type: message/external-body
	;access-type=x-HTTP
	;site=info.cern.ch.
	;name=/hypertext/Products/WAIS/Sources/Overview.html

Content-Type: message


--cut-here
Content-id: http://info.cern.ch/rpc/doc/User/UserGuide.html
Content-type: message/external-body
	;access-type=x-HTTP
	;site=info.cern.ch
	;name=/rpc/doc/User/UserGuide.html

Content-Type: message


--cut-here
Content-id: http://otax.tky.hut.fi/tky/default.html
Content-type: message/external-body
	;access-type=x-HTTP
	;site=otax.tky.hut.fi
	;name=/tky/default.html

Content-Type: message


--cut-here
Content-id: gopher://gopher.micro.umn.edu:70/11/Other%20Gopher%20and%20Information%20Servers
Content-type: message/external-body
	;access-type=x-gopher
	;site=gopher.micro.umn.edu
	;port=70
	;type=11
	;selector="Other Gopher and Information Servers"

Content-Type: message


--cut-here
Content-id: http://info.cern.ch./hypertext/WWW/LineMode/Defaults/default.html
Content-type: message/external-body
	;access-type=x-HTTP
	;site=info.cern.ch.
	;name=/hypertext/WWW/LineMode/Defaults/default.html

Content-Type: message
--cut-here--

---

Here's the perl script I used to convert default.html into
the above message. It's full of gross hacks, but it worked
this evening.

---

#!/usr/local/bin/perl

print "Content-Type: multipart/x-HTDOC; boundary=cut-here\n\n";
print "--cut-here\n";
print "Content-Type: text/x-HTML\n\n";
print "<!DOCTYPE HTML SYSTEM \n[\n";

$o = 0;
$/ = ">";

while(<>){
    s/(<A[^>]*>)/&fix_anchor($1)/ige;
    s/<NEXTID\s*(\d*)\s*>/<NEXTID N=$1>/g;
    if(/<H(\d)/){
	local($n) = $1;
	if($n>$o) { $rep = "<SECTION>"; }
	else { $rep = "</SECTION><SECTION>"; }
        s/(<H\d)/$rep$1/g;
	$o = $n;
    }
    $doc .= $_;
}

@entities = @anchors;
while(@entities){
    local($id) = shift(@entities);
    local($_) = shift(@entities);
    local($name) = shift(@entities);
    local($type) = shift(@entities);

    print "<!ENTITY part$id SDATA \"$_\">\n";
}

print "]>\n", $doc;

while(@anchors){
    local($id) = shift(@anchors);
    local($_) = shift(@anchors);
    local($name) = shift(@anchors);
    local($type) = shift(@anchors);
    local($access_type);

    print "\n\n--cut-here\n";
    print "Content-id: $_\n";
    print "Content-type: message/external-body\n";

    $access_type = $1 if s/^(\w+)://;
    if(s/#([^#]+)$//){
	print "\t;x-element-id=\"$1\"\n";
    }

    if($access_type =~ /file/i){
	print "\t;access-type=LOCAL-FILE\n";
	print "\t;name=$_\n";
    }elsif($access_type =~ /http/i){
	print "\t;access-type=x-HTTP\n";
	if(s-//([^:/]+)--){
	    print "\t;site=$1\n";
	    print "\t;port=$1\n" if s/^:(\d+)//;
	}
	&unescape;
	print "\t;name=$_\n";
    }elsif($access_type =~ /news/i){
	print "\t;access-type=x-news\n";
	&unescape;
	if(/@/){
	    print "\t;message-id=$_\n";
	}else{
	    print "\t;group=$_\n";
	}
    }elsif($access_type =~ /telnet/i){
	print "\t;access-type=x-telnet\n";
	&unescape;
	print "\t;user=$1\n" if s/^(.*)@//;
	print "\t;port=$1\n" if s/:(.*)$//;
	print "\t;site=$_\n";
    }elsif($access_type =~ /gopher/i){
	print "\t;access-type=x-gopher\n";
	if(s-^//([^:/]+)--){
	    print "\t;site=$1\n";
	    print "\t;port=$1\n" if s/:(\d+)//;
	}
	print "\t;type=$1\n" if s-^/(\d+)/--;
	&unescape;
	print "\t;selector=\"$_\"\n";
    }elsif($access_type =~ /wais/i){
	print "\t;access-type=x-wais\n";
	if(s-//([^:/]+)--){
	    print "\t;site=$1\n";
	    print "\t;port=$1\n" if s/:(\d+)//;
	}
	if(m-^/-){
	    print "\t;type=$1\n" if s-^/(\w+)--;
	    print "\t;size=$1\n" if s-^/(\d+)--;
	    &unescape;
	    print "\t;path=\"$_\"\n";
	}else{
	    &unescape;
	    print "\t;words=\"$1\"\n" if /\?(.*)/;
	}
    }elsif($access_type eq ""){
	print "\t;access-type=x-relative\n";
	&unescape;
	print "\t;name=\"$_\"\n";
    }else{
	warn "unknown access type: $access_type in $_";
    }

    print "\nContent-Type: message\n";
}

print "--cut-here--\n";

sub unescape{
    s/%(\w\w)/sprintf("%c",hex($1))/ge;
}

sub fix_anchor{
    local($_) = @_;
    local($name, $href, $type);
    $href = $1 if /HREF\s*=\s*(\S+)/i;
    return $_ unless $href;
    $href =~ s/>$//;

    $name = $1 if /NAME\s*=\s*(\S+)/i;
    $type = $1 if /TYPE\s*=\s*(\S+)/i;

    $content_id{$href} = $content_id++ unless $content_id{$href};
    push(@anchors, $content_id, $href, $name, $type);
    local($ret) = "<A PART=\"part$content_id\"";
    $ret .= " NAME=$name" if $name;
    $ret .= ">";
    return $ret;
}

-----