Global HyperLinks was: quotes around tags and escape sequences

Dan Connolly <connolly@pixel.convex.com>
Message-id: <9212010635.AA23999@pixel.convex.com>
To: Edward Vielmetti <emv@msen.com>
Cc: "Tony Johnson (415) 926 2278" <TONYJ@scs.slac.stanford.edu>,
        www-talk@nxoc01.cern.ch
Subject: Global HyperLinks was: quotes around tags and escape sequences 
In-reply-to: Your message of "Mon, 30 Nov 92 23:37:23 EST."
             <m0mwPMf-0000A7C@garnet.msen.com> 
Date: Tue, 01 Dec 92 00:35:22 CST
From: Dan Connolly <connolly@pixel.convex.com>

OK, now you're asking for it. I've been mulling this
stuff over in my head for a couple weeks, and I've got some
pretty good ideas as to how it all fits together.

My model of global hypermedia includes the following terms:

Entity -- SGML and MIME use this term. WAIS calls it a document.
	Gopher calls it an item or a textfile or something.
	WWW used to call it a document, and now calls it
	a resource.

	The meaning is the same in all of them: a unit
	of retrieval [from the URL document].

Content-Type -- MIME coined this term. SGML calls it a NOTATION.
		WAIS used to call it :type, but they'll call
		it :content-type if they follow up on what they
		told me. Most gopher types fall under this scheme
		(telnet, cso, and other types that don't use gopher
		protocol don't fit)

Reference -- This is the WWW anchor, the Gopher Menu item, the WAIS
	:document-id structure, The MIME message/external-body. It is
	enough information to 1) decide whether to retrieve the entity,
	2) perform the retrieval transaction, and 3) process the entity
	once you've got it.

>Really, though, the gopher reference is (in gopherspeak)
>
>Name=An arbitrary, but meaningful name
>Host=gopher.micro.umn.edu
>Port=70
>Type=0
>Path=Some Stuff

NOTE: Some Stuff is terminated by a newline, and may not contain tabs.

>And the "href=" is just a way to squash it down to a single string.
>It could just as well be a set of attributes and not a single one.
>E.g.
>
><a gopherhost="gopher.micro.umn.edu" 
>   gopherport="70" 
>   gopherpath="/Some Stuff" 
>   gophertype="0">
>An arbitrary, but meaningful, name</a>

NOTE: for type 7 items, you need gophersearch="terms" too.

>expresses the meaning of what's going on in a way that's far closer to
>how SGML might do it as far as I have been able to make out...Dan is
>that actually legal SGML?

Sure, that's legal. I suggested that URLs be expressed in SGML a long
time ago. Tim said it was overkill, and I'm starting to agree.

Let's take a closer look at references:

1) What features allow users and clients to decide to retrieve an entity:

WWW	context and content of the anchor (Is it relevant?)

MIME	content-id (do I have this entity cached already?)
	content-description (relevant?)
	content-type (can I process it once I've got it?)
	SIZE (is it too big to bother?)

WAIS	:score (relavent to my query?)
	:headline (relevant?)
	:doc-id (in cache?)
	 original/distributor-server,database,local-id particularly useful
	:number-of-lines, :number-of-bytes (too big?)
	:type, :content-type (can I process it?)
	:date (how old is it?)

Gopher	name (is it interesting?)
	type (can I process it?)

2) What features allow the client to make the transfer?

WWW	URL -- protocol, host, port, path, type, size, search terms
	handles local files, HTTP, gopher, WAIS connections.
	includes search terms for fulltext indexes.
	scheme mechanism allows gateways to new protocols

MIME	access-type, etc.: handles ftp, anon-ftp, local-file
	Ghost body allows arbitrary extra data.

Gopher	host, port, path, search words

WAIS	source (host, port, database), doc-id, search terms,
	relavent documents (these are the novel feature. Quite handy)

3) What features allow the client to process the entity?
(Keep in mind that these are features of the reference -- this
is information we have _before_ we transfer the entity).

WWW	processing is tied to the protocol. Content-Type
	of local files is inferred from file extensions.

	Entities from HTTP connections are assumed to
	be text/x-html.

	Gopher entites are typed: 0=text/plain, 1=application/x-gopher,
	w=text/x-html.

	WAIS entites are typed: TEXT=text/plain, WSRC=application/x-wais.

MIME	content-type mechanism is quite expressive. Any content-type
	can be encapsulated in a message/rfc822 entity. Multiple
	entities can be encapsulated in a multipart/mixed entity.

Gopher	gopher type tells you what to do with the data.
	text/plain, application/x-gopher are universally supported.
	other types are supported by pilot projects.

WAIS	:type tells what to do. text/plain and application/x-wsrc are
	supported. Other types are supported by pilot projects.


Now let's see how we should change the WWW reference mechanism.

Here's what we've got currently:

<!ELEMENT A	- -  (#PCDATA)>
<!ATTLIST A
	NAME ID #IMPLIED
	HREF CDATA #IMPLIED
	TYPE CDATA #IMPLIED
	>

What's the TYPE used for? It's not a data type. There's some
code in LineMode to handle it, but I'm not sure what it does.

The NAME identifies the anchor as the target of some other anchor.
We should have NAME (or ID) attributes on pretty much all elements,
for example:

<DL>
<DT ID=term>term<DD>definition
</DL>

The HREF attribute is enough information to retrieve and Entity.
Good. But it's got thie #anchor stuck on the end. That should
be a separate attribute. It should be an IDREF, so that we
can validate that it references an existing ID with an SGML
parser.

"But," you say, "what if it references an ID outside the current document?"

I suggest we treat a group of nodes that reference each other not
as separate documents, but as entities of one big document. That
way, an author can validate the internal links in his/her web.

I suggest two new elements: XREF, for intra-document links (i.e.
links within the local web), and SEE for inter-document links
(i.e. links that go outside the local web).

<!ELEMENT XREF - - (#PCDATA)
 -- This element is for links within an HTML document. (a document
    is a collection of entities, or a web of nodes).
 -->
<!ATTLIST XREF
	CONTEXT CDATA #IMPLIED -- entity containing the XREF is implied --
	-- SGML purists would make this attribute an ENTITY reference,
	   and put the URL in the SYSTEM identifier in the prologue.
	   For expediency, we put the URL right in the attribute.
	--
	ORIGIN CDATA #IMPLIED
	-- another URL, used as an identifier, rather than a locator.
	   Ala the WAIS original-server,database,local-id triple.
	--
	REF IDREF #REQUIRED  -- ID of referent element --
	>

<!ELEMENT SEE - - (#PCDATA)
 -- This element is for links from an HTML document to any entity
    in the global web. The location and content-type of the entity
    are sufficient to resolve the reference.

    The other attributes could be specified in the text of the
    SEE content, but by making them attributes, the client software
    can process them, for example, to display a table of references
    sorted by date.
 -->
<!ATTLIST SEE
	LOCATION CDATA #REQUIRED -- URL of referent entity --
	CONTENT-TYPE CDATA #REQUIRED -- MIME Content-Type for the entity --
	CHUNK CDATA #IMPLIED
	-- This is the analogue of the #anchor mechanism.
	   If CONTEXT is an SGML entity, this would be an ID,
	   though it won't be validated.
	   However, if CONTEXT is a text file, this could be a line number.
	   The meaning is defined by the content-type.
	--
	ORIGIN CDATA #IMPLIED
	FROM CDATA #IMPLIED -- email address or name of author/provider --
	DATE NUMBER #IMPLIED -- in ISO format: YYYYMMDDHHMMSSZ --
	BYTES NUMBER #IMPLIED -- useful in many cases --
	MD5 CDATA #IMPLIED -- data signature --
	>

What do you think?

Dan