Redundancy in links, Davenport Prososal [long]

Daniel W. Connolly (connolly@hal.com)
Fri, 27 Jan 95 19:04:11 EST

[Look out folks. This could be a long one. We finally finished and
released OLIAS 1.1, so I have a little time, plus I've been thinking a
lot about Terry's proposal and how the Harvest Technology applies, and
how we increase quality on the web in general in preparation for my
"Formalizing Web Technology" presentation for next week's WebWorld
conference.

I copied all these lists becaue I think there may be interested folks
on all these lists. I suggest follow-ups be sent only to
uri@bunyip.com and davenport@ora.com.]

In message <199501271917.LAA24883@rock>, Terry Allen writes:
>Dan says
>>For example, if there's a postscript file on an FTP server out there
>called "report_127," you effectively can't link to it given today's
>web.
>
>But doesn't that mean simply that not enough info is being sent
>about the file by the server, or that the client isn't smart enough?
>Putting a content-type att on <A> seems like a fragile solution
>to the problem, as it shifts responsibility to the author of
>the doc, who is in most cases just a poor dumb human.

Yes, it's fragile, but it's better than completely broken.

This is _distributed_ hypertext. It spans domains of authority. As an
author, I have authority over the info I put in the link, but I may
not have the authority to change the filename on the server. So I'm
stuck.

This situation will only get more complex: as a value-added proxy
server, I can add annotations, show references to related documents,
etc., but I can't change the original.

I think this is directly relavent to your URN/davenport application[1].

>From the evidence that I have studied, the way to make links more
reliable is not to deploy some new centralized namespace (ala URNs
with publisher id's), but to put more redundant info in links.

Rather than looking at the web as documents addressed by an
identifier, I think we should look at it as a great big
content-addressable-memory. "Give me the document written by Fred in
1992 whose title is 'authentication in distributed systems'."

I think the same sort of thing that makes for a high-quality citation
in written materials will make for a reliable link in a distributed
hypermedia system. A robust _link_ should look like a BibTex entry
(MARC record, etc.)

Given a system like harvest[2], it makes sense to handle queries like
"find me the document who's publisher is O'Reilly and Associates,
published in 1994 under the title 'DNS and Bind'." Their model for
distributed indexing, brokers, replication, and caching (with
taxonomies and query routing in the works) has me convinced that it's
the right way to go.

One party actually develops the document (or program or
database...). Another publishes it. Some folks referee it. Another
party advertises/markets it. Another party provides shared disk space
and bandwith for a fee. Another party is an expert librarian for some
field. All of these parties are humans or groups of humans, but they
are all aided more or less by the machines that participate in this
distributed hypermedia system.

All these folks share resources. Each of them has different policies
and procedures, different experties, different goals. The way to make
the whole thing work is
(1) let the computer do the work, wherever we can, and
(2) keep the simple thinks simple
(3) make the complex things possible.

So if I as the link author know more than the reader's client can get
from the FTP server, I should be _able_ to contribute the knowledge
that I have. Making all the authors put content type info in their
links is the the wrong answer; the optimal solution is for the
provider to adapt to the .ps convention. But the link author should
be able to add value and quality despite the poor efforts of the
FTP server maintainer.

"But the link author could just copy that file and put a .ps extension
on his own machine," you might reply. This doesn't allow for the case
when the document in question changes daily, and it doesn't provide an
audit trail, and it violates my #1 engineering principal: never
maintain the same information in more than one place.

The whole point is that as long as links just give one little point of
information, they're going to be fragile. In effect, URLs give several
pieces of information. They usually give a DNS domain name, so you can
deploy conventions like having webmaster@host be the point of contact
for a given server. From a typical "home page" address

http://host/~user.html

I can infer that user@host is the associated mailbox. It's not 100%,
but it usually works.

That brings me to another point: The sharing of information can only
be automated to the point that it can be formalized. I've been trying
to find some formalism for the way the web works. I've decided that
this is a useful excercise for areas like security, where you have to
be 100% sure of your conclusions relative to your premises.

But for the web in general, 100% accuracy and authenticity is not
necessary. The web is a model for human knowledge, and human knowledge
is generally not clean and pricise -- it's not even 100% consistent.
So I think that in stead of modelling the web with formal systems like
Larch[3], a more "fuzzy" AI knowledge-representation sort of approach
like Algernon[4] is the way to go. Traditional formal systems like
Larch rely on consistency, which is not a good model for the knowledge
base deployed on the web.

The URN model of publisher ID/local-identifier may be sufficient for
the applications of moving the traditional publishing model onto the
web. But that is only one application of the technology that it takes
to achieve high quality links. Another application may have some other
idea of what the "critical meta-information" is. For example, for bulk
file distribution (ala archie/ftp), the MD5 is critical.

OK... so... now that I've a brian dump, how about a specific answer
to the "Davenport proposal":

Problem Statement
=================

The Davenport Group is a group of experts in technical documentation,
mostly representing Unix system vendors. They have developed DocBook,
a shared SGML-based representation for technical documentation. They
will probably be using a combination of CD-ROM distribution and the
Internet to deliver their techincal documention.

They are developing hypertext documentation; they each have solutions
for CD-ROM distribution, but while the World-Wide Web is the most
widely-deployed technology for internet distribution, it does not meet
their needs for response time nor reliability of links over time. As
publishers, they are willing to invest resources to increase the
quality of service for the information they provide over the web.

Moreover, the solution for increased reliability must be shared among
the vendors and publishers, as the links will cross company
boundaries. Ideally, the solution will be part of an Internet-wide
strategy to increase the quality of service in information retrieval.

Theory of Operation
===================

The body of information offered by these vendors can be regarded as a
sort of distributed relational database, the rows being individual
documents (retrievable entities, to be precise), and the columns being
attributes of those documents, such as content, publisher, author,
title, date of publication, etc.

The pattern of access on this database is much like many databases:
some columns are searched, and then the relavent row is selected. This
motivates keeping a certain portion of this data, sometimes referred
to as "meta-data," or indexing information, highly available.

The harvest system is a natural match. Each vendor or publisher would
operate a gatherer, which culls the indexing information from the rows
of the database that it maintains. A harvest broker would collect the
indexing information into an aggregate index. This gatherer/broker
collection interaction is very efficient, and the load on a
publisher's server would be minimal. The broker can be replicated to
provide sufficiently high availability.

Typically, a harvest broker exports a forms-based HTTP searching
interface. But locating documents in the davenport database is a
non-interactive process in this system. Ultimately, smart browsers
can be deployed to conduct the search of the nearest broker and
select the appropriate document automatically. But the system should
interoperate with existing web clients.

Hence the typical HTTP/harvest proxy will have to be modified to not
only search the index, but also select the appropriate document and
retrieve it. To decrease latency, a harvest cache should be collocated
with each such proxy.

Ideally, links would be represented in the harvest query syntax, or a
simple s-expression syntax. (Wow! In surfing around for references, I
just found an example of how these links could be implemented. See the
PRDM project[2].) But since the only information passed from
contemporary browsers to proxy servers is a URL, the query syntax will
have to be embedded in the URL syntax.

I'll leave the details aside for now, but for example, the query:

(Publisher-ISBN: 1232) AND (Title: "Mircosoft Windows User Guide")
AND (Edition: Second)

might be encoded as:

harvest:/davenport?publisher-isbn=1232;title=Microsoft%20Windows%20Users%20Guide;edition=Second

Each client browser is configured with the host and port of the
nearest davenport broker/HTTP proxy. The reason for the "//davenport"
in the above URL is that such a proxy could serve other application
indices as well. Ultimately, browsers might implement the harvest:
semantics natively, and the browser could use the Harvest Server
Registry to resolve the "davenport" keyword to the address of a
suitable broker.

To resolve the above link, the browser client contacts the proxy and
sends the full URL. The proxy contacts a nearby davenport broker,
which processes the query and returns results. The broker then selects
any match from those results.

Through careful administration of the links and the index, all the
matches should identify replicas of the same entity, possibly on
different ftp/http/gopher servers. An alternative to manually
replicating the data on these various servers would be to let the
harvest cache collocated with the broker provide high availability of
the document content.

Security Considerations
=======================

The main considerations are authenticity and access control for the
distributed database.

Securely-obtained links (from a CD-ROM, for example) could include the
MD5 checksum of the target document. If the target document changes, a
digital signature providing a secure override to the MD5 could be
transmitted in the HTTP header. Assuming the publishers' public keys
are made available to the cache/proxies in a secure fashion, this
would allow the cache/proxy to detect a forgery. But the link from the
cache/proxy to the client is insecure until clients are enhanced to
implement more of this functionality natively. At that point, the
problem of key distribution becomes more complex.

This proposal does not address access control. As long as all
information distributed over the web is public, this solution is
complete. But over time, the publishers will expect to be able
to control access to their information.

If the publishers were willing to trust the cache/proxy servers to
implement access control, I expect an access control mechanism could
be added to this system. If the publishers are willing to allow the
indexing information to remain public, I believe that performance
would not suffer tremendously. The primary difficulty would be
distributing a copy of the access control database among the proxies
in a secure fashion.

Conclusions
===========

I believe this solution scales well in many ways. It allows the
publishers to be responsible for the quality of the index and the
links, while delegating the responsibility of high-availability to
broker and cache/proxy servers. The publishers could reach agreements
with network providers to distribute those brokers among the client
population (much like the GNN is available through various sites.)

It allows those cache/proxy servers to provide high-availability to
other applications as well as the davenport community. (The Linux
community and the Computer Science Technical reports community already
operate harvest brokers.)

The impact on clients is minimal -- a one-time configuration of the
address of the nearest proxy. I believe that the benefits to the
respective parties outweigh the cost of deployment, and that this
solution is very feasible.

[1] http://www.acl.lanl.gov/URI/archive/uri-95q1.messages/0080.html
Sun, 22 Jan 1995 12:41:10 PST

[2] PRDM
http://www-pcd.stanford.edu/ANNOT_DOC/annotations.html

[3] http://www.research.digital.com/SRC/larch/larch-home.html

[4] http://www.cs.utexas.edu/~qr/algernon.html