Re Re Customer pull on HTTP2

Dave_Raggett <dsr@hplb.hpl.hp.com>

Mail folder: WWW Talk Jan-Mar 1993 Archives
Next message: Tim Berners-Lee: "Re: new HTML spec, sample implementation "
Previous message: Michael Leventhal: "Re: SGML newline processing"

From: Dave_Raggett <dsr@hplb.hpl.hp.com>
Message-id: <9301111105.AA23480@manuel.hpl.hp.com>
Subject: Re Re Customer pull on HTTP2
To: www-talk@nxoc01.cern.ch
Date: Mon, 11 Jan 93 11:05:48 GMT
Cc: dsr@hplb.hpl.hp.com
Mailer: Elm [revision: 66.25]

Kevin Hoadley says in:

>> Caching
>>-------
>>
>> It will be desirable to avoid overloading servers with popular documents by
>> supporting a caching scheme at local servers (or even at browsers?).

> This as well as caching, replication would be nice. But this is only
> practical if resource identifiers do not contain location information
> (otherwise replication is only possible by making all the peer servers to
> appear to be one machine, as in the DNS CNAME suggestion I made some time
> ago). But if resource identifiers do not contain host information then you
> need an external means of determining how to reach the resource.
> This is analagous to routing protocols (an address is not a route ...).

> Such a system is probably over ambitious for now.

I agree, but it is important to keep an eye of where things are going.
The ability to replicate documents in this way will depend on name servers
e.g. X.500. In the meantime is this necessary? At first, a simple scheme is
to send all remote requests via a fast local server. This server checks if
this Udi is in its cache, and if not forwards the request to the machine
named in the Udi itself. You can extend this to take advantage of several
caches, and the work done by ANSA (Advanced Networked Systems Architecture)
on trading may be apropriate.

... talking about when to purge the cache

> I think this is silly. I haven't changed a document for six months,
> therefore it is safe to say that it won't be changed for the next six
> months ...

Yes, perhaps not one of my best ideas! I think we need some position in
between caching docs only for several minutes, and the full replication
mechanism used with network news and nntp.

One approach is for the server to periodically refresh cache contents. (This
is what Lotus Notes does). You set it up to refresh docs at night, or perhaps
on a trickle basis in the background. The problem is knowing what the
appropriate interval is for each document. The "Expiry:" field or an
equivalent "KeepFor:" (time to live) field when present gives an explicit
suggestion. My "silly" suggestion was a rule of thumb aimed at allowing the
sever to "learn" that some docs don't change much, and so can be refreshed at
longer intervals.

Another complimentary approach is to provide a machanism whereby a server
owning a document informs a list of client servers when it determines that a
given document has changed. This is critical for a successful room booking
application. For this to work, the protocol needs to include the requests:

    NOTIFY hplose.hpl.hp.com:8001
    ADD Udi
    RETRY   1000 10

This is used to inform a server (named in the Udi) that another server
(IP address: hplose.hpl.hp.com, port 8001) wishes to be informed when this
document is changed or deleted. The RETRY parameter is optional and used to
determine the notification retry interval in seconds, followed by how many
times to try.

    NOTIFY hplose.hpl.hp.com:8001
    REMOVE Udi

The reverse operation removing a server from a notification list

    CHANGED Udi
    <Doc header>
    <Doc body>

The message sent to servers on the notification list when the specified doc
has changed. If the doc has been deleted then the body should be empty.

In the case where it is currently impossible to establish a connection with a
server on the notification list, the notification should be periodically
retried until a suitable timeout period has expired. See earlier RETRY field.

You don't need to complicate the http server loop to implement this
mechanism as notifications can be handled by a separate program.

... talking about problems with comparing date/time info

> This also depends on hosts agreeing on the date. To quote
> RFC1128, talking about a 1988 survey of the time/date on
> Internet hosts, "... a few had errors as much as two years"

Wow! I had no idea that this was the case. I had hoped that all machines
would have support for date/time conversion for all known time zones, so
that by including the time zone as part of the format, there would be no
problem.

>> I think that we need to provide an operation in which the server returns a
>> document only if it is later that a date/time supplied with  the request. 

> This would be useful as part of a replication system, as long as both ends
> exchanged timestamps initially so that the dates can be synchronised.

In this case we need to define how servers should process date/time info,
particularly when a mismatch is detected.

... talking about copyright protection

> It may be stating the obvious, but once you allow a user to access you
> data such that they can save it, there is no technical way you can prevent
> them from publically redistributing your data. This is a social/legal
> problem, not a technical one. Accepting that nothing can be done to stop
> deliberate abuse of licensed information, there is a need to prevent
> accidental abuse.

There is no *techical way* to stop me driving my car at a passerby and
killing him! The answer is that it is illegal to breach the copyright law.
In HP we have notices next to each photocopier, reminding us of what the law
allows us to do. The same will apply to networked access. Publishers are
concerned that they receive fair payment for their information, and the
critical issue is to ensure that all processes can pass an audit to show
their compliance.

My idea is that its ok to cache copyrighted docs so long as you put in an
effective mechanism for logging and handling payments. This mechanism must be
able to pass a suitable audit procedure. I believe that the scheme I described
would do this.

> Probably the simplest way to do this is to mark the
> document as one which should NOT be cached.

You need to separate the issue of copyright protection from ensuring secure
access to restricted information. I proposed the "Distribution:" header for
this purpose.

Many thanks for your comments,

Best wishes,

Dave Raggett, dsr@hplb.hpl.hp.com