Re HTTP2: caching and copyright

Dave_Raggett <dsr@hplb.hpl.hp.com>

Mail folder: WWW Talk Jan-Mar 1993 Archives
Next message: Dan Connolly: "Re: new HTML spec, sample implementation "
Previous message: Tim Berners-Lee: "Re: dealing with new-lines "
Reply: Kevin Hoadley: "RE: Re HTTP2: caching and copyright"

From: Dave_Raggett <dsr@hplb.hpl.hp.com>
Message-id: <9301111542.AA23888@manuel.hpl.hp.com>
Subject: Re HTTP2: caching and copyright
To: www-talk@nxoc01.cern.ch
Date: Mon, 11 Jan 93 15:42:41 GMT
Cc: dsr@hplb.hpl.hp.com
Mailer: Elm [revision: 66.25]

These are comments on Tim's responses to my recent message on HTTP2.

>>   o   the "Expires:" field is optional

> agreed.

>>   o   the date values should be in a prescribed format to simplify
>>       machine interpretation (Is this adequately defined by existing RFCs?)

> agreed. yes it is, in RFC850

RFC977 provides a tighter definition for date/time restricting it to the time
zone of the server or GMT. I would like us to restrict it to GMT period - as
otherwise how can you in general find out the time zone of the server?

    NEWGROUPS date time [GMT] [<distribution>]

    The date is sent as 6 digits in the format YYMMDD, where YY is the last
    two digits of the year, MM is the two digits of the month, (with leading
    zero, if appropriate), and DD is the day of the month (with leading zero,
    if appropriate). The closest century is assumed as part of the year, i.e.
    86 specifies 1986, 30 specifies 2030, 99 is 1999, 00 is 2000).

    Time must also be specified. It must be as 6 digits HHMMSS with HH being
    hours in the 24-hour clock, MM minutes 00-59, and SS seconds 00-59.
    The time is assumed to be in the server's time zone unless the token "GMT"
    appears, in which case both time and date are evaluated at the 0 meridian.
    
RFC850 mentions that not all time zones have well known abbreviations,
making it difficult to carry out date/time arithmetic. Furthermore, Kevin
Hoadley's comment:

          This also depends on hosts agreeing on the date. To quote
      RFC1128, talking about a 1988 survey of the time/date on
      Internet hosts, "... a few had errors as much as two years"

suggests we will have problems with servers in this regard.
One solution would be for the server to send a field with each document:

        KeepFor: nnnn seconds | mmm days

The cache management software then notes the date/time received and works
out for itself the expiry date/time. This method has the great advantage of
avoiding all need for date/time conversion, and any reliance on the server's
having their clocks setup correctly.


>> I think that we need to provide an operation in which the server returns a
>> document only if it is later that a date/time supplied with  the request. If
>> it is the same (or earlier) the server should return a suitable status code
>> and an optional "Cost:" header, see below.

> Need to look at NNTP here.  We end up getting very close indeed to it.
> I would want the functionalty of this search to map onto the NEWNEWS
> very nicely.  A newsgroup is just a hypertext list anyway.

I like the NEWNEWS command, but feel we should keep the GET & SINCE command.
The latter allow you to refresh a cached document with one exchange whereas
you need two when using NEWNEWS and a subsequent GET.

The NEWNEWS command is intended for finding new basenotes and responses,
and should be contrasted with the NEWGROUPS command.

>> Note that servers shouln't cache documents with restricted readership since
>> each server don't know the restrictions to apply. This requires a further
>> header to identify such documents as being unsuitable for general caching:
>> 
>>     Distribution: restricted | unrestricted

> Good point.  Not the the distribution of other messages is in the form of
> To: and Cc: and Newsgroup: and in fact Distribution:.  (See  
> http://info.cern.ch/hypertext/WWW/Protocols/rfc850/rfc850.html#z12)
> So you'll need a new fieldname.  If we could only merge the functionality of  
> these systems in some cool way, it would be grand.

I don't understand the description of "Distribution: nj.all" in RFC850
(section 2.2.8). It is unclear what its argument is. Is it a geographical
hierarchy or is it some kind of newsgroup name with the "all" wildcard?

It would be nice in some circumstances to define the readership groups for
situations where a server could apply group membership information to restrict
readership. This field would be supplied by the author. This idea is I believe
in the same spirit as RFC850. Consider the following example:

    Distribution: incl.kbpd psl.all

This says that the document can be given to anyone in psl and anyone in the
kbpd subgroup of incl. You can make these names correspond to your
organisation. The maintenance of these readership groups is outside the scope
of the HTTP2 protocol. Local servers shouldn't cache documents including this
header unless they "understand" the specified readership groups and can apply
the same membership tests. This involves sharing the same definitions across a
group of servers, for instance within a campus or a company.

>> I would like the document header to include an optional cost header, e.g.
>>
>>      Cost: 4.05 US DOLLARS
>>      Copyright: Reuters Inc.

> I note here that both the copyright holder and the account for charging are 
> items in some address space, and we ought to be as flexible with these
> address spaces as with the udi.  So I would propose something like
>
>    ChargeTo:  HPInternal:/8126/148689  upto $2.00
>    
> would be better.  But how does this fit in with authentication?  Once you
> are authenticated, your prefered method of paying will be known.  You can't
> have charging without authentication!

Four points:

First, it certainly isn't the case that once you are authenticated
then the way of charging you is known. For example consider members of
the public wishing to pay for information using a credit card number for a
service they have never before accessed. In HP it may in some cases be
sufficient to check that the client's internet address starts with the
company's subnet code. However, the server still needs the employee
name, number and location code to cross charge.

Second, I don't think the "upto" concept is needed. In the vast majority of
cases a fixed cost will suffice. A point to watch when keeping documents in
local caches, is that this cost may change from time to time. This
corresponds to pricing of normal goods, and I believe can be adequately
handled by appropriately setting the "KeepFor" or "Expires" field.

Third, for legal purposes it is still necesary to tag documents with who owns
the copyright, as in books, music and other products. For this reason we
should include the "Copyright:".

Fourth, I like the idea of a universal scheme for naming the copyright owner
and charging method, but feel that this will take some time to take effect.
For the moment I would like to stick to the following:

    Cost: 4.05 US DOLLARS
    Copyright: 1988 Time International Inc.
    ChargeTo: HP/8126/148689

Where the meaning of the "CopyRight:" and "ChargeTo:" fields is outside the
scope of the HTTP2 protocol specification. The "Cost:" field always starts
with the amount and should be followed by the currency name.

The requesting GET could include an optional header:

    CostingUpTo:  2.50 US DOLLARS

This would result in the server returning an error message if this was less
than the cost of the requested document, and introduces issues of how
to recognising currency type and performing currency conversion. Users should
be able to see how much they will have to pay on preceding hypertext pages
(as supplied by the server).

> A simple thing in the first instance is to say that it illegal to cache
> a for-pay document unless you have a privat earrangement with the owner
> about refunding him.  This could be done using a completely separate billing
> process.

No. You can only get copies of documents for which the server recognises that
there is an effective arrangement for making the payment. However, if this is
the case, then caching presents no problems, provided the authetication and
ChargeTo information is preserved and supplied to the server with the GOT
command.

I will try and lay my hands on a copy of "Litterary Machines".

>> The protocol ought to allow for multiple GOT statements (and associated
>> headers in the same message. For this it seems simple enough to require a
>> terminating blank line.

> Hey, that;s not something you do for one method, it's a change to the whole 
>  protocol to introduce pipelining.

Oh dear! It seems a waste to have to set up a connection for each such
request. Perhaps the safest thing is to allow multiple Udi's with the GOT
command, all of which must be for the same client. What limits are there on
line length for headers or is there a mechanism for continuing arguments on
subsequent lines? This would still be effective in limiting network traffic,
and processing time.

>>  Effective support for discussion groups

>> My model is that discussion groups each have unique Udi's. Each discussion
>> group has a sequence of base notes, and each base note is associated with a
>> sequence of responses. I am unsure of how to deal with cross postings!

> I agree that the POST method is well defined as a method of the
> newsgroup class which takes an article as a parameter. In fact, as you say,
> cross-posting makes a mess of this, as it involved many groups in one atomic 
>  operation.  This is a peculiarity of news which makes it difficult to map
> onto the object model.  Any ideas?

The NNTP protocol employs the POST command to post an article, and relies
on the document's header to specify the news groups for posting to with the
Newsgroups header. The "References:" header is used to link a response to any
articles prompting submission of this article.

Thus each article can be posted to multiple groups, and can have zero or more
references to preceding articles. For convergence between NNTP and HTTP2 we
need to clarify the mapping of groups and references.

News groups as currently defined are hierarchical name spaces without
reference to a server or filing system. The WWW model currently ties
documents to both of these. I would like to be able to post responses as
WWW documents which refer to one (or more?) existing documents. We can already
do this. What we can't do is to find what documents reference a given
document. This is hard in principle and practice since the various documents
can be on different machines scattered over the entire world.

The answer is to provide a mechanism which allows servers to track which Udi's
should be recorded as being "responses" to other Udi's.

What is the analogous concept to news group?

These are lists of articles, and not articles as such. The GROUP command in
NNTP allows you to identify articles in given groups. The LIST command returns
the complete list of groups known to the server, while the NEWGROUPS command
returns groups created since a specified date/time, and matching on specified
distribution categories.

I think that in WWW we should treat groups as named documents which are
generated by the server from the database of postings stored under that name.
The important thing is to distinguish between the Udi's of references and
those of news groups.

Given these ideas I will now present my suggestions for the POST command:

The document header supplied with the POST command has the following fields:

    Newsgroups:  <followed by one or more Udi's>        /* optional */
    References:  <followed by one or more Udi's>        /* optional */

Followed by one of the following:

    DocumentName: Udi       /* for an existing document - body is void */
    NewDocument:            /* for new document, contents follow as body */

The semantics are the same as for NNTP, except that the Newsgroups header is
optional. In otherwords you can post responses to any WWW document - it
doesn't need to be in a news group. The server should return the Udi of the
document if successful (note that the NNTP POST command doesn't bother with
this).

We can include support for ARTICLE, BODY, HEAD, LIST and NEWGROUPS commands in
a way very similar to NNTP. The GROUP command in NNTP returns the first and
last article number in the group. This is unlikely to be what we want - as it
depends of the special naming scheme used for network news articles. We are
probably more interested in getting a list of article names in the group. In
my earlier message I suggested that this could best be achieved using the GET
command in conjunction with SINCE and BEFORE parameters (to allow for really
humungous groups with thousands of base notes). The server is responsible for
interpreting this command with the appropriate database query.

A really useful command missing from NNTP is the ability to list the responses
to a given document, i.e. the command names a given document/article, and is
returned with the list of Udi's for documents which were posted with that
article as part of their references. It would be great if this list was sent
by the server along with the base document, as a separate part in a multipart
message.

Finally, I want to draw attention to post-it style annotations. It would be
really nice to be able to post a note at a particular point within a given
document. The browser would show such annotations as little post-it symbols
which you click to see their contents. This requires similar mechanisms to
that for discussion groups. Perhaps we could have an ANNOTATE command:

    ANNOTATE   Udi  /* including an anchor for positioning the annotation */

   (body follows)

Authors place anchors in documents to suit their own needs and not the
unforeseen needs of others. It is therefore necessary to general the anchor
syntax in document Udi's to support a more flexible scheme based on pattern
matching. Servers should send the document along with the list of annotations.

Comments please.

Best wishes and sorry for such a long response,

Dave Raggett