Re: Caching Servers Considered Harmful (was: Re: Finger URL)

Kevin Altis (altis@ibeam.jf.intel.com)
Tue, 23 Aug 1994 01:16:30 +0200

At 8:07 PM 8/22/94 +0200, Rob Raisch, The Internet Company wrote:
>On Mon, 22 Aug 1994, Daniel W. Connolly wrote:
>
>> In message <Pine.3.85.9408221141.A462-0100000@enews>, "Rob Raisch, The
>>Internet
>> Company" writes:
>>
>> >You can provide no guarantee that the versions that you present to your
>> >users are accurate or timely.
>>
>> This is extremely misleading, if not just plain incorrect.

It is incorrect! For those of you new to caching, if you haven't already
done so, please go and read the online information first, before making
more comments. Two relevant pages cover the Pragma: no-cache header
<http://info.cern.ch/hypertext/WWW/Proxies/ProtocolAdds.html> and
<http://info.cern.ch/hypertext/WWW/Protocols/HTTP/HTRQ_Headers.html>. The
basic idea is to "force" the proxy or server on the other side to always
return the latest version of an item when the user does a reload. First
though, you should probably read the entire Proxy paper
<http://info.cern.ch/hypertext/WWW/Proxies/> and maybe reread the HTTP
protocol and CERN server documentation as well. If you've got technical
questions, Ari or myself should be able to answer them. Luckily this stuff
isn't too difficult and a lot of other folks on this list already
understand the technology and issues so they're also good resources.

Based on current usage, which is empirical not conclusive, most cache
administrators (it is the default setting if nothing else) at least check
for modified web documents via the If-modified-since header which is
relatively lightweight in terms of bandwidth, while still giving content
providers the benefit of seeing a document request. For content providers,
if you aren't running a server that supports If-modified-since then you're
wasting both your bandwidth and machine resources.

Servers that don't support even sending a Last-modified field to facilitate
efficient caching for their documents are just asking for abuse. I'll pick
on <http://www.wired.com/> since they appear to think it is important to
make all files available via server side includes just so a last modified
time shows up at the bottom of the document. Problem is this hoses the
caching algorithm. Some sites that I've checked with just configured their
NCSA servers incorrectly, and to be fair to WiReD they might just have a
configuration problem. I can outline what to look for and how to fix it if
you're having a configuration problem. To test for the problem, simply
telnet to the web server in question to get its home page or some other
typical document, if it doesn't return a Last-modified field that server
will most likely screw up your well intentioned caching for that site.

Caching is over 25 years old (unfortunately I don't have references handy
to the standard computer science books), the cat is out of the bag folks.
Web caching is not going to go away for the Web, in fact it is already in
most of the web browsers today on a per session sense, disk based
persistant caching like the CERN server does will most likely show up in
the next six months for the best Web clients.

Proper caching does not mean that content provider won't see hits. In the
case of corporations or other folks behind firewalls, it does mean that
only one or a few proxy servers will be making requests from the content
servers. It is always possible that some sites or even individual users
will decide to change their caching strategies in order to reduce bandwidth
loads (bandwidth isn't free you know) possibly by only fetching certain
items once a week. The important point is that it is up to them. If they
want to read the Sunday paper and only the Sundy paper all week, then
that's their business.

Content providers should be aware that users will use their content with
different frequency, possibly sharing (even without caching) and then build
that into their cost models. Their are valid points to discuss regarding
copyright and perhaps caching has just brought those issues to the
foreground sooner than expected. The content copyright - redundant copies,
etc. discussion should happen on some other list, not here!

ka