Logging user access. Was: a question...

Tim Berners-Lee <timbl@www3.cern.ch>
Date: Wed, 24 Feb 93 12:41:52 +0100
From: Tim Berners-Lee <timbl@www3.cern.ch>
Message-id: <9302241141.AA04763@www3.cern.ch>
To: marca@ncsa.uiuc.edu (Marc Andreessen)
Subject: Logging user access. Was: a question...
Cc: www-talk@nxoc01.cern.ch
Reply-To: timbl@nxoc01.cern.ch

Logging user access is something which we definitely want ..for a  
number of reasons

	-  Justifying the project by showing statistics
	-  Demonstrating the readership profiles of 

		different material
	-  Demonstrating the usage profile across sites
The privacy issue is very important, and so I had intended to
log each action "A read B" as "A read something" and "B was read"
independently.  This would give the basic profiles.  Anything futher  
would be an infringement of privacy, so yes that the user would
have to agree to it. The problem is, then the sociological data would  
be immediatly filtered ... all the alt.sex.bondage readers would
filter themselves out!  Perhaps two levels are needed.

The network load is also something which I considered a possible  
problem, so I decided on a scheme (have I said this before?) in which
an event was logged with probability p=exp(-a*t) and the probability
p is included in the message so that the message can be given weight  
1/p in the analysis. The time t with which p decays is from  
compilation of the source, so you get more fine-grained
info on the new releases.
The messages would be UDP packets so as not to clog gateways.

We have a monitoring service here which is already monitoring the use  
of other CERN software -- I am not sure whether it is tcp or udp  

*Coincidence:*  As I write the file system on our server has JUST  
filled up in attempting to process server January's log data....
is this a warning?!

BTW: Marc, you were going to log how LONG an article was read for.
I think that is very tricky... if you can come up with a good measure
of how much the person LIKED the article (automatically) then you
will really have something.  Someone whose name I forget in Stockholm
just gave a talk about inferrding document affinities from readership  
profiles... using the user  as a more refined text comparison program
than a work occurence engine.  I suggested WWW usage data as source,
but realized that for example of all the talk I had just given with
XMosaic, the document which was left on the screen for the longest  
time was quite irrelevant.

Something linked with this is finding relevant material for
a particular person.  How about a service which takes someone's
global history file and tells them all that's new in the world
which would interest them?  In other words, if you do keep
data about a particular person, then that can help them find more  
data like it.... a sophisticated form of relevance feedback.

 - - -

I think that as you are collecting data from the public, then the
data should also be made available to the public, with names and
addresses removed.

Another possibility is that all servers keep logs and share the
results... but it will always be incomplete.