Re: Configurable log proposal

"Roy T. Fielding" <fielding@simplon.ICS.UCI.EDU>
To: Ari Luotonen <luotonen@ptsun00.cern.ch>
Cc: www-talk@www0.cern.ch
Subject: Re: Configurable log proposal 
In-reply-to: Your message of "Tue, 01 Feb 1994 21:45:20 +0100."
             <9402012045.AA00523@ptsun03.cern.ch> 
Date: Tue, 01 Feb 1994 14:43:00 -0800
From: "Roy T. Fielding" <fielding@simplon.ICS.UCI.EDU>
Message-id: <9402011443.aa24929@paris.ics.uci.edu>
Content-Length: 3133
Ari said:

> How about having a fully configurable log files, something
> that would understand escapes like this: ...

Although on the surface this seems like a good idea, I want to determine
exactly why it is desirable and make it clear where the pitfalls lie.

I can think of three reasons why it is desirable:

1) Make the largest set of known access data available for logging;

2) Allow the individual service provider to choose the exact subset which
   should NOT be logged, thus saving local disk space.

3) Make the log file format consistent across multiple servers/services
   (assuming those servers follow the same log conventions).

Note, however, that 1 and 3 are also accomplished by establishing a
fixed format which contains all the information.

Now, as for the pitfalls:

1) There is no compelling need for log flexibility other than saving
   disk space.  Regardless of how the data is organized, people will still
   want to use some log analyzer to view the data and thus the format should
   be designed primarily for machine readability and for occasional human
   reading or grep search.  The log analyzer will have its own (possibly
   configurable) output format for human consumption.

2) The content of the data is only known at the time the log entry is
   made.  Any information that was not written at that time is lost to
   any later analysis.  Thus, it is usually preferable to log everything
   and let the log analysis program choose what should be ignored.

3) Every time the configuration changes, the old log file must be deleted.
   This is because any log analyzer will only be able to understand one
   log format at a time.

4) Special formatting conventions (like the square brackets surrounding
   the date field in NCSA httpd logs) make it much easier for analyzers
   to parse the data and identify mangled entries -- a condition which
   occurs quite often with NCSA httpd.

5) It makes it slightly harder for people like me to write and test a
   simple log analyzer program.


Having said all that, I still think that it may be a good idea providing
that the above concerns are addressed (i.e. I have faith that the server
authors will go out of their way to make my life easier, providing that
I let them know what will make my life easier).  Although I personally
would prefer a fixed format, I am willing to go with the flow.

In that spirit, let me propose that some generic indicator (such as "-")
be used for any field which is desired by the configuration string (or by
the fixed format) but is unknown or empty for a particular log entry. 
Thus, if the configurable string indicates REMOTE_IDENT should be logged
between FULL YEAR and CLIENT HOST ADDRESS (as in "%Y %I %C"), and
REMOTE_IDENT is empty, then the output should be like:

        "1994 - simplon.ics.uci.edu"

rather than

        "1994  simplon.ics.uci.edu"

for reasons that should be obvious to most hackers.


....Roy Fielding   ICS Grad Student, University of California, Irvine  USA
                   (fielding@ics.uci.edu)
    <A HREF="http://www.ics.uci.edu/dir/grad/Software/fielding">About Roy</A>