Footprints
Web Statistician Software
Another Small Step for the Web Kind ...
Copyright © 1996-1997
Last Edited: Sep 16, 1997

[ Purpose ] [ Typical Usage ] [ Features ] [ Installing ] [ Revisions ]

Purpose:
To create an all-purpose Web statistician that parses the common log format using PERL5.
What do you mean by all purpose?

  1. Produce statistics from two alternitve perspectives: a list of remote hosts reading which URL's and a list of URL's read by which hosts. Sometimes you want to track statistics on a URL or section of URLS, while other time you may want to track a particular host or domain,

  2. Compile statistics using an optional disk base medium, rather than memory. This lets you compile statistics on huge logs that you know will not fit into memory. Plus, you can "save-state" and display multiple output on the same lump of compiled statistics,

  3. Organize statistics into groups like HTTP status, where you could display information as to what documents requests were "failed", "redirected", "cached", ...

  4. Produce results in both flat text and HTML formats with graphs,

  5. Make it is as easy as possible to parse alternative log formats other than the "common" log format. Since, "virtual hosted" machines can have uniquely strange formats, we hopefully will only have to make you to change one line of our code: "the" PERL regexp that defines how your server is logging connections.

  6. Print statisitics based on a regular expression. This includes egrep expressions like ``*(here|there)*'',

  7. Parse log data as a file defined on the command line, or piped in as standard input. This allows possibilities like piping a directory's worth of compressed logs into the parser and display months oreven years worth of statistics,

  8. Compile statisitics on one or more directories of log files based on a regular expression,

  9. Display statistics on only the heaviestly used sections/sites.
Ok, so why is it called footprints?

Well, almost every other perl script just crunches numbers. Well, this script, outlines how a site traversed your search criteria. In other words, we display, "how" a site read your web pages, or the "footprints" they left behind. This can be rather CPU/memory intensive depending on the scale of your data, and is optional.
Ok, so big deal ... what else you got?

We've noticed a lot of other Web statisticians parse common log format incorrectly. We check for HTTP errors: 400: "Bad Request" since the log has funky things when transcribing botched transactions. For example, we've noticed some unexpected parsing quirks like:

Sometimes, the URL request is blank: "" ...
foo.com - - [06/Oct/1996:00:32:48 -0400] "" 400 207

Sometimes, the bytes sent is nill: "-" ...
foo.com - - [08/Sep/1996:00:35:42 -0400] "GET /home/baz/ HTTP/1.0" 200 -

Typically, these statistics get lumped into some random category, which is not necessarily the right thing.

Is it abso-smurfly free?

But of course! If you thought the days of Internet freeware were gone, we offer some hope at footprints.tar.gz or footprints.tar.Z.


Typical usage:
footprints -key <reg-exp>
To compile statistics on <reg-exp>, where <reg-exp> can include special characters like period (.) and asterisk (*). For instance, a search key like "/projects", will compile statistics on the URL "/project". However using a search key like "/projects/*" will compile statistics on any URL starting with "/projects". Finally, using a search key like "*projects*", will compile statistics on any URL with the word "projects" in them. Using the search key "*", will compile statistics on every URL accessed (which is the default).
footprints -key <reg-exp> -html
To compile statistics on <reg-exp>, but output in HTML.
footprints -key <reg-exp> -ascending
To compile statistics on <reg-exp>, but display statistics in ascending order rather than descending.
footprints -key <reg-exp> -filename </path/different/logfile>
To compile statistics on <reg-exp>, but read statistics from your defined path and filename.
footprints -key <reg-exp> -internal
To compile statistics on <reg-exp>, but discard all requests from internal machines - machines of similar IP addresses. (See $my_IP_number and $my_IP_name)
footprints -key <reg-exp> -images
To compile statistics on <reg-exp>, but discard all requests for "images". Images, are defeined as files with the suffix gif, jpg, jpeg, xbm, tif, or png.
gzcat log.gz | footprints -key <reg-exp> -
Typically, old log files are get compressed. As a result, this script allows STDIN to be used in order to provide gzip'd logs be piped into footprints.
footprints -key <reg-exp> -directory <directory> -expr <reg-exp>
Sometimes, you may want to compile statistics on an entire directory of log files. With these two options you can define where and what kind of files to read. For example, in a directory you may want to read all files suffixed with ".log". So, you could do something like -directory ~logs -expr "*.logs". (See $default_directory and $default_expr)
footprints -key <reg-exp> -verbose > ~/stats.txt
Some logs tend to be huge. You may wanna redirect the output to a file. Using the verbose option, this will display a cute "thermometer" animation, that represents how much time is left to finish.
footprints -key <reg-exp> -minhits <#> -maxhits <#>
If your output is just too large, then you can narrow the range of output by displaying only the sites, URLs, or domains, with at least or at most the defined number you selected.

Note: you may define top for each section of output: url, host, and domain. For instance, you can define something like: -minhits url=10 -minhits site=5 -minhits domain=15.

footprints -key <reg-exp> -top <#>
If your output is still too large, then you can narrow the range of output by displaying only the top "#" of sites or URLs.

Note: you may define top for each section of output: url, host, and domain. For instance, you can define something like: -top url=10 -top site=5 -top domain=15.

footprints -key <reg-exp> -nofootprints {blank, "all", "site", or "url"}
If you don't want all the possible trace of footprints, you can define which set footprints that you do NOT want to calculate and display.

Note: a blank definition defaults to all ... you don't want any footprints.

footprints -key <reg-exp> -noerrors {blank, "all", "site", or "url"}
If you don't want all the possible trace of HTTP error footprints, you can define which set of error footprints that you do NOT want to calculate and display.

Note: a blank definition defaults to all ... you don't want any error footprints.

footprints -key <reg-exp> {-today, -yesterday, -tomorrow}
To compile statistics on just today or yesterday, or tomorrow. :-)
footprints -key <reg-exp> -start <start date> -stop <stop date>
To compile statistics on a certain date range. Date can be defined as follows: where month is the abbreviated word.
  • d/Mon/yy (?)
  • dd/Mon/yy
  • d/Mon/yyyy (?)
  • dd/Mon/yyyy
Note: it is not necessary to have both start and stop defined. You may simply use one or the other.
WARNING: defining start/stop periods will have discrepancies since the time order of some logging is imperfect. Note, however, that the amount of error for the start time is equivalent to the amount of error for the stop time. This can be viewed two ways: they can cancel each other out, our your getting twice the error. How painful!


Features:
Track url viewing by site or by url Version 0.9
User defined regular expression as search key Version 0.9
Read and parse an entire directory of logs Version 0.9
Search by user defined time and/or date Version 0.9
HTTP Error tracking (percentage of entries, percentage of kilobyte) Version 0.9
Total hits (Images) Version 0.9
Total bytes (Images) Version 0.9
Internal requests vs External Requests (Images) Version 0.9
Total hits (Urls/Text) Version 0.9
Total bytes (Urls/Text) Version 0.9
Internal requests vs External Requests (Urls/Text) Version 0.9
Total Bytes transferred(Internal Requests) Version 0.9
Total Bytes transferred (External Requests) Version 0.9
Total Bytes transferred (External Requests that were cached) Not implemented yet?
Total Accesses requested by robots Not implemented yet?
Produce graphs for output Not implemented yet?
Memory vs. Disk cache Not implemented yet?
-exclude, -only-internal, and -only-images Not implemented yet?
Why not ftpstats Not implemented yet?
Stats by a time period: Hour, Day, Week, Month, Year Not implemented yet?
User set interval Not implemented yet?


Unpacking and Installing:
Use gunzip or uncompress depending on the tar file you downloaded. Here's a few things you'll have to change:
  1. Change the location of PERL on line 1.
  2. Change the location of "country_codes.txt" for the the $domain_list.
  3. Change the name of _your_ "default" domain name and IP in both $my_IP_name and $my_IP_number.
  4. Change the location of _your_ "default" web server log file via $default_directory and $default_expr.


Revisions:
March 17, 1997 Our first "get it out the door" release. Too early to tell, but I'm sure there'll be some slight, ahem "modifications".
May 18, 1997 Major re-write of how we store data. The dictionary is now one dimension deep in order to have a disk-based media (dbm strorage rather than memory). We're now enumerating all the possible ways people would desire to display the statistics and how this reflects on the command line argument structure. Essentially, you have this "footprint" object, how do you tell it what you want to do... what makes the most sense ...
Sep 16, 1997 Log parsing section has been rewritten and is about 2-3 times faster. Still having problems with disk based storage, haven't found a solution that is fast enough for our tastes yet. Time periods are now done entirely in epoch format for the future edition of user set time period averages. Version 0.91 has been in production use at our site for the last few weeks and the results have been excellent.

Brought to you by the partners of ... thigpen@ccs.neu.edu and danielr@ccs.neu.edu