ServerSolved

README for superstat

©1999 Pete Nelson
http://www.serversolved.com/superstat/

A Special Note: This program is no longer maintained by the author.

This program may soon be replaced - no sooner did I get it posted online, when I started to notice it's shortcomings. For example -- once the meta file is generated, you have a list of pages visited and their number of hits; a list of clients and their number of hits; a list of 404's and their number of hits -- but none of these lists relate back to each other in any way.

So, I'm working on a bigger, better log analyzation tool (possibly utilizing a database such as MySQL or Oracle). Here is a preliminary version, but it's far from complete (I don't recommend downloading it unless you have time and experience to deal with any problems it may cause). More will come soon.

Also, it has been brought to my attention that very large log files (such as a gigabyte) can cause the application to run out of memory -- not too shocking, really.

Superstat was one of the first perl programs I had written, and felt worth distributing. While it has it's limitations, I'm hoping that if nothing else, the program might give someone insite on ways perl & regular expresions can be utilized.

Copyright Info

You may freely use, edit, and distribute this program free of charge, under the condition that the copyrite information remains intact.

The sale of this program is expressly forbidden.

What does it do?

Superstat takes a server's access log, and generates statistics including top requests, most frequently visited directories, top clients, and even and hourly breakdown of activity. The most common way to use superstat is with the command:

	superstat --infile=access.archive --outfile=stats.html

(If you don't have the 'Getopt::Long' module for perl, you would use the command 'superstat access.archive stats.html' to achive the same effect)

View an Example

It is highly advised that you not run superstat on an active log file! This is just a bad idea in general. You should alway rotate (archive) the access log or at least make a copy of the log to analyze. If you run this program on an active log file, it is quite possible your server will stop writing to it!

Superstat also creates a metafile through the 'Storable' module for perl to save the hashes generated from parsing the log. (It's a very strait-forward use of Storable - see Storable's documentation for more details.) The metafile is handy for changing the formatting of the output file without having to re-parse the log file again. As far as I know, you can use any option on a metafile and receive the same results as reparsing the log. The one exception would be --infile (duh!).

A Note on Modules

This program uses three modules by default. You can still analyze access logs and generate a web page without these modules, but you'll have to make a few simple changes, and you'll miss out on a few features!

Storable
used to create a 'meta' file for faster loading.
Comment out the 'use Storable;' line, and add '$NO_STOR = 1;'
Socket
used for performing hostlookups.
Comment out the 'use Socket;' line, and add '$NO_SOCK = 1;'
Getopt::Long
used to parse command line options.
Comment out the 'use Getopt::Long;' and add '$NO_OPT = 1;'
(you can still use all the options by including them in the config file)

Of course, all three modules are freely available from CPAN.

(If you're using Win32, you can download precompiled modules for ActiveState's Perl. You'll need ppm to install them -- see the ActiveState documentation for details.)

Download

You can download superstat.tar.gz for Unix systems, or superstat.zip for Win32. The only difference is the Win32 version has those goofy characters at the end of each line that CPM DOS likes so much. (c'mon guys, I'm using a TTY, not a printer!)

These archives contain:

You'll also need Perl 5.
(For Win32: you can download a very stable version of Perl from www.ActiveState.com)

Installation

This should be a cakewalk. Set '$DEFAULT_CONFIG_FILE' to where ever you keep superstat.cfg, and have '$null_dev' point to '/dev/null' (Unix) or 'nul' (DOS).

Usage

If you have the Getopt::Long module available, you have the following switches:

--infile=filename
the access log you want to analyze. If the name ends with '.meta', the file is imported as a metafile
--outfile=filename
The HTML output file. (You can specify 'STDOUT')
--[no]graph
use --graph to generate an hourly breakdown
--[no]hostlookup
do hostlookups for Top Clients (off if $NO_SOCK is set)
--use_meta_default
if filename.meta exists, use the meta file
--silent | --quiet
Does not send messages to STDOUT. (Setting 'outfile=STDOUT' automatically sets --silent, sending only the HTML to STDOUT instead)
--hidden_dirs=list
a quoted list of directorys that you don't want displayed on the generated page. '/cgi-bin/' is a good example. Note that the command line option does not overwrite the list in the config file, but appends them. There is no way to display directories listed as 'hidden' in the config file.
--least_hits=n
Don't show URLs that got less than n hits
--max_dirs=n
Show the top n dirs
--max_clients=n
Show the top n clients
--server=servername
The name of your server
--body_def=s
Anything here is placed between '<BODY' and '>' in the generated HTML
(eg: bgcolor=white text=black)
--table_def=s
Like --body_def, but between '<TABLE' and '>'
--table_header_def=s
Like --body_def, but between '<TH' and '>'
--title=s
Sets the title for the generated page. Variables in this string are interpolated right be for the title is used, so you can use almost any variable in the program. See the superstat.cfg file for more details.
--img_dir=dir
the directory containing 'just_green.gif' and 'just_red.gif' for the hourly breakdown graph. Should be relative to the final location of the output file, or DocumentRoot of your web server if starts with '/'
--config=configfile
the path to the config file

If you don't have the Getopt::Long module, you will have to place any settings in the config file (isn't that better than typing them everytime, anyway?)

superstat will also accept command line arguments for the infile & outfile

superstat access.110699 stats.html  

The command line switches take precedence over everything else.

	superstat --infile=access.110699 --outfile=stats.html

HISTORY

I started writing this program for the City of Saint Paul, Minnesota (http://www.ci.saint-paul.mn.us/) for analyzing server access logs. We had several different servers - Netscape FastTrack, Apache, etc., and had a perl script that I found myself hacking for each different log file. This program was written to avoid having different programs for different logs.

The logs we were analyzing were often 500k lines long, so it took some time to parse. The first versions sent the HTML output to STDOUT, but too many times the program would run for hours without any indication of what it was doing, and eventually I'd have to give it the ol' Ctrl-C, not knowing if it was still going, or had gotten stuck somewhere. This is when I decided that it needs to send messages to STDOUT so the operator can see what it's doing.

The hostlookup feature was added since our servers have hostlookups turned off. Originally, I had the program doing hostlookups for every client that was in the log. Andrew Moravec had the good sense to point out, "Hey, why don't you only do it for the ten-or-so clients that you're going to display?" Thanks Andy! (and Duh! - why didn't I think of that?!?). This means that the hostnames are not stored in the metafile, and have to be looked up every time, but it's still way better than looking up all of them.

The hourly breakdown graph I added mostly because I wanted to see if I could do it. just_red.gif & just_green.gif are just that - one pixel of red (or green). If you loose the original image (or never got it), it shouldn't be too hard to make a new one. The width of the image for the graph is deduced by this formula:

$imgwidth = int((hits-this-hour) * (400 / (most-hits-an-hour) ));

So the maximum length is 400 pixels. If the number of hits is greater than average, it uses red, otherwise, green. If there are no hits for a given hour, it's not displayed.

just_green.gif just_red.gif

 

The User-Agent info was added for curiousity. Being a City Government website, we wanted to make our pages accessable to everyone, not just those with the latest & greatest browsers. We used this information to show other depatments how important it was we not alienate people just for using an older browser. As it turned out, about 25% of our customers were still using level 3 browsers at the end of 1999. I also wanted to see what browser strings were out there besides Netscape & IE. So, I added an 'other_browsers' array. When I found one that was pretty common (like AOL), I added it to the list. I was a little supprised when I first ran the program with the other_browsers feature to find a browser named 'F**k You/2.0'. So I added a little regrep substitution to clean things like this up.

Special Thanks

To Dennis Grittner, former webmaster for the City of Saint Paul, Minnesota. Thank you for never accepting "It can't be done."

Andrew Moravec, and the constant challenge of keeping up with him, and his excellent examples and suggestions.

O'Reilly & Associates for publishing (IMHO) the best programming books, including The Perl Cookbook by Tom Christiansen & Nathan Torkington. This book has been an incredible asset not only in writing this program, but programing in general. Thanks for showing me the other 'Ways To Do It'.

And of course Larry Wall, author of Perl. His decision to have Perl be free and open source has allowed it to grow to be one of the most robust programming languages available. (and for free!)


©1999 Pete Nelson