Desperately Seeking Web Log File Standards

by

As any webmaster or search engine marketer knows, you can’t measure the success of your web site or of your online marketing campaigns without knowing your site statistics. And the only way to know your stats is to dig deep into the bowels of your server’s log files. But once you get in, you might not make it back!

Every page viewed on your site, every visit to your site, every referring URL, and hundreds of other bits of information, is stored in these labyrinthine text files that can grow to be hundreds of megabytes in size. It’s nearly impossible to decipher these log files yourself, which is why software companies have created versatile – and pricey – programs to extract the useful nuggets of information contained therein.

Here’s a basic definition of log files (also called extended log files), courtesy of the World Wide Web Consortium (or W3C), the Internet standards group:

“An extended log file contains a sequence of lines containing ASCII characters terminated by either the sequence LF or CRLF. Log file generators should follow the line termination convention for the platform on which they are executed. Analyzers should accept either form. Each line may contain either a directive or an entry.

Entries consist of a sequence of fields relating to a single HTTP transaction. Fields are separated by whitespace, the use of tab characters for this purpose is encouraged. If a field is unused in a particular entry dash “-” marks the omitted field.”

WebTrends is the industry standard log file software, and is used by thousands of companies around the world. Some hosting companies offer WebTrends reports to every site hosted on their servers free of charge, or perhaps for a small fee. There are also companies such as HitBox who offer free ASP – application service provider – hosted services that track your site stats in exchange for placing ads on your site.

Any hosting company worth its salt will give you access to your raw log files, which you can download and then analyze on your own using software from vendors like 123LogAnalyzer (my current favorite), SurfStats, and Sawmill.

These log file analyzers are usually good at giving you a general picture of the overall health of your site, but if you analyze your log files with more than one log analysis tool and you’ll see that the world of site stats is a murky one filled with competing standards, conflicting definitions of basic terminology and few easy methods of understanding what the numbers mean.

One log file tool may report 100,000 page views for your site in a month’s time, and another may report just 80,000. I talked with several of these vendors, and they all gave a litany of possible reasons for the discrepancy in page views: some log file software counts failed pages as page views, some have different definitions for what constitutes a page view or user session, or maybe the other program’s parser – which is the part of the software that scans the log file entries – isn’t up to snuff. But, none of the companies would admit that their software could do a better job in reporting numbers.

So, why does it seem so impossible for different log file programs to report numbers consistently? Here are some of the reasons:

1. There is no standard log file format.

Log file formats come in many different flavors. There are the different formats based on Microsoft’s Internet Information Server (IIS); there’s one for the free web server called Apache; and there are different formats for proxy servers (which act as Internet access gateways for networks).

2. There is no standard method for interpreting and parsing log files.

Many log file analyzers, OpenWebScope, for example, report the useless term “hits” as the more-useful-to-know term “page views.” There are also different definitions of what constitutes a visitor session. Some programs say that if a visitor to your site is inactive for 10 minutes or more and then they come back, they count as a new visitor. Obviously users shouldn’t be counted twice, but when you deal with dynamic IP addresses, it’s hard to know if the user with IP address 64.217.243.22 was the same person as it was 10 minutes ago.

3. There is no standard way to track and measure success.

Unless you contract with a specialized software company to track banner ad campaigns, PPC campaigns, and other sales promotions, you will have a difficult time calculating ROI, tracking referrers, and so on. Log files do record the pass-through URL parameters appended to links, but the data is so generic as to be almost useless.

Someone really needs to create a program for small to midsize businesses that will not only analyze log files but will also provide data mining, ROI tracking, etc. To be sure, there are programs like this, but they are usually geared toward the enterprise or corporate market.

So, until web server vendors and log file software companies get together and agree on a set of standards for recording interpreting and tracking web site traffic, webmasters and search engine marketers will have to deal with inconsistent tools and rely more on guesswork than hard numbers to determine the health and success of a web site.

If you do analyze your server logs with two different log file analyzers and get different numbers, which one do you trust? That’s a question for you to struggle with, but I usually go with the one with the higher numbers! If you have the time and money and you wanted to get fancy, you could even analyze your logs with a third program and then average all three totals to get a more accurate picture.

For more information, visit the following links:

You may also like