Why web usage statistics are (worse than) meaningless

Note: This document was originally written in 1995 to explain the stats situation at Cranfield University where I worked at the Cranfield Computer Centre. It was initially intended for local users there, but quickly gained popularity (notoriaty?) elsewhere. The content has had few changes and updates since then.

  1. There is no discussion of cookie tracking (yet) in this document.
  2. There is no discussion of the very similar problem of guessing web browser popularity from webserver logs.
  3. I had over-estimated the extent to which caching (and hierarchical caching) would be used.
  4. Cranfield University has a proud history of leadership in the web. It was one of the very first UK sites to have a webserver at all (in 1993), was at the forefront of the UK caching effort, and in enabling individual users to publish on the web (early 1994). I am grateful that they permitted what was essentially my personal rant to be hosted so prominantly for so long, and have provided a long term redirect to this page.
  5. On re-reading my original, I see that this document is a bit hyperbolic. So be it. It is after all an acknowledged rant.
Web usage statistics, such as those produced by programs such as analog cannot be used to make strong inferences about the number of people who have read a website or webpage. Although those who compile these statistics usually try to make this clear, people still insist on misusing them to make overly strong inferences. Attaching meaning to meaningless numbers is worse than not having the numbers at all. When you lack information, it is best to know that you lack the information. Web statistics may give the user a false sense of knowledge which can be worse than being knowingly ignorant.

A useful analogy is with putting up advertising posters. You will never really know how many people have noticed them or read them.

It is not enough to say that the statistics should be taken with a grain of salt; they should be taken with a salt lick. If you want to understand why no inference about the number of people reading your pages can be made from web statistics read on. Otherwise, you may wish to just trust that statement or may wish to skip to the section on Quick Questions and Answers.

What web stats are really good for?

Web stats are useful for web administrators to get a sense of the actual load on the server. This is useful for diagnostics and planning, and for detecting unusual behaviour that may require planning action. The goal of the administrator is to keep the server running smoothly under expected loads, while improving the speed and reliability of obtaining documents from the site. The best way to achieve this is to have browsers retrieve documents from places closer to where they will be used (and even from memory) than to get them from the disk on the server. It is only when the file is retrieved from the server that the server has the ability to keep track of the access.

Caching:
Essential for the web and disastrous for statistics

Let's take a fictitious example of what might happen when someone in Nome, Alaska, say at Nome Community College (this would be a polytechnic in the UK), wants to read Cranfield's Prospectus. The user would somehow select the URL with his/her browser, which would then try the following.
Browser Cache
The particular instance of the browser will look in its own memory (or what it may have saved on the its local disk).

If it finds the page corresponding to the sought for URL there it will not go any further, and our site will never know that the request was made.

Local site cache
If the page was not in the browser cache, the browser may look to its site cache. That is, if someone at the user's same site recently retrieved the page, it may be available to the user there.

If it finds the page corresponding to the sought for URL there it will not go any further, and our site will never know that the request was made.

Local regional cache
The site cache may be configured to look in a local regional cache, say at the University of Alaska, Nome campus which might provide a caching service for smaller sites around Nome.

If it finds the page corresponding to the sought for URL there it will not go any further, and our site will never know that the request was made.

Large regional cache
The local regional cache may be configured to look in a large regional cache, say in Fairbanks Alaska, which might provide caching for sites in Alaska that use it.

If it finds the page corresponding to the sought for URL there it will not go any further, and our site will never know that the request was made.

The Cranfield accelerator
An accelerator is an out-going cache for a site. When a document is requested from the site, the accelerator sees whether it has it stored (it stores them in ways much faster to find and retrieve then the server does with files in the directory structure) and serves that up.

While it would be possible to have the accelerator keep a record of which files it served up and to whom, this would defeat the purpose, because it would require a disk operation to make that record.

In addition to over-estimating the degree of caching that would be in place, this last step about accelerators is also no longer relevant. The accelerator was needed when Cranfield was running the original CERN server over an AFS filesystem. Given the nature of modern web server set-ups, accelerators are no longer needed.

Now that you have an idea of what caching is, you are in a better position to understand why it is impossible to make any inference about numbers of people reading your pages from web statistics. But there is more to come described in the section on multiple hits per users. What is necessary to understand about caching is that some users may go through a long and efficient cache chain (as described in the example) and other users may not. Much of this depends on how their site is set up or how they set things up themselves.

One user many hits

Imagine (in the extreme case) a user who is doing no caching whatsoever. Now if that user comes across the Cranfield Home Page 20 times while browsing around the Cranfield pages that will count as 20 hits. Remember the statistics are about accesses, not about people.

Big pages are little pages

When comparing hits for different directories, it is important to note how documents are structured. If you have a directory with a single document on one hand, and on the other you have another directory with the same amount of real content broken in to twenty smaller documents, you will find far more hits into that second section.

Quick Questions and Answers about web statistics

Most of everything listed here is either mentioned above or can be inferred from the explanations above. If there is a question that you would like to see added to this list, or if you have other comments on this document, please use the form at the end to submit queries. [Sorry, that form is now defunct.]

A quick list of the questions is provided here.

Can stats be used to assess changes over time?

Not really. The number of individuals and sites using caches is rising all the time, as is the amount of disk space and memory used for caching. When the Cranfield Accelerator goes live (early November, 1995), there should be an actual drop in our server stats, while an increase in accesses, due to increased speed and reliability of the server. Caching has been on the rise for more than a year now. Even so, loads on systems (including ours) have gone up dramatically.

Can stats be used to assess relative popularity in different Internet domains such as .ac.uk, or .jp?

Unfortunately not even this is possible. Suppose for example that Japan has a very high level of regional and national caching while Singapore does not (the example is fictitious). Under these circumstances, web statistics might show more accesses from Singapore than from Japan even if more people in Japan read our pages.

A clear example of this is the number of accesses from "numerical domains" that have recently started to top various lists. These are accesses from sites that don't have proper reverse DNS listings. Such sites are probably misconfigured single user machines, where either the particular machine that is used in misconfigured or the organisation they belong to has not straightened out its machine names properly. It is reasonable to assume that those running such misconfigured systems are far more likely to not have configured their proxies correctly, so far less caching will be seen from those sites.

Can stats be used to assess relative popularity of different pages?

Not really. The more popular pages will cache more, meaning that real differences between page hits will be dramatically distorted. It is probably safe to say this if one page shows more hits then another that there really were more accesses to that page, but there are circumstances under which even that weak inference won't be true.

Is there some multiplier which can applied to the stats to get more meaningful results?

Not really. This is because any such multiplier would have to differ from page to page and differ from access region to access region.

Can I ensure that my document is never cached?

Yes you can. There are several ways to do so, and there are some circumstances for which it is even legitimate, but to do so merely to get better stats is seriously misguided. This is for two reasons:
  1. You will make your page (much) harder for people to get to and add to network traffic unnecessarily.
  2. If someone fails to reach your page at our site, they may give up on the site all together. Thus hard to get at pages (unless there is a clear reason for them being such) will be unfair to other providers at the site.

Quiet embarrassingly, many of the pages on this site don't normally cache properly. This is because I had some technical difficulties with my configuration of server side includes and the so-called "XBitHack". I've fixed that now, but now have to fix dozens of documents to use things properly.

Can I put counters in my page?

You may have noticed some pages with web counters. There are basically two ways to put them in your page: the wrong way and the very wrong way. The wrong way merely doesn't work and will not be more useful than normal statistics. The very wrong way is counter productive because it subverts the caching mechanism which is not a good idea just to get statistics.

Please note that even if you think that statistics can be made useful, counters on individual pages are displayed to the reader, who isn't in the position to make the various adjustments needed to get some sense of true readership.

Can we get stats from the sites that do caching?

Yes and no, but mostly no. There are two reasons for "mostly no". One is simply that there are too many small caches out there which may have cached our stuff (including the browser software internal cache). Clearly not all of these are going to send us records on a regular basis which we would then have to incorporate into all of the other records to process statistics.

The other reason for "mostly no" is that even the large caches are willing to only send a byte count. That is, one major UK cache is considering sending out on a monthly basis how many bytes of data they served up in our name.

We must remember that the caches are doing us a favour by making our pages much easier to reach. We cannot ask them to take on a task that would degrade the service or place an additional administrative, disk, memory and CPU load on them. Without caching, the web would have collapsed long ago.

Can I infer from stats a minimum number of readers?

Yes and no. If by minimum you mean "at least one" then yes. If you have 400 hits from Japan then you can conclude that during that period you had at least one reader from Japan. You cannot infer that there were at least 400 readers, because the same reader may hit a page many times in a short period of time.

So, the only certain inference that can be made is that there was at least one from a particular domain, or for a particular page.

How can I gauge interest in pages?

One way is to set up Mail Reply Forms in your pages like the one at the end of this document. Of course many more people will read your pages than will complete the form, but the form can be used to judge serious interest. Most people will, however, not fill out a form unless they think they will get some sort of useful response, even if they read the document seriously. (Did you fill out the form for this document?)

Setting up these forms is not as difficult to do as it first appears, and courses are offered on it by the computing centre staff.

If web stats are so bad why are they kept at all?

They are useful for system administrators to judge the actual load on the server. The section on what stats are good for contains more information.

Then why make the stats public?

Popular demand. It is not the computer centre's job to deny users some service just because we know the request to be misguided. Attempts to eliminate these statistics from the system met with complaints. However, no great effort will be put into maintaining statistics or access to them either. It is hoped that this document will make it easier for the computer centre to withdraw statistics altogether, except for what is required for system maintenance.

Is this all just an excuse to avoid the work of maintaining stats?

No. But you may have noticed that many of the individual problems and difficulties could be partially mitigated by collecting and using more information (from some caches for example or times of requests) and using that to make very rough estimates of various correction factors. It would take serious statistic analysis of the sort that professional market research firms may be able to undertake and still the estimates (and relative hits on pages or from regions) would remain iffy. Performing complicated analyses on dubious data only compounds the problem, and the marginal utility would be negative (ie, the large amount of extra effort would not be justified by the tiny gain in meaningfulness of the statistics).

Time to ask your questions

When this page was hosted by Cranfield there was a form for mailing comments. I have disabled that since moving this document to its current location, because (a) I don't have as good a mailform system as was available at Cranfield, (b) there are spam/privacy concerns about collecting unconfirmed email addresses, which I hadn't considered in 1995 for what was initially intended as an internal document, (c) this was partially an attempt to promote the use of Mailforms at Cranfield, and (d) history has shown that I am often not very good at responding to the queries that I get.

Version: $Revision: 2.7 $
Last Modified: $Date: 2004/07/13 18:30:32 $ GMT
First established at orgininal site: Summer 1995
First established at goldmark.org: April 25, 2001
Author: Jeffrey Goldberg