how to measure access to webpages at MIT

Counting or analyzing accesses to webpages

DRAFT DRAFT DRAFT
How popular is this page? This page resides in an Athena "locker", and like all such pages presents some challenges to those who want to know how widely-read a particular webpage might be. There are various ways "accesses" to webpages can be counted, but an access does not imply that a person actually reads, understands or reacts to a page. And, while I might like to find out "Who's been looking at this page, and when?", MIT privacy policies would probably not let me find out that same information for pages that are not mine.

I present here several approaches for counting accesses to webpages that are widely used, but none is really appropriate for the MIT Athena environment. I then propose an approach which may be suitable for us.

Approach 1: Access counts at the filesystem level
Since Athena "lockers" actually reside in /afs/...,, it it possible on an Athena workstation to get a rough count of all accesses to the entire AFS "volume" in which a page resides. But, this count includes all accesses from anything that happens to touch any files or directories in the volume. For example, backups or users running 'find' affect the count. Thus, the count does not indicate the kind of access or who is doing the accessing, or which particular files were accessed. The following example shows recent accesses to the 'cwis' Athena locker:

athena% attach cwis
attach: /afs/athena.mit.edu/org/c/cwis linked to /mit/cwis for filesystem cwis
athena% fs lq /mit/cwis
Volume Name            Quota    Used    % Used   Partition 
org.cwis               80000   52074       65%         87%  
athena% vos ex org.cwis
org.cwis                          537055786 RW      52074 K  On-line
    MOROS.MIT.EDU /vicepb 
    RWrite  537055786 ROnly          0 Backup  537055788 
    MaxQuota      80000 K 
    Creation    Tue Sep 13 17:40:58 1994
    Last Update Mon Jul 27 17:02:01 1998
    11527 accesses in the past day (i.e., vnode references)

    RWrite: 537055786     Backup: 537055788 
    number of sites -> 1
       server MOROS.MIT.EDU partition /vicepb RW Site 
athena%

Approach 2: Access counts in webserver logs
There are many free and commercial packages on the net for analyzing and displaying data that is logged by a running webserver. These are based on several assumptions about webservers that do not apply to web.mit.edu, tute.mit.edu and others, since our webservers are shared resources:
- First, the "content" does not reside on the local disk of the webserver. It resides in /afs, thus it can be served via any webserver which is an AFS client that can see the athena.mit.edu cell and has sufficient access rights. Thus, the content can be served from many webservers, and only a few are under our control. So, there's no way we could look through "all" the webserver logs.
- Second, our main webservers are very heavily used, serving pages from many different "websites" at MIT. Basically, any Athena locker can be considered a website since each locker has its own set of access control lists ("acls") which the locker maintainer can set. Logging places a load on the webserver, and we need to be able to turn logging on or off as necessary for operations. Thus, we cannot guarantee that logging will always turned on. So, we cannot use any software that relies on webserver access logs for its data.
- Last, our main webservers serve many clients, but webserver logging is generally "all or nothing". Basically, if logging is turned on, every page that gets served gets logged. Thus, our webserver logfiles contain data from many different websites. We would have to handle many privacy issues associated with what's in the webserver logs.
So, in summary, analyzing webserver logs is not appropriate for us because we don't have control over all the webservers that serve content stored in Athena lockers; we cannot guarantee that the webservers we manage will always have logging turned on; and we'd have to do a lot of work to determine and maintain privacy of the data logged.
Approach 3: Access counts via html "counters"
This approach is demonstrated by the counter examples at the bottom of this page. Putting a counter in a page can slow down the time it takes a page to load in a browser, since the page contains a line of html that increments a counter somewhere. The counter can be "invisible" on the page, or can show the count, but the basic method is the same. These methods are simply counters; they do not provide any methods for analyzing accesses or for maintaining the privacy of the data.
Approach 4: Access counts via webbrowser customization
Since we do not want to be in the business of customizing webrowsers, this is not an option.
A proposed method for Athena lockers
Tom Copetto has written a way to count accesses to the Admissions Department's homepage, and determine the coutnry of origin of the access. (This does not necessarily mean that there is a person reading the page in that country, simply that web.mit.edu sent a page to a computer that appeared to be in a certain country.) Using this as a model, we should be able to come up with a method that lets the user specify one (or a few) pages that will be counted. Ideally. access to the counter data should be based on the acls associated with the page being counted. Since we have to write a custom method for counting accesses, we should make the method aware of MIT privacy issues and the shared nature of the IS-supported webservers from the very beginning. If possible, it should take advantage of any data-management systems already in place.

Counting examples and commentary

The MIT SIPB traditional counter reports: since the counter was last reset.
The MIT SIPB daily-average counter reports: since I started counting on July 23, 1998.
digits.com isn't giving out free counters for the time being, so I can't put one here.
There are a lot of freeware access counters, but they all involve running a cgi script, thus if I want to count accesses to this page via a webserver, I need the cooperation of a webserver somewhere. Most people who run webservers have the good sense not to run random scripts found on the net on their webservers.
Tom's counter just got bumped.
Is the number of times this page passes through some webserver really of interest? Many webcrawlers, indexers and their ilk access it, reloads of the same page can count as separate accesses, and I myself cause many accesses whenever I edit the page and check that it still works. Also, webservers may send out a cached copy of this file, not this exact, precise file that I happen to be editing. Does that count as an access? What about cached copies in browsers? Oh, you can get a headache trying to figure out if the access count has any useful meaning.
Since this page resides in AFS, it can be accessed from any webserver that is an AFS client that can reach /afs/athena.mit.edu/. In fact, it can be accessed from any webbrowser that is running on an AFS client, via the filesystem and completely bypassing any webserver. Try this, which will succeed only for the few, the proud, the AFS clients that are aware of the athena.mit.edu cell.
I'm looking for ways of counting accesses to this page that make sense, and scale in the sense that anyone in the MIT community could use the same method for counting accesses to their pages with minimal or no staff time required. Groveling through webserver access logs is not an option, and I do not care for solutions that involve giving the webserver write access to directories in /afs.

Other references

A somewhat outdated discussion of getting access statistics for Athena webpages
A statement for the beans discovery project on "Web Page Reporting"

Counter pages found on the web, July 1998

Digit Mania
digits.com
superstats.com
Fake Counter
wusage
Matt's Script Archive
How David puts counters on pages
asoftware for-money counter
cron count
EasyCounter

salemme@mit.edu
Last updated $Date: 1998/07/24 20:19:39 $ GMT