Counting or analyzing accesses to webpages
DRAFT DRAFT DRAFT
How popular is this page?
This page resides in an
Athena
"locker",
and like all such pages presents some challenges to those
who want to know how widely-read a particular webpage might be.
There are various ways "accesses" to webpages can be counted, but
an access does not imply that a person actually reads, understands
or reacts to a page.
And, while I might like to find out "Who's been looking at this
page, and when?", MIT privacy policies would probably not let me
find out that same
information for pages that are not mine.
I present here several approaches for counting accesses to webpages
that are widely used, but none is really appropriate for the
MIT Athena
environment.
I then propose an approach which may be suitable for us.
- Approach 1: Access counts at the filesystem level
Since Athena "lockers" actually reside in /afs/...,, it it possible
on an Athena workstation to get a rough count of all accesses to
the entire AFS "volume" in which a page resides.
But, this count includes all accesses from anything that happens to
touch any files or directories in the volume. For example, backups or
users running 'find' affect the count.
Thus, the count does not indicate the kind of access or who is doing
the accessing, or which particular files were accessed. The following
example shows recent accesses to the 'cwis' Athena locker:
athena% attach cwis
attach: /afs/athena.mit.edu/org/c/cwis linked to /mit/cwis for filesystem cwis
athena% fs lq /mit/cwis
Volume Name Quota Used % Used Partition
org.cwis 80000 52074 65% 87%
athena% vos ex org.cwis
org.cwis 537055786 RW 52074 K On-line
MOROS.MIT.EDU /vicepb
RWrite 537055786 ROnly 0 Backup 537055788
MaxQuota 80000 K
Creation Tue Sep 13 17:40:58 1994
Last Update Mon Jul 27 17:02:01 1998
11527 accesses in the past day (i.e., vnode references)
RWrite: 537055786 Backup: 537055788
number of sites -> 1
server MOROS.MIT.EDU partition /vicepb RW Site
athena%
- Approach 2: Access counts in webserver logs
There are many free and commercial packages on the net for analyzing
and displaying data that is logged by a running webserver.
These are based on several assumptions about webservers that do not apply
to web.mit.edu, tute.mit.edu and others, since our webservers are
shared resources:
- First, the "content" does not reside on the local disk of the
webserver. It resides in /afs, thus it can be served via any
webserver which is an AFS client that can see the athena.mit.edu cell
and has sufficient access rights. Thus, the content can be served from
many webservers, and only a few are under our control. So, there's
no way we could look through "all" the webserver logs.
- Second, our main webservers are very heavily used, serving pages
from many different "websites" at MIT. Basically, any Athena locker
can be considered a website since each locker has its own set
of access control lists ("acls") which the locker maintainer can
set. Logging places a load on the webserver,
and we need to be able to turn logging on or off as necessary
for operations. Thus, we cannot guarantee that logging will always
turned on. So, we cannot use any software that relies on
webserver access logs for its data.
- Last, our main webservers serve many clients, but webserver logging
is generally "all or nothing". Basically, if logging is turned on,
every page that gets served gets logged. Thus, our webserver logfiles
contain data from many different websites. We would have
to handle many privacy issues associated with what's in the webserver
logs.
So, in summary, analyzing webserver logs is not
appropriate for us because
we don't have control over all the webservers that serve content stored
in Athena lockers; we cannot guarantee that the webservers we manage
will always have logging turned on; and we'd have to do a lot of
work to determine and maintain privacy of the data logged.
- Approach 3: Access counts via html "counters"
This approach is demonstrated by the counter examples at the bottom of
this page. Putting a counter in a page can slow down the time it
takes a page to load in a browser, since the page contains a line
of html that increments a counter somewhere.
The counter can be "invisible" on the page, or can show the count,
but the basic method is the same. These methods are simply counters;
they do not provide any methods for analyzing accesses or for maintaining
the privacy of the data.
- Approach 4: Access counts via webbrowser customization
Since we do not want to be in the business of customizing webrowsers,
this is not an option.
- A proposed method for Athena lockers
Tom Copetto has written a way to count accesses to the Admissions
Department's homepage, and determine the coutnry of origin of
the access. (This does not necessarily mean that there is a person reading
the page in that country, simply that web.mit.edu sent a page to a
computer that appeared to be in a certain country.)
Using this as a model, we should be able to come up with a method
that lets the user specify one (or a few) pages that will be counted.
Ideally. access to the counter data should be based on the acls
associated with the page being counted. Since we have to write a
custom method for counting accesses, we should make the method aware
of MIT privacy issues and the shared nature of the IS-supported
webservers from the very beginning. If possible, it should
take advantage of any data-management systems already in place.
Counting examples and commentary
- The MIT SIPB traditional counter reports:
since the counter was last reset.
- The MIT SIPB
daily-average counter reports:
since I started counting on July 23, 1998.
- digits.com isn't giving out free
counters for the time being, so I can't put one here.
- There are a lot of freeware access counters, but they all involve
running a cgi script, thus if I want to count accesses to this page via
a webserver, I need the cooperation of a webserver somewhere.
Most people who run webservers have the good sense not to run random
scripts found on the net on their webservers.
-
Tom's counter just got bumped.
- Is the number of times this page passes through some webserver really of
interest? Many webcrawlers, indexers and their ilk access it, reloads of the
same page can count as separate accesses, and I myself cause many accesses
whenever I edit the page and check that it still works. Also, webservers
may send out a cached copy of this file, not this exact, precise file that
I happen to be editing. Does that count as an access? What about cached
copies in browsers? Oh, you can get a headache trying to figure out if
the access count has any useful meaning.
- Since this page resides in AFS, it can be
accessed from any webserver that is an AFS client that can reach
/afs/athena.mit.edu/. In fact, it can be accessed from any webbrowser
that is running on an AFS client, via the filesystem and completely
bypassing any webserver.
Try this, which will succeed only for the few, the proud, the AFS
clients that are aware of the athena.mit.edu cell.
- I'm looking for ways of counting accesses to this page that make
sense, and scale in the sense that anyone in the MIT community could
use the same method for counting accesses to their pages with minimal
or no staff time required. Groveling through webserver access logs is not
an option, and I do not care for solutions that involve giving the webserver
write access to directories in /afs.
Other references
- A somewhat outdated discussion of getting access statistics for
Athena webpages
- A statement for the
beans
discovery project on "Web Page Reporting"
Counter pages found on the web, July 1998
salemme@mit.edu
Last updated
$Date: 1998/07/24 20:19:39 $ GMT