Caching for Web Authors, Designers and Webmasters

This is an informational document. Although technical in nature, it attempts to make the concepts involved understandable and applicable in real-world situations. Because of this, some aspects of the material are simplified or omitted, for the sake of clarity. If you are interested in the minutia of the subject, please explore the References and Further Information at the end.

What’s a Web Cache? Why do people use them?

A Web Cache (sometimes called a Web proxy) is an application that sits between Web servers (or origin servers) and clients, and watches requests for HTML pages, images and files (collectively known as objects) come by, saving copy for itself. Then, if there is another request for the same object, it will use the copy that it has, instead of asking the origin server for another copy.

There are two main reasons that people use Web Caches:

Aren’t Web Caches bad for me? Why should I help them?

Web Caching is one of the most misunderstood technologies on the Internet. Webmasters in particular fear losing control of their site, because a Cache can ‘hide’ their users from them, making it difficult to see who’s using the site.

Unfortunately for them, even if no Web Caches were used, there are too many variables on the Internet to assure that they’ll be able to get an accurate picture of how users see their site. If this is a big concern for you, this document will teach you how to get the statistics you need without making your site cache-unfriendly.

Another concern is that Caches can serve content that is out of date, or stale. However, this document can show you how to configure your server to control this, while making it more cacheable.

On the other hand, if you plan your site well, caches can help your Web site load faster, and save load on your server and Internet link. The difference can be dramatic; a site that is difficult to cache may take several seconds to load, while one that takes advantage of caching will seem instantaneous in comparison. Users will appreciate a fast-loading site, and will visit more often.

The fact is that Caches will be used whether you like it or not. If you don’t configure your site to be cached correctly, it will be cached using whatever defaults the Cache administrator decides upon.

Kinds of Web Caches

Browser Caches

If you examine the preferences dialog of any modern browser (like Internet Explorer or Netscape), you’ll probably notice a ‘cache’ setting. This lets you set aside a section of your computer’s hard disk to store objects that you’ve seen, just for you. The browser cache works according to fairly simple rules. It will check to make sure that the objects are fresh, usually once a session (that is, the once in the current invocation of the browser).

This cache is useful when a client hits the ‘back’ button to go to a page they’ve already seen. Also, if you use the same navigation images throughout your site, they’ll be served from the browser cache almost instantaneously.

Proxy Caches

Web proxy caches work on the same principle, but a much larger scale. Proxies serve hundreds or thousands of users in the same way; large corporations and ISP’s often set them up on their firewalls.

Because proxy caches usually have a large number of users behind them, they are very good at reducing latency and bandwidth usage. That’s because popular objects are requested only once, and served to a large number of clients.

Most Proxy Caches are deployed by large companies or ISPs that want to reduce the amount of Internet bandwidth that they use. Because the cache is shared by a large number of users, there are a large number of shared hits (objects that are requested by a number of clients). Hit rates of 50% efficiency or greater are not uncommon. Proxy caches are a type of shared cache.

How Web Caches Work

All caches have a set of rules that they use to determine when to serve an object from the cache, if its available. Some of these rules are set in the protocols (HTTP 1.0 and 1.1), and some are set by the administrator of the cache (either the user of the browser cache, or the proxy administrator).

Generally speaking, the most common rules are:

  1. If the object’s headers tell the cache not to keep the document, it won’t.
  2. If the document is authenticated or secure, it won’t be cached.
  3. A cached object is considered fresh (that is, able to be sent to a client without checking with the origin server) if:

Fresh documents are served directly from the cache, without checking with the origin server.

  1. If an object is stale, the origin server will be asked to validate the object, or tell the cache whether the copy that it has is still good. The cache does this by presenting a unique identifier (a validator) to the server to determine if the copy it has is still good. The validator is generated by the server and saved by the cache, along with the object.
    If it has not changed, the server will send a 304 Not Modified response, and the object will be served from the cache. Otherwise, the new object will be sent to the cache and on to the client.

How to Control Caches

There are several tools that Web designers and Webmasters can use to fine-tune how caches will treat their sites. It may require getting your hands a little dirty with the server configuration, but the results are worth it. For detains on how to use these tools with your server, see the Implementation sections below.

Meta Tags vs. HTTP Headers

HTML authors can put tags in a document’s <HEAD> section that describe its attributes. These Meta tags are often used to mark a document as "uncachable" or to force a document to be reloaded regularly, or expire at a certain time.

Meta tags are easy to use, but aren’t very effective. That’s because they’re usually only honored by browser caches, not proxy caches. While it may be tempting to slap a Pragma: no-cache on a home page, it won’t necessarily cause it to be kept fresh.

On the other hand, true HTTP headers give you a lot of control over how both browser caches and proxies handle your objects. They can’t be seen in the HTML , and are usually automatically generated by the Web server. However, you can control them to some degree, depending on the server you use. In the following sections, you’ll see what HTTP headers are interesting, and how to apply them to your site.

HTTP headers are sent by the server before the HTML, and only seen by the client, and any intermediate caches. Typical HTTP 1.1 headers might look like this:

HTTP/1.1 200 OK
Date: Fri, 30 Oct 1998 13:19:41 GMT
Server: Apache/1.3.3 (Unix)
Cache-Control: max-age=3600, must-revalidate
Expires: Fri, 30 Oct 1998 14:19:41 GMT
Last-Modified: Mon, 29 Jun 1998 02:28:12 GMT
ETag: "3e86-410-3596fbbc"
Accept-Ranges: bytes
Content-Length: 1040
Content-Type: text/html

The HTML document would follow these headers, separated by a blank line.

Expires Header

The Expires: HTTP header is particularly useful; it tells all caches how long the object is fresh for; after that time, caches will always check back with the origin server to see if a document is changed.

Most Web servers allow you to set Expires: response headers in a number of ways. Commonly, they will allow setting an absolute time to expire, a time based on the last time that the client saw the object (last access time), or a time based on the last time the document changed on your server (last modification time).

Expires: headers are especially useful for making static images (like navigation bars and buttons) cacheable. Because they don’t change much, you can set extremely long expires time on them, making your site appear much more responsive to your users.

This header is also useful for controlling caching of a page that is regularly changed. For instance, if you update a news page once a day at 6am, you can set the object to expire at that time, so caches will know when to get a fresh copy, without users having to hit ‘reload’.

For example:
Expires: Fri, 30 Oct 1998 14:19:41 GMT

Cache-Control Headers

Although the Expires: header is useful, it is still somewhat limited; there are many situations where content is cacheable, but the HTTP 1.0 protocol lacks methods of telling caches what it is, or how to work with it.

HTTP 1.1 introduces a new class of headers, the Cache-Control response headers, which allow Web publishers to define how pages should be handled by caches. They include directives to declare what should be cacheable, what may be stored by caches, modifications of the expiration mechanism, and revalidation and reload controls.

Interesting Cache-Control response headers include:

For Example:
Cache-Control: max-age=3600, must-revalidate

If you plan to use the Cache-Control headers, you should have a look at the excellent documentation in the HTTP 1.1 draft; see References and Further Information.

Validators

In How Web Caches Work, we explained that validators are used by servers and caches to communicate when an object has changed. By using them, caches avoid having to download the entire object when they already have a copy locally, but they’re not sure if it’s still fresh.

The most common validator is the time that the document last changed, the Last-Modified time. When a cache has an object stored that includes a Last-Modified header, it can use it to ask the server if the object has changed since the last time it was seen, with an If-Modified-Since request.

HTTP 1.1 introduced a new kind of validator called the ETag. Etags are unique identifiers that are generated by the server and changed every time the object does. Because the server controls how the ETag is generated, caches can be surer that if the ETag matches when they make a If-None-Match request, the object really is the same.

Almost all caches use Last-Modified: times in determining if an object is fresh; as more HTTP/1.1 caches come online, Etag: headers will also be used.

Most modern Web servers will generate both ETag: and Last-Modified: validators for static content automatically; you won’t have to do anything. However, they don’t know enough about dynamic content (like CGI, ASP or database sites) to generate them; see Writing Cache-Aware Scripts.

Tips for Building a Cache-Aware Site

Writing Cache-Aware Scripts

By default, CGI scripts won’t return a validator (e.g., a Last-Modified: or Etag: HTTP header). While some scripts can’t benefit from validation techniques, many can. For instance, if your script’s output changes only occasionally, based on external criteria, you can benefit from making it cache-aware. Also, many database-based scripts could benefit from these techniques.

To do this, you’ll need to make the script generate either validator (or both, preferably), and then respond to If-Modified-Since and/or If-None-Match requests with a 304 Not Modified response, when appropriate. If this isn’t possible, consider whether you can use an Expires: header for even a short amount of time.

Remember that if you can represent content as plain files, you should use them; the Web server takes care of generating validators and responding to conditional requests automatically, and it makes your life easier.

Frequently Asked Questions

What are the most important things to make cacheable?

While it’s a good idea to have a comprehensive approach to making your site cache-friendly, a good strategy is to identify the most popular, largest objects and work with them first.

I understand that caching is good, but I need to keep statistics on how many people visit my page!

If you must know every time a page is accessed, select ONE small object on a page, and make it uncachable, by giving it a suitable Expires: header. For example, you could make a directory /never-cache in your document root, and reference a 1x1 transparent image from it. The Referer header will contain information about what page called it.

Be aware that even this will not give truly accurate statistics about your users, and is unfriendly to the Internet and your users; it generates unnecessary traffic, and forces people to wait for that uncached item to be downloaded. For more information about this, see On Interpreting Access Statistics in the references.

I’ve got a page that is updated often. How do I keep caches from giving my users a stale copy?

The Expires: header is the best way to do this. By setting the server to expire the document based on its modification time, you can automatically have caches mark it as stale a set amount of time after it is changed.

For example, if your site’s home page changes every day at 8am, set the Expires: header for 23 hours after the last modification time. This way, your users will always get a fresh copy of the page.

See also the Cache-Control: max-age header.

How can I see which HTTP headers are set for an object?

To see what the Expires: and Last-Modified: headers are, open the page with Netscape and select ‘page info’ from the View menu. This will give you a menu of the page an any objects (like images) associated with it, along with their details.

To see the full headers of an object, you’ll need to manually connect to the Web server. Using a Telnet client, open the Web server port. Depending on what program you use, you may need to type the port into a separate field, or you may need to connect to www.myhost.com:80 or www.myhost.com 80 (note the space). Consult your Telnet client’s documentation.

Once you’ve opened a connection to the site, type a request for the object. For instance, if you want to see the headers for http://www.myhost.com/foo.html, connect to www.myhost.com, port 80, and type:

GET /foo.html HTTP/1.1 [return]
Host: www.myhost.com [return][return]

Press the Return key every time you see [return]; make sure to press it twice at the end. This will print the headers, and then the full object. To see the headers only, substitute HEAD for GET.

My pages are password-protected; how do proxy caches deal with them?

By default, pages protected with HTTP authentication are marked private; they will not be cached. However, you can mark authenticated pages public with a Cache-Control: header; HTTP 1.1-compliant caches will then allow them to be cached.

If you’d like the pages to be cachable, but still authenticated for every user, combine the Cache-Control: public and must-revalidate headers. This tells the cache that it must submit the new client’s authentication information to the origin server before releasing the object from the cache.

Whether or not this is done, it’s best to minimize use of authentication; for instance, if your images are not sensitive, put them in a separate directory and configure your server not to force authentication for it. That way, those images will be naturally cacheable.

Should I worry about security if my user access my site through a cache?

SSL pages are not cached (or unencrypted) by proxy caches, so you don’t have to worry about that. However, because caches store non-SSL requests and URLs fetched through them, you should be conscious of security on unsecured sites; an unscrupulous administrator could conceivably gather information about their users.

In fact, any administrator on the network between your server and your clients could gather this type of information. One particular problem is when CGI scripts put usernames and passwords in the URL itself; this makes it trivial for others to find and user their login.

If you’re aware of the issues surrounding Web security in general, you shouldn’t have any surprises from proxy caches.

I’m looking for an integrated Web publishing solution. Which ones are cache-aware?

It varies. Generally speaking, the more complex a solution is, the more difficult it is to cache. In extreme cases, which dynamically generate all content, and don’t provide validators, it may not be cacheable at all. Speak with your vendor’s technical staff for more information.

I made my images expire a month from now, but I need to change them now! How do I make caches refresh them?

The Expires: header can’t be circumvented; unless the cache (either browser or proxy) runs out of room and has to delete the objects, it will be served until then.

The most effective solution is to rename the files; that way, they will be completely new objects, and loaded fresh from the origin server.

If you want to reload an object from a specific cache, you can either force a reload (in Netscape holding down shift while pressing ‘reload’ will do this, by issuing a Pragma: no-cache request header) while using the cache. Or, you can have the cache administrator delete the object through their interface.

I run a Web Hosting service for a large number of users. How can I let them publish cache-friendly pages?

If you’re using Apache, consider allowing them to use .htaccess files, and provide appropriate documentation.

Otherwise, you can establish predetermined areas for various caching attributes in each virtual server. For instance, you could specify a directory /cache-1m that will be cached for one month after access, and a /no-cache area that will be served with headers instructing caches not to store objects from it.

Whatever you are able to do, it is best to work with your largest customers first on caching. Most of the savings (in bandwidth and in load on your servers) will be realized from high-volume sites.

A Note About the HTTP

HTTP 1.1 compliance is mentioned several times in this document. As of the time it was written, the protocol is a work in progress. Because of this, it is virtually impossible for an application (whether a server, proxy or client) to be truly compliant. However, the protocol has been openly discussed for some time, and feature-frozen for enough time to allow developers to use the ideas contained in it, like Cache-Control and ETags. When HTTP 1.1 is final, expect more vendors to openly state that their applications are compliant.

Implementation Notes – Web Servers

Apache 1.3

Apache (http://www.apache.org/) uses optional modules to include headers, including both Expires: and Cache-Control. Both modules are available in the 1.2 or greater distribution.

The modules need to be built into Apache; although they are included in the distribution, they are not turned on by default. To find out if the modules are enabled in your server, find the httpd binary and run httpd –l; this should print a list of the available modules. The modules we’re looking for are mod_expires and mod_headers.

Once you have an Apache with the appropriate modules, you can use mod_expires to specify when objects should expire, either in .htaccess files or in the server’s access.conf file. You can specify expiry from either access or modification time, and apply it to a file type or as a default. See http://docs.apache.org/mod/mod_expires.html for more information, and speak with your local Apache guru if you have trouble.

To apply Cache-Control headers, you’ll need to use the mod_headers module, which allows you to specify arbitrary HTTP headers for a resource. See http://docs.apache.org/mod/mod_headers.html

Here’s an example .htaccess file that demonstrates use of some headers.

### enable mod_expires
ExpiresActive On
### Expire .gif’s 1 month from when they’re accessed
ExpiresByType image/gif A2592000
### Expire everything else 1 day from when it’s last modified
### (this uses the Alternative syntax)
ExpiresDefault modification plus 1 day
### Apply a Cache-Control header to index.html
<Files index.html>
Header append Cache-Control "public, must-revalidate"
</Files>

Netscape Enterprise 3.5

Netscape Enterprise Server (http://www.netscape.com/) does not provide any obvious way to set Expires: headers. However, it has supported HTTP 1.1 features since version 3.0. This means that HTTP 1.1 caches (proxy and browser) will be able to take advantage of Cache-Control settings you make.

To use Cache-Control headers, choose Content Management | Cache Control Directives in the administration server. Then, using the Resource Picker, choose the directory where you want to set the headers. After setting the headers, click ‘OK’. For more information, see http://developer.netscape.com/docs/manuals/enterprise/admnunix/content.htm#1006282

MS IIS 4.0

Microsoft’s Internet Information Server (http://www.microsoft.com/) makes it very easy to set headers in a somewhat flexible way. Note that this is only possible in version 4 of the server, which will run only on NT Server.

To specify headers for an area of a site, select it in the Administration Tools’ interface, and bring up its properties. After selecting the HTTP Headers tab, you should see two interesting areas; Enable Content Expiration and Custom HTTP headers. The first should be self-explanatory, and the second can be used to apply Cache-Control headers.

IIS also allows you to set HTTP headers in ASP pages. To do this, just use the properties of the Response object in your page, like this:

<% Response.Expires=1440 %>

specifying the number of minutes from last access to expire the object. Likewise, absolute expiry time can be set like this:

<% Response.ExpiresAbsolute=#May 31,1996 13:30:15 GMT# %>

Cache-Control headers can be added like this:

<% Response.CacheControl="public" %>

It is also possible to set headers from ISAPI modules; refer to MSDN for details.

Lotus Domino 4.6

Lotus’ (http://www.lotus.com/) servers are notoriously difficult to cache; they don’t provide any validators, so both browser and proxy caches can only use default mechanisms (i.e., once per session, and a few minutes of ‘fresh’ time, usually) to cache any content from them, even static images.

Even if this limitation is overcome, Notes’ habit of referring to the same object by different URLs (depending on a variety of factors) bars any measurable gains. There is also no documented way to set an Expires: header.

Because of all of this, Domino servers can seem quite slow. Version 5 of the server is in beta testing as of this writing, and claims to address some of these concerns. From preliminary testing, it appears that while some gains have been made, there still aren’t any controls available to developers to fine-tune how their pages are cached.

References and Further Information

Cache Now! Campaign

http://vancouver-webpages.com/CacheNow/
Cache Now! is a campaign to raise awareness of caching, from all perspectives. It contains more detail and tips than this document. An excellent resource.

HTTP 1.1 Specification

http://www.w3.org/Protocols/
The HTTP 1.1 spec has many extensions for making pages cacheable, and is the authoritative guide to implementing the protocol. See sections 13, 14.9, 14.21, and 14.25.

Web Caching Overview

http://www.cs.rutgers.edu/~davison/web-caching/
Another introduction to caching concepts.

On Interpreting Access Statistics

http://www.cranfield.ac.uk/docs/stats/
Jeff Goldberg’s informative paper on why you shouldn’t rely on access statistics and hit counters.

About This Document

This document is copyright © 1998 Mark Nottingham <mnot@pobox.com>. It may be freely distributed in any medium as long as the text (including this notice) is kept intact and the content is not modified, edited, added to or otherwise changed. Formatting and presentation may be modified. All trademarks within are property of their respective holders.

Although the author believes the contents to be accurate at the time of writing, he assumes no liability for them, their application or any consequences thereof. If any misrepresentations, errors or other need for clarification is found, please contact the author immediately.

The latest copy of this document can always be obtained in a variety of formats from

http://www.pobox.com/~mnot/cache_docs/


Version 0.6 – November 3, 1998 – PRE-PUBLICATION VERSION – DO NOT DISTRIBUTE