Ask SIPB - March 1, 2016

This column marks the return of Ask SIPB, last published in 2011. This week's column covers parts of MIT's policies on data retention.

What does MIT know about you?

In particular, when MIT learns something about you, what does it remember, and for how long? In short, what are MIT's data-retention policies, and how do they affect you?

Data-retention policies matter because privacy is a basic human right. Part of what privacy means is the knowledge that even the benign, ordinary aspects of living your life won't somehow come back to haunt you years later—for example, because of a change in the law, or a change in social mores, or because someone is trying to dig up dirt on you. Without the ability to experiment, to take risks, to try on different personalities and see what happens, human development is stunted. Living in a panopticon, where everything you do is recorded forever by unseen observers who may later use it against you, is fundamentally dehumanizing—by design. Reliable and reasonable data-retention policies are thus one small piece in creating a livable society.

The data-retention policies in this article are mostly (but not entirely) about data which is transactional in nature, meaning that they're about data MIT gathers in order to some other job, but (usually) not data you explicitly provided to MIT. (Sometimes, certain types of this transactional data are called metadata, but data is data, and so-called metadata is often the most dangerous kind.) So, for example, we are not talking here about your educational records (covered under FERPA and other laws), your medical records (covered under HIPPA and other laws), or your email (covered under ECPA and other laws).

Instead, we're looking at issues such as

Use of card keys
Surveillance cameras (public and in-dorm)
Backups, clusters, and dialups
Internet-based communication in general

Caveats

Data-retention policies are neither panaceas nor complete descriptions of the world. For example:

Don't assume that just because MIT throws away a given piece of data after a certain time, you can do something illegal in the hopes that MIT will have discarded the evidence before you get caught. For example, if there's an active investigation, MIT may be explicitly retaining any data related to that investigation—possibly for an arbitrary amount of time. The Institute could be doing this on its own initiative, or because it has received a subpoena brought by a third party, giving it no choice but to retain data and to hand it over to an outside party.
If whatever you're doing reaches outside of MIT (say, because you're talking to some Internet-based service off-campus), then that service, and all the network hops in between, may also be retaining data about what you're doing. If someone off campus gets data of yours and wants to retain it and mine it forever, MIT's policies have no bearing.
Not all of the Institute's policies are monolithic—big labs such as CSAIL and the Media Lab often have their own infrastructure and their own policies about what data is collected and how long it's retained. This means that if any of your communications start, end, or transit their networks, they may have policies that affect it. Also, there are several popular services which many believe are run by IS&T, but are instead run by other groups which are not subject to IS&T's policies. One well-known example is SIPB's scripts.mit.edu.
Don't simply assume that anonymization services such as Tor will protect you. Traffic analysis (such as netflow data or packet captures) can often unmask such services, especially if you are one of only a small number of on-campus users at the time. (We'll discuss netflow and packet capture later on.)
MIT's published policies on data-retention are incomplete, and don't cover certain common and important cases. Further, policies can change over time, often without explicit warning. We'll discuss these below.
While MIT must comply with warrants and subpoenas, it has been known to release information without waiting for them, as was documented in an article in The Tech about MIT's handling of the Aaron Swartz case.
While this page discusses official policies (where they exist and are publicly documented), all organizations can have bad actors—people who exceed their authority and access data they shouldn't. (Some organizations do this on a widespread, corrupt, and organizationally-sanctioned scale, such as the parallel construction issues with the DEA, but typically the issue is untrustworthy individuals, such as NSA's LOVEINT revelations or problems with police using their own databases to stalk victims.) Even if evidence against you was improperly obtained, even if by a known bad actor, you may still take the fall for an illegal act. (There are exceptions to the doctrine of the "fruit of the poisonous tree." Furthermore not everything bad happens in a court of law—reputational damage or the effort it takes to mount a defense are often their own punishment.)

So let's assume you're not doing anything illegal, that your data isn't leaving campus, and is covered by MIT's general policies and not some more-specific policies of individual labs or departments. What is the Institute collecting about you, and how long is it keeping it?

Meatspace

When it comes to access to physical spaces and video surveillance, turn to the Security and Emergency Management Office (SEMO). Email confirming that the policies they post on their website are current (and not stale or abandoned) was promptly answered by their Manager of Facilities Operations, Thomas W. Komola. In addition, a message to Housing was answered by (since departed) Dean Henry Humphreys of DSL, confirming that all card key and dorm-visitor data is kept by SEMO, not DSL, and that DSL adheres to MIT's general privacy policies. (Note that this page says nothing specifically about what information Housing/DSL may collect or retain; it's generic to the whole Institute.)

Card keys. SEMO's posted policies clearly state that card key data is kept for 14 days and then erased, and can be used only for (a) debugging system problems or (b) as part of a criminal investigation by the MIT Campus Police. (Left unstated, like all privacy policies, is that any outside party with a warrant or subpoena might also be legally authorized to get this data, but only if delivered in a timely fashion—wait more than two weeks, and there's no data to retrieve.) SEMO states categorically on their page that card key tracking data will not be used for active tracking of individuals or groups.

Surveillance cameras. SEMO's policy page again states 14-day retention, with no audio. This includes cameras in the dorms, as well as cameras installed elsewhere, such as outside or at ATMs.

Dorm visitors. When a visitor arrives at a dorm, they are required to check in at a desk staffed by Allied-Barton employees. Their MIT ID is scanned, or, if they don't have one, some other ID (such as a driver's license) is recorded instead. This information also goes to SEMO, not DSL, and is likewise deleted in 14 days.

Cyberspace

Information Systems & Technology (IS&T) handles most of the networking on campus, with the exception of large labs such as CSAIL, the Media Lab, and so forth, which often have their own internal infrastructure. Some of their policies are posted online, but there are also large gaps, and trying to confirm validity or fill in gaps was much less successful with IS&T than with other MIT departments.

Backups. IS&T maintains a service called CrashPlan, which allows everyone on campus to keep their files backed up. The CrashPlan service (and its parent company, Code42) see only encrypted data and do not themselves have keys to decrypt it; MIT's management server holds individual keys for each user instead and encryption of the backups happens before the data is handed to CrashPlan's servers. This means that, if MIT somehow lost all its local copies and backups of its users' backup keys, the backed-up data would be irretrievable. MIT does not allow users to choose their own keys—this reduces the probability of end-users losing their keys and thus losing their backups (a likely scenario in a disk crash), but also means that, were MIT to receive a subpoena for a user's backed-up data, it would be possible for MIT to comply and to hand over everything you've backed up—which might also include credentials to non-MIT services if stored in your backed-up files. If you want to keep your data safe from such scenarios, you'll need to encrypt it before CrashPlan is asked to back it up—in other words, keep it encrypted on-disk, or decrypt to a location that you haven't asked CrashPlan to back up.

Clusters and dialups. IS&T dialups—the machines you go to if you type "ssh dialup.athena.mit.edu" and are so named because they used to have banks of modems connected to them—run tcpspy. This program logs all TCP connections on the machine, ten times per second, to logfiles on the local filesystem. These logs are kept for seven days. (The dialups have been targets of attacks in the past, and compromising one can allow attacking hundreds of users simultaneously; forensics after an attack may be one reason that connection information is logged.) It is unclear whether these log files are themselves copied elsewhere or backed up; they are also vulnerable to manipulation if root is compromised on the dialup—though a root compromise there could much more severely impact users directly. In addition, cluster machines log which binaries are being run, though IS&T explains that such logging is intended not to identify individual users. (Whether such identification might be made via fusion with other sources, such as netflow, isn't answered and may not have been considered.)

Networking. This covers a lot of ground. There are a series of steps by which your computer can establish communication with another, and each of those steps is subject to possible monitoring and recording. IS&T documents what it does with some of these steps, but definitely not all of them, and finding out exactly what IS&T is doing turns out to be surprisingly difficult.

Let's walk through some steps:

Step 1 of getting on the network at all is getting an IP address. Some machines on some wired connections have permanently-assigned IP addresses, but the majority use DHCP. IS&T documents that, under normal circumstances, it keeps DHCP logs for 30 days. This means that, for 30 days, IS&T—or an outside party with a subpoena—can figure out which MAC address, and hence which machine, had a particular Internet-visible IP address. For example, a copyright holder unhappy about a BitTorrent user would typically ask MIT for DHCP information to identify the machine and thus, presumably, the user, but would have to do so within 30 days unless MIT was already keeping information longer about that particular MAC or IP address for some reason. (Remember also that, if you use the MIT SECURE wireless network, you are identifying yourself via Kerberos principal and password in addition to your machine's MAC address, creating an even stronger presumption that it was you—and not just your machine—using the network at a particular time.)
Step 2 of doing much of anything on the network involves using the Domain Name System, which resolves names such as www.mit.edu to particular IP addresses, such as 104.96.184.107. Some DNS servers keep records of requests, but IS&T doesn't document whether theirs do, or how long they might keep this information around. This matters because circumstantial cases are sometimes built on whether one machine communicated with another, and logging DNS queries is one way this is done. Of course, there are many reasons why your machine might be making DNS queries that have nothing to do with your behavior: web pages often load content from dozens of other sites (trackers and ads), and operating systems often talk to dozens of others servers without notifying their users (telemetry to the mothership, or looking for updates). Finally, note that the above example of www.mit.edu demonstrates that even domain names that seem to be inside MIT may well instead be external hosts and use external DNS servers, each of which may have their own policies—the IP address above goes to Akamai, not MIT.
Step 3 of establishing communication sometimes involves using a VPN in order to allow the traffic to traverse an untrusted intermediary network; this is often true of off-campus users who are trying to use MIT's network, but this is not the only case. Users of MIT's VPN may be subject to monitoring of their connections to the VPN or where they connect to at either end of the tunnel, but again, IS&T does not document what data is retained or for how long.
Step 4, very often, involves fetching a web page. If that page is on one of IS&T's main servers (web.mit.edu or www.mit.edu), they document that their retention policy is 90 days—again modulo unusual circumstances such as an outstanding investigation. Of course, typically you're going to some other web server on campus or more likely off-campus, and IS&T's policies don't apply.

The description above hides an elephant in the room, however. Many large networks maintain employ intrusion detection systems. An IDS typically watches multiple points in the network, looking for unusual patterns that indicate an attack, and sends alerts to sysadmins if it finds something noteworthy. Many of them can record—and keep for arbitrary amounts of time—some proportion of the packets going by. Such netflow data may, in the extreme, include packet capture of every bit of every packet (what the NSA has at times called full take), though usually such systems may record only packet headers, or headers plus the first few dozen bytes, of some subset of every packet—this keeps bandwidth and storage requirements reasonable. An early IDS from the 1990's named Network Flight Recorder demonstrates this point—it was so named in analogy to the digital flight recorders carried by aircraft to figure out why a crash occurred.

Being able to replay netflows or entire packet captures from the past is often invaluable for figuring out how an attack happened—or, in some organizations, to figure out who leaked confidential information. But such detailed recordings may also bypass other data-retention policies, because netflows and packet captures which are complete enough can render the destruction of other logs moot, such as those kept by web servers. Furthermore, such data may include far more information than a typical server log might—such as the actual contents of packets, many of which are typically unencrypted.

IS&T does not document whether it uses such an IDS or records netflows or packet captures, but a Tech article about Aaron Swartz published in 2012 does document that IS&T did at the time (and presumably does today) record at least netflow data—packet capture was left unspecified—and may not wait for a subpoena or warrant to hand it over, either. Doing netflow or packet capture without notification has caused recent faculty controversy at Berkeley. We asked the (since departed) Manager of Security at IS&T, Harry Hoffman, what sorts of netflow or packet capture they might be doing and how long they might be retaining such data, but he referred the query to their Director of Communications, Sarah Korval, who has acknowledged repeated messages but has provided no information on this topic across several months of queries, and in mid-January said that IS&T is "currently working to update the data retention policies on our website," though no updates have apparently yet been made.

Conclusions

When it comes to MIT's data-retention policies, those affecting access to physical spaces and surveillance of those physical spaces are quite restrictive and well-documented. SEMO's web pages are clear and its employees are quick to answer questions about them.

On the other hand, when it comes to MIT's computational infrastructure, the picture is much more fragmented. In those areas where policies have been posted, information is retained much longer (IS&T retains various logs from twice as long to six times as long as SEMO does). More concerningly, many important aspects aren't documented at all, and official channels appear effectively useless at either verifying what's already posted, or at answering questions about what's not.

In both cases, it would also be helpful for policy pages to be dated, for links to older versions to be posted (to make it possible to see what changed, and when), and for those pages to be reviewed every so often (perhaps annually) and for that review date to also be posted on the relevant pages, so it's obvious at a glance that those responsible for those systems have ensured that their posted policies match reality.

So, where does this leave you? Barring unusual circumstances—such as an ongoing investigation—you can be reasonably assured that SEMO's details of your physical movements are likely gone after two weeks. Some details of your on-campus electronic activities, when using IS&T's infrastructure, are likely gone after three months—but there are too many undocumented places where data may accumulate, without published policies about how long it may persist, to have much assurance that this is always the case. And, of course, the majority of the traces you leave online are in networks and servers that aren't managed by IS&T at all, each with their own policies. Be careful out there.

To ask us a question, send email to sipb@mit.edu. We'll try to answer you quickly, and we can address your question in our next column. You can also stop by our office in W20-557 or call us at x3-7788 if you need help. Copies of each column and pointers to additional information are posted on our website: http://www.mit.edu/~asksipb/