Ask SIPB - March 1, 2016

This column marks the return of Ask SIPB, last published in 2011. This week's column covers parts of MIT's policies on data retention.

What does MIT know about you?

In particular, when MIT learns something about you, what does it remember, and for how long? In short, what are MIT's data-retention policies, and how do they affect you?

Data-retention policies matter because privacy is a basic human right. Part of what privacy means is the knowledge that even the benign, ordinary aspects of living your life won't somehow come back to haunt you years later—for example, because of a change in the law, or a change in social mores, or because someone is trying to dig up dirt on you. Without the ability to experiment, to take risks, to try on different personalities and see what happens, human development is stunted. Living in a panopticon, where everything you do is recorded forever by unseen observers who may later use it against you, is fundamentally dehumanizing—by design. Reliable and reasonable data-retention policies are thus one small piece in creating a livable society.

The data-retention policies in this article are mostly (but not entirely) about data which is transactional in nature, meaning that they're about data MIT gathers in order to some other job, but (usually) not data you explicitly provided to MIT. (Sometimes, certain types of this transactional data are called metadata, but data is data, and so-called metadata is often the most dangerous kind.) So, for example, we are not talking here about your educational records (covered under FERPA and other laws), your medical records (covered under HIPPA and other laws), or your email (covered under ECPA and other laws).

Instead, we're looking at issues such as

Caveats

Data-retention policies are neither panaceas nor complete descriptions of the world. For example:

So let's assume you're not doing anything illegal, that your data isn't leaving campus, and is covered by MIT's general policies and not some more-specific policies of individual labs or departments. What is the Institute collecting about you, and how long is it keeping it?

Meatspace

When it comes to access to physical spaces and video surveillance, turn to the Security and Emergency Management Office (SEMO). Email confirming that the policies they post on their website are current (and not stale or abandoned) was promptly answered by their Manager of Facilities Operations, Thomas W. Komola. In addition, a message to Housing was answered by (since departed) Dean Henry Humphreys of DSL, confirming that all card key and dorm-visitor data is kept by SEMO, not DSL, and that DSL adheres to MIT's general privacy policies. (Note that this page says nothing specifically about what information Housing/DSL may collect or retain; it's generic to the whole Institute.)

Card keys. SEMO's posted policies clearly state that card key data is kept for 14 days and then erased, and can be used only for (a) debugging system problems or (b) as part of a criminal investigation by the MIT Campus Police. (Left unstated, like all privacy policies, is that any outside party with a warrant or subpoena might also be legally authorized to get this data, but only if delivered in a timely fashion—wait more than two weeks, and there's no data to retrieve.) SEMO states categorically on their page that card key tracking data will not be used for active tracking of individuals or groups.

Surveillance cameras. SEMO's policy page again states 14-day retention, with no audio. This includes cameras in the dorms, as well as cameras installed elsewhere, such as outside or at ATMs.

Dorm visitors. When a visitor arrives at a dorm, they are required to check in at a desk staffed by Allied-Barton employees. Their MIT ID is scanned, or, if they don't have one, some other ID (such as a driver's license) is recorded instead. This information also goes to SEMO, not DSL, and is likewise deleted in 14 days.

Cyberspace

Information Systems & Technology (IS&T) handles most of the networking on campus, with the exception of large labs such as CSAIL, the Media Lab, and so forth, which often have their own internal infrastructure. Some of their policies are posted online, but there are also large gaps, and trying to confirm validity or fill in gaps was much less successful with IS&T than with other MIT departments.

Backups. IS&T maintains a service called CrashPlan, which allows everyone on campus to keep their files backed up. The CrashPlan service (and its parent company, Code42) see only encrypted data and do not themselves have keys to decrypt it; MIT's management server holds individual keys for each user instead and encryption of the backups happens before the data is handed to CrashPlan's servers. This means that, if MIT somehow lost all its local copies and backups of its users' backup keys, the backed-up data would be irretrievable. MIT does not allow users to choose their own keys—this reduces the probability of end-users losing their keys and thus losing their backups (a likely scenario in a disk crash), but also means that, were MIT to receive a subpoena for a user's backed-up data, it would be possible for MIT to comply and to hand over everything you've backed up—which might also include credentials to non-MIT services if stored in your backed-up files. If you want to keep your data safe from such scenarios, you'll need to encrypt it before CrashPlan is asked to back it up—in other words, keep it encrypted on-disk, or decrypt to a location that you haven't asked CrashPlan to back up.

Clusters and dialups. IS&T dialups—the machines you go to if you type "ssh dialup.athena.mit.edu" and are so named because they used to have banks of modems connected to them—run tcpspy. This program logs all TCP connections on the machine, ten times per second, to logfiles on the local filesystem. These logs are kept for seven days. (The dialups have been targets of attacks in the past, and compromising one can allow attacking hundreds of users simultaneously; forensics after an attack may be one reason that connection information is logged.) It is unclear whether these log files are themselves copied elsewhere or backed up; they are also vulnerable to manipulation if root is compromised on the dialup—though a root compromise there could much more severely impact users directly. In addition, cluster machines log which binaries are being run, though IS&T explains that such logging is intended not to identify individual users. (Whether such identification might be made via fusion with other sources, such as netflow, isn't answered and may not have been considered.)

Networking. This covers a lot of ground. There are a series of steps by which your computer can establish communication with another, and each of those steps is subject to possible monitoring and recording. IS&T documents what it does with some of these steps, but definitely not all of them, and finding out exactly what IS&T is doing turns out to be surprisingly difficult.

Let's walk through some steps:

The description above hides an elephant in the room, however. Many large networks maintain employ intrusion detection systems. An IDS typically watches multiple points in the network, looking for unusual patterns that indicate an attack, and sends alerts to sysadmins if it finds something noteworthy. Many of them can record—and keep for arbitrary amounts of time—some proportion of the packets going by. Such netflow data may, in the extreme, include packet capture of every bit of every packet (what the NSA has at times called full take), though usually such systems may record only packet headers, or headers plus the first few dozen bytes, of some subset of every packet—this keeps bandwidth and storage requirements reasonable. An early IDS from the 1990's named Network Flight Recorder demonstrates this point—it was so named in analogy to the digital flight recorders carried by aircraft to figure out why a crash occurred.

Being able to replay netflows or entire packet captures from the past is often invaluable for figuring out how an attack happened—or, in some organizations, to figure out who leaked confidential information. But such detailed recordings may also bypass other data-retention policies, because netflows and packet captures which are complete enough can render the destruction of other logs moot, such as those kept by web servers. Furthermore, such data may include far more information than a typical server log might—such as the actual contents of packets, many of which are typically unencrypted.

IS&T does not document whether it uses such an IDS or records netflows or packet captures, but a Tech article about Aaron Swartz published in 2012 does document that IS&T did at the time (and presumably does today) record at least netflow data—packet capture was left unspecified—and may not wait for a subpoena or warrant to hand it over, either. Doing netflow or packet capture without notification has caused recent faculty controversy at Berkeley. We asked the (since departed) Manager of Security at IS&T, Harry Hoffman, what sorts of netflow or packet capture they might be doing and how long they might be retaining such data, but he referred the query to their Director of Communications, Sarah Korval, who has acknowledged repeated messages but has provided no information on this topic across several months of queries, and in mid-January said that IS&T is "currently working to update the data retention policies on our website," though no updates have apparently yet been made.

Conclusions

When it comes to MIT's data-retention policies, those affecting access to physical spaces and surveillance of those physical spaces are quite restrictive and well-documented. SEMO's web pages are clear and its employees are quick to answer questions about them.

On the other hand, when it comes to MIT's computational infrastructure, the picture is much more fragmented. In those areas where policies have been posted, information is retained much longer (IS&T retains various logs from twice as long to six times as long as SEMO does). More concerningly, many important aspects aren't documented at all, and official channels appear effectively useless at either verifying what's already posted, or at answering questions about what's not.

In both cases, it would also be helpful for policy pages to be dated, for links to older versions to be posted (to make it possible to see what changed, and when), and for those pages to be reviewed every so often (perhaps annually) and for that review date to also be posted on the relevant pages, so it's obvious at a glance that those responsible for those systems have ensured that their posted policies match reality.

So, where does this leave you? Barring unusual circumstances—such as an ongoing investigation—you can be reasonably assured that SEMO's details of your physical movements are likely gone after two weeks. Some details of your on-campus electronic activities, when using IS&T's infrastructure, are likely gone after three months—but there are too many undocumented places where data may accumulate, without published policies about how long it may persist, to have much assurance that this is always the case. And, of course, the majority of the traces you leave online are in networks and servers that aren't managed by IS&T at all, each with their own policies. Be careful out there.


To ask us a question, send email to sipb@mit.edu. We'll try to answer you quickly, and we can address your question in our next column. You can also stop by our office in W20-557 or call us at x3-7788 if you need help. Copies of each column and pointers to additional information are posted on our website: http://www.mit.edu/~asksipb/