THE VICTIM CLOUD: 2003 – 2010
FILTERING: SCIENTISTS AND HACKERS
MAKING SPAM SCIENTIFIC, PART I
We have machines capable of reading, analyzing, judging any written text. But it is precisely the reliability of the instruments on which we must run some checks.
—Italo Calvino, If on a Winter’s Night a Traveler
If you can’t get to grips with the spammers using law, censure, and protest, can you instead get to grips with spam itself? How do you get a handle on it, and how do you make it something you can measure and quantify, talk about coherently, understand—and therefore attack? It is an etymoogically restless thing, at once noun and verb, that thrives wherever we have trouble defining it clearly enough to exclude it and make precise rules and laws for it. It is the subjective nemesis of the equally subjective germane human interaction. How do you turn it into a material you can work on? How do you make spam an object?
These are matters of practical interest for those seeking to analyze spam and make it something they can stop. This is a story about two such groups—scientists and hackers—and how they went about drawing lines around spam, defining edges and black boxes and criteria and workflows and making it into something to which they could apply tools. This is also necessarily a story about everything left out of the lines they drew, and how in changing the shape of spam it eluded them and transformed, both as a technology and as a set of practices, into something far stranger than before: a new object with a new infrastructure behind it, produced by a new class of criminals.
Spam comes into a computer lab with as much of a halo of strangeness as a chunk of cavorite—H. G. Wells’s fantasy material that resists gravity, with which scientists fall up to the moon—and with similarly strange and innovation-demanding effects. After all, what is this human-machine, inno- vative-criminal, social-technological, maddening yet unstoppable thing? It’s a practice, and a communally expressed attitude, but also an artifact of sorts, something that exists in the singular and ostensive—a “spam message,” this spam—but that also demands analysis in the plural as spam, the problem, on a larger scale. How do you specify this concept, making it productive of reproducible and falsifiable results that are capable of being benchmarked and tested? Spam is constantly fluctuating; the amount you receive depends on your ISP, what filters your ISP uses, your operating system and mail application, the number and type of the accounts that you use, and even the season and the time of day. Spam may seem at first like an ideal subject for scientific testing, as you do not even need to go to the trouble of collecting fruit flies or putting up a telescope to get material—just create an email account and watch it roll in! The infrastructure of spam is so complex, however, that simply testing the result of any given email account is like proving a cure in medieval medicine: the patient may improve or decline, but it is hard to definitively link that change to what the doctor did. Clearly the necessary starting point is some kind of agreed-upon spam object, something like a spam meter or spam calorie, on which and against which things can be tested—a corpus.
A spam corpus, though, starts with the problem of privacy. Simply having a batch of agreed-upon spam messages is not good enough, because spam makes sense only in context, as something distinct from legitimate mail. If your ultimate goal is to produce something that can stop spam, you need an accurate simulation of the inbound email for a user or a group of users in which that spam is embedded. You need not just the spam but also its context of nonspam (referred to in much of the scientific literature, with a straight face, as “ham”). Spam filters improve based on increased amounts of data available for them to work with, so you need a lot of spam, and correspondingly a whole lot of legitimate mail, to get the law of large numbers on your side. Obviously, email is largely a private medium, or is at least treated like one. How do you create an accurate spam corpus, with the vitally important contextual nonspam mail, while maintaining privacy? Simply obfuscating personal details (email addresses, telephone numbers, proper names) seriously interferes with the accuracy of the results. Analyzing email and spam is all about quantifying and studying messages and words in messages, so having a corpus full of “Dear XXXXX” and “you can reach me at XXX-XXXX” is going to make any resulting tests inaccurate at best and misleading at worst for a filter meant for use outside the lab.
What if you obfuscated the entire body of every message in the corpus? You could have a huge set of personal messages and spam without infring- ing on anyone’s privacy. This is a substitution method: “to release bench- marks each consisting of messages received by a particular user, after replacing each token by a unique number in all the messages.The mapping between tokens and numbers is not released, making it extremely difficult to recover the original messages, other than perhaps common words and phrases therein.” A token is a term from lexical analysis for its basic unit: the atomic part of a document, usually but not necessarily a word. Tokenization is what you do to a document to make it into an object for computational lexical analysis: turning the strings of characters we recognize as readers into a series of discrete objects that possess values and can be acted on by a computer algorithm (a “value” in this case being, for instance, the number of times a word appears in the text as a whole). In tokenizing texts, the human meaning of a word is pretty much irrelevant, as its value as a token in a space of other tokens is all that matters: how many there are, whether they often appear in relation to certain other tokens, and so on. It’s not as strange as it sounds, then, to preserve privacy in the spam corpus by the Borgesian strategy of substituting a unique number for each token in a message: 42187 for “Dear,” 8472 for “you,” and so on. If you keep consistency, it could work.
However: “the loss of the original tokens still imposes restrictions; for example, it is impossible to experiment with different tokenizers.” Making an obfuscated corpus with this one map of numbers (which is not released, to prevent people from reversing the obfuscation and reading the private mail) locks all users into one version of the corpus, preventing them from trying different methods on the original messages. There are many ways of performing lexical analysis, evaluation, and parsing that could lead to otherwise-missed implications and innovations for further classification and filtering experiments. Starting from a natural language corpus of spam and nonspam messages provides a lot more experimental room to move, compared to a huge pile of integers representing messages already preprocessed in particular ways. (“42187 64619 87316 73140 . . .”).
There were other approaches to building a spam corpus for scientific research, but each had its own flaws. A series of corpora was made from mailing lists, such as the Ling-Spam corpus, that collected the messages sent to a mailing list for the academic linguistics community—a moderated list on which someone reviewed and approved each message before it went out, which was thus a mailing list free of spam. In theory, this corpus could be used to establish benchmarks for legitimate nonspam email. In practice, the material appearing on the list was far more topic-specific than the profile of any given person’s actual email—no receipts from ecommerce sites, no love letters with erotic language, no brief appointment-making back-and-forth messages. It produced over-optimistic results from classifying and filtering programs for their ability to recognize legitimate text. (Spam messages on the one side and arguments about Chomskyan grammar and linguistic recursion on the other makes for a rather skewed arrangement.) The SpamAssassin corpus, gathered to test the spam filter of the same name, used posts collected from public mailing lists and emails donated by volunteers. It ran into exactly the opposite problem, with a benchmark set of legitimate text that was far more diverse than that of any given person’s email account, and also used only those messages considered acceptable for public display by their recipients.
Early in the literature, the solution was quick and rough: the researchers would simply volunteer their own email inboxes for the experiment without releasing them as a corpus. They would hope that the results they got with their experiments could be reproduced with other people’s corpora, as do scientists using their own bodies as the experimental sub- jects—Pierre Curie generating a lesion on his own arm with radium, or Johann Wilhelm Ritter attaching the poles of a battery to his tongue. We can call this the “we’re all pretty much alike” approach. “As far as I know,” writes Jason D. M. Rennie in 2000 of his early email filter and classifier file, originally released in 1996 and described later in this chapter, “there are no freely available data sets for mail filtering. . . . I asked for volunteers who would be willing to have such experiments performed on their mail collection; four users (including the author) volunteered.” It’s a striking idea, and in many ways one that speaks more to the hacker sensibility than to that of the institutionalized sciences: the code runs, and it’s free, so do the experiments yourself. (It would be wonderful for an article in a journal on genomics, for instance, to say that substantive results depend on you, the reader, to fire up a sequencer and a mass spectrometer and perform the experiments yourself—and on yourself, no less.)
Obfuscation, problematic sampling and harvesting, noble volunteers for one-shot tests: the lack of “freely available data sets for mail filtering” that accurately approximate the mailing dynamics of persons and groups was clearly the problem to beat in making spam scientific. Beyond the institutional and methodological issues created by everyone having yardsticks of different lengths, there was a more subtle concern, one native to the task of filtering email and separating it into spam and nonspam. “The literature suggests that the variation in performance between different users varies much more than the variation between different classification algorithms.” There are distinctive patterns and topologies to users, their folders, the network they inhabit, and other behavior. We are not all pretty much alike, and acting as though all email corpora are created equal is like a chemist deciding to just ignore temperature. Working with an experimental base of small sets of volunteered email makes it difficult to assess differences between filters and almost impossible to analyze differences in the email activity profiles of persons and groups—differences any filter would have to take into account. Creating a scientific object with which to study spam appeared to be at an impasse.