Loot from a Scandal
In 2003, the United States Federal Energy Regulatory Commission (FERC), as part of its major investigation of price manipulation in energy markets, made public all of the data it had accumulated from the energy-trading company Enron. Initially lauded as an innovative corporate player in the energy business, Enron’s very public descent into bankruptcy amid revelations of price fixing and massive accounting fraud was the dominant story in business journalism in late 2001 and 2002. The Enron bankruptcy case was one of the most complex in U.S. history, and its related investigations produced an extraordinary quantity of data: the FERC’s Enron collection includes audio files of trading floor calls, extracts from internal corporate databases, 150,000 scanned documents, and a large portion of Enron’s managerial internal email—all available to the public (at varying degrees of difficulty). The FERC had thus unintentionally produced a remarkable object: the public and private mailing activities of 158 people in the upper echelons of a major corporation, frozen in place like the ruins of Pompeii for future researchers. They were not slow in coming.
The Enron collection is a remarkable and slightly terrifying thing, an artifact of the interplay of public and private in email. As a human document, it has the skeleton of a great, if pathetic, novel: a saga of nepotism, venality, arrogant posturing, office politics, stock deals, wedding contractors, and Texas strip clubs, played out over hundreds of thousands of messages. The human reader discerns a narrative built around two families, one biological and tied by blood, and the other a corporate elite tied by money, coming to ruin and allying their fortunes to the Bush/Cheney campaign. These are the veins of narrative interest embedded in the monotony of a large business. A sample line at random, from “entex transition,” December 14, 1999, 00:13, with the odd spacing and lowercase text left in as artifacts of the data extraction process: “howard will continue with his lead responsibilites [sic] within the group and be available for questions or as a backup, if necessary (thanks howard for all your hard work on the account this year ).” This relentless collection of mundane minutiae and occasional legally actionable evidence, stretching out as vast and trackless as the Gobi, was dramatically transformed as it became an object suitable for the scientific analysis of spam.
After the body of the Enron data has been secured in a researcher’s computer like the sperm whale to the side of the Pequod, it must be flensed into useful parts. First, the monstrous dataset of 619,446 messages belonging to 158 users has to be “cleaned,” removing folders and duplicate messages that are artifacts of the mail system itself and not representative of how humans would classify and sort their mail. This step cuts the number of messages to 200,399, which is still an enormous body of text. The collected data for each user is sorted chronologically and split in half to create separate sets of training and testing materials for the machines. The text is tokenized with attention to different types of data within the set— “unstructured text,” areas such as Subject and Body with free natural language, “categorical text,” well-defined fields such as “To:” and “From:”, and numerical data such as message size, number of recipients, and character counts.
Then the testing and evaluation starts, with its own complex conceptual negotiations tomake this process scientific. Machine learning programs are run against the training and testing sets and their output is analyzed; these results are then compared to the results from a dataset of messages volunteered from students and faculty at Carnegie Mellon. With a dataset that is cleaned and parsed for computational processing, and not too far afield from the results of volunteer data, the creation of a corpus as an epistemic object appropriate for scientific inquiry is almost complete. This process raises a further question: has the work of making the corpus, particularly resolving what is considered as spam, changed it in ways that need to be taken into account experimentally? The question of what spam is, and for whom, becomes an area of community negotiation for the scientists as it was for the antispam charivari and free speech activists, for the network engineers and system administrators, and for the lawyers and the legislators.
Cormack and Lynam, authors of the recent dominant framework for a spam filtering corpus, directly address this issue of creating “repeatable (i.e., controlled and statistically valid) results” for spam analysis with an email corpus. In the process of establishing a “gold standard” for filters—the gold standard itself is an interesting artifact of iterating computational processing and human judgment—they introduce yet another refinement in the struggle to define spam: “We define spam to be ‘Unsolicited, unwanted email that was sent indiscriminately, directly or indirectly, by a sender having no current relationship with the recipient.’” With this definition and an array of filtering tools, they turn to the Enron corpus that we have looked at as processed and distributed by Carnegie Mellon: “We found it very difficult to adjudicate many messages because it was difficult to glean the relationship between the sender and the receiver. In particular, we found a preponderance of sports betting pool announcements, stock market tips, and religious bulk mail that was adjudicated as spam but in hindsight we suspect was not. We found advertising from vendors whose relationship with the recipient we found tenuous.” In other words, along with some textual problems—important parameters for testing a spam filter that were missing, such as headers to establish relationships, and the presence of attachments—the corpus reflected the ongoing problem of the very definition of spam. The email culture of academic computer science is not awash in religious bulk mail and stock tips, except insofar as those messages—especially the latter—arrive as spam. But the email culture of the upper echelons of a major corporation is quite different, particularly when that corporation is, as Enron was, both strongly Christian and strongly Texan. As the company began to fall apart, interoffice mail in the FERC dataset includes many injunctions to pray, and promises to pray for one another and other professions of faith. Similarly, sports, and betting on sports, are part of the conversation, as one might expect from a group of highly competitive businessmen in Houston.
In theory, a sufficiently advanced and trained spam filter would algorithmically recognize these distinctions and be personally accurate within the Enron corporate culture as it would for an academic, learning to stop the unsolicited stock tips for the latter while delivering the endless tide of calls for papers. However, within the Enron corpus as tested by Lynam and Cormack, the details critical for such a filter were missing. Human decisions about what qualified as spam, combined with technical constraints that made it hard to map relationships—and “a sender having no current relationship with the recipient” is one of their spam criteria, as it’s the difference between a stock tip from a boiler room spam business somewhere and one from Ken Lay down the hall—had made the corpus into an object meaningful for one kind of work but not another. It was an appropriate object for the study of automated classification, but not for spam filtering. In the end, they retrieved the original database from the FERC and essentially started over. “Construction of a gold standard for the Enron Corpus, and the tools to facilitate that construction, remains a work in progress,” they write, but: “We believe that . . . the Enron Corpus will form the basis of a larger, more representative public spam corpus than currently exists.”
This is all reliable, well-cited, iterative, inscription-driven science: proper, and slow. Compared to the pace of spam, it’s very slow indeed. A faint air of tragedy hangs over some of the earlier documents in the scientific community when they explain why spam is a problem: “To show the growing magnitude of the junk E-mail problem, these 222 messages contained 45 messages (over 20% of the incoming mail) which were later deemed to be junk by the user.” That percentage, on a given day as of this writing, has tripled or quadrupled on the far side of the various email filters. A certain amount of impatience with formal, scientific antispam progress is understandable.
As it happens, networked computers host a large and thriving technical subculture self-defined by its impatience with procedural niceties, its resistance to institutionalization, its ad hoc and improvisational style, and its desire for speed and working-in-practice or “rough consensus and running code”: the hackers. They have their own ideas, and many of them, about how to deal with spam. The most significant of these ideas, one that will mesh computer science, mathematics, hacking, and eventually literature around the task of objectifying spam, began with a hacker’s failure to accurately cite a scientist.