June 18, 2013

31 min read

Spam: A Shadow History of the Internet [Excerpt, Part 1]

The constant metamorphosis of junk electronic messages mirrors the evolution of the online world itself. Follow in this chapter scientists who try to stem the flow of spam and the hackers who inevitably outwit them

By The Editors

In one sense, the history of spam is the history of the Internet. It is a history of hackers who have constantly probed the limits of the Net’s capabilities to enable the medium to do their bidding in delivering all manner of electronic junk messaging. And it is also a tale of the scientists who have fought a constantly losing battle to try to stop them and, in the course of doing so, have helped shape the evolution of the Internet as a commercial medium.

What follows is a long chapter from a new book by Finn Brunton, a professor at the University of Michigan, that recounts the ceaseless battle of spammer vs. scientist. The book, as a whole, details the entire sweep of spam history, from the early pre-commercial era (through 1995), on to the frenetic cowboy years (“Nigerian Prince” fraud and “pump-and-dump” schemes) until 2003 and the advent of spam legislation, followed by the subsequent globalization, criminalization and militarization that occurred from 2003 to 2010.

The chapter you are about to read, broken into four segments that will run today through Friday recounts the changes in the spam ecosystem after 2003 when, to deal with software filters and the new legal prohibitions, spammers created elaborate automated networks in some of the most far-flung regions of the globe to enable them to practice their trade unhindered. It turns out as well that these networks are becoming a linchpin of the inchoate cyberwarfare waged by national governments.

As Brunton notes in the introduction, spam is a product of our society, the work of “programmers, con artists, cops, lawyers, bots and their botmasters, scientists, pill merchants, social media entrepreneurs, marketers, hackers, identity thieves, sysadmins, victims, pornographers, do-it-yourself vigilantes, government officials, and stock touts. He also phrases it slightly differently—and even more deliciously—elsewhere in the intro: “a remarkable cast of postnational anarchists, baronial system administrators, visionary protocol designers, community-building ‘process queens,’ technolibertarian engineers, and a distributed mob of angry antispam activists.”

This book is a gem. The goings-on of the twisted personages who populate cyberpunk lit have nothing on the ingenious scheming of the spammers and the scientists dedicated to shutting them down. Read here and in days to come about this fascinatingly bizarre subterranean cyberworld. This first section deals with the elaborate, almost monastic scholarship that went into creating spam filters. (Links to previous excerpted segments will be available once new material is posted.)

TABLE OF CONTENTS

THE VICTIM CLOUD
Filtering: Scientists and Hackers

On supporting science journalism

If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.

Making Spam Scientific, Part 1 Scientists building filters try to get a handle on the “etymologically restless” nature of spam by creating text corpuses for software analysis. The search is on for the equivalent of a spam calorie
Loot from a Scandal Researchers use internal communications from the disgraced Enron to try to fashion spam filters
Making Spam Hackable Software maven Paul Graham’s “Plan for Spam"

THE VICTIM CLOUD: 2003 – 2010
FILTERING: SCIENTISTS AND HACKERS

MAKING SPAM SCIENTIFIC, PART I

We have machines capable of reading, analyzing, judging any written text. But it is precisely the reliability of the instruments on which we must run some checks.

—Italo Calvino, If on a Winter’s Night a Traveler

If you can’t get to grips with the spammers using law, censure, and protest, can you instead get to grips with spam itself? How do you get a handle on it, and how do you make it something you can measure and quantify, talk about coherently, understand—and therefore attack? It is an etymoogically restless thing, at once noun and verb, that thrives wherever we have trouble defining it clearly enough to exclude it and make precise rules and laws for it. It is the subjective nemesis of the equally subjective germane human interaction. How do you turn it into a material you can work on? How do you make spam an object?

These are matters of practical interest for those seeking to analyze spam and make it something they can stop. This is a story about two such groups—scientists and hackers—and how they went about drawing lines around spam, defining edges and black boxes and criteria and workflows and making it into something to which they could apply tools. This is also necessarily a story about everything left out of the lines they drew, and how in changing the shape of spam it eluded them and transformed, both as a technology and as a set of practices, into something far stranger than before: a new object with a new infrastructure behind it, produced by a new class of criminals.

Spam comes into a computer lab with as much of a halo of strangeness as a chunk of cavorite—H. G. Wells’s fantasy material that resists gravity, with which scientists fall up to the moon—and with similarly strange and innovation-demanding effects. After all, what is this human-machine, inno- vative-criminal, social-technological, maddening yet unstoppable thing? It’s a practice, and a communally expressed attitude, but also an artifact of sorts, something that exists in the singular and ostensive—a “spam message,” this spam—but that also demands analysis in the plural as spam, the problem, on a larger scale. How do you specify this concept, making it productive of reproducible and falsifiable results that are capable of being benchmarked and tested? Spam is constantly fluctuating; the amount you receive depends on your ISP, what filters your ISP uses, your operating system and mail application, the number and type of the accounts that you use, and even the season and the time of day. Spam may seem at first like an ideal subject for scientific testing, as you do not even need to go to the trouble of collecting fruit flies or putting up a telescope to get material—just create an email account and watch it roll in! The infrastructure of spam is so complex, however, that simply testing the result of any given email account is like proving a cure in medieval medicine: the patient may improve or decline, but it is hard to definitively link that change to what the doctor did. Clearly the necessary starting point is some kind of agreed-upon spam object, something like a spam meter or spam calorie, on which and against which things can be tested—a corpus.

A spam corpus, though, starts with the problem of privacy. Simply having a batch of agreed-upon spam messages is not good enough, because spam makes sense only in context, as something distinct from legitimate mail. If your ultimate goal is to produce something that can stop spam, you need an accurate simulation of the inbound email for a user or a group of users in which that spam is embedded. You need not just the spam but also its context of nonspam (referred to in much of the scientific literature, with a straight face, as “ham”). Spam filters improve based on increased amounts of data available for them to work with, so you need a lot of spam, and correspondingly a whole lot of legitimate mail, to get the law of large numbers on your side. Obviously, email is largely a private medium, or is at least treated like one. How do you create an accurate spam corpus, with the vitally important contextual nonspam mail, while maintaining privacy? Simply obfuscating personal details (email addresses, telephone numbers, proper names) seriously interferes with the accuracy of the results. Analyzing email and spam is all about quantifying and studying messages and words in messages, so having a corpus full of “Dear XXXXX” and “you can reach me at XXX-XXXX” is going to make any resulting tests inaccurate at best and misleading at worst for a filter meant for use outside the lab.

What if you obfuscated the entire body of every message in the corpus? You could have a huge set of personal messages and spam without infring- ing on anyone’s privacy. This is a substitution method: “to release bench- marks each consisting of messages received by a particular user, after replacing each token by a unique number in all the messages.The mapping between tokens and numbers is not released, making it extremely difficult to recover the original messages, other than perhaps common words and phrases therein.” A token is a term from lexical analysis for its basic unit: the atomic part of a document, usually but not necessarily a word. Tokenization is what you do to a document to make it into an object for computational lexical analysis: turning the strings of characters we recognize as readers into a series of discrete objects that possess values and can be acted on by a computer algorithm (a “value” in this case being, for instance, the number of times a word appears in the text as a whole). In tokenizing texts, the human meaning of a word is pretty much irrelevant, as its value as a token in a space of other tokens is all that matters: how many there are, whether they often appear in relation to certain other tokens, and so on. It’s not as strange as it sounds, then, to preserve privacy in the spam corpus by the Borgesian strategy of substituting a unique number for each token in a message: 42187 for “Dear,” 8472 for “you,” and so on. If you keep consistency, it could work.

However: “the loss of the original tokens still imposes restrictions; for example, it is impossible to experiment with different tokenizers.” Making an obfuscated corpus with this one map of numbers (which is not released, to prevent people from reversing the obfuscation and reading the private mail) locks all users into one version of the corpus, preventing them from trying different methods on the original messages. There are many ways of performing lexical analysis, evaluation, and parsing that could lead to otherwise-missed implications and innovations for further classification and filtering experiments. Starting from a natural language corpus of spam and nonspam messages provides a lot more experimental room to move, compared to a huge pile of integers representing messages already preprocessed in particular ways. (“42187 64619 87316 73140 . . .”).

There were other approaches to building a spam corpus for scientific research, but each had its own flaws. A series of corpora was made from mailing lists, such as the Ling-Spam corpus, that collected the messages sent to a mailing list for the academic linguistics community—a moderated list on which someone reviewed and approved each message before it went out, which was thus a mailing list free of spam. In theory, this corpus could be used to establish benchmarks for legitimate nonspam email. In practice, the material appearing on the list was far more topic-specific than the profile of any given person’s actual email—no receipts from ecommerce sites, no love letters with erotic language, no brief appointment-making back-and-forth messages. It produced over-optimistic results from classifying and filtering programs for their ability to recognize legitimate text. (Spam messages on the one side and arguments about Chomskyan grammar and linguistic recursion on the other makes for a rather skewed arrangement.) The SpamAssassin corpus, gathered to test the spam filter of the same name, used posts collected from public mailing lists and emails donated by volunteers. It ran into exactly the opposite problem, with a benchmark set of legitimate text that was far more diverse than that of any given person’s email account, and also used only those messages considered acceptable for public display by their recipients.

Early in the literature, the solution was quick and rough: the researchers would simply volunteer their own email inboxes for the experiment without releasing them as a corpus. They would hope that the results they got with their experiments could be reproduced with other people’s corpora, as do scientists using their own bodies as the experimental sub- jects—Pierre Curie generating a lesion on his own arm with radium, or Johann Wilhelm Ritter attaching the poles of a battery to his tongue. We can call this the “we’re all pretty much alike” approach. “As far as I know,” writes Jason D. M. Rennie in 2000 of his early email filter and classifier file, originally released in 1996 and described later in this chapter, “there are no freely available data sets for mail filtering. . . . I asked for volunteers who would be willing to have such experiments performed on their mail collection; four users (including the author) volunteered.” It’s a striking idea, and in many ways one that speaks more to the hacker sensibility than to that of the institutionalized sciences: the code runs, and it’s free, so do the experiments yourself. (It would be wonderful for an article in a journal on genomics, for instance, to say that substantive results depend on you, the reader, to fire up a sequencer and a mass spectrometer and perform the experiments yourself—and on yourself, no less.)

Obfuscation, problematic sampling and harvesting, noble volunteers for one-shot tests: the lack of “freely available data sets for mail filtering” that accurately approximate the mailing dynamics of persons and groups was clearly the problem to beat in making spam scientific. Beyond the institutional and methodological issues created by everyone having yardsticks of different lengths, there was a more subtle concern, one native to the task of filtering email and separating it into spam and nonspam. “The literature suggests that the variation in performance between different users varies much more than the variation between different classification algorithms.” There are distinctive patterns and topologies to users, their folders, the network they inhabit, and other behavior. We are not all pretty much alike, and acting as though all email corpora are created equal is like a chemist deciding to just ignore temperature. Working with an experimental base of small sets of volunteered email makes it difficult to assess differences between filters and almost impossible to analyze differences in the email activity profiles of persons and groups—differences any filter would have to take into account. Creating a scientific object with which to study spam appeared to be at an impasse.

Loot from a Scandal

In 2003, the United States Federal Energy Regulatory Commission (FERC), as part of its major investigation of price manipulation in energy markets, made public all of the data it had accumulated from the energy-trading company Enron. Initially lauded as an innovative corporate player in the energy business, Enron’s very public descent into bankruptcy amid revelations of price fixing and massive accounting fraud was the dominant story in business journalism in late 2001 and 2002. The Enron bankruptcy case was one of the most complex in U.S. history, and its related investigations produced an extraordinary quantity of data: the FERC’s Enron collection includes audio files of trading floor calls, extracts from internal corporate databases, 150,000 scanned documents, and a large portion of Enron’s managerial internal email—all available to the public (at varying degrees of difficulty). The FERC had thus unintentionally produced a remarkable object: the public and private mailing activities of 158 people in the upper echelons of a major corporation, frozen in place like the ruins of Pompeii for future researchers. They were not slow in coming.

The Enron collection is a remarkable and slightly terrifying thing, an artifact of the interplay of public and private in email. As a human document, it has the skeleton of a great, if pathetic, novel: a saga of nepotism, venality, arrogant posturing, office politics, stock deals, wedding contractors, and Texas strip clubs, played out over hundreds of thousands of messages. The human reader discerns a narrative built around two families, one biological and tied by blood, and the other a corporate elite tied by money, coming to ruin and allying their fortunes to the Bush/Cheney campaign. These are the veins of narrative interest embedded in the monotony of a large business. A sample line at random, from “entex transition,” December 14, 1999, 00:13, with the odd spacing and lowercase text left in as artifacts of the data extraction process: “howard will continue with his lead responsibilites [sic] within the group and be available for questions or as a backup, if necessary (thanks howard for all your hard work on the account this year ).” This relentless collection of mundane minutiae and occasional legally actionable evidence, stretching out as vast and trackless as the Gobi, was dramatically transformed as it became an object suitable for the scientific analysis of spam.

After the body of the Enron data has been secured in a researcher’s computer like the sperm whale to the side of the Pequod, it must be flensed into useful parts. First, the monstrous dataset of 619,446 messages belonging to 158 users has to be “cleaned,” removing folders and duplicate messages that are artifacts of the mail system itself and not representative of how humans would classify and sort their mail. This step cuts the number of messages to 200,399, which is still an enormous body of text. The collected data for each user is sorted chronologically and split in half to create separate sets of training and testing materials for the machines. The text is tokenized with attention to different types of data within the set— “unstructured text,” areas such as Subject and Body with free natural language, “categorical text,” well-defined fields such as “To:” and “From:”, and numerical data such as message size, number of recipients, and character counts.

Then the testing and evaluation starts, with its own complex conceptual negotiations tomake this process scientific. Machine learning programs are run against the training and testing sets and their output is analyzed; these results are then compared to the results from a dataset of messages volunteered from students and faculty at Carnegie Mellon. With a dataset that is cleaned and parsed for computational processing, and not too far afield from the results of volunteer data, the creation of a corpus as an epistemic object appropriate for scientific inquiry is almost complete. This process raises a further question: has the work of making the corpus, particularly resolving what is considered as spam, changed it in ways that need to be taken into account experimentally? The question of what spam is, and for whom, becomes an area of community negotiation for the scientists as it was for the antispam charivari and free speech activists, for the network engineers and system administrators, and for the lawyers and the legislators.

Cormack and Lynam, authors of the recent dominant framework for a spam filtering corpus, directly address this issue of creating “repeatable (i.e., controlled and statistically valid) results” for spam analysis with an email corpus. In the process of establishing a “gold standard” for filters—the gold standard itself is an interesting artifact of iterating computational processing and human judgment—they introduce yet another refinement in the struggle to define spam: “We define spam to be ‘Unsolicited, unwanted email that was sent indiscriminately, directly or indirectly, by a sender having no current relationship with the recipient.’” With this definition and an array of filtering tools, they turn to the Enron corpus that we have looked at as processed and distributed by Carnegie Mellon: “We found it very difficult to adjudicate many messages because it was difficult to glean the relationship between the sender and the receiver. In particular, we found a preponderance of sports betting pool announcements, stock market tips, and religious bulk mail that was adjudicated as spam but in hindsight we suspect was not. We found advertising from vendors whose relationship with the recipient we found tenuous.” In other words, along with some textual problems—important parameters for testing a spam filter that were missing, such as headers to establish relationships, and the presence of attachments—the corpus reflected the ongoing problem of the very definition of spam. The email culture of academic computer science is not awash in religious bulk mail and stock tips, except insofar as those messages—especially the latter—arrive as spam. But the email culture of the upper echelons of a major corporation is quite different, particularly when that corporation is, as Enron was, both strongly Christian and strongly Texan. As the company began to fall apart, interoffice mail in the FERC dataset includes many injunctions to pray, and promises to pray for one another and other professions of faith. Similarly, sports, and betting on sports, are part of the conversation, as one might expect from a group of highly competitive businessmen in Houston.

In theory, a sufficiently advanced and trained spam filter would algorithmically recognize these distinctions and be personally accurate within the Enron corporate culture as it would for an academic, learning to stop the unsolicited stock tips for the latter while delivering the endless tide of calls for papers. However, within the Enron corpus as tested by Lynam and Cormack, the details critical for such a filter were missing. Human decisions about what qualified as spam, combined with technical constraints that made it hard to map relationships—and “a sender having no current relationship with the recipient” is one of their spam criteria, as it’s the difference between a stock tip from a boiler room spam business somewhere and one from Ken Lay down the hall—had made the corpus into an object meaningful for one kind of work but not another. It was an appropriate object for the study of automated classification, but not for spam filtering. In the end, they retrieved the original database from the FERC and essentially started over. “Construction of a gold standard for the Enron Corpus, and the tools to facilitate that construction, remains a work in progress,” they write, but: “We believe that . . . the Enron Corpus will form the basis of a larger, more representative public spam corpus than currently exists.”

This is all reliable, well-cited, iterative, inscription-driven science: proper, and slow. Compared to the pace of spam, it’s very slow indeed. A faint air of tragedy hangs over some of the earlier documents in the scientific community when they explain why spam is a problem: “To show the growing magnitude of the junk E-mail problem, these 222 messages contained 45 messages (over 20% of the incoming mail) which were later deemed to be junk by the user.” That percentage, on a given day as of this writing, has tripled or quadrupled on the far side of the various email filters. A certain amount of impatience with formal, scientific antispam progress is understandable.

As it happens, networked computers host a large and thriving technical subculture self-defined by its impatience with procedural niceties, its resistance to institutionalization, its ad hoc and improvisational style, and its desire for speed and working-in-practice or “rough consensus and running code”: the hackers. They have their own ideas, and many of them, about how to deal with spam. The most significant of these ideas, one that will mesh computer science, mathematics, hacking, and eventually literature around the task of objectifying spam, began with a hacker’s failure to accurately cite a scientist.

MAKING SPAM HACKABLE

I thought spam robots would become more sophisticated—the robots would fight antirobots and antiantirobots and eventually all the spam robots would become lawyers, and take over the world. But the filters were pretty good, so that prediction was wrong.

— Robert Laughlin

“Norbert Wiener said if you compete with slaves you become a slave, and there is something similarly degrading about competing with spammers,” wrote Paul Graham, a prominent programmer in the Lisp language and, as a success of the first dot-com bubble, one of the founders of the venture capital firm Y Combinator. His landmark essay “A Plan for Spam” is one of the most influential documents in the history of the anti-email- spam movement. The project Graham started by proposing a new form of filtering, which was rapidly taken up by many hands, is important for two reasons: first, because he won—the idea he popularized, in accidental concert with U.S. laws, effectively destroyed email spamming as it then existed, and it did so by sidestepping the social complexities and nuances spam exploited and attacking it on a simple and effective technical point. Second, because he lost, after a fashion: his pure and elegant technical attack was based on a new set of assumptions about what spam was and how spammers worked, and email spammers took advantage of those assumptions, transforming their trade and developing many of the characteristics that shape it today.

“A Plan for Spam,” the essay that launched a thousand programming projects, is a series of tactics under the banner of a strategy. The tactics include an economic rationale for antispam filters, a filter based on measuring probabilities, a trial-and-error approach to mathematics, and a hacker’s understanding that others will take the system he proposes and train and modify it for themselves—that there can be no general understanding or “gold standard” for spam as the scientists were seeking but only specific cases for particular individuals. The overarching strategy is to transfer the labor of reading and classifying spam from humans to machines—which is where Norbert Wiener comes in.

Wiener, a mathematician and polymath who coined the term “cyber- netics” as we currently understand it, found much to concern him in the tight coupling of control and communication between humans and machines in the middle of the twentieth century. He worried about the delegation of control over nuclear weapons to game theory and electronic computers and about the feedback from the machines to the humans and to human society. In the 1948 introduction to his book Cybernetics, Wiener made a statement that he would return to intermittently in later studies such as the 1950s The Human Use of Human Beings: “[Automation and cybernetic efficiencies] gives the human race a new and most effective collection of mechanical slaves to perform its labor. Such mechanical labor has most of the economic properties of slave labor, although, unlike slave labor, it does not involve the direct demoralizing effects of human cruelty. However, any labor that accepts the conditions of competition with slave labor accepts the conditions of slave labor, and is essentially slave labor. The key word of this statement is competition.”

For Wiener, to compete with slaves is to become, in some sense, a slave, like Soviet workers competing with impossible Stakhanovite goals during the Second Five-Year Plan. Graham paraphrases Wiener because he wants to present his feelings about spam as a parallel case: “One great advantage of the statistical approach is that you don't have to read so many spams. Over the past six months, I've read literally thousands of spams, and it is really kind of demoralizing. . . . To recognize individual spam features you have to try to get into the mind of the spammer, and frankly I want to spend as little time inside the minds of spammers as possible.” It’s a complaint that will recur with other researchers—to really understand the spamming process, you have to run an emulation of the spammer in your mind and in your code, and the sleaziness that entails is degrading.

The analogy with Wiener is inexact in an informative way, though, which describes Graham’s strategy in a nutshell. Graham does not have to deal with spammers by competing with them. He is not sending out spam in turn or trying to take advantage of their credulity. In no way is he being demoted to the economic status of a spammer by his work, because he is not competing with them—his machine is competing with them. He is doing exactly what Wiener predicted, though he does not say as much: he is building a system in which the spammers will be obliged to compete with machines, with mechanical readers that filter and discard with unre- lenting, inhuman attention, persistence, and acuity. With his mechanical slaves, he will in turn make the business of spamming into slavery and thus unrewarding. As Wiener feared automation would end the economic and political basis for a stable, social-democratic society (“based on human values other than buying or selling,” as he put it), Graham means to end the promise of profit and benefit for small effort that spam initially offers.

The means of ending that promise of profit is Graham’s technological tactic: filtering spam by adopting a system based on naïve Bayesian statistical analysis (more about this in a moment). The one thing spammers cannot hide, Graham argued, is their text: the characteristic language of spam. They can forge return addresses and headers and send their messages through proxies and open relays and so on, but that distinctive spammy tone, the cajolery or the plea, must be there to convince the human on the far side to click the link. What he proposed was a method for turning that language into an object available to hackers, making it possible to build near-perfect personal filters that can rapidly improve in relation to spam with the help of mathematical tools.

The Bayesian probability that Graham adopted for this purpose is named for Thomas Bayes, who sketched it out in the 1750s. It can be briefly summarized by a common analogy using black and white marbles. Imagine someone new to this world seeing the first sunset of her life. Her question: will the sunrise again tomorrow? In ignorance, she defaults to a fifty-fifty chance and puts a black marble and a white marble into a bag. When the sun rises, she puts another white marble in. The probability of randomly picking white from the bag—that is, the probability of the sun rising based on her present evidence—has gone from 1 in 2 to 2 in 3. The next day, when the sun rises, she adds another marble, moving it to 3 in 4, and so on. Over time, she will approach (but never reach) certainty that the sun will rise. If, one terrible morning, the sun does not rise, she will put in a black marble and the probability will decline in proportion to the history of her observations. This system can be extended to very complex problems in which each of the marbles in the bag is itself a bag of marbles: a total probability made up of many individual, varying probabilities—which is where email, and spam, come into the picture.

A document, for a Bayesian filing program, is a bag—a total probability—containing many little bags. The little bags are the tokens—the words—and each starts out with one white marble and one black. You train the program by showing it that this document goes in the Mail folder, and that document in the Spam folder, and so on. As you do this with many documents, the Bayesian system creates a probability for each sig- nificant word, dropping in the marbles. If it makes a mistake, whether a “false positive” (a legitimate message marked as spam) or a “false negative” (spam marked as a legitimate message), you correct it, and the program, faced with this new evidence from your correction, slightly reweights the probabilities of the words found in that document for the future. It turns language into probabilities, creating both a characteristic vocabulary of your correspondence and of the spam you receive. It will notice that words like “madam,” “guarantee,” “sexy,” and “republic” almost never appear in legitimate mail and words like “though,” “tonight,” and “apparently” almost never appear in spam. Soon it will be able to intercept spam, bouncing it before it reaches your computer or sending it to the spam folder. The Bayesian filer becomes the Bayesian filter.

This idea was hardly new when Graham posted his essay. For somewhat mysterious reasons, naïve Bayesian algorithms happen to be exceptionally good at inductive learning tasks such as classifying documents and had been studied and applied to mail before. Jason Rennie’s ifile program— the one for which he’d volunteered his own email for testing purposes— was applying naïve Bayes to filing email and discarding “junk mail” in 1996. “As e-mail use has grown,” he writes his in 1998 paper on ifile, “some regularity has come about the sort of e-mail that appears in users’ mail boxes. In particular, unsolicited e-mail, such as ‘make money fast’ schemes, chain letters and porn advertisements, is becoming all too common. Filtering out such unwanted trash is known as junk mail filtering.” That same year, five years before Graham’s essay, several applications of Bayesian filtering systems to email spam were published; the idea was clearly in the air. Why didn’t it take off, then? The answer to that question explains both what made Graham’s approach both so successful and where its weakness lay in ultimately stopping spam.

Understanding that answer requires a dive into some of the technical material that underlies the project of filtering. The biggest problem a filter can have is called differential loss: filtering the “false positives” mentioned earlier, when it misidentifies real mail as spam and deletes it accordingly. Email deals with time-dependent and value-laden materials such as job offers, appointments, and personal messages needing prompt response, and the importance of messages varies wildly. The potential loss can be quite different from message to message. (Think of the difference between yet another automatic mailing list digest versus a letter from a long-lost friend or work from a client.) The possibility of legitimate email misclassified as spam and either discarded or lost to the human eye amid hundreds of actual spam messages is so appalling that it can threaten to scuttle the whole project. In fact, setting aside the reality of potential losses in money, time, and miscommunication, the psychological stress of communicating on an uncertain channel is onerous. Did he get my message, or was it filtered? Is she just ignoring me? Am I missing the question that would change my life? It puts the user in the classic epistemological bind—you don’t know what you’re missing, but you know that you don’t know—with the constant threat of lost invitations, requests, and offers. With a sufficiently high rate of false positives, email becomes a completely untenable medium constantly haunted by failures of communication.

The loss in time paid by someone manually refiling a spam message misclassified as legitimate and delivered is quite small—about four seconds, on average—especially when compared to the potential disaster and distress of false positives. Filters, therefore, all err a little to the side of tolerance. If you set the filter to be too restrictive, letting through only messages with a very high probability of being legitimate, you run the risk of unacceptably high numbers of false positives. You have to accept messages on the borderline: messages about which the filter is dubious. This is the first strike against the early Bayesian spam filters, part of the answer to Graham’s question, “If people had been onto Bayesian filtering four years ago, why wasn’t everyone using it?” Pantel and Lin’s Bayesian system SpamCop had a rate of 1.16 percent false positives to Graham’s personal rate of 0.03 percent. Small as it may seem, spammers could take advantage of that space and the anxiety it produced.

Perhaps you can narrow that rate a bit by adding in some known spam characteristics—reliable markers that will separate spam from your mail. One group of early spam filter designers tried this method, and the collection of properties built into their filter to specify spam are a fascinating artifact, because the intervening decade has rendered them almost entirely wrong. They made an assumption of stability that spammers turned into a weakness, an issue that Graham’s ever-evolving system sidesteps. The programmers chose two kinds of “domain specific properties” or things that marked spam as spam: thirty-five phrases (“FREE!,” “be over 21,” “only $” as in “only $21.99”) and twenty nonphrasal features.The language of spam has obviously changed enormously—not least as a result of Bayesian filtering, as we will see—but the twenty additional features are where its mutability comes across the most. They include the percentage of non- alphanumeric characters (such as $ and !), “attached documents (most junk E-mail does not have them),” “when a given message was received (most junk E-mail is sent at night),” and the domain type of the sender’s email address, as “junk mail is virtually never sent from .edu domains.”

The “no attachments” rule does not hold: many varieties of spam include attachments, including viruses or other malware, in the form of documents and programs (as well as more baroque attachments such as mp3 files purporting to be voicemail that act as come-ons for identity theft exploits). Neither does reliance on delivery times, which have become much more complex, with interesting diurnal cycles related to computers being turned on and off as the Earth rotates. Address spoofing (by which messages can appear to come from any given address) and commandeered academic addresses (which are usually running on university servers with high-bandwidth connections, great for moving a few million messages fast) has rendered .edu addresses fairly meaningless. (Even worse was an oversight in the SpamAssassin filter that gave any message sent in 2010 a very low score—flagging it as likely to be spam—because that year “is grossly in the future” or was, when the filter was being developed.)

These fixed filtering elements, rendered not just useless but misleading after only a few years, highlight perhaps the biggest hurdle for the scientific antispam project: it moved so slowly in relation to its quarry. Formally accredited scientists publish and are citable, with bracketed numbers directing us to references in journals; hackers, self-described, just post their work online, however half-baked or buggy, because other people will help fix it. Compared to the patient corpora-building and discussion of naïve Bayesian variants in the scientific antispam project, the hackers’ antispam initiative was almost hilariously ramshackle, cheap, fast, semifunctional, and out of control. Graham’s “plan for spam” was written before he came across extant research papers, but it addresses the problems they faced. (In a later talk, he brought up his algorithm for lazy evaluation of research papers: “Just write whatever you want and don’t cite any previous work, and indignant readers will send you references to all the papers you should have cited.”) It offers a much lower rate of potential false positives and an entirely individual approach to training the system to identify spam— one that eliminates the need for general specifications. In theory, it acts as a pure, automated reflection of spam, moving just as fast as the ingenuity of all the spammers in the world. As they try new messages, it learns about them and blocks them, and making money at spam gets harder and harder. Above all, Graham’s approach was about speed.

His essay reflects this speed. There are no equations, except as expressed in the programming language Lisp—the question is not “Does this advance the mathematical conversation?” but “Does it run, and does that make a difference?” There are no citations, though he thanks a few people at the end (the “lazy evaluation of research papers” method kicked in afterward, when his essay had been mentioned on the geek news site Slashdot). The language of process and experiment is completely different: “(There is probably room for improvement here.) . . . [B]y trial and error I’ve found that a good way to do it is to double all the numbers in good. . . . There may be room for tuning here . . . I’ve found, again by trial and error, that .4 is a good number to use.” It offers a deeply different intellectual chal- lenge than the papers described previously, saying, in essence, “This is what I did. If you think you can improve it, open a terminal window and get to work.” It is a document published in a wizardly environment in which anyone, at least in theory, can be a peer with the right attitude and relevant technical skills—and where the review happens after the fact, in linkback, commentary and further development of the project.

It was received with this same sense of urgency and hands-on involvement, starting an avalanche of critique, coding, collaboration, and commentary. One example among many took place on the Python-Dev mailing list discussion (Python is a high-level programming language, named, as it happens, for Monty Python). They were discussing bogofilter, a spam filter that applied Graham’s naïve Bayes model: “Anybody up for pooling corpi (corpora?)?” writes a programmer almost immediately after Graham posted his essay. Tim Peters brings mathematical clarity to the table: “Graham pulled his formulas out of thin air, and one part of the scoring setup is quite dubious. This requires detail to understand”—which he then provides, moving through a deep problem in the application of Bayes’s theorem in Graham’s model.27 Open source advocate Eric Raymond replies with a possible workaround, closing his message with a question for Peters: “Oh, and do you mind if I use your algebra as part of bogo-filter’s documentation?” Peters replies: “Not at all.” Within days the bogo-filter project, based on Graham’s naïve Bayesian idea, was checked into a testing environment for developing software. The pace of spam’s transformations seemed to have found its match.

To really understand the impact of the naïve Bayesian model on email and on spam, multiply the Python bogofilter project’s fast-paced collabora- tive, communal effort out into several different programming languages and many competing projects. Graham lists a few in the FAQ he wrote following “A Plan for Spam”: Death2Spam, SpamProbe, Spammunition, Spam Bully, InboxShield, Junk-Out, Outclass, Disruptor OL, SpamTiger, JunkChief. Others come up in a separate essay: “There are now over 30 available. Apple has one, MSN has one, AOL is said to have one in beta, and you can be pretty sure Yahoo is working on one.”28 The Graham- promoted model of naïve Bayesian filtering remains the default approach to antispam filters to this day, though of course with many modifications, additions, and tweaks, as the Python-Dev conversation suggested. Very heavily modified and supplemented naïve Bayesian filters operate both at the level of personal email programs and the level of the very large webmail providers such as Microsoft’s Hotmail and Google’s Gmail (in concert with numerous other filtering techniques).

There are many reasons for its success. There’s context, for one: Graham’s was a chatty essay full of code examples posted by a widely read Internet entrepreneur, then linked and reposted on high-traffic news and discussion sites, rather than a technical research paper presented at an academic workshop about machine learning for text classification. With its language of “probably room for improvement here,” it was built for rapid adoption by influential groups in the hacker community, like the gang at Python- Dev. Most important, it offered a thoughtful and persuasive argument for its graceful technical hack, an argument about the social and economic structure of spam—how spam works in the most general sense and how it could be broken. This argument was adopted, implicitly and explicitly, with Graham’s filter, and, along with the problem of false positives, it formed the seam of failure that spam practices exploited in their transfor- mation and survival.

All of Graham’s argument in the original “A Plan for Spam” hinges on two points. The first is that “the Achilles heel of the spammers is their message.” The only point on which to reliably stop spam—the one thing spammers cannot circumvent—are the words they need to make the recipient act in some way. Hence the use of Bayesian filters to analyze the text itself and to treat the very language used in the appeal to block it. The second, related point is that this block does not need to work perfectly but just very well, because the goal is not to criminalize spam, publicly shame spammers, or educate recipients, as previous projects sought to—the goal is simply to make spamming less profitable. “The spammers are businessmen,” he writes. “They send spam because it works. It works because although the response rate is abominably low . . . the cost, to them, is practically nothing. . . . Sending spam does cost the spammer something, though. So the lower we can get the response rate—whether by filtering, or by using filters to force spammers to dilute their pitches—the fewer businesses will find it worth their while to send spam.” The promise is that spam is that rare class of problem that, when ignored, actually goes away. “If we get good enough at filtering out spam, it will stop working, and the spammers will actually stop sending it.”

Graham saw legitimated and theoretically law-abiding attempts at email marketing as spam’s most dangerous wing, the leading edge of an Internet in which spammers who can afford lobbyists work with impunity and email has become one big marketing machine. Opt-in spam was one of the most popular forms of this going-legit spam activity, whose proponents argued that the recipients of their spam messages had subscribed—usually by having entered their address at a website whose terms of service allowed the site’s owners to sell addresses to marketers. Opt-in spam included courtesy instructions, at least in theory, about how to unsubscribe from the mailings. “Opt-in spammers are the ones at the more legitimate end of the spam spectrum,” Graham wrote. “The arrival of better filters is going to put an end to the fiction of opt-in, because opt-in spam is especially vulnerable to filters [due to the readily recognizable legal and unsubscrib- ing boilerplate text that appeared in the messages]. . . . Once statistical filters are widely deployed, most opt-in spam will go right into the trash. This should flush the opt-in spammers out of their present cover of semi- legitimacy.”30 If people could vote with their filters as to what was spam, you would have to start actively deceiving them and engaging in less legal and more lucrative practices to make it pay. With such filters broadly deployed, “online marketers” striving for legitimacy could no longer take shelter under the vague protection of legal precedents for telemarketing and direct mail. They would either drop out of the business or become entirely criminal, with all the additional personal, social, and financial costs that change entails. You did not need to criminalize them, because you could make them criminalize themselves—and then you could sic the law on them.

“The companies at the more legitimate end of the spectrum lobby for loopholes that allow bottom-feeders to slip through too. . . . If the ‘opt-in’ spammers went away . . . [i]t would be clear to everyone where marketing ended and crime began, and there would be no lobbyists working to blur the distinction.”31 Graham had thus fashioned an extraordinary double bind. The legal specifications that could legitimate spam and the materials used to display a message’s compliance with a law such as CAN-SPAM— disclaimers, citations of the relevant laws, statements of compliance, and links to unsubscribe—are very regular and therefore a perfect target for Bayesian filtering. With good filters in place, the process of becoming legitimate—of complying with the laws—makes your business much less profitable. You have to become a criminal or get out.

Along with high-profile busts and legal cases in the American spam community such as those against Sanford Wallace and Alan Ralsky, Graham’s filter and its argument succeeded almost completely. Laws and filtering together wiped out the world of legitimated online marketers, the kind of people who wanted to move lots of relatively low-margin products with a straightforward pitch to the world’s email addresses, by starving them in a deep financial freeze. In the process the field was left to agile, adaptable, and far more resourceful criminals. In Graham’s plan for spam lay the seeds of its own relative undoing. In its success, it killed the profit model most vulnerable to it and initiated the transformation of spam into a new and still more problematic practice, one far harder to control.

It’s Time to Stand Up for Science

If you enjoyed this article, I’d like to ask for your support. Scientific American has served as an advocate for science and industry for 180 years, and right now may be the most critical moment in that two-century history.

I’ve been a Scientific American subscriber since I was 12 years old, and it helped shape the way I look at the world. SciAm always educates and delights me, and inspires a sense of awe for our vast, beautiful universe. I hope it does that for you, too.

If you subscribe to Scientific American, you help ensure that our coverage is centered on meaningful research and discovery; that we have the resources to report on the decisions that threaten labs across the U.S.; and that we support both budding and working scientists at a time when the value of science itself too often goes unrecognized.

In return, you get essential news, captivating podcasts, brilliant infographics, can't-miss newsletters, must-watch videos, challenging games, and the science world's best writing and reporting. You can even gift someone a subscription.

There has never been a more important time for us to stand up and show why science matters. I hope you’ll support us in that mission.

Thank you,

David M. Ewalt, Editor in Chief, Scientific American