MAKING SPAM HACKABLE
I thought spam robots would become more sophisticated—the robots would fight antirobots and antiantirobots and eventually all the spam robots would become lawyers, and take over the world. But the filters were pretty good, so that prediction was wrong.
— Robert Laughlin
“Norbert Wiener said if you compete with slaves you become a slave, and there is something similarly degrading about competing with spammers,” wrote Paul Graham, a prominent programmer in the Lisp language and, as a success of the first dot-com bubble, one of the founders of the venture capital firm Y Combinator. His landmark essay “A Plan for Spam” is one of the most influential documents in the history of the anti-email- spam movement. The project Graham started by proposing a new form of filtering, which was rapidly taken up by many hands, is important for two reasons: first, because he won—the idea he popularized, in accidental concert with U.S. laws, effectively destroyed email spamming as it then existed, and it did so by sidestepping the social complexities and nuances spam exploited and attacking it on a simple and effective technical point. Second, because he lost, after a fashion: his pure and elegant technical attack was based on a new set of assumptions about what spam was and how spammers worked, and email spammers took advantage of those assumptions, transforming their trade and developing many of the characteristics that shape it today.
“A Plan for Spam,” the essay that launched a thousand programming projects, is a series of tactics under the banner of a strategy. The tactics include an economic rationale for antispam filters, a filter based on measuring probabilities, a trial-and-error approach to mathematics, and a hacker’s understanding that others will take the system he proposes and train and modify it for themselves—that there can be no general understanding or “gold standard” for spam as the scientists were seeking but only specific cases for particular individuals. The overarching strategy is to transfer the labor of reading and classifying spam from humans to machines—which is where Norbert Wiener comes in.
Wiener, a mathematician and polymath who coined the term “cyber- netics” as we currently understand it, found much to concern him in the tight coupling of control and communication between humans and machines in the middle of the twentieth century. He worried about the delegation of control over nuclear weapons to game theory and electronic computers and about the feedback from the machines to the humans and to human society. In the 1948 introduction to his book Cybernetics, Wiener made a statement that he would return to intermittently in later studies such as the 1950s The Human Use of Human Beings: “[Automation and cybernetic efficiencies] gives the human race a new and most effective collection of mechanical slaves to perform its labor. Such mechanical labor has most of the economic properties of slave labor, although, unlike slave labor, it does not involve the direct demoralizing effects of human cruelty. However, any labor that accepts the conditions of competition with slave labor accepts the conditions of slave labor, and is essentially slave labor. The key word of this statement is competition.”
For Wiener, to compete with slaves is to become, in some sense, a slave, like Soviet workers competing with impossible Stakhanovite goals during the Second Five-Year Plan. Graham paraphrases Wiener because he wants to present his feelings about spam as a parallel case: “One great advantage of the statistical approach is that you don't have to read so many spams. Over the past six months, I've read literally thousands of spams, and it is really kind of demoralizing. . . . To recognize individual spam features you have to try to get into the mind of the spammer, and frankly I want to spend as little time inside the minds of spammers as possible.” It’s a complaint that will recur with other researchers—to really understand the spamming process, you have to run an emulation of the spammer in your mind and in your code, and the sleaziness that entails is degrading.
The analogy with Wiener is inexact in an informative way, though, which describes Graham’s strategy in a nutshell. Graham does not have to deal with spammers by competing with them. He is not sending out spam in turn or trying to take advantage of their credulity. In no way is he being demoted to the economic status of a spammer by his work, because he is not competing with them—his machine is competing with them. He is doing exactly what Wiener predicted, though he does not say as much: he is building a system in which the spammers will be obliged to compete with machines, with mechanical readers that filter and discard with unre- lenting, inhuman attention, persistence, and acuity. With his mechanical slaves, he will in turn make the business of spamming into slavery and thus unrewarding. As Wiener feared automation would end the economic and political basis for a stable, social-democratic society (“based on human values other than buying or selling,” as he put it), Graham means to end the promise of profit and benefit for small effort that spam initially offers.
The means of ending that promise of profit is Graham’s technological tactic: filtering spam by adopting a system based on naïve Bayesian statistical analysis (more about this in a moment). The one thing spammers cannot hide, Graham argued, is their text: the characteristic language of spam. They can forge return addresses and headers and send their messages through proxies and open relays and so on, but that distinctive spammy tone, the cajolery or the plea, must be there to convince the human on the far side to click the link. What he proposed was a method for turning that language into an object available to hackers, making it possible to build near-perfect personal filters that can rapidly improve in relation to spam with the help of mathematical tools.
The Bayesian probability that Graham adopted for this purpose is named for Thomas Bayes, who sketched it out in the 1750s. It can be briefly summarized by a common analogy using black and white marbles. Imagine someone new to this world seeing the first sunset of her life. Her question: will the sunrise again tomorrow? In ignorance, she defaults to a fifty-fifty chance and puts a black marble and a white marble into a bag. When the sun rises, she puts another white marble in. The probability of randomly picking white from the bag—that is, the probability of the sun rising based on her present evidence—has gone from 1 in 2 to 2 in 3. The next day, when the sun rises, she adds another marble, moving it to 3 in 4, and so on. Over time, she will approach (but never reach) certainty that the sun will rise. If, one terrible morning, the sun does not rise, she will put in a black marble and the probability will decline in proportion to the history of her observations. This system can be extended to very complex problems in which each of the marbles in the bag is itself a bag of marbles: a total probability made up of many individual, varying probabilities—which is where email, and spam, come into the picture.
A document, for a Bayesian filing program, is a bag—a total probability—containing many little bags. The little bags are the tokens—the words—and each starts out with one white marble and one black. You train the program by showing it that this document goes in the Mail folder, and that document in the Spam folder, and so on. As you do this with many documents, the Bayesian system creates a probability for each sig- nificant word, dropping in the marbles. If it makes a mistake, whether a “false positive” (a legitimate message marked as spam) or a “false negative” (spam marked as a legitimate message), you correct it, and the program, faced with this new evidence from your correction, slightly reweights the probabilities of the words found in that document for the future. It turns language into probabilities, creating both a characteristic vocabulary of your correspondence and of the spam you receive. It will notice that words like “madam,” “guarantee,” “sexy,” and “republic” almost never appear in legitimate mail and words like “though,” “tonight,” and “apparently” almost never appear in spam. Soon it will be able to intercept spam, bouncing it before it reaches your computer or sending it to the spam folder. The Bayesian filer becomes the Bayesian filter.
This idea was hardly new when Graham posted his essay. For somewhat mysterious reasons, naïve Bayesian algorithms happen to be exceptionally good at inductive learning tasks such as classifying documents and had been studied and applied to mail before. Jason Rennie’s ifile program— the one for which he’d volunteered his own email for testing purposes— was applying naïve Bayes to filing email and discarding “junk mail” in 1996. “As e-mail use has grown,” he writes his in 1998 paper on ifile, “some regularity has come about the sort of e-mail that appears in users’ mail boxes. In particular, unsolicited e-mail, such as ‘make money fast’ schemes, chain letters and porn advertisements, is becoming all too common. Filtering out such unwanted trash is known as junk mail filtering.” That same year, five years before Graham’s essay, several applications of Bayesian filtering systems to email spam were published; the idea was clearly in the air. Why didn’t it take off, then? The answer to that question explains both what made Graham’s approach both so successful and where its weakness lay in ultimately stopping spam.
Understanding that answer requires a dive into some of the technical material that underlies the project of filtering. The biggest problem a filter can have is called differential loss: filtering the “false positives” mentioned earlier, when it misidentifies real mail as spam and deletes it accordingly. Email deals with time-dependent and value-laden materials such as job offers, appointments, and personal messages needing prompt response, and the importance of messages varies wildly. The potential loss can be quite different from message to message. (Think of the difference between yet another automatic mailing list digest versus a letter from a long-lost friend or work from a client.) The possibility of legitimate email misclassified as spam and either discarded or lost to the human eye amid hundreds of actual spam messages is so appalling that it can threaten to scuttle the whole project. In fact, setting aside the reality of potential losses in money, time, and miscommunication, the psychological stress of communicating on an uncertain channel is onerous. Did he get my message, or was it filtered? Is she just ignoring me? Am I missing the question that would change my life? It puts the user in the classic epistemological bind—you don’t know what you’re missing, but you know that you don’t know—with the constant threat of lost invitations, requests, and offers. With a sufficiently high rate of false positives, email becomes a completely untenable medium constantly haunted by failures of communication.
The loss in time paid by someone manually refiling a spam message misclassified as legitimate and delivered is quite small—about four seconds, on average—especially when compared to the potential disaster and distress of false positives. Filters, therefore, all err a little to the side of tolerance. If you set the filter to be too restrictive, letting through only messages with a very high probability of being legitimate, you run the risk of unacceptably high numbers of false positives. You have to accept messages on the borderline: messages about which the filter is dubious. This is the first strike against the early Bayesian spam filters, part of the answer to Graham’s question, “If people had been onto Bayesian filtering four years ago, why wasn’t everyone using it?” Pantel and Lin’s Bayesian system SpamCop had a rate of 1.16 percent false positives to Graham’s personal rate of 0.03 percent. Small as it may seem, spammers could take advantage of that space and the anxiety it produced.
Perhaps you can narrow that rate a bit by adding in some known spam characteristics—reliable markers that will separate spam from your mail. One group of early spam filter designers tried this method, and the collection of properties built into their filter to specify spam are a fascinating artifact, because the intervening decade has rendered them almost entirely wrong. They made an assumption of stability that spammers turned into a weakness, an issue that Graham’s ever-evolving system sidesteps. The programmers chose two kinds of “domain specific properties” or things that marked spam as spam: thirty-five phrases (“FREE!,” “be over 21,” “only $” as in “only $21.99”) and twenty nonphrasal features.The language of spam has obviously changed enormously—not least as a result of Bayesian filtering, as we will see—but the twenty additional features are where its mutability comes across the most. They include the percentage of non- alphanumeric characters (such as $ and !), “attached documents (most junk E-mail does not have them),” “when a given message was received (most junk E-mail is sent at night),” and the domain type of the sender’s email address, as “junk mail is virtually never sent from .edu domains.”
The “no attachments” rule does not hold: many varieties of spam include attachments, including viruses or other malware, in the form of documents and programs (as well as more baroque attachments such as mp3 files purporting to be voicemail that act as come-ons for identity theft exploits). Neither does reliance on delivery times, which have become much more complex, with interesting diurnal cycles related to computers being turned on and off as the Earth rotates. Address spoofing (by which messages can appear to come from any given address) and commandeered academic addresses (which are usually running on university servers with high-bandwidth connections, great for moving a few million messages fast) has rendered .edu addresses fairly meaningless. (Even worse was an oversight in the SpamAssassin filter that gave any message sent in 2010 a very low score—flagging it as likely to be spam—because that year “is grossly in the future” or was, when the filter was being developed.)
These fixed filtering elements, rendered not just useless but misleading after only a few years, highlight perhaps the biggest hurdle for the scientific antispam project: it moved so slowly in relation to its quarry. Formally accredited scientists publish and are citable, with bracketed numbers directing us to references in journals; hackers, self-described, just post their work online, however half-baked or buggy, because other people will help fix it. Compared to the patient corpora-building and discussion of naïve Bayesian variants in the scientific antispam project, the hackers’ antispam initiative was almost hilariously ramshackle, cheap, fast, semifunctional, and out of control. Graham’s “plan for spam” was written before he came across extant research papers, but it addresses the problems they faced. (In a later talk, he brought up his algorithm for lazy evaluation of research papers: “Just write whatever you want and don’t cite any previous work, and indignant readers will send you references to all the papers you should have cited.”) It offers a much lower rate of potential false positives and an entirely individual approach to training the system to identify spam— one that eliminates the need for general specifications. In theory, it acts as a pure, automated reflection of spam, moving just as fast as the ingenuity of all the spammers in the world. As they try new messages, it learns about them and blocks them, and making money at spam gets harder and harder. Above all, Graham’s approach was about speed.
His essay reflects this speed. There are no equations, except as expressed in the programming language Lisp—the question is not “Does this advance the mathematical conversation?” but “Does it run, and does that make a difference?” There are no citations, though he thanks a few people at the end (the “lazy evaluation of research papers” method kicked in afterward, when his essay had been mentioned on the geek news site Slashdot). The language of process and experiment is completely different: “(There is probably room for improvement here.) . . . [B]y trial and error I’ve found that a good way to do it is to double all the numbers in good. . . . There may be room for tuning here . . . I’ve found, again by trial and error, that .4 is a good number to use.” It offers a deeply different intellectual chal- lenge than the papers described previously, saying, in essence, “This is what I did. If you think you can improve it, open a terminal window and get to work.” It is a document published in a wizardly environment in which anyone, at least in theory, can be a peer with the right attitude and relevant technical skills—and where the review happens after the fact, in linkback, commentary and further development of the project.
It was received with this same sense of urgency and hands-on involvement, starting an avalanche of critique, coding, collaboration, and commentary. One example among many took place on the Python-Dev mailing list discussion (Python is a high-level programming language, named, as it happens, for Monty Python). They were discussing bogofilter, a spam filter that applied Graham’s naïve Bayes model: “Anybody up for pooling corpi (corpora?)?” writes a programmer almost immediately after Graham posted his essay. Tim Peters brings mathematical clarity to the table: “Graham pulled his formulas out of thin air, and one part of the scoring setup is quite dubious. This requires detail to understand”—which he then provides, moving through a deep problem in the application of Bayes’s theorem in Graham’s model.27 Open source advocate Eric Raymond replies with a possible workaround, closing his message with a question for Peters: “Oh, and do you mind if I use your algebra as part of bogo-filter’s documentation?” Peters replies: “Not at all.” Within days the bogo-filter project, based on Graham’s naïve Bayesian idea, was checked into a testing environment for developing software. The pace of spam’s transformations seemed to have found its match.
To really understand the impact of the naïve Bayesian model on email and on spam, multiply the Python bogofilter project’s fast-paced collabora- tive, communal effort out into several different programming languages and many competing projects. Graham lists a few in the FAQ he wrote following “A Plan for Spam”: Death2Spam, SpamProbe, Spammunition, Spam Bully, InboxShield, Junk-Out, Outclass, Disruptor OL, SpamTiger, JunkChief. Others come up in a separate essay: “There are now over 30 available. Apple has one, MSN has one, AOL is said to have one in beta, and you can be pretty sure Yahoo is working on one.”28 The Graham- promoted model of naïve Bayesian filtering remains the default approach to antispam filters to this day, though of course with many modifications, additions, and tweaks, as the Python-Dev conversation suggested. Very heavily modified and supplemented naïve Bayesian filters operate both at the level of personal email programs and the level of the very large webmail providers such as Microsoft’s Hotmail and Google’s Gmail (in concert with numerous other filtering techniques).
There are many reasons for its success. There’s context, for one: Graham’s was a chatty essay full of code examples posted by a widely read Internet entrepreneur, then linked and reposted on high-traffic news and discussion sites, rather than a technical research paper presented at an academic workshop about machine learning for text classification. With its language of “probably room for improvement here,” it was built for rapid adoption by influential groups in the hacker community, like the gang at Python- Dev. Most important, it offered a thoughtful and persuasive argument for its graceful technical hack, an argument about the social and economic structure of spam—how spam works in the most general sense and how it could be broken. This argument was adopted, implicitly and explicitly, with Graham’s filter, and, along with the problem of false positives, it formed the seam of failure that spam practices exploited in their transfor- mation and survival.
All of Graham’s argument in the original “A Plan for Spam” hinges on two points. The first is that “the Achilles heel of the spammers is their message.” The only point on which to reliably stop spam—the one thing spammers cannot circumvent—are the words they need to make the recipient act in some way. Hence the use of Bayesian filters to analyze the text itself and to treat the very language used in the appeal to block it. The second, related point is that this block does not need to work perfectly but just very well, because the goal is not to criminalize spam, publicly shame spammers, or educate recipients, as previous projects sought to—the goal is simply to make spamming less profitable. “The spammers are businessmen,” he writes. “They send spam because it works. It works because although the response rate is abominably low . . . the cost, to them, is practically nothing. . . . Sending spam does cost the spammer something, though. So the lower we can get the response rate—whether by filtering, or by using filters to force spammers to dilute their pitches—the fewer businesses will find it worth their while to send spam.” The promise is that spam is that rare class of problem that, when ignored, actually goes away. “If we get good enough at filtering out spam, it will stop working, and the spammers will actually stop sending it.”
Graham saw legitimated and theoretically law-abiding attempts at email marketing as spam’s most dangerous wing, the leading edge of an Internet in which spammers who can afford lobbyists work with impunity and email has become one big marketing machine. Opt-in spam was one of the most popular forms of this going-legit spam activity, whose proponents argued that the recipients of their spam messages had subscribed—usually by having entered their address at a website whose terms of service allowed the site’s owners to sell addresses to marketers. Opt-in spam included courtesy instructions, at least in theory, about how to unsubscribe from the mailings. “Opt-in spammers are the ones at the more legitimate end of the spam spectrum,” Graham wrote. “The arrival of better filters is going to put an end to the fiction of opt-in, because opt-in spam is especially vulnerable to filters [due to the readily recognizable legal and unsubscrib- ing boilerplate text that appeared in the messages]. . . . Once statistical filters are widely deployed, most opt-in spam will go right into the trash. This should flush the opt-in spammers out of their present cover of semi- legitimacy.”30 If people could vote with their filters as to what was spam, you would have to start actively deceiving them and engaging in less legal and more lucrative practices to make it pay. With such filters broadly deployed, “online marketers” striving for legitimacy could no longer take shelter under the vague protection of legal precedents for telemarketing and direct mail. They would either drop out of the business or become entirely criminal, with all the additional personal, social, and financial costs that change entails. You did not need to criminalize them, because you could make them criminalize themselves—and then you could sic the law on them.
“The companies at the more legitimate end of the spectrum lobby for loopholes that allow bottom-feeders to slip through too. . . . If the ‘opt-in’ spammers went away . . . [i]t would be clear to everyone where marketing ended and crime began, and there would be no lobbyists working to blur the distinction.”31 Graham had thus fashioned an extraordinary double bind. The legal specifications that could legitimate spam and the materials used to display a message’s compliance with a law such as CAN-SPAM— disclaimers, citations of the relevant laws, statements of compliance, and links to unsubscribe—are very regular and therefore a perfect target for Bayesian filtering. With good filters in place, the process of becoming legitimate—of complying with the laws—makes your business much less profitable. You have to become a criminal or get out.
Along with high-profile busts and legal cases in the American spam community such as those against Sanford Wallace and Alan Ralsky, Graham’s filter and its argument succeeded almost completely. Laws and filtering together wiped out the world of legitimated online marketers, the kind of people who wanted to move lots of relatively low-margin products with a straightforward pitch to the world’s email addresses, by starving them in a deep financial freeze. In the process the field was left to agile, adaptable, and far more resourceful criminals. In Graham’s plan for spam lay the seeds of its own relative undoing. In its success, it killed the profit model most vulnerable to it and initiated the transformation of spam into a new and still more problematic practice, one far harder to control.