June 19, 2013

30 min read

Spam: A Shadow History of the Internet [Excerpt, Part 2]

Software filters required that spammers get devious. Learn in this new segment about using great works of literature to circumvent the intricate filtering schemes of the best software engineers

The arrival of filtering required spammers to switch strategy. The carpetbaggers of spam’s youth left the scene, ushering in criminal sophisticates who set to work spoofing the filters. The game had changed. As Finn Brunton recounts in his brilliant history of spam, excerpted here for a second day: “Rather than sales pitches for goods or sites, they [messages] could be used for phishing, identity theft, credit card scams, and infecting the recipient’s computer with viruses, worms, adware, and other forms of dangerous and crooked malware. A successful spam message could net many thousands of dollars, rather than $5 or $10 plus whatever the spammer might make selling off their good addresses to other spammers.” Brunton illustrates the ingenuity of this transformation by detailing the highly inventive litspam—the hijacking of entire texts of Borges or Conan Doyle to waltz past spam filtering algorithms.

Litspam was only the beginning, to be followed by splogs, content farms and more. Enter the spam underworld for a second day. A table of contents guides you through the chapter—and, if you missed it, go back and read Part One of this captivating book excerpt.

TABLE OF CONTENTS

On supporting science journalism

If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.

Filtering: Scientists and Hackers [Excerpt, Part 1] Part 1 of the Spam book excerpt series
Inventing Litspam Fooling spam filters by including neutral words lifted from works of great literature
The New Suckers Weaknesses in filtering and in spam attacks persist. Meet the 15 idiots
“New Twist in Affect”: Splogging, Content Farms and Social Spam The betrothal of spam and the blog in the form of the splog confuses search engine page-rank systems

POISONING: THE REINVENTION OF SPAM

INVENTING LITSPAM

The machines in the shop roar so wildly that often I forget in the roar that I am; I am lost in the terrible tumult, my ego disappears, I am a machine. I work, and work, and work with- out end; I am busy, and busy, and busy at all time. For what? and for whom? I know not, I ask not! How should a machine ever come to think?

— Morris Rosenfeld, “In the Sweat-Shop” (Songs from the Ghetto, trans. Leo Wiener, Norbert Wiener’s father)

Even as the filters were being installed, the first messages began to trickle in, like this one: “Took her last look of the phantom in the glass, only the year before, however, there had been stands a china rose go buy a box of betel, dearest of the larger carnivorous dinosaurs would meet it will be handy? Does it feel better now?” Or like this: “rosily pixie sir.chalet, healer partly .fanned media viva.jests, wheat skier.given rammed bath.weeded divas boxers.” Messages by the hundreds and thousands, sometimes with a link, mostly without. It was as though some huge Dada machine, the Tzara-Bot 9000, had just come online. This was litspam, cut-up literary texts statistically reassembled to take advantage of flaws in the design and deployment of Bayesian filters.

Bayesian filters destroyed email spam as a reputable business model in three ways, each of which became a springboard for spam’s transformation. Filtering killed off conventional spam language, the genre of the respect- able sales pitch, with its textual architecture of come-on rhetoric inherited from generations of Dale Carnegie books, direct-mail letters, cold calls, and carnival barkers (“Hundreds of millions of people surf the Web today. The internet is absolutely exploding everywhere in the world. Ask yourself: ‘Am I going to profit from this?’ Give me the chance to share with you this exiting business opportunity.”). This kind of material became a liability; its elements were too likely to be caught by the statistical analysis of words that the filters performed. Second, filtering made it a lot harder to make money through sales—if far fewer of your messages were getting through, you needed much better return on a successful message, not just a small profit on a bottle of generic pharmaceuticals, to make spam a viable business. Finally, filtering enormously increased the failure rate for spammy messages. If the filter caught the vast majority of your messages, you needed to send out a lot more, and be much more creative in their construction, to turn that tiny percentage or part of a percentage into a business. It was assumed that the message-sending capacity was a reliable limit, a fixed ceiling on spam operations: in the “Plan for Spam FAQ,” Paul Graham [a filtering expert] answers the question “If filters catch most spam, won’t spammers just send more to compensate?” with “Spammers already operate at capacity.” These three developments fed each other. If the filters attacked regularity in language, noting the presence of words with a high probability of appearing in spam messages, you had to be much more creative in the spam messages you wrote, putting more effort into each attempt than before. You would see very little profit in return for your increased effort, because fewer of your messages would get through, and you would have to put more of that profit into your infrastructure, because you would need to hugely increase the amount of spam you could send.

They also contained three points for spam’s transformation into a new and different trade, of which litspam was a harbinger. All three points of transformation hinged on the success of Graham’s ideas: the enforcement of new laws combined with filtering had eliminated the mere profit- seeking carpetbaggers and left the business to the criminals. Filters made conventional sales language and legal disclaimers into liabilities, which meant that those willing to be entirely duplicitous could move into wholly different genres of message to get past the filters and make the user take action. Hence the recommended link from a half-remembered friend (or friendly stranger), the announcement of breaking news, and, most extraordinarily, the fractured textual experiment of litspam. If filtering made it much harder to make money per number of messages, spam messages could become much more individually lucrative: rather than sales pitches for goods or sites, they could be used for phishing, identity theft, credit card scams, and infecting the recipient’s computer with viruses, worms, adware, and other forms of dangerous and crooked malware. A successful spam message could net many thousands of dollars, rather than $5 or $10 plus whatever the spammer might make selling off their good addresses to other spammers. Finally, if the new filters meant the messages failed much more often, spammers could develop radically new methods for sending out spam that cost much less and enabled them to send much, much more—methods such as botnets, which we’ll come to shortly.

In the context of the transformations that spam was going through as it became much more criminal, experimental, and massively automated, litspam provides a striking example of the move into a new kind of com- putationally inventive spam production. Somewhere, an algorithmic bot with a pile of text files and a mailing list made a Joycean gesture announc- ing spam’s modernism.

To explain litspam, recall the problem of false positives: legitimate mes- sages misfiled as spam. You cannot make the filter too strict. You need to give it some statistical latitude, because missing a legitimate message could cost the user far more than the average of 4.4 seconds it takes to identify and discard a spam message that gets through the filter. The success or failure of a filter depends on its rate of false positives; one important message lost may be too many, and Graham argued that the reason Bayesian filters did not take off in their first appearance was false positive rates like Patel and Lin’s 1.16 percent, rather than his 0.03 percent. Implicit in his argument was the promise that other people could reproduce or at least closely approximate his percentage. A person could indeed reproduce Graham’s near-perfect rate of false positives if they were very diligent, particularly in checking the marked-as-spam folder early in the filter’s life to correct misclassifications. It also helped to receive a lot of email with a particular vocabulary, a notable lexical profile, to act as the “negative,” legitimate nonspam words. These things were true of Graham. Building this filter was a serious project of his for which he was willing to read a lot of spam messages, do quite a bit of programming, and become a public advocate; it follows that his personal filter would be very carefully maintained. Graham had a distinctive corpus to work with on the initial filter: his personal messages, with all the vocabulary specific to a programmer and specialized venture capitalist—“Lisp [the programming language] . . . is effectively a kind of password for sending email to me,” he wrote in the original “A Plan for Spam” document. His array of legitimate words, at the opposite side of the axis from the words that signal spam (such as “madam,” “guarantee,” and “republic”), includes words like “perl [another programming language],” “scripting,” “morris,” “quite,” and “continuation.”

Other individual users, however, may have a slightly higher rate of false positives because they have a characteristic vocabulary that overlaps more with the vocabulary of spam than Graham’s does, or because their vocabu- lary is aggregated with that of others behind a single larger filter belonging

to an organization or institution, or simply because they are lazier in their classifications or unaware that they can classify spam rather than deleting it. (The problem of characteristic vocabulary was even worse for blog comment spam messages—the kind with a link to boost Google search rankings or bring in a few customers—because the spammers, or their automated programs, can copy and cut up the words in the post itself for the spam massage, making evaluation much trickier.) Users are thus not perfect, and filters can be poorly implemented and maintained, and so must be a little tolerant of the borderline messages. In this imprecision, a two-pointed strategy for email spamming took shape:

1. In theory, you could sway a filter by including a lot of neutral or acceptable words in a message along with the spammier language, edging the probability for the message over into legitimacy. Linkless gibberish messages were the test probes for this idea, sent out in countless variations to see what was bounced and what got through: “I although / actually gorged with food, were always trumpets of the war / to sound! This still- ness doth dart with snake venom on itwell, / I’d have laughed.”

2. After a spam message got through, the recipient was faced with a dilemma. If the recipient deleted the message, rather than flagging it as spam, the filter would read it as legitimate, and similar messages would get through in the future. If he or she flagged it as spam, the filter, always learning, would add some more marbles to the bags of probabilities rep- resented by significant words, slightly reweighing innocent words such as “stillness,” “wheat,” “laughed,” and so on toward the probability of spam, cumulatively increasing the likelihood of false positives. These broadcasts from Borges’s Library of Babel would be, in effect, a way of taking words hostage. “Either the spam continues to move, or say goodbye to ‘laughed.’”

But why use literature? Early messages show that the first experiments along these lines were built of words randomly drawn from the dictionary. This approach does not work very well, because we actually use most words very seldom. The most frequently used word in English, “the,” occurs twice as often as the second most frequent, and three times as often as the third most frequent, and so on, with the great bulk of language falling far down the curve in a very long tail.32 From the perspective of the filter, all those words farther out on the curve of language—“abjure,” “chimera,” “folly”—are like the bag of marbles after that first sunset, with one black marble and one white; with no prior evidence, those unused words are at fifty/fifty odds and make no difference, and one “sexy” will still flag the message as spam. What the spammer needs is natural language, alive and in use up at the front of the curve.

A huge portion of the literature in the public domain is available online as plain text files, the format most convenient to programmers: thousands and thousands and thousands of books and stories and poems. These can be algorithmically fed into the maw of a program, chopped up and reas- sembled and dumped into spam messages to tip the needle just over into the negative, nonspam category. Hence the bizarre stop-start rhythm of many litspam messages, with flashes of lucidity in the midst of a fugue state, like disparate strips of film haphazardly spliced together. Their sources include all the canonical texts and work in the public domain available on sites like Project Gutenberg, as well as more recondite materials. Many authors in the science fiction vein are popular with hackers, who sometimes pay them the dubious honor of scanning their books with optical character recognition software to turn the printed words into text files that can be circulated online. Neal Stephenson’s encryption thriller Cryptonomicon is one of these books, available as a full text file through a variety of sources and intermittently in the form of chunky excerpts in spam messages over the course of years. “This is a curious form of literary immortality,” Stephenson observed. “E-mail messages are preserved, haphazardly but potentially forever, and so in theory some person in the distant future could reconstruct the novel by gathering together all of these spam messages and stitching them back together. On the other hand, e-mail filters learn from their mistakes. When the Cryptonomicon spam was sent out, it must have generated an immune response in the world's spam filtering systems, inoculating them against my literary style. So this could actually cause my writing to disappear from the Internet.”

The deep strangeness of litspam is best illustrated by breaking a piece of it down, dissecting one of these flowers of mechanized language. The sample that opened this section, drawn at random from one of my spam- collecting addresses is two sentences and forty-five words and was assembled from no less than four interpolated sources: “Took her last look of the phantom in the glass, only the year before, however, there had been stands a china rose go buy a box of betel, dearest of the larger carnivorous dinosaurs would meet it will be handy? Does it feel better now?” “Took her last look of the phantom in the glass” is from “The Shadows,” a fairy tale by Aberdonian fantasist George MacDonald. “Only the year before, however,” and “of the larger carnivorous dinosaurs would meet” are from chapters 15 and 11 of Arthur Conan Doyle’s adventure novel The Lost World. “Stands a china rose go buy a box of betel, dearest” is from Song IV of the “Epic of Bidasari” as translated in the Orientalist Chauncey Starkweather’s Malayan Literature. And “it will be handy? Does it feel better now?” is from Sinclair Lewis’s Main Street, chapter 20. Each of these frag- ments are subtly distorted in different ways—punctuation is dropped and the casing of letters changed—but left otherwise unedited. It’s a completely disinterested dispatch from an automated avant-garde that spammers and their recipients built largely by accident. “I began to learn, gentlemen,” as the ape says in Kafka’s “Report to an Academy,” another awkward speaker learning language as a means of escape: “Oh yes, one learns when one has to; one learns if one wants a way out; one learns relentlessly.”

Litspam obviously does not work for human readers, aside from its occasional interesting resemblance to stochastic knockoffs of the work of Tzara or Burroughs (with a hint of Louis Zukofsky’s quotation poems, or Bern Porter’s “Founds” assembled from NASA rocket documentation). If anything, its fractured lines and phrasal salad are a sign that something’s suspiciously wrong and the message should be discarded. As with the biface, robot-readable text of web pages that tell search engine spiders one thing and human visitors another, litspam is to be read differently by different actors: the humans, with their languages, and the filters with their probabilities, like the flower whose color and fragrance we enjoy, and whose splotched ultraviolet target the bee homes in on. Litspam cuts to the heart of spam’s strange expertise. It delivers its words at the point where our experience of words, the Gricean implicature that the things said are connected in some way to other things said or to the situation at hand, bruisingly intersects the affordances of digital text. Like a negative version of the Turing test, you think you will be chatting with a person over teletype (“Will X please tell me the length of his or her hair?” as Turing suggests) and instead end up with racks of vacuum tubes or, rather, a Java program with most of English-language literature in memory: “that when some members rouen, folio 1667.anglonorman antiquities p. Had concluded his speech to the king.” We look for sense, for pattern and meaning, whether in the Kuleshov Effect—the essence of montage, with different meanings attributed to the same strip of film depending on what it’s intercut with—or the power of prophetic signals, like a spread of Tarot cards, whose rich symbolic content is full of hooks we can connect with our own current preoccupations, fears, memories, and desires. If there’s a spammy core to the message—a recognizable pitch, link, or come-on—we might pick out that most salient portion (perhaps clicking on this will explain this bizarre message!) and the spam will still have done its job.

Let us return to Turing, briefly, and introduce the fascinating Imitation Game, before we leave litspam and the world of robot-read/writable text. The idea of a quantifiable, machine-mediated method of describing quali- ties of human affect recurs in the literature of a variety of fields, including criminology, psychology, artificial intelligence, and computer science. Its applications often provide insight into the criteria by which different human states are determined—as described, for example, in Ken Alder’s fascinating work on polygraphs, or in the still understudied history of the “fruit machine,” a device that (allegedly) measured pupillary, pulse, and other response to pornographic images, developed and deployed during the 1950s for the purpose of identifying homosexuals in the Canadian military and the Royal Canadian Mounted Police (RCMP) in order to eliminate them from service. (It is like a sexually normative nightmare version of the replicant-catching Voight-Kampff machine in Blade Runner.) Within this search for human criteria, the most famous statement—and certainly the one that has generated the most consequent literature—is the so-called Turing Test. The goal of Turing’s 1950 thought experiment (which bears repeating, as it’s widely misunderstood today) was to “replace the question [of ‘Can machines think?’] by another, which is closely related to it and is expressed in relatively unambiguous words.” Turing considered the question of machines “thinking” or not to be “too meaningless to deserve discussion,” and, quite brilliantly, turned the question around to whether people think—or rather how we can be convinced that other people think. This project took the form of a parlor game: A and B, a man and a woman, communicate with an “interrogator,” C, by some intermediary such as a messenger or a teleprinter. C knows the two only as “X” and “Y”; after communicating with them, C is to render a verdict as to which is male and which female. A is tasked with convincing C that he, A, is female and B is male; B’s task is the same. “We now ask the question,” Turing continues, “‘What will happen when a machine takes the part of A in this game?’ Will the interrogator decide wrongly as often when the game is played like this as he does when the game is played between a man and a woman? These questions replace our original, ‘Can machines think?’”

What litspam has produced, remarkably, is a kind of parodic imitation game in which one set of algorithms is constantly trying to convince the other of their acceptable degree of salience—of being of interest and value to the humans. As Charles Stross puts it, “We have one faction that is attempting to write software that can generate messages that can pass a Turing test, and another faction that is attempting to write software that can administer an ad hoc Turing test.” In other words, what we are seeing is the product of algorithmic writers producing text for algorithmic readers to parse and block, with the end product providing a fascinatingly fractured and inorganic kind of discourse, far off even from the combinatorial lit- erature of avant-garde movements such as the Oulipo, the “Workshop of Potential Literature.” The particular economics of spamming reward sheer volume rather than message quality, and the great technical innovations lie on the production side, building systems that act with the profligacy of a redwood, which may produce a billion seeds over the course of its lifetime, of which one may grow into a mature tree. The messages don’t improve from their lower bound unless they have to, so the result doesn’t get “better” from a human perspective—that is, more convincing or plausibly human—just stranger.

Surrealist automatic writing has its particular associative rhythm, and the Burroughsian Cut-Up depends strongly on the taste for jarring juxtapositions favored by its authors (an article from Life, a sequence from The Waste Land, one of Burroughs’s “routines” in which mandrills from Venus kill Eisenhower). Litspam text, along with early comment spam and the strange spam blogs described in the next section, is the expression of an entirely different intentionality without the connotative structure produced by a human writer. The results returned by a probabilistically manipulated search engine, or the poisoned Bayesian spew of bot-generated spam, or the jumble of links given by a bad choice filtering algorithm act in a different way than any montage. They are more akin to flipping through channels on a television, with very sharp semantic transitions between spaces—from poetry to porn, a wiki to a closed corporate page, a reputable site to a spear-phishing mock-up. (If it has a cultural parallel, aside from

John Cage’s Imaginary Landscape No. 4—in which two musicians manipulate a radio’s frequency, amplitude, and timbre according to a preestablished score, with no control over what’s being broadcast—it would be Stanley Kubrick’s speculative art form of the future, which he described as “mode jerking”: sudden, severe, jolting transitions between different environments.)41 Consider this message from “AKfour seven,” writing via a Brazilian domain hosted on an ISP in Scranton, Pennsylvania:

I stand here today humbled by the task before [url=http://www.bawwgt.com] dofus kamas[/url], grateful for the trust you have bestowed, mindful of the sacri- fices borne by our [url=http://www.bawwgt.com]cheap dofus kamas[/url]. I thank President [url=http://www.bawwgt.com]dofus power leveling[/url] for his service to [url=http://www.bawwgt.com]buy dofus kamas[/url], as well as the generosity and cooperation he has shown throughout this transition.

It’s President Obama’s inaugural address, intercut with links to a site whose business it is to sell currency (“Kamas”) and other desirables for the French online role-playing game Dofus, which features various tree people, archers, and gambling cats—and a substantial gray market for in-game currency sold for real money. It is not simply that the smallest is spliced into the biggest, major with minor, now with then. It is the use of written words under the condition of pure arbitrary utility. As digitized, searchable, copy- and-paste-ready text, it is all one continuous matter—almost shockingly atemporal and best analogically compared not to a library or a conversation but to the “polymer goo” that Harrison White uses to describe social structures, full of complex striations and from which many different shapes can be extruded depending on need.

Finally, a note on this idea of “atemporality” to close this section on litspam. The concept of atemporal media has been discussed recently in terms of digital aesthetics and music.The digitization of media moves them into a kind of continuous present of use, the way all recorded music can now occupy a single, shuffled state of immanence from wildly different points of creation. An mp3 of the antediluvian old-time musician Dock Boggs, as recorded in the Norton Hotel with a borrowed banjo in 1927, segues into the synthesizer layers of Oneohtrix Point Never, who creates music in 2010 that could pass for electrocosmic voyages on vinyl from the 1970s. Historicity becomes another stylistic element, like timbre. As Brian Eno has put it, it’s all “current” now, which brings the aesthetics of recording itself to the fore as a stylistic choice with its own content, as all sounds coexist in a permanent digital noon. Litspam, setting aside its ultimate purpose of slipping through or damaging filters to sell more porn site logins or discontinued toys, is an extraordinary form of digital atemporality. Histories and myths, poetry, instructions for pleaching the lime trees of an ornamental garden, religious exegesis, and online tax guides constitute one shape, of which a given litspam message is a probability-guided surface. “In which gravitation is a consequence of the curvature of spacetime which governs the motion of inertial objects. The South Park episode Conjoined Fetus Lady and Season 1 of Freaks and Geeks depict dodgeball as a potentially violent sport. August Anheuser Busch IV (born June 15, 1964) is the great-great-grandson of Anheuser-Busch founder Adolphus Busch, the son of former chairman, president and CEO August Busch III. Many of these are produced by hurricanes or tropical storms along the coastal plain.”

Litspam was only one new form of postfilter spam among many, of course. Graham predicted that “the spam of the future will probably look something like this: ‘Hey there. Thought you should check out the following: http://www.27meg.com/foo.’” It squeaks by the filter on neutral language but gets caught with its suspicious URL, and we have indeed seen quite a bit of that, along with the variety of old-fashioned spam that makes it past imperfectly installed and trained filters. (Filtering also created a genius for euphemism on the part of spammers. A few of many, many terms in recent spam messages for promises about the male anatomy, which almost approach poetic allusion: “your engine in pants”/“Drilling machine”/“‘in- work-condition’ tool”/“Crazy penetrator!”/“Meatstick-champion!”/“Your nighttime failure”/“Make your volcano erupt over lion”/“the thing as you deserve it” and many others). Litspam remains something remarkable and special as an unintended consequence within the unintended consequence of spam: a loop of mechanical readers and mechanical writers generating texts from within the uncanny valley identified by Masahiro Mori. It is the chance meeting of Ulysses and a telephone interchange, as strange to our eyes as a pedantic speech from an ape, a tale told by a robot.

THE NEW SUCKERS

Graham never claimed that he or anyone else could filter spam perfectly, only that the filters would work well enough to make spamming into an unprofitable business. The various flavors of Bayesian filtering did, in fact, massively curtail the delivery of spam to the world’s inboxes. ISPs have the first layers of filters between the individual mailbox and the rest of the network, and by end of 2006, they saw spam become an estimated 85 percent of all mail traffic on the far side of their filters—a number that holds steady, give or take a few percentage points, to this day. Most people see only a minuscule portion of this total amount. A vast wave is crashing continuously against the filters, with some occasional spillover. This was exactly the plan that Graham outlined. The response rate for spam was always dismal: spammer Davis Hawke reported a decent return at two-tenths of 1 percent in the period prior to the widespread use of Bayesian filters, and those filters hugely cut down on the amount delivered. It worked, then, on its technical terms and only on those terms. Therein lay spam’s vector. The social choices embedded in and enabled by the technology became the points of failure. In retrospect, these critical points were four: two on the side of users and two on the side of spammers.

FILTERS WERE UNEVENLY DEPLOYED AND TRAINED
Some ISPs, organizations, and users will do it better than others; some possess a more distinctive vocabulary; and some are more diligent in man- aging the training of the system. There will be varying estimations of the cost and acceptable chance of false positives. Many users may never be aware of the need to “classify-as-spam” when sorting their incoming mail. Rates will vary, programs will become obsolete, and there will be holes, however small.

THE QUESTION OF THE “15 IDIOTS”
Graham, in the months following “A Plan for Spam,” considered the pos- sibility that the people most susceptible to spam—the people that make it profitable—will overlap with those least likely to install filters or feel comfortable using them. Arguing that spam makes money from the “15 stupidest or most perverted” people in a million, Graham continues: “The great danger is that whatever filter is most widely deployed in the idiot market will require too much effort by the user. . . . [T]he 15 idiots are probably also the 15 users who won't bother.” His solution, such as it is, is an assumption in the “Plan for Spam FAQ”: “I suspect that people stupid enough to respond to spam will often get email through one of the free services like Yahoo Mail or Hotmail, or through big providers like AOL or Earthlink. Once word spreads that it is possible to filter out most spam, they’ll be forced to offer effective filters.”

CHANGE IN THE ECONOMICS OF SPAM PRODUCTION AND DISTRIBUTION “Spammers,” Graham averred, “already operate at capacity.” In fact, as the filters went online, the production of the spam they were trying to stop was changing. The end of the legitimated spammers in a double bind of filtering and changes to the law was one rumble in spam’s tectonic shift toward an almost wholly criminal domain. Abandoning any pretense of legitimacy freed up a great deal of technical ingenuity. The development of systems like botnets, the use of ISPs in foreign jurisdictions (in some cases owned outright by gangsters), and the increasing sophistication of the programming of spam software at once boosted the capacity of spam distribution while lowering operating costs.

LIBERATION OF SPAM INTO PURE EXPERIMENT—AND PURE FRAUD
In changing its business model, the criminalization of spam also changed its arsenal of tools and words. It no longer needed even the pretense of selling products in a way that would appear legally reputable. Strategies such as phishing and identity theft, advance-fee fraud, and virus and malware distribution meant that the profit margin was pushed back up as the cost of distribution dropped, and spam started to sound, linguistically, like many things—some of them never heard before. Free to abandon any trappings of genre, it could seek any textual shape that got past the filters, using Shakespeare the way a bacterium consumes and employs foreign DNA, making spam into a different and a weirder beast than the one Bayesian filters had been designed to stop.

These problems are related. When Graham described the “15 idiots” as “stupid” or “perverted,” he was writing, with hacker hauteur, about people responding to messages that seemed to require enormous gullibility or a great fondness for porn. But the move into full-on criminality increased the pool of potential suckers. Many people who would never respond to an ad for a multilevel marketing scheme would respond to a notice pur- porting to be from their bank or PayPal account. Spam could now more aggressively target the old, the confused, people using the Internet in a second language, and new users in general, putting the sting on the naïfs that naïve Bayes was supposed to protect. You no longer needed to be an idiot to be one of the fifteen idiots, and this meant that each new sucker could be worth a lot more than the old. This money in turn attracted more sophisticated and skilled talent to spam, both on the business and the programming sides—the kind of people who could build more complex litspam engines and spam distribution programs. Graham, and those who preceded and followed him in the search for a probabilistic filter, were building a brilliant hack to solve a complex and embedded problem whose elements were simultaneously technical and social at every step. The social element transformed in response to their technical intervention and altered the technical in turn.

This is only the merest introduction to the army of chattering, crude language machines (“It was so easy to imitate these people,” as Kafka’s ape says: “I could already spit on the first day”) produced by spammers. To take the measure and extent of their population, we have to turn to the world of spam blogs geared to influence search engines—built to beat a whole different order of filters, the avant-garde in the art of misdirection.

“NEW TWIST IN AFFECT”: SPLOGGING

THE POPULAR VOTE

Screams of frightened women, choked Sobs, truly communicative Tears, little brusque Laughs . . . Howls, Chokings, Encore!, Recalls, silent Tears, Threats, Recalls with additional Howls, Pounding of approbation, uttered Opinions, Wreaths, Prin- ciples, Convictions, moral Tendencies, epileptic Attacks, Childbirth, Insults, Sui- cides, Noises of discussions (Art-for-art’s-sake, Form and Idea), etc.

—Villiers de l’Isle-Adam, explaining some of the settings of his automatic theatri- cal public, “La machine à gloire,” Contes cruels, 1874

Terra’s blog is titled “Tyler tyler honored with rd Jonas E,” and subtitled “Nine State Regulators Investigating Auction Bonds, Group Says.The City of Tyler Traffic Engineering Department installed the City.” One of her posts on July 16, 2008, titled “Tyler State Trial Law Litigation Lawyer Attorney Robert M.,” opens:

Our web servers cannot find the page or file you asked for. Best choice of the month: _blood pressure.

Button to return to the previous page. Astronomers on verge of finding Earth’s twin. The cost of having a product regis- tered is now estimated to be around million. Century, the public may demand that the federal _Bar board massachusetts overseer_ register products that are effective against bed bugs.

My mother lives in _Affect metoprolol_ side housing and has been dealing with this for about a year now.

This continues for another 1,300 words, and Terra posted three times on July 16 alone. In June, she posted 160 entries, about five a day, each from several hundred to a few thousand words. This is not her only blog, either; according to her Blogger profile, she has eleven others, with titles like “S first try boosted the team as Biko” and “Only two USB fujitsu three is always.” The bizarre stop-start rhythm of her posts makes them difficult to stop quoting. Their language has no heritage in oral speech and lacks the syntactic edges that imply beginning and ending. As with litspam messages, the jolting movement from paragraph to paragraph feels much closer to channel-surfing cable television than to any literary medium: “Oprah ends three weeks of vegan eating. Astronomers on verge of finding Earth’s twin. Seeing more people living out of their cars.” Then comes a sudden transition into the diaristic, with “I” sentences, opinions, and the rhythmic clauses characteristic of online thinking-aloud: “I don’t think it’s a numbers game, but I think whatever view you end up with, it does not have to be a majority point of view, that reasons have weight, not just adding up whoever agrees with you.” Her posts are full of links, most of which go to similar blogs: vollybllgrl’s blog “It was a powerline that brought down a Black Hawk black last night in northeast Alabama,” for example, or a post on the blog “Default title” by manning6029 with the oddly Ballardian phrase “Picture of blonde girl in Morocco is new twist in Affect.”

Terra is, of course, a robot, as are vollybllgrl, manning6029, Geriut of “The most dazed part of our democracy,” etylycigob of “The Triad Lady Knights cross country team had a kylee season,” and countless more. They are producing splogs, or spam blogs—one of the forms search engine spam took in response to the PageRank strategy of Google and its third-gen- eration search imitators. Splogs now account for more than half of the total number of all blogs. (In comparison, second-generation nonblog spam pages, stuffed with keywords and links, form roughly 8 percent of the total of all web pages being indexed.) The patterns of data from splogs and spings (spam pings—a link signal sent by a blog, presented on the linked blog like a comment, and theoretically driving traffic and PageRank) map with striking accuracy onto the patterns of email spam, with similar spikes—around the holidays, for example—and mysterious troughs, during which some waning of the moon causes output to die down for a few days or weeks. How do they work?

As the PageRank system became more widely known and understood, Google gathered market share and other search engines began to follow its more refined ranking model. (Google’s ranking system is considerably more elaborate than the bare bones of PageRank, of course, and continues to grow in parameters and complexity to this day, but the basic outline is what search engine spammers were responding to—and that suffices to understand their methods.) A variety of strategies developed as websites with a high PageRank were transformed into kingmakers. A link from them could move a page into the top ten or top three returns of the different search sites, boosting attention and revenue. The theoretical notion of a “reputation economy” was becoming thoroughly applied here. Link trading began as a second-generation approach, along with requests for a positive mention and a clickable link, and third-generation search amplified these methods. Sites issued “Best of the Web” awards, “Top 100 Sites” awards, and so forth; awards included a badge, a little image, and a snippet of code to be copied into the winning site—a snippet that included a link to the award-giving site. The human user saw a little badge image, but the search engine spider saw an outgoing link, that is, an endorsement. New habits of use and etiquette appeared among ordinary users of the web: a comment in a blog post included the commenters’ websites along with their names, to rack up another link. Posting something without including a “via” link to the person you got it from—the “via” being an additional outbound link as a kind of thanks for using their discovery—became increasingly rude, the sign of an uncouth person.

These techniques only brushed the surface of the transformation in spam practices. PageRank tried to solve the problem of relevance and the problem of spam in one stroke by incorporating the expression of social relationships, communities, and human choices. In theory, social structures are much harder to game for spam purposes, but their robot-readable expression online is not. The second-generation-based award badges were one strategy among many. Domain flooding, for example, was the creation of tens or hundreds of sites to redirect to the target site. Link farms or “mutual admiration societies” arose: these were huge link-heavy blocs of sites, each page linking to many of the others, with their accumulated “votes” rented out. They charged for outbound links from the farm, like penurious aristocrats charging to have their renowned ancient name and reputation associated with some unknown member of the nouveau riche. In the third generation, spam began to move into the creation of its own social graph—producing the appearance, if not the reality, of its own society.

Generating the expression of a nonexistent social phenomenon required the creation of much more content than previous search-spamming proj- ects while avoiding certain telltale signs of robots at work. Old-fashioned attempts to artificially alter the link graph had signature patterns.The bulky shape of heavy cross-linking within a group of sites, all with only a few inbound links (because spam pages are lonesome), created little islands of intense self-endorsement with no outside involvement. To the right ana- lytic tools, it’s a pattern as obvious as the newspaper ads taken out by vanity publishing houses for their new releases with the blurbs solely from friends and other writers in the same situation. Search engines could correct for these islands with modifications to the algorithm. Also, although complete web pages could be almost entirely automatically generated, they still required the purchase and maintenance of a stable of domain names and hosting plans with a service provider, which could get expensive.What was needed for third-generation search spam was a way to very rapidly generate new content, seeded with links, in a wide variety of different locations online as in a genuine community.

In 1999, a company called Pyra Labs launched a service called Blogger. The concept of a weblog—a chronological series of entries from newest to oldest—is pleasantly intuitive and diarylike; the concept of Blogger, as of so many related systems from Flickr to Wikipedia, was to provide people with an equally intuitive means for publishing their content. It was remotely hosted, so you did not have to own a website domain name or pay for hosting; many of its processes were automated, so you did not have to design it or do any coding behind the scenes; and it had a useful and increasingly sophisticated Application Programming Interface (API). An API is a set of requests that a web service can support from other pro- grams—tools that programs can use to engage with the service. An API makes it easier to automate the publishing process, and on a platform like Blogger (which was acquired by Google in 2003), this automated publishing requires very little effort to manage a lot of content. You can delegate the account creation process, the choice of settings, the ratio of outbound links to content, and the posting frequency to a program. The missing piece here is the words for the blog, but words come ready-made in the form of RSS feeds.

RSS (the acronym originally stood for RDF Site Summary but has changed to mean the more explanatory Really Simple Syndication) is a format closely associated with the development of blogs; it makes new posts or other changes on a site available in forms that are easy to use. Feed readers can gather the latest entries from RSS-enabled sites, material can be forwarded to mobile devices, and a page can feature the headlines or recent posts from other sites. From the perspective of a spam blogger, this feature is like a faucet for words. Samuel Beckett once said of the Cut-Up technique of William Burroughs and Brion Gysin “That’s not writing, it’s plumbing”—a prescient remark now that we have a means of writing that really is like plumbing: lay the pipes, the tank, the cut-off valves, and then open the taps and leave the room. A splog production system will pull in RSS feeds from other blogs and news sources, chop them up and remix them according to rules, insert relevant links, and post the resulting material, hour after hour and day after day, with minimal human supervision. Terra put a new post up already, while this section was being written, titled “After it became an tyler city”: “A witness reports a nun went crazy upon realizing that the man next to her in line was the Epinephrine frontman. Ghost Town Poster Disappoints Me, Gervais. Some things, like gravity, must also be close to.” And so on, and on, and on.

All splogs do not sound alike: some are built on the “excerpt” model, with fragments of about 350 characters taken whole from other blogs. These fragments are chosen from posts that are polling particularly well in Google, with good keyword metrics. Their goal is to make money through contextual advertising, in which page views and the occasional clickthrough are the best that can be hoped for. These create parasitic relationships with authors. One of the Internet’s many, many interchange- able product-review bloggers notes that being excerpted by splogs is a sign that you are choosing the right topics and words, because the splogs are copying you; if you want to attract more of them, because they offer backlinks to your site with their excerpts, “create posts with a popular keyword, like iPhone for example.” The behavior of excerpt splogs is straightforward: they are drawn to the right language like ants to honey.

Splogs built on a full content model, as Terra’s is, play a larger and more subtle game, cross-linking in their hundreds and thousands to distort the shape of the web. Each splog is assigned a set of keywords and feeds from which to pull related text. This is why Terra’s blog sounds like the product of someone with a feverish, pathological obsessions with Tylers. It pulls a set of RSS feeds and headlines based around “Tyler” as the keyword, with a few others for variation; thus post after post reports from a strange universe where several cities and schools named Tyler, the director Tyler Perry, the economist Tyler Cowen (who blogs), and posts and news articles that mention Tylers are all of equal significance. With experience, one begins to see the patterns. “The TV adaptation of the big screen movie features Nicole Ari Parker, Vanessa Williams and Malinda Williams” refers to one of Perry’s projects; “The sociologist Max Weber introduced a distinction between consumer” is a broken fragment from Cowen. Interspersed with this Tyler compulsion is the jarring appearance of the functional language of web design, as in “Button to return to the previous page,” used within paragraphs of first-person sentences.

How far removed language is at this point from anything meant for humans! Terra’s blog links to other splogs, which link to still more, forming an insular community on a huge range of sites—a kind of PageRank greenhouse that is not in itself meant to be read by people. A human seeing a splog post immediately knows that something is wrong and can flag the splog to be taken down by the network administrator. Splogs of Terra’s type are not meant to interact with humans at all; they are created solely for the benefit of search engine spiders. They do not imitate indvidual humans—they are not the computational equivalent of “George Kaplan,” the nonexistent secret agent whose train tickets and hotel rooms in North by Northwest are meant to convey a particular human life. They only work from a distance, appearing to be groups of people, with the language and links functioning in aggregate. If splogs are like any previous technological artifact, they are akin to the “QL” sites constructed during the Second World War to mislead night bombing runs: rickety structures of pipe, wooden frames, wire netting, and lights that if seen from far enough away look like a town, with railway signals, lamps, and open doors. Taken in statistical total and algorithmic analysis, splogs resemble the patterns of a thriving community. Their posts are pitched at precisely the level of complexity the spider requires to accept their input as human, and they adapt human text for other machines to read and act on. Influence on humans is a second-order result.

It’s Time to Stand Up for Science

If you enjoyed this article, I’d like to ask for your support. Scientific American has served as an advocate for science and industry for 180 years, and right now may be the most critical moment in that two-century history.

I’ve been a Scientific American subscriber since I was 12 years old, and it helped shape the way I look at the world. SciAm always educates and delights me, and inspires a sense of awe for our vast, beautiful universe. I hope it does that for you, too.

If you subscribe to Scientific American, you help ensure that our coverage is centered on meaningful research and discovery; that we have the resources to report on the decisions that threaten labs across the U.S.; and that we support both budding and working scientists at a time when the value of science itself too often goes unrecognized.

In return, you get essential news, captivating podcasts, brilliant infographics, can't-miss newsletters, must-watch videos, challenging games, and the science world's best writing and reporting. You can even gift someone a subscription.

There has never been a more important time for us to stand up and show why science matters. I hope you’ll support us in that mission.

Thank you,

David M. Ewalt, Editor in Chief, Scientific American