Can culture be decoded like a genome? A team from Harvard University has teamed up with Google to crack the spines of 5,195,769 digitized books that span five centuries of the printed word with the hopes of giving the humanities a more quantitative research tool.

The Google Books Ngram Viewer, launched online December 16 and described in a paper in Science, allows Web users to query their respective areas of interest based on n-grams (a method of modeling sequences in natural language).

•    How engrained has Einstein really been in the cultural consciousness?

•    Has interest in evolution been on a steady increase over the past 150 years?

•    Have superheroes always been out to "save the world"?

Questions such as these have launched reams of undergraduate and graduate papers, which have traditionally required long hours searching the stacks—or JSTOR—for mentions to tally by hand and loads of close reading.

But there has been a growing movement afoot to bring more quantitative analysis to the humanities, such as using cognitive science and MRIs for English department research at Yale University, as The New York Times reported in April. Social scientists and humanities scholars have dipped their toes in the quantitative research waters via Perseus and WordHoard. And like the physical sciences, more—and better—data can lead to more robust results. "We can think fruitfully about culture by collecting large sets of information," says Erez Lieberman Aiden, an investigator at Harvard's School of Engineering and Applied Science's Laboratory-at-Large and in the school's Society of Fellows. "Having collected the data set, we can apply very analytic and high-throughput tools to understand [it]."

The Harvard team is calling their analysis "culturomics" based on the notion that culture "is something you can study like evolution in biology," says Jean-Baptiste Michel, a postdoctoral researcher in Harvard's psychology department and in the Program for Evolutionary Dynamics, who helped lead the charge with Aiden. As a gene or phenotype changes over time, so, too, the researchers propose, do cultural sensibilities.

The tool will be "like biology in the sense that you can formulate questions that are quantitative, and you can obtain quantitative answers to them," Aiden says. But like a genome-wide association study (GWAS), the findings are often just the starting point.

What's in a word?
Many humanities scholars are meeting this and other quantitative-based approaches with a mix of excitement and trepidation. "Word frequency is a tool with enormous potential," says Nicholas Dames, associate chair of the Department of English and Comparative Literature at Columbia University. But he has reservations about the use of frequencies alone to address "more nuanced questions, particularly about semantics." 

Dames explains that words such as "nature," "professional" and "gentleman" have come to connote different things depending on time and place—"and the story of those semantic shifts is more crucial to cultural history than a qualitative index of their use-frequency. We might use 'nature' as often as we did in the 18th century, but haven't we accumulated entirely new meanings for this term that tie into all kinds of scientific and cultural changes?"

The researchers behind the Books Ngram Viewer admit it will not likely replace tried-and-true techniques of close reading—much as GWAS have not eliminated the need for basic science research and controlled clinical trials.

Despite the program's capacity to churn out neatly organized analytics at the click of a button (labeled, cheekily, "search lots of books"), Aiden maintains that "we certainly don't view this tool as an answer machine." But certainly the program can work as a question generator.

For example, the evolution of the frequency of "evolution," for instance, reveals some unexpected nuances. It was on a general upswing until the mid-1920s, then declined gradually until around 1945 (from about 0.0035 percent of words in the measured data that year to about 0.0025 percent). Why the dip—and is it significant? The researchers were unsure and offer this as an example of a lead in for further research, Michel notes.

The Books Ngram Viewer also can shed some light on the popularity of various people, revealing, for instance, a marked dearth of references to Jewish artist Marc Chagall in books published in Nazi Germany, suggesting widespread censorship, the researchers concluded in their paper. (For those more keen on following scientists, the frequency of "Albert Einstein" mentions surpasses those of "Charles Darwin" in the late 1960s, but both enjoy a rise in popularity from about 1975 to 2005, according to a recent search—and the researchers found that Freud ranks higher over time than Einstein or Darwin.)

The analysis tool might also provide "an interesting example of how we think in a place we don't expect," Michel says. For instance he and his team found that, the superhero's ultimate challenge, long a trope in literature, has not always been to save the world. Rather, after searching the database they discovered that, until the two world wars, "by and large, it used to be saving the country." But during the 20th century, a more global sensitivity led to "the globalization of heroes," too, he notes.

Dames, who was not involved in the new paper, is not entirely convinced that "the method will truly catch on with humanists otherwise not inclined toward quantitative methods until truly surprising—or controversial—results are generated." So far, he has found the reported frequencies rather predictable, although that could be evidence that the method is working, he notes.

Dusting off the data
The clean lines and neat charts might raise some flags when considering the decidedly messy—and potentially musty—sources of that data. "Our method is certainly not perfect," Michel says. For example, new editions of old works or works in translation are logged in the year and language they are published.

Although the bulk of the books included were written in English (about 72 percent), users can also search works written in French, Spanish, German, Chinese, Russian and Hebrew. The data also becomes more robust over time, with only a few books coming from the early 1500s and billions of printed words catalogued each year by the 20th century.

Google has been partnering with university libraries, publishing houses and other organizations to acquire digital scans of as many books as possible. Michel and his colleagues selected about a third of books that have been digitized so far (about five million of some 15 million), which represents approximately 4 percent of books ever published. "Our number-one criteria is getting the books that have high-quality metadata," Michel says. When a book's publication date is incorrectly noted in the metadata, it distorts that dataset, so those volumes with erroneous information attached were excluded.

Even with a smaller sample of digital works, the current dataset and analysis tools for the Books Ngram Viewer took about four years to put together. Michel says it was "our personal folly when we started it—we didn't realize how long it would take." And the goal is to expand the searchable corpus—not just to more tomes, but also to magazines, newspapers, blogs and even non-text-based products, such as artwork.

Beyond types of sources, broadening the scope of the search unit would increase the value of these sorts of quantitative methods, Dames notes, adding that it will be crucial to be able to investigate things such as shifts in genre and narrative form. "This seems like it would be the necessary next frontier for quantitative work in the humanities: studies of forms larger than the single word." 

In the meantime, the researchers have encouraged the public to search away on the site—or download the hefty data set for their own analyses. "I think it could be a source of wonder for awhile," Michel says. To wit: although it peaked way back in the mid-19th century, the frequency of "procrastination" has been climbing since 2000.