HIGH-THROUGHPUT CULTURE: The new Books Ngram Viewer purports to provide quantitative analyses of cultural history via rapid analyses of more than 500 billion words published in 5.2 million books. Here are some of the words proportional to their frequency in the corpus. Image: AAAS/SCIENCE/WORDLE.ORG
Can culture be decoded like a genome? A team from Harvard University has teamed up with Google to crack the spines of 5,195,769 digitized books that span five centuries of the printed word with the hopes of giving the humanities a more quantitative research tool.
The Google Books Ngram Viewer, launched online December 16 and described in a paper in Science, allows Web users to query their respective areas of interest based on n-grams (a method of modeling sequences in natural language).
• How engrained has Einstein really been in the cultural consciousness?
• Has interest in evolution been on a steady increase over the past 150 years?
• Have superheroes always been out to "save the world"?
Questions such as these have launched reams of undergraduate and graduate papers, which have traditionally required long hours searching the stacks—or JSTOR—for mentions to tally by hand and loads of close reading.
But there has been a growing movement afoot to bring more quantitative analysis to the humanities, such as using cognitive science and MRIs for English department research at Yale University, as The New York Times reported in April. Social scientists and humanities scholars have dipped their toes in the quantitative research waters via Perseus and WordHoard. And like the physical sciences, more—and better—data can lead to more robust results. "We can think fruitfully about culture by collecting large sets of information," says Erez Lieberman Aiden, an investigator at Harvard's School of Engineering and Applied Science's Laboratory-at-Large and in the school's Society of Fellows. "Having collected the data set, we can apply very analytic and high-throughput tools to understand [it]."
The Harvard team is calling their analysis "culturomics" based on the notion that culture "is something you can study like evolution in biology," says Jean-Baptiste Michel, a postdoctoral researcher in Harvard's psychology department and in the Program for Evolutionary Dynamics, who helped lead the charge with Aiden. As a gene or phenotype changes over time, so, too, the researchers propose, do cultural sensibilities.
The tool will be "like biology in the sense that you can formulate questions that are quantitative, and you can obtain quantitative answers to them," Aiden says. But like a genome-wide association study (GWAS), the findings are often just the starting point.
What's in a word?
Many humanities scholars are meeting this and other quantitative-based approaches with a mix of excitement and trepidation. "Word frequency is a tool with enormous potential," says Nicholas Dames, associate chair of the Department of English and Comparative Literature at Columbia University. But he has reservations about the use of frequencies alone to address "more nuanced questions, particularly about semantics."
Dames explains that words such as "nature," "professional" and "gentleman" have come to connote different things depending on time and place—"and the story of those semantic shifts is more crucial to cultural history than a qualitative index of their use-frequency. We might use 'nature' as often as we did in the 18th century, but haven't we accumulated entirely new meanings for this term that tie into all kinds of scientific and cultural changes?"
The researchers behind the Books Ngram Viewer admit it will not likely replace tried-and-true techniques of close reading—much as GWAS have not eliminated the need for basic science research and controlled clinical trials.
Despite the program's capacity to churn out neatly organized analytics at the click of a button (labeled, cheekily, "search lots of books"), Aiden maintains that "we certainly don't view this tool as an answer machine." But certainly the program can work as a question generator.
For example, the evolution of the frequency of "evolution," for instance, reveals some unexpected nuances. It was on a general upswing until the mid-1920s, then declined gradually until around 1945 (from about 0.0035 percent of words in the measured data that year to about 0.0025 percent). Why the dip—and is it significant? The researchers were unsure and offer this as an example of a lead in for further research, Michel notes.