
HIGH-THROUGHPUT CULTURE: The new Books Ngram Viewer purports to provide quantitative analyses of cultural history via rapid analyses of more than 500 billion words published in 5.2 million books. Here are some of the words proportional to their frequency in the corpus.
Image: AAAS/SCIENCE/WORDLE.ORG
-
The Best Science Writing Online 2012
Showcasing more than fifty of the most provocative, original, and significant online essays from 2011, The Best Science Writing Online 2012 will change the way...
Read More »
Can culture be decoded like a genome? A team from Harvard University has teamed up with Google to crack the spines of 5,195,769 digitized books that span five centuries of the printed word with the hopes of giving the humanities a more quantitative research tool.
The Google Books Ngram Viewer, launched online December 16 and described in a paper in Science, allows Web users to query their respective areas of interest based on n-grams (a method of modeling sequences in natural language).
• How engrained has Einstein really been in the cultural consciousness?
• Has interest in evolution been on a steady increase over the past 150 years?
• Have superheroes always been out to "save the world"?
Questions such as these have launched reams of undergraduate and graduate papers, which have traditionally required long hours searching the stacks—or JSTOR—for mentions to tally by hand and loads of close reading.
But there has been a growing movement afoot to bring more quantitative analysis to the humanities, such as using cognitive science and MRIs for English department research at Yale University, as The New York Times reported in April. Social scientists and humanities scholars have dipped their toes in the quantitative research waters via Perseus and WordHoard. And like the physical sciences, more—and better—data can lead to more robust results. "We can think fruitfully about culture by collecting large sets of information," says Erez Lieberman Aiden, an investigator at Harvard's School of Engineering and Applied Science's Laboratory-at-Large and in the school's Society of Fellows. "Having collected the data set, we can apply very analytic and high-throughput tools to understand [it]."
The Harvard team is calling their analysis "culturomics" based on the notion that culture "is something you can study like evolution in biology," says Jean-Baptiste Michel, a postdoctoral researcher in Harvard's psychology department and in the Program for Evolutionary Dynamics, who helped lead the charge with Aiden. As a gene or phenotype changes over time, so, too, the researchers propose, do cultural sensibilities.
The tool will be "like biology in the sense that you can formulate questions that are quantitative, and you can obtain quantitative answers to them," Aiden says. But like a genome-wide association study (GWAS), the findings are often just the starting point.
What's in a word?
Many humanities scholars are meeting this and other quantitative-based approaches with a mix of excitement and trepidation. "Word frequency is a tool with enormous potential," says Nicholas Dames, associate chair of the Department of English and Comparative Literature at Columbia University. But he has reservations about the use of frequencies alone to address "more nuanced questions, particularly about semantics."
Dames explains that words such as "nature," "professional" and "gentleman" have come to connote different things depending on time and place—"and the story of those semantic shifts is more crucial to cultural history than a qualitative index of their use-frequency. We might use 'nature' as often as we did in the 18th century, but haven't we accumulated entirely new meanings for this term that tie into all kinds of scientific and cultural changes?"
The researchers behind the Books Ngram Viewer admit it will not likely replace tried-and-true techniques of close reading—much as GWAS have not eliminated the need for basic science research and controlled clinical trials.
Despite the program's capacity to churn out neatly organized analytics at the click of a button (labeled, cheekily, "search lots of books"), Aiden maintains that "we certainly don't view this tool as an answer machine." But certainly the program can work as a question generator.
For example, the evolution of the frequency of "evolution," for instance, reveals some unexpected nuances. It was on a general upswing until the mid-1920s, then declined gradually until around 1945 (from about 0.0035 percent of words in the measured data that year to about 0.0025 percent). Why the dip—and is it significant? The researchers were unsure and offer this as an example of a lead in for further research, Michel notes.




See what we're tweeting about





13 Comments
Add CommentIMO, quantification of words is not meaningful information. Does a word count indicate why the word is being used? Does use of the word evolution connotate a positive or negative view of the subject? How many books were published during the depression years? That and the readership of books that reference a subject in time are likely significant in assessing their cultural influence. What percentage of of the population reads books over the years, especially since television became widely available? How many since computer games and the internet?
Reply | Report Abuse | Link to thisIMO, wordcounts in books, newspapers or any other written medium are of extremely limited informational value.
@jtdwyer: You're correct as far as you go, but any research system has to begin somewhere; the expectation is that sophistication will be an evolutionary phenomenon. Potentially, this could do for research what computerization did for library card stacks; but if the first step is never taken, how can the potential evolve?
Reply | Report Abuse | Link to this@jtdwyer: I think you make valid propositions for further classification and study.
Reply | Report Abuse | Link to thisWhen we have a diachronic collection as extensive as this, we can then start working on polarity, demographics and other dimensions that are relevant for meaning-making. But I agree with @Geoff, you do have to start somewhere, and getting an extensive, if not entirely representative, sample of texts is always a good place to start.
IMO, until a more direct correspondence between document content, word contextual meaning and cultural influence can be established, the quantified results obtained would be of little research value.
Reply | Report Abuse | Link to thisHowever, such a tool could greatly enhance the ability to produce vast quantities of meaningless research!
@jtdwyer: All tools and methodologies can be used for meaningless, or more often self-serving, purposes. People do love to trust their numbers, I fully admit to that.
Reply | Report Abuse | Link to thisBut luckily the entire dataset is available for download so many of us can work on it, and as the article states, issues such as genre are currently worked upon. Meanwhile, I found a graph that took a week for me to gather in 2007 in mere seconds. I think that's an improvement...
I feel quantitative does not exclude qualitative, but in good hands complements it. Personally, I love mixing both.
Meanwhile, here's a funny piece of possible meaninglessness I found -- it seems we have been omnomming for a while longer than at least I thought possible... or perhaps not:
http://ngrams.googlelabs.com/graph?content=omnom&year_start=1500&year_end=2008&corpus=0&smoothing=3
I admit I'm an ignorant, compulsive problem solver, but puzzles just infuriate me! I can only guess that this result indicates that the popular use of Latin has been in decline for a while...
Reply | Report Abuse | Link to thisPlease, don't do this anymore!
By the way, do you realize that the incidence of 'omnom' in publications inversely corresponds to atmospheric co2 levels? You may have discovered the true cause of global warming!
Reply | Report Abuse | Link to thisI really like the idea, but I don't think this tool will provide all that much insight into our past. It's very difficult to generalize "the number of times that Einstein was mentioned in 10% of the books published between 1900 and 2008" to "a meaningful understanding of the relative importance of Einstein's work to theoretical physics in the past century". Even so, I imagine that this program might provide a good stepping stone to a more in-depth analysis of socio-historical trends.
Reply | Report Abuse | Link to thisReminds me of Isaac Asimov's "PsychoHistory" in his foundation series books
Reply | Report Abuse | Link to thisPsychohistory sounds a better word than culturomics
Reply | Report Abuse | Link to this...and this has anything at all to do with this article in what way?
Reply | Report Abuse | Link to thisI am rather distressed to realize that the cookie monster is just a foul plagarist. There are numerous articles giving him credit for inventing the phrase "om nom nom" when clearly he did not. We must spread the word of this nefarious theft of idea. Does anyone know which book held the original "om nom"?
Reply | Report Abuse | Link to thisI'm sorry but you are incorrect. You pose a bunch of questions and assume their are no answers. Techniques very similar to these are use in grading the SAT scores. All of the English essays on the SATs are marked by computer. So there is meaningful information in word counting like this.
Reply | Report Abuse | Link to this