New Tool Tracks Culture through the Centuries via Google Books

The field of "culturomics" promises humanities researchers a robust quantitative tool to analyze cultural trends back to the 1500s















Share on Tumblr

google books assesses frequency of words, tracks culture

HIGH-THROUGHPUT CULTURE: The new Books Ngram Viewer purports to provide quantitative analyses of cultural history via rapid analyses of more than 500 billion words published in 5.2 million books. Here are some of the words proportional to their frequency in the corpus. Image: AAAS/SCIENCE/WORDLE.ORG

Can culture be decoded like a genome? A team from Harvard University has teamed up with Google to crack the spines of 5,195,769 digitized books that span five centuries of the printed word with the hopes of giving the humanities a more quantitative research tool.

The Google Books Ngram Viewer, launched online December 16 and described in a paper in Science, allows Web users to query their respective areas of interest based on n-grams (a method of modeling sequences in natural language).

•    How engrained has Einstein really been in the cultural consciousness?

•    Has interest in evolution been on a steady increase over the past 150 years?

•    Have superheroes always been out to "save the world"?

Questions such as these have launched reams of undergraduate and graduate papers, which have traditionally required long hours searching the stacks—or JSTOR—for mentions to tally by hand and loads of close reading.

But there has been a growing movement afoot to bring more quantitative analysis to the humanities, such as using cognitive science and MRIs for English department research at Yale University, as The New York Times reported in April. Social scientists and humanities scholars have dipped their toes in the quantitative research waters via Perseus and WordHoard. And like the physical sciences, more—and better—data can lead to more robust results. "We can think fruitfully about culture by collecting large sets of information," says Erez Lieberman Aiden, an investigator at Harvard's School of Engineering and Applied Science's Laboratory-at-Large and in the school's Society of Fellows. "Having collected the data set, we can apply very analytic and high-throughput tools to understand [it]."

The Harvard team is calling their analysis "culturomics" based on the notion that culture "is something you can study like evolution in biology," says Jean-Baptiste Michel, a postdoctoral researcher in Harvard's psychology department and in the Program for Evolutionary Dynamics, who helped lead the charge with Aiden. As a gene or phenotype changes over time, so, too, the researchers propose, do cultural sensibilities.

The tool will be "like biology in the sense that you can formulate questions that are quantitative, and you can obtain quantitative answers to them," Aiden says. But like a genome-wide association study (GWAS), the findings are often just the starting point.

What's in a word?
Many humanities scholars are meeting this and other quantitative-based approaches with a mix of excitement and trepidation. "Word frequency is a tool with enormous potential," says Nicholas Dames, associate chair of the Department of English and Comparative Literature at Columbia University. But he has reservations about the use of frequencies alone to address "more nuanced questions, particularly about semantics." 

Dames explains that words such as "nature," "professional" and "gentleman" have come to connote different things depending on time and place—"and the story of those semantic shifts is more crucial to cultural history than a qualitative index of their use-frequency. We might use 'nature' as often as we did in the 18th century, but haven't we accumulated entirely new meanings for this term that tie into all kinds of scientific and cultural changes?"

The researchers behind the Books Ngram Viewer admit it will not likely replace tried-and-true techniques of close reading—much as GWAS have not eliminated the need for basic science research and controlled clinical trials.

Despite the program's capacity to churn out neatly organized analytics at the click of a button (labeled, cheekily, "search lots of books"), Aiden maintains that "we certainly don't view this tool as an answer machine." But certainly the program can work as a question generator.

For example, the evolution of the frequency of "evolution," for instance, reveals some unexpected nuances. It was on a general upswing until the mid-1920s, then declined gradually until around 1945 (from about 0.0035 percent of words in the measured data that year to about 0.0025 percent). Why the dip—and is it significant? The researchers were unsure and offer this as an example of a lead in for further research, Michel notes.



13 Comments

Add Comment
View
  1. 1. jtdwyer 07:28 AM 12/17/10

    IMO, quantification of words is not meaningful information. Does a word count indicate why the word is being used? Does use of the word evolution connotate a positive or negative view of the subject? How many books were published during the depression years? That and the readership of books that reference a subject in time are likely significant in assessing their cultural influence. What percentage of of the population reads books over the years, especially since television became widely available? How many since computer games and the internet?

    IMO, wordcounts in books, newspapers or any other written medium are of extremely limited informational value.

    Reply | Report Abuse | Link to this
  2. 2. Geoff 07:38 AM 12/17/10

    @jtdwyer: You're correct as far as you go, but any research system has to begin somewhere; the expectation is that sophistication will be an evolutionary phenomenon. Potentially, this could do for research what computerization did for library card stacks; but if the first step is never taken, how can the potential evolve?

    Reply | Report Abuse | Link to this
  3. 3. Susanna 02:14 PM 12/17/10

    @jtdwyer: I think you make valid propositions for further classification and study.

    When we have a diachronic collection as extensive as this, we can then start working on polarity, demographics and other dimensions that are relevant for meaning-making. But I agree with @Geoff, you do have to start somewhere, and getting an extensive, if not entirely representative, sample of texts is always a good place to start.

    Reply | Report Abuse | Link to this
  4. 4. jtdwyer in reply to Susanna 02:56 PM 12/17/10

    IMO, until a more direct correspondence between document content, word contextual meaning and cultural influence can be established, the quantified results obtained would be of little research value.

    However, such a tool could greatly enhance the ability to produce vast quantities of meaningless research!

    Reply | Report Abuse | Link to this
  5. 5. Susanna 04:04 PM 12/17/10

    @jtdwyer: All tools and methodologies can be used for meaningless, or more often self-serving, purposes. People do love to trust their numbers, I fully admit to that.

    But luckily the entire dataset is available for download so many of us can work on it, and as the article states, issues such as genre are currently worked upon. Meanwhile, I found a graph that took a week for me to gather in 2007 in mere seconds. I think that's an improvement...

    I feel quantitative does not exclude qualitative, but in good hands complements it. Personally, I love mixing both.

    Meanwhile, here's a funny piece of possible meaninglessness I found -- it seems we have been omnomming for a while longer than at least I thought possible... or perhaps not:

    http://ngrams.googlelabs.com/graph?content=omnom&year_start=1500&year_end=2008&corpus=0&smoothing=3

    Reply | Report Abuse | Link to this
  6. 6. jtdwyer 06:55 PM 12/17/10

    I admit I'm an ignorant, compulsive problem solver, but puzzles just infuriate me! I can only guess that this result indicates that the popular use of Latin has been in decline for a while...
    Please, don't do this anymore!

    Reply | Report Abuse | Link to this
  7. 7. jtdwyer in reply to Susanna 08:02 PM 12/17/10

    By the way, do you realize that the incidence of 'omnom' in publications inversely corresponds to atmospheric co2 levels? You may have discovered the true cause of global warming!

    Reply | Report Abuse | Link to this
  8. 8. zstansfi 04:00 PM 12/18/10

    I really like the idea, but I don't think this tool will provide all that much insight into our past. It's very difficult to generalize "the number of times that Einstein was mentioned in 10% of the books published between 1900 and 2008" to "a meaningful understanding of the relative importance of Einstein's work to theoretical physics in the past century". Even so, I imagine that this program might provide a good stepping stone to a more in-depth analysis of socio-historical trends.

    Reply | Report Abuse | Link to this
  9. 9. iqsoft 04:15 PM 12/19/10

    Reminds me of Isaac Asimov's "PsychoHistory" in his foundation series books

    Reply | Report Abuse | Link to this
  10. 10. iqsoft 04:19 PM 12/19/10

    Psychohistory sounds a better word than culturomics

    Reply | Report Abuse | Link to this
  11. 11. bucketofsquid in reply to Garudadhaja 11:20 AM 12/28/10

    ...and this has anything at all to do with this article in what way?

    Reply | Report Abuse | Link to this
  12. 12. bucketofsquid in reply to Susanna 11:24 AM 12/28/10

    I am rather distressed to realize that the cookie monster is just a foul plagarist. There are numerous articles giving him credit for inventing the phrase "om nom nom" when clearly he did not. We must spread the word of this nefarious theft of idea. Does anyone know which book held the original "om nom"?

    Reply | Report Abuse | Link to this
  13. 13. cjrisi88 in reply to jtdwyer 12:55 PM 1/2/11

    I'm sorry but you are incorrect. You pose a bunch of questions and assume their are no answers. Techniques very similar to these are use in grading the SAT scores. All of the English essays on the SATs are marked by computer. So there is meaningful information in word counting like this.

    Reply | Report Abuse | Link to this
Leave this field empty

Add a Comment

You must sign in or register as a ScientificAmerican.com member to submit a comment.
Click one of the buttons below to register using an existing Social Account.

More from Scientific American

See what we're tweeting about

Scientific American Editors

More »

Free Newsletters


Get the best from Scientific American in your inbox

Solve Innovation Challenges

Powered By: Innocentive

  SA Digital
  SA Digital

Email this Article

New Tool Tracks Culture through the Centuries via Google Books

X
Scientific American Magazine

Subscribe Today

Save 66% off the cover price and get a free gift!

Learn More >>

X

Please Log In

Forgot: Password

X

Account Linking

Welcome, . Do you have an existing ScientificAmerican.com account?

Yes, please link my existing account with for quick, secure access.



Forgot Password?

No, I would like to create a new account with my profile information.

Create Account
X

Report Abuse

Are you sure?

X

Institutional Access

It has been identified that the institution you are trying to access this article from has institutional site license access to Scientific American on nature.com. To access this article in its entirety through site license access, click below.

Site license access
X

Error

X

Share this Article

X