The Books Ngram Viewer also can shed some light on the popularity of various people, revealing, for instance, a marked dearth of references to Jewish artist Marc Chagall in books published in Nazi Germany, suggesting widespread censorship, the researchers concluded in their paper. (For those more keen on following scientists, the frequency of "Albert Einstein" mentions surpasses those of "Charles Darwin" in the late 1960s, but both enjoy a rise in popularity from about 1975 to 2005, according to a recent search—and the researchers found that Freud ranks higher over time than Einstein or Darwin.)
The analysis tool might also provide "an interesting example of how we think in a place we don't expect," Michel says. For instance he and his team found that, the superhero's ultimate challenge, long a trope in literature, has not always been to save the world. Rather, after searching the database they discovered that, until the two world wars, "by and large, it used to be saving the country." But during the 20th century, a more global sensitivity led to "the globalization of heroes," too, he notes.
Dames, who was not involved in the new paper, is not entirely convinced that "the method will truly catch on with humanists otherwise not inclined toward quantitative methods until truly surprising—or controversial—results are generated." So far, he has found the reported frequencies rather predictable, although that could be evidence that the method is working, he notes.
Dusting off the data
The clean lines and neat charts might raise some flags when considering the decidedly messy—and potentially musty—sources of that data. "Our method is certainly not perfect," Michel says. For example, new editions of old works or works in translation are logged in the year and language they are published.
Although the bulk of the books included were written in English (about 72 percent), users can also search works written in French, Spanish, German, Chinese, Russian and Hebrew. The data also becomes more robust over time, with only a few books coming from the early 1500s and billions of printed words catalogued each year by the 20th century.
Google has been partnering with university libraries, publishing houses and other organizations to acquire digital scans of as many books as possible. Michel and his colleagues selected about a third of books that have been digitized so far (about five million of some 15 million), which represents approximately 4 percent of books ever published. "Our number-one criteria is getting the books that have high-quality metadata," Michel says. When a book's publication date is incorrectly noted in the metadata, it distorts that dataset, so those volumes with erroneous information attached were excluded.
Even with a smaller sample of digital works, the current dataset and analysis tools for the Books Ngram Viewer took about four years to put together. Michel says it was "our personal folly when we started it—we didn't realize how long it would take." And the goal is to expand the searchable corpus—not just to more tomes, but also to magazines, newspapers, blogs and even non-text-based products, such as artwork.
Beyond types of sources, broadening the scope of the search unit would increase the value of these sorts of quantitative methods, Dames notes, adding that it will be crucial to be able to investigate things such as shifts in genre and narrative form. "This seems like it would be the necessary next frontier for quantitative work in the humanities: studies of forms larger than the single word."
In the meantime, the researchers have encouraged the public to search away on the site—or download the hefty data set for their own analyses. "I think it could be a source of wonder for awhile," Michel says. To wit: although it peaked way back in the mid-19th century, the frequency of "procrastination" has been climbing since 2000.



See what we're tweeting about





13 Comments
Add CommentIMO, quantification of words is not meaningful information. Does a word count indicate why the word is being used? Does use of the word evolution connotate a positive or negative view of the subject? How many books were published during the depression years? That and the readership of books that reference a subject in time are likely significant in assessing their cultural influence. What percentage of of the population reads books over the years, especially since television became widely available? How many since computer games and the internet?
Reply | Report Abuse | Link to thisIMO, wordcounts in books, newspapers or any other written medium are of extremely limited informational value.
@jtdwyer: You're correct as far as you go, but any research system has to begin somewhere; the expectation is that sophistication will be an evolutionary phenomenon. Potentially, this could do for research what computerization did for library card stacks; but if the first step is never taken, how can the potential evolve?
Reply | Report Abuse | Link to this@jtdwyer: I think you make valid propositions for further classification and study.
Reply | Report Abuse | Link to thisWhen we have a diachronic collection as extensive as this, we can then start working on polarity, demographics and other dimensions that are relevant for meaning-making. But I agree with @Geoff, you do have to start somewhere, and getting an extensive, if not entirely representative, sample of texts is always a good place to start.
IMO, until a more direct correspondence between document content, word contextual meaning and cultural influence can be established, the quantified results obtained would be of little research value.
Reply | Report Abuse | Link to thisHowever, such a tool could greatly enhance the ability to produce vast quantities of meaningless research!
@jtdwyer: All tools and methodologies can be used for meaningless, or more often self-serving, purposes. People do love to trust their numbers, I fully admit to that.
Reply | Report Abuse | Link to thisBut luckily the entire dataset is available for download so many of us can work on it, and as the article states, issues such as genre are currently worked upon. Meanwhile, I found a graph that took a week for me to gather in 2007 in mere seconds. I think that's an improvement...
I feel quantitative does not exclude qualitative, but in good hands complements it. Personally, I love mixing both.
Meanwhile, here's a funny piece of possible meaninglessness I found -- it seems we have been omnomming for a while longer than at least I thought possible... or perhaps not:
http://ngrams.googlelabs.com/graph?content=omnom&year_start=1500&year_end=2008&corpus=0&smoothing=3
I admit I'm an ignorant, compulsive problem solver, but puzzles just infuriate me! I can only guess that this result indicates that the popular use of Latin has been in decline for a while...
Reply | Report Abuse | Link to thisPlease, don't do this anymore!
By the way, do you realize that the incidence of 'omnom' in publications inversely corresponds to atmospheric co2 levels? You may have discovered the true cause of global warming!
Reply | Report Abuse | Link to thisI really like the idea, but I don't think this tool will provide all that much insight into our past. It's very difficult to generalize "the number of times that Einstein was mentioned in 10% of the books published between 1900 and 2008" to "a meaningful understanding of the relative importance of Einstein's work to theoretical physics in the past century". Even so, I imagine that this program might provide a good stepping stone to a more in-depth analysis of socio-historical trends.
Reply | Report Abuse | Link to thisReminds me of Isaac Asimov's "PsychoHistory" in his foundation series books
Reply | Report Abuse | Link to thisPsychohistory sounds a better word than culturomics
Reply | Report Abuse | Link to this...and this has anything at all to do with this article in what way?
Reply | Report Abuse | Link to thisI am rather distressed to realize that the cookie monster is just a foul plagarist. There are numerous articles giving him credit for inventing the phrase "om nom nom" when clearly he did not. We must spread the word of this nefarious theft of idea. Does anyone know which book held the original "om nom"?
Reply | Report Abuse | Link to thisI'm sorry but you are incorrect. You pose a bunch of questions and assume their are no answers. Techniques very similar to these are use in grading the SAT scores. All of the English essays on the SATs are marked by computer. So there is meaningful information in word counting like this.
Reply | Report Abuse | Link to this