The Books Ngram Viewer also can shed some light on the popularity of various people, revealing, for instance, a marked dearth of references to Jewish artist Marc Chagall in books published in Nazi Germany, suggesting widespread censorship, the researchers concluded in their paper. (For those more keen on following scientists, the frequency of "Albert Einstein" mentions surpasses those of "Charles Darwin" in the late 1960s, but both enjoy a rise in popularity from about 1975 to 2005, according to a recent search—and the researchers found that Freud ranks higher over time than Einstein or Darwin.)
The analysis tool might also provide "an interesting example of how we think in a place we don't expect," Michel says. For instance he and his team found that, the superhero's ultimate challenge, long a trope in literature, has not always been to save the world. Rather, after searching the database they discovered that, until the two world wars, "by and large, it used to be saving the country." But during the 20th century, a more global sensitivity led to "the globalization of heroes," too, he notes.
Dames, who was not involved in the new paper, is not entirely convinced that "the method will truly catch on with humanists otherwise not inclined toward quantitative methods until truly surprising—or controversial—results are generated." So far, he has found the reported frequencies rather predictable, although that could be evidence that the method is working, he notes.
Dusting off the data
The clean lines and neat charts might raise some flags when considering the decidedly messy—and potentially musty—sources of that data. "Our method is certainly not perfect," Michel says. For example, new editions of old works or works in translation are logged in the year and language they are published.
Although the bulk of the books included were written in English (about 72 percent), users can also search works written in French, Spanish, German, Chinese, Russian and Hebrew. The data also becomes more robust over time, with only a few books coming from the early 1500s and billions of printed words catalogued each year by the 20th century.
Google has been partnering with university libraries, publishing houses and other organizations to acquire digital scans of as many books as possible. Michel and his colleagues selected about a third of books that have been digitized so far (about five million of some 15 million), which represents approximately 4 percent of books ever published. "Our number-one criteria is getting the books that have high-quality metadata," Michel says. When a book's publication date is incorrectly noted in the metadata, it distorts that dataset, so those volumes with erroneous information attached were excluded.
Even with a smaller sample of digital works, the current dataset and analysis tools for the Books Ngram Viewer took about four years to put together. Michel says it was "our personal folly when we started it—we didn't realize how long it would take." And the goal is to expand the searchable corpus—not just to more tomes, but also to magazines, newspapers, blogs and even non-text-based products, such as artwork.
Beyond types of sources, broadening the scope of the search unit would increase the value of these sorts of quantitative methods, Dames notes, adding that it will be crucial to be able to investigate things such as shifts in genre and narrative form. "This seems like it would be the necessary next frontier for quantitative work in the humanities: studies of forms larger than the single word."
In the meantime, the researchers have encouraged the public to search away on the site—or download the hefty data set for their own analyses. "I think it could be a source of wonder for awhile," Michel says. To wit: although it peaked way back in the mid-19th century, the frequency of "procrastination" has been climbing since 2000.