Researchers have been trolling Twitter for insights into the human condition since shortly after the site launched in 2006. In aggregate, the service provides a vast database of what people are doing, thinking and feeling. But the research tools at scientists’ disposal are highly imperfect. Keyword searches, for example, return many hits but offer a poor sense of overall trends.
When computer scientist James H. Martin of the University of Colorado at Boulder searched for tweets about the 2010 earthquake in Haiti, he found 14 million. “You can’t hire grad students to read them all,” he says. Researchers need a more automated approach.
One promising method is to develop programs that label words in tweets with parts of speech—such as subject, verb and object—and then use those tags to determine what each tweet is about. This method, called natural-language processing, is not a new idea, but applying it to short social text is new and growing. “That is just a huge area right now,” Martin says.
Scientists at the Xerox-owned Palo Alto Research Center recently developed one such program. It relies on text processors, called parsers, which are typically tested on news articles. Parsers can distinguish between words and punctuation, label parts of speech and analyze a sentence’s grammatical structure. But “they don’t do as well on Twitter,” says Kyle Dent, one of the Palo Alto researchers. He and his co-author wrote hundreds of rules to account for hash tags, repeated letters (as in “pleaaaaaase”) and other linguistic features perhaps not common in the Wall Street Journal. They will present their work on August 8 at an Association for the Advancement of Artificial Intelligence conference in San Francisco.
Dent and his colleagues also tried to use their program to distinguish between rhetorical questions and those that require a response. Businesses could use such a program to find what people are asking about their products. In a recent trial, their program classified 68 percent of 2,304 tweets correctly. “For a brand-new field, that sounds like a decent first attempt,” says Jeffrey Ellen of the Space and Naval Warfare Systems Command, which provides intelligence technology to the U.S. Navy.
Although Twitter-trawling technology is not yet ready to deploy, as a field, “it’s getting there pretty quickly,” Martin says. Once it matures, researchers should have access to an unprecedented trove of data about human behavior. For the first time in history, “watercooler talk” is recorded and publicly available, Ellen says. “A hundred years ago we just didn’t know what everybody was thinking.”
This article was originally published with the title Parsing the Twitterverse.