As opioid abuse tightens its grip on the U.S., a team of medical researchers is combing social media for clues to better understand this major public health problem. Using artificially intelligent software they developed to analyze tweets and related geographic information, the researchers found Twitter to be a particularly reliable data source for pinpointing where the situation is at its worst. With about 500 million messages posted to the microblogging site daily, such an approach could help alert local health officials so they can gather funding or other resources to tackle the issue.

Drug overdoses (most involving illegal heroin and prescription opioids) killed more than 64,000 people in the U.S. in 2016, up 21 percent from the previous year, according to the U.S. Centers for Disease Control and Prevention (pdf). Deaths attributed to the abuse of fentanyl (the painkiller blamed for pop star Prince’s overdose death last year) more than doubled from 2015 to 2016, and more sobering data is expected after health organizations have time to gather it. That lag time in collecting information is one of the major factors that make opioid abuse particularly hard to address.

The researchers wanted to know whether analyzing tweet text could help them estimate the location and relative prevalence of prescription opioid misuse as accurately as established epidemiologic studies—such as the National Survey on Drug Usage and Health (NSDUH)—in a fraction of the time. Traditional medical research like the NSDUH can take years to complete and publish. But the team thought Twitter messages might provide an early warning system that could prompt more immediate action, such as localized public health campaigns. “We found that our estimates agreed with [NSDUH] data, suggesting that social media can be a reliable additional source of epidemiological data regarding substance use,” says Michael Chary, a resident physician in emergency medicine at New York–Presbyterian/Queens Hospital. “We can analyze social media to canvass larger segments of the general population and potentially yield timely insights.” Chary’s research team comprised medical professionals in New York City, New Jersey and Brigham Young University in Utah as well as a Brigham computer scientist.

Publicly searchable Twitter, in particular, offers several advantages for digital epidemiology—the incidence, distribution and possible control of health threats—according to the study (pdf), published recently in the Journal of Medical Toxicology. Twitter users tend to write frequent, short messages on a wide variety of topics, and they often indicate their location and other demographic information. “There’s a confessional effect,” Chary says. “People may discuss or reveal things on social media that, when directly asked, they may not. There may be a level of candor there that’s not present in the emergency room or internist’s office.”

The researchers developed custom software to analyze tweets for possible references to drug use or abuse. The software relied on AI to quickly search more than 3.6 million tweets, and to identify words and phrases—including “dope,” “percs,” “white,” “TNT” and “Captain Cody”—likely referring to opioid consumption. Further examination of the tweets revealed additional details: for example, that fentanyl can go by the term “dummies.” Codeine translates into “syrup” or “Tango and Cash.”

Armed with this knowledge of the software’s algorithms, the team then identified word-use patterns that distinguished tweets referring specifically to drug misuse from those describing, say, breakfast (in the case of “syrup”). The tweets most likely referring to a substance abuse problem were flagged. The researchers validated the software’s ability to distinguish by comparing word use in those messages with a list of opioid-related keywords curated by a medical toxicologist and an emergency physician. The study’s findings were similar to NSDUH state-by-state estimates of prescription opioid misuse, especially among people 18 to 25 years old. That is likely because, according to Pew Research Center, 36 percent of Twitter users are between the ages of 18 and 29.

Following established medical research protocols, the researchers kept the data they collected anonymous—individual tweeters could not be identified. That worked well for the purposes of their study, although they acknowledge it would not be difficult to trace a tweet back to a particular Twitter user profile if a government or law enforcement agency wanted to conduct a similar study.

“Twitter data is high-volume and the content is short-form, brief statements [that] are easier to classify than very long and complex statements,” says Michael Gilbert, a Portland, Ore.–based epidemiologist and social media researcher not involved in Chary’s research. “The combination of the volume of data and the format of the data makes Twitter suitable for machine-learning tools. Are people talking about getting high, controlling pain or some other motivation that is underlying a common behavior? People are more likely to share certain types of information with their peers than they will with their health care providers.”

Chary and his team are not the only researchers studying opioid abuse who have used machine-learning techniques to study Twitter. A group led by Tim Mackey, director of the University of California, San Diego’s Global Health Policy Institute, examined the social media site for five months in 2015 to identify entities illegally selling prescription opioids online. Their software detected 1,778 posts marketing the sale of controlled substances—90 percent included hyperlinks to online sites for purchase. The American Journal of Public Health published their findings earlier this month.

Still, despite the familiarity and openness that Twitter offers—or perhaps because of it—the platform is not always a reliable source of data. Conversations on Twitter cover so many topics that identifying messages relevant to a particular study can be challenging. “This kind of research is still nascent,” says Nikki Adams, an assistant research scientist at the University of Maryland, College Park’s Center for Advanced Study of Language. “Tweets are short, and this does impact the quality of machine learning. There’s not much context. If you are investigating one topic, you might have a lot of noise around your data.”

Chary acknowledges Twitter’s shortcomings as a data source, including the large amount of irrelevant data that must be analyzed to get to anything meaningful as well as the demographic limitations of the platform’s user base. “This work is most useful in capturing trends,” he says. “We all agree there is a problem with opioid use. It’s very difficult to conduct these federal surveys at any scale with the frequency needed to say, ‘Over the last three months drug use in this particular location is going up. What’s going on here?’” The clues are there—what’s needed are the right tools to find them.