Social Media Posts and Online Searches Hold Vital Clues about Pandemic Spread

Such data offer valuable information and could help track the novel coronavirus—but they risk errors and raise privacy concerns

By Katherine Ellison & Knowable Magazine

Social media posts and online searches can provide clues about where disease is spreading.

Lilly Roadstones Getty Images

Nearly a week before the World Health Organization first warned of a mysterious new respiratory disease in Wuhan, China, a team of Boston-based sleuths at the global disease monitoring system HealthMap captured digital clues about the outbreak from an online press report. That same day, December 30, ProMED, another digital disease detection group, became aware of online chatter about a pneumonia of unknown origin on China’s micro-blogging website, Weibo. As researchers later reported, newly popular keywords on the social media platform WeChat included “SARS,” “shortness of breath” and “diarrhea.”

Such alerts reveal the promise of a vast yet risky resource: the tweet-sized hints from people all over the world who report their health status and vent their fears online. Some researchers are calling on public health officials to take greater advantage of this virtual treasure chest of data, especially given the current rapid spread of the new coronavirus.

“We are on the precipice of an unprecedented opportunity to track, predict and prevent global disease burdens in the population using digital data,” Allison Aiello, an epidemiologist at the Gillings School of Global Public Health at the University of North Carolina, and two graduate students write in the 2020Annual Review of Public Health.

On supporting science journalism

If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.

“There’s incredible amounts of data on social media blogs, chatrooms and local news reports that give us clues about disease outbreaks happening on a daily basis,” John Brownstein, the chief innovation officer at Boston Children’s Hospital and Harvard Medical School recently told CNN Headline News. Such data, which Brownstein calls “digital breadcrumbs,” are vital raw material for an emerging field of inquiry known as digital epidemiology. HealthMap, which he cofounded in 2006, is one of several leading efforts in the field.

HealthMap’s first big success came during the 2009 H1N1 (swine flu) pandemic, when it used sources that included Spanish-language online news reports to aid in early detection of an unidentified respiratory illness in Veracruz, Mexico. Five years later, it tapped the WHO Twitter feed and other sources to track the spread of the Ebola virus, which ultimately killed more than 11,000 people in West Africa.

The World Health Organization now routinely uses HealthMap, ProMED and similar systems to monitor infectious disease outbreaks and inform clinicians, officials and the public. Yet big-data disease detection is still in its infancy compared with traditional methods, and thesocial media ingredients, in particular, have yet to make any major contribution in predicting where and how infectious diseases may strike.

So far at least, HealthMap still doesn’t rely heavily on social media; instead, it mostly tracks reports from online news sources and governments, while including some social media posts from public health professionals. Additionally, HealthMap calls on volunteers to submit weekly data to its crowdsourced disease-tracking platform, Flu Near You. In late March, it launched a new site, Covid Near You, that focuses specifically on Covid-19 symptoms and testing.

Still, digital epidemiology’s two key advantages—speed and volume—may increasingly help health officials spot outbreaks quickly and cheaply, Brownstein and other experts believe. At the same time, the huge volume of digital data from social media also comes with sufficient challenges to accuracy and privacy to make it a “double-edged sword,” in the words of University College London e-health researcher Patty Kostkova. It’s a now-familiar story: Technological advances are racing ahead of our ability to guarantee their quality and safety.

The most immediate challenge is getting it right. “It’s actually really hard to get useful prospective data from social media,” says Northeastern University computer scientist Clark Freifeld, who cofounded HealthMap with Brownstein. One of the biggest challenges, he says, is that once a disease becomes news, most subsequent media queries and posts are reactions to that news rather than indicators of more news to come.

For example, in 2012 Google Flu Trends estimated a large spike in winter flu cases based on increased use of flu-related terms in Google searches. The actual spike turned out to be about half as high, perhaps because users’ searches reflected news of flu outbreaks rather than actual illnesses.

Red herrings are another serious problem. Researchers noted a 2007 spike in Google searches for the word “cholera.” But the cause wasn’t a disease outbreak; instead, it turned out that Oprah Winfrey had picked the novel “Love in the Time of Cholera” for her book club. While that particular case didn’t lead any public health officials astray, says Aiello, it’s a vivid example of reactive and irrelevant “noise.”

HealthMap tries to address this problem by using artificial intelligence to filter out repetition and irrelevancies. “We have a database of millions of articles and pieces of content relating to disease outbreaks,” says Freifeld. “We’ll hand-label say 100,000 examples of actual outbreaks and contrast them with things that aren’t related, like an ‘outbreak’ of home runs in the seventh inning. That’s how the system learns what’s useful and what’s not.”

A major reason digital breadcrumbs can lead experts astray is that they can miss a large section of the population. About 22 percent of the US adults use Twitter, but it’s not a random sample. US Twitter users are predominantly wealthier, younger, better-educated and more likely to be Democrats than other Americans. What’s more, most Twitter users don’t tweet all that much: About 80 percent of the tweets from all adult US users come from the most prolific 10 percent. Twitter’s youthful profile is particularly problematic considering that older people—at least according to initial assumptions—have been at more risk of becoming seriously ill. Monitoring health through tweets could thus ignore the most vulnerable among us.

More broadly, social media is justifiably notorious for spreading falsehoods, which in the case of infectious diseases can have deadly consequences. Andthat, public health researchers say, is always a danger in the search for signals amid the social media noise.Public health depends on trust in public officials, but that trust can quickly erode if a government releases faulty information.

On top of its problems with accuracy, digital epidemiology may increase threats to internet users’ privacy. Unlike Europe, the United States lacks sweeping laws to protect privacy on social media. Platforms such as Google and Facebook routinely license aggregated users’ information to advertisers who can then target pitches based on search contents and “likes.” Using these kinds of data for health surveillance could multiply the risk of privacy abuses, Freifeld says, especially when public health concerns conflict with confidentiality.

Privacy advocates are already sounding the alarm about recent efforts by the White House and US Centers for Disease Control and Prevention to expand their access to Americans’ mobile-phone data to track their locations during the epidemic. Federal health officials hope to incorporate anonymous, aggregated data to follow the spread of the virus and check compliance with new “social distancing” rules.

The growing repository of online public health data became somewhat more open to the public just this month. On March 17, CrowdTangle, a social media monitoring site recently purchased by Facebook, announced it had launched a new feature to let users, including news media organizations, public health officials and researchers, track social trends across sites including Facebook, Instagram and Reddit. The company simultaneously introduced a publicly available hub of streaming, limited real-time displays of official information and social media posts concerning Covid-19 infections caused by the new coronavirus. The social media posts were culled only from public accounts, not private ones.

Voluntary reporting systems may avoid some, although not all, of the biases of run-of-the-mill digital epidemiology. Flu Near You, launched in 2011, uses an anonymous, crowdsourced model to collect data for public health officials and researchers.

A somewhat similar project is FoodBorne Chicago, a Twitter-based surveillance system that monitors complaints of foodborne illness. Based at the Chicago Department of Public Health, it tracks tweets using a machine-learning algorithm that identifies the keywords “food poison.” When local residents type those words, the site tweets back a link with a form to provide details, collecting data that might never otherwise have been reported.

For the last seven years, the CDC has dipped its toe in digital detection of diseases by managing a yearly competition known as FluSight in which researchers from academia and industry attempt to forecast the timing and intensity of the flu season. The CDC requires competitors to use some sort of digital data in their projections.

Meanwhile, researchers are increasingly excited about the potential of including data from more direct measures of wellness and illness. Smart, wearable health-tracking monitors supply a constant stream of data about heart rates, steps taken and quality of sleep.

On March 25, Scripps Research Translational Institute epidemiologist Jennifer Radin, the lead author of a recent study on the potentially “vital” role of Fitbits in disease detection, called on US adult volunteers using any kind of smartwatch or activity tracker to share their health data with researchers by downloading the MyDataHelps mobile app. The researchers hope to use the data to identify changes in resting heart rates that may signify disease, Radin told Knowable. While she acknowledged that a faster heart rate might be induced by simply watching the news, she said volunteers who aren’t feeling well may also list other symptoms on the app.

For the past eight years, a San Francisco startup called Kinsa has been systematically collecting such real-time health data, having recently sold and given away more than 1 million internet-connected thermometers. Oregon State University scientist Benjamin Dalziel, who is collaborating on research funded by Kinsa, says the system can accurately track the flu two weeks ahead of predictions by the CDC and could potentially track Covid-19 as well. On March 18, it began posting new data from its opt-in system about clusters of “atypical fevers” on its “Health Weather Map” at www.healthweather.us.

Dalziel and Kinsa corporate leaders are certain thermometers can help during this global emergency. Using these and other types of systems to monitor symptoms in real time, Dalziel says, “is the future, however grand that sounds.... A fever is a key indicator of an acute respiratory infection. It’s measuring something directly relevant to illness. And while I think there has been stunning work done to extract information from Twitter, a thermometer reading has clearly got an advantage over a tweet.”*

Other experts are also enthusiastic about Kinsa’s progress. “Fever monitoring is a great idea given the lack of Covid-specific test kits,” says HealthMap’s Freifeld.

The coronavirus emergency is clearly speeding interest in digital epidemiology. Yet to date, Freifeld and other experts agree that the field’s promise remains more as an adjunct than a substitute for conventional surveillance.

As Aiello, in North Carolina, acknowledges that for the time being, at least: “We’ll need to validate it with traditional shoe-leather data.”

This article originally appeared in Knowable Magazine, an independent journalistic endeavor from Annual Reviews. Sign up for the newsletter.

Read more about the coronavirus outbreak here.

*Editor’s Note (3/31/20): Our partners at Knowable have updated this paragraph to clarify Benjamin Dalziel’s views.

It’s Time to Stand Up for Science

If you enjoyed this article, I’d like to ask for your support. Scientific American has served as an advocate for science and industry for 180 years, and right now may be the most critical moment in that two-century history.

I’ve been a Scientific American subscriber since I was 12 years old, and it helped shape the way I look at the world. SciAm always educates and delights me, and inspires a sense of awe for our vast, beautiful universe. I hope it does that for you, too.

If you subscribe to Scientific American, you help ensure that our coverage is centered on meaningful research and discovery; that we have the resources to report on the decisions that threaten labs across the U.S.; and that we support both budding and working scientists at a time when the value of science itself too often goes unrecognized.

In return, you get essential news, captivating podcasts, brilliant infographics, can't-miss newsletters, must-watch videos, challenging games, and the science world's best writing and reporting. You can even gift someone a subscription.

There has never been a more important time for us to stand up and show why science matters. I hope you’ll support us in that mission.

Thank you,

David M. Ewalt, Editor in Chief, Scientific American