Efforts to study the early stages of the coronavirus pandemic have received help from a surprising source. A biologist in the United States has ‘excavated’ partial SARS-CoV-2 genome sequences from the beginnings of the pandemic’s probable epicentre in Wuhan, China, that were deposited — but later removed — from a US government database.

The partial genome sequences address an evolutionary conundrum about the early genetic diversity of the coronavirus SARS-CoV-2, although scientists emphasize that they do not shed light on its origins. Nor is it fully clear why researchers at Wuhan University asked for the sequences to be removed from the Sequence Read Archive (SRA), a repository for raw sequencing data maintained by the National Center for Biotechnology Information (NCBI), part of the US National Institutes of Health (NIH).

“These sequences are informative, they’re not transformative,” says Jesse Bloom, a viral evolutionary geneticist at the Fred Hutchinson Cancer Research Center in Seattle, Washington, who describes in a 22 June preprint how he recovered the sequences.

Bloom discovered the sequences after searching for genomic data from the pandemic’s early stages. A research paper from May 2020 contained a table of publicly available sequence data, which included entries Bloom had not come across. The sequences were associated with a paper in which researchers used nanopore-sequencing technology to detect SARS-CoV-2 genetic material in samples from people. That study was published in the journal Small in June 2020, having been posted on bioRxiv in March of that year.

When Bloom looked for the sequences in the SRA using the details listed in the May 2020 paper, the database returned no entries. The SRA keeps sequences in cloud storage maintained by Google, and Bloom wondered whether he could find archived versions of the sequences on cloud servers. This approach worked, and Bloom was able to recover data from 50 samples, 13 of which contained enough raw data to generate partial genome sequences.

Evolutionary mystery

The sequences help to solve an evolutionary mystery about the early stages of the pandemic, says Bloom. The earliest viral sequences from Wuhan are from individuals linked to the city’s Huanan Seafood Market in December 2019, which was initially thought to be where the coronavirus first jumped from animals to people. But the seafood-market sequences are more distantly related to SARS-CoV-2’s closest relatives in bats — the most likely ultimate origin of the virus — than are later sequences, including one collected in the United States.

That was surprising, says Bloom, because you would expect that viruses from the early stages of Wuhan’s epidemic would be most closely related to SARS-CoV-2’s relatives that infect bats. The recovered sequences, which were probably collected in January and February 2020, show this to be the case — they are more closely related to the bat viruses than are the sequences from people linked to the seafood market.

This adds to a growing body of evidence, including reports of probable cases dating back to November 2019, that the first human cases of COVID-19 were not associated with the Huanan Seafood Market, say Bloom and other scientists.

“To me, it seemed like Wuhan market was one of the first super-spreading events,” says Sudhir Kumar, an evolutionary geneticist at Temple University in Philadelphia, Pennsylvania. The sequences that Bloom unearthed, he adds, suggest that SARS-CoV-2 developed extensive diversity in the early stages of the pandemic in China — including in Wuhan.

Stephen Goldstein, a virologist at the University of Utah in Salt Lake City, points out that the sequences Bloom recovered were not hidden: they are described in detail, with enough sequence information to know their evolutionary relationship to other early SARS-CoV-2 sequences, in the Small paper. “I don't think this preprint tells us a whole lot that's new, but it does bring to the forefront sequence data that has been publicly available, though under the radar,” Goldstein says.

Bloom says that, although the sequences were published, their removal from the SRA meant that few scientists knew about them. A report commissioned by the World Health Organization on the pandemic’s origins did not include the sequences in an evolutionary analysis of early SARS-CoV-2 data. “Nobody noticed they existed,” Bloom says.

The corresponding authors of the Small paper did not respond to questions from Nature’s news team about why they asked for the sequences to be removed from the SRA, which happened before the paper was published. In a statement, the NIH said it removed the data at the request of the researchers, who said they planned to submit them to another database.

Bloom — who co-authored a letter calling for a renewed investigation into the origins of the pandemic, including the possibility that the virus escaped or leaked from a lab — says his study sheds no light on the origins of the pandemic, nor on why the sequences were removed. But he hopes his efforts will encourage researchers to “think outside the box” and look to other sources, such as archival data, to glean more information from the early days of the pandemic. “There are probably more data out there,” he says.

This article is reproduced with permission and was first published on June 24 2021.