On an airport shuttle bus to the Kavli Institute for Theoretical Physics in Santa Barbara, Calif., Chris Wiggins took a colleague’s advice and opened a Microsoft Excel spreadsheet. It had nothing to do with the talk on biopolymer physics he was invited to give. Rather the columns and rows of numbers that stared back at him referred to the genetic activity of budding yeast. Specifically, the numbers represented the amount of messenger RNA (mRNA) expressed by all 6,200 genes of the yeast over the course of its reproductive cycle. “It was the first time I ever saw anything like this,” Wiggins recalls of that spring day in 2002. “How do you begin to make sense of all these data?”
Instead of shirking from this question, the 36-year-old applied mathematician and physicist at Columbia University embraced it—and now six years later he thinks he has an answer. By foraying into fields outside his own, Wiggins has drudged up tools from a branch of artificial intelligence called machine learning to model the collective protein-making activity of genes from real-world biological data. Engineers originally designed these tools in the late 1950s to predict output from input. Wiggins and his colleagues have now brought machine learning to the natural sciences and tweaked it so that it can also tell a story—one not only about input and output but also about what happens inside a model of gene regulation, the black box in between.
The impetus for this work began in the late 1990s, when high-throughput techniques generated more mRNA expression profiles and DNA sequences than ever before, “opening up a completely different way of thinking about biological phenomena,” Wiggins says. Key among these techniques were DNA microarrays, chips that provide a panoramic view of the activity of genes and their expression levels in any cell type, simultaneously and under myriad conditions. As noisy and incomplete as the data were, biologists could now query which genes turn on or off in different cells and determine the collection of proteins that give rise to a cell’s characteristic features—healthy or diseased.
Yet predicting such gene activity requires uncovering the fundamental rules that govern it. “Over time, these rules have been locked in by cells,” says theoretical physicist Harmen Bussemaker, now an associate professor of biology at Columbia. “Evolution has kept the good stuff.”
To find these rules, scientists needed statistics to infer the interaction between genes and the proteins that regulate them and to then mathematically describe this network’s underlying structure—the dynamic pattern of gene and protein activity over time. But physicists who did not work with particles (or planets, for that matter) viewed statistics as nothing short of an anathema. “If your experiment requires statistics,” British physicist Ernest Rutherford once said, “you ought to have done a better experiment.”
But in working with microarrays, “the experiment has been done without you,” Wiggins explains. “And biology doesn’t hand you a model to make sense of the data.” Even more challenging, the building blocks that make up DNA, RNA and proteins are assembled in myriad ways; moreover, subtly different rules of interaction govern their activity, making it difficult, if not impossible, to reduce their patterns of interaction to fundamental laws. Some genes and proteins are not even known. “You are trying to find something compelling about the natural world in a context where you don’t know very much,” says William Bialek, a biophysicist at Princeton University. “You’re forced to be agnostic.”
Wiggins believes that many machine-learning algorithms perform well under precisely these conditions. When working with so many unknown variables, “machine learning lets the data decide what’s worth looking at,” he says.