At the Kavli Institute, Wiggins began building a model of a gene regulatory network in yeast—the set of rules by which genes and regulators collectively orchestrate how vigorously DNA is transcribed into mRNA. As he worked with different algorithms, he started to attend discussions on gene regulation led by Christina Leslie, who ran the computational biology group at Columbia at the time. Leslie suggested using a specific machine-learning tool called a classifier. Say the algorithm must discriminate between pictures that have bicycles in them and pictures that do not. A classifier sifts through labeled examples and measures everything it can about them, gradually learning the decision rules that govern the grouping. From these rules, the algorithm generates a model that can determine whether or not new pictures have bikes in them. In gene regulatory networks, the learning task becomes the problem of predicting whether genes increase or decrease their protein-making activity.
The algorithm that Wiggins and Leslie began building in the fall of 2002 was trained on the DNA sequences and mRNA levels of regulators expressed during a range of conditions in yeast—when the yeast was cold, hot, starved, and so on. Specifically, this algorithm—MEDUSA (for motif element discrimination using sequence agglomeration)—scans every possible pairing between a set of DNA promoter sequences, called motifs, and regulators. Then, much like a child might match a list of words with their definitions by drawing a line between the two, MEDUSA finds the pairing that best improves the fit between the model and the data it tries to emulate. (Wiggins refers to these pairings as edges.) Each time MEDUSA finds a pairing, it updates the model by adding a new rule to guide its search for the next pairing. It then determines the strength of each pairing by how well the rule improves the existing model. The hierarchy of numbers enables Wiggins and his colleagues to determine which pairings are more important than others and how they can collectively influence the activity of each of the yeast’s 6,200 genes. By adding one pairing at a time, MEDUSA can predict which genes ratchet up their RNA production or clamp that production down, as well as reveal the collective mechanisms that orchestrate an organism’s transcriptional logic.
Wiggins and his colleagues can now go much further than yeast. Recently they have shown that MEDUSA can accurately build predictive models of gene regulatory networks in higher organisms such as worms as well as in several cell lines, including those of human lymphocytes. In a cancer cell line, the team can determine which genes increase their activity when they should decrease it, and vice versa. The ultimate goal, however, is to understand their coordinated activity and infer, with statistics, which interactions lead to a diseased cell.
Although MEDUSA makes accurate predictions on test data, there is still no way to know whether it faithfully reproduces real biological networks. To do so, each connection would have to be experimentally tested. It is also unclear how well microarray data measure expression levels, so accurate predictions may not necessarily reflect the truth. Moreover, machine learning forces researchers to formulate ad hoc hypotheses that may be biased toward their results, “so any kind of correlation in the data may be a fluke,” remarks Yoav Freund of the University of San Diego, who created MEDUSA’s learning algorithm.
To address these limitations, researchers must not only continue to cross disciplines but also be willing to adopt their tools. “I would say that machine learning hasn’t taken off like wildfire in the physics community,” remarks Alex Hartemink, a machine-learning expert at Duke University. “But Chris seems to be most comfortable reaching out and learning about techniques from other places. And I think we need people that are going to do that—foray out into the forest, find new resources and bring them back to the tribe and say, ‘Hey, guys, check this out—this is great stuff.’”
This article was originally published with the title At the Edge of Life's Code.