“Learn from your mistakes.” It’s a familiar adage but people still tend to highlight their successes and sweep their failures under the rug, as a professor at Princeton University pointed out last week when he published his “CV of Failures” (pdf), which has since gone viral. Now, in a study published this week in Nature, a team of researchers at Haverford College in Pennsylvania have taken this idea to the next level—by applying it to the scientific community. (Scientific American is part of Springer Nature.)
Even though most experiments fail, it is only the successes that are reported in scientific literature and discussed among experts. The vast majority of the data is discarded, left to collect dust in forgotten laboratory notebooks or never written down at all, making it effectively unavailable for use in further research. “Scientific literature is biased against failure,” says experimental chemist Alexander Norquist, one of the study’s lead authors. “What we want to do is extract as much information as we can from the vast number of failed reactions that usually don’t get reported.” To achieve this, the Haverford researchers used a collection of these failed or “dark” reactions to create a machine-learning model that was able to predict the success of new chemical reactions with greater accuracy than humans are able to achieve.
They started by compiling a database of nearly 4,000 chemical reactions (many of which had failed and were therefore not already digitized) performed over the past decade in Norquist’s lab. The information focuses on the synthesis of new materials—in this case solids called templated vanadium selenites, which consist of vanadium, selenium, oxygen and an organic component. They then created a machine-learning algorithm to derive patterns from that data and determine what made some experiments succeed and others fail. Usually, scientists like Norquist build up an intuition over many years about the combinations of conditions—temperature, quantity and ratio of reactants, acidity and a host of other factors—that may result in the successful formation of crystals. “But our intuition is always incomplete,” Norquist says. “There’s subtlety and nuance to differences between reactants that aren’t readily apparent.”
So the team turned to machine learning: They assigned nearly 300 properties to each reaction and then used a support vector machine, which can analyze data in high dimensions, to make predictions about which conditions would be necessary for new combinations of reactants that they then tested in the lab. The algorithm predicted conditions for the successful formation of crystals in 89 percent of these cases—compared with the researchers’ predictions, which had a 78 percent success rate.
Because the reasons for the algorithm’s decisions were not always clear, given the massive amount of data being considered, the researchers then went back to the model itself and generated a decision tree, a flowchart-like structure that shows the potential outcomes of a series of choices. Using this method, which is much easier to interpret, they were able to gain new insights and formulate hypotheses. They found, for instance, that polarizability (which measures how the distribution of charges is distorted in the presence of an electric field) was important in a way they had not anticipated based on their own lab experience. In fact, they ended up with three hypotheses about different subsets of reactants. One class of reactions containing certain organic components required the presence of vanadium in a specific oxidation state. Meanwhile when those components had low polarizabilities, the researchers realized they had to turn their attention to the behavior of other reactants, namely sodium. Finally, for particularly large organic components, charge density played a critical role. “The real novelty in this is the end-to-end pipeline,” says computer scientist Sorelle Friedler, another of the study’s lead authors. “The idea of taking previously considered failures, unimportant reactions, and using information contained in them to link with a machine-learning pipeline, and then trying to examine the results of the machine-learning pipeline to generate these new hypotheses.”
The findings come at a time when materials research has become increasingly important. The White House launched the Materials Genome Initiative in 2011, for example, in order to accelerate the pace at which new materials are discovered and put on the market. Now, the Haverford team’s machine-learning approach may help scientists make this search much more targeted—both by optimizing synthetic processes that are already known and by creating novel solids. “Materials are at the heart of every technological advancement we can think of,” says Ram Seshadri, a materials researcher at the University of California, Santa Barbara, who did not participate in this research. “The cell phone I’m using right now—its lithium battery is full of advanced materials, made by precisely the kind of chemical syntheses described in this paper,” he notes.
Cell phones are not the only potential applications of such materials. This research can be directed at anything from creating better shampoos and sunscreen lotions to manufacturing new pharmaceuticals and building better solar panels. Moreover, the researchers want to make their machine-learning approach available in other fields, both within and outside chemistry. The team has published its reactions database online so that other scientists can contribute their own data. “We’re really excited,” Friedler says. “We’re hoping this paper will spur other labs to want to work with us.” Access to such data, particularly the failures, will allow them to make new discoveries and refine their algorithm. “This is the century of data,” says Alán Aspuru-Guzik, a professor of chemistry and chemical biology at Harvard University who was not affiliated with the study. “And this paper shows that we can do a lot of learning from failed experiments.”
“Usually science is not data-driven, it’s cause-and-effect driven. This work acknowledges that sometimes you have to go beyond causality and use data-driven approaches,” Seshadri adds. “But the wonderful thing is, the data-driven approaches themselves lead to a better understanding of causality. So the approach [the Haverford team] has taken is inevitably the approach a lot of us are going to be adopting more and more in the future.”