In a world flooded with data, figuring out where and how to store it efficiently and inexpensively becomes a larger problem every day. One of the most exotic solutions might turn out to be one of the best: archiving information in DNA molecules.
The prevailing long-term cold-storage method, which dates from the 1950s, writes data to pizza-sized reels of magnetic tape. By comparison, DNA storage is potentially less expensive, more energy-efficient and longer lasting. Studies show that DNA properly encapsulated with a salt remains stable for decades at room temperature and should last much longer in the controlled environs of a data center. DNA doesn’t require maintenance, and files stored in DNA are easily copied for negligible cost.
Even better, DNA can archive a staggering amount of information in an almost inconceivably small volume. Consider this: humanity will generate an estimated 33 zettabytes of data by 2025—that’s 3.3 followed by 22 zeroes. DNA storage can squeeze all that information into a ping-pong ball, with room to spare. The 74 million million bytes of information in the Library of Congress could be crammed into a DNA archive the size of a poppy seed—6,000 times over. Split the seed in half, and you could store all of Facebook’s data.
Science fiction? Hardly. DNA storage technology exists today, but to make it viable, researchers have to clear a few daunting technological hurdles around integrating different technologies. As part of a major collaboration to do that work, our team at Los Alamos National Laboratory has developed a key enabling technology for molecular storage. Our software, the Adaptive DNA Storage Codex (ADS Codex), translates data files from the binary language of zeroes and ones that computers understand into the four-letter code biology understands.
ADS Codex is a key part of the Intelligence Advanced Research Projects Activity (IARPA) Molecular Information Storage (MIST) program. MIST seeks to bring cheaper, bigger, longer-lasting storage to big-data operations in government and the private sector, with a short-term goal of writing one terabyte—a trillion bytes—and reading 10 terabytes within 24 hours at a cost of $1,000.
FROM COMPUTER CODE TO GENETIC CODE
When most people think of DNA, they think of life, not computers. But DNA is itself a four-letter code for passing along information about an organism. DNA molecules are made from four types of bases, or nucleotides, each identified by a letter: adenine (A), thymine (T), guanine (G) and cytosine (C). They are the basis of all DNA code, providing the instruction manual for building every living thing on earth.
A fairly well-understood technology, DNA synthesis has been widely used in medicine, pharmaceuticals and biofuel development, to name just a few applications. The technique organizes the bases into various arrangements indicated by specific sequences of A, C, G and T. These bases wrap in a twisted chain around each other—the familiar double helix—to form the molecule. The arrangement of these letters into sequences creates a code that tells an organism how to form.
The complete set of DNA molecules makes up the genome—the blueprint of your body. By synthesizing DNA molecules—making them from scratch—researchers have found they can specify, or write, long strings of the letters A, C, G and T and then read those sequences back. The process is analogous to how a computer stores binary information. From there, it was a short conceptual step to encoding a binary computer file into a molecule
The method has been proven to work, but reading and writing the DNA-encoded files currently takes a long time. Appending a single base to DNA takes about one second. Writing an archive file at this rate could take decades, but research is developing faster methods, including massively parallel operations that write to many molecules at once.
NOTHING LOST IN TRANSLATION
ADS Codex tells exactly how to translate the zeros and ones into sequences of four letter-combinations of A, C, G and T. The Codex also handles the decoding back into binary. DNA can be synthesized by several methods, and ADS Codex can accommodate them all.
Unfortunately, compared to traditional digital systems, the error rates while writing to molecular storage with DNA synthesis are very high. These errors arise from a different source than they do in the digital world, making them trickier to correct. On a digital hard disk, binary errors occur when a zero flips to a one, or vice versa. With DNA, the problems come from insertion and deletion errors. For instance, you might be writing A-C-G-T, but sometimes you try to write A, and nothing appears, so the sequence of letters shifts to the left, or it types AAA.
Normal error correction codes don’t work well with that kind of problem, so ADS Codex adds error detection codes that validate the data. When the software converts the data back to binary, it tests to see that the codes match. If they don’t, it removes or adds bases—letters—until the verification succeeds.
We have completed version 1.0 of ADS Codex, and late this year we plan to use it to evaluate the storage and retrieval systems developed by the other MIST teams. The work fits well with Los Alamos’ history of pioneering new developments in computing as part of our national security mission. Since the 1940s, as an outcome of those computing advancements, we have amassed some of the oldest and largest stores of digital-only data. It still has tremendous value. Because we keep data forever, we’ve been at the tip of the spear for a long time when it comes to finding a cold-storage solution, but we’re not alone.
All the world’s data—all your digital photos and tweets; all the records of the global financial sector; all those satellite images of cropland, troop movements and glacial melting; all the simulations underlying so much of modern science; and so much more—have to go somewhere. The “cloud” isn’t a cloud at all. It is digital data centers in huge warehouses consuming vast amounts of electricity to store (and keep cool) trillions of millions of bytes. Costing billions of dollars to build, power and run, these data centers may struggle to remain viable as the need for data storage continues to grow exponentially.
DNA shows great promise for sating the world’s voracious appetite for data storage. The technology requires new tools and new ways of applying familiar ones. But don’t be surprised if one day the world’s most valuable archives find a new home in a poppy-seed-sized collection of molecules.
Funding for ADS Codex was provided by the Intelligence Advanced Research Projects Activity (IARPA), a research agency within the Office of the Director of National Intelligence.
This is an opinion and analysis article.