SKY'S THE LIMIT: Biotech and physical sciences are two fields beginning to gravitate skyward to take advantage of the cloud's outsourced information technology resources. Image: COURTESY OF ALEXANDER RATHS, VIA ISTOCKPHOTO.COM
More In This Article
Time-shared access to supercomputers or computing clusters cloistered in laboratory data rooms and university basements has helped scientists for decades with problems requiring massive amounts of number-crunching muscle. This is now changing as scientists come to rely on software and storage delivered via the Web, aka "cloud computing," as a resource for organizing and analyzing research data. Biotech and physical sciences are two fields in particular that are gravitating skyward, at least piecemeal.
The National Science Foundation (NSF) and Microsoft in April awarded about $4.5 million in funding to 13 research projects planning to use or study cloud services. As part of the funding researchers involved in these projects will have free access for two years to cloud computing resources hosted by Microsoft and designed to deliver on-demand processing power and storage.
The winners include a project at J. Craig Venter Institute to computationally model protein-to-protein interactions; University of North Carolina at Charlotte research into gene regulatory systems in single-celled organisms; and a joint effort by the University of South Carolina Research Foundation and the University of Virginia in Charlottesville to study the management of large watershed systems.
These are not the first research projects to make use of the cloud. The European Space Agency (ESA) already uses Amazon Web Services to help deliver data about the current state of the planet to scientists, governmental agencies and other organizations worldwide. This data is used for monitoring the environment, improving the accuracy of weather reporting and assisting disaster relief agencies. ESA uses Amazon's Simple Storage Service (S3), for example, to house and retrieve information, including satellite images. During peak usage, Amazon helps ESA provide images and other information to more than 50,000 users around the world, which can equal 30 terabytes of information at one time, according to Amazon.
Complete Genomics, a Mountain View, Calif.–based biotech that provides academic and biopharmaceutical researchers with human genomic data and analysis, likewise uses Amazon's cloud services. "Genome sequencing today is a very computationally intensive process," says Bruce Martin, the company's senior vice president of product development. As a result, the biotech firm uses a large amount of storage and computing power, some of which is in-house and some of which is housed in Amazon data centers.
Complete Genomics customers—often research scientists using genomic data to study the pathology of diseases—ship biological samples to the company. Once Complete Genomics have created the data set required by its customers, the company has Amazon deliver the results. "When we are done computing and analyzing a genome, we push that information out to Amazon's Simple Storage Service, which serves as a scalable storage location," Martin says. "Amazon copies the data onto hard drives and ships them to our customers. That's still a really cost-effective way to get data around the world."
Amazon offers Complete Genomics a practical alternative to operating an entire information technology infrastructure of its own, but the company has retained key components of its business in-house. There are certain pieces of critical infrastructure, such as the DNA sequencers, that need to be operated in-house, which led to the hybrid approach to managing information, Martin says. "We move petabytes of data per month," he adds. (A petabyte is one quadrillion bytes.) "Tens of gigabits of data per second run over our networks. Cloud computing offerings do not offer that level of throughput now, but as network technology advances the cloud could mature to meet those needs."
Cloud computing is not the answer in all instances, particularly in biotech, agrees Giles Day, managing director of cloud computing at Distributed Bio, a San Francisco–based informatics consultancy for pharmaceutical and biotech companies. "Let's say you're producing terabytes of data that takes a relatively short amount of time to compute," he says. "In that case, you're going to spend an awful lot of money and time shifting data into the cloud to gain a very small reward on the actual compute time."
Generally speaking, Distributed Bio recommends a hybrid scenario similar to the one Complete Genomics uses, where some resources are housed in a service provider's data center whereas others are retained on the customers' own computers and servers. "The perfect scenario for using the cloud in biotech is to outsource small amounts of data into the cloud that require a massively parallel computing system for processing and then have the results of that processing returned to you," Day says. Moving large amounts of data to the cloud is difficult because it causes bandwidth bottlenecks. "You still can't break the law[s] of physics," he adds.