A given in the world of information technology is that the amount of data is only going to grow over time. But how can academics and computer scientists make sense of the mountains of information—whether astronomic calculations from a distant satellite or a study of Internet traffic—if they do not have access to a computer capable of handling such large loads?
Yahoo!, Inc., this week offered its vast computing resources to assist with academic pursuits that require a massively parallel computing environment. Parallel computing involves breaking down extremely large sets of data and distributing them to different interconnected computers for simultaneous processing and analysis. Yahoo is offering the service via a cluster of 4,000 computer processors it refers to as M45 running software, also known as Hadoop, an open-source distributed file system and parallel execution environment that lets its users process massive amounts of data.
There is a demand for the ability to extract meaningful information from tremendous amounts of data gathered by computer systems across a number of different disciplines, says Randal Bryant, dean of Carnegie Mellon University's (C.M.U.) School of Computer Science in Pittsburgh.
C.M.U. this month became the first academic institution to sign up for time on Yahoo!'s M45 supercomputer cluster. Initially, about 20 of the school's researchers will use M45 to study ways to improve information retrieval, large-scale graph and computer graphics, natural-language processing and machine translation on widely distributed systems. Yahoo! also plans to make M45 available to researchers from other universities and institutions.
There are plenty of supercomputers available on college campuses—many of them at the Pittsburgh Supercomputing Center, a facility shared by C.M.U., the University of Pittsburgh and Westinghouse Electric—that can crunch numbers at blinding speeds. But these systems are not necessarily good at extracting patterns or analyzing data, Bryant says. The distributed systems such as M45 that can do this, however, are in short supply. "We have facilities here for data analysis that are 5 percent the size of what we're talking about at Yahoo!," he says, adding that C.M.U. faculty members studying natural-language translation (during which computers automatically translate audio from one spoken language to another) are "desperate for something like this."
M45 has about three terabytes (trillion bytes) of memory, 1.5 petabytes (quadrillion) of disk space and a peak performance of more than 27 trillion calculations per second (27 teraflops), placing it among the world's top 50 fastest supercomputers. In addition to tapping M45 to process and analyze data sets, computer scientists will also use its considerable resources to improve the cluster itself. There are a number of areas of distributed computing that could be improved: Among them, the ability to schedule different workloads on the same network, monitor the cluster's performance, recover quickly if a node within the cluster fails, and balance the high input/output (I/O) demands across the entire cluster.
The project to make M45 available to academic institutions means that researchers will be able to work on projects at "Internet scale," says Ron Brachman, vice president of worldwide research operations for Yahoo! Research. "Our sense is that academics don't have this type of environment that can replicate this scale as well as Yahoo! and others in the industry can. This kind of computing environment potentially radically changes the types of applications you're able to experiment with."
Although Hadoop open-source software was created two years ago by Apache Software Foundation in Forest Hill, Md., (a nonprofit corporation that specializes in writing and managing open-source programs), Yahoo! Research has been the primary contributor of new code to Hadoop. Generally, open-source software like Hadoop is created by a programmer or group of programmers—such as Apache—and then released on the Internet for anyone to use and / or improve on.
Hadoop is at the core of Yahoo!'s grid-computing infrastructure that the company uses internally, says Jay Kistler, Yahoo!'s vice president of engineering for systems, tools and services. "With the right infrastructure, you can apply thousands of processors in parallel on a job," he says.
Hadoop is an open-source version of the MapReduce software that Google created to help its developers write programs for processing and generating large data sets, Bryant says, noting that, "MapReduce is the right programming framework for these data analysis tasks." MapReduce and Hadoop automatically take care of the details of partitioning and processing data across a cluster of computers.
Carnegie Mellon will help Yahoo! iron out any kinks in the system, which is expected to take a few more months. "It's hard to say when M45 will be thrown open to universities," Brachman says. "We want to make sure it works well and will support in a secure way the different organizations who will be using the system."