SUPER, COMPUTER: Yahoo, Inc. is offering academia the opportunity to use a portion of the company's massively parallel M45 computing environment, which consists of 4,000 computer processors running the open-source Hadoop distributed file system and parallel execution environment. Image: iStockphoto, Copyright: Duncan Walker
A given in the world of information technology is that the amount of data is only going to grow over time. But how can academics and computer scientists make sense of the mountains of information—whether astronomic calculations from a distant satellite or a study of Internet traffic—if they do not have access to a computer capable of handling such large loads?
Yahoo!, Inc., this week offered its vast computing resources to assist with academic pursuits that require a massively parallel computing environment. Parallel computing involves breaking down extremely large sets of data and distributing them to different interconnected computers for simultaneous processing and analysis. Yahoo is offering the service via a cluster of 4,000 computer processors it refers to as M45 running software, also known as Hadoop, an open-source distributed file system and parallel execution environment that lets its users process massive amounts of data.
There is a demand for the ability to extract meaningful information from tremendous amounts of data gathered by computer systems across a number of different disciplines, says Randal Bryant, dean of Carnegie Mellon University's (C.M.U.) School of Computer Science in Pittsburgh.
C.M.U. this month became the first academic institution to sign up for time on Yahoo!'s M45 supercomputer cluster. Initially, about 20 of the school's researchers will use M45 to study ways to improve information retrieval, large-scale graph and computer graphics, natural-language processing and machine translation on widely distributed systems. Yahoo! also plans to make M45 available to researchers from other universities and institutions.
There are plenty of supercomputers available on college campuses—many of them at the Pittsburgh Supercomputing Center, a facility shared by C.M.U., the University of Pittsburgh and Westinghouse Electric—that can crunch numbers at blinding speeds. But these systems are not necessarily good at extracting patterns or analyzing data, Bryant says. The distributed systems such as M45 that can do this, however, are in short supply. "We have facilities here for data analysis that are 5 percent the size of what we're talking about at Yahoo!," he says, adding that C.M.U. faculty members studying natural-language translation (during which computers automatically translate audio from one spoken language to another) are "desperate for something like this."
M45 has about three terabytes (trillion bytes) of memory, 1.5 petabytes (quadrillion) of disk space and a peak performance of more than 27 trillion calculations per second (27 teraflops), placing it among the world's top 50 fastest supercomputers. In addition to tapping M45 to process and analyze data sets, computer scientists will also use its considerable resources to improve the cluster itself. There are a number of areas of distributed computing that could be improved: Among them, the ability to schedule different workloads on the same network, monitor the cluster's performance, recover quickly if a node within the cluster fails, and balance the high input/output (I/O) demands across the entire cluster.
The project to make M45 available to academic institutions means that researchers will be able to work on projects at "Internet scale," says Ron Brachman, vice president of worldwide research operations for Yahoo! Research. "Our sense is that academics don't have this type of environment that can replicate this scale as well as Yahoo! and others in the industry can. This kind of computing environment potentially radically changes the types of applications you're able to experiment with."