Yahoo! Unleashes Teraflops of Processing Power for Research

Yahoo!'s M45 supercomputing cluster, along with Google-inspired open-source software, could dramatically improve analysis and understanding of astronomy, financial services and Web traffic

By Larry Greenemeier

A given in the world of information technology is that the amount of data is only going to grow over time. But how can academics and computer scientists make sense of the mountains of information—whether astronomic calculations from a distant satellite or a study of Internet traffic—if they do not have access to a computer capable of handling such large loads?

Yahoo!, Inc., this week offered its vast computing resources to assist with academic pursuits that require a massively parallel computing environment. Parallel computing involves breaking down extremely large sets of data and distributing them to different interconnected computers for simultaneous processing and analysis. Yahoo is offering the service via a cluster of 4,000 computer processors it refers to as M45 running software, also known as Hadoop, an open-source distributed file system and parallel execution environment that lets its users process massive amounts of data.

There is a demand for the ability to extract meaningful information from tremendous amounts of data gathered by computer systems across a number of different disciplines, says Randal Bryant, dean of Carnegie Mellon University's (C.M.U.) School of Computer Science in Pittsburgh.

On supporting science journalism

If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.

C.M.U. this month became the first academic institution to sign up for time on Yahoo!'s M45 supercomputer cluster. Initially, about 20 of the school's researchers will use M45 to study ways to improve information retrieval, large-scale graph and computer graphics, natural-language processing and machine translation on widely distributed systems. Yahoo! also plans to make M45 available to researchers from other universities and institutions.

There are plenty of supercomputers available on college campuses—many of them at the Pittsburgh Supercomputing Center, a facility shared by C.M.U., the University of Pittsburgh and Westinghouse Electric—that can crunch numbers at blinding speeds. But these systems are not necessarily good at extracting patterns or analyzing data, Bryant says. The distributed systems such as M45 that can do this, however, are in short supply. "We have facilities here for data analysis that are 5 percent the size of what we're talking about at Yahoo!," he says, adding that C.M.U. faculty members studying natural-language translation (during which computers automatically translate audio from one spoken language to another) are "desperate for something like this."

M45 has about three terabytes (trillion bytes) of memory, 1.5 petabytes (quadrillion) of disk space and a peak performance of more than 27 trillion calculations per second (27 teraflops), placing it among the world's top 50 fastest supercomputers. In addition to tapping M45 to process and analyze data sets, computer scientists will also use its considerable resources to improve the cluster itself. There are a number of areas of distributed computing that could be improved: Among them, the ability to schedule different workloads on the same network, monitor the cluster's performance, recover quickly if a node within the cluster fails, and balance the high input/output (I/O) demands across the entire cluster.

The project to make M45 available to academic institutions means that researchers will be able to work on projects at "Internet scale," says Ron Brachman, vice president of worldwide research operations for Yahoo! Research. "Our sense is that academics don't have this type of environment that can replicate this scale as well as Yahoo! and others in the industry can. This kind of computing environment potentially radically changes the types of applications you're able to experiment with."

Although Hadoop open-source software was created two years ago by Apache Software Foundation in Forest Hill, Md., (a nonprofit corporation that specializes in writing and managing open-source programs), Yahoo! Research has been the primary contributor of new code to Hadoop. Generally, open-source software like Hadoop is created by a programmer or group of programmers—such as Apache—and then released on the Internet for anyone to use and / or improve on.

Hadoop is at the core of Yahoo!'s grid-computing infrastructure that the company uses internally, says Jay Kistler, Yahoo!'s vice president of engineering for systems, tools and services. "With the right infrastructure, you can apply thousands of processors in parallel on a job," he says.

Hadoop is an open-source version of the MapReduce software that Google created to help its developers write programs for processing and generating large data sets, Bryant says, noting that, "MapReduce is the right programming framework for these data analysis tasks." MapReduce and Hadoop automatically take care of the details of partitioning and processing data across a cluster of computers.

Carnegie Mellon will help Yahoo! iron out any kinks in the system, which is expected to take a few more months. "It's hard to say when M45 will be thrown open to universities," Brachman says. "We want to make sure it works well and will support in a secure way the different organizations who will be using the system."

It’s Time to Stand Up for Science

If you enjoyed this article, I’d like to ask for your support. Scientific American has served as an advocate for science and industry for 180 years, and right now may be the most critical moment in that two-century history.

I’ve been a Scientific American subscriber since I was 12 years old, and it helped shape the way I look at the world. SciAm always educates and delights me, and inspires a sense of awe for our vast, beautiful universe. I hope it does that for you, too.

If you subscribe to Scientific American, you help ensure that our coverage is centered on meaningful research and discovery; that we have the resources to report on the decisions that threaten labs across the U.S.; and that we support both budding and working scientists at a time when the value of science itself too often goes unrecognized.

In return, you get essential news, captivating podcasts, brilliant infographics, can't-miss newsletters, must-watch videos, challenging games, and the science world's best writing and reporting. You can even gift someone a subscription.

There has never been a more important time for us to stand up and show why science matters. I hope you’ll support us in that mission.

Thank you,

David M. Ewalt, Editor in Chief, Scientific American