Although Hadoop open-source software was created two years ago by Apache Software Foundation in Forest Hill, Md., (a nonprofit corporation that specializes in writing and managing open-source programs), Yahoo! Research has been the primary contributor of new code to Hadoop. Generally, open-source software like Hadoop is created by a programmer or group of programmers—such as Apache—and then released on the Internet for anyone to use and / or improve on.
Hadoop is at the core of Yahoo!'s grid-computing infrastructure that the company uses internally, says Jay Kistler, Yahoo!'s vice president of engineering for systems, tools and services. "With the right infrastructure, you can apply thousands of processors in parallel on a job," he says.
Hadoop is an open-source version of the MapReduce software that Google created to help its developers write programs for processing and generating large data sets, Bryant says, noting that, "MapReduce is the right programming framework for these data analysis tasks." MapReduce and Hadoop automatically take care of the details of partitioning and processing data across a cluster of computers.
Carnegie Mellon will help Yahoo! iron out any kinks in the system, which is expected to take a few more months. "It's hard to say when M45 will be thrown open to universities," Brachman says. "We want to make sure it works well and will support in a secure way the different organizations who will be using the system."