Even backed by this tradition, the LSST tests the limits of scientists’ data-handling abilities. It will be capable of tracking the effects of dark energy, which is thought to make up a whopping 68 percent of the total contents of the universe, and mapping the distribution of dark matter, an invisible substance that accounts for an additional 27 percent. And the telescope will cast such a wide and deep net that scientists say it is bound to snag unforeseen objects and phenomena too. But many of the tools for disentangling them from the rest of the data don’t yet exist.
Particle physics is the elder statesman of big data science. For decades, high-energy accelerators have been bashing particles together millions of times per second in hopes of generating exotic, never-before-seen particles. These facilities, such as the Large Hadron Collider (LHC) at CERN laboratory in Switzerland, generate so much data that only a tiny fraction (deemed interesting by an automatic selection process) can be kept. A network of hundreds of thousands of computers spread across 36 countries called the Worldwide LHC Computing Grid stores and processes the 25 petabytes of LHC data that were archived in a year’s worth of collisions. The work of thousands of physicists went into finding the bump in that data that last summer was deemed representative of a new subatomic particle, the Higgs boson.
CERN, the organization that operates the LHC, is sharing its wisdom by working with other research organizations “so they can benefit from the knowledge and experience that has been gathered in data acquisition, processing and storage,” said Bob Jones, head of CERN openlab, which develops new IT technologies and techniques for the LHC. Scientists at the European Space Agency, the European Molecular Biology Laboratory, other physics facilities and even collaborations in the social sciences and humanities have taken cues from the LHC on data handling, Jones said.
When the LHC turns back on in 2014 or 2015 after an upgrade, higher energies will mean more interesting collisions, and the amount of data collected will grow by a significant factor. But even though the LHC will continue to possess the biggest data set in physics, its data is much simpler than those obtained from astronomical surveys such as the Sloan Digital Sky Survey and Dark Energy Survey and — to an even greater extent — those that will be obtained from future sky surveys such as the Square Kilometer Array, a radio telescope project set to begin construction in 2016, and the LSST.
“The LHC generates a lot more data right at the beginning, but they’re only looking for certain events in that data and there’s no correlation between events in that data,” said Jeff Kantor, the LSST data management project manager. “Over time, they still build up large sets, but each one can be individually analyzed.”
In combining repeat exposures of the same cosmological objects and logging hundreds rather than a handful of attributes of each one, the LSST will have a whole new set of problems to solve. “It’s the complexity of the LSST data that’s a challenge,” Tyson said. “You’re swimming around in this 500-dimensional space.”
From color to shape, roughly 500 attributes will be recorded for every one of the 20 billion objects surveyed, and each attribute is treated as a separate dimension in the database. Merely cataloguing these attributes consistently from one exposure of a patch of the sky to the next poses a huge challenge. “In one exposure, the scene might be clear enough that you could resolve two different galaxies in the same spot, but in another one, they might be blurred together,” Kantor said. “You have to figure out if it’s one galaxy or two or N.”