To tease scientific discoveries out of the vast trove of data gathered by the LSST and other sky surveys, scientists will need to pinpoint unexpected relationships between attributes, which is extremely difficult in 500 dimensions. Finding correlations is easy with a two-dimensional data set: If two attributes are correlated, then there will be a one-dimensional curve connecting the data points on a two-dimensional plot of one attribute versus the other. But additional attributes plotted as extra dimensions obscure such curves. “Finding the unexpected in a higher-dimensional space is impossible using the human brain,” Tyson said. “We have to design future computers that can in some sense think for themselves.”
Algorithms exist for “reducing the dimensionality” of data, or finding surfaces on which the data points lie (like that 1-D curve in the 2-D plot), in order to find correlated dimensions and eliminate “nuisance” ones. For example, an algorithm might identify a 3-D surface of data points coursing through a database, indicating that three attributes, such as the type, size and rotation speed of galaxies, are related. But when swamped with petabytes of data, the algorithms take practically forever to run.
Identifying correlated dimensions is exponentially more difficult than looking for a needle in a haystack. “That’s a linear problem,” said Alex Szalay, a professor of astronomy and computer science at Johns Hopkins University. “You search through the haystack and whatever looks like a needle you throw in one bucket and you throw everything else away.” When you don’t know what correlations you’re looking for, however, you must compare each of the N pieces of hay with every other piece, which takes N-squared operations.
Adding to the challenge is the fact that the amount of data is doubling every year. “Imagine we are working with an algorithm that if my data doubles, I have to do four times as much computing and then the following year, I have to do 16 times as much computing,” Szalay said. “But by next year, my computers will only be twice as fast, and in two years from today, my computers will only be four times as fast, so I’m falling farther and farther behind in my ability to do this.”
A huge amount of research has gone into developing scalable algorithms, with techniques such as compressed sensing, topological analysis and the maximal information coefficient emerging as especially promising tools of big data science. But more work remains to be done before astronomers, cosmologists and physicists will be ready to fully exploit the multi-petabyte digital movie of the universe that premiers next decade. Progress is hampered by the fact that researchers in the physical sciences get scant academic credit for developing algorithms — a problem that the community widely recognizes but has yet to solve.
“It’s always been the case that the people who build the instrumentation don’t get as much credit as the people who use the instruments to do the cutting-edge science,” Connolly said. “Ten years ago, it was people who built physical instruments — the cameras that observe the sky — and today, it’s the people who build the computational instruments who don’t get enough credit. There has to be a career path for someone who wants to work on the software — because they can go get jobs at Google. So if we lose these people, it’s the science that loses.”