# A Digital Copy of the Universe, Encrypted

As physics prepares for ambitious projects like the Large Synoptic Survey Telescope, the field is seeking new methods of data-driven discovery

Beyond N-Squared

To tease scientific discoveries out of the vast trove of data gathered by the LSST and other sky surveys, scientists will need to pinpoint unexpected relationships between attributes, which is extremely difficult in 500 dimensions. Finding correlations is easy with a two-dimensional data set: If two attributes are correlated, then there will be a one-dimensional curve connecting the data points on a two-dimensional plot of one attribute versus the other. But additional attributes plotted as extra dimensions obscure such curves. “Finding the unexpected in a higher-dimensional space is impossible using the human brain,” Tyson said. “We have to design future computers that can in some sense think for themselves.”

Algorithms exist for “reducing the dimensionality” of data, or finding surfaces on which the data points lie (like that 1-D curve in the 2-D plot), in order to find correlated dimensions and eliminate “nuisance” ones. For example, an algorithm might identify a 3-D surface of data points coursing through a database, indicating that three attributes, such as the type, size and rotation speed of galaxies, are related. But when swamped with petabytes of data, the algorithms take practically forever to run.

Identifying correlated dimensions is exponentially more difficult than looking for a needle in a haystack. “That’s a linear problem,” said Alex Szalay, a professor of astronomy and computer science at Johns Hopkins University. “You search through the haystack and whatever looks like a needle you throw in one bucket and you throw everything else away.” When you don’t know what correlations you’re looking for, however, you must compare each of the N pieces of hay with every other piece, which takes N-squared operations.

Adding to the challenge is the fact that the amount of data is doubling every year. “Imagine we are working with an algorithm that if my data doubles, I have to do four times as much computing and then the following year, I have to do 16 times as much computing,” Szalay said. “But by next year, my computers will only be twice as fast, and in two years from today, my computers will only be four times as fast, so I’m falling farther and farther behind in my ability to do this.”

A huge amount of research has gone into developing scalable algorithms, with techniques such as compressed sensing, topological analysis and the maximal information coefficient emerging as especially promising tools of big data science. But more work remains to be done before astronomers, cosmologists and physicists will be ready to fully exploit the multi-petabyte digital movie of the universe that premiers next decade. Progress is hampered by the fact that researchers in the physical sciences get scant academic credit for developing algorithms — a problem that the community widely recognizes but has yet to solve.

“It’s always been the case that the people who build the instrumentation don’t get as much credit as the people who use the instruments to do the cutting-edge science,” Connolly said. “Ten years ago, it was people who built physical instruments — the cameras that observe the sky — and today, it’s the people who build the computational instruments who don’t get enough credit. There has to be a career path for someone who wants to work on the software — because they can go get jobs at Google. So if we lose these people, it’s the science that loses.”

View
1. 1. Wayne Williamson 03:31 PM 10/4/13

Cool endeavor. Seems like the first thing that should be done is a loss less compression. ie for each original shot, only record what has changed in the next pic of that area. I also think that the original pic could probably be greatly compressed as a good amount should just be "black".
I'm unfamiliar with how Cern captures and processes their data, but it seems to me this would be very different...

2. 2. lwaynebuinis 03:54 PM 10/7/13

Fractal compression showed some promise years ago. Don't know the current status!
http://en.wikipedia.org/wiki/Fractal_compression

3. 3. Vortigon in reply to Wayne Williamson 11:09 AM 11/14/13

You cannot compress this type of data since you don't know how you will eventually use the data in the future. By compressing the data you destroy parts of it forever which defeats the purpose.

Nothing is space is 'black' - even the darkest areas hold huge amounts of information vital to science.

You must sign in or register as a ScientificAmerican.com member to submit a comment.
Click one of the buttons below to register using an existing Social Account.

## More from Scientific American

• Scientific American Magazine | 7 hours ago

### Teenage Flu Scientist Shares His Recipe for Prizewinning Research

• Scientific American Magazine | 7 hours ago

• @ScientificAmerican | 22 hours ago

### Can We Harness Disruption to Improve Our World's Future?

• News | 23 hours ago

### Federal Flood Maps Left New York Unprepared for Sandy, and FEMA Knew It

• News | Dec 6, 2013

More »

## Latest from SA Blog Network

• ### Stream of Thought Description of Teaching James's "Stream of Thought": A Work of Faction

Cross-Check | 4 hours ago
• ### Physics Week in Review: December 7, 2013

Cocktail Party Physics | 10 hours ago
• ### Wonderful Things: The Pugnacious, Alien-esque Skeleton Shrimp

The Artful Amoeba | 20 hours ago
• ### Can We Harness Disruption to Improve Our World's Future?

STAFF
@ScientificAmerican | 22 hours ago
• ### British Storm Brings Up History's First Work of Social Media

Plugged In | 22 hours ago

## Science Jobs of the Week

A Digital Copy of the Universe, Encrypted

X

Give a 1 year subscription as low as \$14.99

X

X

###### Welcome, . Do you have an existing ScientificAmerican.com account?

No, I would like to create a new account with my profile information.

X

Are you sure?

X