File-Swapping Veers into the Fast Lane

A new method for comparing files promises speedier downloads of music and movies

Join Our Community of Science Lovers!

A new file-swapping method could speed up downloads to rates as much as three times faster than the popular service BitTorrent. The approach, outlined and demonstrated last month by computer scientists at Carnegie Mellon University, Purdue University and Intel Research, would let file-swappers seeking a specific title download bits of it from similar, but not necessarily identical files. It works a little like an enterprising mechanic who uses spare parts from a Toyota to fix an old Ford.The idea is already drawing interest from commercial content distribution companies, along with discussion in less formal peer to peer communities online.

"It makes an awful lot of sense," says Andrew Parker, chief technical officer of CacheLogic, which legally distributes movie and game files online. The company has been independently researching a "very similar" concept, he adds.

With high-definition online video just around the corner, proposals for speeding downloads and easing network traffic are increasingly welcome. File-swapping networks, rife with video, games and music, can provide a real-world laboratory with lessons for the broader Net.

On supporting science journalism

If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.

In their quest for speed, most modern peer-to-peer systems break files—say, a copy of The Departed—into thousands of chunks and allow these individual components to be swapped separately. This allows someone with only half a movie downloaded to serve as a secondary source for that part of the content, for example.

Many files can still take days to download, however, as original sources go offline, or as a sources' upstream bandwidth clogs.

Aiming to fix this problem, Carnegie Mellon's David Andersen and his colleagues reasoned that many files online today are, in fact, near-duplicates with minor differences—identical songs labeled differently, movies in different languages or different versions of the same software programs, for example.

To test this, they downloaded all the versions they could find of 26 songs and 26 movies, resulting in more than 6,000 media files. Different versions of the same song wound up sharing about 99 percent of the same content, they found, while different versions of the same movies offered an average 15 percent overlap.

To make this shared content accessible, the team created a "handprinting" system, a unique digital identifier based on the exact contents of the file. Unlike more traditional digital "fingerprinting," commonly used to identify or authenticate documents, this system also allows fast comparison of a limited number of individual chunks, which can then be swapped if found to be identical.

Each handprint can be thought of as a string of digits, with different parts corresponding to different chunks of data. Thus, if The Departed's handprint was "12 14 16 18 24," and its Spanish language translation Los Infiltrados produced "13 15 17 18 24," the second file could be used as a source of some content. Scenes without dialogue, for example, might be identical in both language versions.

Tests of the team's prototype, dubbed Similarity-Enhanced Transfer (SET), found it to be as much as three times faster than BitTorrent for songs and about 30 percent faster for movie files when drawing content from similar as well as identical files over DSL-speed connections. If many identical copies were already available, however, the advantage disappeared, making it useful for perhaps "half the content out there," Andersen says.

The concept may be difficult to add to existing file-swapping networks, because its file-splitting methods would likely make SET-enabled updates incompatible with earlier versions of today's swapping software. Nevertheless, the idea is being widely discussed on peer-to-peer forums and mailing lists. Parker said SET or something like it is "certain" to end up in CacheLogic's toolbox before long.

Andersen said he is not interested in commercializing it himself. He and his colleagues have released detailed technical specifications and prototype code, and are encouraging other developers to draw on the technique.

"I hope other people will take and freely use it," Andersen says. "I really want to see this out there and in use."

It’s Time to Stand Up for Science

If you enjoyed this article, I’d like to ask for your support. Scientific American has served as an advocate for science and industry for 180 years, and right now may be the most critical moment in that two-century history.

I’ve been a Scientific American subscriber since I was 12 years old, and it helped shape the way I look at the world. SciAm always educates and delights me, and inspires a sense of awe for our vast, beautiful universe. I hope it does that for you, too.

If you subscribe to Scientific American, you help ensure that our coverage is centered on meaningful research and discovery; that we have the resources to report on the decisions that threaten labs across the U.S.; and that we support both budding and working scientists at a time when the value of science itself too often goes unrecognized.

In return, you get essential news, captivating podcasts, brilliant infographics, can't-miss newsletters, must-watch videos, challenging games, and the science world's best writing and reporting. You can even gift someone a subscription.

There has never been a more important time for us to stand up and show why science matters. I hope you’ll support us in that mission.

Thank you,

David M. Ewalt, Editor in Chief, Scientific American