"It makes an awful lot of sense," says Andrew Parker, chief technical officer of CacheLogic, which legally distributes movie and game files online. The company has been independently researching a "very similar" concept, he adds.
With high-definition online video just around the corner, proposals for speeding downloads and easing network traffic are increasingly welcome. File-swapping networks, rife with video, games and music, can provide a real-world laboratory with lessons for the broader Net.
In their quest for speed, most modern peer-to-peer systems break files—say, a copy of The Departed—into thousands of chunks and allow these individual components to be swapped separately. This allows someone with only half a movie downloaded to serve as a secondary source for that part of the content, for example.
Many files can still take days to download, however, as original sources go offline, or as a sources' upstream bandwidth clogs.
Aiming to fix this problem, Carnegie Mellon's David Andersen and his colleagues reasoned that many files online today are, in fact, near-duplicates with minor differences—identical songs labeled differently, movies in different languages or different versions of the same software programs, for example.
To test this, they downloaded all the versions they could find of 26 songs and 26 movies, resulting in more than 6,000 media files. Different versions of the same song wound up sharing about 99 percent of the same content, they found, while different versions of the same movies offered an average 15 percent overlap.
To make this shared content accessible, the team created a "handprinting" system, a unique digital identifier based on the exact contents of the file. Unlike more traditional digital "fingerprinting," commonly used to identify or authenticate documents, this system also allows fast comparison of a limited number of individual chunks, which can then be swapped if found to be identical.
Each handprint can be thought of as a string of digits, with different parts corresponding to different chunks of data. Thus, if The Departed's handprint was "12 14 16 18 24," and its Spanish language translation Los Infiltrados produced "13 15 17 18 24," the second file could be used as a source of some content. Scenes without dialogue, for example, might be identical in both language versions.
Tests of the team's prototype, dubbed Similarity-Enhanced Transfer (SET), found it to be as much as three times faster than BitTorrent for songs and about 30 percent faster for movie files when drawing content from similar as well as identical files over DSL-speed connections. If many identical copies were already available, however, the advantage disappeared, making it useful for perhaps "half the content out there," Andersen says.
The concept may be difficult to add to existing file-swapping networks, because its file-splitting methods would likely make SET-enabled updates incompatible with earlier versions of today's swapping software. Nevertheless, the idea is being widely discussed on peer-to-peer forums and mailing lists. Parker said SET or something like it is "certain" to end up in CacheLogic's toolbox before long.
Andersen said he is not interested in commercializing it himself. He and his colleagues have released detailed technical specifications and prototype code, and are encouraging other developers to draw on the technique.
"I hope other people will take and freely use it," Andersen says. "I really want to see this out there and in use."