Editor's note (11/16/15): Following the terrorist attacks in Paris on November 13 and the ensuing debate about counterterrorism efforts and encrypted communications, Scientific American is republishing the following article.
In November 2012 a 28-year-old woman plunged 15 meters from a bedroom window to the pavement in New York City, a devastating fall that left her body broken but alive. The accident was an act of both desperation and hope—the woman had climbed out of the sixth-floor window to escape a group of men who had been sexually abusing her and holding her captive for two days.
Four months ago the New York County District Attorney’s Office sent Benjamin Gaston, one of the men responsible for the woman’s ordeal, to prison for 50-years-to-life. A key weapon in the prosecutor’s arsenal, according to the NYDA’s Office: an experimental set of Internet search tools the U.S. Department of Defense is developing to help catch and lock up human traffickers.
Although the Defense Department and the prosecutor’s office had not publicly acknowledged using the new tools, they confirmed to Scientific American that the Defense Advanced Research Projects Agency’s (DARPA) Memex program provided advanced Internet search capabilities that helped secure the conviction. DARPA is creating Memex to scour the Internet in search of information about human trafficking, in particular advertisements used to lure victims into servitude and to promote their sexual exploitation.
Much of this information is publically available, but it exists in the 90 percent of the so-called “deep Web” that Google, Yahoo and other popular search engines do not index. That leaves untouched a multitude of information that may not be valuable to the average Web surfer but could provide crucial information to investigators. Google would not confirm that it indexes no more than 10 percent of the Internet, a statistic that has been widely reported, but a spokesperson pointed out that the company’s focus is on whether its search results are relevant and useful in answering users' queries, not whether it has indexed 100 percent of the data on the Internet.
Much of this deep Web information is unstructured data gathered from sensors and other devices that may not reside in a database that can be scanned or “crawled” by search engines. Other deep Web data comes from temporary pages (such as advertisements for illegal sexual and similarly illicit services) that are removed before search engines can crawl them. Some areas of the deep Web are accessible using only special software such as the Tor Onion Router, which allows people to secretly share information anonymously via peer-to-peer connections rather than going through a centralized computer server. DARPA is working with 17 different teams of researchers—from both companies and universities—to craft Internet search tools as part of the Memex program that give government, military and businesses new ways to analyze, organize and interact with data pulled from this larger pool of sources.
Law and order
DARPA has said very little about Memex and its use by law enforcement and prosecutors to investigate suspected criminals.
According to published reports, including one from Carnegie Mellon University, the NYDA’s Office is one of several law enforcement agencies that have used early versions of Memex software over the past year to find and prosecute human traffickers, who coerce or abduct people—typically women and children—for the purposes of exploitation, sexual or otherwise. “Memex”—a combination of the words “memory” and “index” first coined in a 1945 article for The Atlantic—currently includes eight open-source, browser-based search, analysis and data-visualization programs as well as back-end server software that perform complex computations and data analysis.
Such capabilities could become a crucial component of fighting human trafficking, a crime with low conviction rates, primarily because of strategies that traffickers use to disguise their victims’ identities (pdf). The United Nations Office on Drugs and Crime estimates there are about 2.5 million human trafficking victims worldwide at any given time, yet putting the criminals who press them into service behind bars is difficult. In its 2014 study on human trafficking (pdf) the U.N. agency found that 40 percent of countries surveyed reported less than 10 convictions per year between 2010 and 2012. About 15 percent of the 128 countries covered in the report did not record any convictions.
Evidence of criminals peddling such services online is hard to pinpoint because of the use of temporary ads and peer-to-peer connections within the deep Web. Over a two-year time frame traffickers spent about $250 million to post more than 60 million advertisements, according to DARPA-funded research. Such a large volume of Web pages, many of which are not posted long enough to be crawled by search engines, makes it difficult for investigators to connect the dots. This is, in part, because investigators typically search for evidence of human trafficking using the same search engines that most people use to find restaurant reviews and gift ideas. Hence the Memex project.
At DARPA’s Arlington, Va., headquarters Memex program manager Christopher White provided Scientific American with a demonstration of some of the tools he and his colleagues are developing. Criminal investigations often begin with little more than a single piece of information, such as an e-mail address. White plugged a demo address into Google to show how investigators currently work. As expected, he received a page of links from the portion of the Internet that Google crawls—also referred to as the “surface Web”—prioritized by a Google algorithm attempting to deliver the most relevant information at the top. After clicking through several of these links, an investigator might find a phone number associated with the e-mail address.
Thus far, White had pulled the same information from the Internet that most people would see. But he then faced a next step all Web users confront: sifting through pages of hyperlinks with very little analytical information available to tie together different search results. Just as important as Memex’s ability to pull information from a broader swath of the Internet are its tools that can identify relationships among different pieces of data. This helps investigators create data maps used to detect spatial and temporal patterns. One example could be a hub-and-spoke visualization depicting hundreds of Web sites connected to a single sex services e-mail, phone number or worker.
White also showed how MEMEX can generate color-coded heat maps of different countries that locate where the most sex advertisements are being posted online at any given time. These patterns and others could help reveal associations that investigators might otherwise miss, says White, who began working with DARPA in 2010 as a consultant developing data-science tools to support the U.S. military in Afghanistan.
The technology has already delivered results since DARPA began introducing Memex to select law enforcement agencies about a year ago. The NYDA says that its new Human Trafficking Response Unit now uses DARPA’s Memex search tool in every human trafficking case it pursues. Memex has played a role in generating at least 20 active sex trafficking investigations and has been applied to eight open indictments in addition to the Gaston conviction, according to the NYDA’s Office. “Memex helps us build evidence-based prosecutions, which are essential to fighting human trafficking,” says Manhattan District Attorney Cyrus R. Vance, Jr. “In these complex cases prosecutors cannot rely on traumatized victims alone to testify. We need evidence to corroborate or, in some cases, replace the need for the victim to testify.”
Different components of Memex are helping law enforcement crack down on trafficking elsewhere in the country as well. A detective in Modesto, Calif., used a specific piece of software called Traffic Jam to follow up on a tip about one particular victim from Nebraska and ended up identifying a sex trafficker who was traveling with prostitutes across the Midwest and West. The investigation culminated in his arrest. Traffic Jam, developed independently of DARPA in 2011 by Carnegie Mellon University researchers and later spun off into a company called Marinus Analytics, enabled investigators to gather evidence by quickly reviewing ads the trafficker posted for several locales.
DARPA has since awarded Carnegie Mellon a three-year, $3.6-million contract to enhance Traffic Jam’s basic search capabilities as part of Memex, with machine-learning algorithms that can analyze results in depth, according to the university. Carnegie Mellon researchers are also studying ways to apply computer vision to searches in a way that helps investigators identify images with similar elements—such as furniture from the same hotel room that appears in multiple images—even if the images themselves are not identical, says Jeff Schneider. Schneider is the project's principal investigator and a research professor in the Auton Lab at the university’s School of Computer Science, which studies statistical data mining. Furniture in a hotel room, for example, could help law enforcement identify the location of trafficking operations.
Vance and other law enforcement officials welcome such advances. “Technology alone won’t solve cases, but it certainly helps,” he says. “We’ve had the most success with this effort when we married traditional field intelligence with the information this tool provides.”
White agrees that DARPA’s technology is a supplement to other investigative methods, including interviews with victims. In addition to targeting human trafficking, law enforcement officials are finding that they can tap Memex to crack down on other, related crimes, including trafficking in guns and drugs.
“Bigger than most people think”
Beyond its capacity to transform law enforcement, Memex marks a major shift in Internet search technology itself, which some day could help all of us get more useful results to our own searches. Most people see Internet searches as lists of hyperlinked results and will click on the first link 40 percent of the time, White says. But, he adds, “The Internet is bigger than most people think.”
Search engines ignore most of the unstructured data, unlinked content (Web pages without links to other pages) and temporary pages they find in the deep Web, dismissing them as unusable to the audience that search engine advertisers are trying to reach. One type of temporary page could be an advertisement for sexual services set up by human traffickers in a location on the Internet known to their customers but taken down before it can be indexed and found by law enforcement. Other temporary pages are more innocuous—for example, those consisting of data query results that change depending on the query.
One reason regular search engines ignore this subsurface data is that Web advertisers have no interest in it. Browser companies make money based on the search results they produce, White says. “We’re showing that there are other models for using the Internet, and they can be domain specific—trafficking, counterterrorism, disease response, for example,” he adds. “[It is] not just about trying to get people to click on ads.”
Google’s search engine combs through the company’s index of the Internet, which Google builds using software programs called spiders that find and catalogue Web pages. The results of a Google search consist of links to the most relevant information the company’s search engine can find within that index. Google ranks those links essentially based on each page’s popularity. Yahoo, Bing and other popular search engines function much the same way. “Most of what happens on current engines is entity search—I’m looking for information about a musician, an event or a product,” explains Tetherless World Professor of Computer and Cognitive Science James Hendler, director of Rensselaer Polytechnic Institute’s Institute for Data Exploration and Applications. “Under the existing search technology you have to guess good keywords to get the information you’re looking for. If you don’t know the right keywords or you need the search results to be placed in context, you’re left with a difficult problem.” Basically, you get either a flood of general links without a clear sense of how they relate back to your original query or a short list that does not provide you with the specific information you need.
More valuable would be something in the middle of those extremes. That’s what Memex proposes to do so well, according to Hendler, who served as a program manager and chief scientist of DARPA’s Information Systems Office from 1999 to 2001, although he is not involved with Memex.
Memex got its first test in February 2014 when White and his team worked in coordination with the New Jersey Regional Operations Intelligence Center to monitor and disrupt any spike in sex trafficking related to Super Bowl XLVIII, held at the Garden State’s MetLife Stadium. The DARPA scientists used early versions of Memex’s tools to give police a sense of the scope of the problem. More specifically, they analyzed images in advertisements for sexual services to determine whether the women in those ads had appeared in previous ads or were new, likely brought to the New York–New Jersey region specifically to meet increased demand around the big game.
This past August White rolled out Memex to additional beta testers—two district attorney's offices, a law enforcement group and a nongovernmental organization (NGO). Although White would not identify these users, he says their work in anti–human trafficking efforts spans prosecution, operations and victim outreach.
The next set of testing begins in a few weeks and will include federal and district prosecutors, regional and national law enforcement and multiple NGOs. One of the main objectives of this round is to test new image search capabilities that can analyze photos even when portions that might aid investigators—including traffickers’ faces or a television screen in the background—are obfuscated. Another goal is to try out different user interfaces and to experiment with streaming architectures that assess time-sensitive data.
White says he wants to expand user testing every quarter until he and his team have created a version of Memex that they are comfortable handing over to law enforcement agencies and prosecutors. When this handover happens, software components like the Web crawler, machine learning algorithms and graph analysis that can scour both the surface and deep Webs will be installed locally at law enforcement agencies. They will be connected to regular browser-based software that agencies and the general public would typically use, such as Firefox and Chrome. This would ensure that law enforcement could access the software from any Internet-connected device.
White made several key decisions about the type of data Memex could access in an effort to steer clear of the controversy around government access to private citizen information and communications, a particularly touchy subject since Edward Snowden’s National Security Agency revelations beginning in June 2013. If something is password protected, it is not public content and Memex does not search it, according to White. “We didn’t want to do hacking,” he adds. “We didn’t want to cloud this work unnecessarily by dragging in the specter of snooping and surveillance.” White and his team are finding there is more than enough public content to challenge them as they develop their tools to aid law enforcement and prosecutors.
Such content can be found on the surface Web with which most people are familiar as well as the deep Web or the “dark Web,” the latter a subset of the unindexed deep Web that requires specialized software and algorithms to find and browse. People, such as those running the underground Silk Road cyber black market, often use the dark Web to anonymously post content that may or may not be legal.
Dark Web sites have, of course, attracted DARPA’s attention because they are good candidates for human trafficking activity. As a result, White and his team are developing a “Dark Web crawler” that explores the Tor-accessible, peer-to-peer areas of the deep Web and has thus far done much to enlighten the researchers as to the extent of dark Web activity. Whereas it was once thought to consist of 1,000 or so pages, White says there could be anywhere between 30,000 and 40,000 dark Web pages. “Just finding these pages and seeing what’s on them is a new aspect of search technology,” he says.
DARPA chose law enforcement efforts to disrupt human trafficking as a concrete cause around which it could quickly develop and deploy its new approach to searching the Internet. White is confident that Memex technology can likewise be applied to any type of investigative effort, including counterterrorism, missing persons, disease response and disaster relief.
Perhaps someday it will even provide better approaches to finding the restaurant reviews, gift ideas and other more mundane information that the vast majority of the Internet’s users crave.