In 1994 I reinvented myself. A physicist and engineer at General Atomics, I was part of an internal think tank charged with answering hard questions from any part of the company. Over the years, I worked on projects as diverse as cold fusion and Predator drones. But by the early 1990s I was collaborating frequently with biologists and geneticists. They would tell me what cool new technologies they needed to do their research; I would go try to invent them.

Around that time I heard about a new effort called the Human Genome Project. The goal was to decipher the sequence of the approximately three billion DNA bases, or code letters, in human chromosomes. I was fascinated. I happened to read an article in this magazine noting that some of the necessary technology had yet to be invented. Physicists and engineers would have to make it happen. And before I knew it, I found myself a professor at the University of Texas Southwestern Medical Center, where my scientific partner, a geneticist, and I were building one of the Human Genome Project's first research centers.

Everything was different there. My colleagues spoke a different language—medicine. I spoke physics. In physics, basic equations govern most everything. In medicine, there are no universal equations—just many observations, some piecewise understanding and a tremendous amount of jargon. I would attend seminars and write down huge lists of words I had never heard and then spend hours afterward looking them up. To read a scientific paper, I had to have a medical dictionary on hand.

Frustrated with my inability to understand any contiguous piece of text, I decided to develop software to help me. I wanted a search engine that would take a chunk of text and return references for further reading, abstracts and papers that would quickly get me up to speed on the topic at hand. It was a tough problem. Search engines for the Web were just emerging. They were fine for finding the best falafel restaurant in town, but they could not begin to digest a paragraph containing multiple interrelated concepts and point me to related readings.

With some students and postdocs, I set about studying text analytics, and together we developed a piece of software called eTBLAST (electronic Text Basic Local Alignment Search Tool). It was inspired by the software tool BLAST, used to search DNA and protein sequence databases. A query for BLAST was usually a series of 100 to 400 DNA letters and would return longer sequences that included those codes. The query for eTBLAST would be a paragraph or page—typically 100 words or more. Designing the search protocol was harder than designing software to seek a string of letters because the search engine could not merely be literal. It also had to recognize synonyms, acronyms and related ideas expressed in different words, and it had to take word order into account. In response to a query consisting of a chunk of text, eTBLAST would return a ranked list of “hits” from the database it was searching, along with a measure of the similarity between the query and each abstract found.

The obvious database to search was Medline (available from PubMed at pubmed.org), the repository, maintained by the National Library of Medicine at the National Institutes of Health, of all biological research relevant to medicine. It contains the title and abstract of millions of research papers from thousands of peer-reviewed journals. Medline had a search engine that was keyword-based, so a query of a few words—for example, “breast cancer genes”—would return plenty of hits, often with links to full papers. But as a newly converted biomedical researcher, I did not even know how to start many of my searches.

The first versions of eTBLAST took hours to compare a paragraph of a few hundred words against Medline. But the software worked. Using eTBLAST, I could make my way through scientific papers, mastering their meaning paragraph by paragraph. I could pop a graduate student's thesis proposal in and quickly get up to speed on the pertinent literature. My research partners and I even spoke with Google about commercializing our software, only to be told it did not fit with the company's business model.

Then events took a strange turn. A couple of times I found text in student proposals that was identical to text in other, uncited papers. The students received remedial ethics training. I received a research question that would change my career: How much of the professional biomedical literature was plagiarized?

Déjà Vu

When I set out to explore this new question, the research on plagiarism in biomedicine consisted of anonymous surveys. In the most current survey I found, researchers admitted to plagiarizing 1.4 percent of the time. But the accuracy of that number depended on the honesty of the survey respondents. With eTBLAST, we could find out whether they were telling the truth.

Once we had enough student help and a sufficiently powerful computer, we randomly selected abstracts from Medline and then used them as eTBLAST queries. The computer would compare the query text with the entire contents of Medline, looking for similarities, then return a list of hits. Each hit came with a similarity score. The query was always at the top of the list—100 percent similarity. The second hit typically had a similarity score between the single digits and 30 percent. Occasionally, though, we found that the second and sometimes third hits had scores close to 100 percent. After running a few thousand queries, we started to see that about 5 percent of queries had suspiciously high similarity scores. We reviewed those abstracts by eye to make sure the software was finding things that a human would consider similar. Then we went on to compare the full text of papers that had suspiciously similar abstracts.

Soon we began to find blatant examples of plagiarism—not just recycled phrases but entire papers lifted whole cloth. It was disappointing, even astounding. Sure, we knew that surveys said that 1.4 percent of researchers admit to plagiarism. But it is quite a different thing to see examples of plagiarized papers side by side. For the students in particular, the process was exciting. They felt like crime fighters, and in a sense, they were.

The next step was to scale up the computing and the analysis. To be thorough, we wanted to perform similarity searching on every entry of sufficient length in Medline—at the time, almost nine million entries, each containing an average of 300 words, times nearly nine million comparisons. The task took months and consumed a considerable amount of our lab's computing power. As the results emerged, we analyzed them and placed all the highly similar results in a database we called Déjà Vu.

Déjà Vu began to fill with pairs of highly similar Medline abstracts—about 80,000 pairs that were at least 56 percent similar. The vast majority of these pairs were highly similar for perfectly good reasons—they were updates to older papers, or meeting summaries, for example. But others were suspicious.

We submitted a paper to Nature that contained data on the frequency of plagiarism and duplicate publication (sometimes called self-plagiarism), details on the content of the Déjà Vu database and some prime examples. (Scientific American is part of Nature Publishing Group.) The editors accepted, but because we referred to some abstracts as plagiarized, the lawyers ripped the paper apart. They had an excellent point: the only people who could make a plagiarism determination were editors and ethics review boards. We could present only facts—the amount of text overlap or similarity between any two pieces of scientific literature. Eventually, with the approval of the lawyers, that is what we did.

When the Nature report came out, all hell broke loose. Journal editors were upset because it gave them extra work to do. To protect their copyright, the editors of the original papers had to insist that the plagiarized papers be retracted. The second publisher, of course, was embarrassed. Scientists were angry because our results seemed to expose a flaw in peer review. But everyone grudgingly admitted that this was an important topic and a serious problem. Scientists and clinicians make critical decisions based on what they read in the literature. What did it mean if those decisions were based on tainted studies?

Ultimately we determined that 0.1 percent of professional publications were blatantly plagiarized from the work of others. (We looked only for papers that were nearly identical to one another; there must be many more instances in which small fragments of papers are plagiarized, but because our software searched only abstracts, it would not detect such things.) Some 1 percent were self-plagiarized; one author's work would appear, often nearly verbatim, in as many as five journals. If these percentages seem small, consider that some 600,000 new biomedical papers are published every year.

And before long, we noticed that the publishing process had begun to change. Journal editors started using eTBLAST to check their submissions. I had changed, too. I had evolved again, adding “ethics researcher” to my job description.

My Life as an Ethics Cop

The first big plagiarism study was just the beginning. Understanding the causes of plagiarism and their effects on science would require much more work. When is repeated text acceptable? When and why do scientists plagiarize? What other kinds of unethical behavior could textual analysis uncover? So we refined our software, expanded our databases and took on new studies.

Some of our subsequent work revealed unexpected nuances in the plagiarism debate. We found that in some cases, textual similarity is not only acceptable but preferred. In the methods section of a research paper, for example, where the most important consideration is reproducibility of results, unoriginal phrasing serves the important purpose of showing clearly that exactly the same protocol was used.

We also found some truly egregious ethical lapses. In a study published in Science, we took the most blatant examples of plagiarism we could find—pairs of papers in which paper B was on average 86 percent identical to paper A—and analyzed them in detail. We e-mailed annotated copies of the papers, along with confidential surveys, to the authors and editors involved with those papers. Were they aware of the similarity? Could they explain it? Ninety percent of the people we contacted responded.

Some of the authors divulged striking ethics violations. Some admitted that they had copied papers while they were reviewing them—and that they had given those papers bad reviews to block their publication. Others blamed the lapse on fictitious medical students. One author said he had plagiarized a paper as a joke. This person happened to be the vice president of the national ethics committee of his country. Unsurprisingly, most of the tainted papers in that bunch have since been retracted.

These were not the last ethics violations we would find. In early 2012 we began looking for instances of double-dipping on grants—that is, getting money from multiple government agencies to do the same work. We downloaded summaries of approximately 860,000 grants from government and private agencies, including the National Institutes of Health, the National Science Foundation, the Department of Defense, the Department of Energy and Susan G. Komen for the Cure, and subjected them to the eTBLAST treatment. The study required 800,000 times 800,000 (roughly 1012) comparisons and supercomputer-level power.

After reviewing the 1,600 most similar grant summaries, we found that about 170 pairs had virtually identical goals, aims or hypotheses. We concluded several things: that double-dipping had been happening consistently for a long time; that it involved America's most prestigious universities; and that the resulting loss to biomedical research was as high as $200 million a year.

The Future of Scientific Publishing

A small percentage of people have always broken societal norms, and scientists are no different. In desperate times, with declining funding and increasingly intense competition for academic positions, some scientists are bound to behave badly. In fact, a recent explosion of dubious, fly-by-night journals has made scientific publishing a Wild West show. It is now easier than ever to find a place to publish your material, even if it is flagrantly plagiarized.

Text analytics gives us a good tool for policing bad behavior. But it could eventually do much more than smoke out plagiarism. It could facilitate entirely new ways of sharing research.

One intriguing idea is to adopt a Wikipedia model: to create a dynamic, electronic corpus of work on a subject that scientists continually edit and improve. Each new “publication” would consist of a contribution to the single growing body of knowledge; those redundant methods sections would become unnecessary. The Wikipedia model would be a step toward a central database of all scientific publications across all disciplines. Authors and editors could use text mining to verify the novelty of new research and to develop reliable metrics for the impact of an idea or discovery. Ideally, instead of measuring a paper's impact by the number of citations it receives, we would measure its influence on our total scientific knowledge and even on society.

At Virginia Tech, where I moved four years ago, we are struggling to keep eTBLAST running, but the software still has thousands of users. My wife and business partner, Kim Menier, and I, meanwhile, are bullish about textual analysis. We are working to apply the kind of paragraph-size similarity searching that uncovered so many instances of plagiarism to other ends, including grant management, market research and patent due diligence. Do we have the next Google on our hands? Who knows? But I speak from experience when I say that textual analysis can be truly revealing. It once proved to me that scientists could be as flawed as the rest of us.