Massive International Project Raises Questions about the Validity of Psychology Research

When 100 past studies were replicated, only 39 percent yielded the same results

Investigators across five continents reported that they were able to replicate only about 40 percent of the results from 100 previously published studies in cognitive and social psychology, in a study described today in the influential journal Science. The massive collaboration, called the Reproducibility Project: Psychology, could serve as a model for examining reproducibility of research in other fields, and a similar effort to scrutinize studies in cancer biology is already underway.

Central to the scientific method, experiments “must be reproducible,” says Gilbert Chin, a senior editor at Science. “That is, someone other than the original experimenter should be able to obtain the same findings by following the same experimental protocol.” The more readily a study can be replicated, the more trustworthy its results. But “there has been growing concern that reproducibility may be lower than expected or desired,” says corresponding author Brian Nosek, a psychology professor at the University of Virginia.

To address the problem, scientists across many disciplines established the Center for Open Science (COS) in Charlottesville, Va. The Reproducibility Project: Psychology, their first research initiative, began recruiting volunteers in 2011. They asked teams of researchers, totaling 270 collaborating authors, to choose from a pool of studies—all reflecting basic science and not requiring specialized samples or equipment—that appeared in 2008 in one of three respected psychology journals: Psychological Science; Journal of Personality and Social Psychology; and Journal of Experimental Psychology: Learning, Memory and Cognition.

On supporting science journalism

If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.

Generally evidence was weaker on replication. The stronger the evidence was to begin with, however, including a larger effect size, the more likely the results were reproduced.

Although the outcome was “somewhat disappointing,” Chin said during a teleconference to discuss the findings, he stressed that it did not necessarily speak to the validity of the theories tested or even the conclusions drawn. The scientific process involves “a continual questioning and assessment of theories and of experiments.” Even nonreproducible experiments contribute to our understanding of science by helping to rule out alternative explanations. Rather, the study suggests “we should be less confident about many of the original experimental results that were provided as empirical evidence in support of those theories.”

Speaking at the same teleconference, Alan Kraut, executive director of the Association for Psychological Science and a COS board member, made a similar point: Inevitable variations in study participants, timing, location, the skills of the research team and many other factors will always influence outcomes. “The only finding that will replicate 100 percent of the time,” Kraut noted, “is one that is likely to be trite and boring.”

The teams received set protocols and analysis plans and consulted with original study authors in order to match their study design as closely as possible. After the experiments concluded, the project coordinators aggregated the data and independently reviewed the analyses.

Study authors gaged replication success using five criteria: statistical significance and p-values—an assessment of the probability of an event within a certain predetermined likelihood (generally 95 percent, or a p-value of 0.05); the effect size, which indicates the strength of the phenomenon tested; the subjective judgment of the replication team; and a meta-analysis of the effect sizes of all 100 experiments. They also factored in various other characteristics—among them sample size, so-called “effect surprisingness” and expertise of the original team—that could potentially affect the results.

In the final analysis they found that whereas 97 percent of the original studies reported statistically significant results (obtaining a p-value of 0.05 or less) only 36 percent of replications did. A weakness of using p-values, however, is that it treats 0.05 as a “bright line” between significant and nonsignificant results. To address this, the researchers also examined effect size. The replicated experiments fared slightly better when measured this way. In total, 47 percent of the replications showed an effect that matched the original results with 95 percent confidence, although generally the strength of the effect had decreased. Subjectively, 39 percent of the research teams deemed their replication a success.

Of interest, the authors found that some types of studies were more likely to be replicated than others. Only about 25 percent of the 57 social psychology studies included in the project were successfully replicated whereas 50 percent of the 43 cognitive psychology ones were. The social psychology studies also had weaker effect sizes. In addition, the simpler the design of the original experiment, the more reliable its results. The researchers also found that “surprising” effects were less reproducible.

In this study the authors excluded research that called for advanced neuroimaging, maybe also excluding the very sorts of precision experiments that could have replicated more easily. But the authors note that the problem of reproducibility persists across all fields of science, perhaps in part due to publication bias. “Publication is the currency of science,” Nosek says. “To succeed, my collaborators and I need to publish regularly and in the most prestigious journals possible.” But academic journals routinely prioritize “novel, positive and tidy results,” he adds. Studies that fail to find a significant result rarely see the light of day. In addition, replications of previously published experiments—which are vitally important in moving science forward—are much less likely to survive peer review.

To change that, Marcia McNutt, editor in chief of Science, points out that her journal and others have recently published guidelines encouraging greater transparency and openness in their selection and review process. She adds that “authors and journal editors should be wary of publishing marginally significant results, as those are the ones that are less likely to reproduce.” If they lose sight of that fact, Nosek concludes, “then the published literature may become more beautiful than the reality.”

It’s Time to Stand Up for Science

If you enjoyed this article, I’d like to ask for your support. Scientific American has served as an advocate for science and industry for 180 years, and right now may be the most critical moment in that two-century history.

I’ve been a Scientific American subscriber since I was 12 years old, and it helped shape the way I look at the world. SciAm always educates and delights me, and inspires a sense of awe for our vast, beautiful universe. I hope it does that for you, too.

If you subscribe to Scientific American, you help ensure that our coverage is centered on meaningful research and discovery; that we have the resources to report on the decisions that threaten labs across the U.S.; and that we support both budding and working scientists at a time when the value of science itself too often goes unrecognized.

In return, you get essential news, captivating podcasts, brilliant infographics, can't-miss newsletters, must-watch videos, challenging games, and the science world's best writing and reporting. You can even gift someone a subscription.

There has never been a more important time for us to stand up and show why science matters. I hope you’ll support us in that mission.

Thank you,

David M. Ewalt, Editor in Chief, Scientific American