If you want to convince the world that a fish can sense your emotions, only one statistical measure will suffice: the p-value.
The p-value is an all-purpose measure that scientists often use to determine whether or not an experimental result is “statistically significant.” Unfortunately, sometimes the test does not work as advertised, and researchers imbue an observation with great significance when in fact it might be a worthless fluke.
Say you’ve performed a scientific experiment testing a new heart attack drug against a placebo. At the end of the trial, you compare the two groups. Lo and behold, the patients who took the drug had fewer heart attacks than those who took the placebo. Success! The drug works!
Well, maybe not. There is a 50 percent chance that even if the drug is completely ineffective, patients taking it will do better than those taking the placebo. (After all, one group has to do better than the other; it’s a toss-up whether the drug group or placebo group will come up on top.)
The p-value puts a number on the effects of randomness. It is the probability of seeing a positive experimental outcome even if your hypothesis is wrong. A long-standing convention in many scientific fields is that any result with a p-value below 0.05 is deemed statistically significant. An arbitrary convention, it is often the wrong one. When you make a comparison of an ineffective drug to a placebo, you will typically get a statistically significant result one time out of 20. And if you make 20 such comparisons in a scientific paper, on average, you will get one significant result with a p-value less than 0.05—even when the drug does not work.
Many scientific papers make 20 or 40 or even hundreds of comparisons. In such cases, researchers who do not adjust the standard p-value threshold of 0.05 are virtually guaranteed to find statistical significance in results that are meaningless statistical flukes. A study that ran in the February issue of the American Journal
of Clinical Nutrition tested dozens of compounds and concluded that those found in blueberries lower the risk of high blood pressure, with a p-value of 0.03. But the researchers looked at so many compounds and made so many comparisons (more than 50), that it was almost a sure thing that some of the p-values in the paper would be less than 0.05 just by chance.
The same applies to a well-publicized study that a team of neuroscientists once conducted on a salmon. When they presented the fish with pictures of people expressing emotions, regions of the salmon’s brain lit up. The result was statistically significant with a p-value of less than 0.001; however, as the researchers argued, there are so many possible patterns that a statistically significant result was virtually guaranteed, so the result was totally worthless. p-value notwithstanding, there was no way that the fish could have reacted to human emotions. The salmon in the fMRI happened to be dead.
This article was originally published with the title The Mind-Reading Salmon.
Already a Digital subscriber? Sign-in Now
If your institution has site license access, enter here.




See what we're tweeting about


16 Comments
Add Comment"When they presented the fish with pictures of people expressing emotions, regions of the salmon’s brain lit up..." "...p-value notwithstanding, there was no way that the fish could have reacted to human emotions. The salmon in the fMRI happened to be dead."
Reply | Report Abuse | Link to thisA dead salmon's brain lit up when shown pictures of people expressing emotion? but at the same time it couldn't have reacted to human emotion because it is dead? Someone please explain to me how this makes any sense. I mean, was it just a malfunction on the fMRI machine?
This isn't a malfunction of the fMRI machine, it's simply the way the machine (or any measuring tool) works. Measurements include noise. Nor is it a critique of the p-value for tests of statistical significance. It *is* a critique aimed at those who confuse statistical significance with practical significance.
Reply | Report Abuse | Link to this'"The goal of the salmon poster was to encourage the minority of researchers who report uncorrected statistics to move forward and begin using basic multiple comparisons correction in their research," says study leader Craig Bennett, a postdoctoral researcher in the Department of Psychology at the University of California, Santa Barbara.'
Notice that his critique is not a broad generalized condemnation, but aimed at politely (by means of dry humor) reminding a small number of people exactly what they need to be doing.
Look up "multiple comparisons" on Wikipedia, then read the poster in question, "Neural correlates of interspecies perspective-taking in the post-mortem Atlantic Salmon: An argument for multiple comparisons correction"
If I can remember correctly the study involving the salmon was actually designed to test the algorithms that fmri machines use to distinguish between significant data and insignificant. The end result of the research was to show that whe data setts are particularly large liklihood of false positives increases until it is certain hence the dead salmon with brain activity. Their is a limit to sensitivity of machines that use statistical logic to make calculations.
Reply | Report Abuse | Link to thisBut the dead salmon likes me! I shall name him Joe. :) He is my special (and stinky) friend.
Reply | Report Abuse | Link to thisUnderstanding these concepts about hypothesis testing and p-values is something any elementary statistics course should teach and anyone coining science would be on top of. A further question might be "if the probability OS a success is 0 .05 and you run say 10 trials, what is the probability of getting at least on success?"
Reply | Report Abuse | Link to thisHopefully Scientific American does not think this I rocket science.
Understanding these concepts about hypothesis testing and p-values is something any elementary statistics course should teach and anyone coining science would be on top of. A further question might be "if the probability OS a success is 0 .05 and you run say 10 trials, what is the probability of getting at least on success?"
Reply | Report Abuse | Link to thisHopefully Scientific American does not think this I rocket science.
As I understand it statistical inference and p-values were intended to reduce as much "noise" as possible from the data in order to determine what might possibly be the controlling variable or variables of some dependent variable. The results were not to be taken as "god's-truth", but only a way of narrowing down the possible independent variables. The way statistical inference seems to be used today suggests that researcher's are taking their results as "god's truth" and going on to make other hypotheses based on them. This is most unfortunate.
Reply | Report Abuse | Link to thisAs I understand it statistical inference and p-values were intended to reduce as much "noise" as possible from the data in order to determine what might possibly be the controlling variable or variables of some dependent variable. The results were not to be taken as "god's-truth", but only a way of narrowing down the possible independent variables. The way statistical inference seems to be used today suggests that researchers are taking their results as "god's truth" and going on to make other hypotheses based on them. This is most unfortunate.
Reply | Report Abuse | Link to thisp value was introduced as a useful rule of thumb, but it has taken on a life of its own in the hands of technicians acting as if they were scientists (and in the hands of scientists who are too lazy to do the math). like the famous razor, a rule of thumb may be useful, some of the time, but in the wrong hands it becomes a tool for rationalization.
Reply | Report Abuse | Link to thisUnfotunately, the definition of statistical significance in the article is incorrect. In the classical frequentist approach the p value is the probability of the observation given that the hypothesis is CORRECT. The idea is that if the probability of an observation is sufficiently low, then we can reject the hypothesis. A level of 0.05 (5%) is arbitrary and is the result of tradition. The concept of statistical significance is indeed complex.
Reply | Report Abuse | Link to thisThe p-value estimates the probability of making a type I error, i.e. it is the probability of wrongly rejecting the null hypothesis. The definition made in this article mirrors this concept perfectly. Your definition on the other hand is clearly wrong.
Reply | Report Abuse | Link to thisIf we were to follow your logic, then a p value of .02 would imply that a given result has a 2% chance of being detected given that the hypothesis is correct. It seems to me that this would make hypothesis testing very difficult.
No, this is exactly the idea of significance testing. If a result has the probability of 2% of being detected given that the hypothesis is correct, then we will reject the hypothesis if we are using a 5% significance level. On the other hand, if we are using a 1% significance level we cannot reject our hypothesis. Clearly, we have to select our significance level before doing the experiment. This is elementary statistics and can be found in any textbook of statistics.
Reply | Report Abuse | Link to thisThis is why every intro stats class covers the Bonferroni correction, which corrects the desired p-value (0.05, by convention) against the creeping significance of multiple comparisons.
Reply | Report Abuse | Link to thisThis places a significant responsibility on the editors of popular science magazines and web sites (such as this one). Most people do not read original papers they rely on scientifically literate journalists to state the significant results.
Reply | Report Abuse | Link to thisWe expect nonsense, propaganda and ignorance in the mass media (thank you Rupert!) but we should be able to rely on sophisticated analysis prior to publication in science media. I would be interested to hear if people think we get it.
Regarding the use of procedures such as the Bonferroni correction, it's as important to have a rationale for applying the correction as it is to have a rational for the selection of the Type I error criterion (alpha) for individual tests. For any given sample size, there is always a trade-off between the occurrence of Type I and Type II errors -- between one's ability to avoid false rejection of the null hypothesis and to avoid failure to report the existence of (probably) real effects. Consider the following quotation from the excellent book, Statistics as Principled Argument: "When there are multiple tests within the same study or series of studies, a stylistic issue is unavoidable. As Diaconis (1985) put it, 'Multiplicity is one of the most prominent difficulties with data-analytic procedures. Roughly speaking, if enough different statistics are computed, some of them will be sure to show structure' (p. 9). In other words, random patterns will seem to contain something systematic when scrutinized in many particular ways. If you look at enough boulders, there is bound to be one that looks like a sculpted human face. Knowing this, if you apply extremely strict criteria for what is to be recognized as an intentionally carved face, you might miss the whole show on Easter Island (Abelson, 1995, p. 70).
Reply | Report Abuse | Link to this"If a result has the probability of 2% of being detected given that the hypothesis is correct, then we will reject the hypothesis if we are using a 5% significance level. On the other hand, if we are using a 1% significance level we cannot reject our hypothesis."
Reply | Report Abuse | Link to thisApparently you have completely misunderstood the basic concept of "probability!"