Main focus of the article: |
My comments on the article:
Carver’s paper needs to be read with care, keeping in mind his intended audience, i.e. social science researchers. He makes four suggestions on how such researchers can minimise the perceived importance of statistical significance testing in their discipline’s publications, namely:
- “Insist that “statistically” be inserted in front of “significant” in research reports.” (p. 288)
- “Insist that the results always be interpreted with respect to the data first, and statistical significance, second.” (p. 288)
- “Insist that attention be paid to the size of the effect, whether it is statistically significant or not.” (p. 288)
- “Insist that new journal editors present their views on statistical significance testing prior to selection.” (p. 289)
It is often the cause of much frustration, particularly for me as a statistician, that authors write that a result was significant when they mean it was important or large. In the same regard, authors often use the word normal when they mean typical or usual. There are certain words, like normal and significant, that have a special meaning in statistics and therefore should be avoided by researchers when writing up their reports. Carver’s suggestion to insert the word “statistically” is admirable in that it would help to avoid such frustration, but perhaps it will also remind authors to be more careful in their choice of words.
Carver suggests in this point that statistical significance testing is somehow a corruption of the scientific method. As such, he suggests that authors should look at what their data tells them about their research question before performing any kind of statistical test. I would disagree with Carver here in that statistical significance testing, if done properly, is central to “the scientific method” (see here for more).
If the appropriate hypothesis test is performed, the data and hypothesis test will tell you the same thing. Since the research question determines the hypothesis you will test and the potential statistical technique(s) that should be used for analysis; the assumptions required for each technique would be known and could accommodated in the research design. Equally, these assumptions can, and should be checked as part of performing the statistical analysis.
Carver suggests that effect sizes should be reported regardless of whether or not a result is statistically significant. The idea being that small, trivial, but statistically significant results will be seen for what they are and large but statistically non-significant results will also be clearly seen. It is tempting to follow Carver’s suggestion, however, non-significant results, regardless of effect size, are indications that there is insufficient evidence to reject the null hypothesis (at the given significance level). Effect sizes should be reported where possible, but care should be taken in interpreting them, just as when interpreting p-values.
Carver suggests that journal editors should be selected depending on their views on statistical significance testing. To some degree I would support this. If a potential journal editor believes, erroneously, that statistical significance is some sort of guarantee of good research worthy of publication, then they should not be selected. Equally, if the potential editor believes that there is something wrong or unscientific about statistical significance testing, then they should also be avoided. Good science is not about whether or not a researcher uses statistical significance testing, but rather whether or not they follow good research practices, in an ethical manner, and interpret their data objectively and dispassionately.
Carver also provides his suggestions on replacements for statistical significance testing. He starts these suggestions with the statement that “The best research articles are those that include no tests of statistical significance” (p. 289). Naturally I would dispute this as a statistician, however, I would agree that there are many poor studies that use statistical tests in an effort to make their study appear to be of quality. Equally, there are many high quality studies that do not use statistical tests. I would argue that the combined pressure on academics to publish, along with the over emphasis on publishing only statistically significant results has flooded the literature with poor quality studies. Combine this with an explosion in publication options, advances in technology that now provide online journals with almost immediate publication options and a plethora of software that allow the researchers to perform and interpret (often incorrectly) their own statistical analysis; hence it is no wonder that Carver suggests that the best articles do not have statistical tests.
Carver’s suggestions are, as he states, aimed at trying to assist the researcher in getting their work published without using statistical significance tests. These suggestions are broken into two kinds, those for single studies and those for multiple studies. For single studies, Carver suggests that the author use a measure of effect (e.g. for a two-sample t-test) and a measure of sampling error (e.g. the standard error term for the same t-test). Carver argues that by reporting the effect size and sampling error, the author can avoid giving a p-value that says nothing about whether the result is large or trivial. I would argue that a researcher should report the test statistic (e.g. the t-statistic), the sample size, the p-value, the a priori significance level ( ), the estimated power of the test, the estimated effect size (if possible) and a Confidence Interval (if possible). Of course, this relies upon the researcher and audience to understand what each of these values indicate and for the author to interpret them correctly.
For multiple studies, Carver recommends looking at how the results are replicated across the studies. Here I would definitely agree with Carver. It is important that researchers look at how their results are replicated across other studies and to do this they need some measure of effect size along with some idea of the observed variability in each study. Replication is key to ensuring that what we have observed in one study is not simply random chance but an actual effect. Unfortunately, replication is extremely hard to achieve without repeating your exact study, along with all it’s controls, on a different sample.
In his conclusion, Carver argues that, “Researchers should be embarrassed any time they report a [p-value] and then claim that it shows that their results were significant” (p. 292) and that such reporting has led to fads based on trivial effects. I would disagree with Carver about researchers needing to be embarrassed by their reporting of a p-value, but they should certainly be embarrassed by their reporting of, and misrepresenting of, statistically significant results as somehow indicating importance. There are many statistically significant results that are truly trivial and meaningless in a practical sense and there are many non-significant results that are either clinically or practically important results. Statistical significance is about the probability of observing this result (or one more extreme) given the null hypothesis is true. If the null hypothesis or significance level are not set appropriately for the study, then it is likely that the results will not be as practical as the researcher looking at the data and assessing it intuitively and therein lies part of the problem.
Full citation and link to document:
Carver, R. P. (1993). The Case Against Statistical Significance Testing , Revisited. The Journal of Experimental Education, 61(4), 287-292. Retrieved from JSTOR: 20152382
CITE THIS AS:
Ovens, Matthew. “Carver (1993) The Case Against Statistical Significance Testing, Revisited” Retrieved from YourStatsGuru.
First published 2012 | Last updated: 21 January 2018