Quotes criticizing significance testing

"After four decades of severe criticism, the ritual of null hypothesis significance testing---mechanical dichotomous decisions around a sacred .05 criterion---still persist. This article reviews the problems with this practice..." ... "What's wrong with [null hypothesis significance testing]? Well, among many other things, it does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe that it does!" (Cohen 1994)

"The statistical folkways of a more primitive past continue to dominate the local scene." (Rozeboom 1960)

"Are the effects of A and B different? They are always different---for some decimal place." Tukey, 1991, quoted in (Cohen 1994)

"As a second example, consider significance tests. They are also widely overused and misused." (Cox 1977)

"The result is that non-statisticians tend to place undue reliance on single 'cookbook' techniques, and it has for example become impossible to get results published in some medical, psychological and biological journals without reporting significance values even if of doubtful validity. It is sad that students may actually be more confused and less numerate at the end of a 'service course' than they were at the beginning, and more likely to overlook a descriptive approach in favour of some inferential method which may be inappropriate or incorrectly executed. " (Chatfield 1985)

"Estimates and measures of variability are more valuable than hypothesis tests." (R.M. Cormack, in discussion of Chatfield 1985)

"Statistics is intimately connected with science and technology, and few mathematicians have experience or understand of methods of either. This I believe is what lies behind the grotesque emphasis on significance tests in statistics courses of all kinds; a mathematical apparatus has been erected with the notions of power, uniformly most powerful tests, uniformly most powerful unbiased tests, etc. etc., and this is taught to people, who, if they come away with no other notion, will remember that statistics is about significant differences. If they then become scientists they will be searching for uniformity, invariance, and repeatability, not differences; if technologists, they will primarily want to measure things and assess their accuracy, not to test theories. The apparatus on which their statistics course has been constructed is often worse than irrelevant---it is misleading about what is important in examining data and making inferences." (J.A. Nelder, in discussion of Chatfield 1985)

"Somehow there has developed a widespread belief that statistical analysis is legitimate only if it includes significance testing. This belief leads to, and is fostered by, numerous introductory statistics texts that are little more than catalogs of techniques for performing significance tests." (D.G. Altman, discussing Chatfield 1985)

"Failing to reject a null hypothesis is distinctly different from proving a null hypothesis; the difference in these interpretations is not merely a semantic point. Rather, the two interpretations can lead to quite different biological conclusions...." (Parkhurst 1985)

"In analysis, overemphasis on significance testing continues." (Preece 1982)

"The most commonly occurring weakness in the application of Fisherian methods is, I think, undue emphasis on tests of significance, and failure to recognize that in many types of experimental work, estimates of the treatment effects, together with estimates of the error to which they are subject, are the quantities of primary interest." (Yates 1964)

"The author recommends abandoning all statistical significance testing and suggests other ways of evaluating research results." ... "Another reason for the popularity of statistical significance testing is probably the complicated mathematical procedures lend an error of scientific objectivity to conclusions." ... "Given that statistical significance testing usually involves a corrupt form of the scientific method and, at best, is of trivial scientific importance, journal editors should not require it as a necessary part of a publishable research article." (Carver 1978)

"The test of statistical significance in psychological research may be taken as an instance of a kind of essential mindlessness in the conduct of research." (Bakan 1966)

"The null hypothesis of no difference has been judged to be no longer a sound or fruitful basis for statistical investigation... Significance tests do not provide the information that scientists need, and, furthermore, they are not the most effective method for analyzing and summarizing data." (Clark 1963)

"A p value reached by classical methods is not a summary of the data. Nor does the p value attached to a result tell how strong or dependable a particular result is." (Cronbach and Snow 1977)

"What is the probability of obtaining a dead person (D) given that the person was hanged (H); that is, in symbol form, what is p(D|H)? Obviously, it will be very high, perhaps .97 or higher. Now, let us reverse the question: What is the probability that a person has been hanged (H) given that the person is dead (D); that is, what is p(H|D)? This time the probability will undoubtedly be very low, perhaps .01 or lower. No one would be likely to make the mistake of substituting the first estimate (.97) for the second (.01); that is, to accept .97 as the probability that a person has been hanged given that the person is dead. Even thought this seems to be an unlikely mistake, it is exactly the kind of mistake that is made with the interpretation of statistical significance testing---by analogy, calculated estimates of p(H|D) are interpreted as if they were estimates of p(D|H), when they are clearly not the same." (Carver 1978)

"What do we do without the tests, then? What we do without the tests has always in some measure been done in behavioral science and needs only to be done more and better: the application of imagination, common sense, and informed judgment, and the appropriate remaining research methods to achieve the scope, form, process, and purpose of scientific inference." (Morrison and Henkel 1970)

"The author believes that tests provide a poor model of most real problems, usually so poor that their objectivity is tangential and often too poor to be useful." (Pratt, 1976, quoted by Yoccuz 1991)

"Overemphasis on tests of significance at the expense especially of interval estimation has long been condemned." (Cox 1977)

"The continued very extensive use of significance tests is alarming." (Cox 1986)

"In marked contrast to what is advocated by most statisticians, most evolutionary biologists and ecologists overemphasize the potential role of significance testing in their scientific practice. Biological significance should be emphasized rather than statistical significance. Furthermore, a survey of papers showed that the literature is infiltrated by an array of misconceptions about the use and interpretation of significance tests." ... "By far the most common error is to confound statistical significance with biological, scientific significance...." ... " Statements like 'the two populations are significantly different relative to parameter X (P=.004)' are found with no mention of the estimated difference. The difference is perhaps statistically significant at the level .004, but the reader has no idea is if it is biologically significant." ... "Most biologists and other users of statistical methods still seem to be unaware that significance testing by itself sheds little light on the questions they are posing." (Yoccuz 1991)

"... We should not say 'we accept the null hypothesis' .... Even more important than what we say, in applied situations we should not use failure to disprove H0 as justification for taking actions that would be appropriate if H0 were true." (Parkhurst 1990)

"It is desirable to report the observed values of the test statistics and not just the p values. The quantitative results being tested, such as mean values, proportions, or correlation coefficients, should be given whether the test was significant or not." ... "Even if there is a large real effect an non-significant result is quite likely if the number of observations is small. Conversely, if the sample size is very large a statistically significant result may occur when there is only a small real effect. The statistical significance should not be taken as synonymous with clinical importance." (Altman, S.M. et al. 1983)

"Dr. Johnstone's paper states the position over significance tests with which I am in almost complete agreement. They are widely used, yet are logically indefensible." ... "There is a firm measure of agreement amongst statisticians of all persuasions that significant tests, as inference procedures, are better replaced by estimation methods: that it is better to quote a confidence or credible interval for than merely a level referring only to one value of , here 0. The reason for this preference is that estimation procedures provide more information: they tell one about reasonable alternatives and not just about reasonableness of one value." D.V. Lindley, discussing (Johnstone 1986)

"The emphasis given to formal tests of significance throughout [R.A. Fisher's] Statistical Methods...has caused scientific research workers to pay undue attention to the results of the tests of significance they perform on their data, particularly data derived from experiments, and too little to the estimates of the magnitude of the effects they are investigating." ... "The emphasis on tests of significance and the consideration of the results of each experiment in isolation, have had the unfortunate consequence that scientific workers have often regarded the execution of a test of significance on an experiment as the ultimate objective." (Yates 1951)

"Overemphasis on tests of significance at the expense especially of interval estimation has long been condemned...." (Cox 1977)

"...There are considerable dangers in overemphasizing the role of significance tests in the interpretation of data." (Cox 1977)

"In any particular application, graphical or other informal analysis may show that consistency or inconsistency with H0 is so clear cut that explicit calculation of p is unnecessary." (Cox 1977)

"It has been widely felt, probably for thirty years and more, that significance tests are overemphasized and often misused and that more emphasis should be put on estimation and prediction. While such a shift of emphasis does seem to be occurring, for example in medical statistics, the continued very extensive use of significance tests is on the one hand alarming and on the other evidence that they are aimed, even if imperfectly, at some widely felt need." (Cox 1986)

"The central point is that statistical significance is quite different from scientific significance and that therefore estimation ...of the magnitude of effects is in general essential regardless of whether statistically significant departure from the null hypothesis is achieved." (Cox 1977)

"Pencil and paper for construction of distributions, scatter diagrams, and run-charts to compare small groups and to detect trends, are more efficient methods of estimation than statistical inference that depends on variances and standard errors, as the simple techniques preserve the information in the original data." (Deming 1975)

"We admit with Sir Winston Churchill that it sometimes pays to admit the obvious: we do not perform an experiment to find out if two varieties of wheat or two drugs are equal. We know in advance without spending a dollar on an experiment that they are not equal. The difference between two treatments or between two areas or two groups of people, will show up as 'significantly different' if the experiment be conducted through a sufficient number of trials, even thought the difference be so small that it is of no scientific or economic consequence. Likewise tests of whether the data of a survey or an experiment fit some particular curve is of no scientific or economic consequence.... With enough data no curve will fit the results of an experiment. The question that one faces in using any curve or any relationship is this: how robust are the conclusions? Would some other curve make safer predictions? Statistical significance of B/A thus conveys no knowledge, no basis for action." (Deming 1975)

"Under the usual teaching, the trusting student, to pass the course must forsake all the scientific sense that he has accumulated so far, and learn the book, mistakes and all." (Deming 1975)

"While [Edward C. Bryant] was at the University of Wyoming, someone came in from the Department of Animal Husbandry to announce to him an astounding scientific discovery---the fibres on the left side of the sheep and those on the right side are of different diameter. Dr. Bryant asked him how many fibres he had in the sample: answer, 50,000. This was a number big enough to establish significance. But what of it? Anyone would know in advance, without spending a dollar, that there is a difference between fibres of the left side and the right side of any sheep, or of n sheep combined. The question is whether the difference is of scientific importance." (Deming 1975)

In a survey of papers published in the American Economic Review, the authors found that "59% use the word 'significance' in ambiguous ways at one point meaning 'statistically significantly different from the null,' at another 'practically important' or 'greatly changing our scientific opinion,' with no distinction." (McCloskey and Ziliak 1996)

"Small wonder that students have trouble [with statistical hypothesis testing]. They may be trying to think." (Deming 1975)

In that same paper, the authors found "despite the advice proffered in theoretical statistics, only 4-percent considered the power of their tests. One percent examined the power function." ... "And 69 percent did not report descriptive statistics---the means of the regression variables, for example---that would allow the reader to make a judgment about the economic significance of the results." (McCloskey and Ziliak 1996)

"The low and falling cost of calculation, together with a widespread though unarticulated realization that after all the significance test is not crucial to scientific questions, has meant that statistical significance has been valued at its cost. Essentially no one believes a finding of statistical significance or insignificance. This is bad for the temper of the field. My statistical significance is a 'finding'; yours is an ornamented prejudice." (McCloskey and Ziliak 1996)

"Data analysis methods in psychology still emphasize statistical significance testing, despite numerous articles demonstrating its severe deficiencies. It is now possible to use meta-analysis to show that reliance on significance testing retards the development of cumulative knowledge. The reform of teaching and practice will also require that researchers learn that the benefits that they believe flow from use of significance testing are illusory. Teachers must re-vamp their courses to bring students to understand that a) reliance on significance testing retards the growth of cumulative research knowledge; b) benefits widely believed to flow from significance testing do not in fact exist; c) significance testing methods must be replaced with point estimates and confidence intervals in individual studies and with meta-analyses and the integration of multiple studies. This reform is essential to the future progress of cumulative knowledge and psychological research." (Abstract of Schmidt 1996)

"...Consider this challenge: Can you articulate even one legitimate contribution that significance testing has made (or makes) to the research enterprise (i.e., any way in which it contributes to the development of cumulative scientific knowledge)? I believe you will not be able to do so." (Schmidt 1996)

"If the null hypothesis is not rejected, Fisher's position was that nothing could be concluded. But researchers find it hard to go to all the trouble of conducting a study only to conclude that nothing can be concluded." (Schmidt 1996)

"I believe that these false beliefs are a major cause of the addiction of researchers to significance tests. Many researchers believe that statistical significance testing confers important benefits that are in fact completely imaginary." (Schmidt 1996)

"An important part of the explanation [of continued use of significance testing] is that researchers hold false beliefs about significance testing, beliefs that tell them that significance testing offers important benefits to researchers that it in fact does not. Three of these beliefs are particularly important. The first is the false belief that the significance level of a study indicates the probability of successful replications of the study.... A second false belief widely held by researchers is that statistical significance level provides an index of the importance or size of a difference or relation.... The third false belief held by many researchers is the most devastating of all to the research enterprise. This is the belief that if a difference or relation is not statistically significant, then it is zero, or at least so small that it can safely be considered to be zero. This is the belief that if the null hypothesis is not rejected then it is to be accepted. This is the belief that a major benefit from significance tests is that they tell us whether a difference or affect is real or 'probably just occurred by chance'." (Schmidt 1996)

"If we were clairvoyant and could enter the mind of a typical researcher we might eavesdrop on the following thoughts:

Significance tests have been repeatedly criticized by methodological specialists, but I find them very useful in interpreting my research data, and I have no intention of giving them up. If my findings are not significant, then I know that they probably just occurred by chance and that the true difference is probably zero. If the result is significant, then I know I have a reliable finding. The p values from the significance tests tell me whether the relationships in my data are large enough to be important or not. I can also determine from the p value what the chances are that these findings would replicate if I conducted a new study. These are very valuable things for a researcher to know. I wish the critics of significance testing would recognize this fact.

Every one of these thoughts about the benefits of significance testing is false. I ask the reader to ponder this question: does this describe your thoughts about the significance test?" (Schmidt 1996)

"We can no longer tolerate a situation in which our upcoming generation of researchers are being trained to use discredited data analysis methods while the broader research enterprise of which they are to become a part has moved toward improved methods." (Schmidt 1996)

"Most statisticians are all too familiar with conversations [that] start:

Q: What is the purpose of your analysis?

A: I want to do a significance test.

Q: No, I mean what is the overall objective?

A (with puzzled look): I want to know if my results are significant.

And so on...." (Chatfield 1991)

"In a large majority of problems ... hypothesis testing is inappropriate: Set up the confidence interval and be done with it!" (Casella and Berger 1987)

"We have addressed a number of defenses for continued use of statistical hypothesis tests and have found inadequate reasons for continuing their use.... The display of the point estimates of the parameters, standard errors, and confidence intervals constructed by the author in making inferences on biological relevance is the most clear and meaningful approach toward the statistical analysis and its presentation." (Jones and Matloff 1986)

"Our greatest concern about reliance on statistical hypothesis testing in this aspect is not just its inadequacy in providing insight into the relevant scientific question, but, even worse, that the results can be highly misleading." (Jones and Matloff 1986)

"The purpose of this paper is severalfold. First, we attempt to convince the reader that at its worst, the results of statistical hypothesis testing can be seriously misleading, and at its best it offers no informational advantage over its alternatives; in fact it offers less." (Jones and Matloff 1986)

"One essential feature of tests is the dichotomous view of the world. There is a null hypothesis and an alternative (possibly just the complement), and which hypothesis you are in is treated as overridingly more important than where you are in it. This is often an inappropriate view." (Pratt 1976)

"[The desire for discipline] reduces the role of tests essentially to convention. Convention is useful in daily life, law, religion, and politics, but it impedes philosophy [i.e., science]." (Pratt 1976)

"Nevertheless, in many or most situations where tests are used in practice, they are not well articulated to the problem really at hand, and such virtues as they have are largely irrelevant." (Pratt 1976)

"How has the virtually barren technique of hypothesis testing come to assume such importance in the process by which we arrive at our conclusions from our data?" (Loftus 1991)

"Despite the stranglehold that hypothesis testing has on experimental psychology, I find it difficult to imagine a less insightful means of transitting from data to conclusions." (Loftus 1991)

"Whereas hypothesis testing emphasizes a very narrow question ('Do the population means fail to conform to a specific pattern?'), the use of confidence intervals emphasizes a much broader question ('What are the population means?'). Knowing what the means are, of course, implies knowing whether they fail to conform to a specific pattern, although the reverse is not true. In this sense, use of confidence intervals subsumes the process of hypothesis testing." (Loftus 1991)

"This remarkable state of affairs [overuse of significance testing] is analogous to engineers' teaching (and believing) that light consists only of waves while ignoring its particle characteristics---and losing in the process, of course, any motivation to pursue the most interesting puzzles and paradoxes in the field." (Loftus 1991)

"Most readers of this journal will recognize the limited value of hypothesis testing in the science of statistics. I am not sure that they all realize the extent to which it has become the primary tool in the religion of Statistics. Since the practitioners of that faith seem unable to cure their own folly, it is time we priests of the faith brought them around to realizing that there are more appropriate ways to get useful answers." (Salsburg 1985)

"[Researchers] pay undue attention to the results of tests of significance they perform on their data, particularly data derived from experiments, and too little to the estimates of the magnitude of the effects which they are investigating.... The emphasis on tests of significance, and the consideration of the results of each experiment in isolation, have had the unfortunate consequence that scientific workers have often regarded the execution of a test of significance on an experiment as the ultimate objective. Results are significant or not and that is the end to it." (Yates 1951)

"In the abstract of one paper we found the statement: 'The immunized ewes had a higher ovulation (P( .01) and produced more lambs than the controls (P( .01).' It would have been more to the point to comment that: 'Compared with the controls, the immunized ewes averaged 48% more ovulations and 25% more lambs were born', and to leave comment on statistical significance to the body of the paper." (Maindonald and Cox 1984)

"People have erroneous intuitions about the laws of chance. In particular, they regard a sample randomly drawn from a population as highly representative, that is, similar to the population in all essential characteristics. The prevalence of the belief and its unfortunate consequences for psychological research are illustrated by the responses of professional psychologists to a questionnaire concerning research decisions." (Tversky and Kahneman 1971)

"In analysis, over-emphasis on significance-testing continues...." (Preece 1982)

"The most commonly occurring weakness in the application of Fisherian methods is undue emphasis on tests of significance, and failure to recognize that in many types of experimental work estimates of the treatment effects, together with estimates of the errors to which they are subject, are the quantities of primary interest." (Yates 1964)

"There are instances of research results presented in terms of probability values of 'statistical significance' alone, without noting the magnitude and importance of the relationships found. These attempts to use the probability levels of significance tests as measures of the strengths of relationships are very common and very mistaken." (Kish 1959)

"We shall marshal arguments against [significance] testing, leading to the conclusion that it be abandoned by all substantive science and not just by educational research and other social sciences which have begun to raise voices against the virtual tyranny of this branch of inference in the academic world." (Guttman 1985)

"Statistical hypothesis testing is commonly used inappropriately to analyze data, determine causality, and make decisions about significance in ecological risk assessment,... It discourages good toxicity testing and field studies, it provides less protection to ecosystems or their components that are difficult to sample or replicate, and it provides less protection when more treatments or responses are used. It provides a poor basis for decision-making because it does not generate a conclusion of no effect, it does not indicate the nature or magnitude of effects, it does address effects at untested exposure levels, and it confounds effects and uncertainty.... Risk assessors should focus on analyzing the relationship between exposure and effects...." (Suter 1996)

"I argued that hypothesis testing is fundamentally inappropriate for ecological risk assessment, that its use has undesirable consequences for environmental protection, and that preferable alternatives exist for statistical analysis of data in ecological risk assessment. The conclusion of this paper is that ecological risk assessors should estimate risks rather than test hypothesis." (Suter 1996)

"...common usage of statistics seems to have become fossilized, mainly because of the view that standard statistics is the objective way to analyze data. Discarding this notion, and indeed embracing the need for subjectivity through Bayesian analysis, can lead to more flexible, powerful, and understandable analysis of data." (Berger and Berry 1988)

"We are better off abandoning the use of hypothesis tests entirely and concentrating on developing continuous measures of toxicity which can be used for estimation." (Salsburg 1986)

"I believe ... that hypothesis testing has been greatly overemphasized in psychology and in the other disciplines that use it. It has diverted our attention from crucial issues. Mesmerized by a single all-purpose, mechanized, 'objective' ritual in which we convert numbers into other numbers and get a yes-no answer, we have come to neglect close scrutiny of where the numbers come from." (Cohen 1990)

"... the primary product of a research inquiry is one or more measures of effect size, not p values." (Cohen 1990)

"The prevailing yes-no decision at the magic .05 level from a single research is a far cry from the use of informed judgment. Science simply doesn't work that way. A successful piece of research doesn't conclusively settle an issue, it just makes some theoretical proposition to some degree more [or less] likely." (Cohen 1990)

"... surely, God loves the .06 nearly as much as the .05." (Rosnell and Rosenthal 1989)

"One of the things I learned early on was that some things you learn aren't so." (Cohen 1990)

"When a Fisherian null hypothesis is rejected with an associated probability of, for example, .026, it is not the case that the probability that the null hypothesis is true is .026 (or less than .05, or any other value we can specify). Given our framework of probability as long-run relative frequency¾as much as we might wish it to be otherwise¾this result does not tell us about the truth of the null hypothesis, given the data. (For this we have to go to Bayesian or likelihood statistics, in which probability is not relative frequency but degree of belief.)" (Cohen 1990)

"Despite widespread misconceptions to the contrary, the rejection of a given null hypothesis gives us no basis for estimating the probability that a replication of the research will again result in rejecting that null hypothesis." (Cohen 1990)

"Of course, everyone knows that failure to reject the Fisherian null hypothesis does not warrant the conclusion that it is true. Fisher certainly knew and emphasized it, and our textbooks duly so instruct us. Yet how often do we read in the discussion and conclusions of articles now appearing in our most prestigious journals that 'there is no difference' or 'no relationship'?" (Cohen 1990)

"A little thought reveals a fact widely understood among statisticians: The null hypothesis, taken literally (and that's the only way you can take it in formal hypothesis testing), is always false in the real world.... If it is false, even to a tiny degree, it must be the case that a large enough sample will produce a significant result and lead to its rejection. So if the null hypothesis is always false, what's the big deal about rejecting it?" (Cohen 1990)