Error statistics aims to provide a philosophical foundation for the application of frequentist statistics in science. As in any frequency-based approach, error statistics adheres to what I consider to be the fundamental tenet of frequentist learning: any particular data-based inference is deemed well-warranted only to the extent that it is the result of a procedure with good sampling characteristics.
What kind of procedures are we talking about, and what characteristics of those procedures ought we to care about? The error statistical approach distinguishes itself from other frequentist frameworks (e.g., frequentist statistical decision theory) by the answer it gives to that question. Particular attention is paid to tests, by which I mean a procedure that takes some data and a statistical hypothesis as inputs and issues a binary pass/fail result. (As we’ll see, the testing framework easily encompasses estimation by holding the data input fixed and varying the statistical hypothesis input.) The error statistical worth of a test is related, sensibly enough, to the (in)frequency with which the test produces an erroneous conclusion, and, critically, what that error rate indicates about the capacity of the test to detect error in the case at hand. This notion is codified by the Severity Principle.
Modus tollens and maximally severe tests
The most straightforward way to understand the Severity Principle (for me, anyway,) is as an extension of a rule of inference of classical logic that has the delightfully baroque name modus tollendo tollens, or more simply, modus tollens.
To apply modus tollens, one starts with two premises: first, “if P, then Q“; and second, “not-Q“. From these, modus tollens produces the conclusion “not-P“. Modus tollens is also known as the law of contrapositive because contraposition applied to the first premise yields “if not-Q, then not-P“.
For the purpose of exposition, I offer this slight reformulation of modus tollens: the two premises are “if not-H, then not-P” and “P“, and the conclusion is “H“. Here H represents an hypothesis and P represents a passing result from some test of H. The premise “if not-H, then not-P” expresses a property of the test, to wit, that it is incapable of producing an erroneous passing grade. The premise “P” expresses the assertion that for the observed data, the test procedure has produced a passing grade. In the language of error statistics, one says that H has passed a maximally severe test.
“Frequencification” of modus tollens
The above reformulation introduced the notion of a test into the premises, but because the first premise posited a perfect test, the whole idea of a test seemed rather superfluous. But classical logic is silent in the face of imperfect tests; and since learning from imperfect tests is, after all, possible, we seek an extension of modus tollens to imperfect tests.
The first step is to consider the opposite of a maximally severe test. Such a minimally severe test would be one gives a passing grade to H irrespective of the data. We can “frequencify” this notion: for a minimally severe test, we have
(I use the symbol “Fr” to be clear that I’m writing about relative frequencies.) The analogous frequencification of a maximal severe test is
The equation in the above line captures only the first premise of modus tollens, so it’s a necessary but not sufficient component of the notion of a maximally severe test. To see this, notice that a procedure that always gives a failing grade to H irrespective of the data does satisfy the above equation; however, modus tollens will never apply. The inequality encodes the fact that part of what we mean by the word “test” is that it is at least possible to observe a passing grade, or to put it another way, that there must be some possible data that accords with H.
The key point is that the necessary conditions for a minimally and a maximally severe test are, respectively, Fr(not-P | not-H) = 0 and Fr(not-P | not-H) = 1. This suggests that (provided we’ve checked that Fr(P | H) > 0,) we can measure the severity of a test by the value of Fr(not-P | not-H) .
The Severity Principle
At this point, I’m just going to quote straight from Mayo’s latest exposition of severity:
“Severity Principle . Data x0 (produced by process G) provides good evidence for hypothesis H (just) to the extent that test T severely passes H with x0.
A hypothesis H passes a severe test T with data x0 if,
(S-1) x0 accords with H, (for a suitable notion of accordance) and
(S-2) with very high [frequency], test T would have produced a result
that accords less well with H than x0 does, if H were false or incorrect.”
My previous discussion was based on the idea of pass/fail testing, whereas condition (S-2) is phrased in terms of “accord[ing] less well”, the key word being “less”. Mayo’s definition does not merely introduce the notion of accordance with H — it demands a totally ordered set to express it.
To connect my exposition here with Mayo’s definition, we can now recognize that, just as school exams produce both a numerical grade and a pass/fail categorization, a statistical data-based pass/fail test will (almost always) be a dichotomization of a test statistic that makes finer distinctions, and the choice of dichotomization threshold is essentially arbitrary. Mayo’s definition of a severe test resolves this arbitrariness by placing the threshold for a passing grade right at the observed value of the test statistic. (The threshold is notionally infinitesimally below the observed value of the test statistic, so that H just barely passes. Alas, the reals contain no infinitesimals.)
To make the notion of the severity of a test T of H mathematically precise and operational, we’ll have to get more specific about what is meant by “accords less well”. That will be the topic of part two.