As discussed in the slide deck, when you’re designing a two-sided confidence interval procedure you have some freedom to decide, for each value of the true parameter, how much probability mass you will put above (literally above if we’re talking about the plot on slide 6) the upper limit and how much you’ll put below the lower limit. The confidence coverage property only constrains the sum of these two chunks of probability mass.

The kinds of inferences for which the SEV function was designed are one-sided, directional inferences — upper-bounding inferences and lower-bounding inferences — so there’s no arbitrariness to SEV with well-behaved models in fixed sample size designs. Sequential designs introduce multiple thresholds at which alpha can be “spent” so even just for a simple one-sided test there is already an element of arbitrariness that must be eliminated by recourse to an alpha-spending function or an expected sample size minimization or some other principle that eats up the degrees of freedom left over after imposing the Type I error constraint. There is likewise arbitrariness in specifying a one-sided confidence procedure for sequential trials — as with two-sided intervals there are multiple boundaries to specify at each possible parameter value and the confidence coverage constraint only ties up one degree of freedom.

In the last post I asserted that the conditional procedure was an exact confidence procedure. Here’s the math. Let *q*_{4}(*α*, *μ*) and *q*_{100}(*α*, *μ*) be the quantile functions of the conditional distributions:

Then the confidence coverage of the conditional procedure is

The conditional procedure had the difficulty that its inferences could contradict the Type I error rate of the design at the first look. However, we can replace the quantile functions with arbitrary functions as long as they satisfy that same equality for all values of *μ* and this will also define an exact confidence procedure. The question then becomes what principle/set of constraints/optimization objective can be used to specify a unique choice.

This sort of procedure offers very fine control over alpha-spending at each parameter value, control that is not available via the orderings on the sample space discussed in the last post that that treat values at different sample sizes as directly comparable. But the phrasing of the (S-2) criterion really strongly points to that kind of direct comparison of sample space elements, and Figure 4 of the last post shows that this is a non-starter. So, to defend the SEV approach from my critique it will be necessary to: (i) overhaul (S-2) to allow for the sort of fine control available to fully general confidence procedures, (ii) come up with a principle for uniquely identifying one particular procedure as the SEV procedure, ideally a principle that is in line with severity reasoning as it has been expounded by its proponents up to this point — oh, and (iii) satisfy PRESS, let’s not forget that.

]]>I had a different follow-up planned for my last post but I made a discovery (see title) that caused me to change course. Previously I had made the rather weak point that the SEV function had some odd properties that I didn’t think made sense for inference. Mayo’s response (on Twitter) was: “The primary purpose of the SEV requirement is to block inference as poorly warranted, & rigged exes have bad distance measures.” In this post I’ll argue that the SEV function has properties that I don’t think anyone can claim make sense for inference, and I’ll draw out the consequences of affirming the severity rationale in spite of its possession of these undesirable properties.

Here I’ll examine the SEV function in the context of a modification of Mayo’s “water plant accident” example. For the picturesque details you can follow the link; I’ll stick with the math. The model is normal with unknown mean *μ* and a standard deviation of 10. Here we are interested in testing *H*_{0}: *μ* ≤ 150 vs. *H*_{1}: *μ* > 150. Mayo looks at the test based on the mean of 100 samples, which I will call x̅_{100}. My modification is this: we’ll check the mean after collecting 4 samples, x̅_{4}, and reject the null if it’s greater than some threshold; otherwise we’ll collect the remaining 96 samples and test again using x̅_{100} as our statistic. Which threshold? It hardly matters, but let’s say the threshold for rejection of *H*_{0} and cessation of data collection at *n* = 4 is at x̅_{4} = 165, three standard errors from the null. (This “spends” 0.00135 of whatever Type I error rate we’re willing to tolerate. In Mayo’s example the Type I error rate is set to 0.022, corresponding to a threshold at x̅_{100} = 152, two standard errors from the null; in our case, to compensate for the look at *n* = 4 the threshold at *n* = 100 increases a very tiny amount — it’s 152.02. Nothing turns on these details.) I’ll also consider what happens when the design is to collect not 96 but 896 additional samples for a total of 900 before the second look.

This is an early stopping design (a.k.a. group sequential design, adaptive design); they’re common in clinical trials where it’s desirable to minimize the number of patients in the event that strong evidence has been observed. Is it a “rigged” example? It sure is. The rigging lies in the fact that in clinical trials the “looks” at the data wouldn’t be as early as this — it generally isn’t worth looking when only 1/25 of the second-look sample size has been observed, much less 1/225. By using this schedule I am deliberately amplifying problems for frequentist inference that already exist in a milder form in more typical trial designs. But we’ve been told that SEV’s primary purpose is to block poorly warranted inferences, and the question we should be asking isn’t “is it rigged?” — it’s whether SEV actually does a reasonable job even in, or perhaps especially in, setups that don’t make a lot of sense. It is, if you will, a severe test of severity reasoning. In any event, I think even a weird design like this ought to be included in the domain of severity reasoning’s application since it’s just twiddling the design parameters of a well-accepted approach.

The postulate on which I’m going to base my argument is this: when early stopping is possible but doesn’t actually occur, the sample mean is consistent for *μ* and its standard deviation decreases at the usual O(*n*^{-½}) rate. So, supposing that we haven’t stopped at the first look, the width of the interval in which the value of *μ* can be bounded ought *in all cases* to shrink to zero as we consider designs with larger and larger second-look sample size. This is not a likelihood-based notion — this is based on the conditional distribution of the estimator. When I say SEV doesn’t work, I mean that it fails to adhere to this “Precision RespEcts Sample Size” (PRESS) postulate. If you don’t accept this postulate then I invite you to simulate some data. If you *still* don’t accept this postulate then my argument that SEV doesn’t work may not be convincing to you, but you’ll find that you have to take on board some troubling consequences nevertheless.

(Just to set expectations: the rest of this post is about ten times longer than this introduction; there are equations and plots.)

Let’s recap the severity criteria (*Error Statistics*, pg 164):

A hypothesis *H* passes a severe test *T* with data **x**_{0} if,

(S-1) **x**_{0} accords with *H*, (for a suitable notion of accordance) and

(S-2) with very high probability, test *T* would have produced a result that accords less well with *H* than **x**_{0} does, if *H* were false or incorrect.

Equivalently, (S-2) can be stated:

(S-2)*: with very low probability, test *T* would have produced a result that accords as well as or better with *H* than **x**_{0} does, if *H* were false or incorrect.

In the fixed sample size design, Mayo instantiates the above severity criteria as the SEV function, thus:

in which *d*(⋅) is a function defining a test statistic and SEV is to be thought of as a function of *μ*′. (This isn’t quite how she puts it — she generally writes Pr (*d*(**X**) < *d*(**x**_{0}); *μ* ≤ *μ*′) and then writes about finding the lower bound of the probability under *μ* ≤ *μ*′. It comes to the same thing.)

The word “suitable” in the phrase “a suitable notion of accordance” is doing a **lot** of work here. In the fixed sample size design it’s straightforward to translate the notion of “a result that accords less well with *μ* > *μ*′ than **x**_{0} does” to the event “d(**X**) < d(**x**_{0})”. Accordance is not so easy to nail down when the test statistic could be calculated at more than one possible sample size. Suppose we observe, say, x̅_{4} = 160 — does this accord more, less, or equally well with larger values of *μ* than, say, x̅_{100} = 151.5?

This is where I got stumped when I last looked at early stopping designs; I came up with three fairly obvious candidate notions of accordance but none of them resulted in SEV functions that made sense. This was unacceptable to me because I wanted to be certain of not constructing a straw man — at that time I sought a SEV analysis that was clearly and unassailably correct by error statistical criteria. I would have liked to use the test statistic associated with a uniformly most powerful (UMP) test but I couldn’t find such a test.

(In retrospect looking for a UMP test was pretty silly: the realized sample size and mean are jointly sufficient statistics — you can’t do better than that — and power can always be increased by pushing the early stopping threshold out to infinity, turning your early stopping design into a fixed sample size design. The whole point is to sacrifice power in return for reducing the expected sample size in a way that depends on the true effect size; hence the name “adaptive design”.)

Recently my interest in the question was rekindled by arguments about optional stopping made in Mayo’s new book. I asked Daniël Lakens for some advice on how to handle frequentist inference in such problems and he pointed me to a tutorial paper he wrote that cites *Statistical Modelling of Clinical Trials: A Unified Approach* by Proschan et al.

In sequential designs the problem of deciding on a notion of accordance arises after the trial has concluded and it’s time to compute the p-value (defined as the probability, under the null hypothesis, of a result as or more extreme than the one actually observed). In Proschan et al. I read that to make “more extreme than the one actually observed” mathematically precise one defines an ordering on the elements of the sample space such that for any two elements one can say which is more extreme (or if they are equally extreme, a circumstance which doesn’t happen in fixed sample size designs). Such an ordering directly specifies a notion of accordance that might be suitable for constructing the SEV function. Proschan et al. offer four possible orderings, and three of them were the three I had thought up previously.

The four orderings are:

- MLE-ordering, which I will denote ≺
_{e}(the “e” stands for “estimator”), which calls outcomes equally extreme when the resulting maximum likelihood estimator (MLE) values are numerically equal; in this example the sample means are the MLEs so x̅_{100}= 154 is simply equivalent to x̅_{4}= 154; - B-ordering, the one I didn’t think of and which I won’t describe because in our case it happens to coincide with MLE-ordering;
- z-ordering, ≺
_{z}, which calls outcomes equally extreme when they correspond to the same z-score; for example, x̅_{100}= 154 is equivalent to x̅_{4}= 170 since they are each four nominal standard errors away from the null; - stagewise ordering, ≺
_{s}, which calls any value of x̅_{4}more extreme than any value of x̅_{100}because the larger the value of*μ*the more likely is early stopping; at a given sample size the test statistic values are ordered as usual.

Proschan et al. strongly recommend stagewise ordering; ironically, I thought of stagewise ordering first and rejected it almost immediately, for reasons that will become clear. They recommend it on three grounds:

- Using other orderings can sometimes result in a p-value that is larger than the design’s Type I error rate even when the null hypothesis has been rejected, but this can never happen using stagewise ordering.
- If the null is rejected at the first look then the p-value is the same as if that result had occurred in a fixed sample size design (I personally don’t see why this is particularly desirable, but okay).
- The other orderings can’t be used with the “alpha-spending function” approach of Lan and DeMets that permits looks at the data that are not planned in advance.

In relation to this last point they write something I find a little odd:

…the p-values for the z-score, B-value, and MLE orderings depend not only on the data observed thus far, but also on future plans. [In a worked example with five looks, even] though we stopped at the third look, we needed to know the boundaries at the fourth and fifth looks. But future look times may be unpredictable. Why should the degree of evidence observed thus far depend on the number and times of looks in the future? This violates the likelihood principle.

The entire approach described in the book violates the version of the likelihood principle I’m familiar with. It’s possible that they mean the thing I know as the sufficiency principle, which states that if two experiments yield the same value of a sufficient statistic then they yield the same evidence about the parameter. But in the sort of group-sequential designs they’re looking at, the MLE and the information fraction (a kind of generalized sample size) at stopping are jointly sufficient and the inference methods they advocate do not depend solely on the realized value of the sufficient statistic, so the sufficiency principle is also violated by their approach. So while I agree that the degree of evidence observed thus far shouldn’t depend on plans for future data collection, I can’t make out what principle they’re citing.

Onward!

We’ll need two random variables, X̅_{4} and X̅_{100}:

- they are jointly normally distributed;
- X̅
_{4}has unknown mean*μ*and standard deviation 5; - X̅
_{100}has unknown mean*μ*and standard deviation 1; - Cov(X̅
_{4}, X̅_{100}) = 1.

This completely specifies their joint distribution. Note that the joint distribution is defined on the entire ℝ^{2} plane; for us the event “no early stopping, n = 100” is identical to the event “X̅_{4} ≤ 165”. I’ll use x̅_{n} to refer to a generic observed result and, well, I’m not sure what X̅_{N} is exactly but the event “X̅_{N} ≺ x̅_{n}” is the event “an outcome less extreme than the observed result x̅_{n} as defined by whichever ordering ‘≺’ is in use”. This is everything we need to define the candidate SEV functions corresponding to the different orderings; I’ll give them a subscript to indicate which ordering is in use.

It turns out that the candidate SEV functions fall afoul of PRESS when data collection isn’t stopped at the first look and the data at a subsequent look indicates that it “should have”, i.e., that the realized estimate suggests that the parameter value is actually on the “stop” side of the stopping threshold. So let us consider the outcome for which the rejection threshold was not exceeded at the first look at *n* = 4 and the final result was x̅_{100} = 170. Is this just a weird result that could effectively never occur even with the most favorable value of *μ*? No — it turns out that Pr (X̅_{4} ≤ 165, X̅_{100} > 170; *μ* = 171) = 0.086, so while such a result is a bit unusual it’s by no means inconceivable. This probability (with *μ* tuned to maximize) *increases* with increasing second-look sample size and asymptotes from below to 0.21.

In stagewise ordering any observed value of x̅_{100} accords less well with larger values of *μ* than any value of X̅_{4} greater than the rejection threshold, so we have

Both of these probabilities are monotonic decreasing in *μ* so the minimum is achieved when *μ* = *μ′* and we may write

Since the two factors are probabilities, either of them is an upper bound on SEV_{s}; in particular, SEV_{s }( *μ* > *μ′* ) can never be larger than the probability of no early stopping, Pr (X̅_{4} ≤ 165; *μ′* ).

Figure 1. Under stagewise ordering, lower-bounding inferences (i.e., inferences of the form “*μ* > *μ′* ”) are constrained by the rejection threshold.

Figure 1 shows SEV_{s }( *μ* > *μ′* ) (solid) along with the probability of no early stopping (dashed). Consider the severity for the inference “*μ* > 165”. SEV_{s }( *μ* > 165) is smaller than its upper bound, 0.5, albeit negligibly so. In the *Error Statistics* paper at the bottom of page 171 we learn that SEV less than 0.5 is considered poor warrant for an inference, so an error statistician who uses stagewise ordering finds that the severity rationale blocks her from making the lower-bounding inference “*μ* > 165”. The inference is blocked by the fact that, even supposing *μ* ≤ 165, there’s appreciable probability that data collection would have stopped early and thereby produced a result that (stagewise-)accords better with “*μ* > 165” than any value of x̅_{100} does (compare criterion (S-2)*). Take note: having looked and not stopped, she’s **permanently** blocked from inferring “*μ* > 165” irrespective of the second-look sample size of the design. It doesn’t matter if she takes the second look after collecting 100, 900, or 250,000 samples; to hold to stagewise ordering she must give up PRESS.

Proschan et al. note that stagewise ordering has related consequences for confidence interval procedures. In fact, the construction of such intervals is mathematically equivalent to reading the tail area probabilities off the SEV_{s} curve, so stagewise ordering results in a bound on the lower confidence limit, and for equal-tailed intervals and a fixed budget of confidence this implies a bound on the upper confidence limit too. As a result, it’s possible for the MLE to lie outside the confidence interval, although the chances are “microscopic” for trial designs and confidence levels typically used in practice. The fact that the width of the confidence interval can end up being bounded below without respect to the realized sample size with non-negligible probability was not remarked on, likely because the phenomenon is both unusual and relatively mild when the sample sizes are closer together.

Figure 2. Under stagewise ordering, upper-bounding inferences respect realized sample size.

Speaking of interval procedures, we’ve looked at SEV_{s} for lower-bounding inferences and we should do the same for upper-bounding inferences. It turns out that with stagewise ordering we have SEV_{s }( *μ* < *μ′* ) = 1 – SEV_{s }( *μ* > *μ′* ). This relationship holds for the sorts of examples Mayo usually discusses (see for example the *Error Statistics* paper, middle of page 172) and for stagewise ordering but can fail for other orderings. Figure 2 shows SEV_{s }( *μ* < *μ′* ) for an observed mean of 170 for two designs, one with a second-look sample size of 100 (in blue) and one with a second-look sample size of 900 (in red). The upper bound does converge with increasing sample size, so that’s okay.

How does z-ordering perform? Well, x̅_{100} = 170 is twenty nominal standard errors from the null, so according to z-ordering it is as extreme as x̅_{4} = 250 and we have

The minimand in the expression is non-monotonic; I don’t even know if Mayo would allow orderings that lead to non-monotonicity but I have relevant points to make even if she wouldn’t, so let’s proceed.

Figure 3. Under z-ordering, upper-bounding inferences mix first-look and second-look distributions together in a bizarre way.

Figure 3 shows SEV_{z} ( *μ* > *μ′* ) (solid) and the minimand of the SEV_{z} ( *μ* > *μ′* ) expression (dashed). The minimand rises from its local minimum and stays at (effectively) one until *μ′* approaches 250, so SEV_{z} ( *μ* > *μ′* ) stays locked to the value of the minimand at the local minimum, 0.913, until the minimand descends below that value near *μ′* = 243. As we consider larger and larger second-look sample sizes the value of x̅_{4} equivalent to an observed mean of 170 at the second-look sample size to goes off to infinity. In the same limit the descent of the minimand close to 170 gets sharper and the limiting value of SEV_{z} ( *μ* > *μ′* ) is 0.841; that is, we always have SEV_{z} ( *μ* > *μ′* ) > 0.841 for all values of *μ′* from 170 on out and out and out…

What are we to make of this value, 0.841? Well, the *Error Statistics* paper characterizes it as “not too bad” close to the bottom of page 171. Returning to the scenario where the second-look sample size is 100, we see that an error statistician who uses z-ordering would be led to affirm that, say, the inference “*μ* > 200” has a warrant that is at least “not too bad” because there is a high probability that a z-score that accords more with *μ* ≤ 200 than does the realized z-score would have been observed, supposing *μ* ≤ 200. This violates PRESS, but I think it’s fair to say that’s the least of its problems.

(Here we have the SEV_{z} function seeming to warrant an inference it shouldn’t, but the definition of the SEV function only draws on criterion (S-2); an error statistician would probably block “*μ* > 200” due to failure of criterion (S-1). Exactly how that would go is unclear because the *Error Statistics* paper only shows applications of (S-1) to the null or alternative hypothesis of a test and not to arbitrary directional inferences. It also says that an error statistician could use likelihood to arrive at (S-1); in all, the proper application of criterion (S-1) is quite mysterious.)

But really the point I want to make with this ordering relates to upper bounding inferences, not lower bounding inferences.

To a high degree of accuracy this is just

Never mind that the z-ordering’s value of x̅_{4} equivalent to x̅_{100} = 170 doesn’t make sense. The point is that the curvature of the SEV_{z} ( *μ* < *μ′* ) function is determined by the standard error at *n* = 4. Even if we hypothesize some other ordering that slides the equivalent value of x̅_{4} down to something more reasonable this will remain true. In fact, let’s do just that.

Figure 4. Under **any** ordering that treats possible results at the first look and second look as equivalently extreme, the width of intervals between upper-bounding and lower-bounding inferences can end up insensitive to the second-look sample size.

Figure 4 shows possible SEV ( *μ* < *μ′* ) functions for hypothetical orderings that set the equivalent value of x̅_{4} to 170 (leftmost/highest), 175 (center), and 180 (rightmost/lowest) for both a second-look sample size of 100 (blue) and 900 (red). The overall width of each curve is pretty much the same no matter where we center the curve and, critically, no matter the second-look sample size. All a larger sample size accomplishes is to make the curve in the region around 170 steeper; there *will* be inferences that, at any level we might consider well-warranted, are insensitive to the second-look sample size of the design.

In Figure 4 the ordering that set the equivalent value of x̅_{4} to 170 wasn’t actually hypothetical — that’s exactly what MLE-ordering does. SEV_{e} ( *μ* < *μ′* ) doesn’t look *too* bad, does it? It’s centered around the right place, unlike the stagewise ordering version. Sure, the precision is “wrong” but it’s conservatively wrong, which might be tolerable to someone who doesn’t require PRESS.

Lest anyone be tempted by the seeming acceptability of this result, it must be pointed out that in MLE-ordering it doesn’t matter at which look you observe a particular realized mean — the SEV_{e} function depends only on the value of the observed mean. This can lead to results that illustrate exactly why Proschan et al. recommend an ordering in which the degree of evidence observed at early looks doesn’t depend on details relating to later looks.

Figure 5. Under MLE-ordering it doesn’t matter if you observe x̅_{4} = 165.7 or x̅_{100} = 165.7, but the *planned* second-look sample size does affect the inference even if early stopping occurred.

Figure 5 shows two SEV_{e} ( *μ* < *μ′* ) functions for x̅_{4} = 165.7, corresponding to two different second-look sample sizes, 100 (blue) and 900 (red). The curvature at values below 0.5 reflect the width of the sampling distribution for the first look but above 0.5 the curvature reflects the smaller width of the sampling distribution for the second look *even though the second look hasn’t and won’t take place*. An error statistician who holds to MLE-ordering and planned a second-look sample size of 100 would claim that given x̅_{4} = 165.7, “*μ* < 168” is a well-warranted inference because there’s a high probability of observing a sample mean larger than 165.7 supposing that *μ* ≥ 168. If she had instead planned a second-look sample size of 900 then the same would go for the claim “*μ* < 167”. This isn’t a violation of PRESS *per se* but I think it’s fair to say that appropriating precision from an unfulfilled plan to look at more data is behavior no one wants from their method of inference.

This early stopping design is somewhat akin to one that was used to level criticisms against Neyman-Pearson testing in years gone by: one of two measurement devices, one low precision and one high precision, is chosen at random and then the measurement is carried out. The specific criticism is that if one calculates error probabilities that take the random selection of measurement devices into account (in statistical jargon the sampling distribution is a mixture model) one ends up with a rejection region that seems inappropriate for either device, being too lenient on the null hypothesis relative to the high precision measurement and too strict on the null hypothesis relative to the low precision measurement. The p-value associated with the mixture has a corresponding affliction. Proponents of hypothesis testing address the criticism by noting that the choice of measurement is “ancillary” — its sampling distribution is free of the parameter of interest — but informative about the precision of the measurement; they argue, reasonably enough, that it’s appropriate to condition on such ancillary information when performing inference. In cases like our early stopping scenario, however, the sample size is not an ancillary random variable — its dependence on the parameter of interest is what makes adaptive trials adaptive — so a straightforward resort to conditioning is not justified.

But you know what? Let’s try conditioning anyway and see what happens. Consider a one-sided lower-bounding confidence procedure in which for a chosen confidence level (1 – *α*), one seeks the value of *μ* such that the observed mean is at the the *α* quantile of its sampling distribution conditional on whether early stopping occurred or not. As we slide the confidence level from 1 to 0 we trace out a function that looks a lot like a candidate SEV ( *μ* > *μ′* ). This *is* an exact confidence procedure but it doesn’t correspond to any simple notion of accordance over the whole sample space. Is it legitimate to use a notion of accordance (and relevant outcome space) that is sampled at random in a manner that depends on the parameter of interest? I don’t know — Mayo never addresses complicated examples.

By design, the inferences this procedure produces at the second look adhere to PRESS. They aren’t identical to the inferences that would result in a fixed sample size design so perhaps they can also be said to have adjusted the error probabilities to take selection effects into account, a property Mayo considers essential to well-warranted inference. The problem comes when we consider the inferences that result when early stopping does occur and the observed mean is close to the rejection threshold. We’re obliged to use the sampling distribution conditional on the occurrence of early stopping; otherwise the confidence coverage of the procedure taken as a whole is wrong. We find that the resulting inferences are subject to an extreme form of the first problem Proschan et al. noted for orderings other than stagewise: the conditional procedure produces an inference that contradicts the Type I error rate set in the design of the trial.

For example, when x̅_{4} = 166 the inference “*μ* > 150” has confidence level Pr (X̅_{4} < 166 | X̅_{4} > 165; *μ* = 150) = 0.49; the p-value for the corresponding test is 0.51. It’s a peculiar sort of contradiction. By using this experimental design we aimed to ensure a Type I error rate no greater than 0.022; the particular rejection threshold that was exceeded has only a 0.00135 Type I error rate associated with it. The result is more than three standard errors from the null! It seems like we ought to be able to claim “*μ* > 150” with a SEV of at least 0.978. And yet, if the severity rationale is even available on a conditional basis then, given this procedure and result, it blocks the inference: among the cases that reject on the first look there’s appreciable probability of a result that accords better with “*μ* > 150” than does x̅_{4} = 166, even supposing *μ* ≤ 150. I can’t imagine anyone would uphold this line of reasoning, but I admit that all I have here is an argument from incredulity. If anyone can argue convincingly that this or something like it is actually reasonable then I’d have to reconsider, but I think it’s fair to say such an argument would have to go well beyond any argument Mayo herself has made up to this point.

I’ve demonstrated that no simple measure of accordance results in a SEV function that follows PRESS in all cases. (I say “simple” because I only looked at the most obvious candidate SEV functions. I’ve thought up some baroque possibilities but I can’t sort out an error-statistical principle that can be applied to uniquely specify a particular choice. I’ve reached the end of both my ability and motivation to steelman the SEV function and no one else has even seen the need, so I’m going forward on the assumption that this is as good at it gets. If someone generates a principled notion of accordance that addresses these problems I can revise my views.) I’ve also shown that simple conditioning on realized sample size, the most straightforward way of constructing a frequentist procedure that follows PRESS, works for second looks but can block the first-look inference for which the test was designed. In the face of this tension between SEV and PRESS, those fully committed to severity reasoning might simply deny that adherence to PRESS is a necessary property of a method of inference. I see two ways one might go about this.

First, one might simply deny that a fully general notion of accordance exists in designs with multiple looks. One might permit the use of stagewise ordering for p-value calculations but forbid calculating post-data error probabilities for hypotheses other than the null, ruling out a SEV function defined on the full parameter space. This will still leave the severity rationale available when considering p-values as in the *Error Statistics* paper at the top of page 168.

This is the sort of stance that might appeal to someone like Daniël Lakens, who is fully committed to Neyman-Pearson testing and has little use for effect size estimates. From what I understand, in his investigations effect sizes are experiment-bound and aren’t expected to generalize, so he is much more interested in whether any effect at all can be generated and discerned than with the magnitude of the effect. In his tutorial paper on sequential designs he recommends caution regarding effect size estimates, writing

…the observed effect size at the moment the study is stopped could be an overestimation of the true effect size. Although procedures to control for bias have been developed, there is still much discussion about the interpretation of such effect sizes, and studies using non-adaptive designs, followed by a meta-analysis, might be needed if an accurate effect size estimate is paramount.

The unappealing consequence of this stance is that it undermines many defences of frequentist statistics made in *Error Statistics* paper, at least in the context of designs with multiple looks. In particular, the paper’s counter-arguments to “fallacies” or “errors” in interpreting hypothesis/significance test results (#2, #3, and #6) all rely on the existence of the SEV function; without a suitable notion of accordance, the whole defence is thrown back on the power function as an aid to the interpretation of dichotomous test results. The paper’s claim that when a statistically significant result is found, “[p]ost-data, one can go much further in determining the magnitude γ of discrepancies from the null warranted by the actual data in hand” would no longer hold, and post-data assessment of statistically insignificant results would likewise no longer be available. Our inferences would depend only on whether the test rejected the null or not, to the neglect of most of the information in the data. On this view, observing x̅_{100} = 170 justifies nothing more than the strong rejection of the null that we can get from the p-value and the inferences we can get from considering the pre-data probability of Type II error.

The staunchest advocates of the severity rationale can avoid undermining these defences by taking the second option: biting the bullet and committing to some fully general notion of accordance even in the multiple look context. At the risk of going out on a limb, I’m going to assume that in light of the my discussion of z-ordering and MLE-ordering and in view of Proschan et al.’s first and third arguments for stagewise ordering (non-contradiction of p-values and Type I error rates, applicability even for unplanned interim looks), their choice would indeed be stagewise ordering. The unappealing consequence of this stance has already been described in mathematical terms; the nickel summary is that when the sheer existence of later observations is contingent on the earlier result they cannot force the SEV function past the worst case consistent with the earlier result. In figurative language we might say that each look at the data that does not trigger the end of data collection acts as a ratchet, constraining the set of possible well-warranted inference in a way that allows free movement in only one direction.

I have to imagine that this is not the first time that a critic of frequentist statistics — perhaps Richard Royall or some other likelihood theorist — has discovered this ratcheting of inferences in sequential trials as it manifests in confidence intervals. I can also see how criticism based on this phenomenon might not be convincing to frequentists who appeal to what the *Error Statistics* paper calls the “behavioristic rationale… wherein tests are interpreted as tools for deciding “how to behave” in relation to the phenomena under test, and are justified in terms of their ability to ensure low long-run errors.” What’s novel here is that I’ve drawn out the consequences of an account that purports to delineate well-warranted inferences from poorly warranted inferences *in the specific case at hand* rather than simply ensuring low errors in the long run. The account is supposed to provide a philosophical foundation for frequentist statistical techniques in current use and, what’s more, to be an account of how people actually reason “in the wild”. If *you* collected 4 samples, observed a sample mean below your rejection threshold of 165, continued by collecting 249,996 more samples, and then observed a sample mean of 170, is this how *you* would reason?

The stagewise ordering ratchet is bad enough in the context of a single early stopping design, but there is worse to come.

A pattern of contingent, sequential investigations of an effect is a relatively small part of science, but it’s not a negligible one. Witness this tweet by statistician Zad Chow: “Thought experiment. Pretend several studies on a phenomenon (an intervention) failed to produce a statistically significant effect. You design a new study, intended to have really high power (like 90%). You find no significant effect. Is this support for no difference?” His point relates to the interpretation of statistically insignificant results; my point relates to just how natural it is to imagine studies that wouldn’t even exist had previous studies found different results. If you don’t find some rando’s tweet compelling evidence, consider what the distinguished applied statistician Stephen Senn has to say about power calculations and sample size: set the sample size to control the rate of two errors, Type I error and “the error of failing to develop a (very) interesting treatment further”. Really, the idea that researchers decide which lines of research to pursue in light of past results would ordinarily be too banal to merit notice, but it matters here.

How are frequentist statistical techniques used in science? A stylized account goes something like this: we draw a conceptual border around our Study; inside this border we design the Study to achieve a particular Type I error rate and power function, and when data have been collected we calculate our inferential statistics in isolation of anything outside of the Study. At the end we (hopefully!) publish a paper relating the findings of our Study to the scientific community at large. Other researchers may then survey the literature and make decisions about what they’re going to do on the basis of information that includes the Study — for that matter we almost certainly looked at prior Studies ourselves before conducting our own Study — but it would be infeasible to try to account for how our current choices are contingent on those findings and would have been different had those findings been different. Of course, within the context of the Study itself we had better be prepared to account for data collection that was contingent on interim findings or else our inferences will not survive an error-statistical audit.

After many Studies of an effect have been conducted, how can we synthesize the information in the literature? We employ the techniques of meta-analysis to conduct another Study, naturally. A meta-Study has as sturdy a conceptual border around it as any other Study, and of course it respects the conceptual borders around the Studies that constitute its data — Studies are treated as independent and no attempt is made to account for the possibility that the sheer existence of later Studies might be contingent on the findings of earlier Studies.

For a severity-based account of statistical inference in science to work it is **absolutely crucial** that the conceptual borders around Studies be enforced and the possibility of contingent existence of later Studies be ignored. If we *were* to acknowledge it we’d be obliged to compute error probabilities in a sequential fashion to avoid committing the sin of ignoring selection effects. Stagewise ordering would be the only practical choice here as it only requires knowing what has happened and not future plans.

Consider the implications for replication studies: an initial well-conducted (perhaps even pre-registered) study finds a result that’s just barely statistically significant at the conventional 2.5% level (Mayo treats two-sided tests as two one-sided tests). The sample size was set in accordance with Stephen Senn’s advice linked above to have high power to reject the null under “the [effect size] we would not like to miss”, so naturally this result prompts researchers to carry out a larger study that would never have happened had the initial study found no evidence of an interesting effect. The follow-on study has much higher precision and shows a smaller effect than in the initial study; the first has a 95% CI of, say, [0.03, 3.95] and the second a 95% CI of [0.46, 1.24] on the same scale; the “effect size we would not like to miss” is around 3 on this scale. Of course, the later inference is carried out ignoring its own contingent existence; data collection wasn’t stopped after the first study and the data in the subsequent study indicates that it “should have”, so the stagewise ordering ratchet binds and blocks the inference to a smaller effect. Accounting for the sequential relationship between the two investigations yields a 95% CI of [0.66, 3.92]. What a disaster! We wake with a start from this horrible nightmare and recall that these are two distinct Studies and are to be treated as unrelated, independent investigations into the same effect. What a relief!

We all knew it would come to this, didn’t we? I’ve carefully avoided the “B” word up to this point; now it’s time to really let my anti-freq flag fly.

Error statisticians face a conundrum. Why is it necessary to account for selection effects within Studies and not across Studies? Why must we carefully avoid accounting for the possibility of contingent existence of later Studies when conducting meta-analysis? We know that this is the right thing to do — it gives the “right” answer — but no error-statistical justification for doing it this way has been offered. When we actually carry the math of contingent data collection through to its logical conclusion we find that error probabilities based on sampling distributions can’t get us there, not even post-data error probabilities like SEV.

For Bayesians this is a non-puzzle. For us this sort of distinction between intra-study and inter-study inference doesn’t exist because data enter into analyses through their likelihood function; selection effects like stopping rules and contingent investigations don’t change the likelihood and so don’t change the resulting inferences. The puzzle facing Bayesians is this: why does the severity rationale sound so darn reasonable?

When Mayo describes severity reasoning qualitatively, for example when discussing the discovery of the mechanism of prion disease, it is usually in terms of an hypothesis *H* and some data **x** that are anomalous for *H*. The data are treated as discrete: either an anomaly is observed or it isn’t. According to severity reasoning, to infer not-*H*, by (S-1) we need **x** to accord with it in some sense; presumably a necessary condition for this is that **x** should at least be possible under not-*H*, Pr(**x**; not-*H*) > 0. By (S-2)* the severity for inferring not-*H* from the anomaly **x** is equal to the probability that the anomaly is not observed supposing *H* to be the case, that is, 1 – Pr(**x**; *H*), which is monotonic decreasing in Pr(**x**; *H*). A Bayesian in this situation would quantify the evidence in favour of not-*H* by the likelihood ratio, Pr(**x**; not-*H*) / Pr(**x**; *H*), and provided that the numerator Pr(**x**; not-*H*) is greater than zero this is also monotonic decreasing in Pr(**x**; *H*). Since the discussion is qualitative, severity is characterized as “high” or “low” and high severity is treated as dispositive. It immediately follows that at this rather high level of abstraction likelihood ratios and severity track one another. (To a Bayesian it can appear almost as if the severity criteria were designed to agree with likelihood ratios. They weren’t, of course; it only seems so because both severity reasoning and Bayes are designed to be consistent with logic in the limit where probabilities approach 0 or 1, at least at this level of abstraction.)

So Bayesians find little to disagree with in Mayo’s accounts of severity reasoning as she describes it being applied in science, as long as the discussion remains qualitative. The more quantitative the discussion becomes the larger the divergence, reaching a maximum with optional stopping. The optional stopping setup Mayo discusses is early stopping on steroids: consider a normal model, unknown mean, known standard deviation, and a sampling design in which data collection continues just until a nominal equal-tailed 95% CI excludes zero. From Khinchin’s law of the iterated logarithm it follows that this design will eventually stop with probability 1 even when the mean is in fact zero. A Bayesian with a flat prior for the mean has equal-tailed 95% posterior credible intervals that are numerically equal to the nominal 95% CIs that set the stopping criterion, so her credible intervals exclude zero with sampling probability 1 even supposing zero is the true mean. The result is the worst possible disagreement between Bayesian inference and a severity assessment.

Many Bayesians, even the most dedicated, are troubled by consequences of stopping criteria on estimates and sampling error probabilities. Other well-known Bayesians (such as Andrew Gelman, with whom Mayo sees the possibility of an error-statistical rapprochement of sorts) are untroubled by the phenomenon. My own opinion is based on the fact that the irrelevance of stopping rules, however disquieting its consequences, goes hand-in-hand with a very desirable characteristic of Bayesian updating called the martingale property: for any measurable function on the probability space the prior expectation of the posterior expectation is equal to the prior expectation. That’s rather opaque; what it means is that we can’t predict in which direction we’ll update with certainty. Still too opaque? Okay: no ratchets, guaranteed.

In the end, I accept this disquieting property of Bayesian inference — that under at least one possible parameter value it licenses inferences that are wrong with sampling probability 1 — in return for the internal consistency of the approach and consequent virtues such as the martingale property and adherence to PRESS. On the one hand, I can never verify that the antecedent condition for certain error due to optional stopping actually holds; on the other hand, the conditions for SEV and confidence intervals to violate PRESS are purely data-dependent and thus verifiable. I can’t accept an account that can knowingly license such nonsensical inferences.

The SEV function has failed my severe test. Under severity reasoning, passing a severe test is strong warrant for a claim, but failing a severe test might still warrant a weaker claim under a less severe standard — a student that didn’t achieve an A on a test might still have gotten a B+. But I don’t affirm the severity rationale; the way I see it, Mayo’s argument for severity makes very strong claims for it, claims so strong that pathologies like failure of PRESS ought to be ruled out altogether. Having led us astray in the sequential setting, the severity argument can’t be regarded as a reliable guide at all.

What went wrong? The introduction to Proschan et al.’s chapter on inference contains the remark, “In fixed sample size trials, the test statistic, *α*-level, p-value, and estimated size of effect flow naturally from the same theory. Group-sequential trials cleave these relationships.” The argument for the severity rationale as applied at the most concrete level, the level where data and statistical hypotheses come into direct contact, leans heavily on the unity of frequentist statistical theory for well-behaved models with fixed sample sizes. It is precisely the complications brought on by the *relevance* of stopping rules to error probabilities that reveal the deficiencies of the argument. Isn’t it ironic, don’t you think?

When I first encountered the severity rationale, its prima facie plausibility was troubling to me. Was there in fact a sound philosophical basis for frequentist statistics after all? As a committed Bayesian I naturally sought to update on the possibility that it was so. The quandary I faced was that in the simple examples Mayo uses there is numerical agreement of SEV and Bayes (under reference priors); the examples couldn’t tell me whether the facial reasonability of the severity argument as applied in the examples was due to the correctness of those arguments or due to a mathematical coincidence. Now I know: this numerical agreement is a happy accident for defenders of the severity rationale, enabling them to draw on the reasonability of Bayes-licensed inferences to defend hypothesis/significance tests from criticism. I am troubled no longer.

]]>The first scenario concerns an investment fund that deceptively advertises portfolio picks made by the “Pickrite method”:

[Jay] Kadane[, a prominent Bayesian,] is emphasizing that Bayesian inference is *conditional* on the particular outcome. So once ** x** is known and fixed, other possible outcomes that could have occurred but didn’t are irrelevant. Recall finding that Pickrite’s procedure was to build

This argument is a straw man caused by a misunderstanding — an unintentional equivocation, if that’s a thing, on the phrase “other portfolios that might have been sent to you but were not”. Now, nothing here turns on the fact that the scenario doesn’t completely specify the distribution of portfolio returns. We are told that the stocks are picked at random, so the portfolio returns are independent and identically distributed random variables; the argument would seem to continue to apply if we specify that portfolio rates of return have some particular known distribution. Mayo tells us that according to a holder of the LP, once ** x** is known we’re not allowed to consider the other chances that the Pickrite method provides for finding an impressive portfolio. Suppose portfolio rates of return are known to have, say, an exponential distribution with unknown mean

But this is simply wrong. The argument overlooks the fact that the LP doesn’t forbid us from taking the data collection mechanism into account (including mechanisms of missing data) *when constructing the likelihood function itself*. We’ve been told that we were presented with just the best result from out of *k* portfolios that were built, so to construct the likelihood we take the probability density for all *k* rates of return and we integrate out the *k* – 1 unobserved rates of return that were smaller than the one we do get to see. In statistical jargon, our likelihood function arises from the probability density for the largest order statistic; the general formula for the density of an order statistic can be found here. Assuming as before that rates of return follow an exponential distribution, the correct likelihood arising in this scenario would be

In fact, in this scenario an error statistician would use precisely this probability model to compute “audited” *p*-values and confidence intervals, and since the parameter being estimated is a scale parameter we would have the standard numerical agreement of frequentist confidence intervals and Bayesian credible intervals (under the usual reference prior for scale parameters).

Kadane might very well ask, “Why are you considering other portfolios that you might have been sent but were not, to reason from the one that you got?” But the portfolios he would be referring to aren’t the other portfolios in the sample that we didn’t get to see — a holder of the LP agrees that *those* portfolios need to be taken into consideration. The “other portfolios the you might have been sent but were not” are the ones that might have arisen in hypothetical replications of the whole data-generating process, that is, other best portfolios selected out of *k* of them. *Those* are the “other portfolios” that likelihood theorists and Bayesians consider irrelevant. (The misunderstanding of the referent of “other portfolios” is the unintentional equivocation.) Of course, an error statistician disagrees that they are irrelevant — they’re implicit in the *p*-value computation — but this is a separate issue.

So that disposes of the Pickrite scenario and the cherry-picking argument against the LP. The second scenario is attributed to Allan Birnbaum, a statistician who started as a likelihood theorist but later abandoned those views due to the inability of likelihoods to control error probabilities. Here’s how Mayo presents it:

A single observation is made on ** X**, which can take values 1, 2, …, 100. “There are 101 possible distributions conveniently indexed by a parameter

It’s apparent that this scenario is designed to challenge the views of likelihood theorists more than those of Bayesians. Nevertheless, Mayo writes, “Allan Birnbaum gets the prize for inventing chestnuts that deeply challenge both those who do, and those who do not, hold the Likelihood Principle!” And since Bayesians do hold the LP, let’s see what challenges Birnbaum’s chestnut presents for us Bayesians.

The contention is that if we let the data determine which non-zero *θ* value to consider then we are certain to find evidence apparently pointing strongly against *θ* = 0 even if it is in fact the case that *θ* = 0. That sounds pretty bad!

First we need to say what “evidence pointing against a hypothesis” means for Bayesians. Later in the book Mayo discusses Bayesian epistemology as a school of thought within academic philosophy, including various proposed numerical measures of confirmation. We don’t need to touch on those complications here; for us it will be enough to say that the data provide evidence against a hypothesis when the posterior odds against it are higher than the prior odds.

Because we’re looking at the odds against *θ* = 0 it is helpful to first decompose the hypothesis space into *θ* = 0 and its negation *θ* ≠ 0 and then assign prior probability mass conditional on *θ* ≠ 0 to the non-zero values, call them *θ’*, that *θ* might take. Given such a decomposition, this is the odds form of Bayes’s theorem in this problem:

The ratio on the left is the posterior odds, the first ratio on the right is the prior odds, and the second ratio on the right is the update factor. In the sum in the numerator of the update factor, all of the Pr(** X** =

The factor of 100 is the likelihood ratio; perhaps unexpectedly, it can be seen that the likelihood ratio is *not* only term in the update factor. The Pr(*θ* = ** r** |

Now we can imagine all sorts of background information that might inform our prior probabilities; in this respect the statement of the problem is underspecified. Suppose nevertheless that this is all we are given; then it seems appropriate to specify a uniform conditional prior distribution, Pr(*θ* = *θ’* | *θ* ≠ 0) = 0.01. Then no matter what value the datum takes, the update factor is identically one; that is, for this prior distribution the evidence in the data is certain to neither cut against nor in favour of *θ* = 0.

If we have information that justifies a non-uniform conditional prior distribution then for some values of *θ* the prior will be larger than 0.01; a corresponding datum would result in an update factor greater than one and thus be evidence against *θ* = 0. But in this situation there must be other values of *θ* for which the prior is smaller than 0.01, and a corresponding datum would result in an update factor smaller than one and thus be evidence *in favour* of *θ* = 0. So contrary to what Cox and Hinkley say holds for likelihood theorists, we Bayesians are *never* certain of finding evidence against *θ* = 0 even when it is in fact the case — the closest we get is being certain that the data provide no evidence one way or the other.

I wonder if this chestnut of Birnbaum’s poses any challenge for the severity concept…

]]>I recommend that people who want to really understand the severity argument read the above-linked paper, but for completeness’s sake let’s have a look at how the SEV function formalizes severity arguments. (Since I’ll be discussing both frequentist and Bayesian calculations I’ll use the notation Fr ( · ; · ) for frequency distributions and Pl ( · | · ) for plausibility distributions.) The examples I’ve seen typically involve nice models and location parameters, so let’s consider an irregular model with a scale parameter. Consider a univariate uniform distribution with unknown support; suppose that the number of data points, *n*, is at least two and we aim to assess the warrant for claims about the width of the support, call it ∆, using the difference between the largest and smallest data values, *D* = *X*_{max} –* X*_{min}, as our test statistic. (This model is “irregular” in that the support of the test statistic’s sampling distribution depends on the parameter.) Starting from the joint distribution of the order statistics of the uniform distribution one can show that *D* is a pivotal statistic satisfying

Severity reasoning works like this: we aim to rule out a particular way that some claim could be wrong, thereby avoiding one way of being in error in asserting the claim. The way we do that is by carrying out a test that would frequently detect such an error if the claim were in fact wrong in that particular way. We attach the error detection frequency to the claim and say that the claim has passed a severe test for the presence of the error; the error detection frequency quantifies just how severe the test was.

To cash this out in the form of a SEV function we need a notion of accordance between the test statistic and the statistical hypothesis being tested. In our case, higher observed values of *D* accord with higher hypothesized values of ∆ (and in fact, values of ∆ smaller than the observed value of *D* are strictly ruled out). SEV is a function with a subjunctive mood; we don’t necessarily carry out any particular test but instead look at all the tests we might have carried out. So: if we were to claim that ∆ > δ when ∆ ≤ δ was true then we’d have committed an error. Smaller observed values of *D* are less in accord with larger values of ∆, so we could have tested for the presence of the error by declaring it to be present if the observed value of *D* were smaller than some threshold *d*. Now, there are lots of possible thresholds *d* and also lots of ways that the claim “∆ > δ” could be wrong – one way for each value of ∆ smaller than δ – but we can finesse these issues by considering the worst possible case. In the worst case the test is just barely passed, that is, the observed value of *D* is on the test’s threshold, and ∆ takes the value that minimizes the frequency of error detection and yet still satisfies ∆ ≤ δ. (Mayo does more work to justify all of this than I’m going to do here.) Thus the severity of the test that the claim “∆ > δ” would have passed is the worst-case frequency of declaring the error to be present supposing that to actually be the case:

in which *D* is (still) a random variable and *d* is the value of *D* that was actually observed in the data at hand.

In every example of a SEV calculation I’ve seen, the minimum occurs right on the boundary — in this case, at ∆ = δ. It’s not clear to me if Mayo would insist that the SEV function can only be sensibly defined for models in which the minimum is at the boundary; that restriction seems implicit in some of the things she’s written about accordance of test statistics with parameter values. In any event it holds in this model, so we can write

What this says is that to calculate the SEV function for this model we stick (*d* / δ) into the cdf for the Beta (*n* – 1, 2) distribution and allow δ to vary. I’ve written a little web app to allow readers to explore the meaning of this equation.

As part of my blogging agenda I had planned to find examples of prominent Bayesians asserting Mayo’s howlers. At one point Mayo expressed disbelief to me that I would call out individuals by name as I demonstrated how the SEV function addressed their criticism, but this would have been easier for me than she thought. The reason is that in all of the examples of formal SEV calculations that I have ever seen (including the one I just did above), it’s been applied to univariate location or scale parameter problems in which the SEV calculation produces exactly the same numbers as a Bayesian analysis (using the commonly-accepted default/non-informative/objective prior, i.e., uniform for location parameters and/or for the logarithm of scale parameters; this SEV calculator web app for the normal distribution mean serves equally well as a Bayesian posterior calculator under those priors). So I wasn’t too concerned about ruffling feathers – because I’m no one of consequence, but also because a critique that goes “criticisms of frequentism are unfair because frequentists have figured out this Bayesian-posterior-looking thing” isn’t the sort of thing any Bayesian is going to find particularly cutting, no matter what argument is adduced to justify the frequentist posterior analogue. In any event, at this remove I find I lack the motivation to actually go and track down an instance of a Bayesian issuing each of the howlers, so if you are one such and you’ve failed to grapple with Mayo’s severity argument – consider yourself chided! (Although I can’t be bothered to find quotes I can name a couple of names off the top of my head: Jay Kadane and William Briggs.)

Because of this identity of numerical output in the two approaches I found it hard to say whether the SEV functions computed and plotted in Mayo and Spanos’s article and in my web app illustrate the severity argument in a way that actually supports it or if they’re just lending it an appearance of reasonability because, through a mathematical coincidence, they happens to line up with default Bayesian analyses. Or, perhaps default Bayesian analyses seem to give reasonable numbers because, through a mathematical coincidence, they happen to line up with SEV – a form of what some critics of Bayes have called “frequentist pursuit”. To help resolve this ambiguity I sought to create an intuition pump: an easy-to-analyze statistical model that could be subjected to extreme conditions in which intuition would strongly suggest what sorts of conclusions were reasonable. The outputs of the formal statistical methods — the SEV function on the one hand and a Bayesian posterior on the other — could be measured relative to these intuitions. Of course, sometimes the point of carrying out a formal analysis is to educate one’s intuition, as in the birthday problem; but in other cases one’s intuition acts to demonstrate that the formalization isn’t doing the work one intended it to do, as in the case of the integrated information theory of consciousness. (The latter link has a discussion of “paradigm cases” that is quite pertinent to my present topic.)

When I first started blogging about severity the idea I had in mind for this intuition pump was to apply an optional stopping data collection design to the usual normal distribution with known variance and unknown mean. Either one or two data points would be observed, with the second data point observed only if the first one was within some region where it would be desirable to gather more information. This kind of optional stopping design induces the same likelihood function (up to proportionality) as a fixed sample size design, but the alteration of the sample space gives rise to very different frequency properties, and this guarantees that (unlike in the fixed sample sized design) the SEV function and the Bayesian posterior will not agree in general.

Now, the computation of a SEV function demands a test procedure that gives a rejection region for any Type I error rate and any one-sided alternative hypothesis; this is because to calculate SEV we need to be able to run the test procedure backward and figure out for each possible one-sided alternative hypothesis what Type I error rate would have given rise to a rejection region with the observed value of the test statistic right on the boundary. (If that seemed complicated, it’s because it is.) In the optional stopping design the test procedure would involve two rejection regions, one for each possible sample size, and a collect-more-data region; given these extra degrees of freedom in specifying the test I found myself struggling to define a procedure that I felt could not be objected to – in particular, I couldn’t handle the math needed to find a uniformly most powerful test (if one even exists in this setup). The usual tool for proving the existence of uniformly powerful tests, the Karlin-Rubin theorem, does not apply to the very weird sample space that arises in the optional stopping design – the dimensionality of the sample space is itself a random variable. But as I worked with the model I realized that optional stopping wasn’t the only way to alter the sample space to drive a wedge between the SEV function and the Bayesian posterior. When I examined the first stage of the optional stopping design in which the collect-more-data region creates a gap in the sample space, I realized that chopping out a chunk of the sample space and just forgetting about the second data point would be enough to force the two formal statistical methods to disagree.

An instance of such a model was described in my most recent post: a normal distribution with unknown *μ* in ℝ and unit *σ* and a gap in the sample space between -1 and 3, yielding the probability density

As previously mentioned, the formalization of severity involves some kind of notion of accordance between test statistic values and parameter values. For the gapped normal distribution the Karlin-Rubin theorem applies directly: a uniformly most powerful test exists and it’s a threshold test, just as in the ordinary normal model. So it seems reasonable to say that larger values of *x* accord with larger values of *μ* even with the gap in the sample space, and the SEV function is constructed for the gapped normal model just as it would be for the ordinary normal model:

It’s interesting to note that frequentist analyses such as a p-value or the SEV function will yield the same result for *x* = -1 and *x* = 3. In both these cases, for example,

This is because the data enter into the inference through tail areas of the sampling probability density, and those tail areas are the same whether the interval of integration has its edge at *x* = -1 or *x* = 3.

The Bayesian posterior distribution, on the other hand, involves integration over the parameter space rather than the sample space. Assuming a uniform prior for *μ*, the posterior distribution is

which does not have an analytical solution. We can see right away that the Bayesian posterior will not yield the same result when *x* = -1 as it does when *x* = 3 because the data enter into the inference through the likelihood function, and *x* = -1 induces a different likelihood function than *x* = 3.

But what about that uniform prior? Does a prior exist that will enable “frequentist pursuit” and bring the Bayesian analysis and the SEV function back into alignment? To answer this question, consider the absolute value of the derivative of the SEV function with respect to *m*. This is the “SEV density”, the function that one would integrate over the parameter space to recover SEV. I leave it as an exercise for the reader to verify that this function cannot be written as (proportional to) the product of the likelihood function and a data-independent prior density.

So! I have fulfilled the promise I made in my blogging agenda to specify a simple model in which the two approaches, operating on the exact same information, must disagree. It isn’t the model I originally thought I’d have – it’s even simpler and easier to analyze. The last item on the agenda is to subject the model to extreme conditions so as to magnify the differences between the SEV function and the Bayesian approach. This web app can be used to explore the two approaches.

The default setting of the web app shows a comparison I find very striking. In this scenario *x* = -1 and Pl(*μ* > 0 | *x*) = 0.52. (Nothing here turns on the observed value being right on the boundary – we could imagine it to be slightly below the boundary, say *x* = -1.01, and the change in the scenario would be correspondingly small.) This reflects the fact that *x* = -1 induces a likelihood function that has a fairly broad peak with a maximum near *μ* = 0. That is, the probability of landing in a vanishingly small neighbourhood of the value of *x* we actually observed is high in a relative sense for values of *μ* in a broad range that extends on both sides of *μ* = 0; when we normalize and integrate over *μ* > 0 we find that we’ve captured about half of the posterior plausibility mass. On the other hand, SEV(*μ* > 0) = 0.99. The SEV function is telling us that if *μ* > 0 were false then we would very frequently – at least 99 times out of 100 – have observed values of *x* that accord less well with *μ* > 0 than the one we have in hand. But wait – severity isn’t just about the subjunctive test result; it also requires that the data “accords with” the claim being made in an absolute sense. If *μ* = 1 then *x* = -1 is a median point of the sampling distribution, so I judge that *x* = -1 does indeed accord with *μ* > 0.

I personally find my intuition rebels against the idea that “*μ* > 0″ is a well-warranted claim in light of *x* = -1; it also rebels at the notion than *x* = -1 and *x* = 3 provide equally good warrant for the claim that *μ* > 0. In the end, I strictly do not care about regions of the sample space far away from the observed data. In fact, this is the reason that I stopped blogging about this – about four years ago I took one look at that plot and felt my interest in (and motivation for writing about) the severity concept drain away. Since then, this whole thing has been weighing down my mind; the only reason I’ve managed to muster the motivation to finally get it out there is because I was playing around with Shiny apps recently – they’ve got a lot better since the last time I did so, which was also about four years ago – and started thinking about the visualizations I could make to illustrate these ideas.

In this model *μ* is not quite a location parameter; when it’s far from the gap the density is effectively a normal centered at *μ* but when it’s close to the gap its shape is distorted. It becomes a half-normal at the gap boundary and then something like an extra-shallow exponential (log-quadratic instead of log-linear like an actual exponential) as *μ* moves toward the center of the gap. At *μ* = 1 the probability mass flips from one side of the gap to the other. Here’s a little web app in which you can play around with this statistical model (don’t neglect the play button under the slider on the right hand side).

Now the question; I ask my readers to report their gut reaction in addition to any more considered conclusions in comments.

Suppose *μ* is unknown and the data is a single observation *x*. Consider two scenarios:

*x*= -1 (the left boundary)*x*= 3 (the right boundary)

For the sake of concreteness suppose our interest is in *μ *≤ 0 vs. *μ* > 0. Should it make a difference to our inference whether we’re in scenario (i) or scenario (ii)?

]]>The Dutch book argument in turn relies on a concept of truth. Often framed in terms of bets on a horse-race, it relies on there only being one winner, which is the case for the overwhelming majority of horse races. The Dutch book argument shows that the odds, when converted to probabilities, must sum to 1 to avoid arbitrage possibilities… If we transfer this to statistics then we have different distributions indexed by a parameter. Based on the idea of truth, only one of these can be true, just as only one horse can win, and the same Dutch book argument shows that the odds must add to 1. In other words the prior must be a probability distribution. We note that in reality none of the offered distributions will be the truth, but due to the non-callability of Bayesian bets this is not considered to be a problem. Suppose we replace the question as whether a distribution represents the truth by the question as to whether it is a good approximation. Suppose that we bet, for example, that the *N*(0, 1) distribution is an adequate approximation for the data. We quote odds for this bet, the computer programme is run, and we either win or lose. If we quote odds of 5:1 then we will probably quote the same, or very similar, odds for the *N*(10^{−6}, 1) distribution, as for the *N*(0, 1+10^{−10}) distribution and so forth. It becomes clear that these odds are not representable by a probability distribution: only one distribution can be the ‘true’ but many can be adequate approximations.

I always meant to write something about how this line of argument goes wrong, but it wasn’t a high priority. But recently Davies reiterated this argument in a comment on Professor Mayo’s blog:

You define adequacy in a precise manner, a computer programme., there [sic] are many examples in my book. The inputs are the data and the model, the output yes or no. You place your bets beforehand, run the programme and win or lose your bet. The bets are realizable. If you bet 50-50 on the *N*(0,1) being an adequate model, you will no doubt bet about 50-50 on the *N*(10^{-20},1) also being an adequate model. Your bets are not expressible by a probability measure. The sum of the odds will generally be zero or infinity. …

I tried to reply in the comment thread, but WordPress ate my attempts, so: a blog post!

I have to wonder if Professor Davies asked even one Bayesian to evaluate this argument before he published it. (*In comments, Davies replies: I have been stating the argument for about 20 years now. Many Bayesians have heard my talks but so the only response I have had was by one in Lancaster who told me he had never heard the argument before and that was it.*) Let *M* be the set of statistical models under consideration. It’s true that if I bet 50-50 on *N*(0,1) being an adequate model, I will no doubt bet very close to 50-50 on *N*(10^{-20}, 1) also being an adequate model. Does this mean that “these odds are not representable by a probability distribution”? Not at all — we just need to get the sample space right. In this setup the appropriate sample space for a probability triple is the *powerset* of *M*, because exactly one of the members of the powerset of *M* will be realized when the data become known.

For example, suppose that *M* = {*N*(0,1), *N*(10^{-20}, 1), *N*(10,1)}; then there are eight conceivable outcomes — one for each possible combination of adequacy indications — that could occur once the data become known. We can encode this sample space using the binary expansion of the numbers from 0 to 7, with each digit of the binary expansion of the integer interpreted as an indicator variable for the statistical adequacy of one of the models in *M*. Let the leftmost bit refer to *N*(0,1), the center bit refer to *N*(10^-20, 1), and the rightmost bit refer to *N*(10,1). Here’s a probability measure that serves as a counterexample to the claim that “[the 50-50] bets are not expressible by a probability measure”:

Pr(001) = Pr(110) = 0.5,

Pr(000) = Pr(100) = Pr(101) = Pr(011) = Pr(010) = Pr(111) = 0.

(This is an abuse of notation, since the Pr() function takes events, that is, sets of outcomes, and not raw outcomes.) The events Davies considers are “*N*(0,1) [is] an adequate model”, which is the set {100, 101, 110, 111}, and “*N*(10^{-20},1) [is] an adequate model”, which is the set {010, 011, 110, 111}; it is trivial to see that both these events are 50-50.

Now obviously when M is uncountably infinite it’s not so easy to write down probability measures on sigma-algebras of the powerset of M. Still, that scenario is not particularly difficult for a Bayesian to handle: if the statistical adequacy function is measurable, a prior or posterior predictive probability measure automatically induces a pushforward probability measure on any sigma-algebra of the powerset of M. In fact, this is precisely the approach taken in the (rather small) Bayesian literature on assessing statistical adequacy; see for example *A nonparametric assessment of model adequacy based on Kullback-Leibler divergence*. These sorts of papers typically treat statistical adequacy as a continuous quantity, but all it would take to turn it into a Davies-style yes-no Boolean variable would be to dichotomize the continuous quantity at some threshold.

(A digression. To me, using a Bayesian nonparametric posterior distribution to assess the adequacy of a parametric model seems a bit pointless — if you have the posterior already, of what possible use is the parametric model? Actually, there *is* one use that I can think of, but I was saving it to write a paper about… Oh what the heck. I’m told (by Andrew Gelman, who should know!) that in social science it’s notorious that every variable is correlated with every other variable, at least a little bit. I imagine that this makes Pearl-style causal inference a big pain — all of the causal graphs would end up totally connected, or close to. I think there may be a role for Bayesian causal graph adequacy assessment; the causal model adequacy function would quantify the loss incurred by ignoring some edges in the highly-connected causal graph. I think this approach could facilitate communication between causal inference experts, subject matter experts, and policymakers.)

*This post’s title was originally more tendentious and insulting. As Professor Davies has graciously suggested that his future work might include a reference to this post, I think it only polite that I change the title to something less argumentative.*

“Error statistics refers to a standpoint regarding both (1) a general philosophy of science and the roles probability plays in inductive inference, and (2) a cluster of statistical tools, their interpretation, and their justiﬁcation.”

In Mayo’s writings I see two interrelated notions of severity corresponding to the two items listed in the quote: (1) an informal severity notion that Mayo uses when discussing philosophy of science and specific scientific investigations, and (2) Mayo’s formalization of severity at the data analysis level.

One of my besetting flaws is a tendency to take a narrow conceptual focus to the detriment of the wider context. In the case of Severity, part one, I think I ended up making claims about severity that were wrong. I was narrowly focused on severity in sense (2) — in fact, on one specific equation within (2) — but used a mish-mash of ideas and terminology drawn from all of my readings of Mayo’s work. When read through a philosophy-of-science lens, the result is a distorted and misstated version of severity in sense (1) .

As a philosopher of science, I’m a rank amateur; I’m not equipped to add anything to the conversation about severity as a philosophy of science. My topic is statistics, not philosophy, and so I want to warn readers against interpreting Severity, part one as a description of Mayo’s philosophy of science; it’s more of a wordy introduction to the formal definition of severity in sense (2).

]]>One of the author’s results (if I could nominate one as the most important, I’d choose this one) says that if you replace your model by another one which is in an arbitrarily close neighborhood (according to the [Prokhorov] metric discussed above), the posterior expectation could be as far away as you want. Which, if you choose the right metric, means that you replace your sampling model by another one out of which typical samples *look the same*, and which therefore can be seen as as appropriate for the situation as the original one.

Note that the result is primarily about a change in the sampling model, not the prior, although it is a bit more complex than that because if you change the sampling model, you need to adapt the prior, too, which is appropriately taken into account by the authors as far as I can see.

My own reaction was rather less impressed; I tossed off,

Conclusion: Don’t define “closeness” using the TV [that is, total variation] metric or matching a finite number of moments. Use KL divergence instead.

In response to a a request by Mayo, OSS wrote up a “plain jane” explanation which was posted on Error Statistics Philosophy a couple of weeks later. It confirmed Christian’s summary:

So the brittleness theorems state that for any Bayesian model there is a second one, nearly indistinguishable from the first, achieving any desired posterior value within the deterministic range of the quantity of interest.

That sounds pretty terrible!

—

This issue came up for discussion in the comments of an Error Statistics Philosophy post in late December.

MAYO: Larry: Do you know anything about current reactions to, status of, the results by Houman Owhadi, Clint Scovel and Tim Sullivan? Are they deemed relevant for practice? (I heard some people downplay the results as not of practical concern.)

LARRY: I am not aware of significant rebuttals.

COREY: I have a vague idea that any rebuttal will basically assert that the distance these authors use is too non-discriminating in some sense, so Bayes fails to distinguish “nice” distributions from nearby (according to the distance) “nasty” ones. My intuition is that these results won’t hold for relative entropy, but I don’t have the knowledge and training to develop this idea — you’d need someone like John Baez for that.

OWHADI (the O in OSS): Well, one should define what one means by “nice” and “nasty” (and preferably without invoking circular arguments).

Also, it would seem to me that the statement that TV and Prokhorov cannot be used (or are not relevant) in “classical” Bayes is a powerful result in itself. Indeed TV has not only been used in many parts of statistics but it has also been called the testing metric by Le Cam for a good reason: i.e. (writing n the number of samples), Le Cam’s Lemma state that

1) For any n, if TV is close enough (as a function of n) all tests are bad.

2) Given any TV distance, with enough sample data there exists a good test.

Now concerning using closeness in Kullback–Leibler (KL) divergence rather than Prokhorov or TV, observe that (as noted in our original post) closeness in KL divergence is not something you can test with discrete data, but you can test closeness in TV or Prokhorov. In other words the statement “if the true distribution and my model are close in KL then classical Bayes behaves nicely” can be understood as “if I am given this infinite amount of information then my Bayesian estimation is good” which is precisely one issue/concern raised by our paper (brittleness under “finite” information).

Note also that, the assumption of closeness in KL divergence requires the non-singularity of the data generating distribution with respect to the Bayesian model (which could be a very strong assumption if you are trying to certify the safety of a critical system and results like the Feldman–Hajek Theorem tell us that “most” pairs of measures are mutually singular in the now popular context of stochastic PDEs).

In preparing a reply to Owhadi, I discovered a comment written by Dave Higdon on Xian’s Og a few days after OSS’s “plain jane” summary went up on Error Statistics Philosophy. He described the situation in concrete terms; this clarified for me just what it is that OSS’s brittleness theorems demonstrate. (Christian Hennig saw the issue too, but I couldn’t follow what he was saying without the example Higdon gave. And OSS are perfectly aware of it too — this post represents me catching up with the more knowledgeable folks.)

Suppose we judge a system safe provided that the probability that the random variable *X* exceeds 10 is very low. We assume that *X* has a Gaussian distribution with known variance 1 and unknown mean *μ*, the prior for which is

This prior doesn’t encode a strong opinion about the prior predictive probability of the event *X* > 10 (i.e., disaster).

Next, we learn about the safety of the system by observing a realization of *X,* and it turns out that the datum *x* is smaller than 7 and the posterior predictive probability of disaster is negligible. Good news, right?

OSS say, not so fast! They ask: suppose that our model is misspecified, and the true model is “nearby” in Prokhorov or TV metric. They show that for any datum that we can observe, the set of all nearby models includes a model that predicts disaster.

What kinds of model misspecifications do the Prokhorov and TV metrics capture? Suppose that the data space has been discretized to precision 2*ϵ*, and consider the set of models in which, for each possible observable datum *x*_{0}, the probability density is

in which χ(.) is the indicator function. For any specific value of *μ*, all of the models in the above set are within a small ball centered on the Gaussian model, where “small” is measured by either the Prokhorov or TV metric. (*How* small depends on *ϵ*.) Each model embodies an implication of the form:

by taking the contrapositive, we see that this is equivalent to:

Each of these “nearby” models basically modifies the Gaussian model to enable *one specific possible datum* to be a certain indicator of disaster. Thus, no matter which datum we actually end up observing, there is a “nearby” model for which both (i) typical samples are basically indistinguishable from typical samples under our assumed Gaussian model, and yet (ii) the realized datum has caused that “nearby” model, like Chicken Little, to squawk that the sky is falling.

OSS have proved a very general version of the above phenomenon: under the (weak) conditions they assume, for any given data set that we can observe, the set of all models “near” the posterior distribution contains a model that, upon observation of the realized data, goes into (the statistical model equivalent of) spasms of pants-shitting terror.

There’s nothing special to Bayes here; in particular, all of the talk about the asymptotic testability of TV- and/or Prokhorov-distinct distributions is a red herring. The OSS procedure stymies learning about the system of interest because the model misspecification set is specifically constructed to allow any possible data set to be totally misleading under the assumed model. Seen in this light, OSS’s choice of article titles is rather tendentious, don’t you think? If tendentious titles are the order of the day, perhaps the first one could be called *As flies to wanton boys are we to th’ gods: **Why no statistical model whatsoever is “good enough” *and the second one could be called *Prokhorov and *total variation neighborhoods and paranoid psychotic breaks with reality.

]]>

My wife says that a glance at the spine gives the impression that it reads “Badass”.

]]>