# Everything I never got around to writing about severity

The first post on this blog described my blogging agenda. It was pinned to the top of the blog but it’s badly out of date so I’ve unpinned it. When I left off, I was about half way though my explanation of severity and I was about to embark on a description of what Mayo calls “howlers” – calumnies cast upon the classical tools of frequentist statistics that are addressed by the severity concept. The severity concept and its formal mathematical incarnation, the SEV function, are explained in Mayo and Spanos’s 2011 article Error Statistics, as are the ways in which the SEV function addresses typical criticisms (which are helpfully written in bold font in numbered and center-justified boxes, but SEV isn’t introduced until howler #2).

I recommend that people who want to really understand the severity argument read the above-linked paper, but for completeness’s sake let’s have a look at how the SEV function formalizes severity arguments. (Since I’ll be discussing both frequentist and Bayesian calculations I’ll use the notation Fr ( · ; · ) for frequency distributions and Pl ( · | · ) for plausibility distributions.) The examples I’ve seen typically involve nice models and location parameters, so let’s consider an irregular model with a scale parameter. Consider a univariate uniform distribution with unknown support; suppose that the number of data points, *n*, is at least two and we aim to assess the warrant for claims about the width of the support, call it ∆, using the difference between the largest and smallest data values, *D* = *X*_{max} –* X*_{min}, as our test statistic. (This model is “irregular” in that the support of the test statistic’s sampling distribution depends on the parameter.) Starting from the joint distribution of the order statistics of the uniform distribution one can show that *D* is a pivotal statistic satisfying

Severity reasoning works like this: we aim to rule out a particular way that some claim could be wrong, thereby avoiding one way of being in error in asserting the claim. The way we do that is by carrying out a test that would frequently detect such an error if the claim were in fact wrong in that particular way. We attach the error detection frequency to the claim and say that the claim has passed a severe test for the presence of the error; the error detection frequency quantifies just how severe the test was.

To cash this out in the form of a SEV function we need a notion of accordance between the test statistic and the statistical hypothesis being tested. In our case, higher observed values of *D* accord with higher hypothesized values of ∆ (and in fact, values of ∆ smaller than the observed value of *D* are strictly ruled out). SEV is a function with a subjunctive mood; we don’t necessarily carry out any particular test but instead look at all the tests we might have carried out. So: if we were to claim that ∆ > δ when ∆ ≤ δ was true then we’d have committed an error. Smaller observed values of *D* are less in accord with larger values of ∆, so we could have tested for the presence of the error by declaring it to be present if the observed value of *D* were smaller than some threshold *d*. Now, there are lots of possible thresholds *d* and also lots of ways that the claim “∆ > δ” could be wrong – one way for each value of ∆ smaller than δ – but we can finesse these issues by considering the worst possible case. In the worst case the test is just barely passed, that is, the observed value of *D* is on the test’s threshold, and ∆ takes the value that minimizes the frequency of error detection and yet still satisfies ∆ ≤ δ. (Mayo does more work to justify all of this than I’m going to do here.) Thus the severity of the test that the claim “∆ > δ” would have passed is the worst-case frequency of declaring the error to be present supposing that to actually be the case:

in which *D* is (still) a random variable and *d* is the value of *D* that was actually observed in the data at hand.

In every example of a SEV calculation I’ve seen, the minimum occurs right on the boundary — in this case, at ∆ = δ. It’s not clear to me if Mayo would insist that the SEV function can only be sensibly defined for models in which the minimum is at the boundary; that restriction seems implicit in some of the things she’s written about accordance of test statistics with parameter values. In any event it holds in this model, so we can write

What this says is that to calculate the SEV function for this model we stick (*d* / δ) into the cdf for the Beta (*n* – 1, 2) distribution and allow δ to vary. I’ve written a little web app to allow readers to explore the meaning of this equation.

As part of my blogging agenda I had planned to find examples of prominent Bayesians asserting Mayo’s howlers. At one point Mayo expressed disbelief to me that I would call out individuals by name as I demonstrated how the SEV function addressed their criticism, but this would have been easier for me than she thought. The reason is that in all of the examples of formal SEV calculations that I have ever seen (including the one I just did above), it’s been applied to univariate location or scale parameter problems in which the SEV calculation produces exactly the same numbers as a Bayesian analysis (using the commonly-accepted default/non-informative/objective prior, i.e., uniform for location parameters and/or for the logarithm of scale parameters; this SEV calculator web app for the normal distribution mean serves equally well as a Bayesian posterior calculator under those priors). So I wasn’t too concerned about ruffling feathers – because I’m no one of consequence, but also because a critique that goes “criticisms of frequentism are unfair because frequentists have figured out this Bayesian-posterior-looking thing” isn’t the sort of thing any Bayesian is going to find particularly cutting, no matter what argument is adduced to justify the frequentist posterior analogue. In any event, at this remove I find I lack the motivation to actually go and track down an instance of a Bayesian issuing each of the howlers, so if you are one such and you’ve failed to grapple with Mayo’s severity argument – consider yourself chided! (Although I can’t be bothered to find quotes I can name a couple of names off the top of my head: Jay Kadane and William Briggs.)

Because of this identity of numerical output in the two approaches I found it hard to say whether the SEV functions computed and plotted in Mayo and Spanos’s article and in my web app illustrate the severity argument in a way that actually supports it or if they’re just lending it an appearance of reasonability because, through a mathematical coincidence, they happens to line up with default Bayesian analyses. Or, perhaps default Bayesian analyses seem to give reasonable numbers because, through a mathematical coincidence, they happen to line up with SEV – a form of what some critics of Bayes have called “frequentist pursuit”. To help resolve this ambiguity I sought to create an intuition pump: an easy-to-analyze statistical model that could be subjected to extreme conditions in which intuition would strongly suggest what sorts of conclusions were reasonable. The outputs of the formal statistical methods — the SEV function on the one hand and a Bayesian posterior on the other — could be measured relative to these intuitions. Of course, sometimes the point of carrying out a formal analysis is educate one’s intuition, as in the birthday problem; but in other cases one’s intuition acts to demonstrate that the formalization isn’t doing the work one intended it to do, as in the case of the integrated information theory of consciousness. (The latter link has a discussion of “paradigm cases” that is quite pertinent to my present topic.)

When I first started blogging about severity the idea I had in mind for this intuition pump was to apply an optional stopping data collection design to the usual normal distribution with known variance and unknown mean. Either one or two data points would be observed, with the second data point observed only if the first one was within some region where it would be desirable to gather more information. This kind of optional stopping design induces the same likelihood function (up to proportionality) as a fixed sample size design, but the alteration of the sample space gives rise to very different frequency properties, and this guarantees that (unlike in the fixed sample sized design) the SEV function and the Bayesian posterior will not agree in general.

Now, the computation of a SEV function demands a test procedure that gives a rejection region for any Type I error rate and any one-sided alternative hypothesis; this is because to calculate SEV we need to be able to run the test procedure backward and figure out for each possible one-sided alternative hypothesis what Type I error rate would have given rise to a rejection region with the observed value of the test statistic right on the boundary. (If that seemed complicated, it’s because it is.) In the optional stopping design the test procedure would involve two rejection regions, one for each possible sample size, and a collect-more-data region; given these extra degrees of freedom in specifying the test I found myself struggling to define a procedure that I felt could not be objected to – in particular, I couldn’t handle the math needed to find a uniformly most powerful test (if one even exists in this setup). The usual tool for proving the existence of uniformly powerful tests, the Karlin-Rubin theorem, does not apply to the very weird sample space that arises in the optional stopping design – the dimensionality of the sample space is itself a random variable. But as I worked with the model I realized that optional stopping wasn’t the only way to alter the sample space to drive a wedge between the SEV function and the Bayesian posterior. When I examined the first stage of the optional stopping design in which the collect-more-data region creates a gap in the sample space, I realized that chopping out a chunk of the sample space and just forgetting about the second data point would be enough to force the two formal statistical methods to disagree.

An instance of such a model was described in my most recent post: a normal distribution with unknown *μ* in ℝ and unit *σ* and a gap in the sample space between -1 and 3, yield the probability density

As previously mentioned, the formalization of severity involves some kind of notion of accordance between test statistic values and parameter values. For the gapped normal distribution the Karlin-Rubin theorem applies directly: a uniformly most powerful test exists and it’s a threshold test, just as in the ordinary normal model. So it seems reasonable to say that larger values of *x* accord with larger values of *μ* even with the gap in the sample space, and the SEV function is constructed for the gapped normal model just as it would be for the ordinary normal model:

It’s interesting to note that frequentist analyses such as a p-value or the SEV function will yield the same result for *x* = -1 and *x* = 3. In both these cases, for example,

This is because the data enter into the inference through tail areas of the sampling probability density, and those tail areas are the same whether the interval of integration has its edge at *x* = -1 or *x* = 3.

The Bayesian posterior distribution, on the other hand, involves integration over the parameter space rather than the sample space. Assuming a uniform prior for *μ*, the posterior distribution is

which does not have an analytical solution. We can see right away that the Bayesian posterior will not yield the same result when *x* = -1 as it does when *x* = 3 because the data enter into the inference through the likelihood function, and *x* = -1 induces a different likelihood function than *x* = 3.

But what about that uniform prior? Does a prior exist that will enable “frequentist pursuit” and bring the Bayesian analysis and the SEV function back into alignment? To answer this question, consider the absolute value of the derivative of the SEV function with respect to *m*. This is the “SEV density”, the function that one would integrate over the parameter space to recover SEV. I leave it as an exercise for the reader to verify that this function cannot be written as (proportional to) the product of the likelihood function and a data-independent prior density.

So! I have fulfilled the promise I made in my blogging agenda to specify a simple model in which the two approaches, operating on the exact same information, must disagree. It isn’t the model I originally thought I’d have – it’s even simpler and easier to analyze. The last item on the agenda is to subject the model to extreme conditions so as to magnify the differences between the SEV function and the Bayesian approach. This web app can be used to explore the two approaches.

The default setting of the web app shows a comparison I find very striking. In this scenario *x* = -1 and Pl(*μ* > 0 | *x*) = 0.52. (Nothing here turns on the observed value being right on the boundary – we could imagine it to be slightly below the boundary, say *x* = -1.01, and the change in the scenario would be correspondingly small.) This reflects the fact that *x* = -1 induces a likelihood function that has a fairly broad peak with a maximum near *μ* = 0. That is, the probability of landing in a vanishingly small neighbourhood of the value of *x* we actually observed is high in a relative sense for values of *μ* in a broad range that extends on both sides of *μ* = 0; when we normalize and integrate over *μ* > 0 we find that we’ve captured about half of the posterior plausibility mass. On the other hand, SEV(*μ* > 0) = 0.99. The SEV function is telling us that if *μ* > 0 were false then we would very frequently – at least 99 times out of 100 – have observed values of *x* that accord less well with *μ* > 0 than the one we have in hand. But wait – severity isn’t just about the subjunctive test result; it also requires that the data “accords with” the claim being made in an absolute sense. If *μ* = 1 then *x* = -1 is a median point of the sampling distribution, so I judge that *x* = -1 does indeed accord with *μ* > 0.

I personally find my intuition rebels against the idea that “*μ* > 0″ is a well-warranted claim in light of *x* = -1; it also rebels at the notion than *x* = -1 and *x* = 3 provide equally good warrant for the claim that *μ* > 0. In the end, I strictly do not care about regions of the sample space far away from the observed data. In fact, this is the reason that I stopped blogging about this – about four years ago I took one look at that plot and felt my interest in (and motivation for writing about) the severity concept drain away. Since then, this whole thing has been weighing down my mind; the only reason I’ve managed to muster the motivation to finally get it out there is because I was playing around with Shiny apps recently – they’ve got a lot better since the last time I did so, which was also about four years ago – and started thinking about the visualizations I could make to illustrate these ideas.

Corey,

To quote Jonathan Swift, “It is useless to attempt to ‘math’ a philosophy professor out of a thing they were never ‘mathed’ into.”

I never expected to change Mayo’s mind. This was about challenging my own intellectual commitment to Bayes as strongly as possible.

The language is a bit verbose. A better phrasing might be,

“it’s useless to math a philosopher out of a thing they never mathed into”

I quite like this example by the way. It’s a stylized version of a common practical problem. You have a physical process in which you can detect events one at a time, but when many are detected over time it forms a stable (frequency) pattern. Because of technical difficulties, outcomes in certain regions cant be detected (or detectors can’t be placed in those regions). The effect is to truncate the “sample space” as in this problem.

Even the n=1 part is realistic. Imagine detecting a single incoming missile of unknown type (stylistically represented by the known mu in this case) and having to make a decision on how to react to it.

Let’s hope the algorithm designer who programs a response is using the Bayesian posterior rather than SEV to make that decision, given that if x=3 is observed, the Bayesian posterior is saying “mu is probably greater than 1” while SEV is noncommital.

I was actually thinking that if the gap were from -2 to 2 (as I was originally planning) then this would be a stylized version of the statistical significance filter for publication. (By Mayo’s lights this would probably not be legitimate even as a toy model of the filter — I imagine she’d insist on basing inference on the actual realized data and not permit an inference when the gap is not, in some sense, a natural part of the model.)

Adding a decision element to inference really heightens the contradiction. I was thinking about writing something about “what if people’s lives depended on not claiming ‘μ >0’ in error? would x = -1.01 be enough evidence?” but in the end I didn’t bother. (And even though it’s not the reason why I didn’t bother, there’s something to be said for keeping inference separate from decision.)

Bayesians have a bad reputation when it comes to criticisms:

A Frequentist would never respond to criticism by making a host of insanely trivial technical mistakes, then in a fuddle of incomprehension, simply dismiss any criticism that makes them question their philosophical assumptions. Bayesians should endeavor to follow their excellent example when responding to criticisms.

Man, I’d love to get Cox’s perspective on the gapped normal distribution model. Or Spanos’s.

After his lordship spanks the p-bashers maybe he’ll be willing to spank the SEV-bashers?

Should be easy to do since SEV is consistent with Mayo’s philosophy and is therefore a priori correct.

Your calculations don’t reflect how I think the severity concept should be used. From a frequentist perspective, I’d phrase this in terms of null hypotheses and alternatives. You’re apparently interested here in finding evidence in favour of μ > 0, or, equivalently, against H0: μ ≤ 0, so you’d need to try to reject μ ≤ 0. You haven’t exactly defined your test and I’m too lazy to do the computations myself, but surely no reasonable test would reject μ ≤ 0 based on x=-1 and therefore no frequentist (working with severity or not) would claim in this situation that x=-1 provides evidence against μ ≤ 0. An appropriate way of computing severity here would be to ask, given that we can't reject μ ≤ 0, whether we can rule out μ > c, c being some constant, and one can then figure out how large c could be so that, in case we observe x=-1, it would’ve been very likely to not observe such small or even smaller x in case μ > c, and then to state that we have severe evidence against μ > c. In case x=3 still doesn’t reject μ ≤ 0 (as said before, I'm not doing computations) one could again ask whether we have severe evidence against μ > c (or, equivalently, in favour of μ ≤ c) with severity, but c would then have to be much larger. (So state of affairs regarding evidence is *not* the same for x=-1 and x=3.)

What severity you compute depends on the result of the test. If you reject the H0, you have evidence against it, and can ask, on top of this, whether you can also rule out values close to the H0 but not part of it. E.g., if you reject μ ≤ 0, you may wonder whether there's severe evidence even against μ ≤ c, c > 0, or equivalently, in favour of μ > c. If you don't reject H0, as here, you can only ask whether you have severe evidence against part of the alternative, i.e., against μ > c or equivalently in favour of μ ≤ c, c > 0, but you cannot have severe evidence against the H0, as you'd claim when claiming μ > 0.

I feel like Galileo standing next to a telescope saying, “Look! Look!”

The stuff Mayo writes about “discrepancy from the null” is mainly to address “howlers” about how tests treat all alternatives (and composite nulls) the same in the face of data. She uses SEV in the subjunctive mode to show what different tests would have licensed had they been performed (and then sliiiiiides on over to saying the resulting inferences are licensed even though the tests were not pre-specified). But fine, that’s severity the informal concept, not SEV the function.

Okay, let’s get our Neyman-Pearson on.

H0: μ ≤ 0

Ha: μ > 0.

Type I error rate α = 0.05

The Karlin-Rubin theorem says the UMP test is the threshold test that rejects H0 when X > x_threshold. So we need x_threshold such that

Fr(X ≤ x_threshold ; μ = 0) = 1 – α,

and if the observed value of x exceeds x_threshold then we reject H0: μ ≤ 0.

That threshold value is… wait for it… wait for it…

x_threshold = -1.028

Yes, really.

Nice! This should teach me to not be so stingy with my time reading and computing. OK, fair enough, the point is yours (I somehow in my mind had put the hole in the distribution only between +1 and +3, stupid).

But then I’d say indeed your intuition that this is somehow wrong is no less questionable than the test/severity result, because indeed something as big as x=-1 is very rarely observed under μ ≤ 0, and much more often under for example μ=1. So there is a case for μ > 0 here, isn’t it?

I’m generally wary of arguments like “so-and-so result looks weird in my intuition so must be bad”. (Apparently you wrote an earlier posting in which you asked people to vote what they think the answer should be – but aren’t we doing formal statistics partly because human intuitions with probabilities should not be trusted?)

I do address this point in the post. It’s in the middle of the wall of text; you’ll find it with a Ctrl-F for “intuition”.

I want to be clear that I’m

notsaying severity is Wrong, capital W. The point is more to write down a simple model, as close as possible to the sort of exponential family model that always has default-Bayes/frequentist agreement, in which the “correct” Neyman-Pearson procedure is inescapable and in which the contrast between the kinds of inferences the two formal methods license is very stark.I assume there must be a bunch of literature on inference for censored/truncated location-scale models? Anyone know of some good standard references? Perhaps one needs to consider the data as partially observed/incomplete and take this into account explicitly?

Regardless, as mentioned to Corey, you could set up an essentially equivalent example without an explicit gap and instead with a discontinuous pdf. The math would work the same but perhaps the intuition is different.

In this case there is no physical reason for the gap and so d(x1,x2) = 0, while L(theta; x1) != L(theta; x2). That is: you claim different evidence based on an arbitrarily small difference in the data.

This seems more like an argument against using a likelihood-based approach when you allow discontinuous densities, than in the case where there was a gap that arose from some clear physical censoring.

An obvious (but somewhat hacky) solution in the discontinuous case is to define the likelihood as a smoothed version of the discontinuous pdf and take the limit as the smoothing goes to infinity. Or to average over a small neighbourhood of the data, which amounts to the same sort of thing. But this is just a consequence of densities being somewhat ‘ill-defined’ objects requiring either regularity conditions or regularisation methods.

On the other hand, this hack is possibly not available for the ‘physical gap’ case. It doesn’t make as much sense to smooth over the large physical gap. So I’m back to wondering whether there should be explicit treatment of the censoring or not?

I don’t know about truncated/censored location-scale models but one related topic is ascertainment bias and models that aim to correct for it.

Another naive thought that I keep coming back to (and please point me to good discussions if you know of them!):

Are we interested in

– Properties of collectives of estimates, OR

– Properties of estimates of collectives

?

E.g. in this case we are asked to make an estimate (or decision or whatever) based on a single sample. You could think of this as a decision problem where the goal is to make decisions in a series of n=1 cases. You want a good decision rule to follow. So this framing is about a collective of decisions (i.e. a decision rule!) and its properties.

On the other hand, another natural way to think of statistics is as estimating the characteristics (=parameters) of collectives (=distributions or datasets). In this case any estimate of a population based on n=1 is likely to be bad, no? The key question is then: can I estimate the characteristics of a population with n > 1 but n < N0, I have some guarantee of having a good estimate.

The answer is, presumably, yes for some characteristics and given some further assumptions. In the present case we could define a series of estimates for each n, Tn, e.g. the max likelihood estimate. Now I don’t care about having a good estimate for n =1, I just care about having a good estimate for some n > 1 but less than infinity. E.g. that the estimate ‘stabilises’ after n > N0 for some N0 and so I don’t have to collect more (or infinite amounts of!) data. The central limit theorem (and more sophisticated limit theorems) help us out here.

Now, whether these problems amount to something like the same thing (as Fisher-Neyman unification advocates would argue) I don’t know. But I do know that there is a difference between a sequence of M, n = 1 estimates and an estimate based on M measurements. The following classic bad joke illustrates this:

> Three statisticians go hunting. When they see a rabbit, the first one shoots, missing it on the left. The second one shoots and misses it on the right. The third one shouts: “We’ve hit it!”

I am actually not interested (here) in either of those — I’m interested in doing the right thing in the one instance of the problem I set out.

Fair enough. But then it seems you are interested in decision theory right? E.g. you want to make decision following a rule which has some property like coherence or whatever.

At the most basic, I want to measure things. The parameter of a distribution labels a distribution and hence represents a distribution. A sample of n=1 is a bad measurement of a distribution to me.

Yes I can set up and solve a decision problem for an n=1 problem, but that involves a whole lot of other factors. For whatever reason, in real-life decisions I seem to often use a minimax approach rather than a most probable case.

PS Looking for good discussions I found this by Efron which is actually a pretty fascinating read (don’t think I’d read it before):

https://projecteuclid.org/download/pdf_1/euclid.aos/1176345778

(‘Maximum likelihood and decision theory’)

I saw you posted that paper on the tweet machine. I’m reading it now.