The first scenario concerns an investment fund that deceptively advertises portfolio picks made by the “Pickrite method”:

[Jay] Kadane[, a prominent Bayesian,] is emphasizing that Bayesian inference is *conditional* on the particular outcome. So once ** x** is known and fixed, other possible outcomes that could have occurred but didn’t are irrelevant. Recall finding that Pickrite’s procedure was to build

This argument is a straw man caused by a misunderstanding — an unintentional equivocation, if that’s a thing, on the phrase “other portfolios that might have been sent to you but were not”. Now, nothing here turns on the fact that the scenario doesn’t completely specify the distribution of portfolio returns. We are told that the stocks are picked at random, so the portfolio returns are independent and identically distributed random variables; the argument would seem to continue to apply if we specify that portfolio rates of return have some particular known distribution. Mayo tells us that according to a holder of the LP, once ** x** is known we’re not allowed to consider the other chances that the Pickrite method provides for finding an impressive portfolio. Suppose portfolio rates of return are known to have, say, an exponential distribution with unknown mean

But this is simply wrong. The argument overlooks the fact that the LP doesn’t forbid us from taking the data collection mechanism into account (including mechanisms of missing data) *when constructing the likelihood function itself*. We’ve been told that we were presented with just the best result from out of *k* portfolios that were built, so to construct the likelihood we take the probability density for all *k* rates of return and we integrate out the *k* – 1 unobserved rates of return that were smaller than the one we do get to see. In statistical jargon, our likelihood function arises from the probability density for the largest order statistic; the general formula for the density of an order statistic can be found here. Assuming as before that rates of return follow an exponential distribution, the correct likelihood arising in this scenario would be

In fact, in this scenario an error statistician would use precisely this probability model to compute “audited” *p*-values and confidence intervals, and since the parameter being estimated is a scale parameter we would have the standard numerical agreement of frequentist confidence intervals and Bayesian credible intervals (under the usual reference prior for scale parameters).

Kadane might very well ask, “Why are you considering other portfolios that you might have been sent but were not, to reason from the one that you got?” But the portfolios he would be referring to aren’t the other portfolios in the sample that we didn’t get to see — a holder of the LP agrees that *those* portfolios need to be taken into consideration. The “other portfolios the you might have been sent but were not” are the ones that might have arisen in hypothetical replications of the whole data-generating process, that is, other best portfolios selected out of *k* of them. *Those* are the “other portfolios” that likelihood theorists and Bayesians consider irrelevant. (The misunderstanding of the referent of “other portfolios” is the unintentional equivocation.) Of course, an error statistician disagrees that they are irrelevant — they’re implicit in the *p*-value computation — but this is a separate issue.

So that disposes of the Pickrite scenario and the cherry-picking argument against the LP. The second scenario is attributed to Allan Birnbaum, a statistician who started as a likelihood theorist but later abandoned those views due to the inability of likelihoods to control error probabilities. Here’s how Mayo presents it:

A single observation is made on ** X**, which can take values 1, 2, …, 100. “There are 101 possible distributions conveniently indexed by a parameter

It’s apparent that this scenario is designed to challenge the views of likelihood theorists more than those of Bayesians. Nevertheless, Mayo writes, “Allan Birnbaum gets the prize for inventing chestnuts that deeply challenge both those who do, and those who do not, hold the Likelihood Principle!” And since Bayesians do hold the LP, let’s see what challenges Birnbaum’s chestnut presents for us Bayesians.

The contention is that if we let the data determine which non-zero *θ* value to consider then we are certain to find evidence apparently pointing strongly against *θ* = 0 even if it is in fact the case that *θ* = 0. That sounds pretty bad!

First we need to say what “evidence pointing against a hypothesis” means for Bayesians. Later in the book Mayo discusses Bayesian epistemology as a school of thought within academic philosophy, including various proposed numerical measures of confirmation. We don’t need to touch on those complications here; for us it will be enough to say that the data provide evidence against a hypothesis when the posterior odds against it are higher than the prior odds.

Because we’re looking at the odds against *θ* = 0 it is helpful to first decompose the hypothesis space into *θ* = 0 and its negation *θ* ≠ 0 and then assign prior probability mass conditional on *θ* ≠ 0 to the non-zero values, call them *θ’*, that *θ* might take. Given such a decomposition, this is the odds form of Bayes’s theorem in this problem:

The ratio on the left is the posterior odds, the first ratio on the right is the prior odds, and the second ratio on the right is the update factor. In the sum in the numerator of the update factor, all of the Pr(** X** =

The factor of 100 is the likelihood ratio; perhaps unexpectedly, it can be seen that the likelihood ratio is *not* only term in the update factor. The Pr(*θ* = ** r** |

Now we can imagine all sorts of background information that might inform our prior probabilities; in this respect the statement of the problem is underspecified. Suppose nevertheless that this is all we are given; then it seems appropriate to specify a uniform conditional prior distribution, Pr(*θ* = *θ’* | *θ* ≠ 0) = 0.01. Then no matter what value the datum takes, the update factor is identically one; that is, for this prior distribution the evidence in the data is certain to neither cut against nor in favour of *θ* = 0.

If we have information that justifies a non-uniform conditional prior distribution then for some values of *θ* the prior will be larger than 0.01; a corresponding datum would result in an update factor greater than one and thus be evidence against *θ* = 0. But in this situation there must be other values of *θ* for which the prior is smaller than 0.01, and a corresponding datum would result in an update factor smaller than one and thus be evidence *in favour* of *θ* = 0. So contrary to what Cox and Hinkley say holds for likelihood theorists, we Bayesians are *never* certain of finding evidence against *θ* = 0 even when it is in fact the case — the closest we get is being certain that the data provide no evidence one way or the other.

I wonder if this chestnut of Birnbaum’s poses any challenge for the severity concept…

]]>I recommend that people who want to really understand the severity argument read the above-linked paper, but for completeness’s sake let’s have a look at how the SEV function formalizes severity arguments. (Since I’ll be discussing both frequentist and Bayesian calculations I’ll use the notation Fr ( · ; · ) for frequency distributions and Pl ( · | · ) for plausibility distributions.) The examples I’ve seen typically involve nice models and location parameters, so let’s consider an irregular model with a scale parameter. Consider a univariate uniform distribution with unknown support; suppose that the number of data points, *n*, is at least two and we aim to assess the warrant for claims about the width of the support, call it ∆, using the difference between the largest and smallest data values, *D* = *X*_{max} –* X*_{min}, as our test statistic. (This model is “irregular” in that the support of the test statistic’s sampling distribution depends on the parameter.) Starting from the joint distribution of the order statistics of the uniform distribution one can show that *D* is a pivotal statistic satisfying

Severity reasoning works like this: we aim to rule out a particular way that some claim could be wrong, thereby avoiding one way of being in error in asserting the claim. The way we do that is by carrying out a test that would frequently detect such an error if the claim were in fact wrong in that particular way. We attach the error detection frequency to the claim and say that the claim has passed a severe test for the presence of the error; the error detection frequency quantifies just how severe the test was.

To cash this out in the form of a SEV function we need a notion of accordance between the test statistic and the statistical hypothesis being tested. In our case, higher observed values of *D* accord with higher hypothesized values of ∆ (and in fact, values of ∆ smaller than the observed value of *D* are strictly ruled out). SEV is a function with a subjunctive mood; we don’t necessarily carry out any particular test but instead look at all the tests we might have carried out. So: if we were to claim that ∆ > δ when ∆ ≤ δ was true then we’d have committed an error. Smaller observed values of *D* are less in accord with larger values of ∆, so we could have tested for the presence of the error by declaring it to be present if the observed value of *D* were smaller than some threshold *d*. Now, there are lots of possible thresholds *d* and also lots of ways that the claim “∆ > δ” could be wrong – one way for each value of ∆ smaller than δ – but we can finesse these issues by considering the worst possible case. In the worst case the test is just barely passed, that is, the observed value of *D* is on the test’s threshold, and ∆ takes the value that minimizes the frequency of error detection and yet still satisfies ∆ ≤ δ. (Mayo does more work to justify all of this than I’m going to do here.) Thus the severity of the test that the claim “∆ > δ” would have passed is the worst-case frequency of declaring the error to be present supposing that to actually be the case:

in which *D* is (still) a random variable and *d* is the value of *D* that was actually observed in the data at hand.

In every example of a SEV calculation I’ve seen, the minimum occurs right on the boundary — in this case, at ∆ = δ. It’s not clear to me if Mayo would insist that the SEV function can only be sensibly defined for models in which the minimum is at the boundary; that restriction seems implicit in some of the things she’s written about accordance of test statistics with parameter values. In any event it holds in this model, so we can write

What this says is that to calculate the SEV function for this model we stick (*d* / δ) into the cdf for the Beta (*n* – 1, 2) distribution and allow δ to vary. I’ve written a little web app to allow readers to explore the meaning of this equation.

As part of my blogging agenda I had planned to find examples of prominent Bayesians asserting Mayo’s howlers. At one point Mayo expressed disbelief to me that I would call out individuals by name as I demonstrated how the SEV function addressed their criticism, but this would have been easier for me than she thought. The reason is that in all of the examples of formal SEV calculations that I have ever seen (including the one I just did above), it’s been applied to univariate location or scale parameter problems in which the SEV calculation produces exactly the same numbers as a Bayesian analysis (using the commonly-accepted default/non-informative/objective prior, i.e., uniform for location parameters and/or for the logarithm of scale parameters; this SEV calculator web app for the normal distribution mean serves equally well as a Bayesian posterior calculator under those priors). So I wasn’t too concerned about ruffling feathers – because I’m no one of consequence, but also because a critique that goes “criticisms of frequentism are unfair because frequentists have figured out this Bayesian-posterior-looking thing” isn’t the sort of thing any Bayesian is going to find particularly cutting, no matter what argument is adduced to justify the frequentist posterior analogue. In any event, at this remove I find I lack the motivation to actually go and track down an instance of a Bayesian issuing each of the howlers, so if you are one such and you’ve failed to grapple with Mayo’s severity argument – consider yourself chided! (Although I can’t be bothered to find quotes I can name a couple of names off the top of my head: Jay Kadane and William Briggs.)

Because of this identity of numerical output in the two approaches I found it hard to say whether the SEV functions computed and plotted in Mayo and Spanos’s article and in my web app illustrate the severity argument in a way that actually supports it or if they’re just lending it an appearance of reasonability because, through a mathematical coincidence, they happens to line up with default Bayesian analyses. Or, perhaps default Bayesian analyses seem to give reasonable numbers because, through a mathematical coincidence, they happen to line up with SEV – a form of what some critics of Bayes have called “frequentist pursuit”. To help resolve this ambiguity I sought to create an intuition pump: an easy-to-analyze statistical model that could be subjected to extreme conditions in which intuition would strongly suggest what sorts of conclusions were reasonable. The outputs of the formal statistical methods — the SEV function on the one hand and a Bayesian posterior on the other — could be measured relative to these intuitions. Of course, sometimes the point of carrying out a formal analysis is educate one’s intuition, as in the birthday problem; but in other cases one’s intuition acts to demonstrate that the formalization isn’t doing the work one intended it to do, as in the case of the integrated information theory of consciousness. (The latter link has a discussion of “paradigm cases” that is quite pertinent to my present topic.)

When I first started blogging about severity the idea I had in mind for this intuition pump was to apply an optional stopping data collection design to the usual normal distribution with known variance and unknown mean. Either one or two data points would be observed, with the second data point observed only if the first one was within some region where it would be desirable to gather more information. This kind of optional stopping design induces the same likelihood function (up to proportionality) as a fixed sample size design, but the alteration of the sample space gives rise to very different frequency properties, and this guarantees that (unlike in the fixed sample sized design) the SEV function and the Bayesian posterior will not agree in general.

Now, the computation of a SEV function demands a test procedure that gives a rejection region for any Type I error rate and any one-sided alternative hypothesis; this is because to calculate SEV we need to be able to run the test procedure backward and figure out for each possible one-sided alternative hypothesis what Type I error rate would have given rise to a rejection region with the observed value of the test statistic right on the boundary. (If that seemed complicated, it’s because it is.) In the optional stopping design the test procedure would involve two rejection regions, one for each possible sample size, and a collect-more-data region; given these extra degrees of freedom in specifying the test I found myself struggling to define a procedure that I felt could not be objected to – in particular, I couldn’t handle the math needed to find a uniformly most powerful test (if one even exists in this setup). The usual tool for proving the existence of uniformly powerful tests, the Karlin-Rubin theorem, does not apply to the very weird sample space that arises in the optional stopping design – the dimensionality of the sample space is itself a random variable. But as I worked with the model I realized that optional stopping wasn’t the only way to alter the sample space to drive a wedge between the SEV function and the Bayesian posterior. When I examined the first stage of the optional stopping design in which the collect-more-data region creates a gap in the sample space, I realized that chopping out a chunk of the sample space and just forgetting about the second data point would be enough to force the two formal statistical methods to disagree.

An instance of such a model was described in my most recent post: a normal distribution with unknown *μ* in ℝ and unit *σ* and a gap in the sample space between -1 and 3, yield the probability density

As previously mentioned, the formalization of severity involves some kind of notion of accordance between test statistic values and parameter values. For the gapped normal distribution the Karlin-Rubin theorem applies directly: a uniformly most powerful test exists and it’s a threshold test, just as in the ordinary normal model. So it seems reasonable to say that larger values of *x* accord with larger values of *μ* even with the gap in the sample space, and the SEV function is constructed for the gapped normal model just as it would be for the ordinary normal model:

It’s interesting to note that frequentist analyses such as a p-value or the SEV function will yield the same result for *x* = -1 and *x* = 3. In both these cases, for example,

This is because the data enter into the inference through tail areas of the sampling probability density, and those tail areas are the same whether the interval of integration has its edge at *x* = -1 or *x* = 3.

The Bayesian posterior distribution, on the other hand, involves integration over the parameter space rather than the sample space. Assuming a uniform prior for *μ*, the posterior distribution is

which does not have an analytical solution. We can see right away that the Bayesian posterior will not yield the same result when *x* = -1 as it does when *x* = 3 because the data enter into the inference through the likelihood function, and *x* = -1 induces a different likelihood function than *x* = 3.

But what about that uniform prior? Does a prior exist that will enable “frequentist pursuit” and bring the Bayesian analysis and the SEV function back into alignment? To answer this question, consider the absolute value of the derivative of the SEV function with respect to *m*. This is the “SEV density”, the function that one would integrate over the parameter space to recover SEV. I leave it as an exercise for the reader to verify that this function cannot be written as (proportional to) the product of the likelihood function and a data-independent prior density.

So! I have fulfilled the promise I made in my blogging agenda to specify a simple model in which the two approaches, operating on the exact same information, must disagree. It isn’t the model I originally thought I’d have – it’s even simpler and easier to analyze. The last item on the agenda is to subject the model to extreme conditions so as to magnify the differences between the SEV function and the Bayesian approach. This web app can be used to explore the two approaches.

The default setting of the web app shows a comparison I find very striking. In this scenario *x* = -1 and Pl(*μ* > 0 | *x*) = 0.52. (Nothing here turns on the observed value being right on the boundary – we could imagine it to be slightly below the boundary, say *x* = -1.01, and the change in the scenario would be correspondingly small.) This reflects the fact that *x* = -1 induces a likelihood function that has a fairly broad peak with a maximum near *μ* = 0. That is, the probability of landing in a vanishingly small neighbourhood of the value of *x* we actually observed is high in a relative sense for values of *μ* in a broad range that extends on both sides of *μ* = 0; when we normalize and integrate over *μ* > 0 we find that we’ve captured about half of the posterior plausibility mass. On the other hand, SEV(*μ* > 0) = 0.99. The SEV function is telling us that if *μ* > 0 were false then we would very frequently – at least 99 times out of 100 – have observed values of *x* that accord less well with *μ* > 0 than the one we have in hand. But wait – severity isn’t just about the subjunctive test result; it also requires that the data “accords with” the claim being made in an absolute sense. If *μ* = 1 then *x* = -1 is a median point of the sampling distribution, so I judge that *x* = -1 does indeed accord with *μ* > 0.

I personally find my intuition rebels against the idea that “*μ* > 0″ is a well-warranted claim in light of *x* = -1; it also rebels at the notion than *x* = -1 and *x* = 3 provide equally good warrant for the claim that *μ* > 0. In the end, I strictly do not care about regions of the sample space far away from the observed data. In fact, this is the reason that I stopped blogging about this – about four years ago I took one look at that plot and felt my interest in (and motivation for writing about) the severity concept drain away. Since then, this whole thing has been weighing down my mind; the only reason I’ve managed to muster the motivation to finally get it out there is because I was playing around with Shiny apps recently – they’ve got a lot better since the last time I did so, which was also about four years ago – and started thinking about the visualizations I could make to illustrate these ideas.

In this model *μ* is not quite a location parameter; when it’s far from the gap the density is effectively a normal centered at *μ* but when it’s close to the gap its shape is distorted. It becomes a half-normal at the gap boundary and then something like an extra-shallow exponential (log-quadratic instead of log-linear like an actual exponential) as *μ* moves toward the center of the gap. At *μ* = 1 the probability mass flips from one side of the gap to the other. Here’s a little web app in which you can play around with this statistical model (don’t neglect the play button under the slider on the right hand side).

Now the question; I ask my readers to report their gut reaction in addition to any more considered conclusions in comments.

Suppose *μ* is unknown and the data is a single observation *x*. Consider two scenarios:

*x*= -1 (the left boundary)*x*= 3 (the right boundary)

For the sake of concreteness suppose our interest is in *μ *≤ 0 vs. *μ* > 0. Should it make a difference to our inference whether we’re in scenario (i) or scenario (ii)?

]]>The Dutch book argument in turn relies on a concept of truth. Often framed in terms of bets on a horse-race, it relies on there only being one winner, which is the case for the overwhelming majority of horse races. The Dutch book argument shows that the odds, when converted to probabilities, must sum to 1 to avoid arbitrage possibilities… If we transfer this to statistics then we have different distributions indexed by a parameter. Based on the idea of truth, only one of these can be true, just as only one horse can win, and the same Dutch book argument shows that the odds must add to 1. In other words the prior must be a probability distribution. We note that in reality none of the offered distributions will be the truth, but due to the non-callability of Bayesian bets this is not considered to be a problem. Suppose we replace the question as whether a distribution represents the truth by the question as to whether it is a good approximation. Suppose that we bet, for example, that the *N*(0, 1) distribution is an adequate approximation for the data. We quote odds for this bet, the computer programme is run, and we either win or lose. If we quote odds of 5:1 then we will probably quote the same, or very similar, odds for the *N*(10^{−6}, 1) distribution, as for the *N*(0, 1+10^{−10}) distribution and so forth. It becomes clear that these odds are not representable by a probability distribution: only one distribution can be the ‘true’ but many can be adequate approximations.

I always meant to write something about how this line of argument goes wrong, but it wasn’t a high priority. But recently Davies reiterated this argument in a comment on Professor Mayo’s blog:

You define adequacy in a precise manner, a computer programme., there [sic] are many examples in my book. The inputs are the data and the model, the output yes or no. You place your bets beforehand, run the programme and win or lose your bet. The bets are realizable. If you bet 50-50 on the *N*(0,1) being an adequate model, you will no doubt bet about 50-50 on the *N*(10^{-20},1) also being an adequate model. Your bets are not expressible by a probability measure. The sum of the odds will generally be zero or infinity. …

I tried to reply in the comment thread, but WordPress ate my attempts, so: a blog post!

I have to wonder if Professor Davies asked even one Bayesian to evaluate this argument before he published it. (*In comments, Davies replies: I have been stating the argument for about 20 years now. Many Bayesians have heard my talks but so the only response I have had was by one in Lancaster who told me he had never heard the argument before and that was it.*) Let *M* be the set of statistical models under consideration. It’s true that if I bet 50-50 on *N*(0,1) being an adequate model, I will no doubt bet very close to 50-50 on *N*(10^{-20}, 1) also being an adequate model. Does this mean that “these odds are not representable by a probability distribution”? Not at all — we just need to get the sample space right. In this setup the appropriate sample space for a probability triple is the *powerset* of *M*, because exactly one of the members of the powerset of *M* will be realized when the data become known.

For example, suppose that *M* = {*N*(0,1), *N*(10^{-20}, 1), *N*(10,1)}; then there are eight conceivable outcomes — one for each possible combination of adequacy indications — that could occur once the data become known. We can encode this sample space using the binary expansion of the numbers from 0 to 7, with each digit of the binary expansion of the integer interpreted as an indicator variable for the statistical adequacy of one of the models in *M*. Let the leftmost bit refer to *N*(0,1), the center bit refer to *N*(10^-20, 1), and the rightmost bit refer to *N*(10,1). Here’s a probability measure that serves as a counterexample to the claim that “[the 50-50] bets are not expressible by a probability measure”:

Pr(001) = Pr(110) = 0.5,

Pr(000) = Pr(100) = Pr(101) = Pr(011) = Pr(010) = Pr(111) = 0.

(This is an abuse of notation, since the Pr() function takes events, that is, sets of outcomes, and not raw outcomes.) The events Davies considers are “*N*(0,1) [is] an adequate model”, which is the set {100, 101, 110, 111}, and “*N*(10^{-20},1) [is] an adequate model”, which is the set {010, 011, 110, 111}; it is trivial to see that both these events are 50-50.

Now obviously when M is uncountably infinite it’s not so easy to write down probability measures on sigma-algebras of the powerset of M. Still, that scenario is not particularly difficult for a Bayesian to handle: if the statistical adequacy function is measurable, a prior or posterior predictive probability measure automatically induces a pushforward probability measure on any sigma-algebra of the powerset of M. In fact, this is precisely the approach taken in the (rather small) Bayesian literature on assessing statistical adequacy; see for example *A nonparametric assessment of model adequacy based on Kullback-Leibler divergence*. These sorts of papers typically treat statistical adequacy as a continuous quantity, but all it would take to turn it into a Davies-style yes-no Boolean variable would be to dichotomize the continuous quantity at some threshold.

(A digression. To me, using a Bayesian nonparametric posterior distribution to assess the adequacy of a parametric model seems a bit pointless — if you have the posterior already, of what possible use is the parametric model? Actually, there *is* one use that I can think of, but I was saving it to write a paper about… Oh what the heck. I’m told (by Andrew Gelman, who should know!) that in social science it’s notorious that every variable is correlated with every other variable, at least a little bit. I imagine that this makes Pearl-style causal inference a big pain — all of the causal graphs would end up totally connected, or close to. I think there may be a role for Bayesian causal graph adequacy assessment; the causal model adequacy function would quantify the loss incurred by ignoring some edges in the highly-connected causal graph. I think this approach could facilitate communication between causal inference experts, subject matter experts, and policymakers.)

*This post’s title was originally more tendentious and insulting. As Professor Davies has graciously suggested that his future work might include a reference to this post, I think it only polite that I change the title to something less argumentative.*

“Error statistics refers to a standpoint regarding both (1) a general philosophy of science and the roles probability plays in inductive inference, and (2) a cluster of statistical tools, their interpretation, and their justiﬁcation.”

In Mayo’s writings I see two interrelated notions of severity corresponding to the two items listed in the quote: (1) an informal severity notion that Mayo uses when discussing philosophy of science and specific scientific investigations, and (2) Mayo’s formalization of severity at the data analysis level.

One of my besetting flaws is a tendency to take a narrow conceptual focus to the detriment of the wider context. In the case of Severity, part one, I think I ended up making claims about severity that were wrong. I was narrowly focused on severity in sense (2) — in fact, on one specific equation within (2) — but used a mish-mash of ideas and terminology drawn from all of my readings of Mayo’s work. When read through a philosophy-of-science lens, the result is a distorted and misstated version of severity in sense (1) .

As a philosopher of science, I’m a rank amateur; I’m not equipped to add anything to the conversation about severity as a philosophy of science. My topic is statistics, not philosophy, and so I want to warn readers against interpreting Severity, part one as a description of Mayo’s philosophy of science; it’s more of a wordy introduction to the formal definition of severity in sense (2).

]]>One of the author’s results (if I could nominate one as the most important, I’d choose this one) says that if you replace your model by another one which is in an arbitrarily close neighborhood (according to the [Prokhorov] metric discussed above), the posterior expectation could be as far away as you want. Which, if you choose the right metric, means that you replace your sampling model by another one out of which typical samples *look the same*, and which therefore can be seen as as appropriate for the situation as the original one.

Note that the result is primarily about a change in the sampling model, not the prior, although it is a bit more complex than that because if you change the sampling model, you need to adapt the prior, too, which is appropriately taken into account by the authors as far as I can see.

My own reaction was rather less impressed; I tossed off,

Conclusion: Don’t define “closeness” using the TV [that is, total variation] metric or matching a finite number of moments. Use KL divergence instead.

In response to a a request by Mayo, OSS wrote up a “plain jane” explanation which was posted on Error Statistics Philosophy a couple of weeks later. It confirmed Christian’s summary:

So the brittleness theorems state that for any Bayesian model there is a second one, nearly indistinguishable from the first, achieving any desired posterior value within the deterministic range of the quantity of interest.

That sounds pretty terrible!

—

This issue came up for discussion in the comments of an Error Statistics Philosophy post in late December.

MAYO: Larry: Do you know anything about current reactions to, status of, the results by Houman Owhadi, Clint Scovel and Tim Sullivan? Are they deemed relevant for practice? (I heard some people downplay the results as not of practical concern.)

LARRY: I am not aware of significant rebuttals.

COREY: I have a vague idea that any rebuttal will basically assert that the distance these authors use is too non-discriminating in some sense, so Bayes fails to distinguish “nice” distributions from nearby (according to the distance) “nasty” ones. My intuition is that these results won’t hold for relative entropy, but I don’t have the knowledge and training to develop this idea — you’d need someone like John Baez for that.

OWHADI (the O in OSS): Well, one should define what one means by “nice” and “nasty” (and preferably without invoking circular arguments).

Also, it would seem to me that the statement that TV and Prokhorov cannot be used (or are not relevant) in “classical” Bayes is a powerful result in itself. Indeed TV has not only been used in many parts of statistics but it has also been called the testing metric by Le Cam for a good reason: i.e. (writing n the number of samples), Le Cam’s Lemma state that

1) For any n, if TV is close enough (as a function of n) all tests are bad.

2) Given any TV distance, with enough sample data there exists a good test.

Now concerning using closeness in Kullback–Leibler (KL) divergence rather than Prokhorov or TV, observe that (as noted in our original post) closeness in KL divergence is not something you can test with discrete data, but you can test closeness in TV or Prokhorov. In other words the statement “if the true distribution and my model are close in KL then classical Bayes behaves nicely” can be understood as “if I am given this infinite amount of information then my Bayesian estimation is good” which is precisely one issue/concern raised by our paper (brittleness under “finite” information).

Note also that, the assumption of closeness in KL divergence requires the non-singularity of the data generating distribution with respect to the Bayesian model (which could be a very strong assumption if you are trying to certify the safety of a critical system and results like the Feldman–Hajek Theorem tell us that “most” pairs of measures are mutually singular in the now popular context of stochastic PDEs).

In preparing a reply to Owhadi, I discovered a comment written by Dave Higdon on Xian’s Og a few days after OSS’s “plain jane” summary went up on Error Statistics Philosophy. He described the situation in concrete terms; this clarified for me just what it is that OSS’s brittleness theorems demonstrate. (Christian Hennig saw the issue too, but I couldn’t follow what he was saying without the example Higdon gave. And OSS are perfectly aware of it too — this post represents me catching up with the more knowledgeable folks.)

Suppose we judge a system safe provided that the probability that the random variable *X* exceeds 10 is very low. We assume that *X* has a Gaussian distribution with known variance 1 and unknown mean *μ*, the prior for which is

This prior doesn’t encode a strong opinion about the prior predictive probability of the event *X* > 10 (i.e., disaster).

Next, we learn about the safety of the system by observing a realization of *X,* and it turns out that the datum *x* is smaller than 7 and the posterior predictive probability of disaster is negligible. Good news, right?

OSS say, not so fast! They ask: suppose that our model is misspecified, and the true model is “nearby” in Prokhorov or TV metric. They show that for any datum that we can observe, the set of all nearby models includes a model that predicts disaster.

What kinds of model misspecifications do the Prokhorov and TV metrics capture? Suppose that the data space has been discretized to precision 2*ϵ*, and consider the set of models in which, for each possible observable datum *x*_{0}, the probability density is

in which χ(.) is the indicator function. For any specific value of *μ*, all of the models in the above set are within a small ball centered on the Gaussian model, where “small” is measured by either the Prokhorov or TV metric. (*How* small depends on *ϵ*.) Each model embodies an implication of the form:

by taking the contrapositive, we see that this is equivalent to:

Each of these “nearby” models basically modifies the Gaussian model to enable *one specific possible datum* to be a certain indicator of disaster. Thus, no matter which datum we actually end up observing, there is a “nearby” model for which both (i) typical samples are basically indistinguishable from typical samples under our assumed Gaussian model, and yet (ii) the realized datum has caused that “nearby” model, like Chicken Little, to squawk that the sky is falling.

OSS have proved a very general version of the above phenomenon: under the (weak) conditions they assume, for any given data set that we can observe, the set of all models “near” the posterior distribution contains a model that, upon observation of the realized data, goes into (the statistical model equivalent of) spasms of pants-shitting terror.

There’s nothing special to Bayes here; in particular, all of the talk about the asymptotic testability of TV- and/or Prokhorov-distinct distributions is a red herring. The OSS procedure stymies learning about the system of interest because the model misspecification set is specifically constructed to allow any possible data set to be totally misleading under the assumed model. Seen in this light, OSS’s choice of article titles is rather tendentious, don’t you think? If tendentious titles are the order of the day, perhaps the first one could be called *As flies to wanton boys are we to th’ gods: **Why no statistical model whatsoever is “good enough” *and the second one could be called *Prokhorov and *total variation neighborhoods and paranoid psychotic breaks with reality.

]]>

My wife says that a glance at the spine gives the impression that it reads “Badass”.

]]>Error statistics aims to provide a philosophical foundation for the application of frequentist statistics in science. As in any frequency-based approach, error statistics adheres to what I consider to be the fundamental tenet of frequentist learning: *any particular data-based inference is deemed well-warranted only to the extent that it is the result of a procedure with good sampling characteristics*.

What kind of procedures are we talking about, and what characteristics of those procedures ought we to care about? The error statistical approach distinguishes itself from other frequentist frameworks (e.g., frequentist statistical decision theory) by the answer it gives to that question. Particular attention is paid to *tests*, by which I mean a procedure that takes some data and a statistical hypothesis as inputs and issues a binary pass/fail result. (As we’ll see, the testing framework easily encompasses estimation by holding the data input fixed and varying the statistical hypothesis input.) The error statistical worth of a test is related, sensibly enough, to the (in)frequency with which the test produces an erroneous conclusion, and, critically, what that error rate indicates about the capacity of the test to detect error *in the case at hand. *This notion is codified by the *Severity Principle*.

The most straightforward way to understand the Severity Principle (for me, anyway,) is as an extension of a rule of inference of classical logic that has the delightfully baroque name *modus tollendo tollens*, or more simply, *modus tollens*.

To apply *modus tollens*, one starts with two premises: first, “if *P*, then *Q*“; and second, “not-*Q*“. From these, *modus tollens* produces the conclusion “not-*P*“. *Modus tollens* is also known as *the law of contrapositive* because contraposition applied to the first premise yields “if not-*Q*, then not-*P*“.

For the purpose of exposition, I offer this slight reformulation of *modus tollens:* the two premises are “if not-*H*, then not-*P*” and “*P*“, and the conclusion is “*H*“. Here *H* represents an hypothesis and *P* represents a passing result from some test of *H*. The premise “if not-*H*, then not-*P*” expresses a property of the test, to wit, that it is incapable of producing an erroneous passing grade. The premise “*P*” expresses the assertion that for the observed data, the test procedure has produced a passing grade. In the language of error statistics, one says that *H* has passed a *maximally severe* test.

The above reformulation introduced the notion of a test into the premises, but because the first premise posited a perfect test, the whole idea of a test seemed rather superfluous. But classical logic is silent in the face of imperfect tests; and since learning from imperfect tests is, after all, possible, we seek an extension of *modus tollens* to imperfect tests.

The first step is to consider the opposite of a maximally severe test. Such a *minimally severe test* would be one gives a passing grade to *H* irrespective of the data. We can “frequencify” this notion: for a minimally severe test, we have

.

or equivalently,

(I use the symbol “Fr” to be clear that I’m writing about relative frequencies.) The analogous frequencification of a maximal severe test is

.

The equation in the above line captures only the first premise of *modus tollens*, so it’s a necessary but not sufficient component of the notion of a maximally severe test. To see this, notice that a procedure that always gives a failing grade to *H* irrespective of the data does satisfy the above equation; however, *modus tollens* will never apply. The inequality encodes the fact that part of what we *mean* by the word “test” is that it is at least possible to observe a passing grade, or to put it another way, that there must be some possible data that accords with *H.*

The key point is that the necessary conditions for a minimally and a maximally severe test are, respectively, Fr(not-*P* | not-*H*) = 0 and Fr(not-*P* | not-*H*) = 1. This suggests that (provided we’ve checked that Fr(*P* | *H*) > 0,) we can measure the severity of a test by the value of Fr(not-*P* | not-*H*) .

At this point, I’m just going to quote straight from Mayo’s latest exposition of severity:

“**Severity Principle** []. Data *x*_{0} (produced by process *G*) provides good evidence for hypothesis *H* (just) to the extent that test *T* severely passes *H* with *x*_{0}.

…

A hypothesis *H* passes a severe test *T* with data *x*_{0} if,

(S-1) *x*_{0} accords with *H*, (for a suitable notion of accordance) and

(S-2) with very high [frequency], test *T* would have produced a result

that accords less well with *H* than *x*_{0} does, if *H* were false or incorrect.”

My previous discussion was based on the idea of pass/fail testing, whereas condition (S-2) is phrased in terms of “accord[ing] less well”, the key word being “less”. Mayo’s definition does not merely introduce the notion of accordance with *H* — it demands a totally ordered set to express it.

To connect my exposition here with Mayo’s definition, we can now recognize that, just as school exams produce both a numerical grade and a pass/fail categorization, a statistical data-based pass/fail test will (almost always) be a dichotomization of a test statistic that makes finer distinctions, and the choice of dichotomization threshold is essentially arbitrary. Mayo’s definition of a severe test resolves this arbitrariness by placing the threshold for a passing grade right at the observed value of the test statistic. (The threshold is notionally infinitesimally below the observed value of the test statistic, so that *H* just barely passes. Alas, the reals contain no infinitesimals.)

To make the notion of the severity of a test *T* of *H* mathematically precise and operational, we’ll have to get more specific about what is meant by “accords less well”. That will be the topic of part two.* *

Recall Cox’s five postulates:

*Cox-plausibilities are real numbers*.*Consistency with Boolean algebra*: if two claims are equal in Boolean algebra, then they have equal Cox-plausibility.- There exists a
*conjunction function**f*such that for any two claims*A, B,*and any prior information*X,*

.

- There exists a
*negation relation*(actually a function too). In slightly different notation that in the previous post, it is:

.

- The negation relation and conjunction function (and their domains) satisfy technical
*regularity**conditions.*

Choices other than the reals are possible: one might choose a domain with less structure, such as a lattice, or larger dimensionality, such as the set of real intervals as in Dempster-Shafer theory. At present, it seems to me that choosing one these other possibilities as the domain doesn’t buy us much in terms of usefulness for practical applications. That said, I could be swayed from this view by relevant evidence.

I doubt such evidence will be forthcoming. The reason is that the key property that distinguishes the reals from the other possible choices for the domain is that the reals are totally ordered. By making this choice, we ensure that the Cox-plausibilities of all the claims in whatever universe of discourse we’re considering will also be totally ordered. Conversely, if the domain is only a poset, then for at least one pair of claims, we won’t be able to pick the one with the larger plausibility (or state that they’re equally plausible).

This is not intrinsically a strike against posets as accurate representations of some set of available information — I acknowledge that a lack of universal comparability may indeed be quite appropriate in some settings. Rather, my intuition is that these settings are precisely the ones for which the paucity of prior information ensures that very little is actually achievable in practice.

This postulate says that the Cox-plausibility for a given Boolean expression depends on its truth table rather than the particular symbols used in the expression. Personally I have no qualms accepting this postulate, and it would surprise me if anyone found it controversial. Perhaps something interesting could be generated without this postulate, but I feel sure that any such development will not have semantics appropriate to plausibility.

I won’t give a full argument for the necessity of this particular form, but I will offer an example of the kind of reasoning that leads to it.

Instead of the conjunction function I gave above, suppose we consider

Can this possibly serve as the functional relationship between the the plausibility of a conjunction and the plausibility of the conjuncts?

It can’t. Consider the plausibility of the claim that some person, (say, the seventh person you encounter tomorrow) has blue eyes. This is a conjunction of the claim that the person has a blue left eye (*L _{B}*) and a blue right eye (

,

,

and so for any sufficiently regular *f*_{?},

.

But

so this functional form isn’t flexible enough to do the job we’d want it to do.

In a similar way, test cases can be defined and applied to all of the possible functional forms, of which there are a small number. Only two survive: the one I gave in postulate #3 and the one obtained from it by interchanging *B *and* A*.

Curiously, the literature did not contain a full account of such a test for quite some time; the full procedure was carried out in Whence the laws of probability? by A.J.M. Garrett.

Actually, there’s no real reason to separate negation and conjunction into separate postulates. The link immediately above is to a paper which uses NAND instead of conjunction and negation separately. It’s also possible to get to Cox’s theorem by considering conjunction and disjunction.

Cox’s original proof assumed that the conjunction function and negation relation were twice-differentiable; Jaynes offered a proof that assumed just differentiability. Mathematically speaking, these are quite strong assumptions; by using them, Cox and Jaynes left open the possibility of non-differentiable solutions.

Frankly, given my training as an engineer, I’m not too concerned that other, non-differentiable solutions might exist. I’d be happy to consider such solutions as they appear in the literature, but for me, the differentiable solution is enough to be going on with. But if you don’t share my cavalier attitude, be comforted, for you are not alone. Within the past 15 years or so, interest in the necessary assumptions for Cox’s theorem was revived by a paper of J. Y. Halpern purporting to give a counter-example to the theorem. The counter-example violated one of the regularity conditions Cox did in fact assume; Halpern acknowledged this and went on to argue that the necessary regularity conditions were nevertheless “not natural”.

This argument prompted a straight-up counterargument by K. S. Van Horn and also research into more natural regularity condtions. (The K. S. Van Horn paper is to my go-to link when referring to Cox’s theorem in blog comments; if you like what you’ve read here, read that next.) More recently, Frank J. Tipler (of *Omega Point* infamy) and a co-author have written a paper that assumes a lot of mathematical machinery with which I am not familiar and claims to give a “trivial” proof of Cox’s theorem. And very recently, K. Knuth and J. Skilling have offered an approach they call “simple and clear” and which they claim unites and extends the approaches of Kolmogorov and Cox.

Assuming, then, that somewhere in the morass of links above there exists a set of satisfactory postulates, where does that leave us? Let me characterize Cox’s theorem in three ways, two negative and one positive.

First off, *Cox’s theorem is not a straitjacket*. Unlike other approaches to Bayesian foundations, we made no loaded claims of rational behavior, preference, or, arguably, belief. We can jump into and out of any or all joint (over parameters/hypotheses and data) prior probability distributions we care to set up, examining the consequences of each in turn.

Second, *Cox’s theorem is not a guarantee of soundness*. Just as in classical logic, nothing in protects us from garbage-in-garbage-out. If we want to argue that the conclusions of a Bayesian analysis are well-warranted, we **must** justify the prior distributions we used in terms of the available prior information.

Finally, *Cox’s theorem is a guarantee of validity.* It justifies Bayes’ theorem as an objectively well-founded method for computing the plausibility of claims post-data given the plausibility of claims pre-data.