# Bayesian statistics

About a year and a half ago, I read P. L. Davies’s interesting paper Approximating Data. There was one passage I read that struck me as unusually wrong-headed (pg 195):

The Dutch book argument in turn relies on a concept of truth. Often framed in terms of bets on a horse-race, it relies on there only being one winner, which is the case for the overwhelming majority of horse races. The Dutch book argument shows that the odds, when converted to probabilities, must sum to 1 to avoid arbitrage possibilities… If we transfer this to statistics then we have different distributions indexed by a parameter. Based on the idea of truth, only one of these can be true, just as only one horse can win, and the same Dutch book argument shows that the odds must add to 1. In other words the prior must be a probability distribution. We note that in reality none of the offered distributions will be the truth, but due to the non-callability of Bayesian bets this is not considered to be a problem. Suppose we replace the question as whether a distribution represents the truth by the question as to whether it is a good approximation. Suppose that we bet, for example, that the N(0, 1) distribution is an adequate approximation for the data. We quote odds for this bet, the computer programme is run, and we either win or lose. If we quote odds of 5:1 then we will probably quote the same, or very similar, odds for the N(10−6, 1) distribution, as for the N(0, 1+10−10) distribution and so forth. It becomes clear that these odds are not representable by a probability distribution: only one distribution can be the ‘true’ but many can be adequate approximations.

I always meant to write something about how this line of argument goes wrong, but it wasn’t a high priority. But recently Davies reiterated this argument in a comment on Professor Mayo’s blog:

You define adequacy in a precise manner, a computer programme., there [sic] are many examples in my book. The inputs are the data and the model, the output yes or no. You place your bets beforehand, run the programme and win or lose your bet. The bets are realizable. If you bet 50-50 on the N(0,1) being an adequate model, you will no doubt bet about 50-50 on the N(10-20,1) also being an adequate model. Your bets are not expressible by a probability measure. The sum of the odds will generally be zero or infinity. …

I tried to reply in the comment thread, but WordPress ate my attempts, so: a blog post!

I have to wonder if Professor Davies asked even one Bayesian to evaluate this argument before he published it. (In comments, Davies replies: I have been stating the argument for about 20 years now. Many Bayesians have heard my talks but so the only response I have had was by one in Lancaster who told me he had never heard the argument before and that was it.) Let M be the set of statistical models under consideration. It’s true that if I bet 50-50 on N(0,1) being an adequate model, I will no doubt bet very close to 50-50 on N(10-20, 1) also being an adequate model. Does this mean that “these odds are not representable by a probability distribution”? Not at all — we just need to get the sample space right. In this setup the appropriate sample space for a probability triple is the powerset of M, because exactly one of the members of the powerset of M will be realized when the data become known.

For example, suppose that M = {N(0,1), N(10-20, 1), N(10,1)}; then there are eight conceivable outcomes — one for each possible combination of adequacy indications — that could occur once the data become known. We can encode this sample space using the binary expansion of the numbers from 0 to 7, with each digit of the binary expansion of the integer interpreted as an indicator variable for the statistical adequacy of one of the models in M. Let the leftmost bit refer to N(0,1), the center bit refer to N(10^-20, 1), and the rightmost bit refer to N(10,1). Here’s a probability measure that serves as a counterexample to the claim that “[the 50-50] bets are not expressible by a probability measure”:

Pr(001) = Pr(110) = 0.5,

Pr(000) = Pr(100) = Pr(101) = Pr(011) = Pr(010) = Pr(111) = 0.

(This is an abuse of notation, since the Pr() function takes events, that is, sets of outcomes, and not raw outcomes.) The events Davies considers are “N(0,1) [is] an adequate model”, which is the set {100, 101, 110, 111}, and “N(10-20,1) [is] an adequate model”, which is the set {010, 011, 110, 111}; it is trivial to see that both these events are 50-50.

Now obviously when M is uncountably infinite it’s not so easy to write down probability measures on sigma-algebras of the powerset of M. Still, that scenario is not particularly difficult for a Bayesian to handle: if the statistical adequacy function is measurable, a prior or posterior predictive probability measure automatically induces a pushforward probability measure on any sigma-algebra of the powerset of M. In fact, this is precisely the approach taken in the (rather small) Bayesian literature on assessing statistical adequacy; see for example A nonparametric assessment of model adequacy based on Kullback-Leibler divergence. These sorts of papers typically treat statistical adequacy as a continuous quantity, but all it would take to turn it into a Davies-style yes-no Boolean variable would be to dichotomize the continuous quantity at some threshold.

(A digression. To me, using a Bayesian nonparametric posterior distribution to assess the adequacy of a parametric model seems a bit pointless — if you have the posterior already, of what possible use is the parametric model? Actually, there is one use that I can think of, but I was saving it to write a paper about… Oh what the heck. I’m told (by Andrew Gelman, who should know!) that in social science it’s notorious that every variable is correlated with every other variable, at least a little bit. I imagine that this makes Pearl-style causal inference a big pain — all of the causal graphs would end up totally connected, or close to. I think there may be a role for Bayesian causal graph adequacy assessment; the causal model adequacy function would quantify the loss incurred by ignoring some edges in the highly-connected causal graph. I think this approach could facilitate communication between causal inference experts, subject matter experts, and policymakers.)

This post’s title was originally more tendentious and insulting. As Professor Davies has graciously suggested that his future work might include a reference to this post, I think it only polite that I change the title to something less argumentative.

In the spring of last year, a paper with the title Bayesian Brittleness: Why no Bayesian model is “good enough”  was put on the arXiv. The authors (Houman Owhadi, Clint Scovel, Tim Sullivan, henceforth OSS) later posted a followup entitled When Bayesian inference shatters. When published, this work was commented on in a number of stats blogs I follow, including Xian’s Og and Error Statistics PhilosophyChristian Hennig wrote up this nice nickel summary:

One of the author’s results (if I could nominate one as the most important, I’d choose this one) says that if you replace your model by another one which is in an arbitrarily close neighborhood (according to the [Prokhorov] metric discussed above), the posterior expectation could be as far away as you want. Which, if you choose the right metric, means that you replace your sampling model by another one out of which typical samples *look the same*, and which therefore can be seen as as appropriate for the situation as the original one.

Note that the result is primarily about a change in the sampling model, not the prior, although it is a bit more complex than that because if you change the sampling model, you need to adapt the prior, too, which is appropriately taken into account by the authors as far as I can see.

My own reaction was rather less impressed; I tossed off,

Conclusion: Don’t define “closeness” using the TV [that is, total variation] metric or matching a finite number of moments. Use KL divergence instead.

In response to a a request by Mayo, OSS wrote up a “plain jane” explanation which was posted on Error Statistics Philosophy a couple of weeks later. It confirmed Christian’s summary:

So the brittleness theorems state that for any Bayesian model there is a second one, nearly indistinguishable from the first, achieving any desired posterior value within the deterministic range of the quantity of interest.

That sounds pretty terrible!

This issue came up for discussion in the comments of an Error Statistics Philosophy post in late December.

MAYO: Larry: Do you know anything about current reactions to, status of, the results by Houman Owhadi, Clint Scovel and Tim Sullivan? Are they deemed relevant for practice? (I heard some people downplay the results as not of practical concern.)

LARRY: I am not aware of significant rebuttals.

COREY: I have a vague idea that any rebuttal will basically assert that the distance these authors use is too non-discriminating in some sense, so Bayes fails to distinguish “nice” distributions from nearby (according to the distance) “nasty” ones. My intuition is that these results won’t hold for relative entropy, but I don’t have the knowledge and training to develop this idea — you’d need someone like John Baez for that.

OWHADI (the O in OSS): Well, one should define what one means by “nice” and “nasty” (and preferably without invoking circular arguments).

Also, it would seem to me that the statement that TV and Prokhorov cannot be used (or are not relevant) in “classical” Bayes is a powerful result in itself. Indeed TV has not only been used in many parts of statistics but it has also been called the testing metric by Le Cam for a good reason: i.e. (writing n the number of samples), Le Cam’s Lemma state that
1) For any n, if TV is close enough (as a function of n) all tests are bad.
2) Given any TV distance, with enough sample data there exists a good test.

Now concerning using closeness in Kullback–Leibler (KL) divergence rather than Prokhorov or TV, observe that (as noted in our original post) closeness in KL divergence is not something you can test with discrete data, but you can test closeness in TV or Prokhorov. In other words the statement “if the true distribution and my model are close in KL then classical Bayes behaves nicely” can be understood as “if I am given this infinite amount of information then my Bayesian estimation is good” which is precisely one issue/concern raised by our paper (brittleness under “finite” information).

Note also that, the assumption of closeness in KL divergence requires the non-singularity of the data generating distribution with respect to the Bayesian model (which could be a very strong assumption if you are trying to certify the safety of a critical system and results like the Feldman–Hajek Theorem tell us that “most” pairs of measures are mutually singular in the now popular context of stochastic PDEs).

In preparing a reply to Owhadi, I discovered a comment written by Dave Higdon on Xian’s Og a few days after OSS’s “plain jane” summary went up on Error Statistics Philosophy. He described the situation in concrete terms; this clarified for me just what it is that OSS’s brittleness theorems demonstrate. (Christian Hennig saw the issue too, but I couldn’t follow what he was saying without the example Higdon gave. And OSS are perfectly aware of it too — this post represents me catching up with the more knowledgeable folks.)

Suppose we judge a system safe provided that the probability that the random variable X exceeds 10 is very low. We assume that X has a Gaussian distribution with known variance 1 and unknown mean μ, the prior for which is

$\mu\sim\mathcal{N}\left(0,10000\right).$

This prior doesn’t encode a strong opinion about the prior predictive probability of the event X > 10 (i.e., disaster).

Next, we learn about the safety of the system by observing a realization of X, and it turns out that the datum x is smaller than 7 and the posterior predictive probability of disaster is negligible. Good news, right?

OSS say, not so fast! They ask: suppose that our model is misspecified, and the true model is “nearby” in Prokhorov or TV metric. They show that for any datum that we can observe, the set of all nearby models includes a model that predicts disaster.

What kinds of model misspecifications do the Prokhorov and TV metrics capture? Suppose that the data space has been discretized to precision 2ϵ, and consider the set of models in which, for each possible observable datum x0, the probability density is

$\pi\left(x;\mu\right)\propto\begin{cases}\exp\left\{ -\frac{1}{2}\left(x-\mu\right)^{2}\right\} \times\chi\left(x\notin\left[x_{0}-\epsilon,x_{0}+\epsilon\right]\right), & \mu\le20,\\\exp\left\{ -\frac{1}{2}\left(x-\mu\right)^{2}\right\} , & \mu>20,\end{cases}$

in which χ(.) is the indicator function. For any specific value of μ, all of the models in the above set are within a small ball centered on the Gaussian model, where “small” is measured by either the Prokhorov or  TV metric. (How small depends on ϵ.)  Each model embodies an implication of the form:

$\mu\le20\Rightarrow x\notin\left[x_{0}-\epsilon,x_{0}+\epsilon\right];$

by taking the contrapositive, we see that this is equivalent to:

$x\in\left[x_{0}-\epsilon,x_{0}+\epsilon\right]\Rightarrow\mu>20.$

Each of these “nearby” models basically modifies the Gaussian model to enable one specific possible datum to be a certain indicator of disaster. Thus, no matter which datum we actually end up observing, there is a “nearby” model for which both (i) typical samples are basically indistinguishable from typical samples under our assumed Gaussian model, and yet (ii) the realized datum has caused that “nearby” model, like Chicken Little, to squawk that the sky is falling.

OSS have proved a very general version of the above phenomenon: under the (weak) conditions they assume, for any given data set that we can observe, the set of all models “near” the posterior distribution contains a model that, upon observation of the realized data, goes into (the statistical model equivalent of) spasms of pants-shitting terror.

There’s nothing special to Bayes here; in particular, all of the talk about the asymptotic testability of TV- and/or Prokhorov-distinct distributions is a red herring. The OSS procedure stymies learning about the system of interest because the model misspecification set is specifically constructed to allow any possible data set to be totally misleading under the assumed model. Seen in this light, OSS’s choice of article titles is rather tendentious, don’t you think? If tendentious titles are the order of the day, perhaps the first one could be called As flies to wanton boys are we to th’ gods: Why no statistical model whatsoever is “good enough”  and the second one could be called Prokhorov and total variation neighborhoods and paranoid psychotic breaks with reality.

## Why these postulates?

Recall Cox’s five postulates:

1. Cox-plausibilities are real numbers.
2. Consistency with Boolean algebra: if two claims are equal in Boolean algebra, then they have equal Cox-plausibility.
3. There exists a conjunction function f such that  for any two claims A, B, and any prior information X,
$A\wedge B|X=f\left(A|X,\, B|A\wedge X\right)$.
1. There exists a negation relation (actually a function too). In slightly different notation that in the previous post, it is:

$\neg A|X=h\left(A|X\right)$.

1. The negation relation and conjunction function (and their domains) satisfy technical regularity conditions.

(I’m frustrated with the length this post and how much time it’s taking me to finish, so I’m splitting it into two parts.)

I subscribe to a school of thought some call “Jaynesian” after Edwin T. Jaynes. Its foundation is a theorem of Richard T. Cox, a physicist who studied electric eels, not to be confused with the eminent statistician Sir David R. Cox. Since my first project will be to engage with Professor Mayo’s  diametrically opposed views on the proper way to use (and think about the use of) statistics in science, it seems worthwhile to describe the theorem and the reasons I take it to be foundational to statistics — of the Bayesian variety, at least.

1. Cox-Jaynes foundations. In which I establish my Bayesian bona fides.
2. Mayo’s error statistics and the Severity Principle. In which I give my current understanding of error statistics. Also, first howler!
3. Howler, howler, howler. In which I show how the severity concept defeats many “howlers”:  common criticisms of frequentist approaches propagated in Bayesian articles and textbooks. Probably more than one post.
4. Two severities. In which I discuss points of contact between the severity approach and the Bayesian approach, and specify a simple model in which the two approaches, operating on the exact same information, must disagree.
5. Increasing the magnification. In which I analyze the model of the previous post, subjecting it to the most extreme conditions so as to magnify the differences between the severity approach and  the Bayesian approach. As of the time of the writing of this blogging agenda, I have not done so.  I do not know which approach, if either, will fail under high magnification. This is to be a  true test of my Bayesianism.

Not every post will contribute to the completion of above agenda; I also plan to write posts sharing my thoughts on any interesting statistical topic I happen to run into. Posts appear below in reverse chronological order; scroll to the bottom for the first one.

In the Dark

A blog about the Universe, and all that surrounds it

Minds aren't magic

Paul Crowley

Musings, useful code etc. on R and data science

djmarsay

Reasoning about reasoning, mathematically.

The Accidental Statistician

Occasional ramblings on statistics

Slate Star Codex

NꙮW WITH MꙮRE MULTIꙮCULAR ꙮ

Models Of Reality

Stochastic musings of a biostatistician.

Thinking about evidence and vice versa

Hacked By Gl0w!Ng - F!R3

Stochastic musings of a biostatistician.

John D. Cook

Stochastic musings of a biostatistician.

Simply Statistics

Stochastic musings of a biostatistician.

Less Wrong

Stochastic musings of a biostatistician.

Normal Deviate

Thoughts on Statistics and Machine Learning

Xi'an's Og

an attempt at bloggin, nothing more...