Monthly Archives: March 2015

About a year and a half ago, I read P. L. Davies’s interesting paper Approximating Data. There was one passage I read that struck me as unusually wrong-headed (pg 195):

The Dutch book argument in turn relies on a concept of truth. Often framed in terms of bets on a horse-race, it relies on there only being one winner, which is the case for the overwhelming majority of horse races. The Dutch book argument shows that the odds, when converted to probabilities, must sum to 1 to avoid arbitrage possibilities… If we transfer this to statistics then we have different distributions indexed by a parameter. Based on the idea of truth, only one of these can be true, just as only one horse can win, and the same Dutch book argument shows that the odds must add to 1. In other words the prior must be a probability distribution. We note that in reality none of the offered distributions will be the truth, but due to the non-callability of Bayesian bets this is not considered to be a problem. Suppose we replace the question as whether a distribution represents the truth by the question as to whether it is a good approximation. Suppose that we bet, for example, that the N(0, 1) distribution is an adequate approximation for the data. We quote odds for this bet, the computer programme is run, and we either win or lose. If we quote odds of 5:1 then we will probably quote the same, or very similar, odds for the N(10−6, 1) distribution, as for the N(0, 1+10−10) distribution and so forth. It becomes clear that these odds are not representable by a probability distribution: only one distribution can be the ‘true’ but many can be adequate approximations.

I always meant to write something about how this line of argument goes wrong, but it wasn’t a high priority. But recently Davies reiterated this argument in a comment on Professor Mayo’s blog:

You define adequacy in a precise manner, a computer programme., there [sic] are many examples in my book. The inputs are the data and the model, the output yes or no. You place your bets beforehand, run the programme and win or lose your bet. The bets are realizable. If you bet 50-50 on the N(0,1) being an adequate model, you will no doubt bet about 50-50 on the N(10-20,1) also being an adequate model. Your bets are not expressible by a probability measure. The sum of the odds will generally be zero or infinity. …

I tried to reply in the comment thread, but WordPress ate my attempts, so: a blog post!

I have to wonder if Professor Davies asked even one Bayesian to evaluate this argument before he published it. (In comments, Davies replies: I have been stating the argument for about 20 years now. Many Bayesians have heard my talks but so the only response I have had was by one in Lancaster who told me he had never heard the argument before and that was it.) Let M be the set of statistical models under consideration. It’s true that if I bet 50-50 on N(0,1) being an adequate model, I will no doubt bet very close to 50-50 on N(10-20, 1) also being an adequate model. Does this mean that “these odds are not representable by a probability distribution”? Not at all — we just need to get the sample space right. In this setup the appropriate sample space for a probability triple is the powerset of M, because exactly one of the members of the powerset of M will be realized when the data become known.

For example, suppose that M = {N(0,1), N(10-20, 1), N(10,1)}; then there are eight conceivable outcomes — one for each possible combination of adequacy indications — that could occur once the data become known. We can encode this sample space using the binary expansion of the numbers from 0 to 7, with each digit of the binary expansion of the integer interpreted as an indicator variable for the statistical adequacy of one of the models in M. Let the leftmost bit refer to N(0,1), the center bit refer to N(10^-20, 1), and the rightmost bit refer to N(10,1). Here’s a probability measure that serves as a counterexample to the claim that “[the 50-50] bets are not expressible by a probability measure”:

Pr(001) = Pr(110) = 0.5,

Pr(000) = Pr(100) = Pr(101) = Pr(011) = Pr(010) = Pr(111) = 0.

(This is an abuse of notation, since the Pr() function takes events, that is, sets of outcomes, and not raw outcomes.) The events Davies considers are “N(0,1) [is] an adequate model”, which is the set {100, 101, 110, 111}, and “N(10-20,1) [is] an adequate model”, which is the set {010, 011, 110, 111}; it is trivial to see that both these events are 50-50.

Now obviously when M is uncountably infinite it’s not so easy to write down probability measures on sigma-algebras of the powerset of M. Still, that scenario is not particularly difficult for a Bayesian to handle: if the statistical adequacy function is measurable, a prior or posterior predictive probability measure automatically induces a pushforward probability measure on any sigma-algebra of the powerset of M. In fact, this is precisely the approach taken in the (rather small) Bayesian literature on assessing statistical adequacy; see for example A nonparametric assessment of model adequacy based on Kullback-Leibler divergence. These sorts of papers typically treat statistical adequacy as a continuous quantity, but all it would take to turn it into a Davies-style yes-no Boolean variable would be to dichotomize the continuous quantity at some threshold.

(A digression. To me, using a Bayesian nonparametric posterior distribution to assess the adequacy of a parametric model seems a bit pointless — if you have the posterior already, of what possible use is the parametric model? Actually, there is one use that I can think of, but I was saving it to write a paper about… Oh what the heck. I’m told (by Andrew Gelman, who should know!) that in social science it’s notorious that every variable is correlated with every other variable, at least a little bit. I imagine that this makes Pearl-style causal inference a big pain — all of the causal graphs would end up totally connected, or close to. I think there may be a role for Bayesian causal graph adequacy assessment; the causal model adequacy function would quantify the loss incurred by ignoring some edges in the highly-connected causal graph. I think this approach could facilitate communication between causal inference experts, subject matter experts, and policymakers.)

This post’s title was originally more tendentious and insulting. As Professor Davies has graciously suggested that his future work might include a reference to this post, I think it only polite that I change the title to something less argumentative.

In the Dark

A blog about the Universe, and all that surrounds it

Minds aren't magic

Paul Crowley

Mad (Data) Scientist

Musings, useful code etc. on R and data science


Reasoning about reasoning, mathematically.

The Accidental Statistician

Occasional ramblings on statistics

Slate Star Codex

The Schelling Point for being on the #slatestarcodex IRC channel (see sidebar) is Wednesdays at 10 PM EST

Models Of Reality

Stochastic musings of a biostatistician.

Data Colada

Thinking about evidence and vice versa

Hacked By Gl0w!Ng - F!R3

Stochastic musings of a biostatistician.

John D. Cook

Stochastic musings of a biostatistician.

Simply Statistics

Stochastic musings of a biostatistician.

Less Wrong

Stochastic musings of a biostatistician.

Normal Deviate

Thoughts on Statistics and Machine Learning

Xi'an's Og

an attempt at bloggin, nothing more...