Laurie Davies and betting on adequacy

About a year and a half ago, I read P. L. Davies’s interesting paper Approximating Data. There was one passage I read that struck me as unusually wrong-headed (pg 195):

The Dutch book argument in turn relies on a concept of truth. Often framed in terms of bets on a horse-race, it relies on there only being one winner, which is the case for the overwhelming majority of horse races. The Dutch book argument shows that the odds, when converted to probabilities, must sum to 1 to avoid arbitrage possibilities… If we transfer this to statistics then we have different distributions indexed by a parameter. Based on the idea of truth, only one of these can be true, just as only one horse can win, and the same Dutch book argument shows that the odds must add to 1. In other words the prior must be a probability distribution. We note that in reality none of the offered distributions will be the truth, but due to the non-callability of Bayesian bets this is not considered to be a problem. Suppose we replace the question as whether a distribution represents the truth by the question as to whether it is a good approximation. Suppose that we bet, for example, that the N(0, 1) distribution is an adequate approximation for the data. We quote odds for this bet, the computer programme is run, and we either win or lose. If we quote odds of 5:1 then we will probably quote the same, or very similar, odds for the N(10−6, 1) distribution, as for the N(0, 1+10−10) distribution and so forth. It becomes clear that these odds are not representable by a probability distribution: only one distribution can be the ‘true’ but many can be adequate approximations.

I always meant to write something about how this line of argument goes wrong, but it wasn’t a high priority. But recently Davies reiterated this argument in a comment on Professor Mayo’s blog:

You define adequacy in a precise manner, a computer programme., there [sic] are many examples in my book. The inputs are the data and the model, the output yes or no. You place your bets beforehand, run the programme and win or lose your bet. The bets are realizable. If you bet 50-50 on the N(0,1) being an adequate model, you will no doubt bet about 50-50 on the N(10-20,1) also being an adequate model. Your bets are not expressible by a probability measure. The sum of the odds will generally be zero or infinity. …

I tried to reply in the comment thread, but WordPress ate my attempts, so: a blog post!

I have to wonder if Professor Davies asked even one Bayesian to evaluate this argument before he published it. (In comments, Davies replies: I have been stating the argument for about 20 years now. Many Bayesians have heard my talks but so the only response I have had was by one in Lancaster who told me he had never heard the argument before and that was it.) Let M be the set of statistical models under consideration. It’s true that if I bet 50-50 on N(0,1) being an adequate model, I will no doubt bet very close to 50-50 on N(10-20, 1) also being an adequate model. Does this mean that “these odds are not representable by a probability distribution”? Not at all — we just need to get the sample space right. In this setup the appropriate sample space for a probability triple is the powerset of M, because exactly one of the members of the powerset of M will be realized when the data become known.

For example, suppose that M = {N(0,1), N(10-20, 1), N(10,1)}; then there are eight conceivable outcomes — one for each possible combination of adequacy indications — that could occur once the data become known. We can encode this sample space using the binary expansion of the numbers from 0 to 7, with each digit of the binary expansion of the integer interpreted as an indicator variable for the statistical adequacy of one of the models in M. Let the leftmost bit refer to N(0,1), the center bit refer to N(10^-20, 1), and the rightmost bit refer to N(10,1). Here’s a probability measure that serves as a counterexample to the claim that “[the 50-50] bets are not expressible by a probability measure”:

Pr(001) = Pr(110) = 0.5,

Pr(000) = Pr(100) = Pr(101) = Pr(011) = Pr(010) = Pr(111) = 0.

(This is an abuse of notation, since the Pr() function takes events, that is, sets of outcomes, and not raw outcomes.) The events Davies considers are “N(0,1) [is] an adequate model”, which is the set {100, 101, 110, 111}, and “N(10-20,1) [is] an adequate model”, which is the set {010, 011, 110, 111}; it is trivial to see that both these events are 50-50.

Now obviously when M is uncountably infinite it’s not so easy to write down probability measures on sigma-algebras of the powerset of M. Still, that scenario is not particularly difficult for a Bayesian to handle: if the statistical adequacy function is measurable, a prior or posterior predictive probability measure automatically induces a pushforward probability measure on any sigma-algebra of the powerset of M. In fact, this is precisely the approach taken in the (rather small) Bayesian literature on assessing statistical adequacy; see for example A nonparametric assessment of model adequacy based on Kullback-Leibler divergence. These sorts of papers typically treat statistical adequacy as a continuous quantity, but all it would take to turn it into a Davies-style yes-no Boolean variable would be to dichotomize the continuous quantity at some threshold.

(A digression. To me, using a Bayesian nonparametric posterior distribution to assess the adequacy of a parametric model seems a bit pointless — if you have the posterior already, of what possible use is the parametric model? Actually, there is one use that I can think of, but I was saving it to write a paper about… Oh what the heck. I’m told (by Andrew Gelman, who should know!) that in social science it’s notorious that every variable is correlated with every other variable, at least a little bit. I imagine that this makes Pearl-style causal inference a big pain — all of the causal graphs would end up totally connected, or close to. I think there may be a role for Bayesian causal graph adequacy assessment; the causal model adequacy function would quantify the loss incurred by ignoring some edges in the highly-connected causal graph. I think this approach could facilitate communication between causal inference experts, subject matter experts, and policymakers.)

This post’s title was originally more tendentious and insulting. As Professor Davies has graciously suggested that his future work might include a reference to this post, I think it only polite that I change the title to something less argumentative.

60 comments
  1. Hi Cory,
    I’ve enjoyed many of your comments in the blogosphere – nice to see you posting here again too.

    While I’m intrigued by some of Laurie Davies’ work, I agree that his arguments against other approaches – esp Bayes – seem misguided. This post is a nice illustration of how and why. Thanks!

  2. OK, this works as a rebuttal of the statement “Your bets are not expressible by a probability measure.” However, to me it seems as if this way of doing it would create a very nonstandard Bayesian analysis that I haven’t seen before, and that may be extremely tedious in practice. I was curious about the cited paper and had a fairly quick look at it, and to me it seemed that it made its major conclusions in the practical example using p-values, uh-oh.
    So you are probably right that one can express bets on adequacy by a probability measure, but the core of Davies’s argument is against doing standard Bayesian analyses (which doesn’t operate on powersets of models) and argueing, “OK, we can’t believe the models literally but they’re fine as an approximation”, and you haven’t rebutted this kind of criticism as far as I can see.

    • I wasn’t trying to rebut Davies’s insistence on a framework that drops the notion of truth and focuses on quantifying statistical adequacy — I was just pointing out that in the pursuit of that goal he’s ended up making foolish claims (ETA: that is, a claim I personally consider foolish) that (ETA: in my opinion,) he should stop repeating. I wouldn’t have objected if all he’d said was that there’s no natural notion of statistical adequacy in the Bayesian framework, and that if you want to introduce one it will be subordinate to the predictive distribution. Instead he wrote to the effect that “Dutch book arguments fail because statistical adequacy is not a mutually exclusive property of statistical models” — and that’s just not true.

      I couldn’t actually find the paper I wanted to link; it was more fully Bayesian, with posterior predictive distributions for statistical adequacy. Oh well…

    • Actually, writing that little snippet of code below has led me to realize that the posterior probability of statistical adequacy is quite similar to Andrew Gelman’s approach to posterior predictive assessment of model fitness. People might disagree on the degree of standard-Bayesian-ness of that approach, but I expect you’re at least aware of it.

  3. A question, regarding:
    “if the statistical adequacy function is measurable, a prior or posterior predictive probability measure automatically induces a pushforward probability measure on any sigma-algebra of the powerset of M.”
    If you do this, starting from a conventional prior on M, doesn’t Davies’s criticism apply to that prior, and therefore to the whole approach?

    • If by “Davies’s criticism” you mean his assertions about bets not being representable, then no, it doesn’t apply to the whole approach even if the prior predictive distribution is itself induced by a prior on M. Just don’t confuse bets on the statistical adequacy of models with bets on the truth of models.

      Maybe if I walk through an example, the process will be more clear.

      As before, let M = {N(0,1), N(10^-20, 1), N(10,1)}. A model will be judged statistically adequate if the datum is within 1.96 units of the model’s mean. On the Bayesian side, give each possible mean parameter a prior probability of 1/3, so that the prior predictive distribution is the mixture (1/3)*N(0,1) + (1/3)*N(10^-20, 1) + (1/3)*N(10,1), or effectively (2/3)*N(0,1) +(1/3)*N(10,1). Then the induced probability distribution on statistical adequacy is

      Pr(000) = 0.05, Pr(110) = 0.633, Pr(001) = 0.317, everything else 0.

      This is example is boring because the models are either cleanly separated or indistinguishable. It would get more interesting if we start overlapping the models so that there’s a non-zero prior probability that a false model is judged adequate. (For most typical models I can imagine, with a continuous parameter space (and a proper prior) the prior probability that some false model will be judged adequate is 1, and the interesting question is about the probability that the true model is judged inadequate. But I think it makes more sense to do Bayesian nonparametrics for the Bayesian model part even if one were assessing the statistical adequacy of a set of parametric models M.)

      • But your first step, the prior (1/3,1/3,1/3), is based on the “true model” paradigm, isn’t it?
        It seems to me that you’re proposing a method for getting a model for adequacy on the powerset of M (which is fine) starting from a model for truth on M, which is what Davies criticises.

      • Yes, that’s correct. We’re still in a setting with the kind of notion of truth to which Davies objects. It’s just that the substantive criticism he’s marshalled in support of that objection is false.

      • Well, that depends on what you see as his main “substance”. I’ll be following your exchange with him curiously.

      • If Bayes really couldn’t handle any notion of statistical adequacy, that would be a thing worth knowing. Without that, we’re left with a disagreement on philosophical grounds, which, while not totally devoid of substance, is apt to generate more heat than light.

        (If I were to try to take on Davies’s main stance, I’d take two tacks: first, I’d point out that nothing in Bayes obligates one to use a continuous model and update using the density version of Bayes’s theorem. Bayes works just fine even when densities don’t exist. Second, I’d point out that we can use stochastic processes as prior measures (per the discipline of Bayesian nonparametrics), and that consistency theorems exist for these that show that they avoid many problems associated with assuming a “wrong” model.)

  4. Thank you for your interest. I have been stating the argument for about 20 years now. Many Bayesians have heard my talks but so the only response I have had was by one in Lancaster who told me he had never heard the argument before and that was it. I think you are the first to do so in writing. I do not understand your argument. Firstly (1,1,1) is missing but I suppose you allocate zero probability to it. If I understand you correctly the event that N(10,1) is adequate is {(0,0,1),(1,0,1),(0,1,1),(1,1,1)} which has probability 0.5 as (0,0,1) has probability 0.5 and the rest have probability 0. Is my understanding correct?

    • Yup, your understanding is correct. Thanks for catching the missing outcome! (Actually, I had 000 in there twice.)

  5. If I think N(0,1) and N(10^-20,1) are both reasonable models my probability for N(10,1) is zero so why do you give it probability 0.5?

    • Because a person who gives odds as in my toy example isn’t certain that N(0,1) and N(10^-20,1) will turn out to be reasonable models. Rather, such a person puts 50% probability on the event that the datum will be close to 0 and 50% probability on the event that the datum will be close to 10.

  6. In the book only the standard Bayesian setup was considered, namely a family of probability models, a prior over this family and a posterior. This is the context. When I write that probabilities for adequacy for each model will in general not be expressible by a probability measure it should be read in context, as indeed everything should be. It is clear that a probability measure over the space of models was meant and indeed is meant. The Dutch book argument is about the parameter space and is correct.Nothing you have written has in any way weakened this argument. You use the pejorative term `foolish’. I suggest you moderate your language and substitute `incorrect’. Things become even worse when I am asked to stop repeating these claims. I will do no such thing. I can think of many claims made by people I regard as incorrect but in the context of science I would be very reluctant to demand that they stop repeating them. The claim is neither `foolish’ or incorrect.

    You seem to be claiming the following although your claim is not sufficiently precise for me to be certain. Given a finite set of models and for each model a probability that it is adequate then there exists a probability measure over the power set such the subset B_j which consists of all elements with a 1 at the jth position has the probability that the jth model is an adequate approximation. This is how I understood your example and you seemed to confirm it. The easiest counter example is that all probabilities are zero. More generally if the sum of the probabilities is less than 1 you cannot do it. A necessary condition is that the sum be at least one, the only interesting question is whether this latter condition is sufficient. It may well be, let me know if you can prove it. However even if you managed to circumvent all this by say including an empty model or a non-model with a positive probability to make up the deficit it would in no way be a counterexample to my claim. Furthermore, as Christian Hennig has pointed out, this would be an absolutely non-standard Bayesian approach which neither he nor I have ever seen. Moreover you will have insurmountable difficulties extending it to the simplest situation, say the Poisson family with parameter λ. Christian Hennig is also correct when he points out that your argument does not rebut my claim, more compactly but perfectly correctly.

    I have claimed that there can be no Bayesian concept of approximation. This is a claim, I offer no proof and so from a logical point of view I could be wrong. All you have to to is provide a Bayesian concept of approximation which permits perturbations on the space of probability models. This argument is stronger than the Dutch book argument but the conclusion is also stronger.

    • You assert that “it is clear that a probability measure over the space of models was meant and indeed is meant.” Well, it wasn’t clear to me. You wrote, and I quoted twice, a universal assertion with no such qualification, and that is the assertion I rebutted. You say now that this is a misreading of your words; fair enough. Please consider including the appropriate qualifications of your argument in the future.

      (But with that qualification, what we’re left with is a special case of the claim, “If you treat the odds of non-mutually-exclusive events as if they’re the odds of mutually exclusive events, then Dutch book arguments don’t work.” That’s true, but it doesn’t really cut against Dutch book arguments.)

      I have qualified the language to which you object so that it’s clear that it’s my opinion rather than an ontological claim.

      It’s kind of hard to understand what you think you’re countering when you write of a counter-example. Here’s what I understand to have happened: you wrote a universal claim, to wit, “It becomes clear that these odds are not representable by a probability distribution.” If you’d written instead, “It becomes clear that these odds are not representable by a probability measure where the set of statistical models under consideration is the sample space,” then I would haven’t objected. That’s not what you wrote, so I responded to the universal claim you actually made (however inadvertently) with a counter-example: a probability distribution that does indeed represent those odds. I don’t understand how some other example (“all probabilities are zero”) could save the universal claim. The correctly qualified claim is, of course, not in need of saving.

      Regarding insurmountable difficulties: here’s an R function that computes a Monte Carlo estimate of the posterior probability (or it could be the prior probability, in which case replace “posterior” with “prior” everywhere in this paragraph) that the Poisson(λ) distribution will be statistically adequate for a future sample of size n. It takes as arguments: (i) a vector of samples from the posterior distribution for λ; (ii) the value of λ for which the posterior probability of statistical adequacy is desired; (iii) a statistical adequacy function that takes two arguments, a set of data as a vector of integers and a value of λ, and returns either TRUE or FALSE; (iv) the sample size n.

      posterior_probability_of_statistical_adequacy <- function(posterior_samples, lambda, statistical_adequacy, n)
      {
      posterior_predictive_simulations <- lapply(posterior_samples, function(param) rpois(n = n, lambda = param))
      mean(sapply(posterior_predictive_simulations, function(new_data) statistical_adequacy(data = new_data, lambda = lambda)))
      }

      That wasn't so hard.

  7. If it was not clear to you then this is your problem, read more carefully in future. Nowhere in the book is mention made of bets other than in the context of the standard Bayesian setup. If you start considering the power space of models that is your privilege to do so, it has nothing to do with what I wrote. Nevertheless I went to the trouble of looking at your rebuttal of a claim not made and pointed out that even in your own terms that your rebuttal requires conditions. More precisely what you claimed is false. You choose not to comment on what I wrote about your claim.

    Let us see if I understand your R function. You have data, a prior and a posterior, samples from the posterior, and apply the adequacy function to the samples generated from a samples of the prior and a chosen value of lambda. This is clearly a bad idea and I suppose is a good example of the weakness of the Bayesian approach. The data consists of 511 ones and 489 zeros. The prior is whatever you choose say uniform over [0,10]. The posterior will be concentrated around 0.511. Generate values from your posterior, say 0.492, generate a Poisson sample with parameter 0.492 and test for adequacy for a chosen value of lambda, say 0.5. If you do this repeatedly you will get a set of lambda values which are adequate and all of these will all be close to 0.511. The problem is of course that the original data look nothing like any Poisson data. What I would do (see Chapter 3.5 of my book) is to apply the adequacy function directly to the original data, no priors, no posteriors in sight. The result would be the empty set, that is, there is no adequate approximation to the data in the family of Poisson models. This example is an expression of your Bayesian mind set. Start thinking differently. Andrew Gelman would not have made your mistake, he would compare the simulated data not with some given lambda but with the original data set (A Bayesian formulation of exploratory data analysis and goodness-of-fit testing, International Statistical Review, 2003, 3.1 Finite mixture models). This would work, your proposal doesn’t. But even Gelman’s proposal is not good in his example. Just do a standard minimum distance using say the Kuiper metric, no prior, no density, no posterior a simple solution do a well-posed problem.

    What you have done was indeed not very hard, in fact very simple but also completely wrong.

    • I wonder, do you think Christian Hennig should also read more carefully? After all, his first comment on the subject was, ‘OK, this works as a rebuttal of the statement “Your bets are not expressible by a probability measure.”’ That suggests rather strongly that he was reading the statement in the same way I was, to wit, as an assertion that holds no matter what sample space is considered. You were quick to seize on and reiterate the parts of his comments that you thought supported your arguments; are you as quick to recognize that your words were confusing to admirers and detractors alike?

      I agree with your judgment: the posterior probability of statistical adequacy is a terrible guide when the prior is over the same space of models as the statistical adequacy function. (Gelman’s approach has similar difficulties in some situations as he described here.) But the point at issue wasn’t the value of the approach; rather, it was whether I would have “insurmountable difficulties extending it to the simplest situation, say the Poisson family with parameter λ.” Care to comment on what you were thinking when you wrote that?

      (A better analysis would use a Bayesian nonparametric prior with positive probability in any KL neighbourhood of any sampling model considered possible a priori. Then the posterior predictive simulations would not necessarily be concentrated on the set of models whose adequacy is being assessed. You’ll see in the comments above that I’ve stated a preference for that approach.)

    • I’d like to take a moment to simplify this whole thing. Suppose my prior assigns 50% probability to μ = 0 and 50% probability to μ = 10. The statistical adequacy function returns “yes” if the datum x is within 5 units of a model’s mean parameter and “no” otherwise. You challenge me to state my odds, for particular values of the mean parameter, that the statistical adequacy function will return “yes”. For μ = 0 my odds of statistical adequacy (not truth) are very very close to 50-50; and the same is true of μ = 10^-20. What was the problem again?

  8. Christian Hennig: ‘So you are probably right that one can express bets on adequacy by a probability measure, but the core of Davies’s argument is against doing standard Bayesian analyses (which doesn’t operate on powersets of models) and argueing, “OK, we can’t believe the models literally but they’re fine as an approximation”, and you haven’t rebutted this kind of criticism as far as I can see.’

    I have no problems with this.

    Except that Hennig qualified his statement by probably. My argument shows that this is wrong in the way you did it, your solution is simply wrong in its own terms. I suspect that Hennig would agree with this. If you interpret `that one can express bets on adequacy by a probability measure’ in its widest possible sense then there is a simple solution. Lebesgue measure on [0,1], a probability p that the model is satisfactory corresponds to the interval [0,p] with probability p under Lebesgue measure. Simple but of zero use.

    The comment about the Poisson was in the context of probability measure on power sets, context is always important.
    Davies quotation: `Moreover you will have insurmountable difficulties extending it to the simplest situation, say the Poisson family with parameter λ’. Extending what? The power set solution you proposed for a finite set of models.
    How do you propose to define a probability measure on the power set of the positive numbers? In general you cannot define a probability measure over all subsets of [0,1].

    `A better analysis would use a Bayesian nonparametric prior with positive probability in any KL neighbourhood of any sampling mode’. Think it through, it won’t work, unless you take the neighbourhood with KL=infinty, that is a prior over the whole space of probability models which also, strangely enough, doesn’t work either. Do you want me to spell it out for you? You are locked into the Bayesian world of likelihood, KL etc. Broaden your horizons and whilst you are about it look at topologies.

    coreyyanofsky: I am not sure that I was meant but anyway. I think the problem is that odds you quote don’t sum to one. You will no doubt also quote 50-50 for N(10^-10,1+10^-10) etc. I have no problem with this but Bayesians in the standard formulation, priors and posteriors over the model space etc. do. There has been an attempt to get round the problem in a Bayesian sense by looking at the power set of models. A strange idea but not by any means forbidden. The first attempt failed, we await eagerly further attempts.

    • Yes, Christian qualified his statement by “probably”. But the point at issue there isn’t whether or not I’m correct; rather, it’s whether Christian read your statement the same way I did.

      In terms of probability models considered possible a priori, I don’t need a prior over all probability measures. I would exclude, at the minimum, non-computable probability measures. (There does exist a prior over all computable probability measures, but it is itself non-computable.) The first thing I’d try is a Dirichlet process centred on a Poisson distribution with hierachical priors for the Dirichlet process concentration parameter and for the Poisson distribution parameter λ. (ETA: Actually, now that I think of it, this would probably fail too. I need to center the Dirichlet process on a discrete distribution with a dispersion parameter that’s separate from the mean parameter and that becomes a degenerate distribution when the dispersion parameter is zero. Or maybe even a Dirichlet process mixture with that model as the components… so many options for modelling! …actually, the best thing to do first would be to check the literature for consistency theorems for nonparametric models in this setting.)

      How do I propose to define a probability measure on the power set of the positive numbers? I don’t. For the computational statistics part I don’t need to think about power sets and sigma-algebras any more than someone running t.test() in R needs to think about the Borel sigma-algebra of the real numbers. I’ll just generate a Monte Carlo estimate of the probability of simultaneous statistical adequacy for all models in an arbitrary set of models. With that and the sum rule, I can compute the probability of statistical adequacy/lack thereof for arbitrary (ETA: finite) sets of models.

      For the Poisson distribution scenario, here’s how this would look in R code. The argument “lambda” is now a vector of one or more parameter values; the function returns the posterior probability that all of the models corresponding to the lambda argument are statistically adequate.

      posterior_probability_of_statistical_adequacy <- function(generate_posterior_predictive_data, n_sim, lambda, statistical_adequacy, n_data)
      {
      posterior_predictive_simulations <- replicate(n = n_sim, generate_posterior_predictive_data(n = n_data), simplify = FALSE)
      all_models_are_adequate <- function(new_data) all(sapply(lambda, function(current_lambda) statistical_adequacy(data = new_data, lambda = current_lambda) ))
      mean(sapply(posterior_predictive_simulations, all_models_are_adequate))
      }

  9. Christian Hennig said:

    I appreciate the willingness of both of you to engage with each other’s statements, but if one is too determined to look for weaknesses in each other’s arguments, it is easy to miss the main point of the “opponent”. Then, if the own arguments are dressed as attacks, on the other side they will provoke counterattacks rather than understanding. As far as I can see, I’d apply this to both of you, so each of you is in good company (there is much more company among people who debate foundations of statistics anyway).

    My impression at the moment is that Corey hasn’t indeed repaired (yet?) what according to Laurie is the major problem with Bayesian analysis but the direction in which he goes is interesting and deserves encouragement.

    Please stop debating whether my initial statement rather agrees with Corey or with Laurie; Laurie didn’t have priors on powersets of distributions in mind in the cited passage, and discussing whether this should have been clear or not is rather a discussion about how to write and read than about statistics, and we all are hopefully better experts in the latter than in the former.
    The more interesting issue is whether working with priors on powersets of distributions can get something useful along the lines of approximating models out of the Bayesian approach. (I have difficulties to imagine this to be honest but I’m not sure and still curious.)

    • I’ve tried to be clear about this, but apparently this message has been lost in the noise: I think the approach I’m describing here is mostly useless. The point is that it’s not impossible, and in fact, people have been publishing papers on it.

      Now either Davies has been saying that it’s impossible (or people have been understanding him to be saying that), in which case a correction is needed; or he’s been saying, in effect, that the odds of non-mutually-exclusive events don’t necessarily satisfy the constraints imposed by the axioms of probability theory on the odds of mutually exclusive events. Perhaps this observation appears striking when it’s framed in terms of statistical adequacy, but once one sees what’s really going on it becomes trivial, or so it seems to me. Anyway, from this he adduces that Bayes is limited in the notions it can express.

      It seems worth stating at this point that I’m happy to broaden my horizons and take Davies’s work on its own terms. I think the taut string estimator is clever and valuable, and I’d love to learn more about the topologies that can be defined on sets of probability distributions. And I affirm that the Bayesian paradigm does not have an intrinsic notion of statistical adequacy; indeed, I think that from the Bayesian perspective the notion is almost superfluous. My point throughout all this has been that Dutch books have nothing to do with it, and Bayes can easily cope with whatever indicator functions you want to throw at it. It’s not that Davies’s conclusion is necessarily incorrect; it’s that the argument’s conclusion doesn’t follow from the premises.

  10. Christian is correct. Arguments about what I wrote or meant are arguments about how to write and read and we can leave them aside. What I meant should be by now clear. Whether a probability measure over the power space of models is of any help or not is an internal problem for Bayesians.

    coreyyanofsky: Perhaps this observation appears striking when it’s framed in terms of statistical adequacy, but once one sees what’s really going on it becomes trivial, or so it seems to me.

    I agree. It always seemed trivial to me so I am somewhat surprised by the interest it has aroused and also by the fact, if it is indeed a fact, that I seem to have been the first to point this out. The Dutch book argument is not really relevant. The only thing which is important is that priors are probability distributions over the family of models. I always thought that this was based on a Dutch book argument but there may be other arguments. I suspect however that all such arguments require a concept of truth to guarantee exclusion. If there is an argument not based on truth I would be glad to have a reference. For what it is worth I point out that
    frequentists also have a problem with truth. The standard definition of a confidence region is that it covers the true parameter value with the prescribed probability. For me there are no true parameter values, there are indeed no unknown parameter values.

    My main problem with traditional treatments of statistics, which includes Bayesian and frequentist, is the role the concept of truth plays. I wanted to replace truth by `adequacy’ or `approximation’ and the book is the result of this. The use of the word `adequate’ goes back at least to Birnbaum and his treatment of the likelihood principle. He has inhibitions about claiming that the model is true and uses the word `adequate’. Deborah Mayo in her Statistical Science paper devotes 2.3.1 Model Checking to this problem. The words she uses are `validity’, `adequacy’ and `feeling comfortable’, the latter in a quotation. One notes the reluctance to use the word `true’ and the reluctance to spell out even in a very simple case, say the normal distribution, as to what is meant by adequacy. All arguments about the likelihood principle take place from both sides in the `behave as if true’ model.

    The concept of adequacy is based on weak metrics and discrepancies and involves regularization in various guises, functionals, differentiability, outlier detection, procedures etc. These are concepts which by and large are foreign to Bayesian statistics. The advantage I often have when arguing with Bayesians is that I have some knowledge of their world but they have little knowledge of mine.

    coreyyanofsky: And I affirm that the Bayesian paradigm does not have an intrinsic notion of statistical adequacy; indeed, I think that from the Bayesian perspective the notion is almost superfluous.

    Can I take this to mean that model checking is almost superfluous for Bayesians?

    coreyyanofsky: A better analysis would use a Bayesian nonparametric prior with positive probability in any KL neighbourhood of any sampling model considered possible a priori.

    Let us flip the situation. The family of models is now the simple binom(1,p) that is i.i.d. observations 0 or 1 and 1 with probability p. Any finite KL neighbourhood lands you back in binom(1,p). The data are however a Poisson sample of size n=1000 and parameter value lambda=0.5.

    I am not sure you really understood my objection to your first R function. The second does not seem to solve the problem. Let lambda in the input be a grid of values say 0.001,0.002,…,9.999,10.00. The data are 1000 zeros and ones, 511 ones and 489 zeros. Your prior is the uniform prior on the positive numbers or say on (0,100]. Your posterior will be concentrated close to the mean of the data namely 0.511. Generate from your posterior. Suppose this gives a value 0.53157. You now generate a Poisson sample of size n=1000 and with parameter 0.53157. You now run through the grid of lambda values and for each such lambda value you test the adequacy of the Poisson distribution with this value of lambda for the data just generated. In particular at some point you will test the parameter value 0.532 which is in the grid.

    all_models_are_adequate <- function(new_data) all(sapply(lambda,
    function(current_lambda) statistical_adequacy(data = new_data, lambda
    = current_lambda) ))

    The statistical adequacy function will return the value TRUE with high probability. In other words your R function will with very high probability return a non-empty set of lambda values which are declared to be adequate. The only way you can avoid this is to choose a grid and a prior which are well separated, for example the grid as before and the prior concentrated on say the interval [20,30].

    Taking a KL neighbourhood doesn't help as you will still be declaring Poisson models as adequate.

    The whole thing simplifies if you just take the original data and then check for adequacy over the grid of values of lambda. The result is the empty set as desired and the calculations are much easier.

    Finally there is the question of the statistical adequacy function where all the hard work is being done. If you are a Bayesian this should be purely Bayesian as well.

    • “If there is an argument not based on truth I would be glad to have a reference.”

      I agree that the ‘behave as if true’ attitude is commonplace, but actually, the two main wellsprings for the mainstream Bayesian school of thought (to which I do not belong), de Finetti’s betting arguments and Savage’s decision-theory-based approach, don’t treat the notion of truth of a statistical model as foundational. De Finetti is all about subjective predictive probabilities for observable random variables; in his approach, parametric statistical models as they’re commonly conceived aren’t assumed to exist but rather arise as a consequence of de Finetti’s exchangeability theorem. Savage’s approach is about choosing the best action when the consequences are uncertain; he only gets to probabilities by proving (given his axioms) that a utility function exists and that the best decision must maximize expected utility where the expectation is with respect to some probability distribution. (The Bayesian school of thought I follow does treat the notion of truth as foundational; the first two posts on the blog discuss these foundations.)

      “Can I take this to mean that model checking is almost superfluous for Bayesians?”

      Depends on which Bayesian you ask, as we all have varying perspectives on what’s needed. Andrew Gelman argues strongly that Bayesian inference via Bayes theorem is only one activity in a three-activity Bayesian approach that includes model checking as another step. He was prompted to create his posterior predictive checks in reaction to anti-model-checking attitudes he heard expressed at a Bayesian conference around 1990.

      My own preference is skip the model checks and go full-on Bayesian nonparametric (I’ll show you where that leads in a sec), but this is only safe for very simple models with few moving parts, like one-dimensional density estimation. As soon as the model starts to include more parts, I start doing consistency checks and model checks of various kinds. This is permitted in my approach to Bayesian foundations, in which the Bayesian inference machinery is viewed as a tool at the disposal of the statistician and not as a straitjacket.

      “I am not sure you really understood my objection to your first R function.”

      No, I got it. The way the second function solves the problem is hidden in the “generate_posterior_predictive_data” function. This is the function that is responsible for generating simulated data sets that look like the observed data. Below I’ll describe how this works for a Dirichlet process prior.

      (Before I continue, I should point out that you’ve misunderstood the “all_models_are_adequate” function. This is for computing, e.g., the probability that both λ = 1 and λ = 1.01 are simultaneously adequate. It only returns TRUE if all of the calls to statistical_adequacy result in TRUE; in the scenario you’ve described above, the statistical_adequacy function would have to return TRUE for all of the lambda values in the grid in order for all_models_are_adequate to return TRUE.)

      But back to generate_posterior_predictive_data. Suppose that the prior is a Dirichlet process centered at Poisson(1) with a fixed concentration parameter α > 0, and we’ve observed a data set of positive integers of size N. An algorithm for sampling a single new value from the posterior predictive distribution goes like this: with probability N/(N + α), return one of the observed values uniformly at random; with probability α/(N + α), return a sample from the Poisson(1) distribution. To obtain a sample of arbitrary size from the posterior predictive distribution, we’d iterate the single-value sampling algorithm, at each iteration augmenting the observed data set with the newly sampled value. At the end, we’d return just the newly sampled values that were not in the original observed data set.

      The key point is that when N >> α, this sampling process is like sampling with replacement from the observed data, except that when we replace a value, we put back two copies of the value, not just one. It’s obvious that the single-step predictive distribution converges to the empirical distribution as N increases without bound, and I think it’s intuitively clear that for N >> α, samples from the multi-step predictive distribution will be similar to the observed data — like bootstrap samples, but with more repeated values.

      This is why I favor Bayesian nonparametrics: stochastic process priors often have wide enough support on the space of probability distributions that the single-step posterior predictive distribution is consistent, i.e., it converges to the true sampling distribution under the assumption that it exists and satisfies mild regularity conditions.

  11. I suppose betting is like a horse race, maybe decision theory is based on having to make just one decision. Anyway the result in both cases is a probability measure.

    The Poisson distribution has particular properties which makes it special. There are therefore circumstances where you do want to know whether the data are consistent with the Poisson distribution or not, for example the flying bomb data.

    If one takes a non-parametric approach the situation changes. My first attempt at such would be to minimize the total variation of the first differences of the density subject to a total variation neighbourhood of the data. This leads to a linear programming problem which should be solvable. Assuming this is the case it only remains to determine the size of the neighbourhood. If the sample size is sufficiently large then this can probably be achieved by bootstrapping.

    I am not quite sure what you want your R function to do. Suppose the input is just zeros and ones as described in a previous posting. What do you expect the output to be? If the input is a realization of a Poisson distribution again, what do you expect the output to be? I want the output to be the empty set in the first case and the range of possible lambda values in the second. Further I would like some measure as to how good or how bad the approximation is, say in the total variation metric.

    • “There are therefore circumstances where you do want to know whether the data are consistent with the Poisson distribution or not, for example the flying bomb data.”

      The question of how well the data accord with a Poisson distribution is an interesting one, but ultimately academic, I think. I can’t think of a practical question I can answer given the knowledge that the data are consistent with the Poisson distribution that I cannot already answer given a Bayesian nonparametric posterior predictive distribution.

      “I am not quite sure what you want your R function to do.”

      The second R function was a response to the challenge of how to deal with the sigma-algebra on the powerset of M. That sigma-algebra contains events in which arbitrary sets of models have arbitrary adequacy status. (Here “arbitrary” is legitimate if we limit ourselves to the Axiom of Dependent Choice (ADC) instead of the full AC, because under ADC only measurable sets can be proven to exist.) The R function shows how I would deal with this in practice. It handles arbitrary finite sets of parameter values and gives (an estimate of) the probability that all models corresponding to the given set are adequate.

      So for example, you hand me a statistical adequacy function and a smallish sample from a Poisson(1) distribution. You ask me what my odds are that Poisson(1) will be adequate for a future sample of large size. I use my R function with the argument lambda = 1 and come up with, say, about 2:1 in favor. Then you ask me what my odds are that Poisson(1.01) will be adequate for a future sample of the same size as in the previous question. I run the function with the argument lambda = 1.01 and come up with 2:1 again. (Maybe I even reset the seed of the pseudo-random number generator so that I’m generating exactly the same posterior predictive simulations each time.)

      “Aha!” you say, “Those odds aren’t coherent!” But it is not necessarily so. I run the function a third time with lambda = c(1, 1.01), and again I report 2:1 odds. This provides enough information that I can use the sum rule to compute the full joint probability distribution:

      Pr(λ = 1 adequate and λ = 1.01 adequate) ≈ 2/3,
      Pr(λ = 1 adequate and λ = 1.01 not adequate) ≈ 0,
      Pr(λ = 1 not adequate and λ = 1.01 adequate) ≈ 0,
      Pr(λ = 1 not adequate and λ = 1.01 not adequate) ≈ 1/3.

      Furthermore, I can plot the probability of adequacy as a function of λ by passing to the function each member of the λ-grid one at a time. I can also plot the probability of simultaneous adequacy of two models (a 3D plot with probability of adequacy on the vertical axis and λ_1 and λ_2 on the two horizonal axes). I can easily compute (but not easily plot) the probability of simultaneous adequacy of three models, or four models, or sets of models of arbitrary finite size.

      “Suppose the input is just zeros and ones as described in a previous posting. What do you expect the output to be?”

      With the nonparametric Dirichlet prior I described in the previous reply, the N >> α condition holds and the number of “zero” and “one” values in each simulated data set effectively follow a beta-binomial distribution (centered at the empirical proportion with variance pretty close to a pure binomial, since the sample size is so large); but any individual posterior simulation might have perhaps one or two values from the Poisson(1) thrown in too. Then the statistical adequacy function (or let us say, any indicator function that you, Laurie Davies, would consider valid as a statistical adequacy function) will return FALSE for all possible values of λ, just as you desire, and the posterior probability of statistical adequacy with be 0 for all possible values of λ.

      But suppose that the input is a sample from the Poisson distribution. Now what you want is some measure of the quality of the approximation, perhaps in total variation metric. We can get the a Monte Carlo sample from the posterior predictive distribution of the distance in total variation metric from a future sample to a Poisson(λ) distribution with a small alteration of the R function.

      posterior_predictive_samples_of_TV_distance <- function(generate_posterior_predictive_data, n_sim, lambda, n_data)
      {
      posterior_predictive_simulations <- replicate(n = n_sim, generate_posterior_predictive_data(n = n_data), simplify = FALSE)
      sapply(posterior_predictive_simulations, function(new_data) TV_distance(new_data, lambda = lambda))
      }

      Some measure of central tendency of this posterior predictive distribution — the mean or the median or what have you — could serve as a measure of the quality of the approximation.

      So I agree, the Bayesian machinery adds many complications, and the value one gains in return for this complexity is typically slight at best. In terms of applied statistics, I personally wouldn't actually spend time doing any of this — but some people would and do, and they write papers about it. And then when I see an argument that seems to assert that it just can’t be done, I write blog posts and comments of debatable civility, and here we are.

  12. ‘You ask me what my odds are that Poisson(1) will be adequate for a future sample of large size.’
    But I don’t and never would ask you that question. I ask you what the odds are for the sample we have and for no other.

  13. To be more precise. I could ask you about further samples but in my book, and this is emphasized several times, the concept of approximation is for the sample we have, not a for possible further sample.

    • “The inputs are the data and the model, the output yes or no. You place your bets beforehand, run the programme and win or lose your bet.”

      What exactly is my state of information when I place my bets? I can think of (at least) three scenarios here; you tell me if you mean one of these, or something else.

      (1) I know the sample size but I don’t have access to the actual data before I place my bet;
      (2) I have access to the data, the statistical adequacy function, and my own computer;
      (3) I have access to the data and the statistical adequacy function, but I don’t have a computer or other computational resources to actually grind through the algorithm and get the yes/no answer in a reasonable time frame.

      Here’s what happens in each of these cases.

      (1) I calculate the expectation of the output of the statistical adequacy function with respect to my prior probability measure and quote my odds based on that. The R functions above show how this would work in practice if analytical solutions weren’t available: by Monte Carlo with data sets simulated from the prior predictive distribution. (If it was ever in question whether there exists a sample space for which the data identify exactly one outcome as having occurred, and that it’s possible to state odds that are coherent on a sigma-algebra derived from that sample space, well, I hope that it is no longer so.)

      (2) I run the algorithm and give infinite odds in favor of the actual outcome I computed. (If I’m being really careful, maybe I figure out the probability that a cosmic ray flipped a physical bit and caused the wrong answer to appear on the screen.)

      (3) This is the tricky case, because it runs into what some call “logical uncertainty”. Standard formulations of Bayes implicitly assume “logical omniscience”, i.e., that logical consequences of premises are always known immediately. I could always fall back on the calculations I would make in case (1), but even just casting my eyes on the data and thinking about them a bit should change the odds I’m willing to give.

      It may prove possible to formally quantify logical uncertainty with probabilities; that research is both ongoing and above my pay grade. George Pólya showed how informal probability-like reasoning works in states of logical uncertainty in his book Mathematics and Plausible Reasoning Volume II: Patterns of Plausible Inference.

  14. This is a complete shift in your argument. I thought you wanted to prove that my statement about no Bayesian approach to approximation was incorrect because you could define a prior on the power set of the family of models. I was surprised at this interpretation of my statement but was prepared to see if you could indeed carry out the programme. As far as I was concerned this discussion took place in the context of my book. In particular that the approximation concerns the given data set and not possible replications or further data sets. This is explicitly stated in Chapter 1.4.1.2. I fail to see how one could misinterpret the statement ‘The data x_n are given. There is no assumption that they can be repeated as in the frequency approach’. In the whole of the book I distinguish between the given data, lower case x_n, and data generated under a model, upper case X_n. This was the reason I never really understood what you were doing. That is, why you were generating other data sets using a Dirichlet process. I could see no point in it and therefore asked the question about what your R function is supposed to accomplish. As far as I now understand it you were doing something different, updating the prior probabilities for adequacy in the light of new?, simulated? data.

    Let us take your first example of three models and the 8 points of the power set {0,1}^3 corresponding to 8 hypotheses. You place a prior on the power set. You run the adequacy function for the three models. Exactly one point of the power set will be correct, that is exactly one hypothesis is correct, all others are false. Your posterior is one for this hypothesis and zero for all others. That seems to me to a solution to your problem; a prior, data and a posterior. You can also extend it to {0,1}^(0,\infty) if you can place a prior on it using whatever AC you wish. The posterior will again be a Dirac measure on one single point. What do you do when the only correct hypothesis has zero prior probability, you can forbid it I suppose. However the end result is completely independent of your prior.

    You ask about the state of information, exactly what knowledge you have etc. That is all your problem, I don’t have it. It is you who wishes to define a prior not me. All I said was, at least as far as my memory carries me, is that if your prior probability for N(0,1) is 0.5 your prior probability for N(10^-20,1) will also be about 0.5. This is a conditional statement given that someone, a Bayesian states 0.5 as the prior probability for N(0,1). I don’t ask how he /she came to this assessment. My remark was based on the supposition that it would be difficult to construct a reasonable definition of adequacy which can distinguish between N(0,1) and N(10^-20,1). If you want a Bayesian solution you can state a protocol, you have given three but there are many others. You can use knowledge of previous data sets, maybe you can look at boxplots etc. Part of your Bayesian solution would be to specify all this and the at the end of the day run the programme and calculate the posterior that each hypothesis or point in your power sets was correct given the knowledge that it is or is not correct. Seems a bit odd to me.

    • There’s a difference between a complete shift in the argument I’m making on the one hand and a complete shift in your understanding of the argument I’m making on the other. I wasn’t perfectly transparent about what part of your position I was attempting to falsify in the original post, but I later clarified my contention when I wrote, “My point throughout all this has been that Dutch books have nothing to do with it, and Bayes can easily cope with whatever indicator functions you want to throw at it. It’s not that Davies’s conclusion is necessarily incorrect; it’s that the argument’s conclusion doesn’t follow from the premises.”

      I’m not sure why you assumed that the argument took place in the context of your book — I explicitly quoted the passages of yours I was arguing against, neither of which come from your book. (I figured what you wrote in the quoted passages closely followed what you wrote in your book. True?)

      “This was the reason I never really understood what you were doing.”

      I was considering two settings: either pre-data using the prior distribution, or post-data using the posterior distribution for a future data set from the same data-generating mechanism. The first way is how I address your original argument; the second way is how people researching these methods do it.

      “That is all your problem, I don’t have it.”

      So you’ve proposed a betting situation as an argument about some property of Bayes (or lack thereof) but you can’t be bothered to state the scenario precisely. That’s just super. Okay, let’s stick with case (1) then, since it’s the case where it’s possible to have non-trivial odds without involving logical uncertainty.

      New tangent!

      It occurred to me last night that I had been too easy on you when I let this slide: “Nowhere in the book is mention made of bets other than in the context of the standard Bayesian setup. If you start considering the power space of models that is your privilege to do so, it has nothing to do with what I wrote.”

      My current proposition regarding your position here is: you’re still stuck with the powerset of models because it is the actual outcome space for settling the bets. Claims like “[i]t becomes clear that these odds are not representable by a probability distribution” and “[y]our bets are not expressible by a probability measure” are strictly false.

      The argument in support of this proposition is: the Dutch book theorem states that a given book of betting odds is susceptible to certain loss (i.e., it is a Dutch book) if and only if it does not correspond to a probability measure. Therefore, your claims that the odds are not expressible by a probability measure are logically equivalent to a claim that there exists a set of bets at the stated odds that makes the Bayesian a sure loser. There is, of course, no such set of bets, because the odds are indeed expressible as a probability measure on the actual outcome space.

      A way to counter this argument is: construct the set of bets that I assert doesn’t exist. The proof of the Dutch book theorem is constructive; all you have to do is use it to construct the set of bets that makes the Bayesian a sure loser no matter how the assessments of statistical adequacy turn out.

  15. Although I appreciate Corey’s clever example, I actually have simpler objections. Basically I agree (I think) with many of the general ideas of Davies’ approach and emphasis, but I don’t agree that it is impossible (or even difficult?) to take this sort of approach by building on the usual approaches. I think it would be more productive to look at how people have been doing this.

    For one example I happen to have lying around, consider these excerpts from Wood (2010) ‘Statistical inference for noisy nonlinear ecological dynamic systems’:

    “This sensitivity is inherited and amplified by the joint probability density of the observable data and the process noise, rendering it useless as the basis for obtaining measures of statistical fit. Because the joint density is the basis for the fit measures used by all conventional statistical methods[2], this is a major theoretical shortcoming. The inability to make well-founded statistical inferences about biological dynamic models in the chaotic and near-chaotic regimes, other than on an ad hoc basis, leaves dynamic theory without the methods of quantitative validation that are essential tools in the rest of biological science. Here I show that this impasse can be resolved in a simple and general manner, using a method that requires only the ability to simulate the observed data on a system from the dynamic model about which inferences are required. The raw data series are reduced to phase-insensitive summary statistics…

    …Naive methods of statistical inference try to make the model reproduce the exact course of the observed data in a way that the real system itself would not do if repeated…if statistical methods are to correspond with what is scientifically meaningful, it is necessary to judge model fit using statistics that reflect what is dynamically important in the data, and to discard the details of local phase. In itself, this idea is not new [6-10]. What is new is the ability to assess the consistency of statistics of the model simulations and data without recourse to ad hoc measures of that consistency, but in a way that instead gives access to much of the machinery of likelihood-based statistical inference.”

    From what I can tell there seem to be quite a few people working in the general direction of ‘approximate models’, giving more careful attention to which aspects (functionals) of the data (c.f. Davies section 11.7) to emphasise, reducing the distinction between exploratory analysis and formal inference etc etc, all while using likelihood/bayes-based approaches or their variations. Why dismiss these so quickly?

  16. Here is the second edition of my book.

    In the first edition I wrote that if someone betted 50-50 on the N(0,1) distribution being an adequate model for the data then the same person would also bet about 50-50 on the N(10^-20,1) model also being adequate. I concluded from this that this could not be represented by a Bayesian prior. By this I meant a prior on the parameter space although this was not explicitly stated. It was pointed out to me by Corey Yanofsky in his blog that one could represent the probabilities of adequacy by a probability distribution by looking at the power set of the models. This is how it is done for three models. The triple (0,1,0) corresponds to the first and third models not being adequate but the second model being adequate. One can specify a proper prior over this space of eight triples which represents the probabilities that the models are approximations. To obtain the data the three models are tested for adequacy and the result will correspond to exactly one triple. The posterior is therefore the unit mass in this one point and zero elsewhere. This is a perfect Bayesian formulation with a prior, data and posterior. It is perhaps worth pointing out that the posterior is independent of the prior. In other words the prior has no influence on the statistical analysis.

    Ok I can accept that, namely that your position was consistent and that I did not understand it. I certainly didn’t. It only became clear to me when you wrote

    “You ask me what my odds are that Poisson(1) will be adequate for a future sample of large size.”

    Did you make this up or is this a quote? If it is a quote give me the reference so that I can claim complete loss of mind when I wrote it.

    So you’ve proposed a betting situation as an argument about some property of Bayes (or lack thereof) but you can’t be bothered to state the scenario precisely. That’s just super

    I am sorry. I didn’t realize that there is a Bayesian police which controls and prescribes the manner in which people decide on their priors. Is it the case that every Bayesian has to describe the scenario exactly under which he/she decided upon the prior. I suppose the first requirement is no data peeping. As I said, that is your problem not mine, and repeat

    “All I said was, at least as far as my memory carries me, is that if(sic) your prior probability for N(0,1) is 0.5 your prior probability for N(10^-20,1) will also be about 0.5. This is a conditional statement given that someone, a Bayesian states 0.5 as the prior probability for N(0,1). I don’t ask how he /she came to this assessment.”

    Your manner of representing the prior no doubt has a similar property. Once more this is your problem and nothing to do with me.

    “[y]our bets are not expressible by a probability measure” are strictly false.

    Back again to context. Are not expressible by a probability measure on the space of models. I find your insistence on this point very strange. Even when I tell you what I meant you come back again. I see no point in it.

    omaclaren: I suppose that it depends on what you call quickly. I do give reasons and these are so to speak general ones which are based on the properties of the likelihood. They are listed in 11.1. If you accept these it seems to me that it will be very difficult to base a concept of approximation using likelihood. Let me know which ones you don’t accept. Look at Table 1.2 and give me a good likelihood based reason for not accepting the comb model. I went to the trouble of comparing the normal and comb models for all the data sets of

    @Article{STIG77,
    author = {Stigler,~S.~M.},
    title = {Do robust estimators work with real data? (with discussion)},
    journal = {Annals of Statistics},
    year = {1977},
    volume = {5},
    number = {6},
    pages = {1055-1098},
    }
    The results were the same. It was not just an accident of the copper data. Most people are not even aware of the problem. Given the copper data they will choose the normal model and use likelihood and then claim that it works very well. Little do they know, for the same example it can also work very badly. Likelihood has played such a central role in statistics that it is very difficult for people to realize that you don’t need it. The statistician I admire the most is Peter Huber. Read his book “Data Analysis”, all without likelihood with perhaps one exception where I don’t think he needs it either. The word does not appear in the index.

    • My preference is that you leave my name out unless you’re also willing to include a URL pointing to this post, with a live link in any electronic version. If you’d rather not use

      https://itschancy.wordpress.com/2015/03/28/laurie-daviess-dutch-blooper/

      you can use

      http://bit.ly/1FdhzU7

      “Did you make this up or is this a quote?”

      I made it up as part of the scenario I was describing.

      “Is it the case that every Bayesian has to describe the scenario exactly under which he/she decided upon the prior?”

      The description has to extend to at least the sample space and sigma-algebra. Sometimes this is obvious from context; other times, not. I personally consider it necessary for a prior distribution to be justified in terms of the available prior information; subjective Bayesians like Jay Kadane disagree with this position.

      “I find your insistence on this point very strange. Even when I tell you what I meant you come back again. I see no point in it.”

      There’s a man named Joe Fernwright, a pot-healer by trade, who considers probability a muddy world and one in which he does not quite care to involve himself. His argument for this stance is quite curious. In the analysis of roulette outcomes, he’s principally concerned with whether the labels ‘red’ and ‘even’ are adequate descriptions for the result of a spin. He notes that we probabilists would quote odds of 18:19 (in Europe, or 9:10 in North America) in favor of ‘red’, and we’d quote the same odds for ‘even’. It’s clear, Joe says, that these odds are not representable by a probability distribution. From this Joe concludes that probabilists are unable to cope with the question of descriptional adequacy.

      I’ve tried to point out to Joe that the set {‘red’, ‘even’} is not the appropriate sample space, and that there is a probability measure for which the sample space is the powerset of {‘red’, ‘even’} that underlies the stated odds:

      Pr(‘red’ and ‘even’) = 9/37
      Pr(not ‘red’ and ‘even’) = 9/37
      Pr(‘red’ and not ‘even’) = 9/37
      Pr(not ‘red’ and not ‘even’) = 10/37

      Joe is perfectly correct when he asserts that the odds are not representable by a probability distribution on a sample space consisting of {‘red’, ‘even’}. It’s a debatable sort of correctness, though, because when we try to settle the bets we find that {‘red’, ‘even’} wasn’t actually the relevant sample space. In any event, I cannot help but feel that Joe’s argument tells us nothing about the usefulness of the probabilistic approach to the analysis of roulette. Joe finds my insistence on this point very strange.

  17. Laurie:
    Thanks for the response. I’m genuinely trying to relate your approach to others I’ve seen, not be (overly) critical.

    I doubt I’m gonna win this argument, but a few points for now. Firstly, I’m not too worried about preserving any kind of strict likelihood approach etc. I’m just wondering to what extent any existing tools/approaches can be used/improved while accepting some points you make.

    I do think for example that model checks based on (say) comparing simulated data with actual data (e.g. using predictive distributions) are necessary and that only looking at likelihood doesn’t help with this. I have also worried about the disconnect between doing the inference (getting the likelihood say) and integrating this to do a check and wondered whether this is a ‘well-posed’ thing to do or whether they should be more closely linked. Hence me looking at your work, for example.

    Re: the copper example. I won’t have access to your book again until tomorrow but I imagine I wouldn’t be able to easily come up with a purely likelihood-based objection. Again, I’m simply wondering about the various ways people have tried to address the limitations of likelihood-based approaches (or similar) and where/how ‘regularizing’ ideas enter. I’m not questioning the need for regularization in the slightest, just whether others haven’t come up with different ways of doing it.

    • Laurie:
      To be more explicit – your approach reminds me a bit of ‘Approximate Bayesian Computation’ (ABC) or ‘likelihood-free inference’ approaches (the paper I linked above is in a similar vein), which do indeed try to work with metrics defined in the ‘data space’ rather than parameter space.

      While at first this was viewed as simply approximating the ‘true’ bayesian inference when the likelihood is difficult to calculate, it can also be given an interpretation (to some extent) as incorporating an additional layer of model or measurement error, and hence modifying (regularizing?) the original likelihood. The choice of appropriate summary statistics, how the ‘tolerance’ in the approximation is chosen and how this subsequently affects the inference etc etc are all on-going research topics.

      Personally I think this is may be an interesting direction to relate your approach to.

    • At the risk of diluting my other comment about ABC etc (which I am more interested in your response to) I’ll add another. With regards to the copper data and the necessity for regularization if working with densities – you don’t seem to discuss how priors can be used for this purpose. For example you mention choosing (again, if one wants to work with densities) the normal model on the basis of the Fisher information.

      Could this not be done in principle within a Bayesian framework with a sufficiently general model class for the likelihood and a prior selecting for low Fisher information (also presumably related to some sort of max-ent idea, but I’m a bit rusty without looking this stuff up)?

      [Again, I agree however that more attention should be paid to the ‘looks like the data’ part of analysis rather than just the parameter inference stage and that the connection between the two is often tenuous/could be better]

  18. Just a comment on your last post. I’ll try to get back to you later when I have had time to look at the articles you mentioned but this may take some time. Suppose you decide to regularize by taking a minimum Fisher information model. There are several depending on what you minimize over. The normal model minimizes over a given variance, the Huber distributions minimize over neighbourhoods. So far so good. Let us then take this model and use say maximum likelihood. This will give us for the normal model the mean and standard deviation, but whatever you minimize over you will get a location-scale functional. I now ask what are the operating characteristics of this functional. That is how will it behave for different data sets. Both the mean and the standard deviation are sensitive to outliers, particularly the standard deviation. More precisely the finite sample breakdown point is 1/n. This is so to speak an analytical property of the functionals. The Huber distributions are supposedly robust but this is not true. The scale part still has a breakdown point of 1/n. Does this matter? It depends. I was partly responsible for the formulation of DIN 38402-45:2003-09 German standard methods for the examination of water, waste water and sludge. The problem for that sort of data is that outliers abound. If you look at Stigler’s data sets I think three of them have outliers. In this situation I take the line of trying to construct a functional with the properties I would like it to have. This requires compromises but it can be done and the resulting functional cannot be obtained from any location-scale model. Funnily enough maximum likelihood does work if you use t-distributions, even in the multidimensional case. This is work of John Kent, David Tyler and Richard Dudley. Why this should be so is a mystery to me. Does the Bayesian approach help here. Not as far as I can tell as the scale parameter will still be sensitive to outliers. One of the many problems I have with the Bayesian approach is the difficulty of analysing its properties. A nice example of this is the paper by David Donoho et al Maximum entropy and the nearly black box. Jaynes did some important work on this topic and got good results. Why did it work? It does not help to say it works because it was a full Bayesian analysis based on maximum entropy. As the authors say in their response to the discussants if you want to understand it you need the ideas of Ben Logan or Werner Heisenberg rather than Thomas Bayes or Edwin Jaynes. So yes, regularization is important but it does not solve all your problems even in the one dimensional location-scale case.

    • “Does the Bayesian approach help here”

      A Bayesian analysis of something like a contaminated error model should work fine. (A contaminated error model is a two-group mixture model; the “contamination” group, i.e., the group of outliers, is given its own explicit model.)

      Have a look at figure 5-3 (pg 92) of my PhD thesis (link). These data comprise two groups: a background of points widely and heavily scattered in the plot with no trend, and a gently-downward-sloping streak of interesting data points in the middle. The goal of the analysis was to fit a regression line to the linear-looking streak of points in a way that’s robust to the background points. The resulting posterior mean regression line and a posterior credible region can be seen in figure 5-6 (pg 105). Figure 5-8 (pg 106) shows how well the approach estimates the standard deviation parameter of the normal distribution used to model the errors.

      This was the first time I used the Bayesian approach for an actual analysis of data. So, either I’m a super-genius who accomplished the herculean task of robustifying a Bayesian analysis my first time out (nope), or creating robust Bayesian analyses just isn’t that hard.

  19. This is a change of topic. I take it that the last discussion has now ended but before doing some final remarks.

    The word ‘prior’ has a very specific meaning in Bayesian statistics. It means a measure, usually a probability measure, over the space of models. I know of no Bayesian paper where anything else is meant although there may be some exceptions. When it is used in an exceptional manner I assume that the authors will state this. Thus when I use the word prior I use it in its standard sense, that is a measure on the model space. When I assert that there can be no Bayesian prior for adequacy this means that it is not possible to express bets on adequacy using a prior over the parameter space. This I take it is accepted. You could now have posted something on your blog entitled for example ‘Laurie Davies and betting on adequacy’. You could have pointed out that this cannot be done with a prior in the standard sense because of the lack of exclusion. However if one is betting the Dutch book argument shows that it must also be possible to represent bets on anything and in particular on adequacy by a probability distribution. This can indeed by done by taking a prior over the power set of the models etc.

    You choose however to entitle it as you did. I do not know what blooper means but it does not seem flattering. You went further, called my argument foolish and proceeded to tell me to shut up. We have never met, you have no reason for this aggressive attitude and lack of respect. In future I will give you credit for this and point out how it can be done. I will give your blog as reference as I have no other but I will not use your title. Let’s leave it at that.

    There was a spate of Bayesian papers in the 1980s 1990s on outliers. They seemed to take the form of n observations k1 outliers to the left, k2 to the right with say Poisson priors for k1 and k2. The remaining observations could for example be modelled by a normal distribution N(mu,sigma^2). I forget what parameters they used for the outliers. They the did the standard Bayesian procedure and used some data sets and perhaps even simulations to show how well it worked. This always left me feeling dissatisfied. The priors for k1 and k2 seemed ad hoc, and the modelling of the individual outliers also seemed ad hoc. The properties of the procedure were never analysed. For example, in most applications the results should be affine equivariant for any location-scale parameters around. The posteriors and the identification of the outliers should be affine invariant. Now if your results for mu are affine equivariant the maximal finite sample breakdown point is 1/2. What is the finite sample breakdown point of the procedure? Huber has shown that the median minimizes the bias in the mixture model you propose. It actually minimizes it in larger Kolmogorov neighbourhoods. So I would like to know about the bias behaviour of the procedure. We could of course replace the N/mu,sigma^2) by my standard comb model. You repeat the analysis using the comb model. Are the results the same? No they will not be. We could even replace the N(mu,sigma^2) model by a very close one (in the Kolmogorov model, say with 10^-10 Kolmogorov distance of the N(0,1) model) but without a Lebesgue density. Your approach then becomes complicated to say the least. So some how one has to regularize,say minimum Fisher, and so justify the normal model.

    I am always suspicious of densities and mixtures. One example is given by Andrew Gelman in 3.1 “Finite mixture models” of his Bayesian EDA paper. For me the surprising thing about this example is that it is perfectly innocuous, a fifty-fifty mixture of two normals, nothing in the slightest pathological. The density based maximum likelihood fails and so he claims but I have never checked this. A Bayesian approach, also of course density based. I am sceptical about densities but even I was surprised by this example. All you have to do is work at the distribution function level and choose the parameters to minimize the Kuiper distance between the empirical and model distribution functions. This cannot go wrong and if your data happen not to be describable by the model this approach will tell you.

    That is it for today but I will come back and comment on your Ph.D. thesis. If you wish to continue this I suggest you choose another title for these postings.

    • I’m satisfied to suspend the discussion of the previous topic at this point. I’ve retitled the post per your suggestion; unfortunately I don’t know how to change the permalink URL, but you can always use the bitly link (http://bit.ly/1FdhzU7). [ETA: I sorted it out.] (“Blooper” means “A clumsy or embarrassing mistake, especially one made in public; a faux pas.”) The reason for my aggressive attitude is that I believe that the repetition of your argument is leaving people with a false understanding of what Bayes can and can’t do. As to respect: let me emphasize that my respect for you as a statistician and for the vast majority of your work is a separate affair from my respect for the specific line of argument that is the subject of the post.

    • “For me the surprising thing about this example is that it is perfectly innocuous, a fifty-fifty mixture of two normals, nothing in the slightest pathological. The density based maximum likelihood fails and so he claims but I have never checked this. A Bayesian approach, also of course density based.”

      Yeah, if you set the mean of one of the components equal to a data point then the likelihood grows without bound as the variance of the component goes to 0. (It’s akin to maximum likelihood over the mean and variance of a normal model when there’s only a single datum.) But while this is a problem for likelihood-based approaches, in the Bayesian approach with a proper prior there’s no difficulty: even if the posterior density has singularities on the boundary of the support, it’s still guaranteed to be proper. The MAP estimator is still garbage, but then, the MAP estimator isn’t strictly Bayesian (according to Christian Robert, anyway). Even better is if the prior bounds the variance parameters away from 0 strongly enough to counteract the growth of the likelihood.

    • Re: change of direction – my bad. [Not at all! — C.]

      With that said, how about this:
      – Construct a generative model m for your data.

      – Then, use this as a latent variable model with an additional lossy measurement process (excluding outliers say) and error model. Something like Yi = Ti(m) + ei.

      One should be able to set this up in a full hierarchical bayes framework, with priors on m and e’s parameters etc etc. T corresponds to your functionals of interest and e corresponds to a tolerance for accepting models as adequate.

  20. coreyyanofsky: I was suggesting a new thread with a different heading as we are no longer talking about bets.

    I had read your Gelman quotation, I just hadn’t checked it. It was unfair to write “he claims” as he did give reasons. At the level of distribution functions the problem is perfectly well behaved. However for some reason many people insist on differentiating, end up with a density and rename it likelihood. The differential operator is pathologically discontinuous so they can expect problems and shouldn’t be surprised when they get them, although they usually are. The practice is so widespread and accepted that people think you a weird if you don’t do it, even to the point of rejecting your papers. One way of trying to avoid problems is to regularize. This is what is often done but rarely stated. In Gelman’s example the regularization is already there, two normal distributions. The mixture is 50-50 so you don’t even have to worry about the mixing parameter. Things couldn’t be better but if you want trouble you stand a good chance of getting it if you differentiate. And you do get trouble and you have to think of a way of repairing things and one way of doing for the Bayesian is say to put a non-zero lower limit on the scale parameters. I am not sure what the maximum likelihooders do.

    omaclaren: that is much too general for me. Give me a simple example.

    • Shifts of topic in blogpost comment threads just goes with the territory; I’m happy enough to let the conversation meander. (If this were a forum or a mailing list it would make sense to start a new heading.)

      I don’t think I’m familiar with this use of the term “regularization”. The usage I’m familiar with is in the sense of regularization of (possibly ill-posed) inverse problems, e.g. Tikhonov regularization, also known to statisticians as ridge regression. Is there a quick reference you can give me that describes in more detail what you’re referring to when you use the term?

      To say that Bayesian analysis is likelihood-based is to mistake form for substance in some sense. Bayesian analysis is conditionalization-based. What we’re really after is the posterior measure, which we obtain by conditionalization of the joint prior measure; the density form of Bayes’ Theorem is just that: a theorem that characterizes the posterior measure. (“Joint” here means the prior is jointly over unobserved and observed quantities; the unobserved quantities are typically parameters, but in the de Finetti approach they’re unobserved yet observable quantities, e.g., future data.)

    • “you have to think of a way of repairing things and one way of doing for the Bayesian is say to put a non-zero lower limit on the scale parameters”

      I should also mention that when I wrote of the prior bounding the variance parameters away from 0 strongly enough to counteract the growth of the likelihood, what I had in mind was a continuous prior density that vanishes at 0 at a rate fast enough to counteract the growth of the likelihood function near 0, such that the posterior density is free of singularities. The posterior distribution would still be be supported over the entire set of positive reals.

    • For a better idea of what I’m talking about I’d recommend for example:

      Wood (2010) ‘Statistical inference for noisy nonlinear ecological dynamic systems’. Nature.

      Wilkinson (2013) ‘Approximate Bayesian computation (ABC) gives exact results under the assumption of model error’. Statistical applications in genetics and molecular biology.

      Marin et al (2012) ‘Approximate Bayesian computational methods’. Statistics and Computing.

      But to try to re-phrase:
      if you think in terms of an optimisation-style setup with a fit function (statistics: likelihood) and a penalty function corresponding to additional constraints (statistics: prior) then I believe some of your objections relate to the choice of fit and some to the penalty. This is why it can be difficult to give one simple example – the focus of the objection can shift from one to another if not careful.

      To address the prior part:
      As discussed, many useful regularization methods can be re-interpreted in terms of the choice of prior – see for example Gelman’s comments to Donoho et al. While you seem to agree with their response that effectively says that while you can do various types of regularization with priors this isn’t the best way of understanding ‘why’ these priors work. This may be fair (or may depend on taste?), but it still means there is a framework in which you can set up the regularized problem and use existing methods to solve. To me this is valuable. I agree though that it is also good to have other mathematical or physical reasons for understanding which prior to choose and why it works.

      On the other hand some of your objections relate more directly to the choice of fit function (likelihood).

      This is where the question of whether you can modify the standard likelihood function setup to address these arises (to me). In particular can we operate more carefully in ‘data space’ to better capture the idea of ‘approximation’ or ‘adequacy’? Again, there are a couple of parts here.

      Firstly, there is the idea of adding an additional layer of error (measurement or model) to the model. If the model predicts data d, then add an error term eps before comparing to observed data d0. This is the ‘modelling’ the errors part that I think corresponds to what you and Corey mention. This corresponds to something like introducing a latent variable and hence modifying the likelihood to include this. You might not like the approach but it exists.

      I’ll discuss some more modifications below, but, as shown for example by Wilkinson (2013; mentioned above) this idea of adding a measurement or model error has a relation to the ABC idea of replacing exact conditioning on given data with an approximate conditioning on given data in the Bayesian framework. I.e. accept models which predict data d satisfying |d-d0| < tol for the given data d0. So this does in fact correspond to an idea of approximation in data space. Despite ABC etc being seen as in some ways a 'new paradigm', the basic ideas seem to have been around for a while, depending on where you look and how explicitly you want it stated. The more novel ideas are arguably those covered next.

      You might object that this is still not a good fit function. For example Wood (2010) illustrate how noisy this still is for a relatively simple system but which exhibits complex dynamics. He argues:

      'The problem with the conventional approaches is actually philosophical. Naive methods of statistical inference try to make the model reproduce the exact course of the observed data in a way that the real system itself would not do if repeated'.

      The solution Wood adopts is to construct a 'synthetic likelihood' i.e. use summary statistics (essentially the data features you emphasise in your approach) in the likelihood function instead of the full data. This is also what is most commonly used in most ABC methods, and corresponds in general (assuming they are not sufficient statistics) to a 'lossy' filtering process. It turns out that this is actually better in many cases because (in my interpretation) it is often easier to remove unwanted noise than try to model it in an possibly ad-hoc manner.

      So now we arrive at what I mentioned above. If you let me change notation a little/make it more explicit, we have a model m with parameters theta predicting data d. i.e. m(theta) = d. However, before comparing to data we consider the key features of interest defined as functionals T of the (predicted or observed) data. These are our summary statistics. We don't expect these summary statistics to exactly match, but if we have chosen well then we should hope that they have a simple noise structure. This means we can compare T(m(theta)) = T(d) to T(d0) via a likelihood-style fit function. When solving via sampling we could do something simple like accept with a probability depending on the difference |T(d) – T(d0)|.

      So to summarise – many standard regularization methods have an interpretation in terms of priors. While the idea of 'adequacy' is best formulated in 'data space', this can be formulated to some extent by modifying the likelihood function appropriately. Furthermore, the predicted data d is then effectively another parameter from the Bayesian perspective and when you follow this all through it can be set up as a 'standard' Bayesian inference problem. You might object – fairly – that this doesn't necessarily illustrate 'why' all these choices work (which summary statistics do we choose? etc etc) but the point is there is now a standard language in which to express your assumptions and which comes with a collection of useful computational tools.

      • Thanks for that! It does indeed seem like a similar way of operationalising some of these sorts of ideas…though I’m not too familiar with all the details. Wood does mention the ‘method of simulated moments’ for example.

  21. I can’t give you a quick reference but I discuss various forms in my book. Tikhonov regularization is a form of regularization as you say. If you have a linear problem Ax=y with A not invertible there are many ways of regularizing. One in vogue at the moment is the L_0 solution , that y with the fewest non-zero coefficients. This is difficult but there is much work by Donoho, Candes and others who show that very often the minimum L_1 solution is also the L_0 solution. I have a modest contribution in this direction together with a Ph.D student, minimizing the number of non-zero interactions in ANOVA, JASA 2012 or 2013 or something. Minimum Fisher information is another form, minimum total variation in non-parametric regression, deconvolution. Also penalized likelihood (not my favourite) but that is I suppose a special case of Tikhonov regularization. Shape regularization also comes to mind as does regularization in multiscale problems, L_0 solutions for piecewise constant regression functions. I don’t think there is a book which covers all these topics in depth although most of them are mentioned in the book.

    • It was the mention of minimum Fisher information as way of specifying a model that piqued my interest.

      I wouldn’t say I’m familiar with Donoho and Candès’s work on compressed sensing, but I’m aware of it. For sparse signals, Nicholas Polson and James Scott have done a lot of work on the Bayesian side; if you like assessing estimators by their risk functions, you’ll like what they’ve done. As far as I’m aware, the Bayesian state of the art for sparse signals is the horseshoe+ estimator (link). As for non-Bayesian approaches, I find that whenever I see a regularized objective function I get the urge to take its exponential and marginalize out the regularization parameter with respect to some prior…

      I wonder if you are aware of the work of Jorma Rissanen? Like you, he rejects the idea of basing statistical inference on the assumption of the truth of some model, but from the perspective of mainstream frequentist statistics his Minimum Description Length approach is a good deal more radical than yours. The definitive reference here is a massive tome by Peter Grünwald entitled the Minimum Description Length principle (link). (I’m proud to say that I made it through the entire book even if I did skip most of the proofs.)

  22. To answer all your points would require a paper of some length. However let us start with MDL. I know Peter Grünwald quite well. When he was writing his Ph.D. under Richard Gill we had some correspondence about MDL. We have had workshops together, for a time I was visiting Eindhoven once a week working at Eurandom and saw him quite regularly. I am now living in Berlin and the contact has broken off. In spite of our different views we always got on well with each other. I also know Jorma Rissanen but relations here are somewhat strained. He once asked what I thought of MDL and I replied that I thought it was all wrong. He didn’t take it kindly. After that I was told that he often referred to me as “that man”. It is a mathematical theorem that a prefix code corresponds to a probability measure (model) and vice versa. So far so good. So given data I use a probability model to encode them and send them to you. You can of course do nothing with this until you decode it and for that you need to know my model. One possibility is we agree about it before we start so let us consider this case. I have the data and decide it encode it using N(3.152,1.1293^2). We meet for lunch I tell you N(3.152,1.1293^2) we return to our workplaces and I send you the encoded data. Next time I use N(16.256,3.457^2). Same procedure. After a while one of us has the brilliant idea of not meeting for lunch, which is in terms of money and time an expensive way of reducing the description length, but for me simply to tag on to the end the values of mu and sigma. Then I get a set of data I want to encode using the chi^2 with a parameter. No problem you say, just add it to the library and tag on the parameter as before. After some time you have a big library and we must think of encoding the models. The MDL dogma is a uniform distribution but any sensible person would look at the frequencies with which the models are used and encode correspondingly. One day I ring you up and tell you I have some data which involves a function as a parameter, non-parametric regression. It is clear how to proceed. I determine the function, let us say a kernel estimate, not that I ever do this, and encode the residuals using a normal model as already discussed. But how do I encode the function? At this point the story becomes interesting. I was working on densities with Arne Kovac and we were regularizing in a Kuiper ball using the taut string. It didn’t work too well and one day when Peter Grünwald was in Essen I asked him if MDL could help. I never got an answer but he may simply have forgotten but I have since come to the conclusion that there is no good MDL answer. Arne Kovac hit on the idea of increments of a high order Kuiper metric and that was what we published. The other story is about Rissanen. He was working on wavelets and some of the wavelet coefficients had to be set to zero. The problem was how to encode this subset. The MDL dogma was to treat all subsets equally. At this time I wasn’t exactly working with Rissanen but he visited the SFB in Dortmund and there was direct contact with him. We tried it out and the result was bad. The reconstructed function looked very noisy and we sent him the result. Rissanen could not let this stand and eventually one of his coworkers came up with the “solution”. First of all you encode the size of the subset and then encode the subsets of this size. This worked but it contradicted the MDL dogma. It was like a Bayesian prior favouring small subsets. Non-parametric regression became important to me. In the mid 1980s I had decided I didn’t like frequentist ideas and that I didn’t like Bayes and that I didn’t like all this talk of truth. I wrote a paper called “Data Features” which got a record number of bad referees’ reports and rejections but was eventually published in Statistica Neerlandica when Richard Gill (see above) was editor. He also got negative reports but published in any case. He very clearly has a mind of his own. If you don’t know him look as his web page. He is also an accomplished ornithologist as is Frank Hampel. An early version was sent to Tukey who wrote three articles in reply, David Brillinger mentioned one of them in his Tukey obituary in AoS. Being negative does not help even if you are correct, you have to come up with something new. The something new I decided to try was non-parametric regression. Wavelets were already around but the reconstructions often did not have the “right” shape such as monotonicity. I decided to try and come up with a method that controlled the number of peaks. The result was the taut string together with Arne Kovac. The taut string is piecewise constant and doesn’t control convexity/concavity. You can do this as well and you can make the function smooth, the limits being the linear programming problem that you can solve. But now you have the problem of encoding all these different functions. If you take MDL seriously then the MDL principle itself should tell you which function to take. It may well be piecewise constant. If you say however you want something with a continuous second derivative and the right shape you will have to overrule the MDL principle. Suppose we overcome all these problems and now have a huge library of models with new ones coming in daily. For small data sets it may now be cheaper in terms of description length just to send the data as they are which is not very helpful for a statistician. At this stage the lunch option is starting to look attractive again and I think this is essentially what the MDL people do. Rissanen is also aware of the problem but I can’t give you a reference at the moment. He has written a book and it is in there somewhere. There are however further objections to the library solution. All the models are treated equally but they are not all equally complex. the uniform distribution requires two number a and b and the calculation of 17(b-a). The normal distribution with two numbers mu and sigma is much more complex requiring the calculation of the exponential function. It is clear where I am now going, namely Kolmogorov complexity which is where Rissanen started. You forget about libraries of models and just write a computer programme, send it to the recipient who then runs it. I think this is indeed a brilliant idea and so far the only really good definition of complexity which is around. I am not sure of the history but I think others Solomonoff, Chaitin had the same idea even before Kolmogorov. The problem is of course which computer and language. Asymptotically it makes no difference but practically it does. There are other problems like outliers, stability regularization etc but that suffices for now.

    • This is very interesting history! I think that treating everything uniformly is more Rissanen dogma than MDL dogma; as you note, other MDL researchers aren’t so insistent on this point. Grünwald’s been doing work on “luckiness functions” in an MDL setting; these allow the code to trade off good performance on some sequences for bad performance on others in a way that maintains the various MDL guarantees. (It’s actually possible to do something similar with confidence intervals; that idea goes back to Pratt in the ’60s.) And Bayesian predictive densities can be used to generate universal codes, ironically enough given Rissanen’s derogation of Bayes. I’ll bet that you could replace “[i]t was like a Bayesian prior favouring small subsets” with “it was mathematically equivalent to a Bayesian prior favouring small subsets”.

      “The reconstructed function looked very noisy and we sent him the result. Rissanen could not let this stand…”

      I always find it striking when a human brain’s evaluation of some algorithmic result is unfavorable. It makes me wonder what algorithm that brain is implementing…

      In our discussion above I actually linked to a page about Solomonoff’s work on algorithmic complexity when I wrote that I would exclude non-computable probability measures as possible models. (The default style of links on this blog is rather subtle — just a thin underline in white — so maybe it escaped your notice.) I’m rather fond of that work, in good part because Solomonoff was seeking (and found!) the ultimate Bayesian prior… (modulo the universal Turing machine, as you note).

      Richard Gill’s abstract of his paper on the Two Envelopes puzzle made me chuckle.

  23. There is a nice proof of the existence etc of minimum Fisher information distributions in Huber and Huber and Ronchetti. A minimum Fisher information distribution must have an absolutely continuous density. Jaynes has put forward a case for maximum entropy. For discrete distributions this makes sense and maximum entropy distributions can be calculated. Moreover if one makes a 1-1 transformation the entropy remains unchanged as it should. As far as I have understood Jaynes he emphasizes that limiting operations should only be done with great care. Now if one does a limiting operation to obtain the entropy of a Gaussian distribution the result is surely infinite. If you ignore this and simply define it by the integral analogue of the discrete case it is no longer invariant to 1-1 transformations. Is there something I have not understood?

    • Yup. Jaynes argues that in the continuous case the correct quantity is a KL-divergence-like expression where the second argument is not a probability density per se but rather characterizes the limiting operation. Here’s a little blurb on the subject.

      I don’t know of any good ideas about how this limiting operation is supposed to happen in real problems, so I’m personally leery of the continous case.

  24. Renyi is good on information theory, see the appendix to his book where he does the limiting operation for the continuous case. The KL discrepancy does not have the correct invariance properties in the continuous case to deserve to be connected with information theory. That is one reason for rejecting KL in the continuous case. Another is the classical Bernoulli central limit for the sums of i.i.d. Bernoulli random variable. These are discrete, the limit the normal distribution is continuous so the standard KL divergence is ∞. Why this is still taken seriously is beyond me. The discrete case is different and does make some sort of sense.

    • I wonder what you consider to be the “correct” invariance properties; presumably you mean something different from the equality case of Theorem 2.5 of these notes: link. I agree with you about not taking the continuous case of Jaynes’s maximum entropy prescription seriously, but for quite different reasons, I suspect.

  25. vl said:

    Way late to the party, and I realize this isn’t the main point of the posting, but I have to comment on this:

    “in social science it’s notorious that every variable is correlated with every other variable, at least a little bit. I imagine that this makes Pearl-style causal inference a big pain — all of the causal graphs would end up totally connected, or close to”

    I’m surprised you would make this point. A consequence of Pearl-ian concepts such as “back-door paths” are how connectivity induces correlations. You only need a sparsely connected graph to obtain correlations everywhere. i.e. correlations being dense does not mean that causality is not sparse.

    • You have uncovered the shallowness of my understanding of Pearl-style causal inference. If I had had time to develop the idea, I might have figured that out eventually. I wish I had time to study up…

Leave a reply to coreyyanofsky Cancel reply

In the Dark

A blog about the Universe, and all that surrounds it

Minds aren't magic

Paul Crowley

Mad (Data) Scientist

Musings, useful code etc. on R and data science

djmarsay

Reasoning about reasoning, mathematically.

The Accidental Statistician

Occasional ramblings on statistics

Models Of Reality

Stochastic musings of a data scientist.

Data Colada

Thinking about evidence and vice versa

Hacked By Gl0w!Ng - F!R3

Stochastic musings of a data scientist.

John D. Cook

Stochastic musings of a data scientist.

Simply Statistics

Stochastic musings of a data scientist.

LessWrong

Stochastic musings of a data scientist.

Normal Deviate

Thoughts on Statistics and Machine Learning

Xi'an's Og

an attempt at bloggin, nothing more...