Wolfgang Schwarz :: Counterexamples to Good's Theorem

Counterexamples to Good's Theorem

Posted on Tuesday, 19 Oct 2021.

Good (1967) famously "proved" that the expected utility of an informed decision is always at least as great as the expected utility of an uninformed decision. The conclusion is clearly false. Let's have a look at the proof and its presuppositions.

Suppose you can either perform one of the acts A₁…A_n now, or learn the answer to some question E and afterwards perform one of A₁…A_n. Good argues that the second option is always at least as good as the first. The supposed proof goes as follows.

We assume a Savage-style formulation of decision theory. For any act A_j, the expected utility of choosing A_j is

(1)\quad EU(A_{j}) = \sum_{i} Cr(S_{i}) U(A_{j} \land S_{i}),

where S_i ranges over a suitable partition of states. The value of choosing A_j after learning E_k (an answer to E) is

(2)\quad EU(A_{j}/E_{k}) = \sum_{i} Cr(S_{i}/E_{k}) U(A_{j} \land S_{i}).

Here we assume that the truth of E_k does not affect the value of choosing A_j in state S_i, so that U(A_j ∧ S_i ∧ E_k) = U(A_j ∧ S_i).

Since Cr(S_i) = ∑_k Cr(S_i/E_k) Cr(E_k), we can rewrite (1) as

(3)\quad EU(A_{j}) = \sum_{i}\sum_{k} Cr(S_{i}/E_{k}) Cr(E_{k}) U(A_{j} \land S_{i}).

If you choose between A₁…A_n now, you will choose an act A_j that maximises EU. Thus, where 'N' stands for choosing between A₁…A_n now,

\[\begin{align*} (4)\quad EU(N) &= \max_{j} \sum_{i}\sum_{k} Cr(S_{i}/E_{k}) Cr(E_{k}) U(A_{j} \land S_{i})\\ &= \max_{j} \sum_{k} Cr(E_{k}) EU(A_{j}/E_{k}). \end{align*} \]

What if instead you choose to first learn ('L') the answer to E? Let's assume that you will afterwards choose an act that maximises your posterior expected utility. You don't know which act that is, but we can compute its expected value by averaging over the possible learning events:

(5)\quad EU(L) = \sum_{k} Cr(E_{k}) \max_{j} EU(A_{j}/E_{k}).

Now compare (4) and (5). EU(N) is the maximum of a weighted average; EU(L) is the corresponding average of the maxima. The maximum of an average is always at least as great as the average of the maxima. QED.

As noted, the argument assumes that the answer to E is certain to make no difference to the value of performing any act A_j in any state S_j, so that U(A_j ∧ S_i ∧ E_k) = U(A_j ∧ S_i). The argument also assumes that if you choose L then your future self is certain to follow the standard norms of Bayesian rationality: she updates by conditionalisation and maximises expected utility. Let's grant these assumptions.

Even so, the conclusion is not always true. Four counterexamples:

Crime Novel. You have a choice between reading a crime novel and reading a biography. You prefer the crime novel because you like the suspense. Before you make your choice, you have the option of finding out who is the villain in the crime novel (by reading a plot summary), which would spoil the novel for you. After getting the information, you would rather read the biography. You rationally prefer the uninformed choice between the two books over the more informed choice. (Adapted from (Bradley and Steele 2016))

Rain. You have a choice between taking a black box and taking a white box. Before you make your choice you may look through the window and check if it is raining. A reliable predictor has put $1 into the black box and $0 into the white box iff she predicted that you would look outside the window. If she predicted that you would not look outside the window, she has put $2 into the white box and $0 into the black box. You are 50% confident that you were predicted to look outside the window. You rationally prefer the uninformed choice.

Here is why (assuming CDT). If you don't you look outside the window, you will take the white box, with expected payoff $1, compared to $0.50 for the black box. If you do look outside the window, you will become confident that you were predicted to look outside the window; as a result, you will take the black box, with expected payoff $0.50.

Middle Knowledge. Once again you have a choice between taking a black box and taking a white box. A psychologist has figured out what you are disposed to do in this kind of choice situation, where you have no evidence that favours one of the boxes over the other. She has put $1 into whichever box she thinks you would take, and $0 into the other box. Unbeknownst to the psychologist, I have observed what she put into the boxes. I slip you a piece of paper on which I claim to have written down the colour of the box with the $1. You are 90% confident that what I have written down is true. You rationally prefer not to read my note.

Here is why (assuming CDT). If you don't read my note, you can expect to get $1. If you do read my note, you will take the box whose colour is written on the note. Since there's a 90% chance that this box contains $1, the expected payoff is $0.90.

Newcomb Revelation. You are facing the standard Newcomb Problem. Before you make your choice, you have the option of looking inside the opaque box. The predictor knew that you would be given this offer, and has factored your response into her prediction. EDT says that you should reject the offer.

This is only a counterexample to Good's Theorem if we assume EDT. But like CDT, EDT can be formulated in Savage's framework. We only have to stipulate that the states in a properly formulated decision problem are probabilistically independent of the acts. In Newcomb's Problem, a suitable partition of states is { prediction accurate, prediction inaccurate }. Good's "proof" does not seem to rely on a causal construal of the states.

To be fair, one might argue that this case violates the assumption that U(A_j ∧ S_i ∧ E_k) equals U(A_j ∧ S_i). But this isn't the only problem. Consider equation (5) in the above proof.

(5)\quad EU(L) = \sum_{k} Cr(E_{k}) \max_{j} EU(A_{j}/E_{k}).

In Newcomb Revelation, this says that EU(L) = Cr(full) EU(two-box/full) + Cr(empty) EU(two-box/empty), assuming that conditional on either observation, two-boxing maximises expected utility. But suppose you (as the agent in Newcomb Revelation) are convinced that you will one-box and that you will reject the offer to look inside the opaque box. So Cr(full) is close to 1. And evidently EU(2b/full) is $1M1K. By (5), the expected utility of looking inside the opaque box is therefore close to $1M1K. That's clearly wrong.

More generally, equation (5) in the above proof simply isn't an application of Savage-style decision theory. It is a hand-wavy shortcut.

(Skyrms 1990) argues that one can patch up Good's proof if one assumes that the states are causal dependency hypotheses, but his argument still looks hand-wavy to me, and the other counterexamples suggest that it is fallacious.

Let's see how far we can get if we use a suppositional formulation of decision theory.

Let { O_i } be a partition of "value-level propositions" as in (Lewis 1981). Intuitively, the members of this partition settle everything the agent ultimately cares about. In suppositional formulations of decision theory, the expected utility of an act A is given by

EU(A) = \sum_{i} Cr^{A}(O_{i}) V(O_{i}),

where Cr^A(O_i) is the probability of O_i on the supposition A. The relevant type of supposition might be "indicative" (yielding EDT) or "subjunctive" (yielding CDT).

Now let's evaluate the two options. First, you might choose directly between A₁…A_n, without first learning the answer to E. This is the option we called N. We assume that your future self will choose an option that maximises expected utility, and that your basic (uncentred) desires don't change. Let's also assume that these assumptions are resilient under suppositions. Thus we can assume that on the supposition that you choose N you will afterwards choose an option from A₁…A_n that maximises (posterior) expected utility. This suggests that the expected utility of N is

(1')\quad EU(N) = \max_{j} EU^{N}(A_{j}),

where EU^N(A_j) is the expected utility of A_j computed relative to Cr^N. I'll return to this assumption below. Let's stick with it for now. By definition,

(2')\quad EU^{N}(A_{j}) = \sum_{i} (Cr^{N})^{A_{j}}(O_{i}) V(O_{i}).

Since (Cr^N)^A_j(O_i) = ∑_k (Cr^N)^A_j(O_i/E_k)(Cr^N)^A_j(E_k), we can expand (2') into

\[\begin{align*} (3')\quad EU^{N}(A_{j}) &= \sum_{i}\sum_{k} (Cr^{N})^{A_{j}}(O_{i}/E_{k})(Cr^{N})^{A_{j}}(E_{k}) V(O_{i})\\ &= \sum_{k} (Cr^{N})^{A_{j}}(E_{k}) \sum_{i}(Cr^{N})^{A_{j}}(O_{i}/E_{k}) V(O_{i})\\ &= \sum_{k} (Cr^{N})^{A_{j}}(E_{k}) EU^{N}(A_{j}/E_{k}), \end{align*} \]

where EU^N(A_j/E_k) is defined as ∑_i (Cr^N)^A_j(O_i/E_k) V(O_i). Plugging this into (1'), we have

\[\begin{align*} (4')\quad EU(N) &= \max_{j} \sum_{k} (Cr^{N})^{A_{j}}(E_{k}) EU^{N}(A_{j}/E_{k})\\ &= \max_{j} \sum_{k} (Cr^{N})^{A_{j}}(E_{k}) \sum_{i}(Cr^{N})^{A_{j}}(O_{i}/E_{k}) V(O_{i}). \end{align*} \]

Alternatively, you might delay your choice between A₁…A_n until after you've learned the answer to E. To begin, we have

(5')\quad EU(L) = \sum_{i} Cr^{L}(O_{i}) V(O_{i}).

As before, probability theory allows expanding and rearranging:

\[\begin{align*} (6')\quad EU(L) &= \sum_{i}\sum_{k} Cr^{L}(O_{i}/E_{k}) Cr^{L}(E_{k}) V(O_{i})\\ &= \sum_{k }Cr^{L}(E_{k}) \sum_{i} Cr^{L}(O_{i}/E_{k}) V(O_{i}). \end{align*} \]

∑_i Cr^L(O_i/E_k) V(O_i) is the "desirability" of E_k from the perspective of Cr^L, understood as in (Jeffrey 1983). We assume that Cr^L is concentrated on worlds at which you are going to choose an act from A₁…A_n that maximises expected utility after learning the true answer to E. We also assume that the answer to E is all you learn. The desirability of E_k from the perspective of Cr^L then equals max_j EU^L(A_j / E_k), where EU^L(A_j / E_k) is the expected utility of A_j computed relative to the probability function (Cr^L)_{E_k} that comes from Cr by first supposing L and then conditioning on E_k. Plugging this into (6') yields

\[\begin{align*} (7')\quad EU(L) &= \sum_{k }Cr^{L}(E_{k}) \max_{j} EU^{L}(A_{j} / E_{k})\\ &= \sum_{k }Cr^{L}(E_{k}) \max_{j} \sum_{i} ((Cr^{L})_{E_{k}})^{A_{j}}(O_{i}) V(O_{i}). \end{align*} \]

(4') and (7') resemble (4) and (5) in Good's proof. (4') is the maximum of an average, (7') is the average of some maxima. But the subscripts and superscripts are different. To infer that L is at least as good as N, we need the following two assumptions:

\[\begin{align*} (i) \quad&(Cr^{N})^{A_{j}}(E_{k}) = Cr^{L}(E_{k}), \text{ for all }A_{j}, E_{k.}\\ (ii) \quad&(Cr^{N})^{A_{j}}(O_{i}/E_{k}) = ((Cr^{L})_{E_{k}})^{A_{j}}(O_{i}), \text{ for all }O_{i,} A_{j}, E_{k}. \end{align*} \]

Continuing to use superscripts for supposition and subscripts for conditioning, we can rewrite (ii) as

(ii)\quad ((Cr^{N})^{A_{j}})_{E_{k}}(O_{i}) = ((Cr^{L})_{E_{k}})^{A_{j}}(O_{i}),\text{ for all }O_{i}, A_{j}, E_{k}.

In EDT, the relevant kind of supposition is conditioning, so we can simplify:

\[\begin{align*} (i_{E})\quad &Cr(E_{k} / N \land A_{j}) = Cr(E_{k} / L),\text{ for all }A_{j}, E_{k}.\\ (ii_{E})\quad &Cr(O_{i} / N \land A_{j} \land E_{k}) = Cr(O_{i} / L \land A_{j} \land E_{k}),\text{ for all }O_{i,} A_{j}, E_{k}. \end{align*} \]

Condition (i_E) is violated in Newcomb Revelation. Here the probability of the opaque box being empty is low conditional on one-boxing without peeking, but it is high conditional on peeking.

CDT does not allow simplifying the two conditions, at least not without further assumptions.

(i) is fairly easy to understand. It says that the probability of the various answers E_k does not "causally" depend on your choice(s). This is violated in the Rain scenario.

(ii) is hard to understand. In normal cases, however, the order of the operations will make little difference. So we can approximately paraphrase (ii) as follows:

You are as likely to get a certain amount of utility by choosing A_j after finding out E_k as by choosing A_j without finding out E_k.

(Here 'without finding out E_k' is meant to imply, as it does in English, that E_k is true.) This condition is obviously violated in the Crime Novel case.

Unfortunately, my "proof" still relies on some further assumptions, besides the assumptions of diachronic rationality.

One assumption was smuggled into (1'):

(1')\quad EU(N) = \max_{j} EU^{N}(A_{j}).

In effect, this assumes that $ EU(N) = EU(N \land \hat{A} $), where $ \hat{A} $ is an act that maximises expected utility on the supposition that N. Without this assumption, I don't know how to get the proof off the ground. In EDT, the assumption is harmless, but in CDT it can fail. It fails in Middle Knowledge.

Another problematic assumption in both my "proof" and in Good's is that the possible propositions you might learn form a partition. To see why this matters, return to the Crime Novel scenario.

Let's construe the relevant states as somewhat course-grained "dependency hypotheses". If you plan to not learn about the plot then most of your credence goes to a state S₁ in which the act of reading the crime novel would bring about a highly desirable experience while the act of reading the biography would bring about a moderately desirable experience. If E_k is a summary of the crime novel's plot, then most of your credence conditional on E_k still goes to S₁. Your enjoyment depends on not knowing the villain, but not on who is the villain. So Cr(S₁/E_k) is high, for all relevant E_k. After you've learned E_k, however, Cr(S₁) is low. You no longer believe that reading the crime novel would be a great experience.

Since Cr(S₁/E_k) is not equal to Cr(S₁), finding out about the plot is not adequately modelled as conditioning on E_k. The problem is that if you find out about the plot, you not only learn E_k, but also that you know E_k. It is this knowledge (or belief) that breaks the connection between reading the crime novel and having a great experience. Conditional on knowing or believing E_k, your credence in S₁ is low.

Since we want to model your learning event in terms of conditioning, we have to make sure that the propositions { E_k } include everything you might learn if you chose L. In the Crime Novel case, each member of { E_k} should specify (a) a plot and (b) that you believe that this is the plot. But then { E_k } no longer forms a partition. Every element of { E_k } now implies that you won't enjoy the novel because you think you already know the villain's identity.

There is nothing special here about the Crime Novel case. In realistic cases, the answers to E will never form a partition, if we assume that learning the answer goes by conditioning on the answer.

Bradley, Seamus, and Katie Steele. 2016. “Can Free Evidence Be Bad? Value of Information for the Imprecise Probabilist.” Philosophy of Science 83 (1): 1–28.

Good, Irving John. 1967. “On the Principle of Total Evidence.” The British Journal for the Philosophy of Science 17 (4): 319–21.

Jeffrey, Richard. 1983. The Logic of Decision. Second. Chicago: University of Chicago Press.

Lewis, David. 1981. “Causal Decision Theory.” Australasian Journal of Philosophy 59: 5–30.

Skyrms, Brian. 1990. “The Value of Knowledge.”

Comments

# Matthew Adelstein on 06 April 2023, 02:48

In, for example, the crime novel, can't we just describe it as a different act then. The act in the first instance is reading a crime novel and thus discovering a new fact in a fun way. But if the experience of reading the novels are the same, learning it would not be bad.

Wolfgang Schwarz

Counterexamples to Good's Theorem

Comments

Add a comment