On Functional Decision Theory
I recently refereed Eliezer Yudkowsky and Nate Soares's "Functional Decision Theory" for a philosophy journal. My recommendation was to accept resubmission with major revisions, but since the article had already undergone a previous round of revisions and still had serious problems, the editors (understandably) decided to reject it. I normally don't publish my referee reports, but this time I'll make an exception because the authors are well-known figures from outside academia, and I want to explain why their account has a hard time gaining traction in academic philosophy. I also want to explain why I think their account is wrong, which is a separate point.
I actually don't know Nate Soares, but Eliezer Yudkowsky is a celebrity in the "rationalist" community. Many of his posts on the Less Wrong blog are gems. I also enjoyed his latest book, Inadequate Equilibria. Yudkowsky seems to be interested in almost everything, but he regards decision theory as his main area of research. I also work in decision theory, but I've always struggled with Yudkowsky's writings on this topic.
Before I explain what I found wrong with the paper, let me review the main idea and motivation behind the theory it defends.
Standard lore in decision theory is that there are situations in which it would be better to be irrational. Three examples.
Blackmail. Donald has committed an indiscretion. Stormy has found out and considers blackmailing Donald. If Donald refuses and blows Stormy's gaff, she is revealed as a blackmailer and his indiscretion becomes public; both suffer. It is better for Donald to pay hush money to Stormy. Knowing this, it is in Stormy's interest to blackmail Donald. If Donald were irrational, he would blow Stormy's gaff even though that would hurt him more than paying the hush money; knowing this, Stormy would not blackmail Donald. So Donald would be better off if here were (known to be) irrational.
Prisoner's Dilemma with a Twin. Twinky and her clone have been arrested. If they both confess, each gets a 5 years prison sentence. If both remain silent, they can't be convicted and only get a 1 year sentence for obstructing justice. If one confesses and the other remains silent, the one who confesses is set free and the other gets a 10 year sentence. Neither cares about what happens to the other. Here, confessing is the dominant act and the unique Nash equilibrium. So if Twinky and her clone are rational, they'll each spend 5 years in prison. If they were irrational and remained silent, they would get away with 1 year.
Newcomb's Problem with Transparent Boxes. A demon invites people to an experiment. Participants are placed in front of two transparent boxes. The box on the left contains a thousand dollars. The box on the right contains either a million or nothing. The participants can choose between taking both boxes (two-boxing) and taking just the box on the right (one-boxing). If the demon has predicted that a participant one-boxes, she put a million dollars into the box on the right. If she has predicted that a participant two-boxes, she put nothing into the box. The demon is very good at predicting, and the participants know this. Each participant is only interested in getting as much money as possible. Here, the rational choice is to take both boxes, because you are then guaranteed to get $1000 more than if you one-box. But almost all of those who irrationally take just one box end up with a million dollars, while most of those who rationally take both boxes leave with $1000.
The driving intuition behind Yudkowsky and Soares's paper is that decision theorists have been wrong about these (and other) cases: in each case, the supposedly irrational choice is actually rational. Whether a pattern of behaviour is rational, they argue, should be measured by how good it is for the agent. In Newcomb's Problem with Transparent Boxes, one-boxers fare better than two-boxers. So we should regard one-boxing as rational. Similarly for the other examples. Standard decision theories therefore get these cases wrong. We need a new theory.
Functional Decision Theory (FDT) is meant to be that theory. FDT recommends blowing the gaff in Blackmail, remaining silent in Prisoner's Dilemma with a Twin, and one-boxing in Newcomb's Problem with Transparent Boxes.
Here's how FDT works, and how it differs from the most popular form of decision theory, Causal Decision Theory (CDT). Suppose an agent faces a choice between two options A and B. According CDT, the agent should evaluate these options in terms of their possible consequences (broadly understood). That is, the agent should consider what might happen if she were to choose A or B, and weigh the possible outcomes by their probability. In FDT, the agent should not consider what would happen if she were to choose A or B. Instead, she ought to consider what would happen if the right choice according to FDT were A or B.
Take Newcomb's Problem with Transparent Boxes. Without loss of generality, suppose you see $1000 in the left box and a million in the right box. If you were to take both boxes, you would get a million and a thousand. If you were to take just the right box, you would get a million. So Causal Decision Theory says you should take box boxes. But let's suppose you follow FDT, and you are certain that you do. You should then consider what would be the case if FDT recommended one-boxing or two-boxing. These hypotheses are not hypotheses just about your present choice. If FDT recommended two-boxing, then any FDT agent throughout history would two-box. And, crucially, the demon would (probably) have foreseen that you would two-box, so she would have put nothing into the box on the right. As a result, if FDT recommended two-boxing, you would probably end up with $1000. To be sure, you know that there's a million in the box on the right. You can see it. But according to FDT, this is irrelevant. What matters is what would be in the box relative to different assumptions about what FDT recommends.
To spell out the details, one would now need to specify how to compute the probability of various outcomes under the subjunctive supposition that FDT recommended a certain action. Yudkowsky and Soares are explicit that the supposition is to be understood as counterpossible: we need to suppose that a certain mathematical function, which in fact outputs A for input X, instead were to output B. They do not explain how to compute the probability of outcomes under such a counterpossible supposition. So we don't get any details spelled out. This is flagged as the main open question for FDT.
It is not obvious to me why Yudkowsky and Soares choose to model the relevant supposition as a mathematical falsehood. For example, why not let the supposition be: I am the kind of agent who chooses A in the present decision problem? That is an ordinary contingent (centred) propositions, since there are possible agents who do choose option A in the relevant problem. These agents may not follow FDT, but I don't see why that would matter. For some reason, Yudkowsky and Soares assume that an FDT agent is certain that she follows FDT, and this knowledge is held fixed under all counterfactual suppositions. I guess there is a reason for this assumption, but they don't tell us.
Anyway. That's the theory. What's not to like about it?
For a start, I'd say the theory gives insane recommendations in cases like Blackmail, Prisoner's Dilemma with a Twin, and Newcomb's Problem with Transparent Boxes. Suppose you have committed an indiscretion that would ruin you if it should become public. You can escape the ruin by paying $1 once to a blackmailer. Of course you should pay! FDT says you should not pay because, if you were the kind of person who doesn't pay, you likely wouldn't have been blackmailed. How is that even relevant? You are being blackmailed. Not being blackmailed isn't on the table. It's not something you can choose.
[Clarifications: (a) I assume that the blackmailer is not infallible at predicting how you will react. Even if you follow FDT you might therefore find yourself in this situation. (b) You are rationally certain that you will never find yourself again in a similar situation, so that your act is useless as a signal to potential future blackmailers.]
Admittedly, that's not much of an objection. I say you'd be insane not to pay the $1, Yudkowsky and Soares say you'd be irrational to pay. Neither of us can prove that their judgement is right from neutral premises.
What about the fact that FDT agents do better than (say) CDT agents? I admit that if this were a fact, it would be somewhat interesting. But it's not clear if it is true.
First, it depends on how success is measured. If you face the choice between submitting to blackmail and refusing to submit (in the kind of case we've discussed), you fare dramatically better if you follow CDT than if you follow FDT. If you are in Newcomb's Problem with Transparent Boxes and see a million in the right-hand box, you again fare better if you follow CDT. Likewise if you see nothing in the right-hand box.
So there's an obvious sense in which CDT agents fare better than FDT agents in the cases we've considered. But there's also a sense in which FDT agents fare better. Here we don't just compare the utilities scored in particular decision problems, but also the fact that FDT agents might face other kinds of decision problems than CDT agents. For example, FDT agents who are known as FDT agents have a lower chance of getting blackmailed and thus of facing a choice between submitting and not submitting. I agree that it makes sense to take these effects into account, at least as long as they are consequences of the agent's own decision-making dispositions. In effect, we would then ask what decision rule should be chosen by an engineer who wants to build an agent scoring the most utility across its lifetime. Even then, however, there is no guarantee that FDT would come out better. What if someone is set to punish agents who use FDT, giving them choices between bad and worse options, while CDTers are given great options? In such an environment, the engineer would be wise not build an FDT agent.
Moreover, FDT does not in fact consider only consequences of the agent's own dispositions. The supposition that is used to evaluate acts is that FDT in general recommends that act, not just that the agent herself is disposed to choose the act. This leads to even stranger results.
Procreation. I wonder whether to procreate. I know for sure that doing so would make my life miserable. But I also have reason to believe that my father faced the exact same choice, and that he followed FDT. If FDT were to recommend not procreating, there's a significant probability that I wouldn't exist. I highly value existing (even miserably existing). So it would be better if FDT were to recommend procreating. So FDT says I should procreate. (Note that this (incrementally) confirms the hypothesis that my father used FDT in the same choice situation, for I know that he reached the decision to procreate.)
In Procreation, FDT agents have a much worse life than CDT agents.
[Edit: Another potential case in which its hard to see any sense in which FDT agents do better than CDT agents is suggested in this comment below.]
All that said, I agree that there's an apparent advantage of the "irrational" choice in cases like Blackmail or Prisoner's Dilemma with a Twin, and that this raises an important issue. The examples are artificial, but structurally similar cases arguably come up a lot, and they have come up a lot in our evolutionary history. Shouldn't evolution have favoured the "irrational" choices?
Not necessarily. There is another way to design agents who refuse to submit to blackmail and who cooperate in Prisoner Dilemmas. The trick is to tweak the agents' utility function. If Twinky cares about her clone's prison sentence as much as about her own, remaining silent becomes the dominant option in Prisoner's Dilemma with a Twin. If Donald develops a strong sense of pride and would rather take Stormy down with him than submit to her blackmail, refusing to pay becomes the rational choice in Blackmail.
FDT agents rarely find themselves in Blackmail scenarios. Neither do CDT agents with a vengeful streak. If I wanted to design a successful agent for a world like ours, I would build a CDT agent who cares what happens to others. My CDT agent would still two-box in Newcomb's Problem with Transparent Boxes (or in the original Newcomb Problem). But this kind of situation practically never arises in worlds like ours.
The story I'm hinting at has been well told by others. I'd especially recommend Brian Skyrms's Evolution of the Social Contract and chapter 6 of Simon Blackburn's Ruling Passions.
So here's the upshot. Whether FDT agents fare better than CDT agents depends on the environment, on how "faring better" is measured, and on what the agents care about. Across their lifetime, purely selfish agents might do better, in a world like ours, if they followed FDT. But that doesn't persuade me that the insane recommendations FDT are correct.
So far, I have explained why I'm not convinced by the case for FDT. I haven't explained why I didn't recommend the paper for publication. That I'm not convinced is not a reason. I'm rarely convinced by arguments I read in published papers.
The standards for deserving publication in academic philosophy are relatively simple and self-explanatory. A paper should make a significant point, it should be clearly written, it should correctly position itself in the existing literature, and it should support its main claims by coherent arguments. The paper I read sadly fell short on all these points, except the first. (It does make a significant point.)
Here, then, are some of the complaints from my referee report, lightly edited for ease of exposition. I've omitted several other complaints concerning more specific passages or notation from the paper.
A popular formulation of CDT assumes that to evaluate an option A we should consider the probability of various outcomes on the subjunctive supposition that A were chosen. That is, we should ask how probable such-and-such an outcome would be if option A were chosen. The expected utility of the option is then defined as the probability-weighted average of the utility of these outcomes. In much of their paper, Yudkowsky and Soares appear to suggest that this is exactly how expected utility is defined in FDT. The disagreement between CDT and FDT would then boil down to a disagreement about what is likely to be the case under the subjunctive supposition that an option is chosen.
For example, consider Newcomb's Problem with Transparent Boxes. Suppose (without loss of generality) that the right-hand box is empty. CDT says you should take both boxes because if you were to take only the right-hand box you would get nothing whereas if you were to take both boxes, you would get $1000. According to FDT (as I presented it above, and as it is presented in parts of the paper), we should ask a different question. We should ask would be the case if FDT recommended one-boxing, and what would be the case if FDT recommended two-boxing. For much of the paper, however, Yudkowsky and Soares seem to assume that these questions coincide. That is, they suggest that you should one-box because if you were to one-box, you would get a million. The claim that you would get nothing if you were to one-box is said to be a reflection of CDT.
If that's really what Yudkowsky and Soares want to say, they should, first, clarify that FDT is a special case of CDT as conceived for example by Stalnaker, Gibbard & Harper, Sobel, and Joyce, rather than an alternative. All these parties would agree that the expected utility of an act is a matter of what would be the case if the act were chosen. (Yudkowsky and Soares might then also point out that "Causal Decision Theory" is not a good label, given that they don't think the relevant conditionals track causal dependence. John Collins has made essentially the same point.)
Second, and more importantly, I would like to see some arguments for the crucial claim about subjunctive conditionals. Return once more to the Newcomb case. Here's the right-hand box. It's empty. It's a normal box. Nothing you can do has any effect on what's in the box. The demon has tried to predict what you will do, but she could be wrong. (She has been wrong before.) Now, what would happen if you were to take that box, without taking the other one? The natural answer, by the normal rules of English, is: you would get an empty box. Yudkowsky and Soares instead maintain that the correct answer is: you would find a million in the box. Note that this is a claim about the truth-conditions of a certain sentence in English, so facts about the long-run performance of agents in decision problems don't seem relevant. (If the predictor is highly reliable, I think a "backtracking" reading can become available on which it's true that you would get a million, as Terry Horgan has pointed out. But there's still the other reading, and it's much more salient if the predictor is less reliable.)
Third, later in the paper it transpires that FDT can't possibly be understood as a special case of CDT along the lines just suggested because in some cases FDT requires assessing the expected utility of an act by looking exclusively at scenarios in which that act is not performed. For example, in Blackmail, not succumbing is supposed to be better because it decreases the chance of being blackmailed. But any conditional of the form if the agent were to do A, then the agent would do A is trivially true in English.
Fourth, in other parts of the paper it is made clear that FDT does not instruct agents to suppose that a certain act were performed, but rather to suppose that FDT always were to give a certain output for a certain input.
I would recommend dropping all claims about subjunctive conditionals involving the relevant acts. The proposal should be that the expected utility of act A in decision problem P is to be evaluated by subjunctively supposing not A, but the proposition that FDT outputs A in problem P. (That's how I presented the theory above.) The proposal then wouldn't rely on implausible and unsubstantiated claims about English conditionals.
[I then listed several passages that would need to be changed if the suggestion is adopted.]
I'm worried that so little is said about how subjunctive probabilities are supposed to be revised when supposing that FDT gives a certain output for a certain decision problem. Yudkowsky and Soares insist that this is a matter of subjunctively supposing a proposition that's mathematically impossible. But as far as I know, we have no good models for supposing impossible propositions.
Here are three more specific worries.
First, mathematicians are familiar with reductio arguments, which appear to involve impossible suppositions. "Suppose there were a largest prime. Then there would be a product x of all these primes. And then x+1 would be prime. And so there would be a prime greater than all primes." What's noteworthy about these arguments is that whenever B is mathematically derivable from A, then mathematicians are prepared to accept 'if A were the case then B would be the case', even if B is an explicit contradiction. (In fact, that's where the proof usually ends: "If A were the case then a contradiction would be the case; so A is not the case.")
If that is how subjunctive supposition works, FDT is doomed. For if A is a mathematically false proposition, then anything whatsoever mathematically follows from A. (I'm ignoring the subtle difference between mathematical truth and provability, which won't help.) So then anything whatsoever would be the case on a counterpossible supposition that FDT produces a certain output for a certain decision problem. We would get: If FDT recommended two-boxing in Newcomb's Problem, then the second box would be empty, but also /If FDT recommended two-boxing in Newcomb's Problem, then the second box would contain a million/, and If FDT recommended two-boxing in Newcomb's Problem, the second box would contain a round square.
A second worry. Is a probability function revised by a counterpossible supposition, as employed by FDT, still a probability function? Arguably not. For presumably the revised function is still certain of elementary mathematical facts such as the Peano axioms. (If, when evaluating a relevant scenario, the agent is no longer sure whether 0=1, all bets are off.) But some such elementary facts will logically entail the negation of the supposed hypothesis. So in the revised probability function, probability 1 is not preserved under logical entailment; and then the revised function is no longer a classical probability function. (This matters, for example, because Yudkowsky and Soares claim that the representation theorem from Joyce's Foundations of Causal Decision Theory can be adapted to FDT, but Joyce's theorem assumes that the supposition preserves probabilistic coherence.)
Another worry. Subjunctive supposition is relatively well-understood for propositions about specific events at specific times. But the hypothesis that FDT yields a certain output for a certain input is explicitly not spatially and temporally limited in this way. We have no good models for how supposing such general propositions works, even for possible propositions.
The details matter. For example, assume FDT actually outputs B for problem P, and B' for a different problem P'. Under the counterpossible supposition that FDT outputs A for P, can we hold fixed that it outputs B' for P'? If not, FDT will sometimes recommend choosing a particular act because of the advantages of choosing a different act in a different kind of decision problem.
Standard decision theories are not just based on brute intuitions about particular cases, as Yudkowsky and Soares would have us believe, but also on general arguments. The most famous of these are so-called representation theorems which show that the norm of maximising expected utility can be derived from more basic constraints on rational preference (possibly together with basic constraints on rational belief). It would be nice to see which of the preference norms of CDT Yudkowsky and Soares reject. It would also be nice if they could offer a representation theorem for FDT. All that is optional and wouldn't matter too much, in my view, except that Yudkowsky and Soares claim (as I mentioned above) that the representation theorem in Joyce's Foundations of Causal Decision Theory can be adapted straightforwardly to FDT. But I doubt that it can. The claim seems to rest on the idea that FDT can be formalised just like CDT, assuming that subjunctively supposing A is equivalent to supposing that FDT recommends A. But as I've argued above, the latter supposition arguably makes an agent's subjective probability function incoherent. More obviously, in cases like Blackmail, A is plausibly false on the supposition that FDT recommends A. These two aspects already contradict the very first two points in the statement of Joyce's representation theorem, on p.229 of The Foundations of Causal Decision Theory, under 7.1.a.
Yudkowsky and Soares constantly talk about how FDT "outperforms" CDT, how FDT agents "achieve more utility", how they "win", etc. As we saw above, it is not at all obvious that this is true. It depends, in part, on how performance is measured. At one place, Yudkowsky and Soares are more specific. Here they say that "in all dilemmas where the agent's beliefs are accurate [??] and the outcome depends only on the agent's actual and counterfactual behavior in the dilemma at hand – reasonable constraints on what we should consider "fair" dilemmas – FDT performs at least as well as CDT and EDT (and often better)". OK. But how we should we understand "depends on … the dilemma at hand"? First, are we talking about subjunctive or evidential dependence? If we're talking about evidential dependence, EDT will often outperform FDT. And EDTers will say that's the right standard. CDTers will agree with FDTers that subjunctive dependence is relevant, but they'll insist that the standard Newcomb Problem isn't "fair" because here the outcome (of both one-boxing and two-boxing) depends not only on the agent's behavior in the present dilemma, but also on what's in the opaque box, which is entirely outside her control. Similarly for all the other cases where FDT supposedly outperforms CDT. Now, I can vaguely see a reading of "depends on … the dilemma at hand" on which FDT agents really do achieve higher long-run utility than CDT/EDT agents in many "fair" problems (although not in all). But this is a very special and peculiar reading, tailored to FDT. We don't have any independent, non-question-begging criterion by which FDT always "outperforms" EDT and CDT across "fair" decision problems.
FDT closely resembles Justin Fisher's "Disposition-Based Decision Theory" and the proposal in David Gauthier's Morals by Agreement, both of which are motivated by cases like Blackmail and Prisoner's Dilemma with a Twin. Neither is mentioned. It would be good to explain how FDT relates to these earlier proposals.
The paper goes to great lengths criticising the rivals CDT and EDT. The apparent aim is to establish that both CDT and EDT sometimes make recommendations that are clearly wrong. Unfortunately, these criticisms are largely unoriginal, superficial, or mistaken.
For example, Yudkowsky and Soares fault EDT for giving the wrong verdicts in simple medical Newcomb problems. But defenders of EDT such as Arif Ahmed and Huw Price have convincingly argued that the relevant decision problems would have to be highly unusual. Similarly, Yudkowsky and Soares cite a number of well-known cases in which CDT supposedly gives the wrong verdict, such as Arif's Dicing with Death. But again, most CDTers would not agree that CDT gets these cases wrong. (See this blog post for my response to Dicing with Death.) In general, I am not aware of any case in which I'd agree that CDT – properly spelled out – gives a problematic verdict. Likewise, I suspect Arif does not think there are any cases in which EDT goes wrong. It just isn't true that both CDT and EDT are commonly agreed to be faulty. If Yudkowsky and Soares want to argue that they are, they need to do more than revisit well-known scenarios and make bold assertions about what CDT and EDT say about them.
The criticism of CDT and EDT also contains several mistakes. For example, Yudkowsky and Soares repeatedly claim that if an EDT agent is certain that she will perform an act A, then EDT says she must perform A. I don't understand why. I guess the idea is that (1) if P(B)=0, then the evidential expected utility of B is undefined, and (2) any number is greater than undefined. But lots of people, from Kolmogoroff to Hajek, have argued against (1), and I don't know why anyone would find (2) plausible.
For another example, Yudkowsky and Soares claim that CDT (like FDT) involves evaluating logically impossible scenarios. For example, "[CDTers] are asking us to imagine the agent's physical action changing while holding fixed the behavior of the agent's decision function". Who says that? I would have thought that when we consider what would happen if you took one box in Newcomb's Problem, the scenario we're considering is one in which your decision function outputs one-boxing. We're not considering an impossible scenario in which your decision function outputs two-boxing, you have complete control over your behaviour, and yet you choose to one-box. There are many detailed formulations of CDT. Yudkowsky and Soares ignore almost all of them and only mention the comparatively sketchy theory of Pearl. But even Pearl's theory plausibly doesn't appeal to impossible propositions to evaluate ordinary options. Lewis's or Joyce's or Skyrms's certainly doesn't.
I still think the paper could probably have been published after a few rounds of major revisions. But I also understand that the editors decided to reject it. Highly ranked philosophy journals have acceptance rates of under 5%. So almost everything gets rejected. This one got rejected not because Yudkowsky and Soares are outsiders or because the paper fails to conform to obscure standards of academic philosophy, but mainly because the presentation is not nearly as clear and accurate as it could be.