## The lure of free energy

There's an exciting new theory in cognitive science. The theory began as an account of message-passing in the visual cortex, but it quickly expanded into a unified explanation of perception, action, attention, learning, homeostasis, and the very possibility of life. In its most general and ambitious form, the theory was mainly developed by Karl Friston -- see e.g. Friston 2006, Friston and Stephan 2007, Friston 2009, Friston 2010, or the Wikipedia page on the free-energy principle.

Unfortunately, Friston isn't very good at explaining what exactly the
theory says. The unifying principle at its core is called
the *free-energy principle*. It says that "any self-organizing
system that is at equilibrium with its environment must minimize its
free energy"
(Friston
2010). Both perception and action are then characterized as
serving this goal of minimizing free energy.

Let's have a closer look. The term "free energy" originally comes from thermodynamics, but Friston's usage has its home in variational Bayesian methods for approximating intractable integrals in machine learning and statistics.

The basic idea is actually quite simple. Suppose we have a probability distribution P that we would like to conditionalize on some data d. Call the target distribution P'. If the distributions are moderately complex, computing P' turns out to be infeasible. What we can do instead is focus on a restricted class of computationally simple distributions and find the distribution Q from within that class that is most similar to P'. For many purposes, Q will do as an approximation to the target distribution P'.

The similarity between Q and P' is commonly measured by the
"Kullback-Leibler divergence",
\begin{align}
\text{KL}(Q||P^\prime) &= \sum_x Q(x)\log \frac{Q(x)}{P^\prime(x)}\\
&= \sum_x Q(x)\log \frac{Q(x)}{P(x \land d)} - (-\log P(d)).
\end{align}
The first term in the second formulation is called
the *(variational) free energy* of Q (relative to P and d); the
second term, -log P(d), is called the *surprise*
or *surprisal* of d (relative to P).

Note that Q doesn't occur in -log P(d). So when we look for the function Q that minimizes KL(Q||P'), we can equivalently look for the function Q that minimizes free energy. Given suitable restrictions on the class of functions from which Q is chosen, this can be done very efficiently by an iterative algorithm described e.g. on these useful slides by Kay Brodersen.

How is all this relevant to cognitive science? Well, when our brain
receives sensory input, it has to come up with a plausible hypothesis
about the state of the world. Bayesian methods are ideally suited to
that task, and there are good reasons for thinking that the
information processing in our perceptual systems does in fact follow
broadly Bayesian principles. But since the ideal Bayesian method
(conditionalization) is computationally intractable, our brain can
only implement approximations. One hypothesis is that it implements
the iterative algorithm of variational approximation. (Let's call
this *the variational hypothesis about perception*.) Minimizing
free energy then plays an important role in our perceptual
architecture.

The variational hypothesis is not the only game in town. It could turn out that our brain uses some other techniques (MCMC, say) to approximate the Bayesian ideal. However, there's a sense in which our brain would still minimize free energy. For no matter how the brain arrives at its posterior probability Q, approximating the Bayesian ideal means getting close to the Bayesian posterior P', and getting close to P' means minimizing free energy.

(Here is a more complicated way of making essentially the same
point, that's popular among Friston & friends. Recall from the above definition of KL(Q||P') that divergence equals free energy minus
surprise. Moving surprise to the left-hand side of the equation, we
see that free energy equals divergence plus surprise. And since the
divergence can't be negative, free energy is an upper bound on
surprise. Now Bayesian conditionalization on d turns a prior
probability P into a posterior P' such that P'(d)=1 and therefore -log
P'(d)=0. So the output of conditionalization is a distribution that,
in a sense, minimizes surprise. (Strictly speaking, what's minimized
is of course the *posterior surprise* -log P'(d), not the prior
surprise -log P(d) that figures in our formula above.) Any approximation to
conditionalization will generally yield a posterior Q that also
assigns high probability to the input d. So it will also reduce
surprise. And so, all else equal, it will reduce free energy.)

To sum up, the proposal that a perceptual system minimizes free energy can be understood either as a substantive conjecture on how the system implements approximately Bayesian inference (the variational hypothesis) or as a rather less substantive conjecture that merely says that the system (somehow or other) implements approximately Bayesian inference. It's a bit misleading to express the weaker conjecture in terms of free energy rather than (say) surprise, since non-variational approximations to conditionalization don't directly involve free energy. But let that pass.

So far, we've only looked at perception. Let's turn to action. Here the free-energy approach is often advertised as a radical departure from traditional views insofar as it no longer appeals to goals in its explanation of action. Instead, action comes out as just another method for minimizing free energy.

The basic idea seems to go roughly as follows. Suppose my internal probability function Q assigns high probability to states in which I'm having a slice of pizza, while my sensory input suggests that I'm currently not having a slice of pizza. There are two ways of bringing Q in alignment with my sensory input: (a) I could change Q so that it no longer assigns high probability to pizza states, (b) I could grab a piece of pizza, thereby changing my sensory input so that it conforms to the pizza predictions of Q. Both (a) and (b) would lead to a state in which my (new) probability function Q' assigns high probability to my (new) sensory input d'. Compared to the present state, the sensory input will then have lower surprise. So any transition to these states can be seen as a reduction of free energy, in the unambitious sense of the term.

Action is thus explained as an attempt to bring one's sensory input in alignment with one's representation of the world.

This is clearly nuts. When I decide to reach out for the pizza, I
don't assign high probability to states in which I'm already eating
the slice. It is precisely my knowledge that I'm *not* eating the
slice, together with my desire to eat the slice, that explains my
reaching out.

There are at least two fundamental problems with the simple picture just outlined. One is that it makes little sense without postulating an independent source of goals or desires. Suppose it's true that I reach out for the pizza because I hallucinate (as it were) that that's what I'm doing, and I try to turn this hallucination into reality. Where does the hallucination come from? Surely it's not just a technical glitch in my perceptual system. Otherwise it would be a miraculous coincidence that I mostly hallucinate pleasant and fitness-increasing states. Some further part of my cognitive architecture must trigger the hallucinations that cause me to act. (If there's no such source, the much discussed "dark room problem" arises: why don't we efficiently minimize sensory surprise (and thereby free energy) by sitting still in a dark room until we die?)

The second problem is that efficient action requires keeping track
of *both* the actual state and the goal state. If I want to reach
out for the pizza, I'd better know where my arms are, where the pizza
is, what's in between the two, and so on. If my internal
representation of the world falsely says that the pizza is already in
my mouth, it's hard to explain how I manage to grab it from the
plate.

A closer look at Friston's papers suggests that the above rough proposal isn't quite what he has in mind. Recall that minimizing free energy can be seen as an approximate method for bringing one probability function Q close to another function P. If we think of Q as representing the system's beliefs about the present state, and P as a representation of its goals, then we have the required two components for explaining action. What's unusual is only that the goals are represented by a probability function, rather than (say) a utility function. How would that work?

Here's an idea. Given the present probability function Q, we can map any goal state A to the target function Q^A, which is Q conditionalized on A -- or perhaps on certain sensory states that would go along with A. For example, if I successfully reach out for the pizza, my belief function Q will change to a function Q^A that assigns high probability to my arm being outstretched, to seeing and feeling the pizza in my fingers, etc. Choosing an act that minimizes the difference between my belief function and Q^A is then tantamount to choosing an act that realizes my goal.

This might lead to an interesting empirical model of how actions are generated. Of course we'd need to know more about how the target function Q^A is determined. I said it comes about by (approximately?) conditionalizing Q on the goal state A, but how do we identify the relevant A? Why do I want to reach out for the pizza? Arguably the explanation is that reaching out is likely (according to Q) to lead to a more distal state in which I eat the pizza, which I desire. So to compute the proximal target probability Q^A we presumably need to encode the system's more distal goals and then use techniques from (stochastic) control theory, perhaps, to derive more immediate goals.

That version of the story looks much more plausible, and much less revolutionary, than the story outlined above. In the present version, perception and action are not two means to the same end -- minimizing free energy. The free energy that's minimized in perception is a completely different quantity than the free energy that's minimized in action. What's true is that both tasks involve mathematically similar optimization problems. But that isn't too surprising given the well-known mathematical and computational parallels between conditionalizing and maximizing expected utility.

Now you may rightly wonder what any of this has to do with the free-energy principle -- that "any self-organizing system
that is at equilibrium with its environment must minimize its free
energy". It is certainly not true that any self-organizing system at
equilibrium (whatever that means) *must* employ variational
Bayesian methods to process sensory input. Nor is it true that
it *must* encode proximate goals by a probability function. There
are other models of action and perception, including non-probabilistic
models that don't involve any kind of free energy, not even in a
derivative and unambitious sense.

Here is how Friston seems to think it all hangs together.

Consider the frequency with which biological systems of a given
type are in a physical state of a given type (where a "state" includes
extrinsic relations to the environment). We can think of these
frequencies as a probability measure P over physically possible
states. This distribution P -- call it the *population distribution* -- will
usually be concentrated on a small region in the state space. This
requires an explanation, since it isn't what normally happens to
physical systems when left on their own. In fact, biological systems
seem to make an active effort to remain within their typical
region. For example, when our body temperature gets unusually low, we
generally do things (shiver, put on more cloths, turn up the heating)
that prevent the temperature from decreasing further. One might even
say that it's a defining feature of biological systems that they
actively make sure to be in states with high P-probability, and
therefore low P-surprise. (If P is high, -log P is low.)

Now whatever mechanisms ensures that a system maintains a high-P state must be sensitive to the system's actual present state. It needn't "know" the exact present state, but it might at least encode a probability measure Q over possibilities concerning the present state. The goal of the mechanism is then to bring Q close to P -- in other words, to minimize the free energy of Q relative to P.

Above we saw that action can perhaps be modelled as a process of minimizing the difference between our internal representation Q and a target distribution Q'. As biological systems, we have to minimize the difference between Q and P. This suggests that our ultimate target distribution is none other than the population distribution P! The free energy that's minimized in action is thus essentially the same free energy that must be minimized by all biological systems.

Perception enters the picture in two ways. First, minimizing the difference between Q and P won't be conducive to the goal of maintaining high-P states unless Q is fairly accurate about the present state. Perception helps to render Q accurate. More importantly, recall that perception works by minimizing the free energy of Q relative to P', where P' is some prior probability distribution P conditionalized on sensory data. What is this prior distribution P? Maybe it's once again the population distribution P! This would mean that the free energy that's minimized in perception is, after all, closely related to the free energy that's minimized in action, and to the free energy that we have to minimize qua biological organisms.

In this picture, there aren't just abstract mathematical parallels between action, perception and homeostasis. There is a biological imperative to minimize a certain form of free energy, and both action and perception are means to this end.

That's the unified theory, as far as I can tell. (As I said, I find Friston hard to read.)

Is it a plausible theory? I don't think so. First of all, even if we accept that biological systems have to maintain states with high P-value (the free-energy principle), this provides no good reason for thinking that the mechanisms that achieve this can be usefully modelled as minimizing the free energy of an internal representation Q relative to the population distribution P. In particular, the free-energy principle lends no support to the variational hypothesis about perception, nor to any specific hypothesis about the generation of acts. In short, the free-energy principle is far too unspecific to serve as the basis of an interesting, unified theory of cognition.

Moreover, the idea that the very same type of free energy gets minimized in action and perception is extremely implausible,
as we saw above. It is also implausible that even *one* of these
quantities has any direct connection to the population distribution
P. For one thing, we often find ourselves in states whose past
population probability is zero, so our prior probability had better
not coincide with that population probability. More importantly, the
distribution that serves as the prior P in sensory updates is arguably
not fixed, but modified by earlier experience. So it can't always
coincide with the rigid population distribution.

Things get a lot worse when we look at the supposed connection between the population distribution P and an organism's ultimate goals. Let's set aside the question whether there's a sensible model of action in which ultimate goals can be represented in a probability function. The more immediate problem is that typicality and desirability, while correlated, are by no means the same. Many insects have thousands of offspring almost all of which die at a very early age. But they don't actively attempt to die. In the other direction, mating events are often extremely rare in the population distribution, and yet individuals actively seek them out. More generally, there is just no reason to believe that evolution would make organisms seek out states to the extent that they are common, rather than to the extent that they promote fitness.

Free energy minimization *might* play an interesting role in
perception. A computationally similar process *might* play a role
in the generation of action. But the kinds of free energy that are
minimized in the two cases would be very different. Neither of them
has any plausible connection to population distributions.

Friston's 'free energy' has nothing to do with 'thermodynamic free energy'. I wonder why he used the thermodynamic term to represent a non-thermodynamic concept ? Most people, including myself, when first encountering Friston's use of the term "free energy", would think of Gibbs free energy in thermodynamics. But when you actually read his article, his "free energy" is a totally different animal. To help reduce such a mental shock, shouldn't Friston provide some explanation as to why he chose to use the well-known term in such an unusual way ?

Is it possible that Friston was unaware of the definition of Gibbs free energy ? It seems to me that Friston's use of variational "free energy" which is totally unrelated to thermodynamic "free energy" may well confuse workers as Shannon's unfortunate use of the term "entropy" in his information theory, that has nothing to do with thermodynamic entropy, has been confusing investigators since the middle of the last century. I can make this strong statement because thermodynamic entropy obeys the Second Law of thermodynamics but Shannon entropy does not. Similarly, thermodynamic free energy obeys the First and the Second laws of thermodynamics, but Friston's "free energy" does not. Hence I strongly recommend that Friston consider replacing his "free energy" with a less confusing term that reflects his brain theory in order to avoid the unnecessary confusions.