Derivation of the Variational Bayes Equations

Alianna J. Maren
Themesis Technical Report TR-2019-01v6 (ajm)
[email protected]
[email protected]
Abstract

The derivation of key equations for the variational Bayes approach is well-known in certain circles. However, translating the fundamental derivations (e.g., as found in Beal’s work) to Friston’s notation is somewhat delicate. Further, the notion of using variational Bayes in the context of a system with a Markov blanket requires special attention. This Technical Report presents the derivation in detail. It further illustrates how the variational Bayes method provides a framework for a new computational engine, incorporating the 2-D cluster variation method (CVM), which provides a necessary free energy equation that can be minimized across both the external and representational systems’ states, respectively.

“Do you understand this?” she demanded. …

“It seems simple enough,” he said after a moment.

“I knew it,” she muttered, crossing her arms. “I knew it was written in male.”

Heir to the Shadows: Book 2 of the Black Jewels Trilogy

Anne Bishop (1999), p. 214 (Trade edition).

1 Introduction

A recent evolution by Friston et al. (2024) [1] illustrates how active inference can be used for multi-scale applications, which in turn prompts greater attention to the derivation of core active inference equations.

Beyond this, the recent exposition by Friston et al. (2023) [2] on the “free energy principle,” coupled with a conceptual advance by Hafner et al. (2020, rev. 2022) [3] on“Action Perception Divergence” (APD) posits the need to trace how the active inference equations have been derived, as well as adapted for expression in different circiumstances.

Friston (2010, 2013) [4, 5] has proposed that free energy minimization serves as a unifying theory for describing neural dynamics, with further elaboration in Friston et al. (2015) [6]. He further suggests that statistical thermodynamics can model neuronal systems, drawing on the dynamic properties of activated neuronal ensembles [6]. This elegant and fascinating notion depends on the use of the variational Bayes approach together with the idea of a Markov blanket, which separates an internal computational (“representational”) set of units from external ones.

While this approach is certainly attractive, it is potentially difficult for many readers to follow the translation from one of the earlier presentations of variational Bayes (by Beal (2003) [7]) to the equations used by Friston (op. cit.).

The treatment offered in this work is a careful deconstruction of the ideas and formalism as they were originally articulated by Friston (op. cit.) based on detailed derivations used by Beal [7]). As such, it serves as a Rosetta stone, not just for the ideas, but also for the meaning of variables and operators as used in different works. This is not as easy as it may seem; for example, the term H𝐻Hitalic_H could be read as enthalpy in thermodynamic treatments, while it stands in for entropy in purely information theoretic treatments.

Thus, this Technical Report serves as a mini-tutorial, carefully delineating how the free energy equations presented in Friston (op. cit.) correspond to the detailed derivations presented in Beal (2003), which were originally presented in Feynman [8] and in Hinton and van Camp [9]. It identifies how - although the equations may seem formally identical - there are certain key differences in the presentations offered by Friston and Beal.

In what follows, we will address the meaning of each variable explicitly and note any instances of overloading (i.e., the use of the same variable to mean two things). To help in this regard, Section 2 includes a glossary of the thermodynamic variables in this paper, along with a brief description. Section 3 includes a glossary of the additional information-theoretic notions.

Also, since Blei et al. (2016, 2017) [10, 11] have offered a valuable and useful tutorial on variational inference - one that is usefully read hand-in-hand with Beal - we also address the nomenclature used by Blei et al. Section 4 offers a table that compares (in a “Rosetta stone” manner) the nomenclature used by Beal, Friston, and Blei et al.

One of the most important elements in the derivation of variational Bayes is that the fundamental free energy equation (Eqn. 2) can be expressed two different ways. We work through the derivation of the first version in Section 4, and the second in Section 5. Section 6, “Discussion,” offers a contrast-and-compare of these two different free energy expressions.

Between 2016 and 2017, Friston and colleagues shifted the notation that they used, moving from explicit depiction of an external surround (ΨΨ\Psiroman_Ψ) to a notation where the interaction between the system that was being modeled and the surround was evidenced by action agents and sensing agents [12, 13]. (These agents were present in the earlier Friston works, but now the role of ΨΨ\Psiroman_Ψ became more implicit.)

The expressions used in Friston et al. (2017) became the basis for much future work, including the free energy principle exposition [2] as well as Action Perception Divergence (APD) [3]. Thus, in this (2024) revised version of this work, we introduce a new Section 7, which overviews how active inference is presented in Friston et al. (2016, 2017) [12, 13].

This work also introduces a new computational engine (Maren, 2016) [14] in the active inference context. Such an engine would make use of not only Friston’s notion of a set of computational (representational) units separated from an external system by a Markov blanket, but also follow the variational Bayes (free energy minimization) approach described by both Friston (op. cit.) and Beal (2003).

In brief, Friston proposes a computational system in which a Markov blanket separates the computational (representational) elements of the engine from external events, as shown in Figure 1. The communications between the external system elements (denoted ψ~~𝜓\tilde{\psi}over~ start_ARG italic_ψ end_ARG) with those of the representational system (denoted λ𝜆\lambdaitalic_λ oder r~~𝑟\tilde{r}over~ start_ARG italic_r end_ARG) are mediated by two distinct layers or components of the Markov blanket; the sensing (s~~𝑠\tilde{s}over~ start_ARG italic_s end_ARG) elements and the action (a~~𝑎\tilde{a}over~ start_ARG italic_a end_ARG) ones. The distinction between active and sensory states is dictated by the definition of a Markov blanket; namely, active states influence but are not influenced by external system elements, while sensory states influence but are not influenced by the representational system.

Refer to caption

Figure 1: Illustration of a CORTECON(R) (COntent-Retentive, TEMporally-CONnected neural network) computational engine (Maren, 2016) [14], which includes an internal latent node grid. Within this grid, the total number of active nodes is govered by an activation enthalpy (ε0subscript𝜀0\varepsilon_{0}italic_ε start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT) and the degree of clustering is governed by an interaction enthalpy parameter (ε1subscript𝜀1\varepsilon_{1}italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT). The cluster variation method (CVM) is used to bring the active and non-active nodes into free energy equilibrium. A Markov blanket of sensing and active units corresponds to input and output layers (see Friston [6]). The latent node grid, or“computational layer,” can be composed as either a 1-D or 2-D CVM, for which the free energy minimum can be found either analytically (for the case where ε0=0subscript𝜀00\varepsilon_{0}=0italic_ε start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0) or computationally (for the case where ε00subscript𝜀00\varepsilon_{0}\neq 0italic_ε start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≠ 0). The CVM layer comprises the internal or representational units (r~~𝑟\tilde{r}over~ start_ARG italic_r end_ARG), and cannot communicate with the external field (shown in two parts for visualization purposes only). However, units within the representational layer can receive inputs from the sensory units (s~~𝑠\tilde{s}over~ start_ARG italic_s end_ARG) and send signals to the active (a~~𝑎\tilde{a}over~ start_ARG italic_a end_ARG) units. The sensory units can receive inputs from external stimulus, and send signals to the representational units. The active units can receive inputs from the representational units, and send signals to the external system.

In the notions offered by Friston, both the external system (ψ~~𝜓\tilde{\psi}over~ start_ARG italic_ψ end_ARG units) and the representational system (r~~𝑟\tilde{r}over~ start_ARG italic_r end_ARG units) each independently come to a free energy minimum. The activation of r~~𝑟\tilde{r}over~ start_ARG italic_r end_ARG units within the representational system can be mediated by certain parameters (θ𝜃\thetaitalic_θ), so that the representational system models the external one.

More appropriately, since we are specifying that both the external and representational systems come to free energy minima, we would state that the nature of the representational system, when it achieves free energy minimization, approximates that of the external system, which also comes to free energy minimization. The degree-of-closeness of the model approximation to the external system is mediated by the parameter(s) θ𝜃\thetaitalic_θ.

In order to create such a computational engine, we need a formalism that will actually allow this free energy minimization to take place, in both the external and representational components of the system. Maren [14] has suggested that a free energy formulation, known as the cluster variation method (CVM), can potentially serve in such a computational engine, as is shown in Fig. 1. We develop this further in Section 8.

As a first step, we will use Friston’s framework for the variational Bayes approach, and this requires that we derive the basic variational Bayes equations.

To do this, we use the same notation as used by Friston, presented in the following Table 1. The tilde notation (with variables ψ~~𝜓\tilde{\psi}over~ start_ARG italic_ψ end_ARG, r~~𝑟\tilde{r}over~ start_ARG italic_r end_ARG, s~~𝑠\tilde{s}over~ start_ARG italic_s end_ARG, and a~~𝑎\tilde{a}over~ start_ARG italic_a end_ARG) all refer to these as being “generalized” variables [6].

Table 1: Variable definitions for variational free energy equations
Variable Meaning     
s~,a~,r~~𝑠~𝑎~𝑟\tilde{s},\tilde{a},\tilde{r}over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG Generalized expressions for sensory (s~~𝑠\tilde{s}over~ start_ARG italic_s end_ARG), active (a~~𝑎\tilde{a}over~ start_ARG italic_a end_ARG), and internal (or representational) (r~~𝑟\tilde{r}over~ start_ARG italic_r end_ARG) states
ψ~~𝜓\tilde{\psi}over~ start_ARG italic_ψ end_ARG States of the world (system being modeled) that cause sensory states, and which can be influenced by action
fx(ψ~,s~,a~,r~)subscript𝑓𝑥~𝜓~𝑠~𝑎~𝑟f_{x}(\tilde{\psi},\tilde{s},\tilde{a},\tilde{r})italic_f start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( over~ start_ARG italic_ψ end_ARG , over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) Flow of system’s states, where x𝑥xitalic_x corresponds to external, internal (representational), or active states; see, e.g., Friston et al. (2015) [6]

2 The Variational Free Energy

The goal of this section is to follow the Friston approach and express variational free energy as an expected energy or enthalpy minus the entropy of a variational (i.e., approximate posterior) probability density. This can be equivalently expressed as surprisal plus the reverse Kullback-Leibler (KL) divergence between the “variational density and the posterior density over external states” (p. 4, [6]), as shown in Eqn. 2. (For a discussion of the reverse K-L divergence, see Maren (2024) [15].)

(Note: Most of us would say that the formulation of Eqn. 2 is between the true posterior density over external states q(ψ~|r~)𝑞conditional~𝜓~𝑟q({\tilde{{\psi}}}|\tilde{r})italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) (given the internal or representational states and their Markov blanket) and the variational density p(ψ~|s~,a~,r~)𝑝conditional~𝜓~𝑠~𝑎~𝑟p(\tilde{{\psi}}|\tilde{s},\tilde{a},\tilde{r})italic_p ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) (the density of the model system).)

Friston expresses the variational free energy of an ensemble as the following equations [5, 6] (where the exact notation is taken from Friston (2015) [6], Eqn. 3.2)

fr(s~,a~,r~)=(QrΓr)r~Fsubscript𝑓𝑟~𝑠~𝑎~𝑟subscript𝑄𝑟subscriptΓ𝑟subscript~𝑟𝐹\displaystyle f_{r}(\tilde{s},\tilde{a},\tilde{r})=(Q_{r}-\Gamma_{r})\nabla_{% \tilde{r}}Fitalic_f start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) = ( italic_Q start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT - roman_Γ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) ∇ start_POSTSUBSCRIPT over~ start_ARG italic_r end_ARG end_POSTSUBSCRIPT italic_F (1)
fa(s~,a~,r~)=(QaΓa)a~Fsubscript𝑓𝑎~𝑠~𝑎~𝑟subscript𝑄𝑎subscriptΓ𝑎subscript~𝑎𝐹\displaystyle f_{a}(\tilde{s},\tilde{a},\tilde{r})=(Q_{a}-\Gamma_{a})\nabla_{% \tilde{a}}Fitalic_f start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) = ( italic_Q start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - roman_Γ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) ∇ start_POSTSUBSCRIPT over~ start_ARG italic_a end_ARG end_POSTSUBSCRIPT italic_F

and

F(s~,a~,r~)=Eq[L(x~)]H[q(ψ~|r~)]𝐹~𝑠~𝑎~𝑟subscript𝐸𝑞delimited-[]𝐿~𝑥𝐻delimited-[]𝑞conditional~𝜓~𝑟\displaystyle F(\tilde{s},\tilde{a},\tilde{r})=E_{q}[L(\tilde{x})]-H[q(\tilde{% {\psi}}|\tilde{r})]italic_F ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) = italic_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_L ( over~ start_ARG italic_x end_ARG ) ] - italic_H [ italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) ] (2)
=L(s~,a~,r~)+DKL[q(ψ~|r~)||p(ψ~|s~,a~,r~)].\displaystyle=L(\tilde{s},\tilde{a},\tilde{r})+D_{KL}[q({\tilde{{\psi}}}|% \tilde{r})||p(\tilde{{\psi}}|\tilde{s},\tilde{a},\tilde{r})].= italic_L ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) + italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT [ italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) | | italic_p ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) ] .

where the variables were previously identified in Table 1.

The flow of system states, represented by Eqn. 1, is particular to each type of unit, so that fa(s~,a~,r~)subscript𝑓𝑎~𝑠~𝑎~𝑟f_{a}(\tilde{s},\tilde{a},\tilde{r})italic_f start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) is the (gradient descent) change in the set of action units (a~~𝑎\tilde{a}over~ start_ARG italic_a end_ARG), and fs(s~,a~,r~)subscript𝑓𝑠~𝑠~𝑎~𝑟f_{s}(\tilde{s},\tilde{a},\tilde{r})italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) is the change in the set of action units (s~~𝑠\tilde{s}over~ start_ARG italic_s end_ARG). These system states are subject to random fluctuations denoted by ω𝜔\omegaitalic_ω. The amplitude of the random fluctuations is controlled by [the diffusion tensor] ΓΓ\Gammaroman_Γ, while (the set of) Q𝑄Qitalic_Q are antisymmetric matrices that allow for solenoidal flow (which does not change free energy). (See further discussion in Sect. 2 of Friston (2013) [5].)

This Technical Report focuses exclusively on the static Eqn. 2, and defers the dynamic Eqn. 1 to a different occasion.

Eqn. 2 expresses the variational free energy F(s~,a~,r~)𝐹~𝑠~𝑎~𝑟F(\tilde{s},\tilde{a},\tilde{r})italic_F ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ), which is isomorphic in structure to the free energy used in statistical thermodynamics. In Eqn. 2, this term is expressed in two ways. First, it given as the difference of the expectation of L(x~)𝐿~𝑥L(\tilde{x})italic_L ( over~ start_ARG italic_x end_ARG ) (a log-likelihood term) and the entropy of the posterior density over external states, H[q]𝐻delimited-[]𝑞H[q]italic_H [ italic_q ]. (This is the formalism that is isomorphic with statistical thermodynamics.) We will determine the exact meaning of L(x~)𝐿~𝑥L(\tilde{x})italic_L ( over~ start_ARG italic_x end_ARG ) in Section 5.

The second expression for F(s~,a~,r~)𝐹~𝑠~𝑎~𝑟F(\tilde{s},\tilde{a},\tilde{r})italic_F ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) is the sum of the negative log evidence (surprisal) L(s~,a~,r~)𝐿~𝑠~𝑎~𝑟L(\tilde{s},\tilde{a},\tilde{r})italic_L ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) and the reverse Kullback-Leibler (K-L) divergence between the external system and the model. We will determine the meaning of L(s~,a~,r~)𝐿~𝑠~𝑎~𝑟L(\tilde{s},\tilde{a},\tilde{r})italic_L ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) in Section 4, and identify how (and why) it is different from the L(x~)𝐿~𝑥L(\tilde{x})italic_L ( over~ start_ARG italic_x end_ARG ) used in the first expression. (As a minor note: as the reverse Kullback-Leibler divergence in this expression approaches zero, we see that the variational free energy can be identified as an evidence bound for the system.)

We further note that a primary difference between these two expressions is that in the first expression, H[q(ψ~|r~)]𝐻delimited-[]𝑞conditional~𝜓~𝑟H[q(\tilde{{\psi}}|\tilde{r})]italic_H [ italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) ] is an exact entropy term, whereas in the second expression, the DKLsubscript𝐷𝐾𝐿D_{KL}italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT term is a relative entropy.

The following Table 2 presents a glossary of the thermodynamic terms used in this Report.

Table 2: Variable definitions for variational free energy equations
Variable Meaning     
Activation enthalpy Enthalpy ε0subscript𝜀0\varepsilon_{0}italic_ε start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT associated with a single unit (node) in the “on” or “active” state (A); influences configuration variables and is set to 0 in order to achieve an analytic solution for the free energy equilibrium
Configuration variable(s) Nearest neighbor, next-nearest neighbor, and triplet patterns
Degeneracy Number of ways in which a configuration variable can appear
Enthalpy Internal energy H results from both per unit and pairwise interactions; often denoted H𝐻Hitalic_H in thermodynamic treatments
Entropy The entropy S is the distribution over all possible states; often denoted S𝑆Sitalic_S in thermodynamic treatments and H𝐻Hitalic_H in information theory
Equilibrium point By definition, the free energy minimum for a closed system
Equilibrium distribution Configuration variable values when free energy minimized for given h
Ergodic distribution Achieved when a system is allowed to evolve over a long period of time
Free Energy The thermodynamic state function F; where F = H-TS; sometimes G is used instead of F; referring to (thermodynamic) Gibbs free energy
h-value A more useful expression for the interaction enthalpy parameter ε1subscript𝜀1\varepsilon_{1}italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT; h=e2βε1superscript𝑒2𝛽subscript𝜀1h=e^{2\beta\varepsilon_{1}}italic_h = italic_e start_POSTSUPERSCRIPT 2 italic_β italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where β=1/kβT𝛽1subscript𝑘𝛽𝑇\beta=1/{k_{\beta}T}italic_β = 1 / italic_k start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT italic_T, and where kβsubscript𝑘𝛽k_{\beta}italic_k start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT is Boltzmann’s constant and T𝑇Titalic_T is temperature; β𝛽\betaitalic_β can be set to 1 for our purposes
Interaction enthalpy Between two unlike units, ε1subscript𝜀1\varepsilon_{1}italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT; influences configuration variables
Interaction enthalpy parameter Another term for the h-value where h=e2ε1superscript𝑒2subscript𝜀1h=e^{2\varepsilon_{1}}italic_h = italic_e start_POSTSUPERSCRIPT 2 italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT
Temperature Temperature T times Boltzmann’s constant kβsubscript𝑘𝛽k_{\beta}italic_k start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT is set equal to one

2.1 The Final Result in a Nutshell

The essence of Eqn. 2 is that we are taking a single expression, and parsing and re-organizing it to achieve two different ways of re-expressing the same thing. The expression, sometimes called the “variational free energy” (see, e.g., Friston (op. cit.)) is given as

F(s~,a~,r~)=ψq(ψ~|r~)ln(p(ψ~,s~,a~,r~)q(ψ~|r~))𝑑ψ.𝐹~𝑠~𝑎~𝑟subscript𝜓𝑞conditional~𝜓~𝑟𝑝~𝜓~𝑠~𝑎~𝑟𝑞conditional~𝜓~𝑟differential-d𝜓\displaystyle F(\tilde{s},\tilde{a},\tilde{r})=-\int_{\psi}q({\tilde{{\psi}}}|% \tilde{r})\ln\left({\frac{p(\tilde{{\psi}},\tilde{s},\tilde{a},\tilde{r})}{q({% \tilde{{\psi}}}|\tilde{r})}}\right)d{\psi}.italic_F ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) = - ∫ start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) roman_ln ( divide start_ARG italic_p ( over~ start_ARG italic_ψ end_ARG , over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) end_ARG start_ARG italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) end_ARG ) italic_d italic_ψ . (3)

Diagrammatically, we can see this shown in Figure 2. The tilde notation is dropped in this figure and in the immediately-following subsection, which discusses this figure.

Refer to caption
Figure 2: Diagrammatic illustration of Eqn. 2.

2.2 Quick Summary of Key Points

In Figure 2, we saw that the initiating expression, the “variational free energy,” was being reorganized in two different ways. On the Left-Hand-Side (LHS), we saw that it led to the (negative of) a simple sum over the log-likelihood of the set of variables associated with the representational units and the Markov blanket, added to the reverse Kullback-Leibler (K-L) divergence term expressing the difference between the model q𝑞qitalic_q and the probability distribution of the external units ψ𝜓\psiitalic_ψ as conditioned on the internal units and the Markov blanket. We achieve this result by re-expressing the probability of the joint co-occurrence, p(ψ,s,a,r)𝑝𝜓𝑠𝑎𝑟p(\psi,s,a,r)italic_p ( italic_ψ , italic_s , italic_a , italic_r ) as a conditional probability, using the formulation for a Bayesian posterior.

On the Right-Hand-Side (RHS), we see that the re-organization is much simpler, and can indeed be followed simply by examining the diagram itself. (There are a few subtleties, which are addressed in the following sections.) The first term on the RHS is a weighted sum over the joint probability distribution p(ψ,s,a,r)𝑝𝜓𝑠𝑎𝑟p(\psi,s,a,r)italic_p ( italic_ψ , italic_s , italic_a , italic_r ). The second term is a term that looks remarkably like entropy.

The structure of the equation on the RHS, and the “entropy-like” appearance of the last term, has given rise to expressing the whole equation as “variational free energy.” This is because there is a morphological similarity between the form of the variational free energy equation and the classic thermodynamic free energy. (See Appendix A for a quick review of the basic statistical thermodynamics equations, with more details presented in Appendix B.)

2.3 Various Iinterpretations

Various authors interpret Eqn. 2 with different notations and descriptive phrases. The purpose of this subsection is to identify a few of these interpretations, and to tease out exactly what is meant from exactly what is said. This should make it easier for those reading the source papers to understand what is actually being presented, and is a first step towards building a Rosetta stone; giving a cross-correlation between two different notations.

The specifics of this Rosetta stone are captured in Table 4, presented in Section 4.

For example, Friston (2013) presents this Report’s Eqn. 2 as his Eqn. 2.7 in [5], using the notation

F(s,a,λ)=Eq[G(ψ,s,a,λ)]H[q(ψ|μ)].𝐹𝑠𝑎𝜆subscript𝐸𝑞delimited-[]𝐺𝜓𝑠𝑎𝜆𝐻delimited-[]𝑞conditional𝜓𝜇\displaystyle F(s,a,\lambda)=E_{q}[G(\psi,s,a,\lambda)]-H[q({\psi}|\mu)].italic_F ( italic_s , italic_a , italic_λ ) = italic_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_G ( italic_ψ , italic_s , italic_a , italic_λ ) ] - italic_H [ italic_q ( italic_ψ | italic_μ ) ] .

Friston refers to Eq[G]subscript𝐸𝑞delimited-[]𝐺E_{q}[G]italic_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_G ] saying that “[T]he last equality just shows that free energy can be expressed as the expected Gibbs energy minus the entropy of the variational density.”

However, Sengupta, Stemmler, and Friston [16] state that “U(t)=lnp(s(t),ψ(t)|m)𝑈𝑡𝑝𝑠𝑡conditional𝜓𝑡𝑚U(t)=-\ln p(s(t),\psi(t)|m)italic_U ( italic_t ) = - roman_ln italic_p ( italic_s ( italic_t ) , italic_ψ ( italic_t ) | italic_m ) corresponds to an internal energy under a generative model of the world, described in terms of the density over sensory and hidden states p(s,y|m)𝑝𝑠conditional𝑦𝑚p(s,y|m)italic_p ( italic_s , italic_y | italic_m ).” (Author’s note: U(t)𝑈𝑡U(t)italic_U ( italic_t ) corresponds to Eq[G]subscript𝐸𝑞delimited-[]𝐺E_{q}[G]italic_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_G ], from [16] and [5], respectively.) Moreover, Sengupta et al. state that “F(t)𝐹𝑡F(t)italic_F ( italic_t ) is called free energy —- by analogy with its thermodynamic homologue that is defined as internal energy minus entropy. However, it is important to note that variational free energy is not the Helmholtz free energy … it is a functional of a probability distribution over hidden (fictive) states encoded by internal states q(y|m)𝑞conditional𝑦𝑚q(y|m)italic_q ( italic_y | italic_m ), not the probability distribution over the (physical) internal states. This is why variational free energy pertains to information about hidden states that are represented, not the internal states that represent them.”

(Author’s Note 1: For the benefit of those who wish to compare the information-theoretic approach of Beal, Friston, and others against a classic statistical thermodynamics formulation, Appendix A derives fundamental thermodynamic concepts, and Beal’s results compared with the corresponding statistical thermodynamic formalism are given in Appendix B. As stated by Sengupta et al., they are not precisely the same [16].)

(Author’s Note 2: Sengupta et al. [16] refer to a Helmholtz free energy, and Friston, writing separately, refers to a Gibbs free energy. Both Helmholtz and Gibbs free energies correspond to thermodynamic free energy formulations, and involve measurements on a physical system, i.e., temperature, pressure, and volume. The distinction between Helmholtz and Gibbs free energies disappears when we use the thermodynamic free energy formulations as a metaphor.)

It will be clear, in the succeeding derivations, that what is offered as Eq[G]subscript𝐸𝑞delimited-[]𝐺E_{q}[G]italic_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_G ] is not what we are familiar with as the Gibbs (or Helmholtz) free energy from statistical thermodynamics. However, Sengupta et al. offer the following explanation as a Lemma:

Lemma: (complexity minimisation) Minimising the complexity of a conditional distribution —- whose sufficient statistics are (strictly increasing functions of) some unconstrained internal variables of a thermodynamic system —- minimises the Helmholtz free energy of that system.”

As proof, they suggest that we can use standard results from Bayesian statistics [7] in order to express free energy as complexity minus accuracy. They are, in fact, referring to the derivation in Beal (2003) that will be the fundamental reference in this Technical Report. They conclude that “In sum, the internal states encoding prior beliefs about hidden states of the world are those that minimise Helmholtz free energy and the complexity defined by variational free energy.”

Friston cites Beal (2003) [7] for the derivation of Eqns. 1 and 2. Blei et al. [10] also provide a useful tutorial. The following sections walk through the derivations as provided by Beal, using some of the material provided by Blei et al. to support and elucidate certain points. The goal is to make the match between the free energy equations as expressed by Beal (and in certain cases, by Blei et al.) with those used by Friston as clear and as transparent as possible.

3 Important Distinction & Clarification

Following the transition of the variational Bayes approach from Beal [7] to Friston (op. cit.), and from thence to an actual, computable model, requires some subtlety and attention to detail. The most crucial consideration at the outset is that with Beal (and with previous expostulations on the variational Bayes method), both the actual data points being modeled and the model itself are expressed with regard to some underlying variables, xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Thus, we have the external (or observable, or dependent) variables yi=y(xi)subscript𝑦𝑖𝑦subscript𝑥𝑖y_{i}=y(x_{i})italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_y ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), and the distribution pxi(xi)subscript𝑝subscript𝑥𝑖subscript𝑥𝑖p_{x_{i}}(x_{i})italic_p start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ).

In contrast to Beal’s description, with Friston, the external system and the representation are separated by a Markov blanket. Thus, the external system’s units, denoted ψ𝜓\psiitalic_ψ, are distinct from the representational system’s units. Friston’s approach is illustrated in Figure 3.

(Note: Beal actually does address how variational inference is performed in a system separated by a Markov blanket; see Sections 3 ff. of his work; his notation gets complex, and we don’t need to use his more detailed work in order to interpret Friston.)

Refer to caption
Figure 3: In the variational Bayes method described by Friston, the external system, whose units are denoted by ψ𝜓\psiitalic_ψ, interacts with a separate representational system whose units are denoted by r𝑟ritalic_r. The two systems are separated by a Markov blanket composed of sensing (s𝑠sitalic_s) and action (a𝑎aitalic_a) units. (Note: for simplicity, the tilde notation is dropped from this figure.) In comparison with Beal, the distribution q𝑞qitalic_q is of the external system, which is conditioned on the representational system; q=q(ψ~|r~)𝑞𝑞conditional~𝜓~𝑟q=q(\tilde{\psi}|\tilde{r})italic_q = italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ). This is feasible because the external system units ψ~~𝜓\tilde{\psi}over~ start_ARG italic_ψ end_ARG influence the representational units r~~𝑟\tilde{r}over~ start_ARG italic_r end_ARG through the sensory units s~~𝑠\tilde{s}over~ start_ARG italic_s end_ARG. Conversely, the representational units r~~𝑟\tilde{r}over~ start_ARG italic_r end_ARG influence the external units ψ~~𝜓\tilde{\psi}over~ start_ARG italic_ψ end_ARG through the active units a~~𝑎\tilde{a}over~ start_ARG italic_a end_ARG.

In a beautiful and intriguing illustration, Friston applies his formulation to the emergence of a Markov blanket, beginning with “an ensemble of elemental subsystems with (heuristically speaking) Newtonian and electrochemical dynamics … One can think of these generalized states as describing the physical and electrochemical state of large macromolecules. Crucially, these states are coupled within and between the subsystems comprising an ensemble” [5]. The units comprising the Markov blanket and the internal, “representational” system emerge over time. Appropriately enough, Friston’s illustration focuses on the dynamic behaviors of the various units.

The following Table 3 presents a glossary of the thermodynamic terms used in this Report.

Table 3: Variable definitions for information theoretic terms
Variable Meaning     
Kullback-Leibler divergence Relative entropy, or measure of how one probability distribution differs from a second one; here, the divergence between a model distribution and the actual system being modeled
Surprisal Information content, or amount of information gained when a random variable is sampled
Variational free energy F(s~,a~,r~)𝐹~𝑠~𝑎~𝑟F(\tilde{s},\tilde{a},\tilde{r})italic_F ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) Isomorphic in structure to the equation for thermodynamic free energy, this expresses the difference between the expectation for the probability distribution over a system and the entropy of a model system; it can also be expressed as the sum of the negative log evidence (surprisal) and the Kullback-Leibler divergence between the model and the external system

3.1 Primary Distinction: Beal and Friston

Turning our attention back to the evolution of Friston’s formulation from that expressed by Beal, we investigate the correspondence between Beal’s expression of p(yi|θ)𝑝conditionalsubscript𝑦𝑖𝜃p(y_{i}|\theta)italic_p ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_θ ) in comparison with Friston’s use of p(s~,a~,r~)𝑝~𝑠~𝑎~𝑟p(\tilde{s},\tilde{a},\tilde{r})italic_p ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ).

In writing p(yi|θ)𝑝conditionalsubscript𝑦𝑖𝜃p(y_{i}|\theta)italic_p ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_θ ), Beal is actually describing the observable variables that are being modeled. He is specifically referring to the (integration of) the joint probability distribution p(xi,yi|θ)𝑝subscript𝑥𝑖conditionalsubscript𝑦𝑖𝜃p(x_{i},y_{i}|\theta)italic_p ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_θ ), where the yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are the dependent, or observable variables, and the xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are the independent, or hidden, or latent variables. Specifically, Beal states (Sect. 2.2.1 [7]), “Consider a model with hidden variables x𝑥xitalic_x and observed variables y𝑦yitalic_y. The parameters describing the (potentially) stochastic dependencies between variables are given by θ𝜃\thetaitalic_θ.”

The key to understanding that Beal is describing the external (observable) system (also that which Friston refers to as being denoted by the units ψ~~𝜓\tilde{\psi}over~ start_ARG italic_ψ end_ARG) is that both Beal and Friston refer to the distribution of hidden variables as p(xi)𝑝subscript𝑥𝑖p(x_{i})italic_p ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (see Eqn. 2 as an example). As Friston notes (personal communication), “The link between the two formulations rests upon associating the causes xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and θ𝜃\thetaitalic_θ in Beal’s notation with the external states that cause data (yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) in Friston’s formulation (i.e., xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and θisubscript𝜃𝑖\theta_{i}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT correspond to ψ~~𝜓\tilde{\psi}over~ start_ARG italic_ψ end_ARG), where the data in Friston’s formulation become the sensory units states (s~~𝑠\tilde{s}over~ start_ARG italic_s end_ARG).”

In support of this, we note that Blei et al. [10] stated that “the approach is to posit a family of approximate densities Q, which are defined over the set of latent variables z=z1:m𝑧subscript𝑧:1𝑚z=z_{1:m}italic_z = italic_z start_POSTSUBSCRIPT 1 : italic_m end_POSTSUBSCRIPT and observations x=x1:n𝑥subscript𝑥:1𝑛x=x_{1:n}italic_x = italic_x start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT … Then, we try to find that member q*(z) of the set Q that is the Kullback-Leibler divergence (KL) of q*(z) with respect to the exact posterior p(z|x)𝑝conditional𝑧𝑥p(z|x)italic_p ( italic_z | italic_x ), which represents the probability distribution of the latent variables with regard to the observables, given as

q*(z)=argminKL(q(z)||(p(z|x));q(z)Q.′′\displaystyle\textit{q*({z})}=argminKL(\textit{q({z})}||(p(z|x));q(z)\in% \textbf{{Q}}.^{\prime\prime}q*( bold_italic_z ) = italic_a italic_r italic_g italic_m italic_i italic_n italic_K italic_L ( q( bold_italic_z ) | | ( italic_p ( italic_z | italic_x ) ) ; italic_q ( italic_z ) ∈ Q . start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT

Note that in this explanation by Blei et al., the observable variable was denoted as x𝑥xitalic_x instead of y𝑦yitalic_y, but the independent (and hidden) variable was denoted z𝑧zitalic_z. Thus, in Beal, the observables are y𝑦yitalic_y and the latent are x𝑥xitalic_x, and in Blei et al., the observables are x𝑥xitalic_x and the latent are z𝑧zitalic_z. (For a useful cross-correlation of notation across all three authors, see Table 4.)

The important thing to note, and the thing that differentiates Friston’s formulation from Beal’s, is that with Beal, both p𝑝pitalic_p and q𝑞qitalic_q are applied to the same underlying hidden or independent variable(s) x𝑥xitalic_x. Essentially, p𝑝pitalic_p represents the exact observed data, and q𝑞qitalic_q is the model (subject to parameter fitting with θ𝜃\thetaitalic_θ). In contrast, with Friston, we are dealing with entirely different sets of units; the external units (denoted by ψ~~𝜓\tilde{\psi}over~ start_ARG italic_ψ end_ARG) and the internal, representational units (denoted by r~~𝑟\tilde{r}over~ start_ARG italic_r end_ARG). We can express a probability distribution over each, but they are not necessarily the respective observed data points and modeling functions taken over the same underlying base set of values.

3.2 Integrating over the Model Space

As we will note during subsequent derivations (presented in Subsection 4.1), the integration (in various equations) over q𝑞qitalic_q is shown as being with regard to the external system units, ψ~~𝜓\tilde{\psi}over~ start_ARG italic_ψ end_ARG. More to the point, Friston uses the notation of integrating over ψ~~𝜓\tilde{\psi}over~ start_ARG italic_ψ end_ARG, in order to show the comparison between his formulation and that of Beal.

Specifically, we will see that Beal uses the integration (Section 5)

L(θ)i=1𝑑xiqxi(xi)lnp(xi,yi|θ)qxi(xi),𝐿𝜃subscript𝑖1differential-dsubscript𝑥𝑖subscript𝑞subscript𝑥𝑖subscript𝑥𝑖𝑝subscript𝑥𝑖conditionalsubscript𝑦𝑖𝜃subscript𝑞subscript𝑥𝑖subscript𝑥𝑖\displaystyle L(\theta)\geq\sum_{i=1}\int dx_{i}\>q_{x_{i}}(x_{i})\>\ln\>\frac% {p(x_{i},y_{i}|\theta)}{q_{x_{i}}{(x_{i}})},italic_L ( italic_θ ) ≥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT ∫ italic_d italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_ln divide start_ARG italic_p ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_θ ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG ,

and correspondingly, Friston uses the integration (Subsection 4.3)

F(s~,a~,r~)=ψq(ψ~|r~)ln(p(ψ~,s~,a~,r~)q(ψ~|r~))𝑑ψ.𝐹~𝑠~𝑎~𝑟subscript𝜓𝑞conditional~𝜓~𝑟𝑝~𝜓~𝑠~𝑎~𝑟𝑞conditional~𝜓~𝑟differential-d𝜓\displaystyle F(\tilde{s},\tilde{a},\tilde{r})=-\int_{\psi}q({\tilde{{\psi}}}|% \tilde{r})\ln\left({\frac{p(\tilde{{\psi}},\tilde{s},\tilde{a},\tilde{r})}{q({% \tilde{{\psi}}}|\tilde{r})}}\right)d{\psi}.italic_F ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) = - ∫ start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) roman_ln ( divide start_ARG italic_p ( over~ start_ARG italic_ψ end_ARG , over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) end_ARG start_ARG italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) end_ARG ) italic_d italic_ψ .

Friston is clearly making an effort to show the correspondence between his formulation and that of Beal. Also, as he envisions it, a summation or integration of the model over the external system may be possible. (His illustration in [5] showed the evolution of the internal states and the Markov blanket from an original “primordial soup” encompassing all the units.)

For the purposes of this Technical Report, though, we envision a system where the distribution q𝑞qitalic_q refers strictly to the external units Ψ~~Ψ\tilde{\Psi}over~ start_ARG roman_Ψ end_ARG as conditioned by the internal (representational) units r~~𝑟\tilde{r}over~ start_ARG italic_r end_ARG. Over time, the goal is to adjust the units r~~𝑟\tilde{r}over~ start_ARG italic_r end_ARG so that the free energy of the distribution q𝑞qitalic_q approximates that of the actual external system with units ψ~~𝜓\tilde{\psi}over~ start_ARG italic_ψ end_ARG.

Thus, the elements of our system that we’ve been considering so far consist of three things:

  1. 1.

    The external system which is composed of units ψ~~𝜓\tilde{\psi}over~ start_ARG italic_ψ end_ARG; we are trying to model this, and we operate under the presumption that we cannot always directly compute certain measures on this system,

  2. 2.

    The internal system which is composed of units r~~𝑟\tilde{r}over~ start_ARG italic_r end_ARG; at any given moment we can determine certain measures on this system, yielding L(s~,a~,r~)𝐿~𝑠~𝑎~𝑟L(\tilde{s},\tilde{a},\tilde{r})italic_L ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) (we are temporarily ignoring s~~𝑠\tilde{s}over~ start_ARG italic_s end_ARG and a~~𝑎\tilde{a}over~ start_ARG italic_a end_ARG), and

  3. 3.

    A distribution of the external system expressed via the internal system, q𝑞qitalic_q, where the chief distinction is that when we take an actual value for q𝑞qitalic_q, we do so with the presumption that the internal system is brought to a free energy equilibrium for a given set of parameter values θ𝜃\thetaitalic_θ. This means that the measures for a given distribution-in-the-moment, as represented by L𝐿Litalic_L, would be adjusted to represent what they would be if the internal system were brought to equilibrium, for a specific set of θ𝜃\thetaitalic_θ.

Thus, when it comes to the integration steps, we will consider that an integration of the distribution q𝑞qitalic_q over ψ~~𝜓\tilde{\psi}over~ start_ARG italic_ψ end_ARG will be interpreted as integrating over the distribution units themselves (r~~𝑟\tilde{r}over~ start_ARG italic_r end_ARG), but with consideration that the distribution p𝑝pitalic_p will have come into a free energy equilibrium (subject to parameters θ𝜃\thetaitalic_θ) that is a best approximation, for that set of θ𝜃\thetaitalic_θ, to the external system.

4 The Variational Free Energy: Reverse K-L divergence and Log-Llikelihood

In Eqn. 2, the free energy is expressed in two different ways:

  1. 1.

    As the difference between an enthalpy-like term and the entropy of the distribution q𝑞qitalic_q, and

  2. 2.

    As the sum of a surprisal or potential (negative log evidence) and the (reverse) K-L divergence between the probability distributions of the model and the external system.

The following Table 4 presents a “Rosetta Stone” of the differing notations as used by Beal, Friston, and Blei et al.

Notation regarding use of a Markov blanket is presented for Friston only; while Beal does address Markov blankets in his work, we use here only the simpler form of his notation, and similarly use only simple notation presented by Blei et al.

Table 4: The Rosetta Stone: Beal, Friston, and Blei Notation
Variable / Notation Beal Friston Blei     
Observable Variable; Dependent or “Internal States’ yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT λ,r~𝜆~𝑟\lambda,\tilde{r}italic_λ , over~ start_ARG italic_r end_ARG xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
Hidden Variable; Independent, Latent, or “External States” xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT Ψ~~Ψ\tilde{\Psi}over~ start_ARG roman_Ψ end_ARG zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
Markov “sensing” units - s~~𝑠\tilde{s}over~ start_ARG italic_s end_ARG -
Markov “active” units - a~~𝑎\tilde{a}over~ start_ARG italic_a end_ARG -
Model parameters θ𝜃\thetaitalic_θ m𝑚mitalic_m -
Model distribution q(x)𝑞𝑥q(x)italic_q ( italic_x ) (1) q(Ψ|λ)𝑞conditionalΨ𝜆q(\Psi|\lambda)italic_q ( roman_Ψ | italic_λ ) (2) -
Observations distribution p(y|θ)𝑝conditional𝑦𝜃p(y|\theta)italic_p ( italic_y | italic_θ ) (3) p(Ψ,s,a,r|m)𝑝Ψ𝑠𝑎conditional𝑟𝑚p(\Psi,s,a,r|m)italic_p ( roman_Ψ , italic_s , italic_a , italic_r | italic_m ) (4) -
Variational free energy - F(s~,a~,r~)𝐹~𝑠~𝑎~𝑟F(\tilde{s},\tilde{a},\tilde{r})italic_F ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) -

The authors specifically identify their notation, according to the following enumerated points (corresponding to elements of Table 4):

  1. 1.

    Model distribution - Beal: qxi(xi)subscript𝑞subscript𝑥𝑖subscript𝑥𝑖q_{x_{i}}(x_{i})italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ): “we use a distinct distribution qxi(xi)subscript𝑞subscript𝑥𝑖subscript𝑥𝑖q_{x_{i}}(x_{i})italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) over the hidden variables …” (Beal, 2003, p. 47, just before Eqn. 2.12),

  2. 2.

    Model distribution - Friston: q(Ψ|λ)𝑞conditionalΨ𝜆q(\Psi|\lambda)italic_q ( roman_Ψ | italic_λ ): “ … a probability density over external states q(Ψ|λ)𝑞conditionalΨ𝜆q(\Psi|\lambda)italic_q ( roman_Ψ | italic_λ ) that is encoded (parametrized) by internal states.” (Friston, 2013, p. 4, just before Lemma 2.1).

  3. 3.

    Observations - Beal: p(y|θ)𝑝conditional𝑦𝜃p(y|\theta)italic_p ( italic_y | italic_θ ): “ … [the] generative model that produces a dataset y={y1,,yn}𝑦subscript𝑦1subscript𝑦𝑛y=\{y_{1},...,y_{n}\}italic_y = { italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } consisting of n independent and identically distributed (i.i.d.) items, generated using a set of hidden variables x={x1,,xn}𝑥subscript𝑥1subscript𝑥𝑛x=\{x_{1},...,x_{n}\}italic_x = { italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } such that the likelihood can be written as a function of θ𝜃\thetaitalic_θ …” (Beal, 2003, p.46, Eqn. 2.9), and

  4. 4.

    Observations - Friston: p(Ψ,s,a,r|m)𝑝Ψ𝑠𝑎conditional𝑟𝑚p(\Psi,s,a,r|m)italic_p ( roman_Ψ , italic_s , italic_a , italic_r | italic_m ): “… ergodic density p(Ψ,s,a,r|m)𝑝Ψ𝑠𝑎conditional𝑟𝑚p(\Psi,s,a,r|m)italic_p ( roman_Ψ , italic_s , italic_a , italic_r | italic_m ) [is] a probability density function over external ψΨ𝜓Ψ\psi\in\Psiitalic_ψ ∈ roman_Ψ, sensory sS𝑠𝑆s\in Sitalic_s ∈ italic_S, active aA𝑎𝐴a\in Aitalic_a ∈ italic_A and internal states λΛ𝜆Λ\lambda\in\Lambdaitalic_λ ∈ roman_Λ for a system denoted by m𝑚mitalic_m” (Friston, 2013, p. 2, Table 1).

In this section, we focus on the second half of Eqn. 2; the equality between the “variational free energy” and the sum of the pooled negative log probabilities of sensory states (and their accompanying representational and active states) and the reverse K-L divergence. Specifically, we wish to show that

F(s~,a~,r~)=L(s~,a~,r~)+DKL[q(ψ~|r~)||p(ψ~|s~,a~,r~)].\displaystyle F(\tilde{s},\tilde{a},\tilde{r})=L(\tilde{s},\tilde{a},\tilde{r}% )+D_{KL}[q({\tilde{{\psi}}}|\tilde{r})||p(\tilde{{\psi}}|\tilde{s},\tilde{a},% \tilde{r})].italic_F ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) = italic_L ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) + italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT [ italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) | | italic_p ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) ] . (4)

In achieving this goal, we will also accomplish two other tasks, namely:

  1. 1.

    Obtain a precise mathematical formation for F(s~,a~,r~)𝐹~𝑠~𝑎~𝑟F(\tilde{s},\tilde{a},\tilde{r})italic_F ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ), and

  2. 2.

    Interpret this mathematical formulation in a useful manner.

We begin our derivation of Eqn. 4 by first considering the definition for the reverse K-L divergence in the context of the system that we are describing (and using the notation advanced by Friston (2015) [6]).

4.1 Interpreting the Reverse K-L Divergence

For the discrete case, we write the reverse Kullback-Leibler (K-L) divergence as

DKL[q(ψ~|r~)||p(ψ~|s~,a~,r~)]=i=1Iq(ψ~|r~)ln(q(ψ~|r~)p(ψ~|s~,a~,r~)).\displaystyle D_{KL}[q({\tilde{{\psi}}}|\tilde{r})||p(\tilde{{\psi}}|\tilde{s}% ,\tilde{a},\tilde{r})]=\sum_{i=1}^{I}q({\tilde{{\psi}}}|\tilde{r})\ln\left({% \frac{q({\tilde{{\psi}}}|\tilde{r})}{p(\tilde{{\psi}}|\tilde{s},\tilde{a},% \tilde{r})}}\right).italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT [ italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) | | italic_p ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) ] = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) roman_ln ( divide start_ARG italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) end_ARG start_ARG italic_p ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) end_ARG ) . (5)

For a discussion of the K-L divergence, together with the reverse K-L divergence (which is used in all generative methods, including variational inference and active inference), see Maren (2024) [15].

We briefly interpret the physical meaning of the terms in Eqn. 5. The reverse K-L divergence measures the divergence between the model-distribution q𝑞qitalic_q of (i.e., probability distribution over) the external system, as conditioned on the representation r~~𝑟\tilde{r}over~ start_ARG italic_r end_ARG, against the actual distribution of the external system itself p(ψ~|s~,a~,r~)𝑝conditional~𝜓~𝑠~𝑎~𝑟p(\tilde{\psi}|\tilde{s},\tilde{a},\tilde{r})italic_p ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ).

We note that any time we write p(x)𝑝𝑥p(x)italic_p ( italic_x ), we are implicitly writing p(x|m)𝑝conditional𝑥𝑚p(x|m)italic_p ( italic_x | italic_m ), because we are using p𝑝pitalic_p to represent the notion of a (generative) distribution that uses a certain parameter set θ𝜃\thetaitalic_θ.

The model-distribution q𝑞qitalic_q is a model of the external system, ψ~~𝜓\tilde{\psi}over~ start_ARG italic_ψ end_ARG, which is why we write q=q(ψ~|r~)𝑞𝑞conditional~𝜓~𝑟q=q(\tilde{\psi}|\tilde{r})italic_q = italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ). The key feature in computing q𝑞qitalic_q is that (for the application being considered here) we take it at the equilibrium state. That is, q𝑞qitalic_q corresponds to the equilibrium free energy of the external system, which can be computed (or approximated) if we have a suitable free energy equation. Thus, in Eqn. 5, we are looking at the divergence between the model-distribution of the system at equilibrium and the probabilities p𝑝pitalic_p of various components of the system, potentially in a not-yet-at-equilibrium state.

The parameter(s) θ𝜃\thetaitalic_θ directly influence p𝑝pitalic_p, but the notation for θ𝜃\thetaitalic_θ is suppressed in this section.

Thus, we can read the term q(ψ~|r~)𝑞conditional~𝜓~𝑟q(\tilde{\psi}|\tilde{r})italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) as the “probability distribution of the model of the external system ψ~~𝜓\tilde{\psi}over~ start_ARG italic_ψ end_ARG, which is computed based solely on the value of the representational units r~~𝑟\tilde{r}over~ start_ARG italic_r end_ARG that are isolated from the external system ψ~~𝜓\tilde{\psi}over~ start_ARG italic_ψ end_ARG by a Markov blanket, but these representational units are to be considered with their at-equilibrium values.”

Next, we examine the term p(ψ~|s~,a~,r~)𝑝conditional~𝜓~𝑠~𝑎~𝑟{p(\tilde{{\psi}}|\tilde{s},\tilde{a},\tilde{r})}italic_p ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ), which expresses the probability distribution of units ψ~~𝜓\tilde{\psi}over~ start_ARG italic_ψ end_ARG in the external system, conditioned on the Markov blanket sensory units s~~𝑠\tilde{s}over~ start_ARG italic_s end_ARG and action units a~~𝑎\tilde{a}over~ start_ARG italic_a end_ARG, along with the representational units r~~𝑟\tilde{r}over~ start_ARG italic_r end_ARG. We recall, from the design of the entire system (external plus Markov blanket plus representational units), and also from figures given in Friston [5] and Friston et al. [6], and replicated in Figure 3, that the representational units do not communicate directly with the external units. Thus, the dependence of ψ~~𝜓\tilde{\psi}over~ start_ARG italic_ψ end_ARG is very much an implicit relationship; one that is at a distance because the direct interactions of the units in ψ~~𝜓\tilde{\psi}over~ start_ARG italic_ψ end_ARG are exclusively with s~~𝑠\tilde{s}over~ start_ARG italic_s end_ARG and a~~𝑎\tilde{a}over~ start_ARG italic_a end_ARG. Further, the system design is that the sensory units receive inputs from the external units ψ~~𝜓\tilde{\psi}over~ start_ARG italic_ψ end_ARG, but do not directly influence the ψ~~𝜓\tilde{\psi}over~ start_ARG italic_ψ end_ARG themselves.

Thus, the conditional relationship expressed in p(ψ~|s~,a~,r~)𝑝conditional~𝜓~𝑠~𝑎~𝑟p(\tilde{\psi}|\tilde{s},\tilde{a},\tilde{r})italic_p ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) seems a little forced. However, it is the basis for our next steps in the derivation, and we will think of it simply as stating that the external system can indeed be influenced by the evolving values for the representational system r~~𝑟\tilde{r}over~ start_ARG italic_r end_ARG.

With this in mind, we go back to Eqn. 5, and interpret the reverse K-L divergence on the Right-Hand-Side (RHS) of the equation. It states that this reverse K-L divergence is the sum, over all possible states in which the system can possibly find itself, of the (model probability) distribution q𝑞qitalic_q for a given specific state, multiplying the natural log of the distribution of that state (for the actual, external system), which is divided by the actual probability for those external states. In this expression, the actual distribution of the external system, p(ψ~|s~,a~,r~)𝑝conditional~𝜓~𝑠~𝑎~𝑟p(\tilde{\psi}|\tilde{s},\tilde{a},\tilde{r})italic_p ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ), is conditioned by the states of the Markov blanket and the representational system.

Thinking ahead, we consider how the traditional formulation for how the relationship between the model and the external system is expressed, specifically as a reverse K-L divergence. Typically, the actual “external” system is some set of values, p(x)𝑝𝑥p(x)italic_p ( italic_x ), and the model is given as q(x)𝑞𝑥q(x)italic_q ( italic_x ). The (generative) distribution p𝑝pitalic_p is specified via certain parameters θ𝜃\thetaitalic_θ.

In our situation, though, we will be looking at both an external system and an internal model that will each, separately, come to their respective free energy equilibrium points. That is, there will not be a sum over all possible values of some distribution over i𝑖iitalic_i; there will instead be a single probability distribution p𝑝pitalic_p and a single probability distribution q𝑞qitalic_q, after each has reached free energy minimization.

4.2 Rewriting the Bayesian Posterior Distribution

Before we rewrite the reverse K-L divergence term of Eqn. 5, we first recall how the Bayesian posterior probability density can be rewritten, as framed in Blei et al. [10].

Consider a system that has a set of observable variables v=v1..V\textbf{v}=v_{1..V}v = italic_v start_POSTSUBSCRIPT 1 . . italic_V end_POSTSUBSCRIPT and a set of latent or “hidden” variables w=w1..W\textbf{w}=w_{1..W}w = italic_w start_POSTSUBSCRIPT 1 . . italic_W end_POSTSUBSCRIPT. In a feedforward neural network, for example, the observable variables v would be the values of the output layer neurons, and the latent (hidden) variables would be the associated values of the hidden layer w neurons.

Similarly, we can envision many other situations in which we can identify an observation that is a function of multiple input factors. Sometimes, not all of those input factors can be directly observed.

In the Bayesian formalism, the prior density of the (set of) latent variables w𝑤witalic_w is defined as p(w)𝑝𝑤p(w)italic_p ( italic_w ). A Bayesian model relates these latent variables to the observations v𝑣vitalic_v through the likelihood p(v|w)𝑝conditional𝑣𝑤p(v|w)italic_p ( italic_v | italic_w ). The interpretation is straightforward; it speaks to the likelihood of observing an outcome or observable variable v𝑣vitalic_v given the hidden variables w𝑤witalic_w. This is called the prior distribution.

Sometimes, though, we don’t have an accurate means of establishing the values for the latent or hidden variables w𝑤witalic_w. Thus, we use approximate inference to determine the posterior distribution, p(w|v)𝑝conditional𝑤𝑣p(w|v)italic_p ( italic_w | italic_v ). This means that we are trying to estimate the values of the hidden variables, seeing only the values for the observable variables.

To rewrite the probability density, we first consider a system that can be described in terms of a joint density of latent variables w=w1..W\textbf{w}=w_{1..W}w = italic_w start_POSTSUBSCRIPT 1 . . italic_W end_POSTSUBSCRIPT and observations v=v1..V\textbf{v}=v_{1..V}v = italic_v start_POSTSUBSCRIPT 1 . . italic_V end_POSTSUBSCRIPT, where the conditional density function is given as

p(w|v)=p(w,v)/p(v).𝑝conditional𝑤𝑣𝑝𝑤𝑣𝑝𝑣\displaystyle p(w|v)=p(w,v)/p(v).italic_p ( italic_w | italic_v ) = italic_p ( italic_w , italic_v ) / italic_p ( italic_v ) . (6)

Conversely, we also have

p(w,v)=p(w|v)p(v).𝑝𝑤𝑣𝑝conditional𝑤𝑣𝑝𝑣\displaystyle p(w,v)=p(w|v)p(v).italic_p ( italic_w , italic_v ) = italic_p ( italic_w | italic_v ) italic_p ( italic_v ) . (7)

4.3 Rewriting the Reverse K-L Divergence

We wish now to rewrite the probability density of the external states ψ~~𝜓\tilde{\psi}over~ start_ARG italic_ψ end_ARG that is conditional on the Markov blanket and internal (representational) states, so that the probability density becomes a joint distribution.

We identify the conditional distribution from Eqn. 5 in terms of the joint probability distribution p(ψ~,s~,a~,r~)𝑝~𝜓~𝑠~𝑎~𝑟p(\tilde{{\psi}},\tilde{s},\tilde{a},\tilde{r})italic_p ( over~ start_ARG italic_ψ end_ARG , over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ), together with the simple probability distribution over the model states.

p(ψ~|s~,a~,r~)=p(ψ~,s~,a~,r~)/p(s~,a~,r~).𝑝conditional~𝜓~𝑠~𝑎~𝑟𝑝~𝜓~𝑠~𝑎~𝑟𝑝~𝑠~𝑎~𝑟\displaystyle p(\tilde{{\psi}}|\tilde{s},\tilde{a},\tilde{r})=p(\tilde{{\psi}}% ,\tilde{s},\tilde{a},\tilde{r})/p(\tilde{s},\tilde{a},\tilde{r}).italic_p ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) = italic_p ( over~ start_ARG italic_ψ end_ARG , over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) / italic_p ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) . (8)

Substituting this result into Eqn. 5, we have

DKL[q(ψ~|r~)||p(ψ~|s~,a~,r~)]=i=1Iq(ψ~|r~)ln(q(ψ~|r~)p(s~,a~,r~)p(ψ~,s~,a~,r~)),\displaystyle D_{KL}[q({\tilde{{\psi}}}|\tilde{r})||p(\tilde{{\psi}}|\tilde{s}% ,\tilde{a},\tilde{r})]=\sum_{i=1}^{I}q({\tilde{{\psi}}}|\tilde{r})\ln\left({% \frac{q({\tilde{{\psi}}}|\tilde{r})p(\tilde{s},\tilde{a},\tilde{r})}{p(\tilde{% {\psi}},\tilde{s},\tilde{a},\tilde{r})}}\right),italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT [ italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) | | italic_p ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) ] = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) roman_ln ( divide start_ARG italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) italic_p ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) end_ARG start_ARG italic_p ( over~ start_ARG italic_ψ end_ARG , over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) end_ARG ) , (9)

which we can reorganize to write as

DKL[q(ψ~|r~)||p(ψ~|s~,a~,r~)]=i=1Iq(ψ~|r~)ln(p(s~,a~,r~))\displaystyle D_{KL}[q({\tilde{{\psi}}}|\tilde{r})||p(\tilde{{\psi}}|\tilde{s}% ,\tilde{a},\tilde{r})]=\sum_{i=1}^{I}q({\tilde{{\psi}}}|\tilde{r})\ln\left({p(% \tilde{s},\tilde{a},\tilde{r})}\right)italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT [ italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) | | italic_p ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) ] = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) roman_ln ( italic_p ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) ) (10)
+i=1Iq(ψ~|r~)ln(q(ψ~|r~)p(ψ~,s~,a~,r~)).superscriptsubscript𝑖1𝐼𝑞conditional~𝜓~𝑟𝑞conditional~𝜓~𝑟𝑝~𝜓~𝑠~𝑎~𝑟\displaystyle+\sum_{i=1}^{I}q({\tilde{{\psi}}}|\tilde{r})\ln\left({\frac{q({% \tilde{{\psi}}}|\tilde{r})}{p(\tilde{{\psi}},\tilde{s},\tilde{a},\tilde{r})}}% \right).+ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) roman_ln ( divide start_ARG italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) end_ARG start_ARG italic_p ( over~ start_ARG italic_ψ end_ARG , over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) end_ARG ) .

Following Beal [7] (Eqns. 2.32 - 2.34), we note that the sum over the model terms q𝑞qitalic_q in the first term on the RHS comes to 1 (implicitly, there is a double summation there, and q𝑞qitalic_q is independent of p𝑝pitalic_p), so that we have

DKL[q(ψ~|r~)||p(ψ~|s~,a~,r~)]=i=1Iln(p(s~,a~,r~))\displaystyle D_{KL}[q({\tilde{{\psi}}}|\tilde{r})||p(\tilde{{\psi}}|\tilde{s}% ,\tilde{a},\tilde{r})]=\sum_{i=1}^{I}\ln\left({p(\tilde{s},\tilde{a},\tilde{r}% )}\right)italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT [ italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) | | italic_p ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) ] = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT roman_ln ( italic_p ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) ) (11)
+i=1Iq(ψ~|r~)ln(q(ψ~|r~)p(ψ~,s~,a~,r~)),superscriptsubscript𝑖1𝐼𝑞conditional~𝜓~𝑟𝑞conditional~𝜓~𝑟𝑝~𝜓~𝑠~𝑎~𝑟\displaystyle+\sum_{i=1}^{I}q({\tilde{{\psi}}}|\tilde{r})\ln\left({\frac{q({% \tilde{{\psi}}}|\tilde{r})}{p(\tilde{{\psi}},\tilde{s},\tilde{a},\tilde{r})}}% \right),+ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) roman_ln ( divide start_ARG italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) end_ARG start_ARG italic_p ( over~ start_ARG italic_ψ end_ARG , over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) end_ARG ) ,

keeping in mind that the q𝑞qitalic_q are taken over the hidden or latent variables, and that the sum over i=1..Ii=1..Iitalic_i = 1 . . italic_I is taken over the I𝐼Iitalic_I units in the system being modeled.

This is a good place in which to note that Friston typically writes the first term on the RHS of 11 without the summation sign; e.g., the summation is subsumed into the notation.

For example, Friston, as Eqn. 2.8 in Friston (2013) [5] and Eqns. 3.2 and 3.4 of Friston et al. (2015) [6], uses

L(s~,a~,r~)=lnp(s~,a~,r~),𝐿~𝑠~𝑎~𝑟𝑝~𝑠~𝑎~𝑟\displaystyle L(\tilde{s},\tilde{a},\tilde{r})=-\ln{p(\tilde{s},\tilde{a},% \tilde{r})},italic_L ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) = - roman_ln italic_p ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) , (12)

where clearly the meaning (see Eqn. 11) is

L(s~,a~,r~)=i=1Ilnp(s~,a~,r~).𝐿~𝑠~𝑎~𝑟superscriptsubscript𝑖1𝐼𝑝~𝑠~𝑎~𝑟\displaystyle L(\tilde{s},\tilde{a},\tilde{r})=-\sum_{i=1}^{I}\ln{p(\tilde{s},% \tilde{a},\tilde{r})}.italic_L ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) = - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT roman_ln italic_p ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) . (13)

As further evidence that Friston intends the summation (or, as suitable for the situation, an integration) is found in Friston’s expression (2013, see Eqn. 2.7) [5] for F(s~,a~,r~)𝐹~𝑠~𝑎~𝑟F(\tilde{s},\tilde{a},\tilde{r})italic_F ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ), as

F(s~,a~,r~)=ψq(ψ~|r~)ln(p(ψ~,s~,a~,r~|m)q(ψ~|r~))𝑑ψ.𝐹~𝑠~𝑎~𝑟subscript𝜓𝑞conditional~𝜓~𝑟𝑝~𝜓~𝑠~𝑎conditional~𝑟𝑚𝑞conditional~𝜓~𝑟differential-d𝜓\displaystyle F(\tilde{s},\tilde{a},\tilde{r})=-\int_{\psi}q({\tilde{{\psi}}}|% \tilde{r})\ln\left({\frac{p(\tilde{{\psi}},\tilde{s},\tilde{a},\tilde{r}|m)}{q% ({\tilde{{\psi}}}|\tilde{r})}}\right)d{\psi}.italic_F ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) = - ∫ start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) roman_ln ( divide start_ARG italic_p ( over~ start_ARG italic_ψ end_ARG , over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG | italic_m ) end_ARG start_ARG italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) end_ARG ) italic_d italic_ψ .

We will discuss this equation in greater context later in this Subsection.

Returning to our original line of thought, we rearrange terms in Eqn. 11 to obtain

i=1Iq(ψ~|r~)ln(q(ψ~|r~)p(ψ~,s~,a~,r~))=i=1Iln(p(s~,a~,r~))superscriptsubscript𝑖1𝐼𝑞conditional~𝜓~𝑟𝑞conditional~𝜓~𝑟𝑝~𝜓~𝑠~𝑎~𝑟superscriptsubscript𝑖1𝐼𝑝~𝑠~𝑎~𝑟\displaystyle\sum_{i=1}^{I}q({\tilde{\psi}}|\tilde{r})\ln\left({\frac{q({% \tilde{{\psi}}}|\tilde{r})}{p(\tilde{{\psi}},\tilde{s},\tilde{a},\tilde{r})}}% \right)=-\sum_{i=1}^{I}\ln\left({p(\tilde{s},\tilde{a},\tilde{r})}\right)∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) roman_ln ( divide start_ARG italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) end_ARG start_ARG italic_p ( over~ start_ARG italic_ψ end_ARG , over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) end_ARG ) = - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT roman_ln ( italic_p ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) ) (14)
+DKL[q(ψ~|r~)||p(ψ~|s~,a~,r~)].\displaystyle+D_{KL}[q({\tilde{{\psi}}}|\tilde{r})||p(\tilde{{\psi}}|\tilde{s}% ,\tilde{a},\tilde{r})].+ italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT [ italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) | | italic_p ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) ] .

Adopting Friston’s notation, in which the summation (or integration, in the case of continuous variables) is subsumed, we can write

q(ψ~|r~)ln(q(ψ~|r~)p(ψ~,s~,a~,r~))=ln(p(s~,a~,r~))𝑞conditional~𝜓~𝑟𝑞conditional~𝜓~𝑟𝑝~𝜓~𝑠~𝑎~𝑟𝑝~𝑠~𝑎~𝑟\displaystyle q({\tilde{\psi}}|\tilde{r})\ln\left({\frac{q({\tilde{{\psi}}}|% \tilde{r})}{p(\tilde{{\psi}},\tilde{s},\tilde{a},\tilde{r})}}\right)=-\ln\left% ({p(\tilde{s},\tilde{a},\tilde{r})}\right)italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) roman_ln ( divide start_ARG italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) end_ARG start_ARG italic_p ( over~ start_ARG italic_ψ end_ARG , over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) end_ARG ) = - roman_ln ( italic_p ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) ) (15)
+DKL[q(ψ~|r~)||p(ψ~|s~,a~,r~)].\displaystyle+D_{KL}[q({\tilde{{\psi}}}|\tilde{r})||p(\tilde{{\psi}}|\tilde{s}% ,\tilde{a},\tilde{r})].+ italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT [ italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) | | italic_p ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) ] .

Note that the following equations are being written in Friston’s style, with summation (or integration) subsumed.

We notice that Eqn. 15 is similar to form of Eqn. 4; sufficiently so that we can establish the identities for F(s~,a~,r~)𝐹~𝑠~𝑎~𝑟F(\tilde{s},\tilde{a},\tilde{r})italic_F ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) and L(s~,a~,r~)𝐿~𝑠~𝑎~𝑟L(\tilde{s},\tilde{a},\tilde{r})italic_L ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ).

For the Left-Hand-Side (LHS) of Eqn. 15, we create the identity for F(s~,a~,r~)𝐹~𝑠~𝑎~𝑟F(\tilde{s},\tilde{a},\tilde{r})italic_F ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) as

F(s~,a~,r~)=q(ψ~|r~)ln(q(ψ~|r~)p(ψ~,s~,a~,r~))𝐹~𝑠~𝑎~𝑟𝑞conditional~𝜓~𝑟𝑞conditional~𝜓~𝑟𝑝~𝜓~𝑠~𝑎~𝑟\displaystyle F(\tilde{s},\tilde{a},\tilde{r})=q({\tilde{{\psi}}}|\tilde{r})% \ln\left({\frac{q({\tilde{{\psi}}}|\tilde{r})}{p(\tilde{{\psi}},\tilde{s},% \tilde{a},\tilde{r})}}\right)italic_F ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) = italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) roman_ln ( divide start_ARG italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) end_ARG start_ARG italic_p ( over~ start_ARG italic_ψ end_ARG , over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) end_ARG ) (16)
=q(ψ~|r~)ln(p(ψ~,s~,a~,r~)q(ψ~|r~)).absent𝑞conditional~𝜓~𝑟𝑝~𝜓~𝑠~𝑎~𝑟𝑞conditional~𝜓~𝑟\displaystyle=-q({\tilde{{\psi}}}|\tilde{r})\ln\left({\frac{p(\tilde{{\psi}},% \tilde{s},\tilde{a},\tilde{r})}{q({\tilde{{\psi}}}|\tilde{r})}}\right).= - italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) roman_ln ( divide start_ARG italic_p ( over~ start_ARG italic_ψ end_ARG , over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) end_ARG start_ARG italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) end_ARG ) .

This gives us the precise form for F(s~,a~,r~)𝐹~𝑠~𝑎~𝑟F(\tilde{s},\tilde{a},\tilde{r})italic_F ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ); the variational free energy. We note the specific difference between this Eqn. 16 and Eqn. 5; both have the form of a reverse K-L divergence. However, in Eqn. 16, the divergence is between the distribution q𝑞qitalic_q of the external system and the joint model-distribution of both the external system and the internal system; in Eqn. 5, the divergence is between the model distribution q𝑞qitalic_q and the (observed) distribution of the external system as conditioned on the representational system and the Markov blanket.

We notice also (in the second part of Eqn. 16) that Friston et al. prefer to represent the variational free energy as the negative of a divergence-like term; it is now between the joint probability distribution (of the external system observatioin) against the model, although the multiplying factor is still that of the model distribution.

Similarly, for the first term on the RHS of Eqn. 15, we also take note of the interpretation offered by Friston (2015) [6], which gives us

L(s~,a~,r~)=lnp(s~,a~,r~),𝐿~𝑠~𝑎~𝑟𝑝~𝑠~𝑎~𝑟\displaystyle L(\tilde{s},\tilde{a},\tilde{r})=-\ln{p(\tilde{s},\tilde{a},% \tilde{r})},italic_L ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) = - roman_ln italic_p ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) , (17)

so that L𝐿Litalic_L is defined as the negative of the logarithm of the probability of internal (representational) units, together with the Markov blanket units.

As another notational note; Friston actually incorporates dependence on the model parameters into this term; see Eqn. 3.1 of Friston et al. (2015) [6] and Eqn. 2.7 of Friston (2013) [5] for F(s~,a~,r~)𝐹~𝑠~𝑎~𝑟F(\tilde{s},\tilde{a},\tilde{r})italic_F ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ), which gives

L(s~,a~,r~)=lnp(s~,a~,r~|m),𝐿~𝑠~𝑎~𝑟𝑝~𝑠~𝑎conditional~𝑟𝑚\displaystyle L(\tilde{s},\tilde{a},\tilde{r})=-\ln{p(\tilde{s},\tilde{a},% \tilde{r}|m)},italic_L ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) = - roman_ln italic_p ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG | italic_m ) ,

If we were to substitute these two expressions into Eqn. 15, we would obtain

F(s~,a~,r~)=q(ψ~|r~)ln(p(ψ~,s~,a~,r~)q(ψ~|r~))𝐹~𝑠~𝑎~𝑟𝑞conditional~𝜓~𝑟𝑝~𝜓~𝑠~𝑎~𝑟𝑞conditional~𝜓~𝑟\displaystyle F(\tilde{s},\tilde{a},\tilde{r})=-q({\tilde{{\psi}}}|\tilde{r})% \ln\left({\frac{p(\tilde{{\psi}},\tilde{s},\tilde{a},\tilde{r})}{q({\tilde{{% \psi}}}|\tilde{r})}}\right)italic_F ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) = - italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) roman_ln ( divide start_ARG italic_p ( over~ start_ARG italic_ψ end_ARG , over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) end_ARG start_ARG italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) end_ARG ) (18)
=L(s~,a~,r~)+DKL[q(ψ~|r~)||p(ψ~|s~,a~,r~)],\displaystyle=L(\tilde{s},\tilde{a},\tilde{r})+D_{KL}[q({\tilde{{\psi}}}|% \tilde{r})||p(\tilde{{\psi}}|\tilde{s},\tilde{a},\tilde{r})],= italic_L ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) + italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT [ italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) | | italic_p ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) ] ,

which is identical with the second part of Eqn. 4, and also with Friston (2015), Eqn. 3.2 [6], and also with Friston (2013) Eqn. 2.8 [5].

Once again, it may be useful to consider the distinction between q𝑞qitalic_q and L𝐿Litalic_L. More precisely, we need to ask ourselves exactly what it is that we mean when we speak of p(s~,a~,r~)𝑝~𝑠~𝑎~𝑟p(\tilde{s},\tilde{a},\tilde{r})italic_p ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ). This is presumably the probability distribution of the internal (representational) and Markov blanket states. However, we have been representing the distribution of the internal and Markov blanket states as p(x|θ)𝑝conditional𝑥𝜃p(x|\theta)italic_p ( italic_x | italic_θ ); that is, as a probability distribution of the observable, representational system that is encoded by the internal states s~,a~,r~~𝑠~𝑎~𝑟\tilde{s},\tilde{a},\tilde{r}over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG together with a (set of) model parameters θ𝜃\thetaitalic_θ.

A useful interpretation is that we may take L(s~,a~,r~)𝐿~𝑠~𝑎~𝑟L(\tilde{s},\tilde{a},\tilde{r})italic_L ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) to be the actual distribution of the representational system (as observed directly over its various components), and p𝑝pitalic_p to be the computational distribution of the representation. In short, the negative log likelihood of sensory states (and active plus internal states, i.e., L(s~,a~,r~)𝐿~𝑠~𝑎~𝑟L(\tilde{s},\tilde{a},\tilde{r})italic_L ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG )) pertains to the actual state of affairs, while the free energy corresponds to the equivalent measure that would be obtained if the sensory units were caused by the latent or hidden states encoded by the internal states. By minimizing free energy, the two become close but (in general) will never be exactly the same.

The physical implication here could be that we will obtain p𝑝pitalic_p as the probability distribution for the observations in a free energy-minimized state. In contrast, the values of specific elements in the distribution over (s~,a~,r~)~𝑠~𝑎~𝑟(\tilde{s},\tilde{a},\tilde{r})( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) may, at a certain point, not be in a free energy-minimized state.

We have thus accomplished half of our goal, in deriving one of the equalities of Eqn. 2, involving the negative log-likelihood of the probability of states that are actually observed, added to the reverse K-L divergence of the probability distribution of the model with respect to the probability distribution of the external units.

Examining Eqn. 16, we note the correspondence between the expression for F(s~,a~,r~)𝐹~𝑠~𝑎~𝑟F(\tilde{s},\tilde{a},\tilde{r})italic_F ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) given there and the corollary expression given for the free energy of a system in Friston (2013, see Eqn. 2.7) [5], as

F(s~,a~,r~)=ψq(ψ~|r~)ln(p(ψ~,s~,a~,r~)q(ψ~|r~))𝑑ψ.𝐹~𝑠~𝑎~𝑟subscript𝜓𝑞conditional~𝜓~𝑟𝑝~𝜓~𝑠~𝑎~𝑟𝑞conditional~𝜓~𝑟differential-d𝜓\displaystyle F(\tilde{s},\tilde{a},\tilde{r})=-\int_{\psi}q({\tilde{{\psi}}}|% \tilde{r})\ln\left({\frac{p(\tilde{{\psi}},\tilde{s},\tilde{a},\tilde{r})}{q({% \tilde{{\psi}}}|\tilde{r})}}\right)d{\psi}.italic_F ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) = - ∫ start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) roman_ln ( divide start_ARG italic_p ( over~ start_ARG italic_ψ end_ARG , over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) end_ARG start_ARG italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) end_ARG ) italic_d italic_ψ . (19)

We take note that the integration in Eqn. 19 is over the external units ψ~~𝜓\tilde{\psi}over~ start_ARG italic_ψ end_ARG, reinforcing our understanding that the free energy F(s~,a~,r~)𝐹~𝑠~𝑎~𝑟F(\tilde{s},\tilde{a},\tilde{r})italic_F ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) is really the variational free energy of the external system, with the probability distribution in the numerator of the logarithmic term being taken over the joint distribution of external units ψ𝜓\psiitalic_ψ together with the internal or representational units r~~𝑟\tilde{r}over~ start_ARG italic_r end_ARG as well as the Markov blanket units.

We notice the correspondence between Eqn. 18 above and Eqn. 2.34 of Beal [7], in that F(s~,a~,r~)=F(qx(x),θ)𝐹~𝑠~𝑎~𝑟𝐹subscript𝑞𝑥𝑥𝜃F(\tilde{s},\tilde{a},\tilde{r})=-F(q_{x}(x),\theta)italic_F ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) = - italic_F ( italic_q start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_x ) , italic_θ ), where the former notation for the free energy is Friston’s, and the second is Beal’s. (A minor technical note, is that the free energy of a probability distribution (q𝑞qitalic_q) is a functional, while the free energy of its sufficient statistics (r~~𝑟\tilde{r}over~ start_ARG italic_r end_ARG) becomes a function.) Correspondingly, Eqn. 18 is precisely the negative of Eqn. 2.34 in Beal.

Now, we wish to show the first part of Eqn. 2, which expresses the same free energy in terms of an expectation of log-likelihood of a certain term (whose precise nature will be clarified) minus the entropy of the model as applied to the external units (which will also need to be verified).

5 The Variational Free Energy: Log-Likelihood Expectation and Entropy

The previous Subsection 4.3 presented a derivation for the second half of Eqn. 2, giving an expression for the variational free energy in terms of the (negative of the) log-likelihood (over the representational system) and the K-L divergence (of the model vis-à-vis the external system). In this section, we show how the other equality expressed in Eqn. 2 can be derived, giving the free energy in terms of what Friston calls an “expected enthalpy” and an entropy term, the exact natures of which will be determined as we proceed [6].

Several sources remark that the while the first term in the second line of Eqn. 2 is computable, the second term (the reverse K-L divergence) is not [7, 10]. This then motivates the expression given on the first line of the equation, with the intention of giving an alternative - and computable - formulation for the variational free energy F𝐹Fitalic_F.

It is worth noting what these various sources say.

Friston et al. (2015) [6] states (p. 3, just before Lemma 3.1), with regard to the Lagrangian L(s~,a~,r~)𝐿~𝑠~𝑎~𝑟L(\tilde{s},\tilde{a},\tilde{r})italic_L ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) that “Although we know this Lagrangian exists, it is practically (almost) impossible to evaluate its form. However, there is an alternative formulation of equation (3.1) that allows one to describe the flow in terms of a probabilistic model of how a system thinks it should behave.” (This is given prior to Friston’s Eqn. 3.2, which is our Eqn. 2, and which motivates the need for that equation.) … “The solution to equation (3.2) implies the internal states minimize free energy rendering the divergence zero (by Gibbs inequality) … In short, the internal states will appear to engage in Bayesian inference, effectively inferring the (external) causes of sensory states. Furthermore, the active states are complicit in this inference, sampling sensory states that maximize model evidence: in other words, selecting sensations that the system expects. This is active inference, in which internal states and action minimize free energy—or maximize model evidence—in a way that is consistent with the good regulator theorem and related treatments of self-organization [49,58–61].”

Beal (2003, p. 45) [7] states that: “A more principled approach is to estimate the integral numerically by evaluating the integrand at many different θ𝜃\thetaitalic_θ via Monte Carlo methods. In the limit of an infinite number of samples of θ𝜃\thetaitalic_θ this produces an accurate result, but despite ingenious attempts to curb the curse of dimensionality in θ𝜃\thetaitalic_θ using methods such as Markov chain Monte Carlo, these methods remain prohibitively computationally intensive in interesting models. These methods were reviewed in the last chapter, and the bulk of this chapter concentrates on a third way of approximating the integral, using variational methods. The key to the variational method is to approximate the integral with a simpler form that is tractable, forming a lower or upper bound. The integration then translates into the implementationally simpler problem of bound optimisation: making the bound as tight as possible to the true value.”

Blei et al. (2018, p. 2) [10] state that: “For decades, the dominant paradigm for approximate inference has been MCMC [Markov chain Monte Carlo] … However, there are problems for which we cannot easily use this approach. These arise particularly when we need an approximate conditional faster than a simple MCMC algorithm can produce, such as when data sets are large or models are very complex. In these settings, variational inference provides a good alternative approach to approximate Bayesian inference. Rather than use sampling, the main idea behind variational inference is to use optimization. First, we posit a family of approximate densities Q𝑄Qitalic_Q. This is a set of densities over the latent variables. Then, we try to find the member of that family that minimizes the Kullback-Leibler (KL) divergence to the exact posterior … Finally, we approximate the posterior with the optimized member of the family q()superscript𝑞q^{*}(\cdot)italic_q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ⋅ )).

Of these various explanations, that of Blei et al. seems most straight-forward.

We wish to minimize the [reverse] Kullback-Leibler divergence of Eqn. 2.

The derivation of the first part of Eqn. 2 is found in Beal (2003) [7], Eqns. 2.15 and 2.16.

For convenience, Eqn. 2 (Eqn. 3.2 in Friston et al. (2015) [6]) is presented again, as

F(s~,a~,r~)=Eq[L(x~)]H[q(ψ~|r~)]𝐹~𝑠~𝑎~𝑟subscript𝐸𝑞delimited-[]𝐿~𝑥𝐻delimited-[]𝑞conditional~𝜓~𝑟\displaystyle F(\tilde{s},\tilde{a},\tilde{r})=E_{q}[L(\tilde{x})]-H[q(\tilde{% {\psi}}|\tilde{r})]italic_F ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) = italic_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_L ( over~ start_ARG italic_x end_ARG ) ] - italic_H [ italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) ]
=L(s~,a~,r~)+DKL[q(ψ~|r~)||p(ψ~|s~,a~,r~)].\displaystyle=L(\tilde{s},\tilde{a},\tilde{r})+D_{KL}[q({\tilde{{\psi}}}|% \tilde{r})||p(\tilde{{\psi}}|\tilde{s},\tilde{a},\tilde{r})].= italic_L ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) + italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT [ italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) | | italic_p ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) ] .

Our goal is to verify the first equality presented in this equation. To accomplish this, we follow a line of reasoning presented in Beal (2003) [7], who introduced a formulation for the log likelihood.

We begin with Beal’s Eqn. 2.10, given as

L(θ)𝐿𝜃\displaystyle L(\theta)italic_L ( italic_θ ) lnp(y|θ)=i=1nlnp(yi|θ)=i=1nln𝑑xip(xi,yi|θ).absent𝑝conditional𝑦𝜃superscriptsubscript𝑖1𝑛𝑝conditionalsubscript𝑦𝑖𝜃superscriptsubscript𝑖1𝑛differential-dsubscript𝑥𝑖𝑝subscript𝑥𝑖conditionalsubscript𝑦𝑖𝜃\displaystyle\equiv\ln p(y|\theta)=\sum_{i=1}^{n}\ln p(y_{i}|\theta)=\sum_{i=1% }^{n}\ln\int dx_{i}\>p(x_{i},y_{i}|\theta).≡ roman_ln italic_p ( italic_y | italic_θ ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_ln italic_p ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_θ ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_ln ∫ italic_d italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_p ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_θ ) . (20)

Beal’s Eqns. 2.12 - 2.16 are reproduced here as

L(θ)𝐿𝜃\displaystyle L(\theta)italic_L ( italic_θ ) =i=1ln𝑑xip(xi,yi|θ)absentsubscript𝑖1differential-dsubscript𝑥𝑖𝑝subscript𝑥𝑖conditionalsubscript𝑦𝑖𝜃\displaystyle=\sum_{i=1}\ln\int dx_{i}\>p(x_{i},y_{i}|\theta)= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT roman_ln ∫ italic_d italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_p ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_θ ) (21)
=i=1ln𝑑xiqxi(xi)p(xi,yi|θ)qxi(xi)absentsubscript𝑖1differential-dsubscript𝑥𝑖subscript𝑞subscript𝑥𝑖subscript𝑥𝑖𝑝subscript𝑥𝑖conditionalsubscript𝑦𝑖𝜃subscript𝑞subscript𝑥𝑖subscript𝑥𝑖\displaystyle=\sum_{i=1}\ln\int dx_{i}\>q_{x_{i}}(x_{i})\>\frac{p(x_{i},y_{i}|% \theta)}{q_{x_{i}}{(x_{i}})}= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT roman_ln ∫ italic_d italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) divide start_ARG italic_p ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_θ ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG
i=1𝑑xiqxi(xi)lnp(xi,yi|θ)qxi(xi)absentsubscript𝑖1differential-dsubscript𝑥𝑖subscript𝑞subscript𝑥𝑖subscript𝑥𝑖𝑝subscript𝑥𝑖conditionalsubscript𝑦𝑖𝜃subscript𝑞subscript𝑥𝑖subscript𝑥𝑖\displaystyle\geq\sum_{i=1}\int dx_{i}\>q_{x_{i}}(x_{i})\>\ln\>\frac{p(x_{i},y% _{i}|\theta)}{q_{x_{i}}{(x_{i}})}≥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT ∫ italic_d italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_ln divide start_ARG italic_p ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_θ ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG
=i=1(𝑑xiqxi(xi)lnp(xi,yi|θ)𝑑xiqxi(xi)lnqxi(xi))absentsubscript𝑖1differential-dsubscript𝑥𝑖subscript𝑞subscript𝑥𝑖subscript𝑥𝑖𝑝subscript𝑥𝑖conditionalsubscript𝑦𝑖𝜃differential-dsubscript𝑥𝑖subscript𝑞subscript𝑥𝑖subscript𝑥𝑖subscript𝑞subscript𝑥𝑖subscript𝑥𝑖\displaystyle=\sum_{i=1}\left(\int dx_{i}\>q_{x_{i}}(x_{i})\>\ln\>p(x_{i},y_{i% }|\theta)-\int dx_{i}\>q_{x_{i}}(x_{i})\>\ln\>q_{x_{i}}(x_{i})\right)= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT ( ∫ italic_d italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_ln italic_p ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_θ ) - ∫ italic_d italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_ln italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) )
=i=1(𝑑xiqxi(xi)lnp(xi,yi|θ))𝑑xiqxi(xi)lnqxi(xi)absentsubscript𝑖1differential-dsubscript𝑥𝑖subscript𝑞subscript𝑥𝑖subscript𝑥𝑖𝑝subscript𝑥𝑖conditionalsubscript𝑦𝑖𝜃differential-dsubscript𝑥𝑖subscript𝑞subscript𝑥𝑖subscript𝑥𝑖subscript𝑞subscript𝑥𝑖subscript𝑥𝑖\displaystyle=\sum_{i=1}\left(\int dx_{i}\>q_{x_{i}}(x_{i})\>\ln\>p(x_{i},y_{i% }|\theta)\right)-\int dx_{i}\>q_{x_{i}}(x_{i})\>\ln\>q_{x_{i}}(x_{i})= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT ( ∫ italic_d italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_ln italic_p ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_θ ) ) - ∫ italic_d italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_ln italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
F(qx1(x1),,qxn(xn),θ).absent𝐹subscript𝑞subscript𝑥1subscript𝑥1subscript𝑞subscript𝑥𝑛subscript𝑥𝑛𝜃\displaystyle\equiv F(q_{x_{1}}{(x_{1}}),...,q_{x_{n}}{(x_{n}}),\theta).≡ italic_F ( italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , italic_θ ) .

In the preceding Eqn. 21, we note that the last two lines are those that interest us; we see there a formal similarity between those terms and those in the first equality expression of Eqn. 2. Specifically, we desire to show a correspondence between Beal’s Eqns. 2.12 - 2.16, given as

L(θ)𝐿𝜃\displaystyle L(\theta)italic_L ( italic_θ ) =i=1ln𝑑xiqxi(xi)p(xi,yi|θ)qxi(xi)absentsubscript𝑖1differential-dsubscript𝑥𝑖subscript𝑞subscript𝑥𝑖subscript𝑥𝑖𝑝subscript𝑥𝑖conditionalsubscript𝑦𝑖𝜃subscript𝑞subscript𝑥𝑖subscript𝑥𝑖\displaystyle=\sum_{i=1}\ln\int dx_{i}\>q_{x_{i}}(x_{i})\>\frac{p(x_{i},y_{i}|% \theta)}{q_{x_{i}}{(x_{i}})}= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT roman_ln ∫ italic_d italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) divide start_ARG italic_p ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_θ ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG (22)
i=1(𝑑xiqxi(xi)lnp(xi,yi|θ))𝑑xiqxi(xi)lnqxi(xi)absentsubscript𝑖1differential-dsubscript𝑥𝑖subscript𝑞subscript𝑥𝑖subscript𝑥𝑖𝑝subscript𝑥𝑖conditionalsubscript𝑦𝑖𝜃differential-dsubscript𝑥𝑖subscript𝑞subscript𝑥𝑖subscript𝑥𝑖subscript𝑞subscript𝑥𝑖subscript𝑥𝑖\displaystyle\geq\sum_{i=1}\left(\int dx_{i}\>q_{x_{i}}(x_{i})\>\ln\>p(x_{i},y% _{i}|\theta)\right)-\int dx_{i}\>q_{x_{i}}(x_{i})\>\ln\>q_{x_{i}}(x_{i})≥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT ( ∫ italic_d italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_ln italic_p ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_θ ) ) - ∫ italic_d italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_ln italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
F(qx1(x1),,qxn(xn),θ),absent𝐹subscript𝑞subscript𝑥1subscript𝑥1subscript𝑞subscript𝑥𝑛subscript𝑥𝑛𝜃\displaystyle\equiv F(q_{x_{1}}{(x_{1}}),...,q_{x_{n}}{(x_{n}}),\theta),≡ italic_F ( italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , italic_θ ) ,

and Friston’s equation, which we’ve presented here as Eqn. 2 (Eqn. 3.2 in Friston et al. (2015), and which we present again as

F(s~,a~,r~)=Eq[L(x~)]H[q(ψ~|r~)].𝐹~𝑠~𝑎~𝑟subscript𝐸𝑞delimited-[]𝐿~𝑥𝐻delimited-[]𝑞conditional~𝜓~𝑟\displaystyle F(\tilde{s},\tilde{a},\tilde{r})=E_{q}[L(\tilde{x})]-H[q(\tilde{% {\psi}}|\tilde{r})].italic_F ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) = italic_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_L ( over~ start_ARG italic_x end_ARG ) ] - italic_H [ italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) ] . (23)

The “greater-than-or-equal” relation in Eqn. 22 is due to Jensen’s inequality, and is essential to one of the steps shown in Eqn. 21; this is a minor omission in Friston’s phrasing, and does not substantially impact our translation.

We will note, as we pursue our investigation, that Friston’s identification of F𝐹Fitalic_F is the negative of that used by Beal; this will change the direction of the inequality, but will again not impact our work.

As a precursor step, we use the equality case and take negative of Beal’s expression from Eqn. 22 and write

F(s~,a~,r~)𝐹~𝑠~𝑎~𝑟\displaystyle F(\tilde{s},\tilde{a},\tilde{r})italic_F ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) (24)
=F(qx1(x1),,qxn(xn),θ)absent𝐹subscript𝑞subscript𝑥1subscript𝑥1subscript𝑞subscript𝑥𝑛subscript𝑥𝑛𝜃\displaystyle=-F(q_{x_{1}}{(x_{1}}),...,q_{x_{n}}{(x_{n}}),\theta)= - italic_F ( italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , italic_θ )
=i=1(𝑑xiqxi(xi)lnp(xi,yi|θ))absentsubscript𝑖1differential-dsubscript𝑥𝑖subscript𝑞subscript𝑥𝑖subscript𝑥𝑖𝑝subscript𝑥𝑖conditionalsubscript𝑦𝑖𝜃\displaystyle=-\sum_{i=1}\left(\int dx_{i}\>q_{x_{i}}(x_{i})\>\ln\>p(x_{i},y_{% i}|\theta)\right)= - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT ( ∫ italic_d italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_ln italic_p ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_θ ) )
[𝑑xiqxi(xi)lnqxi(xi)].delimited-[]differential-dsubscript𝑥𝑖subscript𝑞subscript𝑥𝑖subscript𝑥𝑖subscript𝑞subscript𝑥𝑖subscript𝑥𝑖\displaystyle-\left[-\int dx_{i}\>q_{x_{i}}(x_{i})\>\ln\>q_{x_{i}}(x_{i})% \right].- [ - ∫ italic_d italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_ln italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] .

We identify the three separate equivalences that we will wish to make, correlating Friston’s terms (Eqn. 25) with Beal’s (Eqn. 24); specifically that

F(s~,a~,r~)𝐹~𝑠~𝑎~𝑟\displaystyle F(\tilde{s},\tilde{a},\tilde{r})italic_F ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) (25)
=F(qx1(x1),,qxn(xn),θ)absent𝐹subscript𝑞subscript𝑥1subscript𝑥1subscript𝑞subscript𝑥𝑛subscript𝑥𝑛𝜃\displaystyle=-F(q_{x_{1}}{(x_{1}}),...,q_{x_{n}}{(x_{n}}),\theta)= - italic_F ( italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , italic_θ )
=i=1ln𝑑xiqxi(xi)p(xi,yi|θ)qxi(xi),absentsubscript𝑖1differential-dsubscript𝑥𝑖subscript𝑞subscript𝑥𝑖subscript𝑥𝑖𝑝subscript𝑥𝑖conditionalsubscript𝑦𝑖𝜃subscript𝑞subscript𝑥𝑖subscript𝑥𝑖\displaystyle=-\sum_{i=1}\ln\int dx_{i}\>q_{x_{i}}(x_{i})\>\frac{p(x_{i},y_{i}% |\theta)}{q_{x_{i}}{(x_{i}})},= - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT roman_ln ∫ italic_d italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) divide start_ARG italic_p ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_θ ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG ,

and

Eq[L(x~)]=i=1(𝑑xiqxi(xi)lnp(xi,yi|θ)),subscript𝐸𝑞delimited-[]𝐿~𝑥subscript𝑖1differential-dsubscript𝑥𝑖subscript𝑞subscript𝑥𝑖subscript𝑥𝑖𝑝subscript𝑥𝑖conditionalsubscript𝑦𝑖𝜃\displaystyle E_{q}[L(\tilde{x})]=-\sum_{i=1}\left(\int dx_{i}\>q_{x_{i}}(x_{i% })\>\ln\>p(x_{i},y_{i}|\theta)\right),italic_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_L ( over~ start_ARG italic_x end_ARG ) ] = - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT ( ∫ italic_d italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_ln italic_p ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_θ ) ) , (26)

and

H[q(ψ~|r~)]=𝑑xiqxi(xi)lnqxi(xi).𝐻delimited-[]𝑞conditional~𝜓~𝑟differential-dsubscript𝑥𝑖subscript𝑞subscript𝑥𝑖subscript𝑥𝑖subscript𝑞subscript𝑥𝑖subscript𝑥𝑖\displaystyle H[q(\tilde{{\psi}}|\tilde{r})]=-\int dx_{i}\>q_{x_{i}}(x_{i})\>% \ln\>q_{x_{i}}(x_{i}).italic_H [ italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) ] = - ∫ italic_d italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_ln italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) . (27)

We remind ourselves (to avoid confusion for any who would be reading and comparing the original documents) that Friston’s F(s~,a~,r~)𝐹~𝑠~𝑎~𝑟F(\tilde{s},\tilde{a},\tilde{r})italic_F ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) is the negative of Beal’s F(qx1(x1),,qxn(xn),θ)𝐹subscript𝑞subscript𝑥1subscript𝑥1subscript𝑞subscript𝑥𝑛subscript𝑥𝑛𝜃F(q_{x_{1}}{(x_{1}}),...,q_{x_{n}}{(x_{n}}),\theta)italic_F ( italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , italic_θ ).

We will address each of these three correspondences in the following three subsections, respectively.

5.1 Equivalence of the Variational Free Energy Expressions

In this subsection, we will show the correspondence given previously as Eqn. 25, that is

F(s~,a~,r~)𝐹~𝑠~𝑎~𝑟\displaystyle F(\tilde{s},\tilde{a},\tilde{r})italic_F ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG )
=F(qx1(x1),,qxn(xn),θ)absent𝐹subscript𝑞subscript𝑥1subscript𝑥1subscript𝑞subscript𝑥𝑛subscript𝑥𝑛𝜃\displaystyle=-F(q_{x_{1}}{(x_{1}}),...,q_{x_{n}}{(x_{n}}),\theta)= - italic_F ( italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , italic_θ )
=i=1ln𝑑xiqxi(xi)p(xi,yi|θ)qxi(xi).absentsubscript𝑖1differential-dsubscript𝑥𝑖subscript𝑞subscript𝑥𝑖subscript𝑥𝑖𝑝subscript𝑥𝑖conditionalsubscript𝑦𝑖𝜃subscript𝑞subscript𝑥𝑖subscript𝑥𝑖\displaystyle=-\sum_{i=1}\ln\int dx_{i}\>q_{x_{i}}(x_{i})\>\frac{p(x_{i},y_{i}% |\theta)}{q_{x_{i}}{(x_{i}})}.= - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT roman_ln ∫ italic_d italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) divide start_ARG italic_p ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_θ ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG .

We pause to recollect, from Subsection 4.3, Eqn. 16, the term that Friston has identified for variational free energy. We state this again here for reference as

F(s~,a~,r~)=q(ψ~|r~)ln(q(ψ~|r~)p(ψ~,s~,a~,r~))𝐹~𝑠~𝑎~𝑟𝑞conditional~𝜓~𝑟𝑞conditional~𝜓~𝑟𝑝~𝜓~𝑠~𝑎~𝑟\displaystyle F(\tilde{s},\tilde{a},\tilde{r})=q({\tilde{{\psi}}}|\tilde{r})% \ln\left({\frac{q({\tilde{{\psi}}}|\tilde{r})}{p(\tilde{{\psi}},\tilde{s},% \tilde{a},\tilde{r})}}\right)italic_F ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) = italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) roman_ln ( divide start_ARG italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) end_ARG start_ARG italic_p ( over~ start_ARG italic_ψ end_ARG , over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) end_ARG ) (28)
=q(ψ~|r~)ln(p(ψ~,s~,a~,r~)q(ψ~|r~)).absent𝑞conditional~𝜓~𝑟𝑝~𝜓~𝑠~𝑎~𝑟𝑞conditional~𝜓~𝑟\displaystyle=-q({\tilde{{\psi}}}|\tilde{r})\ln\left({\frac{p(\tilde{{\psi}},% \tilde{s},\tilde{a},\tilde{r})}{q({\tilde{{\psi}}}|\tilde{r})}}\right).= - italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) roman_ln ( divide start_ARG italic_p ( over~ start_ARG italic_ψ end_ARG , over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) end_ARG start_ARG italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) end_ARG ) .

We rewrite Beal’s expresssion from Eqn. 22, by taking the negative of all terms and dropping the inequality

L(θ)𝐿𝜃\displaystyle-L(\theta)- italic_L ( italic_θ ) =i=1ln𝑑xiqxi(xi)p(xi,yi|θ)qxi(xi)absentsubscript𝑖1differential-dsubscript𝑥𝑖subscript𝑞subscript𝑥𝑖subscript𝑥𝑖𝑝subscript𝑥𝑖conditionalsubscript𝑦𝑖𝜃subscript𝑞subscript𝑥𝑖subscript𝑥𝑖\displaystyle=-\sum_{i=1}\ln\int dx_{i}\>q_{x_{i}}(x_{i})\>\frac{p(x_{i},y_{i}% |\theta)}{q_{x_{i}}{(x_{i}})}= - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT roman_ln ∫ italic_d italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) divide start_ARG italic_p ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_θ ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG (29)
=F(qx1(x1),,qxn(xn),θ).absent𝐹subscript𝑞subscript𝑥1subscript𝑥1subscript𝑞subscript𝑥𝑛subscript𝑥𝑛𝜃\displaystyle=-F(q_{x_{1}}{(x_{1}}),...,q_{x_{n}}{(x_{n}}),\theta).= - italic_F ( italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , italic_θ ) .

We note that there is indeed the desired resemblance. To be more clear, we are seeking the correspondence that can be expressed as

q(ψ~|r~)ln(p(ψ~,s~,a~,r~)q(ψ~|r~))𝑞conditional~𝜓~𝑟𝑝~𝜓~𝑠~𝑎~𝑟𝑞conditional~𝜓~𝑟\displaystyle q({\tilde{{\psi}}}|\tilde{r})\ln\left({\frac{p(\tilde{{\psi}},% \tilde{s},\tilde{a},\tilde{r})}{q({\tilde{{\psi}}}|\tilde{r})}}\right)italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) roman_ln ( divide start_ARG italic_p ( over~ start_ARG italic_ψ end_ARG , over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) end_ARG start_ARG italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) end_ARG ) (30)
=i=1ln𝑑xiqxi(xi)p(xi,yi|θ)qxi(xi).absentsubscript𝑖1differential-dsubscript𝑥𝑖subscript𝑞subscript𝑥𝑖subscript𝑥𝑖𝑝subscript𝑥𝑖conditionalsubscript𝑦𝑖𝜃subscript𝑞subscript𝑥𝑖subscript𝑥𝑖\displaystyle=\sum_{i=1}\ln\int dx_{i}\>q_{x_{i}}(x_{i})\>\frac{p(x_{i},y_{i}|% \theta)}{q_{x_{i}}{(x_{i}})}.= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT roman_ln ∫ italic_d italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) divide start_ARG italic_p ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_θ ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG .

Clearly, the desired terms are present and in their appropriate order. The key differences are that Beal (on the Right-Hand-Side, or RHS) explicitly identifies the summation and integration steps, and that the Friston formalism is expanded; it includes the entire set of elements; ψ~~𝜓\tilde{\psi}over~ start_ARG italic_ψ end_ARG, s~~𝑠\tilde{s}over~ start_ARG italic_s end_ARG, r~~𝑟\tilde{r}over~ start_ARG italic_r end_ARG, and r~~𝑟\tilde{r}over~ start_ARG italic_r end_ARG.

This makes sense; the entire “universe” encompassed in the Friston model is expressed via ψ~~𝜓\tilde{\psi}over~ start_ARG italic_ψ end_ARG, s~~𝑠\tilde{s}over~ start_ARG italic_s end_ARG, r~~𝑟\tilde{r}over~ start_ARG italic_r end_ARG, and r~~𝑟\tilde{r}over~ start_ARG italic_r end_ARG. Correspondingly, the “universe” modeled in Beal’s approach is the set of observable data points yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the associated latent variables xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The parameter θ𝜃\thetaitalic_θ is expressly identified in Beal’s notation; it is suppressed in this particular notation by Friston, but is evident in various Friston writings (op. cit.).

We give our attention to how Beal’s expression integrates over the q(xi)𝑞subscript𝑥𝑖q(x_{i})italic_q ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). More precisely, Beal gives an integration over the hidden or latent elements xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and a summation over the units yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Friston’s approach simply specifies a distribution q𝑞qitalic_q associated with each specific probabilistic state p𝑝pitalic_p. However, as discussed earlier, Friston’s notation subsumes the summation (or integration, as appropriate).

This is a reasonable transition, as in the system being described by Friston, we no longer are assessing the values of p𝑝pitalic_p and q𝑞qitalic_q over the same underlying hidden variables x𝑥xitalic_x. Rather, the q𝑞qitalic_q corresponds to the external system, and the p𝑝pitalic_p corresponds to the internal (representational) system, which we are bringing into alignment with the external system.

5.2 Equivalence of the Entropy Expressions

For ease in flow, we next address the equivalence of the two entropy terms, as this is relatively straightforward.

Friston identifies an entropy term H𝐻Hitalic_H (using this notation, common to information theory, rather than the more classic thermodynamic notation S𝑆Sitalic_S), and we desire that it be equivalent to Beal’s term, as expressed previously in Eqn. 27, which we restate here (using Beal’s notation) as

H[q(ψ~|r~)]=𝑑xqx(x)lnqx(x),𝐻delimited-[]𝑞conditional~𝜓~𝑟differential-d𝑥subscript𝑞𝑥𝑥subscript𝑞𝑥𝑥\displaystyle H[q(\tilde{{\psi}}|\tilde{r})]=-\int dx\>q_{x}(x)\>\ln\>q_{x}(x),italic_H [ italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) ] = - ∫ italic_d italic_x italic_q start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_x ) roman_ln italic_q start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_x ) ,

which we can also write (using Friston’s notation) as

H[q(ψ~|r~)]=d(ψ~|r~)q(ψ~|r~)lnq(ψ~|r~).𝐻delimited-[]𝑞conditional~𝜓~𝑟𝑑conditional~𝜓~𝑟𝑞conditional~𝜓~𝑟𝑞conditional~𝜓~𝑟\displaystyle H[q(\tilde{{\psi}}|\tilde{r})]=-\int d({\tilde{{\psi}}}|\tilde{r% })\>q({\tilde{{\psi}}}|\tilde{r})\>\ln\>q({\tilde{{\psi}}}|\tilde{r}).italic_H [ italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) ] = - ∫ italic_d ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) roman_ln italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) .

We recall that the fundamental definition for the entropy of a system is given (see Appendix B) as

S=knPnlnPn,𝑆𝑘subscript𝑛subscript𝑃𝑛subscript𝑃𝑛\displaystyle S=-k\sum_{n}\>P_{n}\ln\>P_{n},italic_S = - italic_k ∑ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_ln italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , (31)

where Pnsubscript𝑃𝑛P_{n}italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT refers to the probability of a unit being in energy state n𝑛nitalic_n. This is a classic entropy formulation, and we see it replicated in Eqn. 27. The thing that we wish to carefully note is that in Eqn. 27, given that the entropy is being expressed as a function of q(xi)𝑞subscript𝑥𝑖q({x_{i}})italic_q ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (using Beal’s notation) or q(ψ~|r~)𝑞conditional~𝜓~𝑟q(\tilde{\psi}|\tilde{r})italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) (using Friston’s notation), the units that are being summed (or integrated) are those in the model-distribution of the external system ψ~~𝜓\tilde{\psi}over~ start_ARG italic_ψ end_ARG, as conditioned on the units in the internal system r~~𝑟\tilde{r}over~ start_ARG italic_r end_ARG. Thus, H𝐻Hitalic_H is a function of (the model of) the external system represented by q𝑞qitalic_q.

Very specifically, when it comes to evaluating this function, we would not need a distribution over all possible states in the model. Rather, the computational engine which this Technical Report envisions is one in which the external and internal systems separately come to free energy minima. Thus, when that minimum point is achieved, there would be a single existent value for p𝑝pitalic_p; the one which represents the free energy-minimized state for a given set of parameters θ𝜃\thetaitalic_θ. This would then lead (via sensory and active units) to a similarly free energy-mininized q𝑞qitalic_q.

Friston (personal communication) notes that “A complementary perspective on this computational saving follows from Feynman’s original motivation; namely, that we have converted a very difficult integration problem into an easy optimization problem. Here, the optimization problem simply entails minimizing variational free energy.”

5.3 Equivalence of the Enthalpy Expressions

Finally, we wish to show the equivalence between the enthalpy terms. The word “enthalpy” may be a misnomer here, but is being used in the classic sense of thermodynamics, in which (see Appendix A) the free energy F𝐹Fitalic_F is equal to the enthalpy H𝐻Hitalic_H (which is the classic notation for enthalpy, although U𝑈Uitalic_U is sometimes used, depending on the version of free energy being described) minus the temperature T𝑇Titalic_T times the entropy S𝑆Sitalic_S, or (in the case of this manuscript as well as in Friston’s work and most information theoretic works) H𝐻Hitalic_H.

Thus, the most classic equation in thermodynamics is

F=HTS,𝐹𝐻𝑇𝑆\displaystyle F=H\>-\>TS,italic_F = italic_H - italic_T italic_S ,

which states that the free energy is the enthalpy minus temperature times entropy, and where enthalpy is denoted H𝐻Hitalic_H and entropy is denoted S𝑆Sitalic_S.

The equations presented by Beal and Friston have the same formal structure as the classic free energy equation from statistical thermodynamics, as stated in Eqn. 2. The version offered by Friston is replicated here as

F(s~,a~,r~)=Eq[L(x~)]H[q(ψ~|r~)],𝐹~𝑠~𝑎~𝑟subscript𝐸𝑞delimited-[]𝐿~𝑥𝐻delimited-[]𝑞conditional~𝜓~𝑟\displaystyle F(\tilde{s},\tilde{a},\tilde{r})=E_{q}[L(\tilde{x})]-H[q(\tilde{% {\psi}}|\tilde{r})],italic_F ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) = italic_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_L ( over~ start_ARG italic_x end_ARG ) ] - italic_H [ italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) ] ,

and note again that Friston uses H𝐻Hitalic_H for entropy, instead of the thermodynamic S𝑆Sitalic_S.

The temperature T𝑇Titalic_T has been absorbed in the derivations presented in this work; we are dealing with something called a reduced free energy (and also reduced entropy and reduced enthalpy), which are dimensionless quantities. (Note also that this “reduction” also normalizes the thermodynamic variables with regard to the total number of units in the system. See Appendix A for a review of basic thermodynamics.)

We have already established the correspondence between the Friston’s variational free energy F(s~,a~,r~)𝐹~𝑠~𝑎~𝑟F(\tilde{s},\tilde{a},\tilde{r})italic_F ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) and (the negative of) that used by Beal, and also identified it as corresponding (in position and form) to the classical free energy. We have also established the correspondence of Friston’s entropy term H𝐻Hitalic_H with (the negative of) that used by Beal, and identified it as corresponding to the classical entropy term. (In this case the actual expressions are very much aligned.)

We now seek to identify the correspondence between Friston’s enthalpy-like term, Eq[L(x~)]subscript𝐸𝑞delimited-[]𝐿~𝑥E_{q}[L(\tilde{x})]italic_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_L ( over~ start_ARG italic_x end_ARG ) ], and the negative of that used by Beal. Specifically, we want to establish Eqn. 26, restating it here for convenience as

Eq[L(x~)]=i=1(𝑑xiqxi(xi)lnp(xi,yi|θ)).subscript𝐸𝑞delimited-[]𝐿~𝑥subscript𝑖1differential-dsubscript𝑥𝑖subscript𝑞subscript𝑥𝑖subscript𝑥𝑖𝑝subscript𝑥𝑖conditionalsubscript𝑦𝑖𝜃\displaystyle E_{q}[L(\tilde{x})]=-\sum_{i=1}\left(\int dx_{i}\>q_{x_{i}}(x_{i% })\>\ln\>p(x_{i},y_{i}|\theta)\right).italic_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_L ( over~ start_ARG italic_x end_ARG ) ] = - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT ( ∫ italic_d italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_ln italic_p ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_θ ) ) . (32)

We also take note of the interpretation offered by Friston (2015) [6], which states that L(x~)=lnp(ψ~,s~,a~,r~|m)𝐿~𝑥𝑝~𝜓~𝑠~𝑎conditional~𝑟𝑚L(\tilde{x})=-\ln\>{p(\tilde{\psi},\tilde{s},\tilde{a},\tilde{r}|m)}italic_L ( over~ start_ARG italic_x end_ARG ) = - roman_ln italic_p ( over~ start_ARG italic_ψ end_ARG , over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG | italic_m ) (see Lemma 3.1 and also Eqn. 3.2), or more specifically

L(x~)=L(ψ~,s~,a~,r~)=lnp(ψ~,s~,a~,r~),𝐿~𝑥𝐿~𝜓~𝑠~𝑎~𝑟𝑝~𝜓~𝑠~𝑎~𝑟\displaystyle L(\tilde{x})=L(\tilde{{\psi}},\tilde{s},\tilde{a},\tilde{r})=-% \ln{p(\tilde{{\psi}},\tilde{s},\tilde{a},\tilde{r})},italic_L ( over~ start_ARG italic_x end_ARG ) = italic_L ( over~ start_ARG italic_ψ end_ARG , over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) = - roman_ln italic_p ( over~ start_ARG italic_ψ end_ARG , over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) , (33)

so that L𝐿Litalic_L is defined as the negative of the sum of the logarithm of the joint probability of the external and internal (representational) units, together with the Markov blanket units. (The dependence on the model parameter m𝑚mitalic_m is implicit.)

We have previously addressed the nature of L𝐿Litalic_L in Subsection 4.3, and thus will just briefly recapitulate cogent arguments here.

First, we examine the term lnp(xi,yi|θ)𝑝subscript𝑥𝑖conditionalsubscript𝑦𝑖𝜃\ln\>p(x_{i},y_{i}|\theta)roman_ln italic_p ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_θ ). The joint probability of the dependent variables yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, co-occurring with the independent variables xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, as conditioned by the model system parameters θ𝜃\thetaitalic_θ, is consistent with Friston’s notation involving a joint probability distribution.

Second, we consider the integration over the qxi(xi)subscript𝑞subscript𝑥𝑖subscript𝑥𝑖q_{x_{i}}(x_{i})italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) times the logarithm of the probability. As noted previously, the q𝑞qitalic_q and the p𝑝pitalic_p address the distributions over different systems, and thus are independent (to a first order). Thus, we can separate out the integration of the q𝑞qitalic_q. The agreement with Eqn. 33 becomes self-evident, if we associate the external states with x𝑥xitalic_x, and the sensory states (plus active and internal states) with y𝑦yitalic_y.

5.4 Recapitulation and Summary

We now recast Eqn. 21 using Friston’s notation.

Further, since L(ψ~,s~,a~,r~)=ln(p(ψ~,s~,a~,r~|m))𝐿~𝜓~𝑠~𝑎~𝑟𝑙𝑛𝑝~𝜓~𝑠~𝑎conditional~𝑟𝑚L(\tilde{{\psi}},\tilde{s},\tilde{a},\tilde{r})=-ln({p(\tilde{{\psi}},\tilde{s% },\tilde{a},\tilde{r}|m)})italic_L ( over~ start_ARG italic_ψ end_ARG , over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) = - italic_l italic_n ( italic_p ( over~ start_ARG italic_ψ end_ARG , over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG | italic_m ) ), the signs on the terms on the RHS of Eqn. 21 have been changed throughout, along with the direction of the inequality.

A key feature in the following Eqn. 34 is that Friston is taking the integration over the units ψ~~𝜓\tilde{\psi}over~ start_ARG italic_ψ end_ARG in the external system, similar to how Beal is summing over the observable units y𝑦yitalic_y. However, we will conduct our integration (or summation, which is more literally the case) over the units in the model system, as discussed in Subsection 3.2.

Specifically, Friston’s starting point is Eqn. 2.7 in [5], given as

F(s,a,r)𝐹𝑠𝑎𝑟\displaystyle F(s,a,r)italic_F ( italic_s , italic_a , italic_r ) =ψ𝑑ψq(ψ|r)ln(p(ψ,s,a,r)q(ψ|r))absentsubscript𝜓differential-d𝜓𝑞conditional𝜓𝑟𝑝𝜓𝑠𝑎𝑟𝑞conditional𝜓𝑟\displaystyle=-\int_{\psi}d{\psi}\>q(\psi|r)\ln\left(\frac{p(\psi,s,a,r)}{q(% \psi|r)}\right)= - ∫ start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT italic_d italic_ψ italic_q ( italic_ψ | italic_r ) roman_ln ( divide start_ARG italic_p ( italic_ψ , italic_s , italic_a , italic_r ) end_ARG start_ARG italic_q ( italic_ψ | italic_r ) end_ARG )
=Eq[L(ψ,s,a,r)]H[q(ψ|μ)].absentsubscript𝐸𝑞delimited-[]𝐿𝜓𝑠𝑎𝑟𝐻delimited-[]𝑞conditional𝜓𝜇\displaystyle=E_{q}[L(\psi,s,a,r)]-H[q(\psi|\mu)].= italic_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_L ( italic_ψ , italic_s , italic_a , italic_r ) ] - italic_H [ italic_q ( italic_ψ | italic_μ ) ] .

Note that the tilde notation, indicating generalized variables, is dropped, conforming with the notation that Friston uses in [5], where Friston uses G𝐺Gitalic_G for L𝐿Litalic_L in [5].

Friston’s interpretation is that “Here, free energy is a functional of an arbitrary (variational) density q(ψ|r)𝑞conditional𝜓𝑟q(\psi|r)italic_q ( italic_ψ | italic_r ) [q(ψ|λ)𝑞conditional𝜓𝜆q(\psi|\lambda)italic_q ( italic_ψ | italic_λ ) in the original article] that is parametrized by internal states. The last equality just shows that free energy can be expressed as the expected Gibbs energy minus the entropy of the variational density.” (Friston (2013) [5], immediately after Eqn. 2.7.)

The corresponding expressions, from Eqns. 2.12 - 2.16 in Beal [7] are given as

L(θ)𝐿𝜃\displaystyle L(\theta)italic_L ( italic_θ ) i=1𝑑xiqxi(xi)lnp(xi,yi|θ)qxi(xi)absentsubscript𝑖1differential-dsubscript𝑥𝑖subscript𝑞subscript𝑥𝑖subscript𝑥𝑖𝑝subscript𝑥𝑖conditionalsubscript𝑦𝑖𝜃subscript𝑞subscript𝑥𝑖subscript𝑥𝑖\displaystyle\geq\sum_{i=1}\int dx_{i}\>q_{x_{i}}(x_{i})\>\ln\>\frac{p(x_{i},y% _{i}|\theta)}{q_{x_{i}}{(x_{i}})}≥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT ∫ italic_d italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_ln divide start_ARG italic_p ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_θ ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG
=i=1(𝑑xiqxi(xi)lnp(xi,yi|θ))𝑑xiqxi(xi)lnqxi(xi)absentsubscript𝑖1differential-dsubscript𝑥𝑖subscript𝑞subscript𝑥𝑖subscript𝑥𝑖𝑝subscript𝑥𝑖conditionalsubscript𝑦𝑖𝜃differential-dsubscript𝑥𝑖subscript𝑞subscript𝑥𝑖subscript𝑥𝑖subscript𝑞subscript𝑥𝑖subscript𝑥𝑖\displaystyle=\sum_{i=1}\left(\int dx_{i}\>q_{x_{i}}(x_{i})\>\ln\>p(x_{i},y_{i% }|\theta)\right)-\int dx_{i}\>q_{x_{i}}(x_{i})\>\ln\>q_{x_{i}}(x_{i})= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT ( ∫ italic_d italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_ln italic_p ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_θ ) ) - ∫ italic_d italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_ln italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
F(qx1(x1),,qxn(xn),θ).absent𝐹subscript𝑞subscript𝑥1subscript𝑥1subscript𝑞subscript𝑥𝑛subscript𝑥𝑛𝜃\displaystyle\equiv F(q_{x_{1}}{(x_{1}}),...,q_{x_{n}}{(x_{n}}),\theta).≡ italic_F ( italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , italic_θ ) .

The full derivation, using Friston’s notation, can be found as

L(s,a,r)𝐿𝑠𝑎𝑟\displaystyle L(s,a,r)italic_L ( italic_s , italic_a , italic_r ) =ln(p(ψ,s,a,r|m))absent𝑝𝜓𝑠𝑎conditional𝑟𝑚\displaystyle=-\ln({p(\psi,s,a,r|m)})= - roman_ln ( italic_p ( italic_ψ , italic_s , italic_a , italic_r | italic_m ) ) (34)
=ψ𝑑ψln(p(ψ,s,a,r))absentsubscript𝜓differential-d𝜓𝑝𝜓𝑠𝑎𝑟\displaystyle=-\int_{\psi}d{\psi}\>\ln\left({p(\psi,s,a,r)}\right)= - ∫ start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT italic_d italic_ψ roman_ln ( italic_p ( italic_ψ , italic_s , italic_a , italic_r ) )
=ψ𝑑ψln(q(ψ|r)p(ψ,s,a,r)q(ψ|r))absentsubscript𝜓differential-d𝜓𝑞conditional𝜓𝑟𝑝𝜓𝑠𝑎𝑟𝑞conditional𝜓𝑟\displaystyle=-\int_{\psi}d{\psi}\>\ln\left(q(\psi|r)\frac{p(\psi,s,a,r)}{q(% \psi|r)}\right)= - ∫ start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT italic_d italic_ψ roman_ln ( italic_q ( italic_ψ | italic_r ) divide start_ARG italic_p ( italic_ψ , italic_s , italic_a , italic_r ) end_ARG start_ARG italic_q ( italic_ψ | italic_r ) end_ARG )
ψ𝑑ψq(ψ|r)ln(p(ψ,s,a,r)q(ψ|r))absentsubscript𝜓differential-d𝜓𝑞conditional𝜓𝑟𝑝𝜓𝑠𝑎𝑟𝑞conditional𝜓𝑟\displaystyle\leq-\int_{\psi}d{\psi}\>q(\psi|r)\ln\left(\frac{p(\psi,s,a,r)}{q% (\psi|r)}\right)≤ - ∫ start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT italic_d italic_ψ italic_q ( italic_ψ | italic_r ) roman_ln ( divide start_ARG italic_p ( italic_ψ , italic_s , italic_a , italic_r ) end_ARG start_ARG italic_q ( italic_ψ | italic_r ) end_ARG )
=ψ𝑑ψq(ψ|r)ln(p(ψ,s,a,r))+ψ𝑑ψq(ψ|r)ln(q(ψ|r))absentsubscript𝜓differential-d𝜓𝑞conditional𝜓𝑟𝑝𝜓𝑠𝑎𝑟subscript𝜓differential-d𝜓𝑞conditional𝜓𝑟𝑞conditional𝜓𝑟\displaystyle=-\int_{\psi}d{\psi}\>q(\psi|r)\ln\left({p(\psi,s,a,r)}\right)+% \int_{\psi}d{\psi}\>q(\psi|r)\ln\left({q(\psi|r)}\right)= - ∫ start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT italic_d italic_ψ italic_q ( italic_ψ | italic_r ) roman_ln ( italic_p ( italic_ψ , italic_s , italic_a , italic_r ) ) + ∫ start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT italic_d italic_ψ italic_q ( italic_ψ | italic_r ) roman_ln ( italic_q ( italic_ψ | italic_r ) )
=Eq[G(ψ,s,a,r)]H[q(ψ|μ)]absentsubscript𝐸𝑞delimited-[]𝐺𝜓𝑠𝑎𝑟𝐻delimited-[]𝑞conditional𝜓𝜇\displaystyle=E_{q}[G(\psi,s,a,r)]-H[q(\psi|\mu)]= italic_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_G ( italic_ψ , italic_s , italic_a , italic_r ) ] - italic_H [ italic_q ( italic_ψ | italic_μ ) ]
F(s,a,r).absent𝐹𝑠𝑎𝑟\displaystyle\equiv F(s,a,r).≡ italic_F ( italic_s , italic_a , italic_r ) .

A notational point is that in this equation, G𝐺Gitalic_G refers to the thermodynamic Gibbs free energy, which is being used here in a didactic manner.

As another small note, Friston shifts notation between the third-to-last and the second-to-last lines of this equation, where he expresses the results in Friston (2013) [5]. In the third-to-last equation, he has the expression involving q(ψ~|r~)𝑞conditional~𝜓~𝑟q({\tilde{{\psi}}}|\tilde{r})italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ). In the second-to-last equation, he uses q(ψ~|μ)𝑞conditional~𝜓𝜇q({\tilde{{\psi}}}|\mu)italic_q ( over~ start_ARG italic_ψ end_ARG | italic_μ ). A rationale is that after the integration (where the units that are being considered in the distributionl q𝑞qitalic_q are dependent on the actual representational units r~~𝑟\tilde{r}over~ start_ARG italic_r end_ARG), the dependence of q𝑞qitalic_q on r~~𝑟\tilde{r}over~ start_ARG italic_r end_ARG no longer needs to be explicitly stated. The introduction of μ𝜇\muitalic_μ is simply noting that the computation for the distribution q𝑞qitalic_q was done with reference to sufficient statistics or parameters μ𝜇\muitalic_μ, which are associated with the internal states (μ=r~𝜇~𝑟\mu=\tilde{r}italic_μ = over~ start_ARG italic_r end_ARG).

6 Discussion

Now that we’ve done a detailed derivation for both of the equalities expressed in Eqn. 2, it is useful to step back and ascertain exactly what is meant by these paired statements, which are reproduced below for convenience.

F(s~,a~,r~)=Eq[L(x~)]H[q(ψ~|r~)]𝐹~𝑠~𝑎~𝑟subscript𝐸𝑞delimited-[]𝐿~𝑥𝐻delimited-[]𝑞conditional~𝜓~𝑟\displaystyle F(\tilde{s},\tilde{a},\tilde{r})=E_{q}[L(\tilde{x})]-H[q(\tilde{% {\psi}}|\tilde{r})]italic_F ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) = italic_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_L ( over~ start_ARG italic_x end_ARG ) ] - italic_H [ italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) ]
=L(s~,a~,r~)+DKL[q(ψ~|r~)||p(ψ~|s~,a~,r~)].\displaystyle=L(\tilde{s},\tilde{a},\tilde{r})+D_{KL}[q({\tilde{{\psi}}}|% \tilde{r})||p(\tilde{{\psi}}|\tilde{s},\tilde{a},\tilde{r})].= italic_L ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) + italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT [ italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) | | italic_p ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) ] .

The first expression for the variational free energy puts the influence of the external units in the first free energy term Eq[L(x~)]subscript𝐸𝑞delimited-[]𝐿~𝑥E_{q}[L(\tilde{x})]italic_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_L ( over~ start_ARG italic_x end_ARG ) ]. By stating that we desire the expectation of L(x~)𝐿~𝑥L(\tilde{x})italic_L ( over~ start_ARG italic_x end_ARG ), we are pushing to identify L𝐿Litalic_L at the point at which we “expect” the system to come to a stable state, i.e., a free energy minimum.

The influence of the “variation” or the perturbation to the system is expressed in terms of the “entropy of the variational density,” H[q(ψ~|r~)]𝐻delimited-[]𝑞conditional~𝜓~𝑟H[q(\tilde{{\psi}}|\tilde{r})]italic_H [ italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) ]. This puts the variation or perturbation of the external units in the context of the expected values for the internal and Markov blanket units.

The second expression for the variational free energy simply identifies a free energy-like term that involves only internal and Markov blanket units; L(s~,a~,r~)𝐿~𝑠~𝑎~𝑟L(\tilde{s},\tilde{a},\tilde{r})italic_L ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ). (In statistical terms, this is known as the marginal likelihood; having integrated out dependencies on the external states or causes of sensory states. In Bayesian statistics, this is also known as the (negative) log model evidence (see below).)

The extracted influence of the expected external units (those typically associated with a specific state of internal and Markov blanket units) is now combined with the influence of the variational (or perturbed) external units, within the reverse Kullback-Leibler divergence term, DKL[q(ψ~|r~)||p(ψ~|s~,a~,r~)]D_{KL}[q({\tilde{{\psi}}}|\tilde{r})||p(\tilde{{\psi}}|\tilde{s},\tilde{a},% \tilde{r})]italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT [ italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ) | | italic_p ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) ].

(In terms of Bayesian statistics, this is the divergence between the approximate and true posterior. This means that minimizing free energy is equivalent to approximate Bayesian inference.)

This Technical Report has, thus far, served to present in detail a derivation for the ideas behind the variational Bayes approach, and provided a detailed correlation between the notation used by one author (Beal [7]) and that used by Friston (op. cit.), in his extension of variational Bayes to a more general case, in which an external system is separated from a “representational” system by a Markov blanket of sensory and active units. These are two different ways of envisioning the variational Bayes approach in action.

6.1 Free Energy Physical Interpretation

We consider that the free energy formulation that we have been developing describes a system with external units ψ𝜓\psiitalic_ψ, together with a representational system that contains internal units that encode “latent” or “hidden” states, in terms of their sufficient statistics r~~𝑟\tilde{r}over~ start_ARG italic_r end_ARG, that are separated from the external system by a Markov blanket comprising sensory units s~~𝑠\tilde{s}over~ start_ARG italic_s end_ARG and action units a~~𝑎\tilde{a}over~ start_ARG italic_a end_ARG, as illustrated in Figure 1 of [5]. In other words, internal states encode probability distributions over latent states that ‘could have’ caused the sensory states.

Eqn. 6 gives us the free energy of the system, where the elements of F(s~,a~,r~)𝐹~𝑠~𝑎~𝑟F(\tilde{s},\tilde{a},\tilde{r})italic_F ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) are formulated in terms of the probability distribution over ψ~~𝜓\tilde{\psi}over~ start_ARG italic_ψ end_ARG in terms of q(ψ~|r~)𝑞conditional~𝜓~𝑟q({\tilde{\psi}}|\tilde{r})italic_q ( over~ start_ARG italic_ψ end_ARG | over~ start_ARG italic_r end_ARG ). In contrast, L(s~,a~,r~)𝐿~𝑠~𝑎~𝑟L(\tilde{s},\tilde{a},\tilde{r})italic_L ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) is a function strictly of the units associated with the representation, where the elements include the representational units r~~𝑟\tilde{r}over~ start_ARG italic_r end_ARG along with the Markov blanket units s~~𝑠\tilde{s}over~ start_ARG italic_s end_ARG and a~~𝑎\tilde{a}over~ start_ARG italic_a end_ARG. Finally, the reverse K-L divergence term (the final term on the RHS of Eqn. 6) expresses the divergence between the model (expressed as q𝑞qitalic_q) and the representation of the external system (expressed as the posterior distribution p𝑝pitalic_p), given the Markov blanket. (We are dropping the “tilde” notation favored by Friston et al. (op. cit.).)

6.2 Free Energy as a Lower Bound

Beal notes that F(qx(x),θ)𝐹subscript𝑞𝑥𝑥𝜃F(q_{x}(x),\theta)italic_F ( italic_q start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_x ) , italic_θ ) is a lower bound on L(θ)𝐿𝜃L(\theta)italic_L ( italic_θ ) and is a functional of the free distributions qxi(xi)subscript𝑞subscript𝑥𝑖subscript𝑥𝑖q_{x_{i}}(x_{i})italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and of θ𝜃\thetaitalic_θ (the dependence on y𝑦yitalic_y is left implicit). The inequality introduced in the third expression makes use of Jensen’s inequality.

Beal notes: “Defining the energy of a global configuration (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) … the lower bound F(qx(x),θ)L(θ)𝐹subscript𝑞𝑥𝑥𝜃𝐿𝜃F(q_{x}(x),\theta)\leq L(\theta)italic_F ( italic_q start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_x ) , italic_θ ) ≤ italic_L ( italic_θ ) is the negative of a quantity known in statistical physics as the free energy: the expected energy under qx(x)subscript𝑞𝑥𝑥q_{x}(x)italic_q start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_x ) minus the entropy of qx(x)subscript𝑞𝑥𝑥q_{x}(x)italic_q start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_x ) (Feynman, 1972; Neal and Hinton, 1998).”

Beal further notes that F(qx(x),θ)𝐹subscript𝑞𝑥𝑥𝜃F(q_{x}(x),\theta)italic_F ( italic_q start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_x ) , italic_θ ) is the negative of what is known, in statistical thermodynamics, as the free energy of a system, which is the expected energy (H𝐻Hitalic_H) under qx(x)subscript𝑞𝑥𝑥q_{x}(x)italic_q start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_x ) minus the entropy of qx(x)subscript𝑞𝑥𝑥q_{x}(x)italic_q start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_x ). Thus, when we shift to the notation of Friston (op.cit.), we reverse the signs on all of the terms on the right-hand-side of Eqn. 34, as well as the direction of the inequality.

As is often noted [5, 7, 10], since the DKL>=0subscript𝐷𝐾𝐿0D_{KL}>=0italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT > = 0, the free energy for the model is a lower bound for the free energy of the external system. As the model is brought closer to alignment with the external system (the reverse K-L divergence decreases), the free energy of the model approaches that of the external system (L(s~,a~,r~)=>F(s~,a~,r~)L(\tilde{s},\tilde{a},\tilde{r})=>F(\tilde{s},\tilde{a},\tilde{r})italic_L ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) = > italic_F ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG )).

7 The Evolution of Active Inference

This paper has addressed the notational correspondence between Friston’s early work introducing active inference (op. cit.) and the notation that he followed from Beal (2003) [7], with some attention also to notation used by Blei et al. (2016) [10].

This early Friston work (approximately between 2010 and 2015) emphasized the distinction between the external environment ΨΨ\Psiroman_Ψ and the representation of that environment r~~𝑟\tilde{r}over~ start_ARG italic_r end_ARG, as mediated by sensing agents s~~𝑠\tilde{s}over~ start_ARG italic_s end_ARG and action agents a~~𝑎\tilde{a}over~ start_ARG italic_a end_ARG.

More recently, Friston et al. have shifted notation (starting 2016 - 2017) [12, 13].

The newer formulations have been the basis for recent work on active inference, specifically the evolution of Action Perception Divergence (APD) by Hafner et al. (2020, rev. 2022) [3]. The newer notation has also been used by Friston et al. in more broadly describing the “free energy principle” [2].

Authors presenting active inference in more readily-understood forms (compared to Friston’s early works), as well as Friston himself, emphasize the role of active inference in “process theory” - that is, as a guiding framework for how intelligent systems actually accomplish desired tasks. A recent book by Parr, Pezzulo, and Friston (2022) [17] is a premier example of such, as is an excellent review by Sajid et al. (2020), presenting a contrast-and-compare between active inference and reinforcement learning [18].

Most recently, Friston et al. (2024), have put forth a “renormalization group” approach to active inference, which allows active inference to be applied to larger-scale problems [1]. This overcomes a prior drawback to using active inference – its restriction to relatively small-scale problems – such as was done in Cullen et al. (2016) [19], where active inference was applied to the game of Doom.

8 The Variational Free Energy in a New Computational Engine

One of the themes that consistently underlies active inference is that a given system will seek to reach a free energy equilibrium. Ideally, the external reality that we are seeking to represent, ΨΨ\Psiroman_Ψ, undergoes its own processes that continually move it towards a free energy-minimized state, while also adapting to inputs within its own environment as well as actions from the internal representation system; both of these can affect exactly where the corresponding free energy minimum might be found.

In a like vein, the internal representation of this external system, p(s,a,r)𝑝𝑠𝑎𝑟p(s,a,r)italic_p ( italic_s , italic_a , italic_r ) (tilde notation removed for simplicity) should likewise come to a free energy minimized state. Again, exactly where this free energy minimum is located can change, subject to sensory inputs from the external environment, mediated by sensing agents (s)𝑠(s)( italic_s ).

CORTECONs(R) (COntent-Retentive, TEmporally-CONnected neural networks) provide a means for actually bringing an internal representation system to a free energy-minimized state. They do this by establishing a grid of bistate nodes; that is, each node can be in state A or state B; “on” or “off.” This grid of bistate nodes can be brought to an equilibrium by minimizing a free energy equation that is more complex than usual. The interesting and distinctive characteristic of this equation is that the entropy term encompasses not only whether a given node is “on” or “off,” but also takes note of the distribution of local patterns - nearest-neghbors, next-nearest-neighbors, and triplets.

In their simplest form, we consider a CORTECON(R) only from the perspective of using it to create a 1-D or 2-D grid of nodes that can form a representation of some external system. In this simplest possible CORTECON(R) interpretation, we pay attention only to the grid free energy, and can adjust node activations (flipping nodes between “on” and “off” states) to achieve a free energy minimum.

This use, of course, is simply the most basic sort. More comprehensive CORTECON(R) implementations will allow grid nodes to play the role of latent variables, and the degree to which nodes can be made active can be a function of not only direct stimulus from an external source but also of control parameters (ε0,ε1)subscript𝜀0subscript𝜀1(\varepsilon_{0},\varepsilon_{1})( italic_ε start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ).

In the limited treatment that we provide in this paper, we address only the free energy equation that we used in a CORTECON(R), which is taken directly from the cluster variation method (CVM).

8.1 2-D Cluster Variation Method Overview

This Technical Report introduces how a CORTECON(R) can be used to construct the set of representational units (r)𝑟(r)( italic_r ). We envision the formulation of a representational system whose component elements are pre-specified, and which is distinct from the external system that is being represented. Further, we envision a total system (external together with representation) in which both the external and representational systems can, and indeed do, separately achieve free energy minimization. Their ability to do this requires, of course, that a free energy equation exists for each of these respective systems.

One way in which we can have a system that allows for both free energy minimization and suitable modeling richness is to use a 2-D system constructed as a grid of bistate nodes, as was shown in the previous Fig. 1. In such a system, we can use the cluster variation method (CVM) to compute a free energy, for which the entropy term is more complex than is typically used. The theoretical basis for this was first developed by Kikuchi [20], and then jointly by Kikuchi and Brush [21]. A more recent description is provided in Maren [14].

The key measurable variables within a CVM system are the configuration variables. In addition to the simple identification of proportional numbers of “on” (A, black) and “off” (B, white) nodes, these configuration variables also account for the various kinds of nearest-neighbor (yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) and next-nearest-neighbor (wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) pairs, as well as the six different kinds of triplets (denoted zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT). (A figure in Appendix C illustrates these different configuration variables.)

In using the CVM method for describing the free energy, the equilibrium distribution of nodes is governed by two enthalpy parameters, ε0subscript𝜀0\varepsilon_{0}italic_ε start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and ε1subscript𝜀1\varepsilon_{1}italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. These parameters are the only “tunable” parameters available in the CVM formulation, and thus are identified with θ𝜃\thetaitalic_θ, as used by Friston (op. cit.) and Beal [7].

For the case where the distribution of units into the two states A and B is equiprobable, the activation enthalpy parameter ε0subscript𝜀0\varepsilon_{0}italic_ε start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is by definition zero. This leaves only a single “tunable” parameter, ε1subscript𝜀1\varepsilon_{1}italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT; the interaction enthalpy parameter. For this specific case, where the fractions of A and B nodes are equal, there is an analytic solution that provides the relative equilibrium fractions of the different configuration variables as a function of ε1subscript𝜀1\varepsilon_{1}italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Due to how the equilibrium solution for the configuration variables is expressed, it is easier to refer to a parameter that is a function of ε1subscript𝜀1\varepsilon_{1}italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT than that specific value itself. Thus, we normally use the h-value, where h=exp(2ε1)𝑒𝑥𝑝2subscript𝜀1h=exp(2\varepsilon_{1})italic_h = italic_e italic_x italic_p ( 2 italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ).

The analytic solution is not particularly accurate for larger h-values (e.g., h>1.61.6h>1.6italic_h > 1.6), but it provides a starting point for computational free energy minimization, for a given system and a corresponding given h-value.

Formally speaking, to apply the free energy principle (or indeed variational Bayes to any given data), it is entirely sufficient to specify a generative model in terms of a joint distribution over data and their latent causes or, in a Markov blanket partition, sensory and external states (where sensory states are augmented with internal and active states in the Markov blanket formalism).

Appendix C describes the implicit generative model that CVM entails – to give an idea of the sort of data it can generate – and therefore explain or recognize. The following subsections present an illustration of how the 2-D CVM would actually look in application.

8.2 CVM Illustration

The following Figure 4 provides a conceptual illustration for two systems; (a) corresponds to the external (ψ~~𝜓\tilde{\psi}over~ start_ARG italic_ψ end_ARG) system, and (b) corresponds to the representational system (r~~𝑟\tilde{r}over~ start_ARG italic_r end_ARG). (This follows notation introduced by Friston [6].) The grid pattern of active (A) and inactive (B) units is suitable for modeling using a 2-D cluster variation method (CVM) free energy equation, as described in Maren [14, 22].

Neither of the two systems depicted in Fig. 4 are at equilibrium; both are hand-crafted with the intent of embodying a scale-free type of system.

Each system was constructed to have an equal number of nodes in states A and B, or “on” and “off” states. This implies that the activation enthalpy parameter ε0=0subscript𝜀00\varepsilon_{0}=0italic_ε start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0. However, the likely value for the interaction enthalpy parameter ε1subscript𝜀1\varepsilon_{1}italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT was unknown for each of the two systems. (We typically work with the h-value instead of with ε1subscript𝜀1\varepsilon_{1}italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, where h=exp(2ε1)𝑒𝑥𝑝2subscript𝜀1h=exp(2\varepsilon_{1})italic_h = italic_e italic_x italic_p ( 2 italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ).)

Thus, our first goal was to identify a likely h-value candidate for each of the two systems. Pragmatically, we focused on the external, or Ψ~~Ψ\tilde{\Psi}over~ start_ARG roman_Ψ end_ARG, system, as it was more extensive and allowed for a richer set of at-equilibrium patterns to evolve. Our second goal was then to computationally bring that system into a free energy minimum for that particular h-value, which was determined to be approximately 1.2. (Detailed experimental results are in Maren [23].)

In actuality, we would most likely not know the h-values for the external Ψ~~Ψ\tilde{\Psi}over~ start_ARG roman_Ψ end_ARG system. We use the plural “h-values” instead of the singular, because even if we had an equiprobability of A and B nodes, we would not necessarily have a single h-value that would characterize the system.

This is because the distribution of local configuration variables - that is, the nearest-neighbor, next-nearest-neighbor, and triplet configurations - would not likely correspond to an equilibrium state. Instead, each distinct configuration variable would have a specific h-value that would correspond with it, and there would be a range of h-values associated with a given system.

We envision how this would be considered in the active inference context, where we have an external system (ψ~~𝜓\tilde{\psi}over~ start_ARG italic_ψ end_ARG), and are seeking to represent it with an internal system of representation units (r~~𝑟\tilde{r}over~ start_ARG italic_r end_ARG). We would likely be able to sample the configuration variables at various locations, and for various degrees of granularity, for the external system. These would become the inputs (s~~𝑠\tilde{s}over~ start_ARG italic_s end_ARG) to the units in the representational system r~~𝑟\tilde{r}over~ start_ARG italic_r end_ARG.

The two systems shown in Figure 4 have similar pattern configurations, with the exception that the larger system shown in (a) has clusters that are proportionately larger than the clusters in the smaller-scale representational system (b). The natures of the patterns within each, though, are much the same. Thus, to a first order, they should have similar (reduced) free energies. More to the point, we envision that the representational system shown in (b) can be brought into alignment with that of (a), or (more specifically) be brought to a free energy minimium with the same (or similar) h-value as with the h-value corresponding to the external system of (a).

We note, of course, that since neither system is likely to be at equilibrium, that initially we will not have a single h-value for either. One task is to find an h-value that provides a “best fit” to each of the systems. To that end, we have devised a new divergence measure, the (reverse) Kikuchi-Maren divergence, which is conceptually akin to the (reverse) Kullback-Leibler divergence [37, 15].

Refer to caption
Figure 4: Illustration of two systems, arranged so that a 2-D CVM-based free energy can be directly computed for each. (a) The external system, with units denoted ψ~~𝜓\tilde{\psi}over~ start_ARG italic_ψ end_ARG. (b) The representational system, showing only the representational units,(r~~𝑟\tilde{r}over~ start_ARG italic_r end_ARG). The Markov blanket around the grid of representational units is not shown in this figure. The dark and light-shaded grey and mottled units to the upper and right edges of each system illustrate the wrap-around from the left and bottom edges, used to compute the configuration variables leading to the free energies of each system. Both systems show an approximate scale-free distribution of islands of dark (A) units in a sea of white (B) units. The systems are designed with equiprobable distribution of units into states A and B (xA=xB=0.5subscript𝑥𝐴subscript𝑥𝐵0.5x_{A}=x_{B}=0.5italic_x start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT = 0.5), so that the (reduced) free energies of each can be computed directly, using the analytic solution provided in Maren [14, 22]. Details of the corresponding thermodynamic calculations are found in Maren [23]. The systems shown in this figure have been hand-designed to illustrate a potential scale-free configuration; they have not yet been brought into free energy minimization.

8.3 Interpreting the CVM in the Variational Bayes Framework

The variational Bayes method provides a framework for a new computational engine, and the first step towards this is illustrated in Figure 5.

Refer to caption
Figure 5: (a) The external system Ψ~~Ψ\tilde{\Psi}over~ start_ARG roman_Ψ end_ARG has been brought to a free energy minimum for the case where h=1.21.2h=1.2italic_h = 1.2. Sampling this system provides different inputs to the representational system with units r~~𝑟\tilde{r}over~ start_ARG italic_r end_ARG. In reality, we would not directly know the h-values corresponding to Ψ~~Ψ\tilde{\Psi}over~ start_ARG roman_Ψ end_ARG. However, we would trust that the system r~~𝑟\tilde{r}over~ start_ARG italic_r end_ARG, taking its configuration values from sensing applied to the units ψ~~𝜓\tilde{\psi}over~ start_ARG italic_ψ end_ARG, would also be at equilibrium. Finding the h-values for r~~𝑟\tilde{r}over~ start_ARG italic_r end_ARG would give us the parameters for the model p~~𝑝\tilde{p}over~ start_ARG italic_p end_ARG, shown in (b). In this particular case, as the full set of h-values corresponding to different configuration values still needs to be developed, the system p~~𝑝\tilde{p}over~ start_ARG italic_p end_ARG was devised for illustration purposes by performing free energy minimization on r~~𝑟\tilde{r}over~ start_ARG italic_r end_ARG, shown in the previous Fig. 4, for h=1.21.2h=1.2italic_h = 1.2. The equilibrium configuration values for Ψ~~Ψ\tilde{\Psi}over~ start_ARG roman_Ψ end_ARG and p~~𝑝\tilde{p}over~ start_ARG italic_p end_ARG are shown in (c).

As with the previous Figure 4, the larger-scale system on the left (a) corresponds to the external system. In the case of Figure 5, though, the system has been brought to a free energy minimum, for the case where ε0=0subscript𝜀00\varepsilon_{0}=0italic_ε start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0 (since the system has been designed with equiprobable distribution of A and B nodes), and where h=1.21.2h=1.2italic_h = 1.2. (Recall that h=exp(2ε1)𝑒𝑥𝑝2subscript𝜀1h=exp(2\varepsilon_{1})italic_h = italic_e italic_x italic_p ( 2 italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ).) The selection of h=1.21.2h=1.2italic_h = 1.2 was done by first (computationally) counting the distribution of all the different nearest-neighbor and next-nearest-neighbor pairs, as well as the different triplets, and then computing their relative fractions as configuration variables.

Using the analytic solution for equilibrium values of the configuration values as functions of h, it was possible to estimate a range of possible values for h. (Since the original system of Figure 4 was not at equilibrium, the various configuration values corresponded to different analytic h-values.) For simplicity, the next step was done using h=1.21.2h=1.2italic_h = 1.2, which was the h-value corresponding to the nearest-neighbor pairs for unlike nodes. For details, see Maren [23].)

We can see, in Figure 5, that in both the external system (a) and the representational system (b), some of the respective units ψ~~𝜓\tilde{\psi}over~ start_ARG italic_ψ end_ARG as well as the units in the representation p~~𝑝\tilde{p}over~ start_ARG italic_p end_ARG have taken on different values. (The total numbers of units in states A and B remains equal, in each of these systems. This allows us to apply the analytic solution as a starting point for selecting an h-value.)

For each of these systems, we see a set of configuration variables that now represent at-equilibrium values for the case where h=1.21.2h=1.2italic_h = 1.2. Under normal circumstances, we would not be directly bringing the external system Ψ~~Ψ\tilde{\Psi}over~ start_ARG roman_Ψ end_ARG to equilibrium; we would instead be sampling it with our sensory units s~~𝑠\tilde{s}over~ start_ARG italic_s end_ARG. These sensory units would influence the representational units r~~𝑟\tilde{r}over~ start_ARG italic_r end_ARG. The representational system could then, following the precepts of Action Perception Divergence [3], direct action agents a~~𝑎\tilde{a}over~ start_ARG italic_a end_ARG to influence the external system Ψ~~Ψ\tilde{\Psi}over~ start_ARG roman_Ψ end_ARG.

As we bring the model of the representational system into free equilibrium, the configuration variables reflect what would be the case for overall equilibrium in the external system Ψ~~Ψ\tilde{\Psi}over~ start_ARG roman_Ψ end_ARG. These configuration variables describe the topography of an at-equilibrium system. Further, the equilibrium state for this set of configuration variables corresponds to a specific h-value, which here functions as a model parameter θ𝜃\thetaitalic_θ.

In a more complete build-out of this approach, we would be able to vary both the activation enthalpy ε0subscript𝜀0\varepsilon_{0}italic_ε start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and the interaction enthalpy ε1subscript𝜀1\varepsilon_{1}italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT parameters, where for our purposes, h=exp(2ε1)𝑒𝑥𝑝2subscript𝜀1h=exp(2\varepsilon_{1})italic_h = italic_e italic_x italic_p ( 2 italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ). This means that we would be able to identify a full set of configuration variables with just the parameter set θ=(ε0,ε1)𝜃subscript𝜀0subscript𝜀1\theta=(\varepsilon_{0},\varepsilon_{1})italic_θ = ( italic_ε start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ).

As the phase space map for various equilibrium configuration values versus different h-values becomes worked out, it will be possible to find a corresponding set of h-values given an initial set of configuration values. It will then be possible to perform free energy minimization on the system r~~𝑟\tilde{r}over~ start_ARG italic_r end_ARG, using different candidate h-values together with estimates for the activation enthalpy parameter ε0subscript𝜀0\varepsilon_{0}italic_ε start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, to obtain a resultant model q~~𝑞\tilde{q}over~ start_ARG italic_q end_ARG that provides an acceptable fit to the units in the representational system r~~𝑟\tilde{r}over~ start_ARG italic_r end_ARG.

This mechanism can further be used to model the external system Ψ~~Ψ\tilde{\Psi}over~ start_ARG roman_Ψ end_ARG as it moves through different states, with various corresponding h-values. This means that we would potentially have a means for modeling evolving system dynamics over time. Moreover, the model would be encapsulated into an h-values trajectory, which would be a relatively simple θ𝜃\thetaitalic_θ model.

9 Conclusions

‘What?’ Female was an alien language, but he usually could translate it well enough to understand what was being said. But this [was] …”

Tangled Webs

Anne Bishop (2008), p. 159 (Hardcover edition).

This Technical Report has served three purposes:

  1. 1.

    Perform a “Rosetta Stone” translation,

  2. 2.

    Describe how the external and representational systems begin as separate entities, each of which can (separately) come to free energy equilibrium, and

  3. 3.

    Introduce a method for system representation that could, within itself, undergo free energy minimization in order to yield a resulting model which could be described using only one or two parameters (the θ𝜃\thetaitalic_θ elements of the model q).

There has been substantial grumbling within the research community about how difficult it has been to read and understand Karl Friston’s various articles. (See Freed [25] as just one example.) Some of this is notational; Friston has changed his notation subtly - just enough to be difficult for (and perhaps maddening to) the reader - throughout his various articles. Yet, a growing sense that he’s presenting a very useful approach is driving more and more researchers to attempt to read his works.

This desire to understand the fundamental variational Bayes approach, and Friston’s extension to describing external systems and their corresponding representational systems (separated by a Markov blanket), is growing. The variational Bayes methods are receiving greater attention, as a next step in machine learning methods, as described by Yellapragada and Konkimalla [26]. Wainwright and Jordan (2008) have published an extensive tutorial on variational Bayes, setting it in the overarching context of graph theory [27].

Thus, the first intention of this work has been to perform a “Rosetta Stone” translation between the variational Bayes derivation as given by Beal [7] and the subsequent ones given by Friston [4, 5, 6], with particular attention to how the shifts in notation reflect a move from an envisioning where both the external and representational systems are predicated on the same underlying variables xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to one in which the external and model states can be separated. This should make Friston’s works more readable to the broader scientific community.

The second intention has been envisioning how this formulation can be used for a scenario in which the external and representational systems begin as separate entities, each of which can (separately) come to free energy equilibrium. (As an example, see Friston and Frith [28] for an example cast in terms of communication and birdsong.) This is important, because we have not typically thought about how various systems (whether external or representational) need to be expressed in a way that allows free energy minimization.

As one of Friston’s key points is that free energy minimization underlies crucial processes (including brain processes), we need to have a better understanding of this premise. Some work, such as that done by a team led by Moran and published by Cullen et al. [19], has already shown the validity of this approach. That work shows how a variational Bayes approach can outperform reinforcement learning, within a specific and constrained operational environment. Those experiments indicate a promising direction for future investigation.

The third intention has been to introduce a means for representing a system, that is, the r~~𝑟\tilde{r}over~ start_ARG italic_r end_ARG component of the system model, using a 2-D cluster variation method approach. This gives us a representation for which we can write a free energy equation, and thus carry out explicit free energy minimization, leading to parameter identification for the free energy-minimized state. This approach has only been briefly sketched; greater expostulation is provided elsewhere and further developmental work is underway.

It is possibly this last intention that will prove the most valuable over time. There is currently a paucity of useful models for which free energy minimization is an inherently appropriate method. Specifically, the well-known Ising model (which has become dearly beloved within deep learning circles) does not offer sufficient richness for more complex system modeling. The 2-D CVM approach allows both for richness in expression and a simplicity in terms of the two parameters that govern this expression.

The impediment thus far has been that the CVM approach has been theoretically obscure, and its practical capabilities so far unknown. In fact, the phase space behavior of this model - in terms of identifying how the activation and interaction enthalpy parameters impact the resulting free energy-minimized states - has not yet been fully mapped out. This is largely a computational problem, somewhat aided and abetted by (limited) analytic solutions. Work on the 2-D CVM is underway, which should make this model available for use in the near future as a means for implementing model systems that can, within their own nature, be free energy-minimized. This will introduce a new kind of modeling capability for a wide range of applications.

Acknowledgements

I am enormously indebted to Karl Friston for careful, detailed, and thoughtful reviews, together with very useful suggestions for rewording a few explanations.

Declaration of No Conflicts

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Code Availability

The initial Python code for computing the 2-D CVM configurations, together with the corresponding entropy, enthalpy, and free energy values, is available from the author in the ajmaren GitHub repositories. See Maren (2018) [29] for the code verification and validation documentation.

More recently, there has been an effort to transition the original Python code to an object-oriented Python code set. This object-oriented code is available at tje Themesis GitHub repository [30]. This code is made available under the MIT License Agreement. Themesis has developed an extensive set of YouTube code walkthrough tutorials [31].

Those who “Opt-In” with Themesis (www.themesis.com/themesis/) will receive word when code is released, along with word on new experimental results with the 2-D cluster variation method.

Appendix A Appendix: Fundamental Thermodynamic Concepts

In various commentaries, researchers note that the term “thermodynamic free energy,” as used by Friston (op. cit.) does not really correspond to a a true thermodynamic free energy. Similarly, there is a difference in the enthalpy term, as computed and used in Friston’s work (and in others using the variational Bayes method) and in the notion of enthalpy as it is found in statistical thermodynamics.

This Appendix briefly overviews some of the key concepts in statistical (and classical) thermodynamics, so that it is easier to compare the formalisms resulting from the variational Bayes method described in the body of this Report with the corresponding formalisms from statistical thermodynamics, as is commonly known.

Note that the thermodynamic quantities of free energy (F𝐹Fitalic_F), enthalpy (H𝐻Hitalic_H), and entropy (S𝑆Sitalic_S) are all extensive variables; their values are subject to the number of total units in a given system. However, we will work throughout with the reduced variables, (F¯¯𝐹\bar{F}over¯ start_ARG italic_F end_ARG), the enthalpy (H¯¯𝐻\bar{H}over¯ start_ARG italic_H end_ARG), and the entropy (S¯¯𝑆\bar{S}over¯ start_ARG italic_S end_ARG), for which the previous values have been divided through by NkβT𝑁subscript𝑘𝛽𝑇Nk_{\beta}Titalic_N italic_k start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT italic_T, where N𝑁Nitalic_N is the total number of units in the system, kβsubscript𝑘𝛽k_{\beta}italic_k start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT is Boltzmann’s constant, and T𝑇Titalic_T is the temperature. This reduces all values to dimensionless quantities which are independent of the size of the systems under consideration. For the remainder of this work, the overhead bar notation on the reduced thermodynamic variables will be dropped; all quantities are understood to be reduced.

The well-known thermodynamic equation for the free energy is given as

F=HS,𝐹𝐻𝑆\displaystyle F=H\>-\>S,italic_F = italic_H - italic_S , (A-1)

where F𝐹Fitalic_F, called the free energy, is the energy available to do work, H𝐻Hitalic_H is the enthalpy (also called the internal energy) of a system, and S𝑆Sitalic_S is the entropy. In this reduced free energy equation, we have already divided through by the (absolute) temperature T𝑇Titalic_T.

From a statistical thermodynamics perspective, in order to calculate any of these terms, we first need the probabilistic distribution of units in the system among available states. (Several of the following equations are drawn from Maren [32, 33].)

A.1 The Classic Partition Function

To introduce the statistical thermodynamics approach to describing the free energy of a system, we begin with a simple (and classical) expression for the probability of finding a system in the quantum state i𝑖iitalic_i, characterized by energy Eisubscript𝐸𝑖E_{i}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, as

Piexp(Ei/kβT),proportional-tosubscript𝑃𝑖subscript𝐸𝑖subscript𝑘𝛽𝑇\displaystyle P_{i}\propto\exp(-E_{i}/{k_{\beta}T}),italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∝ roman_exp ( - italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / italic_k start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT italic_T ) , (A-2)

where kβsubscript𝑘𝛽k_{\beta}italic_k start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT is Boltzmann’s constant and T𝑇Titalic_T is temperature.

We note that the energy Eisubscript𝐸𝑖E_{i}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT describes the energy of the entire system of N𝑁Nitalic_N units, and that there very well may be degeneracy in the ways in which various units can be assembled to yield a certain energy.

As is true for any probability distribution,

i=1NPi=1.superscriptsubscript𝑖1𝑁subscript𝑃𝑖1\displaystyle\sum_{i=1}^{N}P_{i}=1.∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 . (A-3)

In Eqn. A-2, we stated a proportionality. Now, to find the proportional constant c𝑐citalic_c, we state that

Pi=cexp(Ei/kβT),subscript𝑃𝑖𝑐subscript𝐸𝑖subscript𝑘𝛽𝑇\displaystyle P_{i}=c\exp(-E_{i}/{k_{\beta}T}),italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_c roman_exp ( - italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / italic_k start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT italic_T ) , (A-4)

which gives us

i=1NPi=i=1Ncexp(Ei/kβT)=ci=1Nexp(Ei/kβT)=1,superscriptsubscript𝑖1𝑁subscript𝑃𝑖superscriptsubscript𝑖1𝑁𝑐subscript𝐸𝑖subscript𝑘𝛽𝑇𝑐superscriptsubscript𝑖1𝑁subscript𝐸𝑖subscript𝑘𝛽𝑇1\displaystyle\sum_{i=1}^{N}P_{i}=\sum_{i=1}^{N}c\exp(-E_{i}/{k_{\beta}T})=c% \sum_{i=1}^{N}\exp(-E_{i}/{k_{\beta}T})=1,∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_c roman_exp ( - italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / italic_k start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT italic_T ) = italic_c ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_exp ( - italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / italic_k start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT italic_T ) = 1 , (A-5)

so that

c=1/i=1Nexp(Ei/kβT).𝑐1superscriptsubscript𝑖1𝑁subscript𝐸𝑖subscript𝑘𝛽𝑇\displaystyle c=1/\sum_{i=1}^{N}\exp(-E_{i}/{k_{\beta}T}).italic_c = 1 / ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_exp ( - italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / italic_k start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT italic_T ) . (A-6)

This sum of probabilities becomes a valuable quantity in and of itself; we refer to it as the partition function, Q𝑄Qitalic_Q (and in some sources referred to as Z𝑍Zitalic_Z), so that

Q=i=1Nexp(Ei/kβT).,𝑄superscriptsubscript𝑖1𝑁subscript𝐸𝑖subscript𝑘𝛽𝑇\begin{aligned} Q=\sum_{i=1}^{N}\exp(-E_{i}/{k_{\beta}T}).\end{aligned},start_ROW start_CELL italic_Q = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_exp ( - italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / italic_k start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT italic_T ) . end_CELL end_ROW , (A-7)

which allows us to phrase a distinct probability as

P=exp(Ei/kβT)/Q.𝑃subscript𝐸𝑖subscript𝑘𝛽𝑇𝑄\displaystyle P=\exp(-E_{i}/{k_{\beta}T})/Q.italic_P = roman_exp ( - italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / italic_k start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT italic_T ) / italic_Q . (A-8)

A.2 The Classic Enthalpy Formulation

Now, we define the average energy H𝐻Hitalic_H, or enthalpy, of a system to be the expectation for the energy of the system, which can be described as the average of the sum of all the energies of the system,

H𝐻\displaystyle Hitalic_H =Eiabsentdelimited-⟨⟩delimited-⟨⟩subscript𝐸𝑖\displaystyle=\langle\langle E_{i}\rangle\rangle= ⟨ ⟨ italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ ⟩ (A-9)
=i=1NEiPiabsentsuperscriptsubscript𝑖1𝑁subscript𝐸𝑖subscript𝑃𝑖\displaystyle=\sum_{i=1}^{N}E_{i}P_{i}= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
=i=1NEiexp(Ei/kβT)/Qabsentsuperscriptsubscript𝑖1𝑁subscript𝐸𝑖subscript𝐸𝑖subscript𝑘𝛽𝑇𝑄\displaystyle=\sum_{i=1}^{N}E_{i}\exp(-E_{i}/{k_{\beta}T})/Q= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_exp ( - italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / italic_k start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT italic_T ) / italic_Q
=1Qi=1NEiexp(Ei/kβT).absent1𝑄superscriptsubscript𝑖1𝑁subscript𝐸𝑖subscript𝐸𝑖subscript𝑘𝛽𝑇\displaystyle=\frac{1}{Q}\sum_{i=1}^{N}E_{i}\exp(-E_{i}/{k_{\beta}T}).= divide start_ARG 1 end_ARG start_ARG italic_Q end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_exp ( - italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / italic_k start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT italic_T ) .

For simplicity, we introduce the notation that β=1/kβT𝛽1subscript𝑘𝛽𝑇\beta=1/{k_{\beta}T}italic_β = 1 / italic_k start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT italic_T, so that

H=1Qi=1NEiexp(βEi).𝐻1𝑄superscriptsubscript𝑖1𝑁subscript𝐸𝑖𝛽subscript𝐸𝑖\displaystyle H=\frac{1}{Q}\sum_{i=1}^{N}E_{i}\exp(-{\beta}E_{i}).italic_H = divide start_ARG 1 end_ARG start_ARG italic_Q end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_exp ( - italic_β italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) . (A-10)

Now, we notice that we can make a further simplification, by observing that

Eiexp(βEi)Q=1Qβexp(βEi)subscript𝐸𝑖𝛽subscript𝐸𝑖𝑄1𝑄𝛽𝛽subscript𝐸𝑖\displaystyle E_{i}\frac{\exp(-{\beta}E_{i})}{Q}=-\frac{1}{Q}\frac{\partial}{% \partial\beta}\exp(-{\beta}E_{i})italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG roman_exp ( - italic_β italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_Q end_ARG = - divide start_ARG 1 end_ARG start_ARG italic_Q end_ARG divide start_ARG ∂ end_ARG start_ARG ∂ italic_β end_ARG roman_exp ( - italic_β italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (A-11)

so that

Eiexp(βEi)=βexp(βEi)subscript𝐸𝑖𝛽subscript𝐸𝑖𝛽𝛽subscript𝐸𝑖\displaystyle E_{i}\exp(-{\beta}E_{i})=-\frac{\partial}{\partial\beta}\exp(-{% \beta}E_{i})italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_exp ( - italic_β italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = - divide start_ARG ∂ end_ARG start_ARG ∂ italic_β end_ARG roman_exp ( - italic_β italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (A-12)

Thus, we can rewrite Eqn. A-10 as

H𝐻\displaystyle Hitalic_H =i=1NEiexp(βEi)Qabsentsuperscriptsubscript𝑖1𝑁subscript𝐸𝑖𝛽subscript𝐸𝑖𝑄\displaystyle=\sum_{i=1}^{N}\frac{E_{i}\exp(-{\beta}E_{i})}{Q}= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_exp ( - italic_β italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_Q end_ARG (A-13)
=1Qi=1Nβexp(βEi)absent1𝑄superscriptsubscript𝑖1𝑁𝛽𝛽subscript𝐸𝑖\displaystyle=-\frac{1}{Q}\sum_{i=1}^{N}\frac{\partial}{\partial\beta}\exp(-{% \beta}E_{i})= - divide start_ARG 1 end_ARG start_ARG italic_Q end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG ∂ end_ARG start_ARG ∂ italic_β end_ARG roman_exp ( - italic_β italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
=1Qβi=1Nexp(βEi).absent1𝑄𝛽superscriptsubscript𝑖1𝑁𝛽subscript𝐸𝑖\displaystyle=-\frac{1}{Q}\frac{\partial}{\partial\beta}\sum_{i=1}^{N}\exp(-{% \beta}E_{i}).= - divide start_ARG 1 end_ARG start_ARG italic_Q end_ARG divide start_ARG ∂ end_ARG start_ARG ∂ italic_β end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_exp ( - italic_β italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) .

We notice, from Eqn. A-7 that the entire of the summation in Eqn. A-13 is exactly Q𝑄Qitalic_Q itself. Thus, we can rewrite the expression for H𝐻Hitalic_H in Eqn. A-13 as

H𝐻\displaystyle Hitalic_H =1Qβi=1Nexp(βEi)absent1𝑄𝛽superscriptsubscript𝑖1𝑁𝛽subscript𝐸𝑖\displaystyle=-\frac{1}{Q}\frac{\partial}{\partial\beta}\sum_{i=1}^{N}\exp(-{% \beta}E_{i})= - divide start_ARG 1 end_ARG start_ARG italic_Q end_ARG divide start_ARG ∂ end_ARG start_ARG ∂ italic_β end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_exp ( - italic_β italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) =1QβQ.absent1𝑄𝛽𝑄\displaystyle=-\frac{1}{Q}\frac{\partial}{\partial\beta}Q.= - divide start_ARG 1 end_ARG start_ARG italic_Q end_ARG divide start_ARG ∂ end_ARG start_ARG ∂ italic_β end_ARG italic_Q . (A-14)

We recall the expression for the derivative of a logarithmic function, that

xln(y(x))=1yy(x)x.𝑥𝑙𝑛𝑦𝑥1𝑦𝑦𝑥𝑥\displaystyle\frac{\partial}{\partial x}ln(y(x))=\frac{1}{y}\frac{\partial y(x% )}{\partial x}.divide start_ARG ∂ end_ARG start_ARG ∂ italic_x end_ARG italic_l italic_n ( italic_y ( italic_x ) ) = divide start_ARG 1 end_ARG start_ARG italic_y end_ARG divide start_ARG ∂ italic_y ( italic_x ) end_ARG start_ARG ∂ italic_x end_ARG . (A-15)

Thus, noticing the correspondence between Q=Q(β)=y(x)𝑄𝑄𝛽𝑦𝑥Q=Q(\beta)=y(x)italic_Q = italic_Q ( italic_β ) = italic_y ( italic_x ), we can identify that

βln(Q(β))=1QQ(β)β.𝛽𝑙𝑛𝑄𝛽1𝑄𝑄𝛽𝛽\displaystyle\frac{\partial}{\partial\beta}ln(Q(\beta))=\frac{1}{Q}\frac{% \partial Q(\beta)}{\partial\beta}.divide start_ARG ∂ end_ARG start_ARG ∂ italic_β end_ARG italic_l italic_n ( italic_Q ( italic_β ) ) = divide start_ARG 1 end_ARG start_ARG italic_Q end_ARG divide start_ARG ∂ italic_Q ( italic_β ) end_ARG start_ARG ∂ italic_β end_ARG . (A-16)

This allows us to rewrite Eqn. A-14 as

H𝐻\displaystyle Hitalic_H =1QβQabsent1𝑄𝛽𝑄\displaystyle=-\frac{1}{Q}\frac{\partial}{\partial\beta}Q= - divide start_ARG 1 end_ARG start_ARG italic_Q end_ARG divide start_ARG ∂ end_ARG start_ARG ∂ italic_β end_ARG italic_Q =βln(Q(β))absent𝛽𝑙𝑛𝑄𝛽\displaystyle=-\frac{\partial}{\partial\beta}ln(Q(\beta))= - divide start_ARG ∂ end_ARG start_ARG ∂ italic_β end_ARG italic_l italic_n ( italic_Q ( italic_β ) ) =βln(Q),absent𝛽𝑙𝑛𝑄\displaystyle=-\frac{\partial}{\partial\beta}ln(Q),= - divide start_ARG ∂ end_ARG start_ARG ∂ italic_β end_ARG italic_l italic_n ( italic_Q ) , (A-17)

dropping the functional dependence of Q𝑄Qitalic_Q on β𝛽\betaitalic_β in the last expression.

This gives us a very powerful means for expressing the enthalpy, or internal energy, of a system in terms of ln(Q)𝑙𝑛𝑄ln(Q)italic_l italic_n ( italic_Q ). However, it is not a completely general expression for H𝐻Hitalic_H. Instead, we note that by taking the (negative of) the derivative of ln(Q)𝑙𝑛𝑄ln(Q)italic_l italic_n ( italic_Q ) with respect to β𝛽\betaitalic_β, we are moving to a value of ln(Q)𝑙𝑛𝑄ln(Q)italic_l italic_n ( italic_Q ) that maximizes (minimizes) the function in terms of temperature (actually, 1/T1𝑇1/T1 / italic_T). This then leads us to the value of enthalpy, H𝐻Hitalic_H, that helps to minimize the overall value for the free energy, F𝐹Fitalic_F. The value of H𝐻Hitalic_H that would occur in this case (taking into account the impact of entropy), would then be the expected value of H𝐻Hitalic_H when the system is at equilibrium.

What we can generalize from this is not so much the specific formulation for H𝐻Hitalic_H (as a partial derivative of ln(Q)𝑄\ln(Q)roman_ln ( italic_Q ) with respect to β𝛽\betaitalic_β), but rather, that we are looking for an expected value of H𝐻Hitalic_H. Depending on how the system is formulated, it could be a different resulting expression.

Appendix B Appendix B: Variational Free Energy: Enthalpy and Entropy

This Appendix provides more detail on the derivation of the free energy equation as given in Friston [5, 6], using the formulation established by Beal [7], and making the correspondence between the notations of each. It yields the expression of the variational free energy in terms of what Friston refers to as expected energy or enthalpy (and in one source [6], as “thermodynamic free energy”) plus an entropy term. This expected energy or enthalpy is actually the expectation for the (negative of the) log-likelihood of a certain distribution, as given in Eqn. 33, which is restated here as

L(x~)=L(ψ~,s~,a~,r~)=lnp(ψ~,s~,a~,r~),𝐿~𝑥𝐿~𝜓~𝑠~𝑎~𝑟𝑝~𝜓~𝑠~𝑎~𝑟\displaystyle L(\tilde{x})=L(\tilde{{\psi}},\tilde{s},\tilde{a},\tilde{r})=-% \ln{p(\tilde{{\psi}},\tilde{s},\tilde{a},\tilde{r})},italic_L ( over~ start_ARG italic_x end_ARG ) = italic_L ( over~ start_ARG italic_ψ end_ARG , over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) = - roman_ln italic_p ( over~ start_ARG italic_ψ end_ARG , over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_r end_ARG ) ,

If we were to use a 2-D CVM grid to represent the external and internal (representational) states, we would take our probabilities as the actual fractional values for the different configuration variables, and we would use the entropy term from the 2-D CVM formalism (see Appendix C, and references cited therein) to suggest a structuring for the probabilities in the L(x~)𝐿~𝑥L(\tilde{x})italic_L ( over~ start_ARG italic_x end_ARG ) and entropy terms.

We will, very shortly, follow a line of argument introduced by Beal. One of the steps that he makes at a certain conclusion gives a form of the enthalpy equation, which is fundamentally the same as used by Friston (op. cit.), and is the one that we have shown in Section 5. It will be that

H=i=1𝑑xiqxi(xi)lnp(xi,yi|θ).𝐻subscript𝑖1differential-dsubscript𝑥𝑖subscript𝑞subscript𝑥𝑖subscript𝑥𝑖𝑙𝑛𝑝subscript𝑥𝑖conditionalsubscript𝑦𝑖𝜃\displaystyle H=\sum_{i=1}\int dx_{i}\>q_{x_{i}}(x_{i})\>ln\>p(x_{i},y_{i}|% \theta).italic_H = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT ∫ italic_d italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_l italic_n italic_p ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_θ ) . (B-1)

In the next few paragraphs, we will show that Eqn. B-1 serves analogously to Eqn. A-17. It fits the role of enthalpy in what is being described in a “free energy” formalism. However, it is not a derivative equation; it is the expectation for the natural logarithm of the probability of certain variables.

By examining the correspondence between the two expressions (the variational Bayes in comparison with the thermodynamic), when we come to expressing the full free energy for the representational system (using the formulation expressed by Beal), the free energy derivation will be much more lucid.

Beal [7] (Sect. 2.2.1) introduces the scenario for parameter learning with the notion of a generative model with hidden variables x and observed variables y, where the dataset of observed variables y={y1,,yn}𝑦subscript𝑦1subscript𝑦𝑛y=\{y_{1},...,y_{n}\}italic_y = { italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } are generated by the set of hidden variables x={x1,,xn}𝑥subscript𝑥1subscript𝑥𝑛x=\{x_{1},...,x_{n}\}italic_x = { italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }, and where the n items in each case are independent and identically distributed (i.i.d.). He identifies the parameters describing the (potentially) stochastic dependencies between variables as θ𝜃\thetaitalic_θ. The probability distribution for observing y𝑦yitalic_y (Beal, 2003, Eqn. 2.9) is given as

p(y|θ)=i=1np(yi|θ)=i=1n𝑑xip(xi,yi|θ).𝑝conditional𝑦𝜃superscriptsubscriptproduct𝑖1𝑛𝑝conditionalsubscript𝑦𝑖𝜃superscriptsubscriptproduct𝑖1𝑛differential-dsubscript𝑥𝑖𝑝subscript𝑥𝑖conditionalsubscript𝑦𝑖𝜃\displaystyle p(y|\theta)=\prod_{i=1}^{n}p(y_{i}|\theta)=\prod_{i=1}^{n}\int dx% _{i}p(x_{i},y_{i}|\theta).italic_p ( italic_y | italic_θ ) = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_p ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_θ ) = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∫ italic_d italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_p ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_θ ) . (B-2)

This tells us that the probability distribution of the observable variables yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is conditioned on the parameters θ𝜃\thetaitalic_θ.

We compare this with our earlier expression for the probability that a system would be in the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT energy state, where we stated in Eqn. A-7 that the partition function Q𝑄Qitalic_Q (which served as a normalizing factor) is given as

Q=i=1Nexp(Ei/kβT),𝑄superscriptsubscript𝑖1𝑁subscript𝐸𝑖subscript𝑘𝛽𝑇\displaystyle Q=\sum_{i=1}^{N}\exp(-E_{i}/{k_{\beta}T}),italic_Q = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_exp ( - italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / italic_k start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT italic_T ) ,

and from Eqn. A-8 that probability is given as

P=Pi=exp(Ei/kβT)/Q.𝑃subscript𝑃𝑖subscript𝐸𝑖subscript𝑘𝛽𝑇𝑄\displaystyle P=P_{i}=\exp(-E_{i}/{k_{\beta}T})/Q.italic_P = italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_exp ( - italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / italic_k start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT italic_T ) / italic_Q .

The difference in the two formulations is that, in Eqn. A-8, we are dealing with the probability of finding a system in a given ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT energy state, where there could indeed be multiplicity of units inhabiting various components of that overall system’s energy state, and there can also be degeneracy in the units inhabiting the various states. This means, various units, indistinguishable from each other, can inhabit the energy states so that there are multiple means of counting the units in various states and coming up with the same overall ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT system energy.

In contrast, the formulation given in Eqn. B-2 deals with a full set of observed variables, y={y1,,yn}𝑦subscript𝑦1subscript𝑦𝑛y=\{y_{1},...,y_{n}\}italic_y = { italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }.

As we know from the formulation of joint probabilities, the joint probability of observing variable y1subscript𝑦1y_{1}italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in a given state i𝑖iitalic_i, p(y1=i)𝑝subscript𝑦1𝑖p(y_{1}=i)italic_p ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_i ), together with the probability of observing variable y2subscript𝑦2y_{2}italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in its own given state j𝑗jitalic_j, p(y2=j)𝑝subscript𝑦2𝑗p(y_{2}=j)italic_p ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_j ), is p(y1=i)p(y2=j)𝑝subscript𝑦1𝑖𝑝subscript𝑦2𝑗p(y_{1}=i)p(y_{2}=j)italic_p ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_i ) italic_p ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_j ). Thus, for a set of such states, y={y1,,yn}𝑦subscript𝑦1subscript𝑦𝑛y=\{y_{1},...,y_{n}\}italic_y = { italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }, we need the product of each of the unique variable’s probability, which yields the first part of Eqn. B-2;

p(y|θ)=i=1np(yi|θ).𝑝conditional𝑦𝜃superscriptsubscriptproduct𝑖1𝑛𝑝conditionalsubscript𝑦𝑖𝜃\displaystyle p(y|\theta)=\prod_{i=1}^{n}p(y_{i}|\theta).italic_p ( italic_y | italic_θ ) = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_p ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_θ ) . (B-3)

As a second step in making the correlation between Eqn. B-2 and the combination of Eqns. A-7 and A-8, we examine the formulation (from Eqn. B-2)

p(yi|θ)=𝑑xip(xi,yi|θ).𝑝conditionalsubscript𝑦𝑖𝜃differential-dsubscript𝑥𝑖𝑝subscript𝑥𝑖conditionalsubscript𝑦𝑖𝜃\displaystyle p(y_{i}|\theta)=\int dx_{i}p(x_{i},y_{i}|\theta).italic_p ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_θ ) = ∫ italic_d italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_p ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_θ ) . (B-4)

We note that Eqn. B-4 essentially states that the observable variable(s) yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are dependent on a hidden, unobservable set of variables xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. For each distinct value (or set of values) yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, there are potentially multiple contributions, drawn over the complete set of xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The notation here (keeping to that used by Beal) is just a little ambiguous; it does not mean that the set of countable variables xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is exactly the same as the set of countable variables yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Rather, it is saying that for a given instance of a set of variables yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, there is a corresponding set of variables xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, xi{Xi}={xi,1,xi,2,,xi,J}subscript𝑥𝑖subscript𝑋𝑖subscript𝑥𝑖1subscript𝑥𝑖2subscript𝑥𝑖𝐽x_{i}\equiv\{X_{i}\}=\{x_{i,1},x_{i,2},...,x_{i,J}\}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≡ { italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } = { italic_x start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i , 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i , italic_J end_POSTSUBSCRIPT }, where J𝐽Jitalic_J is the total number of elements in the set of variables {Xi}subscript𝑋𝑖\{X_{i}\}{ italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } contributing to values for yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

This means, since each of these contribute to the value for yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, that summation (rather than multiplication) is needed. Since the realm of variables constituting {Xi}subscript𝑋𝑖\{X_{i}\}{ italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } is assumed here to be large, integration rather than summation is used. This, then, gives us a plausibility argument in support of Eqn. B-2.

Before moving on, we need one more step. Eqn. B-1 involves more than the probability distribution of Eqn. B-2. We note that in Eqn. B-1, we have a summation rather than a product, the natural logarithm of p(xi,yi|θ)𝑝subscript𝑥𝑖conditionalsubscript𝑦𝑖𝜃p(x_{i},y_{i}|\theta)italic_p ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_θ ) as opposed to just a simple statement of the probability, and an additional modulating term, qxi(xi)subscript𝑞subscript𝑥𝑖subscript𝑥𝑖q_{x_{i}}(x_{i})italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ).

To see how these occur, we go back to our original definition of the enthalpy, H𝐻Hitalic_H, as given in Eqn. A-17;

H𝐻\displaystyle Hitalic_H =1QβQabsent1𝑄𝛽𝑄\displaystyle=-\frac{1}{Q}\frac{\partial}{\partial\beta}Q= - divide start_ARG 1 end_ARG start_ARG italic_Q end_ARG divide start_ARG ∂ end_ARG start_ARG ∂ italic_β end_ARG italic_Q =βln(Q(β))absent𝛽𝑙𝑛𝑄𝛽\displaystyle=-\frac{\partial}{\partial\beta}ln(Q(\beta))= - divide start_ARG ∂ end_ARG start_ARG ∂ italic_β end_ARG italic_l italic_n ( italic_Q ( italic_β ) ) =βln(Q).absent𝛽𝑙𝑛𝑄\displaystyle=-\frac{\partial}{\partial\beta}ln(Q).= - divide start_ARG ∂ end_ARG start_ARG ∂ italic_β end_ARG italic_l italic_n ( italic_Q ) .

Suppose that we have a probability distribution given as

p(y|θ)=i=1np(yi|θ).𝑝conditional𝑦𝜃superscriptsubscriptproduct𝑖1𝑛𝑝conditionalsubscript𝑦𝑖𝜃\displaystyle p(y|\theta)=\prod_{i=1}^{n}p(y_{i}|\theta).italic_p ( italic_y | italic_θ ) = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_p ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_θ ) . (B-5)

Then, the natural logarithm of this distribution is given as

lnp(y|θ)=lni=1np(yi|θ)=i=1nlnp(yi|θ).𝑝conditional𝑦𝜃superscriptsubscriptproduct𝑖1𝑛𝑝conditionalsubscript𝑦𝑖𝜃superscriptsubscript𝑖1𝑛𝑝conditionalsubscript𝑦𝑖𝜃\displaystyle\ln{p(y|\theta)}=\ln{\prod_{i=1}^{n}p(y_{i}|\theta)}=\sum_{i=1}^{% n}\ln{p(y_{i}|\theta)}.roman_ln italic_p ( italic_y | italic_θ ) = roman_ln ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_p ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_θ ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_ln italic_p ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_θ ) . (B-6)

We do not have agreement between the formulation given in Eqn. B-1 and that given in Eqn. A-13 together with Eqn. A-14.

Moving on, we now turn our attention to the derivation for the free energy as introduced in Beal [7].

Beal introduces a formulation for the log likelihood in his Eqns. 2.12 - 2.16, reproduced here:

L(θ)𝐿𝜃\displaystyle L(\theta)italic_L ( italic_θ ) =i=1ln𝑑xip(xi,yi|θ)absentsubscript𝑖1𝑙𝑛differential-dsubscript𝑥𝑖𝑝subscript𝑥𝑖conditionalsubscript𝑦𝑖𝜃\displaystyle=\sum_{i=1}ln\int dx_{i}\>p(x_{i},y_{i}|\theta)= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT italic_l italic_n ∫ italic_d italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_p ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_θ ) (B-7)
=i=1ln𝑑xiqxi(xi)p(xi,yi|θ)qxi(xi)absentsubscript𝑖1𝑙𝑛differential-dsubscript𝑥𝑖subscript𝑞subscript𝑥𝑖subscript𝑥𝑖𝑝subscript𝑥𝑖conditionalsubscript𝑦𝑖𝜃subscript𝑞subscript𝑥𝑖subscript𝑥𝑖\displaystyle=\sum_{i=1}ln\int dx_{i}\>q_{x_{i}}(x_{i})\>\frac{p(x_{i},y_{i}|% \theta)}{q_{x_{i}}{(x_{i}})}= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT italic_l italic_n ∫ italic_d italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) divide start_ARG italic_p ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_θ ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG
i=1𝑑xiqxi(xi)lnp(xi,yi|θ)qxi(xi)absentsubscript𝑖1differential-dsubscript𝑥𝑖subscript𝑞subscript𝑥𝑖subscript𝑥𝑖𝑙𝑛𝑝subscript𝑥𝑖conditionalsubscript𝑦𝑖𝜃subscript𝑞subscript𝑥𝑖subscript𝑥𝑖\displaystyle\geq\sum_{i=1}\int dx_{i}\>q_{x_{i}}(x_{i})\>ln\>\frac{p(x_{i},y_% {i}|\theta)}{q_{x_{i}}{(x_{i}})}≥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT ∫ italic_d italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_l italic_n divide start_ARG italic_p ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_θ ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG
=i=1(𝑑xiqxi(xi)lnp(xi,yi|θ)𝑑xiqxi(xi)lnqxi(xi))absentsubscript𝑖1differential-dsubscript𝑥𝑖subscript𝑞subscript𝑥𝑖subscript𝑥𝑖𝑙𝑛𝑝subscript𝑥𝑖conditionalsubscript𝑦𝑖𝜃differential-dsubscript𝑥𝑖subscript𝑞subscript𝑥𝑖subscript𝑥𝑖𝑙𝑛subscript𝑞subscript𝑥𝑖subscript𝑥𝑖\displaystyle=\sum_{i=1}\left(\int dx_{i}\>q_{x_{i}}(x_{i})\>ln\>p(x_{i},y_{i}% |\theta)-\int dx_{i}\>q_{x_{i}}(x_{i})\>ln\>q_{x_{i}}(x_{i})\right)= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT ( ∫ italic_d italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_l italic_n italic_p ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_θ ) - ∫ italic_d italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_l italic_n italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) )
=i=1(𝑑xiqxi(xi)lnp(xi,yi|θ))𝑑xiqxi(xi)lnqxi(xi)absentsubscript𝑖1differential-dsubscript𝑥𝑖subscript𝑞subscript𝑥𝑖subscript𝑥𝑖𝑙𝑛𝑝subscript𝑥𝑖conditionalsubscript𝑦𝑖𝜃differential-dsubscript𝑥𝑖subscript𝑞subscript𝑥𝑖subscript𝑥𝑖𝑙𝑛subscript𝑞subscript𝑥𝑖subscript𝑥𝑖\displaystyle=\sum_{i=1}\left(\int dx_{i}\>q_{x_{i}}(x_{i})\>ln\>p(x_{i},y_{i}% |\theta)\right)-\int dx_{i}\>q_{x_{i}}(x_{i})\>ln\>q_{x_{i}}(x_{i})= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT ( ∫ italic_d italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_l italic_n italic_p ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_θ ) ) - ∫ italic_d italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_l italic_n italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
F(qx1(x1),,qxn(xn),θ).absent𝐹subscript𝑞subscript𝑥1subscript𝑥1subscript𝑞subscript𝑥𝑛subscript𝑥𝑛𝜃\displaystyle\equiv F(q_{x_{1}}{(x_{1}}),...,q_{x_{n}}{(x_{n}}),\theta).≡ italic_F ( italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , italic_θ ) .

Beal notes that F(qx(x),θ)𝐹subscript𝑞𝑥𝑥𝜃F(q_{x}(x),\theta)italic_F ( italic_q start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_x ) , italic_θ ) is a lower bound on L(θ)𝐿𝜃L(\theta)italic_L ( italic_θ ) and is a functional of the free distributions qxi(xi)subscript𝑞subscript𝑥𝑖subscript𝑥𝑖q_{x_{i}}(x_{i})italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and of θ𝜃\thetaitalic_θ (the dependence on y𝑦yitalic_y is left implicit). The inequality introduced in the third expression makes use of Jensen’s inequality.

Beal notes: “Defining the energy of a global configuration (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) … the lower bound F(qx(x),θ)L(θ)𝐹subscript𝑞𝑥𝑥𝜃𝐿𝜃F(q_{x}(x),\theta)\leq L(\theta)italic_F ( italic_q start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_x ) , italic_θ ) ≤ italic_L ( italic_θ ) is the negative of a quantity known in statistical physics as the free energy: the expected energy under qx(x)subscript𝑞𝑥𝑥q_{x}(x)italic_q start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_x ) minus the entropy of qx(x)subscript𝑞𝑥𝑥q_{x}(x)italic_q start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_x ) (Feynman, 1972; Neal and Hinton, 1998)” [8, 9].

Thus, both F(qx(x),θ)𝐹subscript𝑞𝑥𝑥𝜃F(q_{x}(x),\theta)italic_F ( italic_q start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_x ) , italic_θ ) and L(θ)𝐿𝜃L(\theta)italic_L ( italic_θ ) refer to free energy formalization, however, L(θ)𝐿𝜃L(\theta)italic_L ( italic_θ ) can be greater than F(qx(x),θ)𝐹subscript𝑞𝑥𝑥𝜃F(q_{x}(x),\theta)italic_F ( italic_q start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_x ) , italic_θ ). Alternatively, we can say that F(qx(x),θ)𝐹subscript𝑞𝑥𝑥𝜃F(q_{x}(x),\theta)italic_F ( italic_q start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_x ) , italic_θ ) is a lower bound on L(θ)𝐿𝜃L(\theta)italic_L ( italic_θ ).

Essentially (in Beal’s conception; Friston’s is reversed), we’re saying that the free energy of the model is going to be lower than the free energy of the system being modeled; LF𝐿𝐹L\geq Fitalic_L ≥ italic_F. If we are going to improve the accuracy of the model, we will be bringing F(qx(x),θ)𝐹subscript𝑞𝑥𝑥𝜃F(q_{x}(x),\theta)italic_F ( italic_q start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_x ) , italic_θ ) towards L(θ)𝐿𝜃L(\theta)italic_L ( italic_θ ).

Beal further notes that F(qx(x),θ)𝐹subscript𝑞𝑥𝑥𝜃F(q_{x}(x),\theta)italic_F ( italic_q start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_x ) , italic_θ ) is the negative of what is known, in statistical thermodynamics, as the free energy of a system, which is the expected energy (H𝐻Hitalic_H) under qx(x)subscript𝑞𝑥𝑥q_{x}(x)italic_q start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_x ) minus the entropy of qx(x)subscript𝑞𝑥𝑥q_{x}(x)italic_q start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_x ). Thus, when we shift to the notation of Friston (op.cit.), we will reverse the signs on all of the terms on the right-hand-side of Eqn. B-7 (second-to-last line of the equation), leading up to Beal’s definition of F(qx(x),θ)𝐹subscript𝑞𝑥𝑥𝜃F(q_{x}(x),\theta)italic_F ( italic_q start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_x ) , italic_θ ).

According to this understanding, and changing the signs of the terms to deal with Beal’s use of F(qx(x),θ)𝐹subscript𝑞𝑥𝑥𝜃F(q_{x}(x),\theta)italic_F ( italic_q start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_x ) , italic_θ ) as the negative of the free energy, we have the expected energy (or enthalpy) of a system H𝐻Hitalic_H is given as

H=i=1𝑑xiqxi(xi)lnp(xi,yi|θ),𝐻subscript𝑖1differential-dsubscript𝑥𝑖subscript𝑞subscript𝑥𝑖subscript𝑥𝑖𝑙𝑛𝑝subscript𝑥𝑖conditionalsubscript𝑦𝑖𝜃\displaystyle H=-\sum_{i=1}\int dx_{i}\>q_{x_{i}}(x_{i})\>ln\>p(x_{i},y_{i}|% \theta),italic_H = - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT ∫ italic_d italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_l italic_n italic_p ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_θ ) , (B-8)

and

S=i=1𝑑xiqxi(xi)lnqxi(xi).𝑆subscript𝑖1differential-dsubscript𝑥𝑖subscript𝑞subscript𝑥𝑖subscript𝑥𝑖𝑙𝑛subscript𝑞subscript𝑥𝑖subscript𝑥𝑖\displaystyle S=-\sum_{i=1}\int dx_{i}\>q_{x_{i}}(x_{i})\>ln\>q_{x_{i}}(x_{i}).italic_S = - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT ∫ italic_d italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_l italic_n italic_q start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) . (B-9)

The definition for the entropy of a system is given as

S=kβlnΩ.𝑆subscript𝑘𝛽𝑙𝑛Ω\displaystyle S=-k_{\beta}\>ln\>\Omega.italic_S = - italic_k start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT italic_l italic_n roman_Ω . (B-10)

An alternative formulation is

S=kβlnP,𝑆subscript𝑘𝛽delimited-⟨⟩𝑙𝑛𝑃\displaystyle S=-k_{\beta}\>\langle ln\>P\rangle,italic_S = - italic_k start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ⟨ italic_l italic_n italic_P ⟩ , (B-11)

where P𝑃Pitalic_P is the probability distribution over all states. Thus, S𝑆Sitalic_S is the expectation of the probability of the variable for which various states are possible; i.e., this correlates directly with Eqn. B-9, and where ΩΩ\Omegaroman_Ω is the grand partition function, and kβsubscript𝑘𝛽k_{\beta}italic_k start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT is Boltzmann’s constant.

From Feynman (1972) [8], Eqn. 1.1, that

Q=nexp(En/kT)exp(F/kT).𝑄subscript𝑛subscript𝐸𝑛𝑘𝑇𝐹𝑘𝑇\displaystyle Q=\sum_{n}\>\exp(-E_{n}/{kT})\equiv\exp(-F/{kT}).italic_Q = ∑ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_exp ( - italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT / italic_k italic_T ) ≡ roman_exp ( - italic_F / italic_k italic_T ) . (B-12)

From Feynman (1972) we have the definition for F𝐹Fitalic_F, the Helmholtz free energy, as

F=kTlnQ=kTln(nexp(En/kT))𝐹𝑘𝑇𝑄𝑘𝑇subscript𝑛subscript𝐸𝑛𝑘𝑇\displaystyle F=-kT\ln Q=-kT\ln\left(\sum_{n}\>\exp(-E_{n}/{kT})\right)italic_F = - italic_k italic_T roman_ln italic_Q = - italic_k italic_T roman_ln ( ∑ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_exp ( - italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT / italic_k italic_T ) ) (B-13)

and also (Eqn. 1.3)

S=knPnlnPn,𝑆𝑘subscript𝑛subscript𝑃𝑛subscript𝑃𝑛\displaystyle S=-k\sum_{n}\>P_{n}\ln\>P_{n},italic_S = - italic_k ∑ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_ln italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , (B-14)

where (Eqn. 1.4)

Pn=1Qexp(En/kT).subscript𝑃𝑛1𝑄subscript𝐸𝑛𝑘𝑇\displaystyle P_{n}=\frac{1}{Q}\>\exp(-E_{n}/{kT}).italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_Q end_ARG roman_exp ( - italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT / italic_k italic_T ) . (B-15)

Further, the average energy per unit, U𝑈Uitalic_U (also known as the enthalpy), is given as (Eqn. 1.7)

U=1QnEnexp(En/kT),𝑈1𝑄subscript𝑛subscript𝐸𝑛subscript𝐸𝑛𝑘𝑇\displaystyle U=\frac{1}{Q}\>\sum_{n}\>E_{n}\>\exp(-E_{n}/{kT}),italic_U = divide start_ARG 1 end_ARG start_ARG italic_Q end_ARG ∑ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_exp ( - italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT / italic_k italic_T ) , (B-16)

which can be rewritten as

U=nEnPn.𝑈subscript𝑛subscript𝐸𝑛subscript𝑃𝑛\displaystyle U=\sum_{n}\>E_{n}\>P_{n}.italic_U = ∑ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT . (B-17)

Essentially, Eqn. B-17 states that the enthalpy of a system (the total energy of the system), is simply the sum of the energy-per-state, multiplied by the probability that a unit will be in that state, where the whole is summed over all the units in the system.

This derivation by Feynman [8] is perhaps a bit more intuitive than that put forth in Appendix A, and may be preferred.

Appendix C Appendix: The Cluster Variation Method

This Appendix provides a very brief overview of the cluster variation method (CVM), specifically the 2-D CVM. In particular, it provides:

  1. 1.

    Notion of the configuration variables, together with illustrations of how they appear within a 2-D CVM grid,

  2. 2.

    The free energy equation for the 2-D CVM, along with the equations for the equilibrium values for the different configuration variables, which are analytically available when there is an equiprobable distribution of nodes into states A and B; i.e., x1=x2=0.5subscript𝑥1subscript𝑥20.5x_{1}=x_{2}=0.5italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.5, and

  3. 3.

    A very quick sketch of the actual free energy minimization protocol, which defines how an actual at-equilibrium 2-D CVM grid configuration can be obtained, together with all associated configuration variable values and thermodynamic values for the system.

The goal of this Appendix is to provide just enough information to make Section 8 more readable for those who wish to understand the potential role of a 2-D CVM system in a computational system embodying a variational Bayes approach, within the context used here of separating the representational system from the external one via a Markov blanket.

For those wishing to pursue the 2-D CVM in more depth, the most accessible starting place is provided in Maren (2016) [14]. This overviews the CVM approach for both 1-D and 2-D systems. It also suggests how, in particular, a 2-D CVM approach can potentially relate to modeling neural activation topographies, and can also fit in with descriptions of statistical mechanics in the brain.

The CVM was first developed by Kikuchi [20], and then jointly by Kikuchi and Brush [21]. Kikuchi and Brush presented an analytic solution for both the 1-D and 2-D systems for the equiprobable distribution case. They did not provide details for this analytic solution. The derivation steps for the 1-D case are in Maren (2014a) [34], and for the 2-D case are in Maren (2014b, 2019) [35, 36].

C.1 Configuration Variables

The cluster variation method, introduced by Kikuchi [20] and refined by Kikuchi and Brush [21], uses an entropy term that includes not only the distribution across simple “on” and “off” states, but also the distribution into local patterns, or configurations, as illustrated in the following Figure 6.

A 2-D CVM is characterized by a set of configuration variables, which collectively represent single unit, pairwise combination, and triplet values. The configuration variables are denoted as:

  • xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - Single units,

  • yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - Nearest-neighbor pairs,

  • wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - Next-nearest-neighbor pairs, and

  • zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - Triplets.

The degeneracy factors βisubscript𝛽𝑖\beta_{i}italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and γisubscript𝛾𝑖\gamma_{i}italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (number of ways of constructing a given configuration variable) are shown in the following Figure 6; β2=2subscript𝛽22\beta_{2}=2italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 2 for both y2subscript𝑦2y_{2}italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and w2subscript𝑤2w_{2}italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, as y2subscript𝑦2y_{2}italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and w2subscript𝑤2w_{2}italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT can be constructed as either A-B or as B-A for y2subscript𝑦2y_{2}italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, or as B- -A or as A- -B for w2subscript𝑤2w_{2}italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. All other degeneracy factors for the yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT configuration variables are set to 1.

Refer to caption

Figure 6: Illustration of the configuration variables for the cluster variation method, showing the ways in which the configuration variables yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be constructed, together with their degeneracy factors βisubscript𝛽𝑖\beta_{i}italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and γisubscript𝛾𝑖\gamma_{i}italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Notice also that within Figure 6, the triplets z2subscript𝑧2z_{2}italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and z5subscript𝑧5z_{5}italic_z start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT have two possible configurations each: A-A-B and B-A-A for z2subscript𝑧2z_{2}italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and B-B-A and A-B-B for z5subscript𝑧5z_{5}italic_z start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT. This means that there is a degeneracy factor of 2 for each of the z2subscript𝑧2z_{2}italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and z5subscript𝑧5z_{5}italic_z start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT triplets. This gives us γ2=γ5=2subscript𝛾2subscript𝛾52\gamma_{2}=\gamma_{5}=2italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_γ start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT = 2 (for the triplets), as there are two ways each for constructing the triplets z2subscript𝑧2z_{2}italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and z5subscript𝑧5z_{5}italic_z start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT. The remaining degeneracy factors for the triplet configuration variables are set to 1.

C.2 Free Energy for the 2-D CVM

The analytic solution for the case where x1=x2=0.5subscript𝑥1subscript𝑥20.5x_{1}=x_{2}=0.5italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.5 can be found when we are using the full interaction enthalpy term of ε1(2y2y1y3)subscript𝜀12subscript𝑦2𝑦1𝑦3\varepsilon_{1}*(2y_{2}-y1-y3)italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∗ ( 2 italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_y 1 - italic_y 3 ). This solution is presented in Maren (2019) [36].

The free energy equation for a 2-D CVM system, including configuration variables in the entropy term, is

G¯2D=G2D/N=subscript¯𝐺2𝐷subscript𝐺2𝐷𝑁absent\displaystyle\bar{G}_{2-D}=G_{2-D}/N=over¯ start_ARG italic_G end_ARG start_POSTSUBSCRIPT 2 - italic_D end_POSTSUBSCRIPT = italic_G start_POSTSUBSCRIPT 2 - italic_D end_POSTSUBSCRIPT / italic_N = (C-1)
ε1(z1+z3+z4z6)subscript𝜀1subscript𝑧1subscript𝑧3subscript𝑧4subscript𝑧6\displaystyle\varepsilon_{1}(-z_{1}+z_{3}+z_{4}-z_{6})italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( - italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_z start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT + italic_z start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT )
\displaystyle-- [2i=13βiLf(yi))+i=13βiLf(wi))i=12βiLf(xi))2i=16γiLf(zi)]\displaystyle\Bigg{[}2\sum\limits_{i=1}^{3}\beta_{i}Lf(y_{i}))+\sum\limits_{i=% 1}^{3}\beta_{i}Lf(w_{i}))-\sum\limits_{i=1}^{2}\beta_{i}Lf(x_{i}))-2\sum% \limits_{i=1}^{6}\gamma_{i}Lf(z_{i})\Bigg{]}[ 2 ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_L italic_f ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_L italic_f ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_L italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - 2 ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_L italic_f ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ]
+\displaystyle++ μ(1i=16γizi)+4λ(z3+z5z2z4)𝜇1superscriptsubscript𝑖16subscript𝛾𝑖subscript𝑧𝑖4𝜆subscript𝑧3subscript𝑧5subscript𝑧2subscript𝑧4\displaystyle\mu(1-\sum\limits_{i=1}^{6}\gamma_{i}z_{i})+4\lambda(z_{3}+z_{5}-% z_{2}-z_{4})italic_μ ( 1 - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + 4 italic_λ ( italic_z start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT + italic_z start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT )

where μ𝜇\muitalic_μ and λ𝜆\lambdaitalic_λ are Lagrange multipliers, and we have set kβT=1subscript𝑘𝛽𝑇1k_{\beta}T=1italic_k start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT italic_T = 1.

The single enthalpy parameter here is ε1subscript𝜀1\varepsilon_{1}italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, with the enthalpy parameter for unit activation implicitly set to zero, as the intention has been to solve the above equation for an analytic solution, which is possible only in the case where x1=x2=0.5subscript𝑥1subscript𝑥20.5x_{1}=x_{2}=0.5italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.5, meaning that the per-unit enthalpy activation parameter ε0=0subscript𝜀00\varepsilon_{0}=0italic_ε start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0.

When we use ε1(2y2y1y3)=ε1(z1+z3+z4z6)subscript𝜀12subscript𝑦2subscript𝑦1subscript𝑦3subscript𝜀1subscript𝑧1subscript𝑧3subscript𝑧4subscript𝑧6\varepsilon_{1}(2y_{2}-y_{1}-y_{3})=\varepsilon_{1}(-z_{1}+z_{3}+z_{4}-z_{6})italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 2 italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) = italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( - italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_z start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT + italic_z start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT ) as the enthalpy expression (as is done in the previous equation), we can obtain an analytic solution for each of the configuration variables. For example, we find the expression for z3subscript𝑧3z_{3}italic_z start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT in terms of hhitalic_h as

z3=(h3)(h+1)8[h26h+1].subscript𝑧3318delimited-[]superscript261z_{3}=\frac{(h-3)(h+1)}{8[h^{2}-6h+1]}.italic_z start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = divide start_ARG ( italic_h - 3 ) ( italic_h + 1 ) end_ARG start_ARG 8 [ italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 6 italic_h + 1 ] end_ARG . (C-2)

(Note: See Maren (2019) [36] for full derivations. This corresponds with Eqn. (I.25) in Kikuchi and Brush (1967) [21].)

Similar expressions can be obtained for the remaining configuration variables.

The experimentally-generated results from probabilistically-generated data sets correspond to the analytic results in the neighborhood of ε0=0subscript𝜀00\varepsilon_{0}=0italic_ε start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0. The reason that the range is so limited is that the analytic solution makes use of equivalence relations as expressed above.

The resulting solution of Eqn. C-2 clearly has a divergence in it, due to the quadratic expression in the denominator. There are two solutions to the quadratic expression. We are interested in the case where the value of h>11h>1italic_h > 1 indicates that ε1>0subscript𝜀10\varepsilon_{1}>0italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > 0, which is the case where the interaction enthalpy favors like-near-like interactions, or some degree of gathering of similar units into clusters. This means that we expect that the computational results would differ from the analytic as hhitalic_h approaches the divergence point.

The divergence in the analytic solution is an artifact. However, it does indicate that for larger h-values, the analytic solution will not be accurate. Thus, for high h-values, we take the analytic solution simply as a starting point, and invoke a protocol for determining the correct configuration variable values associated with a given h-value, as described in Maren [36].

C.3 Free Energy Minimization Protocol for the 2-D CVM

The following describes an early protocol for finding an h-value that is in proximity to the h-value that would be a “best fit for a given 2-D CVM grid. It is included here largely for historical reasons (Maren 2021, 2019) [22, 36].

More recently, Maren (2022) has devised a new protocol that uses a new divergence measure, the Kikuchi-Maren divergence, for identifying the activation enthalpy parameter ε1subscript𝜀1\varepsilon_{1}italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT that provides a “best fit” for characterizing a given 2-D CVM topography. This can then be used to bring that topography to a free energy minimum [37].

An initial 2-D CVM grid, whether manually designed or randomly-generated, is typically not at equilibrium. More specifically, the various configuration variable values will typically correspond to different h-values, where the h-value or simply, h is defined as h=exp(2ε1)𝑒𝑥𝑝2subscript𝜀1h=exp(2\varepsilon_{1})italic_h = italic_e italic_x italic_p ( 2 italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ).

Each h-value indicates a unique free energy minimum solution. Thus, we need to take the initial grid through a free energy minimization process, in order to obtain a grid where the various configuration variables all correspond to a single free energy minimum. This requires using a free energy minimization protocol.

This protocol takes a computational approach, so that given an initial 2-D CVM grid pattern, the steps are to:

  1. 1.

    Obtain an initial estimate for a candidate h-value,

  2. 2.

    Determine the associated thermodynamic values for that particular h-value estimate and current set of configuration values, and

  3. 3.

    Adjust the node activations in order to minimize the free energy for that given h-value, yielding a new grid configuration and associated set of configuration variables and thermodynamic values.

See Maren [29] for the code verification and validation documentation. Code releases are being provided courtesy of Themesis, Inc. and are available at the Themesis GitHub repository [30].

As a preliminary step, a computer program is used to obtain the actual counts for each of the different configuration variables. The next step is to find the corresponding h-values for each of three different configuration variables; z1subscript𝑧1z_{1}italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, z3subscript𝑧3z_{3}italic_z start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, and y2subscript𝑦2y_{2}italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. These are selected because, taken together, they are reasonably descriptive of the grid topography:

  • z1subscript𝑧1z_{1}italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - A-A-A triplets; indicates the relative fraction of A units that are included in “islands” or “land masses”; this also (indirectly) indicates the compactness of these masses,

  • z3subscript𝑧3z_{3}italic_z start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT - A-B-A triplets; indicates the relative fraction of A units that are involved in a “jagged” border (one that involves irregular protrusions of A into a B space), or the presence of one or more “rivers” of B units extending into landmass(es) of A units, and

  • y2subscript𝑦2y_{2}italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - A-B nearest-neighbor pairs; indicates the relative extent to which the A units are distributed among the surrounding B units.

We briefly describe the first protocol used. Initially, each of these configuration variables will correspond to a different h-value. (This corresponding h-value can be found graphically, via a look-up table, or by extrapolation, using Eqn. C-2 and similar equations.) This gives us a set of (typically) three h-values.

According to this initial protocol, we can determine which h-value to use via any number of means; it is reasonable to take an average or mean of the h-values corresponding to the configuration variable values.

This h-value then is used to define the corresponding ε1subscript𝜀1\varepsilon_{1}italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in a program that will iteratively modify the grid configuration. (This is for the case where x1=x2=0.5subscript𝑥1subscript𝑥20.5x_{1}=x_{2}=0.5italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.5, so that ε0=0subscript𝜀00\varepsilon_{0}=0italic_ε start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0.)

More specifically, the program will randomly select two units, and ascertain that the first is in the A state and the other is in the B state. (Random selection continues until one of each has been found.) The program then “swaps” the two nodes’ states; that which was A becomes B and vice versa. The program then computes the new set of configuration variables and the corresponding new thermodynamic values. If the free energy is reduced, the swap is kept; if not, it is reversed. The program continues this for a predetermined number of trials.

Graphical display of the free energy values through this process shows that the free energy typically drops relatively fast, and then is impervious to future node swaps.

This is, of course, an extremely heavy-handed and simplistic approach. It does not take into account any sense of what might be accomplished by intelligently selecting candidate nodes for which a “flip” in activation would have the greatest free energy impact. Such a more sophisticated strategy would require an object-oriented approach, and the current program is simplistic in terms of node representations.

However, these initial and exploratory experiments have produced interesting topologies. For example, as shown in the body of this article, a free energy-minimized grid topography will often have “spider legs” of A units connecting various islands and landmasses of A. Studies of free energy-minimized topographies are in progress.

In the more recent protocol (Maren, 2022) [37], we similarly obtain an intiial potential range of h-values. We then use the reverse Kikuchi-Maren divergence to identify the candidate h-value that provides the smallest divergence between a (figurative) system with at-equilibrium configuration variable values as compared with the actual representational system. The h-value that provides the smallest divergence is selected, and is used to provide a “target” system against which the reperesntational system is adjusted until it has configuration variable values that are as close as desired (within the limits of simple node-activation swapping) with the target values. This is a situation in which Action Perception Divergence (APD) (Hafner et al., 2020, rev. 2022) [3] usefully provides a means for bringing the representational system through a “phase space” defined by the two control parameters (ε0,ε1)subscript𝜀0subscript𝜀1(\varepsilon_{0},\varepsilon_{1})( italic_ε start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) and their associated configuration variable values.

C.4 Application to a Variational Bayes Approach

The notion behind offering the 2-D CVM for the extension of variational Bayes into modeling a representational system rests on the premise that both the external and representational systems, composed of ψ~~𝜓\tilde{\psi}over~ start_ARG italic_ψ end_ARG and r~~𝑟\tilde{r}over~ start_ARG italic_r end_ARG units respectively, can be expressed using a 2-D CVM. The elements for the representational system would be created by sampling areas of the external system; that is, the sensory elements s~~𝑠\tilde{s}over~ start_ARG italic_s end_ARG of the Markov blanket would generate elements of r~~𝑟\tilde{r}over~ start_ARG italic_r end_ARG.

It is reasonable that neither the external nor the representational systems would be completely at free energy equilibrium. In the case of the Ψ~~Ψ\tilde{\Psi}over~ start_ARG roman_Ψ end_ARG system, this could reasonably be due to local influences and events, and also as the system can be differentially changing in response to various inputs. In the case of the representational system with r~~𝑟\tilde{r}over~ start_ARG italic_r end_ARG units, this can be attributed to creating its unit activations via sampling from the ψ~~𝜓\tilde{\psi}over~ start_ARG italic_ψ end_ARG units.

The model q𝑞qitalic_q is formed by bringing the representational system r~~𝑟\tilde{r}over~ start_ARG italic_r end_ARG into free energy equilibrium. In this approach, we would actually go through a free energy minimization process, as that is literally feasible when we have a 2-D CVM grid. The resulting parameters θ𝜃\thetaitalic_θ are the set of (ε0,ε1)subscript𝜀0subscript𝜀1(\varepsilon_{0},\varepsilon_{1})( italic_ε start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) values that describe the free energy-minimized resultant 2-D CVM grid. These two parameters then indicate the corresponding configuration variables, which define the nature of the grid’s topography.

While this, in itself, is somewhat interesting, the real value would lie in taking this kind of 2-D CVM grid into a computational engine. Inputs from an external source (s~~𝑠\tilde{s}over~ start_ARG italic_s end_ARG) can be used to generate certain activations across the grid. The free energy minimization process then modifies these activations. The resulting node activations can then be “learned” (using any number of neural network learning methods).

What makes this process potentially interesting and useful is that we can then take one more step. We can create temporal persistence of a unit’s activation, as a function of the degree to which it is central to an island or landmass of similar units. In short, we have a means for inducing temporal persistence that is dependent on a form of lateral interactions (i.e., neighborliness to units that are similarly in an active state). This means that a given unit’s activation is now a function of two factors; the typical input from an external stimulus, and a lateral interaction.

This method will be addressed more fully in subsequent works.

References

  • [1] Friston, Karl, Conor Heins, Tim Verbelen, Lancelot Da Costa, Tommaso Salvatori, Dimitrije Markovic, Alexander Tschantz, Magnus Koudahl, Christopher Buckley and Thomas Parr. 2024. “From Pixels to Planning: Scale-Free Active Inference.” arXiv2407.20292v1 [cs.LG]. doi:10.48550/arXiv.2407.20292. (Accessed Aug. 10, 2024; available online at https://arxiv.org/pdf/2407.20292.)
  • [2] Friston, Karl, Lancelot Da Costa, Noor Sajid, Conor Heins, Kai Ueltzhöffer, Grigorios A. Pavliotis and Thomas Parr. 2023. “The Free Energy Principle Made Simpler but Not Too Simple.” arXiv:2201.06387v3 [cond-mat.stat-mech]. doi:10.48550/arXiv.2201.06387. (Accessed Aug. 10, 2024; available online at https://arxiv.org/pdf/2201.06387.)
  • [3] Hafner, Danijar, Pedro A. Ortega, Jimmy Ba, Thomas Parr, Karl Friston and Nicolas Heess. 2020, rev. 2022. “Action and Perception as Divergence Minimization.” arXiv:2009.01791v3 [cs.AI] (13 Feb 2022). doi:10.48550/arXiv.2009.01791. (Accessed Aug. 10, 2024; available online at https://arxiv.org/pdf/2009.01791.)
  • [4] Friston, K. 2010. “The Free-Energy Principle: A Unified Brain Theory?” Nat. Rev. Neurosci., 11:127–138. doi:10.1038/nrn2787.
  • [5] Friston, K. 2013. “Life as We Know It.” Journal of The Royal Society Interface, 10(86).
  • [6] Friston, K., M. Levin, B. Sengupta, and G. Pezzulo. 2015. “Knowing One’s Place: A Free-Energy Approach to Pattern Regulation.” J. R. Soc. Interface, 12:20141383. doi:10.1098/rsif.2014.1383. (Available online at: http://dx.doi.org/10.1098/rsif.2014.1383.)
  • [7] Beal, M. J. 2003. Variational Algorithms for Approximate Bayesian Inference. PhD Thesis, University College London. (PDF available online at: http://www.cse.buffalo.edu/faculty/mbeal/papers/beal03.pdf.)
  • [8] Feynman, R. P. 1972, 1998. Statistical Mechanics. (Reading, MA: Benjamin.)
  • [9] Hinton, G. E. and D. van Camp. 1993. “Keeping Neural Networks Simple by Minimizing the Description Length of Weights,” in Proc. of COLT-93:5–13. doi:10.1145/168304.168306.
  • [10] Blei, D. M., A. Kucukelbir and J. D. McAuliffe, “Variational Inference: A Review for Statisticians.” arXiv:1601.00670v4 [stat:CO], 2 Nov 2016.
  • [11] Blei, D. M., A. Kucukelbir and J. D. McAuliffe. 2017. “Variational Inference: A Review for Statisticians.” J. American Statistical Association 112 (58):859-877. doi:10.1080/01621459.2017.1285773.
  • [12] Friston, Karl, Thomas FitzGerald, Francesco Rigoli, Philipp Schwartenbeck, John O’Dohertye and Giovanni Pezzulo. 2016. “Active Inference and Learning.” Neuroscience & Biobehavioral Reviews 68 (September 2016): 862-879. doi:10.1016/j.neubiorev.2016.06.022. (Accessed Aug. 10, 2024; available online at https://www.sciencedirect.com/science/article/pii/S0149763416301336?via%3Dihub.)
  • [13] Friston, Karl, Thomas FitzGerald, Francesco Rigoli, Philipp Schwartenbeck and Giovanni Pezzulo. 2017. “Active Inference: A Process Theory.” Neural Comput 29(1; Jan, 2017):1-49. doi:10.1162/NECO_a_00912. (Accessed Aug. 10, 2024; available online at https://activeinference.github.io/papers/process_theory.pdf.)
  • [14] Maren, A.J. 2016. “The Cluster Variation Method: A Primer for Neuroscientists,” Brain Sciences, 6(4):44. doi:10.3390/brainsci6040044. (Available online at: https://doi.org/10.3390/brainsci6040044.)
  • [15] Maren, Alianna J. 2024. “Minding Your P’s and Q’s: Notational Variations Expressing the Kullback-Leibler Divergence.” Themesis Inc. Technical Note THM TN2024-001v1 (ajm). (Accessed Aug. 16, 2024; available online at https://themesis.com/Downloads/Kullback-Leibler-Notational-Variations-Themesis-upload-2024-08-11.pdf.)
  • [16] Sengupta, B., M. B. Stemmler, and K. J. Friston. 2013. “Information and Efficiency in the Nervous System - a Synthesis,” PLoS Comput Biol. 9(7):e1003157. doi:10.1371/journal.pcbi.1003157. PMID:23935475. PMCID:PMC3723496. Epub 2013 Jul 25.
  • [17] Parr, Thomas, Giovanni Pezzulo, and Karl J. Friston. 2022. “Active Inference: The Free Energy Principle in Mind, Brain, and Behavior.” (Cambridge, MA: The MIT Press). doi:10.7551/mitpress/12441.001.0001.
  • [18] Sajid, Noor, Philip J. Ball, Thomas Parr and Karl J. Friston. 2021. “Active Iinference: Demystified and Compared.” arXiv:1909.10863v3 [cs.AI] (30 Oct 2020). (Later published in Neural Computation 33:3 (March 2021; Special Issue).) doi:10.1162/neco_a_01357 (for the journal publication).
  • [19] Cullen, M., B. Davey, K.J. Friston, and R.J. Moran. 2018. “Active Inference in OpenAI Gym: A Paradigm for Computational Investigations into Psychiatric Illness.” Biological Psychiatry CNNI 3(9, September, 2018):809-818. doi:10.1016/j.bpsc.2018.06.010.
  • [20] Kikuchi, R. 1951.“A Theory of Cooperative Phenomena.” Phys. Rev. 988(81):127–138.
  • [21] Kikuchi, R. and S.G. Brush. 1967. “Improvement of the Cluster Variation Method,” J. Chem. Phys. 47:195.
  • [22] Maren, Alianna J. 2021. “The 2-D Cluster Variation Method: Topography Illustrations and Their Entropy Parameter Correlations.” Entropy. 23(3):319. doi:10.3390/e23030319. (Accessed Aug. 17, 2024; available online at https://www.mdpi.com/1099-4300/23/3/319.)
  • [23] Maren, A.J. 2019 “2-D Cluster Variation Method: Comparison of Computational and Analytic Results” (June, 2019). GitHub: github.com/ajmaren/2D-Cluster-Variation-Method; MS (TM) PPT Slidedeck: 2D-CVM-Expts-vary-eps0-and-eps1-computational-2019-06-17v4.pptx.
  • [24] Maren, Alianna J. 2022. “A Variational Approach to Parameter Estimation for Characterizing 2-D Cluster Variation Method Topographies.” Themesis Technical Report THM 2022-001 (ajm). arXiv:2209.04087v1 [cs.NE] 9 Sep 2022. doi:10.48550/arXiv.2209.04087. (Accessed Aug. 17, 2024; available online at https://arxiv.org/pdf/2209.04087.)
  • [25] Freed, P. 2010. “Research Digest,” Neuropsychoanalysis 12(1):103-106. doi:10.1080/15294145.2010.10773634; https://doi.org/10.1080/15294145.2010.10773634.
  • [26] Yellapragada, M.S. and C.P. Konkimalla. 2019. “Variational Bayes: A Report on Approaches and Applications.” arXiv:1905.10744v1 [cs.LG] 26 May 2019.
  • [27] Wainwright, M.J. and M.I. Jordan. 2008. “Graphical Models, Exponential Families, and Variational Inference.” Foundations and Trends in Machine Learning 1(1-2):1 - 305. doi:10.1561/2200000001.
  • [28] Friston, K.L.  and C. Frith. 2015. “A Duet for One,” Consciousness and Cognition. 36 (Nov., 2015):390-405. doi:10.1016/j.concog.2014.12.003.
  • [29] Maren, A.J. 2018, 2019. “Free Energy Minimization Using the 2-D Cluster Variation Method: Initial Code Verification and Validation,” Themesis Technical Report 2018-001v2 (ajm). v1: 2018; v2: 2019. arXiv:1801.08113v2 [cs.NE] 25 Jun 2019.
  • [30] Themesis, Inc. GitHub Repository. Accessed Aug. 17, 2024; available online at https://github.com/Themesis1.
  • [31] Themesis, Inc. “Cluster Variation Method: Code Walkthroughs.” Themesis, Inc. YouTube Playlist. (Accessed Aug. 17, 2024; available online at https://www.youtube.com/playlist?list=PLQ7kdul7PF0cMQTWRWi1yB0zsc8Uti6J4.)
  • [32] Maren, Alianna J. 2013. “Statistical Thermodynamics: Basic Theory and Equations.” Themesis Technical Report THM TR2013-001(ajm). (Accessed Aug. 17, 2024; available online at https://www.aliannajmaren.com/Downloads/Stat_Thermo_Basic_Theory_2013-12-01.pdf.)
  • [33] Maren, A.J. Statistical Mechanics, Neural Networks, and Artificial Intelligence: The Précis, Book-in-progress. Select chapters available for download at: http://www.aliannajmaren.com/book/.
  • [34] Maren, A.J. 2014. “The Cluster Variation Method I: 1-D Single Zigzag Chain: Basic Theory, Analytic Solution and Free Energy Variable Distributions at Midpoint (x1 = x2 = 0.5),” Themesis Technical Report THM TR2014-002 (ajm). doi:10.13140/2.1.4415.6485. (Accessed Aug. 17, 2024; available online at https://www.aliannajmaren.com/Downloads/Cluster_Variation_Method_I_Basic_Theory_2014-06-25.pdf.)
  • [35] Maren, Alianna J. 2014. “The Cluster Variation Method II: 2-D Grid of Zigzag Chains: Basic Theory, Analytic Solution and Free Energy Variable Distributions at Midpoint (x1 = x2 = 0.5).” Themesis Technical Report THM TR2014-003(ajm) (July, 2014). doi:10.13140/2.1.4112.5446 (Accessed Aug. 17, 2024; available online at https://www.aliannajmaren.com/Downloads/Cluster_Variation_Method_II_Basic_Theory_2-D_2014-07-07.pdf.)
  • [36] Maren, Alianna J. 2019. ’”2-D Cluster Variation Method Free Energy: Fundamentals and Pragmatics.” Themasis Technical Report THM TR-2019-02v1 (ajm). arXiv:1909.09366v1 [cs.NE] 20 Sep 2019. (Accessed Aug. 17, 2024; available online at https://arxiv.org/pdf/1909.09366.)
  • [37] Maren, Alianna J. 2022. “A Variational Approach to Parameter Estimation for Characterizing 2-D Cluster Variation Method Topographies,” Themesis Technical Report THM 2022-001 (ajm). arXiv:2209.04087v1 [cs.NE]. 9 Sep 2022. doi:10.48550/arXiv.2209.04087. (Accessed Aug. 17, 2024; available online at https://arxiv.org/pdf/2209.04087.)