Information, Energy and Intelligence
Abstract
David MacKay’s work emphasized explicit assumptions and operational clarity in modeling information and inference. In games like Conway’s Life, rules are explicit and self-contained. In analytic frameworks, we often take implicit adjudicators for granted—external observers, pre-specified outcome spaces, privileged decompositions. What if we forbid such external adjudication and seek only rules that can be applied from within the system?
This talk explores the “inaccessible game,” an information-theoretic dynamical system where all rules must be internally adjudicable. Starting from three axioms characterizing information loss (Baez-Fritz-Leinster), we show how a “no-barber principle” selects marginal entropy conservation, maximum entropy dynamics, and specific substrate properties, not by assumption but by consistency requirements. We explore when the constraints imply energy-entropy equivalence in the thermodynamic limit and how entropy time becomes a distinguished clock withn the framework.
This work is dedicated to the memory of David MacKay.
In Memory of David MacKay
This talk is for a memorial meeting for David MacKay, who made fundamental contributions to our understanding of the relationship between information theory, energy, and practical systems. David’s work on information theory and inference provided elegant bridges between abstract mathematical principles and real-world applications.
I first saw David speak about information theory at the 1997 Issac Newton Institute meeting on Machine Learning and Genaralisation. It was a suprise because I had been expecting him to speak about either Bayesian neural networks or Gaussian processes. But, of course, as you get to know David you realise it was unsurprising, because he’d been looking at the connections between information theory, machine learning and error correcting codes. He summarised his thinking in his book “Information Theory, Inference, and Learning Algorithms” (MacKay, 2003). It demonstrated how information-theoretic thinking can illuminate everything from error-correcting codes to neural networks to sexual reproduction. This book was my introduction to information theory, and it was available long before it was published. Regular updates were made on David’s website from the late 1990s.
He later surprised me again, when I heard that he’d shifted away from this larger work and was focussing on energy. He did so because he believed that sustainable energy was the most important challenge humanity faced. He approache the subject with the same clarity of thinking and careful reasoning. Memorably underpinned by practical examples using phone chargers. “Sustainable Energy Without the Hot Air” (MacKay, 2008) was published in 2008. In a video created as part of the 2009 Cambridge Campaign he went from phone chargers to sweeping national changes in the way we use energy.
Figure: A YouTube video featuring David’s clarity of thought and the ideas behind sustainable energy from 2010.
This Talk
The work I present today on the inaccessible game is my best attempt to follow in this tradition. It tries to build on rigorous information-theoretic foundations for understanding the limitations of automotous information systems. The hope is to use this framework to underpin our understanding of information processing systems, what their limits are. I hope that David would have appreciated both the mathematical structure and the attempt to use it to deflate unrealistic promises about superintelligence which I see as problematic in the same way he felt conversations about phone chargers were problematic in distracting us from the real challenges of energy restructuring.
As David told us ten years ago, he was highly inspired by John Bridle telling him (as an undergraduate student) “everything’s connected.” Across his workDavid taught us to ask: “What are the fundamental constraints? What do the numbers say?”
In my best attempt to respect that spirit of inquiry, this work tries to ask: What fundamental information-theoretic constraints govern intelligent systems? Can we understand these constraints as rigorously as we understand thermodynamic constraints on engines?
Playful
David was also a playful person, he enjoyed games, often rephrasing physics questions as puzzles, but also ultimate frisbee,1 or ultimate for short. One aspect of ultimate he seemed to particularly like was the “spirit of the game.” In ultimate there is no referee, no arbitration. Self arbitration is part of the spirit of the game.
In honour of this idea, we consider a similar principle for “zero player games.” Games of the type of Conway’s game of life (Gardner (1970)) or Wolfram’s cellular automata (Wolfram (1983)). The principle is a conceptual constraint inspired by Russell’s paradox: it demands that the rules of our system not appeal to external adjudicators or reference points just like ultimate.
The No-Barber Principle
In 1901 Bertrand Russell introduced a paradox: if a barber shaves everyone in the village who does not shave themselves, does the barber shave themselves? The paradox arises when a definition quantifies over a totality that includes the defining rule itself.
We propose a similar constraint for the inaccessible game: the foundational rules must not refer to anything outside themselves for adjudication or reference. Or in other words there can be no external structure. We call this the “no-barber principle.”
Without such consistency, we would require what we might call a “Munchkin provision.” In the Munchkin card game (Jackson-munchkin01?), it is acknowledged that the cards and rules may be inconsistent. Their resolution?
Any other disputes should be settled by loud arguments, with the owner of the game having the last word.
Munckin Rules (Jackson-munchkin01?)
While this works for card games, it’s unsatisfying for foundational mathematics. We want our game to be internally consistent, not requiring an external referee to resolve paradoxes.
Figure: The Munchkin card came has both cards and rules. The game explicitly acknowledges that this can lead to inconsistencies which should be resolved by the game owner.
The no-barber principle says that admissible rules must be internally adjudicable: they depend only on quantities definable from within the system’s internal language, without requiring e.g. an external observer to define what’s distinguishable, or a pre-specified outcome space or \(\sigma\)-algebra, a privileged decomposition into subsystems an externally defined time parameter or spatial coordinates.
Entropic Exchangeability
The no-barber principle leads to what we call entropic exchangeability: any admissible constraint or selection criterion must depend only on reduced subsystem descriptions, be invariant under relabeling of subsystems, and not presuppose access to globally distinguishable joint configurations.
This is an attempt to introduce a consistency requirement that prevents the rules from appealing to distinctions the game itself cannot represent.
What This Excludes
Many seemingly natural constraints violate the no-barber principle. For example, partial conservation which assumes only some variables are isolated, privileges those variables. A time varying \(C\) would require an external time parameter. The notion of observer relative isolation would require an observer that cannot be defined externally. Probabilistic isolation also requires an externally defined probabilistic space.
Foundations: Information Loss and Entropy
The Three Axioms
Baez-Fritz-Leinster Characterization of Information Loss
Before introducing our fourth axiom, we need to understand how information loss is measured. Baez et al. (2011) showed that entropy emerges naturally from category theory as a way of measuring information loss in measure-preserving functions. They derived Shannon entropy from three axioms, without invoking probability directly.
The Three Axioms
Let \(F(f)\) denote the information lost by a process \(f\) that transforms one probability distribution to another. The three axioms constrain the functional form of \(F\).
Axiom 1: Functoriality suggests that given a process consisting of two stages, the amount of information lost in the whole process is the sum of the amounts lost at each stage: \[ F(f \circ g) = F(f) + F(g), \] where \(\circ\) represents composition.
Axiom 2: Convex Linearity suggests that if we flip a probability-\(\lambda\) coin to decide whether to do one process or another, the information lost is \(\lambda\) times the information lost by the first process plus \((1-\lambda)\) times the information lost by the second: \[ F(\lambda f \oplus (1-\lambda)g) = \lambda F(f) + (1-\lambda)F(g). \]
Axiom 3: Continuity suggests that if we change a process slightly, the information lost changes only slightly, i.e. \(F(f)\) is a continuous function of \(f\).
The Main Result
The main result of Baez et al. (2011) is that these three axioms uniquely determine the form of information loss. There exists a constant \(c\geq 0\) such that for any \(f: p \rightarrow q\): \[ F(f) = c(H(p) -H(q)) \] where \(F(f)\) is the information loss in process \(f: p\rightarrow q\) and \(H(\cdot)\) is the Shannon entropy measured before and after the process is applied to the system.
This provides a foundational justification for using entropy as our measure of information. It’s not just a convenient choice, it’s the unique measure that satisfies natural requirements for measuring information loss.
Information Isolation: Selected by No-Barber
Information Isolation: Selected, Not Assumed
In our earlier presentation (Lawrence-inaccessible25?), we introduced information isolation \(\sum_i h_i = C\) as a fourth independent axiom. But more recent work suggests this is better understood as selected by the no-barber principle rather than assumed.
Why This Selection?
Among all possible entropy-related constraints, marginal entropy conservation \(\sum_i h_i = C\) is the strongest nontrivial constraint that can be formulated without external structure.
Consider the alternatives:
Partial conservation (only some \(h_i\) constrained): - Requires specifying which variables are isolated - Privileges those variables over others - Needs external criterion to select them
Time-varying \(C(t)\): - Requires an external time parameter \(t\) - Who defines this time? - Violates no-barber principle
Observer-relative isolation: - Isolated from what observer? - Requires external reference point - Smuggles in external structure
Probabilistic isolation (holds in expectation): - Requires external probability space - Who defines the measure? - Again violates no-barber
The constraint \(\sum_i h_i = C\) (constant) has special properties:
- Exchangeable: Treats all subsystems identically (depends only on marginal entropies)
- Extensive: Scales linearly with system size
- Internal: Definable from reduced descriptions alone
- No external time: \(C\) is a constant, not time-dependent
- No external observer: Isolation is absolute, not relative
These properties make it the unique choice that satisfies both: - Strong enough to constrain dynamics - Weak enough to avoid external reference
Implication: Not “Just Another Axiom”
This reframing changes how we understand the game’s foundation. We’re not arbitrarily adding a fourth axiom—we’re discovering that internal adjudicability forces this specific constraint.
The three Baez-Fritz-Leinster axioms tell us information loss is measured by entropy. The no-barber principle then tells us that the only admissible global constraint on a collection of subsystems is marginal entropy conservation.
This is a stronger foundation. We’re not asking “what happens if we add this constraint?” but rather “what must be true if we forbid external reference?” The answer turns out to be marginal entropy conservation.
Constraints vs Selections
One useful clarification from the latest “no-barber” framing is to separate (i) that are needed just to avoid impredicative circularity / external structure, from (ii) that look internally motivated but may not yet be uniquely forced. This keeps the story honest: we can say what seems necessary now, and where the remaining design degrees of freedom live.
Smuggled Outcomes: Shannon vs von Neumann
Baez-Fritz-Leinster show Shannon entropy is uniquely characterised by natural axioms . The no-barber question is subtler: does the already assume external structure (a labelled outcome space / \(\sigma\)-algebra) that the game cannot represent internally? This is one motivation for preferring an algebraic entropy (von Neumann) when we want the rule language to be outcome-independent.
The Inaccessible Game
With these foundations, we can now introduce the game itself.
The Inaccessible Game
We call our framework the “inaccessible game” because the system is isolated from external observation: an outside observer cannot extract or inject information, making the game’s internal state inaccessible.
Like other zero-player games, such as Conway’s Game of Life (Gardner, 1970), the system evolves according to internal rules without external interference. But unlike cellular automata where rules are chosen by design, in our game the rules emerge from an information-theoretic principle.
Why “Inaccessible?”
The game is inaccessible because our fourth axiom—information isolation—ensures that an external observer learns nothing. Recall from Baez-Fritz-Leinster that information gained from observing a process \(f: p \rightarrow q\) is measured by entropy change: \[ F(f) = H(p) - H(q). \]
But if marginal entropies are conserved (\(\sum h_i = C\)), then an external observer sees: - Before observation: \(\sum h_i = C\) - After observation: \(\sum h_i = C\)
- Information gained: \(\Delta(\sum h_i) = 0\)
The observer learns nothing! The system is informationally isolated from the outside world.
What Makes It a Game?
The “game” has the following characteristics:
Players: None (zero-player game)
State: A probability distribution over variables, parametrized by natural parameters \(\boldsymbol{\theta}\)
Rules: Evolve to maximize entropy production subject to the constraint \(\sum h_i = C\)
Dynamics: Emerge from information geometry (Fisher information) and the constraint structure
The game starts in a maximally correlated state (high multi-information \(I\), low joint entropy \(H\)) and evolves toward states of higher entropy. Since \(I + H = C\), this means the system breaks down correlations (\(I\) decreases) as entropy increases.
Connection to Physical Reality
Why should we care about this abstract game? Because its dynamics exhibit structure that connects to fundamental physics:
- GENERIC structure: The dynamics decompose into reversible (conservative) and irreversible (dissipative) components
- Energy-entropy equivalence: In the thermodynamic limit, our marginal entropy constraint becomes equivalent to energy conservation
- Landauer’s principle: Information erasure necessarily involves the dissipative part of the dynamics
The inaccessible game provides an information-theoretic foundation for thermodynamic principles. It suggests that physical laws might emerge from information-theoretic constraints, rather than information theory being derived from physics.
Information Dynamics
The Conservation Law
The \(I + H = C\) Structure
We have established four axioms, with the fourth axiom stating that the sum of marginal entropies is conserved, \[ \sum_{i=1}^N h_i = C. \] This conservation law is the heart of The Inaccessible Game, but to understand its dynamical implications, we need to rewrite it in a more revealing form.
Multi-Information: Measuring Correlation
The multi-information (or total correlation), introduced by Watanabe (1960), measures how much the variables in a system are correlated. It is defined as, \[ I = \sum_{i=1}^N h_i - H, \] where \(H\) is the joint entropy of the full system: \[ H = -\sum_{\mathbf{x}} p(\mathbf{x}) \log p(\mathbf{x}). \]
The multi-information has a clear interpretation: - \(I = 0\): The variables are completely independent. The joint entropy equals the sum of marginal entropies. - \(I > 0\): The variables are correlated. Some information is “shared” between variables, so the joint entropy is less than the sum of marginals. - \(I\) is maximal: The variables are maximally correlated (in the extreme case, deterministically related).
Multi-information is always non-negative (\(I \geq 0\)) and measures how much knowing one variable tells you about others.
The Information Action Principle: \(I + H = C\)
Using the definition of multi-information, we can rewrite our conservation law. From \(I = \sum_{i=1}^N h_i - H\), we have: \[ \sum_{i=1}^N h_i = I + H. \] Therefore, the fourth axiom \(\sum_{i=1}^N h_i = C\) becomes: \[ I + H = C. \]
This is an information action principle. It says that multi-information plus joint entropy is conserved. This equation sits behind the dynamics of the Inaccessible Game.
This equation has the structure of an action principle in classical mechanics. In physics, total energy is conserved and splits into two parts, \[ T + V = E, \] where \(T\) is kinetic energy and \(V\) is potential energy.
The analogy for The Inaccessible Game is.
- Multi-information \(I\) plays the role of potential energy. It represents “stored” correlation structure. High \(I\) means variables are tightly coupled, like a compressed spring.
- Joint entropy \(H\) plays the role of kinetic energy. It represents “dispersed” or “free” information. High \(H\) means the probability distribution is spread out, with maximal uncertainty.
Just as a classical system evolves from high potential energy to high kinetic energy (a ball rolling down a hill), the idea in the Inaccessible Game will be that the information system evolves from high correlation (high \(I\)) to high entropy (high \(H\)).
The Information Relaxation Principle
The \(I + H = C\) structure suggests a relaxation principle: systems naturally evolve from states of high correlation (high \(I\), low \(H\)) toward states of low correlation (low \(I\), high \(H\)).
Why? Our inspiration is that the second law of thermodynamics tells us that entropy increases. If we want to introduce dynamics in the game, increasing entropy provides an obvious way to do that. Since \(I + H = C\) is constant, if \(H\) increases, \(I\) must decrease. The system breaks down correlations to increase entropy.
This is analogous to how physical systems relax from non-equilibrium states (low \(T\), high \(V\)) to equilibrium (high \(T\), low \(V\)). A compressed spring releases its stored energy. A hot object in a cold room disperses its energy. In information systems, correlated structure dissipates into entropy.
Consider a simple two-variable system with binary variables \(X_1\) and \(X_2\):
High correlation state (high \(I\), low \(H\)): \[ p(X_1=0, X_2=0) = 0.5, \quad p(X_1=1, X_2=1) = 0.5 \] The variables are perfectly correlated. Marginal entropies: \(h_1 = h_2 = 1\) bit. Joint entropy: \(H = 1\) bit. Multi-information: \(I = 1 + 1 - 1 = 1\) bit.
Low correlation state (low \(I\), high \(H\)): \[ p(X_1, X_2) = 0.25 \text{ for all four combinations} \] The variables are independent. Marginal entropies: \(h_1 = h_2 = 1\) bit. Joint entropy: \(H = 2\) bits. Multi-information: \(I = 1 + 1 - 2 = 0\) bits.
The system relaxes from the first state to the second, conserving \(I + H = 2\) bits throughout. Let’s visualize this relaxation:
import numpy as np# Generate relaxation trajectory
n_steps = 100
alphas = np.linspace(0, 1, n_steps)
h1_vals = []
h2_vals = []
H_vals = []
I_vals = []
for alpha in alphas:
p00, p01, p10, p11 = relaxation_path(alpha)
h1, h2, H, I = compute_binary_entropies(p00, p01, p10, p11)
h1_vals.append(h1)
h2_vals.append(h2)
H_vals.append(H)
I_vals.append(I)
h1_vals = np.array(h1_vals)
h2_vals = np.array(h2_vals)
H_vals = np.array(H_vals)
I_vals = np.array(I_vals)
C_vals = I_vals + H_vals # Should be constantFigure: Left: Multi-information \(I\) decreases as joint entropy \(H\) increases, conserving \(I + H = C\). The colored regions show how the conserved quantity splits between correlation (red) and entropy (blue). Right: Marginal entropies remain constant throughout, making the system inaccessible to external observation.
The visualisation shows the trade-off: as the system relaxes, correlation structure (multi-information) is converted into entropy. The total \(I + H = C\) remains constant (black dashed line), but the system evolves from a state dominated by correlation to one dominated by entropy.
The marginal entropies \(h_1\) and \(h_2\) stay constant throughout this evolution. An external observer measuring only marginal entropies would see no change—the system is informationally isolated, hence “inaccessible.”
Connection to Marginal Entropy Conservation
Why does this structure conserve marginal entropies? Recall that from Baez’s axioms, any change in entropy of a subsystem represents “information loss” to an external observer. If the observer learns nothing about the system (information isolation), then \[ \Delta\left(\sum_{i=1}^N h_i\right) = 0. \] The \(I + H = C\) formulation provides our dynamics clear: as the system evolves, correlations are traded for entropy. The marginal entropies remain fixed (so an external observer learns nothing), while internally the system reorganises from a correlated state to an uncorrelated state.
Importantly, information conservation doesn’t mean nothing changes, it means the changes are internal redistributions that leave marginal entropies (and hence external information) unchanged. The system is inaccessible to the outside because its dynamics preserve \(\sum h_i\) which means they preserve \(I + H\).
Why This Matters for Dynamics
The \(I + H = C\) structure immediately tells us:
Direction of evolution: Systems move from high \(I\) to high \(H\) (correlation to entropy).
Constrained dynamics: Not all paths through probability space are allowed. Only those preserving \(I + H = C\) are accessible.
Physical interpretation: The split into \(I\) (correlation/potential) and \(H\) (entropy/kinetic) gives us a sense of what’s happening, later we will parameterise directly through natural parameters \(\boldsymbol{\theta}\) (from Lecture 1).
Variational principle: The action-like structure hints that we can derive dynamics from a variational principle, just as Lagrangian mechanics derives equations of motion from the principle of least action.
In the next section, we’ll see how information relaxation, i.e. the tendency to move from high \(I\) to high \(H\), leads to maximum entropy production dynamics in the natural parameter space.
Information Relaxation
From Information Relaxation to Maximum Entropy Production
We’ve established used the conservation law \(I + H = C\) to suggest a relaxation principle: systems evolve from high correlation (high \(I\)) to high entropy (high \(H\)). Now we explore the implications of this principle by deriving the explicit dynamics.
The Direction of Time: Entropy Increases
The second law of thermodynamics tells us that entropy increases over time. In The Inaccessible Game, this means: \[ \dot{H} \geq 0. \]
Since the joint entropy \(H\) must increase (or at least not decrease), and we have the constraint \(I + H = C\), it immediately follows that: \[ \dot{I} \leq 0. \]
The multi-information must decrease, i.e. the system breaks down correlations to increase entropy.
This gives us the direction of evolution, but not yet the rate or the specific form of the dynamics. For that, we need to think about how the system can maximize the rate of entropy production while respecting the conservation constraint.
Maximum Entropy Production Principle
Among all possible dynamics that conserve marginal entropy, which path should we choose? Our answer comes from a principle that has emerged across multiple domains of physics: maximum entropy production (MEP).
The MEP principle states that, subject to constraints, systems evolve along the path of steepest entropy increase. This principle has been observed in: - Non-equilibrium thermodynamics (Beretta, 2020; Ziegler and Wehrli, 1987) - Fluid dynamics and turbulence - Ecology and self-organization - Climate dynamics
For The Inaccessible Game, MEP means: of all paths that conserve \(\sum h_i\), the system follows the one that maximises \(\dot{H}\). This choice of dynamics makes sense because it is uniquely determined at all times.
Note that this principle isn’t one of our axioms, it’s an assumption about how the relaxation dynamics should occur.
Natural Parameters and the Entropy Gradient
To make MEP concrete, we need coordinates. Recall from Lecture 1 that exponential families have natural parameters \(\boldsymbol{\theta}\) where the geometry is particularly elegant. In natural parameters, the entropy gradient has a beautiful form.
For an exponential family \(p(\mathbf{x}|\boldsymbol{\theta}) = \exp(\boldsymbol{\theta}^T T(\mathbf{x}) - \mathcal{A}(\boldsymbol{\theta}))\), the joint entropy is: \[ H(\boldsymbol{\theta}) = \mathcal{A}(\boldsymbol{\theta}) - \boldsymbol{\theta}^T \nabla \mathcal{A}(\boldsymbol{\theta}). \]
Taking the gradient with respect to \(\boldsymbol{\theta}\): \[ \nabla_{\boldsymbol{\theta}} H = -\boldsymbol{\theta}^\top \nabla^2 \mathcal{A}(\boldsymbol{\theta}) = -G(\boldsymbol{\theta})\boldsymbol{\theta}, \] where \(G(\boldsymbol{\theta}) = \nabla^2 \mathcal{A}(\boldsymbol{\theta})\) is the Fisher information matrix.
This gradient points in the direction of steepest entropy increase.
The MEP Dynamics
Maximum entropy production, in its simplest form, says: move in the direction of the entropy gradient. In natural parameters, this gives: \[ \dot{\boldsymbol{\theta}} = \nabla_{\boldsymbol{\theta}} H = -G(\boldsymbol{\theta})\boldsymbol{\theta}. \]
This is gradient ascent on entropy. But we must be careful: this dynamics must also preserve the constraint \(\sum h_i = C\).
For the moment, we will focus on systems where we assume that this simple gradient flow preserves marginal entropies. This would require the Fisher information matrix \(G\) and the conservation constraint have compatible structure, both arising from the same geometric structure of the probability manifold.
Later (especially Lecture 4) we’ll see how to handle more general constraints through Lagrangian methods, where we explicitly enforce the conservation through multipliers. For now, the key insight is: MEP naturally leads to gradient flow in entropy.
Why This Is the Unique Dynamics
The MEP dynamics \(\dot{\boldsymbol{\theta}} = -G(\boldsymbol{\theta})\boldsymbol{\theta}\) has several interesting properties.
Steepest ascent: This follows the Euclidean gradient \(\nabla_{\boldsymbol{\theta}} H\) in parameter space. The Fisher information \(G\) appears because of the structure of exponential families (\(\nabla H = -G\boldsymbol{\theta}\)), not because we’re using it as a Riemannian metric. Note: natural gradients would be \(G^{-1}\nabla H\), we are not following the natural gradients here.
Maximizes entropy production: Among all dynamics preserving the constraint, this produces entropy fastest.
Conserves marginal entropies: For systems with the exchangeable structure we’ve assumed, this flow keeps \(\sum h_i\) constant.
Connects to thermodynamics: This is exactly the “steepest entropy ascent” dynamics explored e.g. in non-equilibrium thermodynamics by Beretta (Beretta, 2020).
The information relaxation principle—that systems evolve from correlation to entropy—combined with MEP, uniquely determines these dynamics.
The Information Relaxation Picture
Let’s put it all together. The Inaccessible Game dynamics can be understood as:
Initial state: High correlation, low entropy - Multi-information \(I\) is large (variables tightly coupled) - Joint entropy \(H\) is small (distribution is concentrated)
Evolution: Maximum entropy production - System follows gradient flow \(\dot{\boldsymbol{\theta}} = -G(\boldsymbol{\theta})\boldsymbol{\theta}\) - Entropy \(H\) increases at maximum rate - Multi-information \(I\) decreases correspondingly - Marginal entropies \(h_i\) remain constant (external observer learns nothing)
Final state: Low correlation, high entropy - Multi-information \(I\) is small (variables nearly independent) - Joint entropy \(H\) is large (distribution is spread out) - Equilibrium: \(\dot{H} = 0\), maximum entropy subject to constraints
This is information relaxation: the system “relaxes” from a tense, correlated state to a loose, uncorrelated state, just as a stretched elastic relaxes to its natural length.
Connection to Physical Intuition
The information relaxation picture has a direct physical analogy. Consider a room full of gas molecules:
Initial state: All molecules are in one corner (high correlation - if you know where one molecule is, you have a good guess about others; low entropy - the distribution over positions is concentrated).
Evolution: Molecules diffuse according to the laws of thermodynamics, spreading out to fill the room. This is maximum entropy production—the fastest route to equilibrium.
Final state: Molecules are uniformly distributed throughout the room (low correlation - knowing where one molecule is tells you nothing about others; high entropy - maximum uncertainty about positions).
In both cases, we have: - A conservation law (total energy for gas, marginal entropy for information) - A relaxation from structured to unstructured states - Maximum entropy production as the governing principle
The Inaccessible Game applies this same physics to abstract probability distributions.
Preview: Constrained Gradient Flow
The simple gradient flow \(\dot{\boldsymbol{\theta}} = -G(\boldsymbol{\theta})\boldsymbol{\theta}\) is the starting point, but real systems often have additional constraints beyond marginal entropy conservation. In Lecture 4, we’ll see how to incorporate these constraints using Lagrangian methods, leading to: \[ \dot{\boldsymbol{\theta}} = -G(\boldsymbol{\theta})(\boldsymbol{\theta} + \lambda \nabla C), \] where \(\lambda\) is a Lagrange multiplier and \(C\) represents the constraint.
The key insight remains: information relaxation through maximum entropy production is the fundamental principle. Constraints modify the path, but not the underlying logic.
Constrained Maximum Entropy Production
We’ve established that \(I + H = C\) must be conserved, and that the system naturally evolves toward higher entropy. But how does it evolve? What determines the specific path through information space?
Our answer comes from an information relaxation principle: of all paths that conserve \(\sum_i h_i = C\), the system follows the one that maximizes entropy production \(\dot{H}\).
Unconstrained vs Constrained Dynamics
Without the constraint, maximum entropy production would simply be gradient ascent on \(H\). For exponential families, the entropy gradient is \[ \nabla H = -G(\boldsymbol{\theta})\boldsymbol{\theta} \] where \(G = \nabla^2A\) is the Fisher information. Pure MEP would give: \[ \dot{\boldsymbol{\theta}} = \nabla H = -G(\boldsymbol{\theta})\boldsymbol{\theta}. \]
This is the natural gradient flow in the information geometry. The system would flow toward the maximum entropy state at \(\boldsymbol{\theta} = \mathbf{0}\).
But we must respect \(\sum_i h_i = C\) at all times. This constraint defines a manifold in parameter space. We need to project the MEP flow onto the tangent space of this constraint manifold.
Using a Lagrangian formulation: \[ \mathscr{L}(\boldsymbol{\theta}, \nu) = -H(\boldsymbol{\theta}) + \nu\left(\sum_{i=1}^n h_i - C\right) \] where \(\nu\) is a Lagrange multiplier (note we use \(-H\) since Lagrangians are minimized by convention).
The Constrained Dynamics
The constrained dynamics become: \[ \dot{\boldsymbol{\theta}} = -G(\boldsymbol{\theta})\boldsymbol{\theta} + \nu(\tau) \mathbf{a}(\boldsymbol{\theta}) \] where \(\mathbf{a}(\boldsymbol{\theta}) = \sum_i \nabla h_i(\boldsymbol{\theta})\) is the constraint gradient and \(\tau\) is “game time” parametrising the trajectory.
The Lagrange multiplier \(\nu(\tau)\) is time-dependent, it varies along the trajectory to maintain the constraint. Since the constraint must be satisfied at all times: \[ \frac{\text{d}}{\text{d}\tau}\left(\sum_i h_i\right) = 0 \] we can use the chain rule: \[ \mathbf{a}(\boldsymbol{\theta})^\top \dot{\boldsymbol{\theta}} = 0. \]
Solving for the Lagrange Multiplier
Substituting the dynamics into the constraint maintenance condition: \[ \mathbf{a}^\top\left(-G\boldsymbol{\theta} + \nu \mathbf{a}\right) = 0 \] we can solve for \(\nu\): \[ \nu(\tau) = \frac{\mathbf{a}^\top G\boldsymbol{\theta}}{\|\mathbf{a}\|^2}. \]
This allows us to write the dynamics in projection form: \[ \dot{\boldsymbol{\theta}} = -\Pi_\parallel G\boldsymbol{\theta} \] where \[ \Pi_\parallel = \mathbf{I} - \frac{\mathbf{a}\mathbf{a}^\top}{\|\mathbf{a}\|^2} \] is the projection matrix onto the constraint tangent space.
Physical Interpretation
This has a nice physical interpretation.
- \(-G\boldsymbol{\theta}\): The “natural” direction to maximise entropy
- \(\nu \mathbf{a}\): The “constraint force” that keeps the system on the manifold
- \(\nu(\tau)\): Measures how much the natural flow “wants” to leave the constraint surface
When \(\nu \approx 0\), the constraint gradient is nearly orthogonal to the flow, i.e. the system is naturally moving along the constraint surface. When \(|\nu|\) is large, the natural flow is trying to leave the surface and significant constraint force is needed to keep it on track.
As we’ll see next, this tension between the information geometry and the constraint structure is what generates the GENERIC-like decomposition into dissipative and conservative parts.
Entropy Time (Internal Clock)
A small but important design point: if we allow an time parameter, we’ve already violated the no-barber spirit. One candidate is to parameterise trajectories by entropy production itself (an affine freedom remains: choosing units and an origin for the clock). This keeps the ordering internal and avoids appealing to an externally supplied clock.
Emergent Structure: GENERIC
What is GENERIC?
One of the most remarkable consequences of constrained maximum entropy production is the emergence of GENERIC structure—a framework from non-equilibrium thermodynamics that combines reversible and irreversible dynamics.
The GENERIC Framework
We’ve seen something emerge from lectures 5-7: - Lecture 5: Hamiltonian/Poisson structure describes energy-conserving dynamics (antisymmetric operators) - Lecture 6: Linearisation around equilibrium reveals structure of dynamics - Lecture 7: Any dynamics matrix decomposes uniquely as \(M = S + A\) where \(S\) is symmetric (dissipative) and \(A\) is antisymmetric (conservative)
This emergence comes from the geometry of constrained information dynamics. The structure is not unique to information theory. It appears throughout physics whenever systems combine reversible and irreversible processes.
Historical Context: Non-Equilibrium Thermodynamics
In the 1980s-90s, researchers in non-equilibrium thermodynamics faced a challenge: How do you describe systems that are simultaneously: - Reversible (like mechanical systems, governed by Hamilton’s equations) - Irreversible (like thermodynamic systems, governed by entropy increase)
Examples of such systems: - Fluid dynamics: Conservation of momentum (reversible) + viscosity (irreversible) - Chemical reactions: Reaction kinetics (reversible) + diffusion (irreversible) - Complex materials: Elastic deformation (reversible) + plastic flow (irreversible)
Two researchers, Miroslav Grmela and Hans Christian Öttinger, independently developed a unified framework in the 1990s that they called GENERIC: General Equation for Non-Equilibrium Reversible-Irreversible Coupling (Grmela and Öttinger, 1997; Öttinger, 2005; Öttinger and Grmela, 1997).
What Problem Does GENERIC Solve?
Classical mechanics and classical thermodynamics seemed to describe fundamentally different worlds:
Classical Mechanics (Hamiltonian)
- Time reversible: \(t \to -t\) gives valid dynamics
- Energy conserved: \(\frac{\text{d}E}{\text{d}t} = 0\)
- Phase space volume conserved (Liouville’s theorem)
- Described by antisymmetric operators (Poisson brackets)
Classical Thermodynamics
- Time irreversible: entropy always increases
- Energy dissipated: \(\frac{\text{d}S}{\text{d}t} \geq 0\)
- Systems evolve toward equilibrium
- Described by symmetric operators (friction, diffusion)
The Problem: Real systems do BOTH simultaneously! A pendulum with friction conserves angular momentum (reversible) while losing energy to heat (irreversible). How do you write down equations that respect both structures at once?
The GENERIC Answer: Coexistence Requires Structure
GENERIC provides the answer: reversible and irreversible processes can coexist in the same system, but only if they satisfy certain compatibility conditions. These conditions ensure
- Energy and entropy are consistently defined
- The second law of thermodynamics holds (\(\dot{S} \geq 0\))
- Conserved quantities (like energy, momentum) remain conserved
- Casimir functions (constraints) are respected
Importantly you can’t just add reversible and irreversible parts arbitrarily, they must be coupled through degeneracy conditions that enforce thermodynamic consistency.
In typical GENERIC applications, ensuring the degeneracy conditions are satisfied is challenging, you must carefully engineer both operators to be compatible. But in our case (Lectures 1-7), the degeneracy conditions emerge automatically from the constraint geometry. We didn’t impose them, they pop out as a consequence of starting from information-theoretic axioms. When we check in Lecture 8, they’re already satisfied. This strongly suggests our axioms capture something fundamental about thermodynamic consistency.
Why “GENERIC” Matters for Information Dynamics
You might wonder: “Why do we care about a framework from non-equilibrium thermodynamics in a course on information dynamics?”
The structure that emerged from pure information theory (lectures 1-7) is identical to the structure that physicists discovered was necessary for consistent non-equilibrium thermodynamics.
This suggests something deep: - Information dynamics is thermodynamics (we’ve known this since Shannon and Jaynes) - But more: information dynamics is also a dynamical system with conserved quantities - The GENERIC structure is the inevitable consequence of combining these two aspects
When we derived \(\dot{\boldsymbol{\theta}} = -G\boldsymbol{\theta} - \nu a\) from information-theoretic principles, we were actually deriving a GENERIC system! The Fisher information \(G\) plays the role of the “friction” operator, and the constraint dynamics provide the “Poisson” structure.
Preview: Structure of the GENERIC Equation
In the next sections, we’ll see the full GENERIC equation: \[ \dot{x} = L(x) \nabla E(x) + M(x) \nabla S(x) \] where: - \(L(x)\): Poisson operator (antisymmetric, describes reversible dynamics) - \(M(x)\): Friction operator (symmetric positive semi-definite, describes irreversible dynamics) - \(E(x)\): Energy functional (conserved by \(L\) dynamics) - \(S(x)\): Entropy functional (increased by \(M\) dynamics)
And we’ll see how our information dynamics fit perfectly into this form, with: - Fisher information matrix \(G\) playing the role of \(M\) - Constraint structure providing the Poisson operator \(L\) - Marginal entropy conservation giving us Casimir functions
The structure we built from axioms (lectures 1-4) is GENERIC. The decomposition we derived from linearization (lectures 6-7) reveals its form.
Historical note: The original GENERIC papers (Grmela and Öttinger, 1997; Öttinger and Grmela, 1997) emerged from studies of complex fluids and polymers. It has since been applied to
- Viscoelastic materials
- Multiphase flows
- Chemical reaction networks
- Biological systems
- Plasma physics
The framework is now recognized as a fundamental structure in non-equilibrium statistical mechanics. Our contribution is showing it emerges naturally from information-theoretic first principles.
The GENERIC Equation
The GENERIC framework describes the evolution of a state \(x\) (which could be a density matrix, a distribution, a field configuration, etc.) through the equation: \[ \dot{x} = L(x) \nabla E(x) + M(x) \nabla S(x), \] where
- \(x\): State of the system (lives in some state space)
- \(E(x)\): Energy functional
- \(S(x)\): Entropy functional
- \(\nabla E\), \(\nabla S\): Functional derivatives (gradients in function space)
- \(L(x)\): Poisson operator (describes reversible dynamics)
- \(M(x)\): Friction operator (describes irreversible dynamics)
This equation encodes deep structure. Let’s unpack each component.
The Poisson Operator \(L(x)\)
The Poisson operator \(L(x)\) describes the reversible (energy-conserving, time-reversible) part of the dynamics. It must
1. Be Antisymmetric: For any functionals \(F\) and \(G\), \[ \langle \nabla F, L \nabla G \rangle = -\langle \nabla G, L \nabla F \rangle \] where \(\langle \cdot, \cdot \rangle\) denotes an appropriate inner product. This ensures time reversibility: if you reverse time (\(t \to -t\)) and flip velocities, you get valid dynamics.
2. Satisfy Jacobi identity: The Poisson bracket defined by \(L\) must satisfy \[ \{F, \{G, H\}\} + \{G, \{H, F\}\} + \{H, \{F, G\}\} = 0 \] where \(\{F,G\} := \langle \nabla F, L \nabla G \rangle\). This is the condition for \(L\) to generate a Lie algebra structure (recall Lecture 5!).
3. Conserve energy: The energy must be a Casimir of \(L\): \[ L(x) \nabla E(x) = 0 \quad \Rightarrow \quad \frac{\text{d}E}{\text{d}t}\bigg|_{L \text{ only}} = 0 \] Actually, for GENERIC we require something weaker: \(\langle \nabla E, L \nabla E \rangle = 0\), which follows from antisymmetry.
This connects directly to Lecture 5: \(L\) defines a Poisson structure, and Hamiltonian flow with Hamiltonian \(E\) is given by \(\dot{x} = L \nabla E\).
The Friction Operator \(M(x)\)
The friction operator \(M(x)\) describes the irreversible (entropy-increasing, dissipative) part of the dynamics. It must
1. Be Symmetric: For any functionals \(F\) and \(G\), \[ \langle \nabla F, M \nabla G \rangle = \langle \nabla G, M \nabla F \rangle. \] This is the Onsager reciprocity relation from irreversible thermodynamics.
2. Be Positive semi-definite: For any functional \(F\), \[ \langle \nabla F, M \nabla F \rangle \geq 0. \] This ensures dissipation: entropy can only increase (or stay constant), never decrease.
3. Conserve energy: The friction must not change the energy \[ \langle \nabla E, M \nabla S \rangle = 0. \] This is the first degeneracy condition. Dissipation redistributes energy but doesn’t create or destroy it.
The positive semi-definite property ensures the second law: \[ \frac{\text{d}S}{\text{d}t}\bigg|_{M \text{ only}} = \langle \nabla S, M \nabla S \rangle \geq 0 \]
This connects to Lecture 7: The symmetric part \(S\) of our decomposition \(M = S + A\) had exactly these properties!
The Degeneracy Conditions
For the GENERIC equation to be thermodynamically consistent, \(L\) and \(M\) must satisfy two degeneracy conditions that couple the reversible and irreversible parts:
Degeneracy Condition 1 (Energy conservation by friction): \[ M(x) \nabla E(x) = 0 \] Physically: Dissipative processes cannot change the total energy, only redistribute it. This is more general than what we stated above—the friction operator must annihilate the energy gradient entirely, not just be orthogonal to it.
Degeneracy Condition 2 (Entropy conservation by Poisson dynamics): \[ L(x) \nabla S(x) = 0 \] Physically: Reversible (Hamiltonian) processes cannot change entropy. All entropy change must come from irreversible processes.
These conditions are non-trivial constraints on the operators \(L\) and \(M\). They ensure: - The first law (energy conservation): \(\frac{\text{d}E}{\text{d}t} = \langle \nabla E, L \nabla E + M \nabla S \rangle = 0\) - The second law (entropy increase): \(\frac{\text{d}S}{\text{d}t} = \langle \nabla S, L \nabla E + M \nabla S \rangle = \langle \nabla S, M \nabla S \rangle \geq 0\)
Without these conditions, you could have energy creation/destruction or entropy decrease, those would be violations of thermodynamics.
Casimir Functions and Constraints
Beyond energy and entropy, GENERIC systems often have additional conserved quantities called Casimir functions \(C_i(x)\). These satisfy: \[ L(x) \nabla C_i(x) = 0 \quad \text{and} \quad M(x) \nabla C_i(x) = 0 \]
Casimirs are “super-conserved,” they’re annihilated by both the reversible and irreversible parts of the dynamics. Physically, Casimirs represent fundamental constraints that cannot be changed by any process in the system.
Examples: - Mechanics: Total momentum (in absence of external forces) - Fluids: Circulation in ideal fluids - Electromagnetism: Total charge - Information dynamics: \(\sum h_i = C\) (marginal entropy conservation!)
Casimirs often arise from symmetries (Noether’s theorem, Lecture 5) or from fundamental conservation laws. They stratify the state space into symplectic leaves, invariant manifolds on which the dynamics are confined.
For information dynamics, the conservation of marginal entropy sum \(\sum h_i = C\) is our primary Casimir. The dynamics must respect this constraint at all times.
Why This Structure?
You might wonder why GENERIC has exactly this form. Why two operators? Why these specific conditions?
The answer is that this is the most general structure*that allows reversible and irreversible processes to coexist while respecting: 1. Time-reversal symmetry for the reversible part 2. Second law for the irreversible part 3. Energy conservation overall 4. Additional conservation laws (Casimirs)
Grmela and Öttinger proved that any system satisfying these physical requirements must have GENERIC form. It’s not a choice, it’s a consequence of thermodynamic consistency.
In TIG this structure emerged from our information dynamics (Lectures 1-7) without imposing it. The axioms (Lecture 2) + maximum entropy dynamics (Lecture 3) + constraints (Lecture 4) produce the GENERIC structure. This suggests GENERIC is not just a modeling framework, it’s a principle that information isolated systems must obey.
A Worked Example: Damped Harmonic Oscillator
Let’s see GENERIC in action with a simple example: a harmonic oscillator with friction.
State: \(x = (q, p)\) (position and momentum)
Energy: \(E(q,p) = \frac{p^2}{2m} + \frac{1}{2}kq^2\) (kinetic + potential)
Entropy: For this simple example, we’ll use \(S = -\beta E\) where \(\beta = 1/(k_BT)\) is inverse temperature (this connects to Gibbs distribution).
Poisson operator: Standard symplectic structure from Lecture 5, \[ L = \begin{pmatrix} 0 & 1 \\ -1 & 0 \end{pmatrix}, \quad \{f,g\} = \frac{\partial f}{\partial q}\frac{\partial g}{\partial p} - \frac{\partial f}{\partial p}\frac{\partial g}{\partial q} \]
Friction operator: Simple isotropic damping, \[ M = \begin{pmatrix} 0 & 0 \\ 0 & \gamma \end{pmatrix} \] where \(\gamma > 0\) is the friction coefficient (only momentum experiences friction, not position).
The dynamics: Compute the gradients, \[ \nabla E = \begin{pmatrix} kq \\ p/m \end{pmatrix}, \quad \nabla S = -\beta \nabla E \]
Then GENERIC gives: \[ \begin{pmatrix} \dot{q} \\ \dot{p} \end{pmatrix} = L \nabla E + M \nabla S = \begin{pmatrix} 0 & 1 \\ -1 & 0 \end{pmatrix}\begin{pmatrix} kq \\ p/m \end{pmatrix} + \begin{pmatrix} 0 & 0 \\ 0 & \gamma \end{pmatrix}\begin{pmatrix} -\beta kq \\ -\beta p/m \end{pmatrix} \]
This yields: \[ \dot{q} = \frac{p}{m}, \quad \dot{p} = -kq - \gamma\beta\frac{p}{m} \]
The first equation is just velocity = momentum/mass. The second is Newton’s law with friction: \(m\ddot{q} = -kq - \gamma\beta \dot{q}\), which is exactly the damped harmonic oscillator!
Check degeneracy: - \(M \nabla E = (0, \gamma p/m)\) is NOT zero! - Wait, this violates degeneracy condition 1!
Actually, for finite-dimensional systems with constant \(M\) and \(L\), the degeneracy conditions are automatically satisfied if we choose entropy correctly as \(S \propto -E\) (equilibrium condition). More generally, for complex systems we need to ensure degeneracy by construction. We’ll see how information dynamics naturally satisfies these conditions in the next section.
Summary: The GENERIC equation \(\dot{x} = L \nabla E + M \nabla S\) encodes: - Structure: Antisymmetric \(L\) + symmetric positive semi-definite \(M\) - Physics: Reversible dynamics (conserves energy, preserves entropy) + irreversible dynamics (conserves energy, increases entropy) - Consistency: Degeneracy conditions couple \(L\) and \(M\) to ensure thermodynamic laws - Generality: Covers everything from mechanics to thermodynamics to complex systems
Next, we’ll see how our information dynamics fit this framework perfectly.
Automatic Degeneracy
In standard GENERIC applications, ensuring thermodynamic consistency requires careful hand-crafting of operators. In our framework, the consistency conditions emerge automatically.
Automatic Degeneracy Conditions
A remarkable feature of the inaccessible game is that the GENERIC degeneracy conditions, which are typically difficult to impose and verify, emerge automatically from the constraint structure.
The GENERIC Degeneracy Requirements
Recall that standard GENERIC requires two degeneracy conditions for thermodynamic consistency:
Degeneracy 1: The entropy should be conserved by the reversible dynamics: \[ A(\boldsymbol{\theta})\nabla H(\boldsymbol{\theta}) = 0 \]
Degeneracy 2: The energy should be conserved by the irreversible dynamics: \[ S(\boldsymbol{\theta})\nabla E(\boldsymbol{\theta}) = 0 \]
In most GENERIC applications, constructing operators \(A\) and \(S\) that satisfy both conditions requires significant effort. You must carefully design the operators to ensure the degeneracies hold at every point in state space.
First Degeneracy: Automatic from Tangency
In our framework, the first degeneracy \(A\nabla H = 0\) holds automatically at every point on the constraint manifold. It comes from the constraint maintenance requirement: \[ \mathbf{a}^\top \dot{\boldsymbol{\theta}} = 0 \] where \(\mathbf{a} = \nabla\left(\sum_i h_i\right)\) is the constraint gradient.
This ensures that the dynamics remain tangent to the constraint surface at all times. The antisymmetric part \(A\) inherits this tangency because it generates rotations on the constraint manifold. Since rotations conserve everything (by definition \(\mathbf{z}^\top A\mathbf{z} = 0\) for antisymmetric \(A\)), and the rotations stay on the constraint surface, the first degeneracy is automatically satisfied.
Second Degeneracy: From Constraint Gradient
The second degeneracy is where our framework departs from standard GENERIC. In standard formulations, one requires \(S\nabla E = 0\) where \(E\) is thermodynamic energy. This must be verified case-by-case.
In our framework, the marginal entropy constraint \(\sum_i h_i = C\) plays the role that energy conservation plays in GENERIC. The constraint gradient \(\mathbf{a}\) defines the degeneracy direction along which the dissipative operator vanishes: \[ S\mathbf{a} = 0 \equiv S\nabla\left(\sum_i h_i\right) = 0. \]
This follows from the constraint tangency requirement: the symmetric part cannot have a component along the constraint gradient because that would violate conservation.
In Section 4 of the paper, we show that under certain conditions in the thermodynamic limit, the constraint gradient becomes asymptotically parallel to an energy gradient: \[ \nabla\left(\sum_i h_i\right) \parallel \nabla E. \]
When this equivalence holds, our automatically-derived degeneracy condition \(S\nabla(\sum_i h_i) = 0\) becomes equivalent to the standard GENERIC condition \(S\nabla E = 0\). This connects our information-theoretic framework to classical thermodynamics.
Why This Matters
The automatic emergence of degeneracy conditions is profound because:
No hand-crafting required: We don’t need to guess the form of \(A\) and \(S\) and verify they satisfy degeneracies. The structure emerges from the constrained dynamics.
Global validity: The degeneracies hold everywhere on the constraint manifold by construction, not just at specific points or in special cases.
Information-first foundation: Instead of starting with energy and thermodynamics and deriving information-theoretic properties, we start with information conservation and derive thermodynamic structure.
Fundamental principle: This suggests GENERIC is not just a modelling framework but a fundamental principle that information-isolated systems must obey.
In Grmela and Öttinger’s original work, satisfying the degeneracy conditions requires careful construction (see e.g. Chapter 4 of Öttinger (2005)). The fact that they emerge automatically in our framework suggests that marginal entropy conservation has special geometric significance for non-equilibrium dynamics.
Information Topography
The Fisher information matrix provides mathematical precision to the intuitive notion of an “information topography”—the landscape that shapes how information can flow.
Fisher Information as Conductance Tensor
In the inaccessible game, the Fisher information matrix \(G(\boldsymbol{\theta})\) plays a role analogous to conductance in electrical circuits—but with differences that make the game richer than a Kirchhoff network.
The Electrical Circuit Analogy
In a Kirchhoff electrical network, charge conservation is local and linear: \(\sum_j I_{ij} = 0\) at each node, where current flows according to Ohm’s law with fixed conductances: \[ I_{ij} = g_{ij}(V_i - V_j). \] The conductances \(g_{ij}\) are fixed parameters of the circuit. Given the conductances, the steady state can be found by solving linear equations derived from local charge conservation.
In contrast, our information conservation constraint \(\sum_{i=1}^n h_i = C\) is generally nonlocal and nonlinear. Each marginal entropy \(h_i\) requires marginalization over all other variables, making it a global functional of the entire state \(\boldsymbol{\theta}\).
Consider a multivariate Gaussian as an example. The marginal entropy is \[ h_i = \frac{1}{2}\log(2\pi e \sigma_i^2) \] where \(\sigma_i^2 = [G^{-1}]_{ii}\) is the \(i\)-th diagonal element of the inverse Fisher information. The conservation constraint becomes: \[ \sum_{i=1}^n \log([G^{-1}]_{ii}) = \text{constant}. \]
Dynamic Information Topography
Moreover, the Fisher information \(G(\boldsymbol{\theta})\) itself evolves with the state. This creates a dynamic information topography, more analogous to memristive networks than fixed resistors.
The “conductance” for information flow is given by the Fisher information: \[ G(\boldsymbol{\theta}) = \nabla^2 A(\boldsymbol{\theta}), \] which depends on the current state \(\boldsymbol{\theta}\). As the system evolves, both the “voltages” (parameters \(\boldsymbol{\theta}\)) and the “conductances” (Fisher information \(G\)) change together.
Information Channels and Bottlenecks
Despite the differences, the analogy provides intuition. The eigenvalues of \(G(\boldsymbol{\theta})\) indicate information channel capacities:
- Large eigenvalues: Low-resistance channels for information flow
- Small eigenvalues: Bottlenecks that constrain flow
- Eigenvectors: Directions of easy/hard information movement
The constrained maximum entropy production acts as a generalized Ohm’s law. Information flows “downhill” in the entropy landscape, but the rate of flow is governed by the Fisher information metric. The nonlocal conservation and emergent conductance structure create a system where information reorganizes itself through the interplay between local gradient flows and global constraints.
Why “Topography?”
We think of \(G(\boldsymbol{\theta})\) as defining the information topography. In geography, topography describes hills, valleys, and plains that constrain how water flows. In our framework, \(G(\boldsymbol{\theta})\) describes the “shape” of the information landscape that constrains how information flows.
This formalises the metaphor from The Atomic Human (Lawrence, 2024): “In geography, the topography is the configuration of natural and man-made features in the landscape… An information topography is similar, but instead of the movement of goods, water and people, it dictates the movement of information.”
Formalising Information Topography
In The Atomic Human (Lawrence, 2024), the concept of an information topography was introduced as a metaphor: “In geography, the topography is the configuration of natural and man-made features in the landscape… These questions are framed by the topography. An information topography is similar, but instead of the movement of goods, water and people, it dictates the movement of information.”
However, no formal mathematical definition was given. The inaccessible game provides one.
Mathematical Definition
We define the information topography of a system to be the Fisher information matrix \(G(\boldsymbol{\theta})\) viewed as a Riemannian metric on the space of probability distributions.
Formally, for an exponential family parametrised by natural parameters \(\boldsymbol{\theta}\):
Definition (Information Topography): The information topography is the pair \((G(\boldsymbol{\theta}), \mathcal{M})\) where: - \(\mathcal{M}\) is the statistical manifold of probability distributions - \(G(\boldsymbol{\theta}) = \nabla \nabla A(\boldsymbol{\theta})\) is the Fisher information metric
This metric determines: 1. Information distances between distributions 2. Directions of information flow (geodesics) 3. Information channel capacities (eigenvalues) 4. Bottlenecks and pathways (eigenvectors)
How It Constrains Information Movement
The information topography constrains movement in three ways:
1. Metric Structure: The “distance” between two nearby distributions \(p(\boldsymbol{\theta})\) and \(p(\boldsymbol{\theta} + d\boldsymbol{\theta})\) is: \[ ds^2 = d\boldsymbol{\theta}^\top G(\boldsymbol{\theta}) d\boldsymbol{\theta} \] Moving in directions corresponding to small eigenvalues of \(G\) requires large parameter changes for small distributional changes—these are “narrow passes” in the information landscape.
2. Gradient Flow: The entropy gradient is: \[ \nabla H = -G(\boldsymbol{\theta})\boldsymbol{\theta} \] Information flows “downhill” in entropy space, but the Fisher information determines the effective slope. Regions with small eigenvalues have shallow gradients—information flows slowly.
3. Constraint Enforcement: Under conservation \(\sum h_i = C\), the constraint gradient \(\mathbf{a} = \nabla(\sum h_i)\) interacts with \(G\) to determine allowed flow directions. The dynamics become: \[ \dot{\boldsymbol{\theta}} = -\Pi_\parallel G(\boldsymbol{\theta})\boldsymbol{\theta} \] where \(\Pi_\parallel\) projects onto the constraint tangent space.
Dynamic Topography
Unlike geographical topography which is static, information topography is dynamic, it changes as the system evolves. As \(\boldsymbol{\theta}\) changes, so does \(G(\boldsymbol{\theta})\). This creates a feedback loop:
- Current topography \(G(\boldsymbol{\theta})\) determines information flow
- Flow changes parameters \(\boldsymbol{\theta}\)
- New parameters change topography \(G(\boldsymbol{\theta})\)
- Repeat
This dynamic restructuring is what makes information systems so rich. The landscape itself evolves as you move through it.
This formalisation gives mathematical precision to the intuitive notion from The Atomic Human that information movement is constrained by structure. The Fisher information matrix is that structure, and the inaccessible game describes how systems evolve within it.
Fisher Information as Geometry
In the previous section, we saw that for exponential families, the Fisher information matrix appears as the second derivative of the log partition function \[ G(\boldsymbol{\theta}) = \nabla^2 \mathcal{A}(\boldsymbol{\theta}) = \mathrm{Cov}_{\boldsymbol{\theta}}[T(\mathbf{x})]. \] We now develop the geometric interpretation: the Fisher information matrix defines a metric on the space of probability distributions.
The Statistical Manifold
Consider the space of all probability distributions in an exponential family, parametrized by \(\boldsymbol{\theta}\). This space forms a manifold — a smooth, curved space where each point represents a different distribution.
The Fisher information matrix \(G(\boldsymbol{\theta})\) acts as a Riemannian metric on this manifold. Think of measuring distances on a curved surface like a sphere: you need a metric to tell you how far apart two nearby points are. The Fisher information provides exactly this for the space of probability distributions, telling us how to measure “statistical distance” between distributions.
The Fisher information defines the information distance between nearby distributions. If we move from parameters \(\boldsymbol{\theta}\) to \(\boldsymbol{\theta} + \text{d}\boldsymbol{\theta}\), the infinitesimal distance in information space is \[ \text{d}s^2 = \text{d}\boldsymbol{\theta}^\top G(\boldsymbol{\theta}) \text{d}\boldsymbol{\theta} \] where the Fisher information playing the role of the metric. Larger Fisher information means a given parameter change corresponds to a larger “information distance,” the distributions are more distinguishable.
Connection to Statistical Estimation
This geometric picture connects directly to Fisher’s original motivation. The Cramér-Rao bound states that for any unbiased estimator \(\hat{\boldsymbol{\theta}}\) of parameters \(\boldsymbol{\theta}\), \[ \text{cov}(\hat{\boldsymbol{\theta}}) \succeq G^{-1}(\boldsymbol{\theta}), \] where \(\succeq\) denotes that the left side minus the right side is positive semidefinite.
Geometrically, this means: higher Fisher information (stronger metric) implies tighter bounds on estimation. The inverse \(G^{-1}\) gives the minimum possible covariance of any unbiased estimator, it’s the fundamental limit on how well we can estimate parameters from data.
The Fisher information plays two distinct but related roles:
As a metric: It defines information distance, telling us how “far apart” distributions are.
In gradient flow: Recall from the exponential family definitions that that \(\nabla H = -G(\boldsymbol{\theta})\boldsymbol{\theta}\). This means entropy gradient ascent in exponential families involves the Fisher information, \[ \dot{\boldsymbol{\theta}} = \nabla H = -G(\boldsymbol{\theta})\boldsymbol{\theta}. \]
The appearance in the gradient comes from the specific structure of exponential families (where \(G = \nabla^2 \mathcal{A}\)). Together, they determine how the system flows through information space, with the geometry guiding the dynamics.
Examples Revisited
For the Gaussian distribution, we saw that \(G(\boldsymbol{\theta}) = \Sigma\). This means: - The information metric is the covariance matrix - The inverse \(G^{-1} = \Sigma^{-1}\) is the precision matrix
Geometrically, the information ellipsoid has the same shape as the probability ellipsoid. This direct connection between the Fisher information and covariance is special to Gaussians (and arises because we’re working in natural parameters \(\boldsymbol{\theta} = \Sigma^{-1}\boldsymbol{\mu}\)).
For a categorical distribution with \(K\) outcomes, the Fisher information has a special structure. Using the natural parameters \(\theta_k = \log \pi_k\), the Fisher information is \[ G_{ij}(\boldsymbol{\theta}) = \delta_{ij}\pi_i - \pi_i\pi_j = \begin{cases} \pi_i(1 - \pi_i) & i = j \\ -\pi_i\pi_j & i \neq j \end{cases} \]
This metric defines the probability simplex geometry. Distributions near the center of the simplex (all \(\pi_k \approx 1/K\)) have different local geometry than those near the corners (one \(\pi_k \approx 1\)). The Fisher metric captures this intrinsic curvature.
Information Geometry: The Big Picture
The Fisher information matrix is a foundational element of information geometry, a field that studies probability distributions using differential geometric tools. Key insights:
**mari’s Dually Flat Structure*: Exponential families have a special property. They are “dually flat” under two different coordinate systems (natural parameters \(\boldsymbol{\theta}\) and expectation parameters \(\boldsymbol{\mu}\)). The Fisher metric connects these.
Geodesics: The shortest path between two distributions (in the information geometry sense) is a geodesic. For exponential families, geodesics have elegant forms that will connect to our least action principles.
Curvature: The curvature of the statistical manifold (measured by the Riemann curvature tensor derived from \(G\)) tells us about the intrinsic structure of the family. Exponential families have zero curvature in a certain sense—they are “flat” manifolds.
These geometric properties will be essential when we study constrained information dynamics and emergence.
Why This Matters for The Inaccessible Game
The Fisher information matrix plays three roles in our framework:
Gradient Flow Metric: It appears in the entropy gradient, determining how the system evolves through information space via \(\dot{\boldsymbol{\theta}} = -G(\boldsymbol{\theta})\boldsymbol{\theta}\).
Information Distance: It defines the metric for measuring statistical distinguishability between distributions.
Emergence Indicator: Changes in the structure of \(G\) signal the emergence of new regimes and effective descriptions.
Understanding Fisher information as geometry, not just as a statistical tool, is key for everything that follows.
Connecting Information to Energy
The Thermodynamic Limit
Perhaps the most surprising result is that our information-theoretic constraint becomes equivalent to energy conservation in appropriate limits.
Energy-Entropy Equivalence in the Thermodynamic Limit
We’ve seen that marginal entropy conservation \(\sum_i h_i = C\) leads to GENERIC-like structure. But how does this connect to traditional thermodynamics with energy conservation? The answer lies in the thermodynamic limit.
The Energy-Entropy Question
In real GENERIC systems, it’s not marginal entropy that’s conserved but extensive thermodynamic energy \(E\). Can we show that, under specific conditions, the marginal entropy constraint asymptotically singles out the same degeneracy direction as energy conservation?
In other words, does the constraint gradient \(\nabla(\sum_i h_i)\) become parallel to \(\nabla E\) in appropriate limits?
Conditions for Equivalence
The equivalence requires specific scaling properties. Using multi-information \(I = \sum_i h_i - H\), we can write: \[ \nabla\left(\sum_i h_i\right) = \nabla I + \nabla H. \]
Our requirement: Along certain directions (order parameters), the multi-information gradient must scale intensively while entropy gradients scale extensively.
Specifically, consider a macroscopic order parameter \(m\) (like magnetization in a spin system). The requirement is: - \(\nabla_m I = \mathscr{O}(1)\) (intensive) - \(\nabla_m H = \mathscr{O}(n)\) (extensive) - \(\nabla_m(\sum_i h_i) = \mathscr{O}(n)\) (extensive)
where \(n\) is the number of variables.
When this scaling holds: \[ \nabla_m\left(\sum_i h_i\right) = \nabla_m H + \nabla_m I = \mathscr{O}(n) + \mathscr{O}(1). \]
In the thermodynamic limit \(n \to \infty\), the \(\mathscr{O}(1)\) correction from multi-information becomes negligible: \[ \nabla_m\left(\sum_i h_i\right) \parallel \nabla_m H. \]
Connecting to Energy
For exponential families, the entropy gradient in expectation parameters \(\boldsymbol{\mu} = \langle T(\mathbf{x})\rangle\) is: \[ \nabla_{\boldsymbol{\mu}} H = \boldsymbol{\theta} \] where \(\boldsymbol{\theta}\) are the natural parameters.
Now define an energy functional as: \[ E(\mathbf{x}) = -\boldsymbol{\alpha}^\top T(\mathbf{x}) \] where \(\boldsymbol{\alpha}\) is chosen such that \(\boldsymbol{\theta} = -\beta\boldsymbol{\alpha}\). This gives: \[ \nabla_{\boldsymbol{\mu}} E = -\boldsymbol{\alpha} = \frac{\boldsymbol{\theta}}{\beta} = \frac{\nabla_{\boldsymbol{\mu}} H}{\beta}. \]
Therefore \[ \nabla E \parallel \nabla H \parallel \nabla\left(\sum_i h_i\right) \] along the macroscopic direction in the thermodynamic limit.
When Does This Hold?
The equivalence requires:
Macroscopic order parameter: There exists a low-dimensional direction (like total magnetisation \(m = \sum_i x_i\)) that captures system-wide behavior
Bounded correlations: The correlation length \(\xi\) remains finite (away from criticality), ensuring \(\nabla_m I\) stays intensive
Translation invariance: Symmetry ensures marginal entropies are identical conditioned on the order parameter
Thermodynamic limit: Number of variables \(n \to \infty\)
Not all systems satisfy these conditions. Near critical points, correlations diverge and the intensive scaling breaks down. In small systems, \(\mathscr{O}(1)\) corrections matter. But for many physically relevant systems—bulk matter far from phase transitions—the equivalence holds and provides a bridge between information theory and thermodynamics.
Implications
This equivalence has several implications:
Information \(\leftrightarrow\) Thermodynamics: Energy conservation and marginal entropy conservation become equivalent statements in appropriate limits
Temperature emergence: The parameter \(\beta\) emerges as inverse temperature from the information geometry, not imposed from thermodynamics
Landauer’s principle: Can be derived from information conservation via this equivalence
Wheeler’s “it from bit”: Suggests thermodynamics might emerge from information-theoretic principles
GENERIC and Thermodynamics
GENERIC as Generalized Thermodynamics
GENERIC provides a framework that generalizes classical thermodynamics to arbitrary non-equilibrium systems. Where classical thermodynamics describes systems near equilibrium with linear response, GENERIC handles systems arbitrarily far from equilibrium with full nonlinear dynamics.
Classical thermodynamics (Clausius, Kelvin, Carnot): - Equilibrium states - Quasi-static processes - Entropy maximization at equilibrium - No dynamics, only relations between states
Linear irreversible thermodynamics (Onsager, Prigogine): - Near-equilibrium dynamics - Linear force-flux relations - Onsager reciprocity - Valid only for small deviations
GENERIC (Grmela, Öttinger): - Arbitrary far-from-equilibrium states - Nonlinear dynamics - Combines reversible + irreversible - Reduces to classical thermo at equilibrium
GENERIC is the completion of thermodynamics—it describes the full dynamical evolution, not just equilibrium endpoints.
The Laws of Thermodynamics in GENERIC
GENERIC automatically encodes the fundamental laws of thermodynamics. Let’s see how:
Zeroth Law (Transitivity of equilibrium): At equilibrium, \(\dot{x} = L \nabla E + M \nabla S = 0\). If systems A and B are each in equilibrium with C, and equilibrium is defined by the same functionals \(E\) and \(S\), then A and B are in equilibrium with each other. This follows from the uniqueness of critical points.
First Law (Energy conservation): \[ \frac{\text{d}E}{\text{d}t} = \langle \nabla E, \dot{x} \rangle = \langle \nabla E, L \nabla E + M \nabla S \rangle \] Using antisymmetry of \(L\): \(\langle \nabla E, L \nabla E \rangle = 0\)
Using degeneracy condition 1: \(\langle \nabla E, M \nabla S \rangle = 0\)
Therefore: \(\frac{\text{d}E}{\text{d}t} = 0\) (energy is conserved!)
The first law is built into GENERIC structure through antisymmetry and degeneracy.
Second Law (Entropy increase): \[ \frac{\text{d}S}{\text{d}t} = \langle \nabla S, \dot{x} \rangle = \langle \nabla S, L \nabla E + M \nabla S \rangle \] Using degeneracy condition 2: \(\langle \nabla S, L \nabla E \rangle = 0\)
Using positive semi-definiteness of \(M\): \(\langle \nabla S, M \nabla S \rangle \geq 0\)
Therefore: \(\frac{\text{d}S}{\text{d}t} \geq 0\) (entropy increases!)
The second law is built into GENERIC through degeneracy and positive semi-definiteness.
Third Law (Entropy vanishes at zero temperature): This is more subtle and depends on the specific form of \(S\) and \(M\), but GENERIC is compatible with quantum mechanical formulations where the third law emerges naturally.
Onsager Reciprocity Relations
One of the crowning achievements of linear irreversible thermodynamics was Onsager’s reciprocity relations (Onsager, 1931). These state that near equilibrium, the response matrix relating thermodynamic forces to fluxes is symmetric.
GENERIC provides a non-linear generalization of Onsager reciprocity through the symmetry of the friction operator \(M\).
Near equilibrium, expand \(M(\nabla S)\) to leading order: \[ M(x) \nabla S(x) \approx M(x_{\text{eq}}) \nabla^2 S(x_{\text{eq}}) (x - x_{\text{eq}}) = M_0 H_S (x - x_{\text{eq}}) \] where \(H_S\) is the Hessian of entropy at equilibrium.
The flux is \(J = M_0 H_S \delta x\) and the thermodynamic force is \(X = H_S \delta x\). The response matrix is: \[ J = L X \quad \text{where} \quad L = M_0 \]
Since \(M_0\) is symmetric (GENERIC requirement), we have \(L_{ij} = L_{ji}\), which is exactly Onsager reciprocity!
Key insight: Onsager reciprocity isn’t a separate postulate—it’s a consequence of the symmetric structure of the friction operator, which in turn follows from thermodynamic consistency (entropy increase).
Entropy Production
A central concept in non-equilibrium thermodynamics is entropy production—the rate at which entropy is generated by irreversible processes. In GENERIC, this has a beautiful formulation.
The total entropy rate is: \[ \frac{\text{d}S}{\text{d}t} = \langle \nabla S, L \nabla E + M \nabla S \rangle = \langle \nabla S, M \nabla S \rangle \] (using degeneracy 2: \(L \nabla S = 0\))
We can decompose this as: \[ \dot{S} = \sigma_S \geq 0 \] where \(\sigma_S = \langle \nabla S, M \nabla S \rangle\) is the entropy production rate.
Key properties: 1. Non-negative: \(\sigma_S \geq 0\) (from positive semi-definiteness of \(M\)) 2. Vanishes at equilibrium: When \(\nabla S = 0\), we have \(\sigma_S = 0\) 3. Measures irreversibility: \(\sigma_S\) quantifies how far the system is from reversible dynamics
For our information dynamics: \[ \sigma_S = \langle -G\boldsymbol{\theta}, G(-G\boldsymbol{\theta}) \rangle = \boldsymbol{\theta}^\top G^2 \boldsymbol{\theta} \]
This is exactly the entropy production from maximum entropy dynamics (Lecture 3)! The Fisher information matrix \(G\) governs the rate of entropy increase.
Free Energy and Dissipation
In equilibrium thermodynamics, the free energy \(F = E - TS\) (where \(T\) is temperature) determines equilibrium states: systems minimize \(F\) at fixed temperature.
In GENERIC, we can define a generalized free energy functional and show that it decreases along trajectories (for isothermal processes).
For a system at temperature \(T\), define: \[ \mathcal{F} = E - T S \]
The rate of change is: \[ \frac{\text{d}\mathcal{F}}{\text{d}t} = \frac{\text{d}E}{\text{d}t} - T\frac{\text{d}S}{\text{d}t} = 0 - T \sigma_S = -T \langle \nabla S, M \nabla S \rangle \leq 0 \]
So free energy decreases! The system dissipates toward minimum free energy at equilibrium.
For information dynamics, this connects to the free energy in exponential families: \[ \mathcal{F}(\boldsymbol{\theta}) = -A(\boldsymbol{\theta}) + \boldsymbol{\theta}^\top \mathbb{E}[T(\mathbf{x})] \] where \(A(\boldsymbol{\theta})\) is the log partition function. The dynamics \(\dot{\boldsymbol{\theta}} = -G\boldsymbol{\theta}\) perform gradient descent on free energy (under constraints).
Fluctuation-Dissipation Relations
Another deep result from statistical mechanics is the fluctuation-dissipation theorem, which relates the response of a system to perturbations to its spontaneous fluctuations at equilibrium.
Near equilibrium, fluctuations in a variable \(x_i\) have variance: \[ \langle (\delta x_i)^2 \rangle \propto k_B T \]
The response to a small force \(f_j\) is: \[ \langle \delta x_i \rangle = \chi_{ij} f_j \]
The fluctuation-dissipation theorem states: \[ \chi_{ij} \propto \frac{\langle \delta x_i \delta x_j \rangle}{k_B T} \]
In GENERIC, this emerges naturally from the structure. The friction operator \(M\) governs both: - Dissipation: How perturbations relax - Fluctuations: Equilibrium variance (through Gibbs distribution)
For our information dynamics, \(M = G\) (Fisher information), which is exactly the inverse covariance matrix for Gaussian distributions! So: \[ G_{ij}^{-1} = \text{Cov}(T_i, T_j) = \langle (\delta T_i)(\delta T_j) \rangle \]
The Fisher information (dissipation) and the covariance (fluctuations) are inverses—a direct manifestation of fluctuation-dissipation!
Maximum Entropy Production Principle
The maximum entropy production principle (MEPP) states that non-equilibrium steady states are characterized by maximum entropy production rate subject to constraints. This principle is debated in thermodynamics, but GENERIC provides a framework for understanding when it applies.
From Lecture 3, we derived that unconstrained information dynamics maximize entropy production: \[ \dot{S} = \max_{\dot{\boldsymbol{\theta}}} \{\dot{S}(\dot{\boldsymbol{\theta}}) : \text{Fisher-constrained}\} \]
With constraints, the dynamics become: \[ \dot{\boldsymbol{\theta}} = -G\boldsymbol{\theta} - \nu a \]
This still maximizes entropy production on the constraint manifold: \[ \dot{S} = \max_{\dot{\boldsymbol{\theta}} : a^\top\dot{\boldsymbol{\theta}}=0} \{-\boldsymbol{\theta}^\top G \dot{\boldsymbol{\theta}}\} \]
So MEPP holds for information dynamics as a consequence of GENERIC structure + Fisher information as friction.
The general lesson: MEPP emerges when: 1. The friction operator is related to the entropy Hessian (Fisher information) 2. Constraints are properly accounted for via Lagrange multipliers 3. The system is not externally driven
GENERIC provides the mathematical framework for understanding when MEPP applies and when it doesn’t.
Connection to Non-Equilibrium Statistical Mechanics
GENERIC bridges macroscopic thermodynamics and microscopic statistical mechanics. While we’ve been working at the level of distributions and information, GENERIC can be derived from:
Microscopic foundations: - Liouville equation for phase space density - BBGKY hierarchy for reduced distributions
- Projection operator methods (Zwanzig-Mori)
These microscopic theories show how reversible microscopic dynamics (Hamiltonian) give rise to irreversible macroscopic dynamics (dissipation) through coarse-graining and loss of information.
The antisymmetric part \(L\) preserves the fine-grained microscopic reversibility. The symmetric part \(M\) captures the effective irreversibility from ignored microscopic degrees of freedom.
For information dynamics: - Fine-grained: Full joint distribution \(p(\mathbf{x})\) - Coarse-grained: Natural parameters \(\boldsymbol{\theta}\) (exponential family) - \(L\): Preserves structure within parameter space - \(M = G\): Dissipation from unobserved correlations
This connects our information-theoretic approach to fundamental stat-mech!
Summary: GENERIC generalizes classical thermodynamics to arbitrary non-equilibrium dynamics. The laws of thermodynamics, Onsager reciprocity, entropy production, fluctuation-dissipation, and maximum entropy production all emerge as consequences of GENERIC structure. For information dynamics, Fisher information plays the role of thermodynamic friction, connecting information theory to the foundations of statistical mechanics and thermodynamics. The framework we built from axioms (Lectures 1-7) is not just mathematically consistent—it’s thermodynamically fundamental.
Landauer’s Principle
With the energy-entropy equivalence established, we can derive Landauer’s principle—the fundamental limit on information erasure—from our information-theoretic framework.
Landauer’s Principle from the Inaccessible Game
The GENERIC-like structure and energy-entropy equivalence provide a natural framework for deriving Landauer’s principle (Landauer, 1961), which states that erasing information requires dissipating at least \(k_BT\log 2\) of energy per bit.
Information Erasure as a Process
Consider erasing one bit: a memory variable \(x_i \in \{0,1\}\) is reset to a standard state (say \(x_i = 0\)), destroying the stored information. From an ensemble perspective—considering many such erasure operations where the initial value is random—the marginal entropy of this variable decreases: \[ \Delta h(X_i) = -\log 2. \]
Conservation Requires Redistribution
For a system obeying \(\sum_i h_i = C\), this decrease must be compensated by increases elsewhere: \[ \sum_{j \neq i} \Delta h(X_j) = +\log 2. \]
The antisymmetric (conservative) part \(A\) of our GENERIC dynamics preserves both \(H\) and \(\sum_i h_i\). It can only shuffle entropy reversibly between variables. But such reversible redistribution doesn’t truly erase the information—it merely moves it to other variables, from which it could in principle be recovered.
True irreversible erasure requires increasing the total joint entropy \(H\) (second law) while maintaining \(\sum_i h_i = C\). Since \(I = \sum_i h_i - H\), this means decreasing multi-information: \[ \Delta I < 0. \]
This reduction of correlations is precisely what the dissipative part \(S\) achieves. It increases \(H\) through entropy production while the constraint forces redistribution of marginal entropies. The erasure process thus necessarily involves the dissipative dynamics, not just conservative reshuffling.
Energy Cost from Energy-Entropy Equivalence
In the thermodynamic limit with energy-entropy equivalence (Section 5 of the paper), the gradients \(\nabla(\sum_i h_i)\) and \(\nabla E\) become parallel along the order-parameter direction. Near thermal equilibrium at inverse temperature \(\beta = \tfrac{1}{k_BT}\), this implies: \[ \beta \langle E \rangle \approx \sum_i h_i + \text{const}. \]
Therefore, erasing one bit requires: \[ \Delta(\beta \langle E \rangle) \approx \Delta h(X_i) = -\log 2, \] giving an energy change: \[ \Delta \langle E \rangle \approx -\frac{\log 2}{\beta} = -k_BT \log 2. \]
Dissipation Bound
Since the system must dissipate this energy via the symmetric part \(S\), and entropy production is non-negative, we obtain Landauer’s bound \[ Q_{\text{dissipated}} \geq k_BT\log 2. \]
This derivation shows that Landauer’s principle emerges from: 1. Marginal entropy conservation \(\sum_i h_i = C\) 2. GENERIC-like structure distinguishing conservative redistribution (\(A\)) from dissipative entropy production (\(S\))
3. Energy-entropy equivalence in the thermodynamic limit
The insight is that erasure requires both redistributing marginal entropy (to maintain the constraint) and increasing total entropy \(H\) (second law), which necessarily reduces multi-information \(I\) and invokes dissipation.
The information-theoretic constraint provides the foundation, with thermodynamic energy appearing as its dual in the large-system limit. This reverses the usual derivation where information bounds follow from thermodynamics — here thermodynamic bounds follow from information theory.
Is Landauer’s Limit Related to Shannon’s Gaussian Channel Capacity?
Digital memory can be viewed as a communication channel through time - storing a bit is equivalent to transmitting information to a future moment. This perspective immediately suggests that we look for a connection between Landauer’s erasure principle and Shannon’s channel capacity. The connection might arise because both these systems are about maintaining reliable information against thermal noise.
The Landauer limit (Landauer, 1961) is the minimum amount of heat energy that is dissapated when a bit of information is erased. Conceptually it’s the potential energy associated with holding a bit to an identifiable single value that is differentiable from the background thermal noise (representated by temperature).
The Gaussian channel capacity (Shannon, 1948) represents how identifiable a signal \(S\), is relative to the background noise, \(N\). Here we trigger a small exploration of potential relationship between these two values.
When we store a bit in memory, we maintain a signal that can be reliably distinguished from thermal noise, just as in a communication channel. This suggests that Landauer’s limit for erasure of one bit of information, \(E_{min} = k_BT\), and Shannon’s Gaussian channel capacity, \[ C = \frac{1}{2}\log_2\left(1 + \frac{S}{N}\right), \] might be different views of the same limit.
Landauer’s limit states that erasing one bit of information requires a minimum energy of \(E_{\text{min}} = k_BT\). For a communication channel operating over time \(1/B\), the signal power \(S = EB\) and noise power \(N = k_BTB\). This gives us: \[ C = \frac{1}{2}\log_2\left(1 + \frac{S}{N}\right) = \frac{1}{2}\log_2\left(1 + \frac{E}{k_BT}\right) \] where the bandwidth B cancels out in the ratio.
When we operate at Landauer’s limit, setting \(E = k_BT\), we get a signal-to-noise ratio of exactly 1: \[ \frac{S}{N} = \frac{E}{k_BT} = 1 \] This yields a channel capacity of exactly half a bit per second, \[ C = \frac{1}{2}\log_2(2) = \frac{1}{2} \text{ bit/s} \]
The factor of 1/2 appears in Shannon’s formula because of Nyquist’s theorem - we need two samples per cycle at bandwidth B to represent a signal. The bandwidth \(B\) appears in both signal and noise power but cancels in their ratio, showing how Landauer’s energy-per-bit limit connects to Shannon’s bits-per-second capacity.
This connection suggests that Landauer’s limit may correspond to the energy needed to establish a signal-to-noise ratio sufficient to transmit one bit of information per second. The temperature \(T\) may set both the minimum energy scale for information erasure and the noise floor for information transmission.
Implications for Information Engines
This connection suggests that the fundamental limits on information processing may arise from the need to maintain signals above the thermal noise floor. Whether we’re erasing information (Landauer) or transmitting it (Shannon), we need to overcome the same fundamental noise threshold set by temperature.
This perspective suggests that both memory operations (erasure) and communication operations (transmission) are limited by the same physical principles. The temperature \(T\) emerges as a fundamental parameter that sets the scale for both energy requirements and information capacity.
The connection between Landauer’s limit and Shannon’s channel capacity is intriguing but still remains speculative. For Landauer’s original work see Landauer (1961), Bennett’s review and developments see Bennet (1982), and for a more recent overview and connection to developments in non-equilibrium thermodynamics Parrondo et al. (2015).
Implications
Information-Theoretic Limits
The framework reveals fundamental constraints on information processing systems, including intelligent systems.
Information-Theoretic Limits on Intelligence
Just as the second law of thermodynamics places fundamental limits on mechanical engines, no matter how cleverly designed, information theory places fundamental limits on information engines, no matter how cleverly implemented.
What Intelligent Systems Must Do
Any intelligent system, whether biological or artificial, must perform certain fundamental operations:
- Acquire information from its environment (sensing, observation)
- Store information about the world (memory)
- Process information to make decisions (computation)
- Erase information to make room for new data (memory management)
- Act on the world using the processed information
Each of these operations has information-theoretic costs that cannot be eliminated by clever engineering.
The Landauer Bound on Computation
Landauer’s principle (Landauer, 1961) establishes that erasing one bit of information requires dissipating at least \(k_BT\log 2\) of energy as heat, where \(k_B\) is Boltzmann’s constant and \(T\) is temperature.
This isn’t an engineering limitation, it’s a fundamental consequence of the second law. To reset a bit to a standard state (say, always 0) requires reducing its entropy from 1 bit to 0 bits. That entropy must go somewhere, and it ends up as heat in the environment.
Modern computers operate billions of times above the Landauer limit due to engineering constraints. But even if we could build computers at the thermodynamic limit, consider a brain-scale computation. Lawrence (2017) reviews estimates suggesting it would require over an exaflop (\(10^{18}\) floating point operations per second) to perform a full simulation of the human brain, based on Ananthanarayanan et al. (2009). Other authors have suggested the operations could be as low as \(10^{15}\) (Moravec, 1999; Sandberg and Bostrom, 2008)
Taking the most conservative estimate of \(10^{15}\) operations/sec for functionally relevant computation: - \(\sim 10^{15}\) operations/sec - Running for one year (\(\sim 3\times 10^7\) seconds) - At room temperature (300K)
This would require at minimum (assuming one bit erasure per operation): \[ E \sim 10^{15} \times 3\times10^{7} \times 3\times10^{-21} \approx 10^2 \text{ Joules} \]
That seems small, but this is just for erasing bits. It doesn’t include the entropy production that occurs in: - Acquiring the data - Moving data around - The actual computation - Dissipation in real (non-ideal) systems
The actual human brain consumes about 20W continuously, or \(\sim 6 \times 10^8\) Joules per year—roughly \(10^6\) times the Landauer limit.
Fisher Information Bounds on Learning
The Fisher information matrix \(G(\boldsymbol{\theta})\) sets fundamental bounds on how quickly a system can learn. The Cramér-Rao bound tells us that the variance of any unbiased estimator of parameters \(\boldsymbol{\theta}\) is bounded by: \[ \text{Var}(\hat{\boldsymbol{\theta}}) \geq G^{-1}(\boldsymbol{\theta}). \]
This means: - You cannot extract information from data faster than the Fisher information allows - Small eigenvalues of \(G\) correspond to directions that are hard to learn - The information topography determined by \(G\) constrains learning dynamics
Embodiment as Necessity, Not Limitation
These constraints mean that embodiment, i.e. physical instantiation with specific constraints, is not a limitation to overcome but a feature of any information-processing system.
The Fisher information \(G(\boldsymbol{\theta})\) defines the information topography, which is determined by: - The physical substrate (silicon, neurons, quantum systems) - The available energy budget - The communication bandwidth - The thermal environment
Different substrates have different topographies, each with its own bottlenecks and channels. You cannot have intelligence without a substrate, and every substrate brings constraints.
Why Superintelligence Claims Fail
Claims about unbounded superintelligence typically ignore these constraints. They imagine intelligence as something that can be “scaled up” indefinitely, like adding more processors. But:
- Information acquisition is bounded by Fisher information—you can’t extract more information from data than the data contains
- Information storage requires physical space and energy
- Information processing has Landauer costs that scale with computation
- Information erasure is necessarily dissipative
Trying to build unbounded intelligence is like trying to build a perpetual motion machine, it violates fundamental physical principles.
This doesn’t mean AI can’t be powerful or transformative — internal combustion engines transformed the world despite thermodynamic limits. But it does mean there are hard bounds on what’s possible, and claims that ignore these bounds are as unrealistic as promises of perpetual motion.
A Thought on Intelligence
The perpetual motion analogy provides an accessible way to think about claims of unbounded intelligence.
Perpetual Motion and Superintelligence
Imagine in 1925 a world where the automobile is already transforming society, but big promises are being made for things to come. The stock market is soaring, the 1918 pandemic is forgotten. And every major automobile manufacturer is investing heavily on the promise they will each be the first to produce a car that needs no fuel. A perpetual motion machine.
Well, of course that didn’t happen. But I sometimes wonder if what we’re seeing today 100 years later is the modern equivalent of that. In 2025 billions are being invested in promises of superintelligence and artificial general intelligence that will transform everything.
We know why perpetual motion is impossible: the second law of thermodynamics tells us that entropy always increases. So we can’t have motion without entropy production. No matter how clever the design, you cannot extract energy from nothing, and you cannot create a closed system that does useful work indefinitely without an external energy source.
How might we make an equivalent statement for the bizarre claims around superintelligence? Some inspiration comes from Maxwell’s demon, an “intelligent” entity which operates against the laws of thermodynamics. The inspiration comes because the demon suggests that for the second law to hold there must be a relationship between the demon’s decisions and thermodynamic entropy.
One of the resolutions comes from Landauer’s principle, the notion that erasure of information requires heat dissipation. This suggests there are fundamental information-theoretic constraints on intelligent systems, just as there are thermodynamic constraints on engines.
I’ve no doubt that AI technologies will transform our world just as much as the automobile has. But I also have no doubt that the promise of superintelligence is just as silly as the promise of perpetual motion. The inaccessible game provides one way of understanding why.
Superintelligence as Perpetual Motion
Claims about imminent superintelligence or artificial general intelligence that will recursively self-improve to unbounded capability bear a striking resemblance to promises of perpetual motion machines. Both violate fundamental physical constraints.
The Thermodynamic Constraint
Perpetual motion machines fail because they violate the second law of thermodynamics. You cannot extract unlimited work from a finite system without an external energy source. Entropy must increase, energy must be conserved, and there are fundamental limits on efficiency set by temperature.
These aren’t engineering challenges to overcome with better designs, they’re fundamental constraints built into the structure of physical law.
Similarly, unbounded intelligence fails because it would require unbounded information processing. The inaccessible game shows that information processing has thermodynamic costs through:
- Landauer’s principle: Information erasure costs \(k_BT\log 2\) per bit
- Marginal entropy conservation: Information cannot be created from nothing
- Fisher information bounds: Information channel capacity is finite
- GENERIC structure: Any real process has dissipative components
The Recursive Self-Improvement Fallacy
The superintelligence narrative often invokes “recursive self-improvement”: an AI that makes itself smarter, which makes it better at making itself smarter, leading to explosive growth. This is supposed to lead to capabilities that far exceed human intelligence.
But this violates conservation of information. To “improve” requires: - Learning: Extracting information from environment (limited by Fisher information) - Memory: Storing that information (limited by physical substrate) - Computation: Processing information (limited by Landauer and thermodynamics) - Erasure: Clearing memory for new information (dissipates energy)
Each step has information-theoretic costs. You cannot recursively self-improve without limit any more than you can build a perpetual motion machine by clever arrangement of gears.
Embodiment as Thermodynamic Necessity
The inaccessible game reveals why embodiment—physical constraints—is not a limitation to be overcome but a necessary feature of any information-processing system.
The Fisher information matrix \(G(\boldsymbol{\theta})\) defines the information topography. It determines: - How fast information can flow - What channels are available
- What bottlenecks exist - How much energy is needed
This topography is shaped by the physical implementation. A biological brain has different \(G\) than a silicon chip, which has different \(G\) than a quantum computer. Each physical substrate creates its own information landscape with its own constraints.
Promises of “uploading” consciousness or achieving superintelligence by removing physical constraints misunderstand the relationship between information and physics. Information processing is physical. The constraints aren’t bugs, they’re features that make information processing possible at all.
Why the Hype Persists
If the constraints are so fundamental, why do smart people keep claiming superintelligence is just around the corner? Several reasons.
- Confusing capability with intelligence: Current AI systems can do impressive things, but that doesn’t mean they’re on a path to unbounded capability
- Ignoring thermodynamic costs: Information processing seems “free” compared to mechanical work, but Landauer’s principle shows it has real energy costs
- Mistaking scaling for fundamental progress: Making systems bigger isn’t the same as removing fundamental constraints
- Economic incentives: Billions of dollars flow toward exciting promises
Just as perpetual motion machines attracted investors in the 19th and early 20th centuries, superintelligence claims attract billions today. But the fundamental constraints haven’t changed. The idea of this work is that information theory provides as firm a bound on intelligence as thermodynamics provides on engines.
Conclusions
We have explored what emerges when we demand internal adjudicability in an information-theoretic dynamical system. Starting from consistency requirements rather than physical assumptions, we derived:
This reverses the usual logic. Rather than starting with thermodynamics and deriving information bounds, we start with information-theoretic consistency and derive thermodynamic structure. This suggests Wheeler’s “it from bit” vision may be realisable: physical laws emerging from information-theoretic constraints.
Broader Relevance?
Implications for Theory Construction
The inaccessible game is a mathematical framework—a formal game with precise rules. But the no-barber principle that underlies it may have broader relevance.
If we think of scientific theories as games played against nature, then perhaps ensuring that theoretical rules don’t silently rely on external adjudicators could be an interesting constraint on theory construction.
Consider how often in physics and mathematics we appeal to: - A pre-existing space-time arena - An external notion of simultaneity - A privileged basis or coordinate system - An observer who “collapses” quantum states - A distinguished decomposition into system and environment
The no-barber principle asks: what if we couldn’t appeal to these? What structures would have to emerge internally? The inaccessible game is one exploration of this question in the domain of information dynamics.
Not a Grand Claim
We are not claiming that: - Reality must obey the inaccessible game’s rules - All theories must satisfy the no-barber principle - This solves foundational problems in physics
Rather, we’re offering a principled constraint and exploring what follows. If it proves useful for understanding certain phenomena or for constructing internally consistent theories, that would be interesting. If not, the mathematical structure may still illuminate the relationship between information, geometry, and dynamics.
This is in the spirit of MacKay’s approach: make assumptions explicit, explore consequences rigorously, and see what the mathematics reveals. Whether these ideas apply beyond the formal game remains an open question.
David MacKay’s Legacy
David MacKay taught us to ask: “What are the fundamental constraints? What do the numbers actually say?” This work follows that tradition—making assumptions explicit, exploring consequences rigorously, and letting the mathematics reveal structure.
I hope that David would have appreciated the attempt to build foundations carefully, to derive rather than assume, and to use mathematical structure to illuminate real constraints. His legacy continues in work that combines technical rigour with conceptual clarity.
What This Selects
The no-barber principle, combined with information-theoretic considerations, selects rather than assumes:
- Marginal entropy conservation \(\sum_i h_i = C\) (strongest constraint without external structure)
- Von Neumann entropy (invariant, doesn’t require external outcome labeling)
- Maximum entropy production (internal ordering principle)
- Qutrit substrate (\(d_i = 3\) optimizes \(\log d_i / d_i\))
- Countably infinite system (avoids arbitrary finite \(N\))
In (Lawrence-nobarber26?)] we argue that these are not ad hoc choices, they emerge from requiring internal consistency. The game’s structure is determined by what can be formulated without external reference.
Open Questions
Many questions remain:
- Can we formalize “axiomatic distinguishability” more rigorously?
- Does the Jacobi identity hold globally, or only for symmetric configurations?
- Can this framework extend to quantum systems beyond the origin?
- What other structures emerge from internal adjudicability?
- Does this constraint illuminate other areas of theory construction?
These point toward future work at the intersection of information theory, geometry, and foundations.
A common worry is Gödel-style: can any sufficiently expressive system be fully self-adjudicating? The no-barber principle is not a claim of completeness. It is a : don’t quantify over distinctions the system cannot internally represent. If more external structure is needed, the demand is simply that it be made explicit.
Thanks!
For more information on these subjects and more you might want to check the following resources.
- company: Trent AI
- book: The Atomic Human
- twitter: @lawrennd
- podcast: The Talking Machines
- newspaper: Guardian Profile Page
- blog: http://inverseprobability.com
References
Have a look at the wonderful tributes to him on the Cambridge Ultimate website.↩︎