Information, Energy and Intelligence

Neil D. Lawrence

$$\newcommand{\tk}[1]{} \newcommand{\Amatrix}{\mathbf{A}} \newcommand{\KL}[2]{\text{KL}\left( #1\,\|\,#2 \right)} \newcommand{\Kaast}{\kernelMatrix_{\mathbf{ \ast}\mathbf{ \ast}}} \newcommand{\Kastu}{\kernelMatrix_{\mathbf{ \ast} \inducingVector}} \newcommand{\Kff}{\kernelMatrix_{\mappingFunctionVector \mappingFunctionVector}} \newcommand{\Kfu}{\kernelMatrix_{\mappingFunctionVector \inducingVector}} \newcommand{\Kuast}{\kernelMatrix_{\inducingVector \bf\ast}} \newcommand{\Kuf}{\kernelMatrix_{\inducingVector \mappingFunctionVector}} \newcommand{\Kuu}{\kernelMatrix_{\inducingVector \inducingVector}} \newcommand{\Kuui}{\Kuu^{-1}} \newcommand{\Qaast}{\mathbf{Q}_{\bf \ast \ast}} \newcommand{\Qastf}{\mathbf{Q}_{\ast \mappingFunction}} \newcommand{\Qfast}{\mathbf{Q}_{\mappingFunctionVector \bf \ast}} \newcommand{\Qff}{\mathbf{Q}_{\mappingFunctionVector \mappingFunctionVector}} \newcommand{\aMatrix}{\mathbf{A}} \newcommand{\aScalar}{a} \newcommand{\aVector}{\mathbf{a}} \newcommand{\acceleration}{a} \newcommand{\bMatrix}{\mathbf{B}} \newcommand{\bScalar}{b} \newcommand{\bVector}{\mathbf{b}} \newcommand{\basisFunc}{\phi} \newcommand{\basisFuncVector}{\boldsymbol{ \basisFunc}} \newcommand{\basisFunction}{\phi} \newcommand{\basisLocation}{\mu} \newcommand{\basisMatrix}{\boldsymbol{ \Phi}} \newcommand{\basisScalar}{\basisFunction} \newcommand{\basisVector}{\boldsymbol{ \basisFunction}} \newcommand{\activationFunction}{\phi} \newcommand{\activationMatrix}{\boldsymbol{ \Phi}} \newcommand{\activationScalar}{\basisFunction} \newcommand{\activationVector}{\boldsymbol{ \basisFunction}} \newcommand{\bigO}{\mathcal{O}} \newcommand{\binomProb}{\pi} \newcommand{\cMatrix}{\mathbf{C}} \newcommand{\cbasisMatrix}{\hat{\boldsymbol{ \Phi}}} \newcommand{\cdataMatrix}{\hat{\dataMatrix}} \newcommand{\cdataScalar}{\hat{\dataScalar}} \newcommand{\cdataVector}{\hat{\dataVector}} \newcommand{\centeredKernelMatrix}{\mathbf{ \MakeUppercase{\centeredKernelScalar}}} \newcommand{\centeredKernelScalar}{b} \newcommand{\centeredKernelVector}{\centeredKernelScalar} \newcommand{\centeringMatrix}{\mathbf{H}} \newcommand{\chiSquaredDist}[2]{\chi_{#1}^{2}\left(#2\right)} \newcommand{\chiSquaredSamp}[1]{\chi_{#1}^{2}} \newcommand{\conditionalCovariance}{\boldsymbol{ \Sigma}} \newcommand{\coregionalizationMatrix}{\mathbf{B}} \newcommand{\coregionalizationScalar}{b} \newcommand{\coregionalizationVector}{\mathbf{ \coregionalizationScalar}} \newcommand{\covDist}[2]{\text{cov}_{#2}\left(#1\right)} \newcommand{\covSamp}[1]{\text{cov}\left(#1\right)} \newcommand{\covarianceScalar}{c} \newcommand{\covarianceVector}{\mathbf{ \covarianceScalar}} \newcommand{\covarianceMatrix}{\mathbf{C}} \newcommand{\covarianceMatrixTwo}{\boldsymbol{ \Sigma}} \newcommand{\croupierScalar}{s} \newcommand{\croupierVector}{\mathbf{ \croupierScalar}} \newcommand{\croupierMatrix}{\mathbf{ \MakeUppercase{\croupierScalar}}} \newcommand{\dataDim}{p} \newcommand{\dataIndex}{i} \newcommand{\dataIndexTwo}{j} \newcommand{\dataMatrix}{\mathbf{Y}} \newcommand{\dataScalar}{y} \newcommand{\dataSet}{\mathcal{D}} \newcommand{\dataStd}{\sigma} \newcommand{\dataVector}{\mathbf{ \dataScalar}} \newcommand{\decayRate}{d} \newcommand{\degreeMatrix}{\mathbf{ \MakeUppercase{\degreeScalar}}} \newcommand{\degreeScalar}{d} \newcommand{\degreeVector}{\mathbf{ \degreeScalar}} \newcommand{\diag}[1]{\text{diag}\left(#1\right)} \newcommand{\diagonalMatrix}{\mathbf{D}} \newcommand{\diff}[2]{\frac{\text{d}#1}{\text{d}#2}} \newcommand{\diffTwo}[2]{\frac{\text{d}^2#1}{\text{d}#2^2}} \newcommand{\displacement}{x} \newcommand{\displacementVector}{\textbf{\displacement}} \newcommand{\distanceMatrix}{\mathbf{ \MakeUppercase{\distanceScalar}}} \newcommand{\distanceScalar}{d} \newcommand{\distanceVector}{\mathbf{ \distanceScalar}} \newcommand{\eigenvaltwo}{\ell} \newcommand{\eigenvaltwoMatrix}{\mathbf{L}} \newcommand{\eigenvaltwoVector}{\mathbf{l}} \newcommand{\eigenvalue}{\lambda} \newcommand{\eigenvalueMatrix}{\boldsymbol{ \Lambda}} \newcommand{\eigenvalueVector}{\boldsymbol{ \lambda}} \newcommand{\eigenvector}{\mathbf{ \eigenvectorScalar}} \newcommand{\eigenvectorMatrix}{\mathbf{U}} \newcommand{\eigenvectorScalar}{u} \newcommand{\eigenvectwo}{\mathbf{v}} \newcommand{\eigenvectwoMatrix}{\mathbf{V}} \newcommand{\eigenvectwoScalar}{v} \newcommand{\entropy}[1]{\mathcal{H}\left(#1\right)} \newcommand{\errorFunction}{E} \newcommand{\expDist}[2]{\left\langle#1\right\rangle_{#2}} \newcommand{\expSamp}[1]{\left\langle#1\right\rangle} \newcommand{\expectation}[1]{\left\langle #1 \right\rangle } \newcommand{\expectationDist}[2]{\left\langle #1 \right\rangle _{#2}} \newcommand{\expectedDistanceMatrix}{\mathcal{D}} \newcommand{\eye}{\mathbf{I}} \newcommand{\fantasyDim}{r} \newcommand{\fantasyMatrix}{\mathbf{ \MakeUppercase{\fantasyScalar}}} \newcommand{\fantasyScalar}{z} \newcommand{\fantasyVector}{\mathbf{ \fantasyScalar}} \newcommand{\featureStd}{\varsigma} \newcommand{\gammaCdf}[3]{\mathcal{GAMMA CDF}\left(#1|#2,#3\right)} \newcommand{\gammaDist}[3]{\mathcal{G}\left(#1|#2,#3\right)} \newcommand{\gammaSamp}[2]{\mathcal{G}\left(#1,#2\right)} \newcommand{\gaussianDist}[3]{\mathcal{N}\left(#1|#2,#3\right)} \newcommand{\gaussianSamp}[2]{\mathcal{N}\left(#1,#2\right)} \newcommand{\uniformDist}[3]{\mathcal{U}\left(#1|#2,#3\right)} \newcommand{\uniformSamp}[2]{\mathcal{U}\left(#1,#2\right)} \newcommand{\given}{|} \newcommand{\half}{\frac{1}{2}} \newcommand{\heaviside}{H} \newcommand{\hiddenMatrix}{\mathbf{ \MakeUppercase{\hiddenScalar}}} \newcommand{\hiddenScalar}{h} \newcommand{\hiddenVector}{\mathbf{ \hiddenScalar}} \newcommand{\identityMatrix}{\eye} \newcommand{\inducingInputScalar}{z} \newcommand{\inducingInputVector}{\mathbf{ \inducingInputScalar}} \newcommand{\inducingInputMatrix}{\mathbf{Z}} \newcommand{\inducingScalar}{u} \newcommand{\inducingVector}{\mathbf{ \inducingScalar}} \newcommand{\inducingMatrix}{\mathbf{U}} \newcommand{\inlineDiff}[2]{\text{d}#1/\text{d}#2} \newcommand{\inputDim}{q} \newcommand{\inputMatrix}{\mathbf{X}} \newcommand{\inputScalar}{x} \newcommand{\inputSpace}{\mathcal{X}} \newcommand{\inputVals}{\inputVector} \newcommand{\inputVector}{\mathbf{ \inputScalar}} \newcommand{\iterNum}{k} \newcommand{\kernel}{\kernelScalar} \newcommand{\kernelMatrix}{\mathbf{K}} \newcommand{\kernelScalar}{k} \newcommand{\kernelVector}{\mathbf{ \kernelScalar}} \newcommand{\kff}{\kernelScalar_{\mappingFunction \mappingFunction}} \newcommand{\kfu}{\kernelVector_{\mappingFunction \inducingScalar}} \newcommand{\kuf}{\kernelVector_{\inducingScalar \mappingFunction}} \newcommand{\kuu}{\kernelVector_{\inducingScalar \inducingScalar}} \newcommand{\lagrangeMultiplier}{\lambda} \newcommand{\lagrangeMultiplierMatrix}{\boldsymbol{ \Lambda}} \newcommand{\lagrangian}{L} \newcommand{\laplacianFactor}{\mathbf{ \MakeUppercase{\laplacianFactorScalar}}} \newcommand{\laplacianFactorScalar}{m} \newcommand{\laplacianFactorVector}{\mathbf{ \laplacianFactorScalar}} \newcommand{\laplacianMatrix}{\mathbf{L}} \newcommand{\laplacianScalar}{\ell} \newcommand{\laplacianVector}{\mathbf{ \ell}} \newcommand{\latentDim}{q} \newcommand{\latentDistanceMatrix}{\boldsymbol{ \Delta}} \newcommand{\latentDistanceScalar}{\delta} \newcommand{\latentDistanceVector}{\boldsymbol{ \delta}} \newcommand{\latentForce}{f} \newcommand{\latentFunction}{u} \newcommand{\latentFunctionVector}{\mathbf{ \latentFunction}} \newcommand{\latentFunctionMatrix}{\mathbf{ \MakeUppercase{\latentFunction}}} \newcommand{\latentIndex}{j} \newcommand{\latentScalar}{z} \newcommand{\latentVector}{\mathbf{ \latentScalar}} \newcommand{\latentMatrix}{\mathbf{Z}} \newcommand{\learnRate}{\eta} \newcommand{\lengthScale}{\ell} \newcommand{\rbfWidth}{\ell} \newcommand{\likelihoodBound}{\mathcal{L}} \newcommand{\likelihoodFunction}{L} \newcommand{\locationScalar}{\mu} \newcommand{\locationVector}{\boldsymbol{ \locationScalar}} \newcommand{\locationMatrix}{\mathbf{M}} \newcommand{\variance}[1]{\text{var}\left( #1 \right)} \newcommand{\mappingFunction}{f} \newcommand{\mappingFunctionMatrix}{\mathbf{F}} \newcommand{\mappingFunctionTwo}{g} \newcommand{\mappingFunctionTwoMatrix}{\mathbf{G}} \newcommand{\mappingFunctionTwoVector}{\mathbf{ \mappingFunctionTwo}} \newcommand{\mappingFunctionVector}{\mathbf{ \mappingFunction}} \newcommand{\scaleScalar}{s} \newcommand{\mappingScalar}{w} \newcommand{\mappingVector}{\mathbf{ \mappingScalar}} \newcommand{\mappingMatrix}{\mathbf{W}} \newcommand{\mappingScalarTwo}{v} \newcommand{\mappingVectorTwo}{\mathbf{ \mappingScalarTwo}} \newcommand{\mappingMatrixTwo}{\mathbf{V}} \newcommand{\maxIters}{K} \newcommand{\meanMatrix}{\mathbf{M}} \newcommand{\meanScalar}{\mu} \newcommand{\meanTwoMatrix}{\mathbf{M}} \newcommand{\meanTwoScalar}{m} \newcommand{\meanTwoVector}{\mathbf{ \meanTwoScalar}} \newcommand{\meanVector}{\boldsymbol{ \meanScalar}} \newcommand{\mrnaConcentration}{m} \newcommand{\naturalFrequency}{\omega} \newcommand{\neighborhood}[1]{\mathcal{N}\left( #1 \right)} \newcommand{\neilurl}{http://inverseprobability.com/} \newcommand{\noiseMatrix}{\boldsymbol{ E}} \newcommand{\noiseScalar}{\epsilon} \newcommand{\noiseVector}{\boldsymbol{ \epsilon}} \newcommand{\noiseStd}{\sigma} \newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \newcommand{\normalizedLaplacianMatrix}{\hat{\mathbf{L}}} \newcommand{\normalizedLaplacianScalar}{\hat{\ell}} \newcommand{\normalizedLaplacianVector}{\hat{\mathbf{ \ell}}} \newcommand{\numActive}{m} \newcommand{\numBasisFunc}{m} \newcommand{\numComponents}{m} \newcommand{\numComps}{K} \newcommand{\numData}{n} \newcommand{\numFeatures}{K} \newcommand{\numHidden}{h} \newcommand{\numInducing}{m} \newcommand{\numLayers}{\ell} \newcommand{\numNeighbors}{K} \newcommand{\numSequences}{s} \newcommand{\numSuccess}{s} \newcommand{\numTasks}{m} \newcommand{\numTime}{T} \newcommand{\numTrials}{S} \newcommand{\outputIndex}{j} \newcommand{\paramVector}{\boldsymbol{ \theta}} \newcommand{\parameterMatrix}{\boldsymbol{ \Theta}} \newcommand{\parameterScalar}{\theta} \newcommand{\parameterVector}{\boldsymbol{ \parameterScalar}} \newcommand{\partDiff}[2]{\frac{\partial#1}{\partial#2}} \newcommand{\precisionScalar}{j} \newcommand{\precisionVector}{\mathbf{ \precisionScalar}} \newcommand{\precisionMatrix}{\mathbf{J}} \newcommand{\pseudotargetScalar}{\widetilde{y}} \newcommand{\pseudotargetVector}{\mathbf{ \pseudotargetScalar}} \newcommand{\pseudotargetMatrix}{\mathbf{ \widetilde{Y}}} \newcommand{\rank}[1]{\text{rank}\left(#1\right)} \newcommand{\rayleighDist}[2]{\mathcal{R}\left(#1|#2\right)} \newcommand{\rayleighSamp}[1]{\mathcal{R}\left(#1\right)} \newcommand{\responsibility}{r} \newcommand{\rotationScalar}{r} \newcommand{\rotationVector}{\mathbf{ \rotationScalar}} \newcommand{\rotationMatrix}{\mathbf{R}} \newcommand{\sampleCovScalar}{s} \newcommand{\sampleCovVector}{\mathbf{ \sampleCovScalar}} \newcommand{\sampleCovMatrix}{\mathbf{s}} \newcommand{\scalarProduct}[2]{\left\langle{#1},{#2}\right\rangle} \newcommand{\sign}[1]{\text{sign}\left(#1\right)} \newcommand{\sigmoid}[1]{\sigma\left(#1\right)} \newcommand{\singularvalue}{\ell} \newcommand{\singularvalueMatrix}{\mathbf{L}} \newcommand{\singularvalueVector}{\mathbf{l}} \newcommand{\sorth}{\mathbf{u}} \newcommand{\spar}{\lambda} \newcommand{\trace}[1]{\text{tr}\left(#1\right)} \newcommand{\BasalRate}{B} \newcommand{\DampingCoefficient}{C} \newcommand{\DecayRate}{D} \newcommand{\Displacement}{X} \newcommand{\LatentForce}{F} \newcommand{\Mass}{M} \newcommand{\Sensitivity}{S} \newcommand{\basalRate}{b} \newcommand{\dampingCoefficient}{c} \newcommand{\mass}{m} \newcommand{\sensitivity}{s} \newcommand{\springScalar}{\kappa} \newcommand{\springVector}{\boldsymbol{ \kappa}} \newcommand{\springMatrix}{\boldsymbol{ \mathcal{K}}} \newcommand{\tfConcentration}{p} \newcommand{\tfDecayRate}{\delta} \newcommand{\tfMrnaConcentration}{f} \newcommand{\tfVector}{\mathbf{ \tfConcentration}} \newcommand{\velocity}{v} \newcommand{\sufficientStatsScalar}{g} \newcommand{\sufficientStatsVector}{\mathbf{ \sufficientStatsScalar}} \newcommand{\sufficientStatsMatrix}{\mathbf{G}} \newcommand{\switchScalar}{s} \newcommand{\switchVector}{\mathbf{ \switchScalar}} \newcommand{\switchMatrix}{\mathbf{S}} \newcommand{\tr}[1]{\text{tr}\left(#1\right)} \newcommand{\loneNorm}[1]{\left\Vert #1 \right\Vert_1} \newcommand{\ltwoNorm}[1]{\left\Vert #1 \right\Vert_2} \newcommand{\onenorm}[1]{\left\vert#1\right\vert_1} \newcommand{\twonorm}[1]{\left\Vert #1 \right\Vert} \newcommand{\vScalar}{v} \newcommand{\vVector}{\mathbf{v}} \newcommand{\vMatrix}{\mathbf{V}} \newcommand{\varianceDist}[2]{\text{var}_{#2}\left( #1 \right)} \newcommand{\vecb}[1]{\left(#1\right):} \newcommand{\weightScalar}{w} \newcommand{\weightVector}{\mathbf{ \weightScalar}} \newcommand{\weightMatrix}{\mathbf{W}} \newcommand{\weightedAdjacencyMatrix}{\mathbf{A}} \newcommand{\weightedAdjacencyScalar}{a} \newcommand{\weightedAdjacencyVector}{\mathbf{ \weightedAdjacencyScalar}} \newcommand{\onesVector}{\mathbf{1}} \newcommand{\zerosVector}{\mathbf{0}} $$

at Cambridge Philosophical Society - David MacKay Memorial Meeting, Cambridge University Engineering Department on Mar 27, 2026 [jupyter][google colab][reveal]

Neil D. Lawrence, University of Cambridge

Abstract

Our fascination with AI and promises of superintelligence mirror the excitement around perpetual motion machines a century ago. Just as thermodynamics places fundamental limits on engines, information theory places fundamental limits on intelligence. This talk introduces the “inaccessible game,” an information-theoretic dynamical system built from four axioms. The game reveals how GENERIC structure—combining reversible and irreversible dynamics—emerges from information conservation, how energy and entropy become equivalent in the thermodynamic limit, and why Landauer’s principle follows naturally. These results suggest that superintelligence claims violate fundamental information-theoretic constraints, much as perpetual motion violates thermodynamics.

This work is dedicated to the memory of David MacKay, whose approach to cutting through hype with rigorous reasoning inspired this investigation.

Perpetual Motion and Superintelligence

[edit]

Imagine in 1925 a world where the automobile is already transforming society, but big promises are being made for things to come. The stock market is soaring, the 1918 pandemic is forgotten. And every major automobile manufacturer is investing heavily on the promise they will each be the first to produce a car that needs no fuel. A perpetual motion machine.

Well, of course that didn’t happen. But I sometimes wonder if what we’re seeing today 100 years later is the modern equivalent of that. In 2025 billions are being invested in promises of superintelligence and artificial general intelligence that will transform everything.

We know why perpetual motion is impossible: the second law of thermodynamics tells us that entropy always increases. So we can’t have motion without entropy production. No matter how clever the design, you cannot extract energy from nothing, and you cannot create a closed system that does useful work indefinitely without an external energy source.

How might we make an equivalent statement for the bizarre claims around superintelligence? Some inspiration comes from Maxwell’s demon, an “intelligent” entity which operates against the laws of thermodynamics. The inspiration comes because the demon suggests that for the second law to hold there must be a relationship between the demon’s decisions and thermodynamic entropy.

One of the resolutions comes from Landauer’s principle, the notion that erasure of information requires heat dissipation. This suggests there are fundamental information-theoretic constraints on intelligent systems, just as there are thermodynamic constraints on engines.

I’ve no doubt that AI technologies will transform our world just as much as the automobile has. But I also have no doubt that the promise of superintelligence is just as silly as the promise of perpetual motion. The inaccessible game provides one way of understanding why.

In Memory of David MacKay

[edit]

This talk is given at a memorial meeting for David MacKay, who made fundamental contributions to our understanding of the relationship between information theory, energy, and practical systems. David’s work on information theory and inference provided elegant bridges between abstract mathematical principles and real-world applications.

David’s book “Information Theory, Inference, and Learning Algorithms” (MacKay, 2003) demonstrated how information-theoretic thinking could illuminate everything from error-correcting codes to neural networks to thermodynamics. His later work on sustainable energy (MacKay, 2008) showed how careful quantitative reasoning about physical constraints could cut through hype and wishful thinking.

The work I present today on the inaccessible game follows in this tradition. It attempts to build rigorous information-theoretic foundations for understanding physical systems and, by extension, the limits on intelligent systems. David would have appreciated both the mathematical structure and the attempt to use it to deflate unrealistic promises about superintelligence.

As David told us ten years ago, he was highly inspired by John Bridle telling him (as an undergraduate student) “everything’s connected”. In his book “Sustainable Energy, Without the Hot Air” David taught us to ask: “What are the fundamental constraints? What do the numbers actually say?” In that spirit, this work asks: What fundamental information-theoretic constraints govern intelligent systems? Can we understand these constraints as rigorously as we understand thermodynamic constraints on engines?

Information, Energy and Fundamental Limits

Information Theory and Thermodynamics

[edit]

Information theory provides a mathematical framework for quantifying information. Many of information theory’s core concepts parallel those found in thermodynamics. The theory was developed by Claude Shannon who spoke extensively to MIT’s Norbert Wiener at while it was in development (Conway and Siegelman, 2005). Wiener’s own ideas about information were inspired by Willard Gibbs, one of the pioneers of the mathematical understanding of free energy and entropy. Deep connections between physical systems and information processing have connected information and energy from the start.

Entropy

Shannon’s entropy measures the uncertainty or unpredictability of information content. This mathematical formulation is inspired by thermodynamic entropy, which describes the dispersal of energy in physical systems. Both concepts quantify the number of possible states and their probabilities.

Figure: Maxwell’s demon thought experiment illustrates the relationship between information and thermodynamics.

In thermodynamics, free energy represents the energy available to do work. A system naturally evolves to minimize its free energy, finding equilibrium between total energy and entropy. Free energy principles are also pervasive in variational methods in machine learning. They emerge from Bayesian approaches to learning and have been heavily promoted by e.g. Karl Friston as a model for the brain.

The relationship between entropy and Free Energy can be explored through the Legendre transform. This is most easily reviewed if we restrict ourselves to distributions in the exponential family.

Exponential Family

The exponential family has the form \[ \rho(Z) = h(Z) \exp\left(\boldsymbol{\theta}^\top T(Z) + A(\boldsymbol{\theta})\right) \] where $h(Z)$ is the base measure, $\boldsymbol{\theta}$ is the natural parameters, $T(Z)$ is the sufficient statistics and $A(\boldsymbol{\theta})$ is the log partition function. Its entropy can be computed as \[ S(Z) = A(\boldsymbol{\theta}) - \boldsymbol{\theta}^\top \nabla_\boldsymbol{\theta}A(\boldsymbol{\theta}) - E_{\rho(Z)}\left[\log h(Z)\right], \] where $E_{\rho(Z)}[\cdot]$ is the expectation under the distribution $\rho(Z)$.

Available Energy

Work through Measurement

In machine learning and Bayesian inference, the Markov blanket is the set of variables that are conditionally independent of the variable of interest given the other variables. To introduce this idea into our information system, we first split the system into two parts, the variables, $X$, and the memory $M$.

The variables are the portion of the system that is stochastically evolving over time. The memory is a low entropy partition of the system that will give us knowledge about this evolution.

We can now write the joint entropy of the system in terms of the mutual information between the variables and the memory. \[ S(Z) = S(X,M) = S(X|M) + S(M) = S(X) - I(X;M) + S(M). \] This gives us the first hint at the connection between information and energy.

If $M$ is viewed as a measurement then the change in entropy of the system before and after measurement is given by $S(X|M) - S(X)$ wehich is given by $-I(X;M)$. This is implies that measurement increases the amount of available energy we obtain from the system (Parrondo et al., 2015).

The difference in available energy is given by \[ \Delta A = A(X) - A(Z|M) = I(X;M), \] where we note that the resulting system is no longer in thermodynamic equilibrium due to the low entropy of the memory.

Information-Theoretic Limits on Intelligence

[edit]

Just as the second law of thermodynamics places fundamental limits on mechanical engines, no matter how cleverly designed, information theory places fundamental limits on information engines, no matter how cleverly implemented.

What Intelligent Systems Must Do

Any intelligent system, whether biological or artificial, must perform certain fundamental operations:

Acquire information from its environment (sensing, observation)
Store information about the world (memory)
Process information to make decisions (computation)
Erase information to make room for new data (memory management)
Act on the world using the processed information

Each of these operations has information-theoretic costs that cannot be eliminated by clever engineering.

The Landauer Bound on Computation

Landauer’s principle (Landauer, 1961) establishes that erasing one bit of information requires dissipating at least $k_BT\log 2$ of energy as heat, where $k_B$ is Boltzmann’s constant and $T$ is temperature.

This isn’t an engineering limitation, it’s a fundamental consequence of the second law. To reset a bit to a standard state (say, always 0) requires reducing its entropy from 1 bit to 0 bits. That entropy must go somewhere, and it ends up as heat in the environment.

Modern computers operate billions of times above the Landauer limit due to engineering constraints. But even if we could build computers at the thermodynamic limit, consider a brain-scale computation. Lawrence (2017) reviews estimates suggesting it would require over an exaflop ($10^{18}$ floating point operations per second) to perform a full simulation of the human brain, based on Ananthanarayanan et al. (2009). Other authors have suggested the operations could be as low as $10^{15}$ (Moravec, 1999; Sandberg and Bostrom, 2008)

Taking the most conservative estimate of $10^{15}$ operations/sec for functionally relevant computation: - $\sim 10^{15}$ operations/sec - Running for one year ($\sim 3\times 10^7$ seconds) - At room temperature (300K)

This would require at minimum (assuming one bit erasure per operation): \[ E \sim 10^{15} \times 3\times10^{7} \times 3\times10^{-21} \approx 10^2 \text{ Joules} \]

That seems small, but this is just for erasing bits. It doesn’t include the entropy production that occurs in: - Acquiring the data - Moving data around - The actual computation - Dissipation in real (non-ideal) systems

The actual human brain consumes about 20W continuously, or $\sim 6 \times 10^8$ Joules per year—roughly $10^6$ times the Landauer limit.

Fisher Information Bounds on Learning

The Fisher information matrix $G(\boldsymbol{\theta})$ sets fundamental bounds on how quickly a system can learn. The Cramér-Rao bound tells us that the variance of any unbiased estimator of parameters $\boldsymbol{\theta}$ is bounded by: \[ \text{Var}(\hat{\boldsymbol{\theta}}) \geq G^{-1}(\boldsymbol{\theta}). \]

This means: - You cannot extract information from data faster than the Fisher information allows - Small eigenvalues of $G$ correspond to directions that are hard to learn - The information topography determined by $G$ constrains learning dynamics

Embodiment as Necessity, Not Limitation

These constraints mean that embodiment, i.e. physical instantiation with specific constraints, is not a limitation to overcome but a feature of any information-processing system.

The Fisher information $G(\boldsymbol{\theta})$ defines the information topography, which is determined by: - The physical substrate (silicon, neurons, quantum systems) - The available energy budget - The communication bandwidth - The thermal environment

Different substrates have different topographies, each with its own bottlenecks and channels. You cannot have intelligence without a substrate, and every substrate brings constraints.

Why Superintelligence Claims Fail

Claims about unbounded superintelligence typically ignore these constraints. They imagine intelligence as something that can be “scaled up” indefinitely, like adding more processors. But:

Information acquisition is bounded by Fisher information—you can’t extract more information from data than the data contains
Information storage requires physical space and energy
Information processing has Landauer costs that scale with computation
Information erasure is necessarily dissipative

Trying to build unbounded intelligence is like trying to build a perpetual motion machine, it violates fundamental physical principles.

This doesn’t mean AI can’t be powerful or transformative — internal combustion engines transformed the world despite thermodynamic limits. But it does mean there are hard bounds on what’s possible, and claims that ignore these bounds are as unrealistic as promises of perpetual motion.

The Inaccessible Game

Foundations: The Four Axioms

Baez-Fritz-Leinster Characterization of Information Loss

[edit]

Before introducing our fourth axiom, we need to understand how information loss is measured. Baez et al. (2011) showed that entropy emerges naturally from category theory as a way of measuring information loss in measure-preserving functions. They derived Shannon entropy from three axioms, without invoking probability directly.

The Three Axioms

Let $F(f)$ denote the information lost by a process $f$ that transforms one probability distribution to another. The three axioms constrain the functional form of $F$.

Axiom 1: Functoriality suggests that given a process consisting of two stages, the amount of information lost in the whole process is the sum of the amounts lost at each stage: \[ F(f \circ g) = F(f) + F(g), \] where $\circ$ represents composition.

Axiom 2: Convex Linearity suggests that if we flip a probability-$\lambda$ coin to decide whether to do one process or another, the information lost is $\lambda$ times the information lost by the first process plus $(1-\lambda)$ times the information lost by the second: \[ F(\lambda f \oplus (1-\lambda)g) = \lambda F(f) + (1-\lambda)F(g). \]

Axiom 3: Continuity suggests that if we change a process slightly, the information lost changes only slightly, i.e. $F(f)$ is a continuous function of $f$.

The Main Result

The main result of Baez et al. (2011) is that these three axioms uniquely determine the form of information loss. There exists a constant $c\geq 0$ such that for any $f: p \rightarrow q$: \[ F(f) = c(H(p) -H(q)) \] where $F(f)$ is the information loss in process $f: p\rightarrow q$ and $H(\cdot)$ is the Shannon entropy measured before and after the process is applied to the system.

This provides a foundational justification for using entropy as our measure of information. It’s not just a convenient choice, it’s the unique measure that satisfies natural requirements for measuring information loss.

The Fourth Axiom: Information Conservation

[edit]

Baez et al.’s three axioms tell us how to measure information loss. Now we add a new constraint: what if information is conserved in an isolated system?

In physics, isolated chambers conserve mass and energy. In The Inaccessible Game, we consider what it would mean for information to be conserved. This gives us the fourth axiom.

Statement of the Axiom

Axiom 4: Information Conservation

For any finite sub-group of $N$ variables, the sum of marginal entropies $\{h_i\}_{i=1}^{N}$ equals a constant $C$: \[ \sum_{i=1}^N h_i = C \]

This says the total amount of information in the system is conserved. Information can flow between variables, but the total remains constant.

A crucial choice: we conserve the sum of marginal entropies, not the joint entropy $H(\mathbf{x})$.

Recall the relationship: \[ H(\mathbf{x}) = \sum_{i=1}^N h_i - I(\mathbf{x}) \] where $I(\mathbf{x})$ is the multi-information (total correlation): \[ I(\mathbf{x}) = \sum_{i=1}^N h_i - H(\mathbf{x}) \geq 0 \]

If we conserve $\sum h_i = C$, then: - The multi-information $I$ can change - The joint entropy $H$ can change - Variables can become correlated or decorrelated - But the “total information capacity” $\sum h_i$ stays fixed

Why marginal? The exchangeability principle: we want a constraint that applies to any finite subset of variables. Marginal entropies allow this flexibility. Also, conserving $\sum h_i$ (rather than joint $H$) allows the correlation structure $I$ to vary, giving the system freedom to organise.

The conservation constraint is imposed in an exchangeable form across the marginal entropies. This means we can consider any finite partition of variables from a total that could be countably infinite.

The exchangeability ensures: 1. We can focus on any finite subset of variables 2. The constraint applies equally to all variables 3. No variable has a “special” role 4. The system can have infinitely many degrees of freedom

This is inspired by, but differs from, the usual notion of exchangeability in Bayesian non-parametric probability (de Finetti, Aldous, Bernardo & Smith). Here, exchangeability refers to the constraint structure, not the probability distribution itself.

Physical Interpretation

Think of the system as an isolated information chamber:

Marginal entropy $h_i$: The “information content” or “information capacity” of variable $i$. High marginal entropy means the variable can take on many states.

Conservation $\sum h_i = C$: Total capacity is fixed. Like a fixed amount of “space” to store information.

Multi-information $I$: The “structure” or “correlation” between variables. This can grow or shrink.

As the system evolves: - Information can flow between variables (changing individual $h_i$) - Correlations can form or break (changing $I$) - But the total capacity $\sum h_i = C$ remains constant

Since $H + I = C$, increasing joint entropy $H$ (second law) means decreasing multi-information $I$. The system breaks down correlations to increase entropy. It’s like having a fixed amount of “information capacity” $C$ that gets redistributed between joint entropy (disorder) and correlation structure.

Now we see why the game is “inaccessible”: an external observer cannot learn anything about the system.

Recall Baez’s result: information gained from observing a process $f: p \rightarrow q$ is measured by the entropy change: \[ F(f) = H(p) - H(q) \]

But if marginal entropies are conserved ($\sum h_i = C$), then an external observer sees: - Before observation: $\sum h_i = C$ - After observation: $\sum h_i = C$ - Information gained: $\Delta(\sum h_i) = 0$

The observer learns nothing! The system is informationally isolated. Internal variables can correlate and decorrelate (changing $I$ and $H$), but the total marginal entropy, what an external observer can access through Baez’s measurement framework, remains constant.

This is “inaccessibility”: the system’s internal dynamics are hidden from external observation because there’s no net information flow to the outside. The conservation constraint creates an information barrier.

Mathematically, this leads to dynamics constrained to a manifold $\sum h_i = C$, which we’ll formalize with Lagrange multipliers in Lecture 3.

The Four Axioms Together

We now have four axioms for The Inaccessible Game:

Functoriality (Baez): Information loss is additive across compositions
Convex Linearity (Baez): Probabilistic mixing is linear
Continuity (Baez): Small changes have small effects
Information Conservation (New): $\sum_{i=1}^N h_i = C$

The first three axioms (Baez) justify measuring information by entropy differences. The fourth axiom constrains the total information. Together, they define an information-theoretic “chamber” where: - Information content is measured by entropy (axioms 1-3) - Total information is conserved (axiom 4) - Dynamics must respect this conservation

From these four axioms, we will derive (in Lecture 3) a constrained dynamical system that evolves to maximize entropy production subject to conservation.

The Inaccessible Game

[edit]

We call our framework the “inaccessible game” because the system is isolated from external observation: an outside observer cannot extract or inject information, making the game’s internal state inaccessible.

Like other zero-player games, such as Conway’s Game of Life (Gardner, 1970), the system evolves according to internal rules without external interference. But unlike cellular automata where rules are chosen by design, in our game the rules emerge from an information-theoretic principle.

Why “Inaccessible”?

The game is inaccessible because our fourth axiom—information isolation—ensures that an external observer learns nothing. Recall from Baez-Fritz-Leinster that information gained from observing a process $f: p \rightarrow q$ is measured by entropy change: \[ F(f) = H(p) - H(q). \]

But if marginal entropies are conserved ($\sum h_i = C$), then an external observer sees: - Before observation: $\sum h_i = C$ - After observation: $\sum h_i = C$
- Information gained: $\Delta(\sum h_i) = 0$

The observer learns nothing! The system is informationally isolated from the outside world.

What Makes It a Game?

The “game” has the following characteristics:

Players: None (zero-player game)

State: A probability distribution over variables, parametrized by natural parameters $\boldsymbol{\theta}$

Rules: Evolve to maximize entropy production subject to the constraint $\sum h_i = C$

Dynamics: Emerge from information geometry (Fisher information) and the constraint structure

The game starts in a maximally correlated state (high multi-information $I$, low joint entropy $H$) and evolves toward states of higher entropy. Since $I + H = C$, this means the system breaks down correlations ($I$ decreases) as entropy increases.

Connection to Physical Reality

Why should we care about this abstract game? Because its dynamics exhibit structure that connects to fundamental physics:

GENERIC structure: The dynamics decompose into reversible (conservative) and irreversible (dissipative) components
Energy-entropy equivalence: In the thermodynamic limit, our marginal entropy constraint becomes equivalent to energy conservation
Landauer’s principle: Information erasure necessarily involves the dissipative part of the dynamics

The inaccessible game provides an information-theoretic foundation for thermodynamic principles. It suggests that physical laws might emerge from information-theoretic constraints, rather than information theory being derived from physics.

Information Dynamics

The Conservation Law

The $I + H = C$ Structure

[edit]

We have established four axioms, with the fourth axiom stating that the sum of marginal entropies is conserved, \[ \sum_{i=1}^N h_i = C. \] This conservation law is the heart of The Inaccessible Game, but to understand its dynamical implications, we need to rewrite it in a more revealing form.

Multi-Information: Measuring Correlation

The multi-information (or total correlation), introduced by Watanabe (1960), measures how much the variables in a system are correlated. It is defined as, \[ I = \sum_{i=1}^N h_i - H, \] where $H$ is the joint entropy of the full system: \[ H = -\sum_{\mathbf{x}} p(\mathbf{x}) \log p(\mathbf{x}). \]

The multi-information has a clear interpretation: - $I = 0$: The variables are completely independent. The joint entropy equals the sum of marginal entropies. - $I > 0$: The variables are correlated. Some information is “shared” between variables, so the joint entropy is less than the sum of marginals. - $I$ is maximal: The variables are maximally correlated (in the extreme case, deterministically related).

Multi-information is always non-negative ($I \geq 0$) and measures how much knowing one variable tells you about others.

The Information Action Principle: $I + H = C$

Using the definition of multi-information, we can rewrite our conservation law. From $I = \sum_{i=1}^N h_i - H$, we have: \[ \sum_{i=1}^N h_i = I + H. \] Therefore, the fourth axiom $\sum_{i=1}^N h_i = C$ becomes: \[ I + H = C. \]

This is an information action principle. It says that multi-information plus joint entropy is conserved. This equation sits behind the dynamics of the Inaccessible Game.

This equation has the structure of an action principle in classical mechanics. In physics, total energy is conserved and splits into two parts, \[ T + V = E, \] where $T$ is kinetic energy and $V$ is potential energy.

The analogy for The Inaccessible Game is.

Multi-information $I$ plays the role of potential energy. It represents “stored” correlation structure. High $I$ means variables are tightly coupled, like a compressed spring.
Joint entropy $H$ plays the role of kinetic energy. It represents “dispersed” or “free” information. High $H$ means the probability distribution is spread out, with maximal uncertainty.

Just as a classical system evolves from high potential energy to high kinetic energy (a ball rolling down a hill), the idea in the Inaccessible Game will be that the information system evolves from high correlation (high $I$) to high entropy (high $H$).

The Information Relaxation Principle

The $I + H = C$ structure suggests a relaxation principle: systems naturally evolve from states of high correlation (high $I$, low $H$) toward states of low correlation (low $I$, high $H$).

Why? Our inspiration is that the second law of thermodynamics tells us that entropy increases. If we want to introduce dynamics in the game, increasing entropy provides an obvious way to do that. Since $I + H = C$ is constant, if $H$ increases, $I$ must decrease. The system breaks down correlations to increase entropy.

This is analogous to how physical systems relax from non-equilibrium states (low $T$, high $V$) to equilibrium (high $T$, low $V$). A compressed spring releases its stored energy. A hot object in a cold room disperses its energy. In information systems, correlated structure dissipates into entropy.

Consider a simple two-variable system with binary variables $X_1$ and $X_2$:

High correlation state (high $I$, low $H$): \[ p(X_1=0, X_2=0) = 0.5, \quad p(X_1=1, X_2=1) = 0.5 \] The variables are perfectly correlated. Marginal entropies: $h_1 = h_2 = 1$ bit. Joint entropy: $H = 1$ bit. Multi-information: $I = 1 + 1 - 1 = 1$ bit.

Low correlation state (low $I$, high $H$): \[ p(X_1, X_2) = 0.25 \text{ for all four combinations} \] The variables are independent. Marginal entropies: $h_1 = h_2 = 1$ bit. Joint entropy: $H = 2$ bits. Multi-information: $I = 1 + 1 - 2 = 0$ bits.

The system relaxes from the first state to the second, conserving $I + H = 2$ bits throughout. Let’s visualize this relaxation:

import numpy as np

# Generate relaxation trajectory
n_steps = 100
alphas = np.linspace(0, 1, n_steps)

h1_vals = []
h2_vals = []
H_vals = []
I_vals = []

for alpha in alphas:
    p00, p01, p10, p11 = relaxation_path(alpha)
    h1, h2, H, I = compute_binary_entropies(p00, p01, p10, p11)
    h1_vals.append(h1)
    h2_vals.append(h2)
    H_vals.append(H)
    I_vals.append(I)

h1_vals = np.array(h1_vals)
h2_vals = np.array(h2_vals)
H_vals = np.array(H_vals)
I_vals = np.array(I_vals)
C_vals = I_vals + H_vals  # Should be constant

Figure: Left: Multi-information $I$ decreases as joint entropy $H$ increases, conserving $I + H = C$. The colored regions show how the conserved quantity splits between correlation (red) and entropy (blue). Right: Marginal entropies remain constant throughout, making the system inaccessible to external observation.

The visualisation shows the trade-off: as the system relaxes, correlation structure (multi-information) is converted into entropy. The total $I + H = C$ remains constant (black dashed line), but the system evolves from a state dominated by correlation to one dominated by entropy.

The marginal entropies $h_1$ and $h_2$ stay constant throughout this evolution. An external observer measuring only marginal entropies would see no change—the system is informationally isolated, hence “inaccessible.”

Connection to Marginal Entropy Conservation

Why does this structure conserve marginal entropies? Recall that from Baez’s axioms, any change in entropy of a subsystem represents “information loss” to an external observer. If the observer learns nothing about the system (information isolation), then \[ \Delta\left(\sum_{i=1}^N h_i\right) = 0. \] The $I + H = C$ formulation provides our dynamics clear: as the system evolves, correlations are traded for entropy. The marginal entropies remain fixed (so an external observer learns nothing), while internally the system reorganises from a correlated state to an uncorrelated state.

Importantly, information conservation doesn’t mean nothing changes, it means the changes are internal redistributions that leave marginal entropies (and hence external information) unchanged. The system is inaccessible to the outside because its dynamics preserve $\sum h_i$ which means they preserve $I + H$.

Why This Matters for Dynamics

The $I + H = C$ structure immediately tells us:

Direction of evolution: Systems move from high $I$ to high $H$ (correlation to entropy).
Constrained dynamics: Not all paths through probability space are allowed. Only those preserving $I + H = C$ are accessible.
Physical interpretation: The split into $I$ (correlation/potential) and $H$ (entropy/kinetic) gives us a sense of what’s happening, later we will parameterise directly through natural parameters $\boldsymbol{\theta}$ (from Lecture 1).
Variational principle: The action-like structure hints that we can derive dynamics from a variational principle, just as Lagrangian mechanics derives equations of motion from the principle of least action.

In the next section, we’ll see how information relaxation, i.e. the tendency to move from high $I$ to high $H$, leads to maximum entropy production dynamics in the natural parameter space.

Information Relaxation

From Information Relaxation to Maximum Entropy Production

[edit]

We’ve established used the conservation law $I + H = C$ to suggest a relaxation principle: systems evolve from high correlation (high $I$) to high entropy (high $H$). Now we explore the implications of this principle by deriving the explicit dynamics.

The Direction of Time: Entropy Increases

The second law of thermodynamics tells us that entropy increases over time. In The Inaccessible Game, this means: \[ \dot{H} \geq 0. \]

Since the joint entropy $H$ must increase (or at least not decrease), and we have the constraint $I + H = C$, it immediately follows that: \[ \dot{I} \leq 0. \]

The multi-information must decrease, i.e. the system breaks down correlations to increase entropy.

This gives us the direction of evolution, but not yet the rate or the specific form of the dynamics. For that, we need to think about how the system can maximize the rate of entropy production while respecting the conservation constraint.

Maximum Entropy Production Principle

Among all possible dynamics that conserve marginal entropy, which path should we choose? Our answer comes from a principle that has emerged across multiple domains of physics: maximum entropy production (MEP).

The MEP principle states that, subject to constraints, systems evolve along the path of steepest entropy increase. This principle has been observed in: - Non-equilibrium thermodynamics (Beretta, 2020; Ziegler and Wehrli, 1987) - Fluid dynamics and turbulence - Ecology and self-organization - Climate dynamics

For The Inaccessible Game, MEP means: of all paths that conserve $\sum h_i$, the system follows the one that maximises $\dot{H}$. This choice of dynamics makes sense because it is uniquely determined at all times.

Note that this principle isn’t one of our axioms, it’s an assumption about how the relaxation dynamics should occur.

Natural Parameters and the Entropy Gradient

To make MEP concrete, we need coordinates. Recall from Lecture 1 that exponential families have natural parameters $\boldsymbol{\theta}$ where the geometry is particularly elegant. In natural parameters, the entropy gradient has a beautiful form.

For an exponential family $p(\mathbf{x}|\boldsymbol{\theta}) = \exp(\boldsymbol{\theta}^T T(\mathbf{x}) - \mathcal{A}(\boldsymbol{\theta}))$, the joint entropy is: \[ H(\boldsymbol{\theta}) = \mathcal{A}(\boldsymbol{\theta}) - \boldsymbol{\theta}^T \nabla \mathcal{A}(\boldsymbol{\theta}). \]

Taking the gradient with respect to $\boldsymbol{\theta}$: \[ \nabla_{\boldsymbol{\theta}} H = -\boldsymbol{\theta}^\top \nabla^2 \mathcal{A}(\boldsymbol{\theta}) = -G(\boldsymbol{\theta})\boldsymbol{\theta}, \] where $G(\boldsymbol{\theta}) = \nabla^2 \mathcal{A}(\boldsymbol{\theta})$ is the Fisher information matrix.

This gradient points in the direction of steepest entropy increase.

The MEP Dynamics

Maximum entropy production, in its simplest form, says: move in the direction of the entropy gradient. In natural parameters, this gives: \[ \dot{\boldsymbol{\theta}} = \nabla_{\boldsymbol{\theta}} H = -G(\boldsymbol{\theta})\boldsymbol{\theta}. \]

This is gradient ascent on entropy. But we must be careful: this dynamics must also preserve the constraint $\sum h_i = C$.

For the moment, we will focus on systems where we assume that this simple gradient flow preserves marginal entropies. This would require the Fisher information matrix $G$ and the conservation constraint have compatible structure, both arising from the same geometric structure of the probability manifold.

Later (especially Lecture 4) we’ll see how to handle more general constraints through Lagrangian methods, where we explicitly enforce the conservation through multipliers. For now, the key insight is: MEP naturally leads to gradient flow in entropy.

Why This Is the Unique Dynamics

The MEP dynamics $\dot{\boldsymbol{\theta}} = -G(\boldsymbol{\theta})\boldsymbol{\theta}$ has several interesting properties.

Steepest ascent: This follows the Euclidean gradient $\nabla_{\boldsymbol{\theta}} H$ in parameter space. The Fisher information $G$ appears because of the structure of exponential families ($\nabla H = -G\boldsymbol{\theta}$), not because we’re using it as a Riemannian metric. Note: natural gradients would be $G^{-1}\nabla H$, we are not following the natural gradients here.
Maximizes entropy production: Among all dynamics preserving the constraint, this produces entropy fastest.
Conserves marginal entropies: For systems with the exchangeable structure we’ve assumed, this flow keeps $\sum h_i$ constant.
Connects to thermodynamics: This is exactly the “steepest entropy ascent” dynamics explored e.g. in non-equilibrium thermodynamics by Beretta (Beretta, 2020).

The information relaxation principle—that systems evolve from correlation to entropy—combined with MEP, uniquely determines these dynamics.

The Information Relaxation Picture

Let’s put it all together. The Inaccessible Game dynamics can be understood as:

Initial state: High correlation, low entropy - Multi-information $I$ is large (variables tightly coupled) - Joint entropy $H$ is small (distribution is concentrated)

Evolution: Maximum entropy production - System follows gradient flow $\dot{\boldsymbol{\theta}} = -G(\boldsymbol{\theta})\boldsymbol{\theta}$ - Entropy $H$ increases at maximum rate - Multi-information $I$ decreases correspondingly - Marginal entropies $h_i$ remain constant (external observer learns nothing)

Final state: Low correlation, high entropy - Multi-information $I$ is small (variables nearly independent) - Joint entropy $H$ is large (distribution is spread out) - Equilibrium: $\dot{H} = 0$, maximum entropy subject to constraints

This is information relaxation: the system “relaxes” from a tense, correlated state to a loose, uncorrelated state, just as a stretched elastic relaxes to its natural length.

Connection to Physical Intuition

The information relaxation picture has a direct physical analogy. Consider a room full of gas molecules:

Initial state: All molecules are in one corner (high correlation - if you know where one molecule is, you have a good guess about others; low entropy - the distribution over positions is concentrated).

Evolution: Molecules diffuse according to the laws of thermodynamics, spreading out to fill the room. This is maximum entropy production—the fastest route to equilibrium.

Final state: Molecules are uniformly distributed throughout the room (low correlation - knowing where one molecule is tells you nothing about others; high entropy - maximum uncertainty about positions).

In both cases, we have: - A conservation law (total energy for gas, marginal entropy for information) - A relaxation from structured to unstructured states - Maximum entropy production as the governing principle

The Inaccessible Game applies this same physics to abstract probability distributions.

Preview: Constrained Gradient Flow

The simple gradient flow $\dot{\boldsymbol{\theta}} = -G(\boldsymbol{\theta})\boldsymbol{\theta}$ is the starting point, but real systems often have additional constraints beyond marginal entropy conservation. In Lecture 4, we’ll see how to incorporate these constraints using Lagrangian methods, leading to: \[ \dot{\boldsymbol{\theta}} = -G(\boldsymbol{\theta})(\boldsymbol{\theta} + \lambda \nabla C), \] where $\lambda$ is a Lagrange multiplier and $C$ represents the constraint.

The key insight remains: information relaxation through maximum entropy production is the fundamental principle. Constraints modify the path, but not the underlying logic.

Constrained Maximum Entropy Production

[edit]

We’ve established that $I + H = C$ must be conserved, and that the system naturally evolves toward higher entropy. But how does it evolve? What determines the specific path through information space?

Our answer comes from an information relaxation principle: of all paths that conserve $\sum_i h_i = C$, the system follows the one that maximizes entropy production $\dot{H}$.

Unconstrained vs Constrained Dynamics

Without the constraint, maximum entropy production would simply be gradient ascent on $H$. For exponential families, the entropy gradient is \[ \nabla H = -G(\boldsymbol{\theta})\boldsymbol{\theta} \] where $G = \nabla^2A$ is the Fisher information. Pure MEP would give: \[ \dot{\boldsymbol{\theta}} = \nabla H = -G(\boldsymbol{\theta})\boldsymbol{\theta}. \]

This is the natural gradient flow in the information geometry. The system would flow toward the maximum entropy state at $\boldsymbol{\theta} = \mathbf{0}$.

But we must respect $\sum_i h_i = C$ at all times. This constraint defines a manifold in parameter space. We need to project the MEP flow onto the tangent space of this constraint manifold.

Using a Lagrangian formulation: \[ \mathscr{L}(\boldsymbol{\theta}, \nu) = -H(\boldsymbol{\theta}) + \nu\left(\sum_{i=1}^n h_i - C\right) \] where $\nu$ is a Lagrange multiplier (note we use $-H$ since Lagrangians are minimized by convention).

The Constrained Dynamics

The constrained dynamics become: \[ \dot{\boldsymbol{\theta}} = -G(\boldsymbol{\theta})\boldsymbol{\theta} + \nu(\tau) \mathbf{a}(\boldsymbol{\theta}) \] where $\mathbf{a}(\boldsymbol{\theta}) = \sum_i \nabla h_i(\boldsymbol{\theta})$ is the constraint gradient and $\tau$ is “game time” parametrising the trajectory.

The Lagrange multiplier $\nu(\tau)$ is time-dependent, it varies along the trajectory to maintain the constraint. Since the constraint must be satisfied at all times: \[ \frac{\text{d}}{\text{d}\tau}\left(\sum_i h_i\right) = 0 \] we can use the chain rule: \[ \mathbf{a}(\boldsymbol{\theta})^\top \dot{\boldsymbol{\theta}} = 0. \]

Solving for the Lagrange Multiplier

Substituting the dynamics into the constraint maintenance condition: \[ \mathbf{a}^\top\left(-G\boldsymbol{\theta} + \nu \mathbf{a}\right) = 0 \] we can solve for $\nu$: \[ \nu(\tau) = \frac{\mathbf{a}^\top G\boldsymbol{\theta}}{\|\mathbf{a}\|^2}. \]

This allows us to write the dynamics in projection form: \[ \dot{\boldsymbol{\theta}} = -\Pi_\parallel G\boldsymbol{\theta} \] where \[ \Pi_\parallel = \mathbf{I} - \frac{\mathbf{a}\mathbf{a}^\top}{\|\mathbf{a}\|^2} \] is the projection matrix onto the constraint tangent space.

Physical Interpretation

This has a nice physical interpretation.

$-G\boldsymbol{\theta}$: The “natural” direction to maximise entropy
$\nu \mathbf{a}$: The “constraint force” that keeps the system on the manifold
$\nu(\tau)$: Measures how much the natural flow “wants” to leave the constraint surface

When $\nu \approx 0$, the constraint gradient is nearly orthogonal to the flow, i.e. the system is naturally moving along the constraint surface. When $|\nu|$ is large, the natural flow is trying to leave the surface and significant constraint force is needed to keep it on track.

As we’ll see next, this tension between the information geometry and the constraint structure is what generates the GENERIC-like decomposition into dissipative and conservative parts.

Emergence of Physical Structure

GENERIC: Reversible and Irreversible Dynamics

The GENERIC Framework

[edit]

We’ve seen something emerge from lectures 5-7: - Lecture 5: Hamiltonian/Poisson structure describes energy-conserving dynamics (antisymmetric operators) - Lecture 6: Linearisation around equilibrium reveals structure of dynamics - Lecture 7: Any dynamics matrix decomposes uniquely as $M = S + A$ where $S$ is symmetric (dissipative) and $A$ is antisymmetric (conservative)

This emergence comes from the geometry of constrained information dynamics. The structure is not unique to information theory. It appears throughout physics whenever systems combine reversible and irreversible processes.

Historical Context: Non-Equilibrium Thermodynamics

In the 1980s-90s, researchers in non-equilibrium thermodynamics faced a challenge: How do you describe systems that are simultaneously: - Reversible (like mechanical systems, governed by Hamilton’s equations) - Irreversible (like thermodynamic systems, governed by entropy increase)

Examples of such systems: - Fluid dynamics: Conservation of momentum (reversible) + viscosity (irreversible) - Chemical reactions: Reaction kinetics (reversible) + diffusion (irreversible) - Complex materials: Elastic deformation (reversible) + plastic flow (irreversible)

Two researchers, Miroslav Grmela and Hans Christian Öttinger, independently developed a unified framework in the 1990s that they called GENERIC: General Equation for Non-Equilibrium Reversible-Irreversible Coupling (Grmela and Öttinger, 1997; Öttinger, 2005; Öttinger and Grmela, 1997).

What Problem Does GENERIC Solve?

Classical mechanics and classical thermodynamics seemed to describe fundamentally different worlds:

Classical Mechanics (Hamiltonian)

Time reversible: $t \to -t$ gives valid dynamics
Energy conserved: $\frac{\text{d}E}{\text{d}t} = 0$
Phase space volume conserved (Liouville’s theorem)
Described by antisymmetric operators (Poisson brackets)

Classical Thermodynamics

Time irreversible: entropy always increases
Energy dissipated: $\frac{\text{d}S}{\text{d}t} \geq 0$
Systems evolve toward equilibrium
Described by symmetric operators (friction, diffusion)

The Problem: Real systems do BOTH simultaneously! A pendulum with friction conserves angular momentum (reversible) while losing energy to heat (irreversible). How do you write down equations that respect both structures at once?

The GENERIC Answer: Coexistence Requires Structure

GENERIC provides the answer: reversible and irreversible processes can coexist in the same system, but only if they satisfy certain compatibility conditions. These conditions ensure

Energy and entropy are consistently defined
The second law of thermodynamics holds ($\dot{S} \geq 0$)
Conserved quantities (like energy, momentum) remain conserved
Casimir functions (constraints) are respected

Importantly you can’t just add reversible and irreversible parts arbitrarily, they must be coupled through degeneracy conditions that enforce thermodynamic consistency.

In typical GENERIC applications, ensuring the degeneracy conditions are satisfied is challenging, you must carefully engineer both operators to be compatible. But in our case (Lectures 1-7), the degeneracy conditions emerge automatically from the constraint geometry. We didn’t impose them, they pop out as a consequence of starting from information-theoretic axioms. When we check in Lecture 8, they’re already satisfied. This strongly suggests our axioms capture something fundamental about thermodynamic consistency.

Why “GENERIC” Matters for Information Dynamics

You might wonder: “Why do we care about a framework from non-equilibrium thermodynamics in a course on information dynamics?”

The structure that emerged from pure information theory (lectures 1-7) is identical to the structure that physicists discovered was necessary for consistent non-equilibrium thermodynamics.

This suggests something deep: - Information dynamics is thermodynamics (we’ve known this since Shannon and Jaynes) - But more: information dynamics is also a dynamical system with conserved quantities - The GENERIC structure is the inevitable consequence of combining these two aspects

When we derived $\dot{\boldsymbol{\theta}} = -G\boldsymbol{\theta} - \nu a$ from information-theoretic principles, we were actually deriving a GENERIC system! The Fisher information $G$ plays the role of the “friction” operator, and the constraint dynamics provide the “Poisson” structure.

Preview: Structure of the GENERIC Equation

In the next sections, we’ll see the full GENERIC equation: \[ \dot{x} = L(x) \nabla E(x) + M(x) \nabla S(x) \] where: - $L(x)$: Poisson operator (antisymmetric, describes reversible dynamics) - $M(x)$: Friction operator (symmetric positive semi-definite, describes irreversible dynamics) - $E(x)$: Energy functional (conserved by $L$ dynamics) - $S(x)$: Entropy functional (increased by $M$ dynamics)

And we’ll see how our information dynamics fit perfectly into this form, with: - Fisher information matrix $G$ playing the role of $M$ - Constraint structure providing the Poisson operator $L$ - Marginal entropy conservation giving us Casimir functions

The structure we built from axioms (lectures 1-4) is GENERIC. The decomposition we derived from linearization (lectures 6-7) reveals its form.

Historical note: The original GENERIC papers (Grmela and Öttinger, 1997; Öttinger and Grmela, 1997) emerged from studies of complex fluids and polymers. It has since been applied to

Viscoelastic materials
Multiphase flows
Chemical reaction networks
Biological systems
Plasma physics

The framework is now recognized as a fundamental structure in non-equilibrium statistical mechanics. Our contribution is showing it emerges naturally from information-theoretic first principles.

The GENERIC Equation

[edit]

The GENERIC framework describes the evolution of a state $x$ (which could be a density matrix, a distribution, a field configuration, etc.) through the equation: \[ \dot{x} = L(x) \nabla E(x) + M(x) \nabla S(x), \] where

$x$: State of the system (lives in some state space)
$E(x)$: Energy functional
$S(x)$: Entropy functional
$\nabla E$, $\nabla S$: Functional derivatives (gradients in function space)
$L(x)$: Poisson operator (describes reversible dynamics)
$M(x)$: Friction operator (describes irreversible dynamics)

This equation encodes deep structure. Let’s unpack each component.

The Poisson Operator $L(x)$

The Poisson operator $L(x)$ describes the reversible (energy-conserving, time-reversible) part of the dynamics. It must

1. Be Antisymmetric: For any functionals $F$ and $G$, \[ \langle \nabla F, L \nabla G \rangle = -\langle \nabla G, L \nabla F \rangle \] where $\langle \cdot, \cdot \rangle$ denotes an appropriate inner product. This ensures time reversibility: if you reverse time ($t \to -t$) and flip velocities, you get valid dynamics.

2. Satisfy Jacobi identity: The Poisson bracket defined by $L$ must satisfy \[ \{F, \{G, H\}\} + \{G, \{H, F\}\} + \{H, \{F, G\}\} = 0 \] where $\{F,G\} := \langle \nabla F, L \nabla G \rangle$. This is the condition for $L$ to generate a Lie algebra structure (recall Lecture 5!).

3. Conserve energy: The energy must be a Casimir of $L$: \[ L(x) \nabla E(x) = 0 \quad \Rightarrow \quad \frac{\text{d}E}{\text{d}t}\bigg|_{L \text{ only}} = 0 \] Actually, for GENERIC we require something weaker: $\langle \nabla E, L \nabla E \rangle = 0$, which follows from antisymmetry.

This connects directly to Lecture 5: $L$ defines a Poisson structure, and Hamiltonian flow with Hamiltonian $E$ is given by $\dot{x} = L \nabla E$.

The Friction Operator $M(x)$

The friction operator $M(x)$ describes the irreversible (entropy-increasing, dissipative) part of the dynamics. It must

1. Be Symmetric: For any functionals $F$ and $G$, \[ \langle \nabla F, M \nabla G \rangle = \langle \nabla G, M \nabla F \rangle. \] This is the Onsager reciprocity relation from irreversible thermodynamics.

2. Be Positive semi-definite: For any functional $F$, \[ \langle \nabla F, M \nabla F \rangle \geq 0. \] This ensures dissipation: entropy can only increase (or stay constant), never decrease.

3. Conserve energy: The friction must not change the energy \[ \langle \nabla E, M \nabla S \rangle = 0. \] This is the first degeneracy condition. Dissipation redistributes energy but doesn’t create or destroy it.

The positive semi-definite property ensures the second law: \[ \frac{\text{d}S}{\text{d}t}\bigg|_{M \text{ only}} = \langle \nabla S, M \nabla S \rangle \geq 0 \]

This connects to Lecture 7: The symmetric part $S$ of our decomposition $M = S + A$ had exactly these properties!

The Degeneracy Conditions

For the GENERIC equation to be thermodynamically consistent, $L$ and $M$ must satisfy two degeneracy conditions that couple the reversible and irreversible parts:

Degeneracy Condition 1 (Energy conservation by friction): \[ M(x) \nabla E(x) = 0 \] Physically: Dissipative processes cannot change the total energy, only redistribute it. This is more general than what we stated above—the friction operator must annihilate the energy gradient entirely, not just be orthogonal to it.

Degeneracy Condition 2 (Entropy conservation by Poisson dynamics): \[ L(x) \nabla S(x) = 0 \] Physically: Reversible (Hamiltonian) processes cannot change entropy. All entropy change must come from irreversible processes.

These conditions are non-trivial constraints on the operators $L$ and $M$. They ensure: - The first law (energy conservation): $\frac{\text{d}E}{\text{d}t} = \langle \nabla E, L \nabla E + M \nabla S \rangle = 0$ - The second law (entropy increase): $\frac{\text{d}S}{\text{d}t} = \langle \nabla S, L \nabla E + M \nabla S \rangle = \langle \nabla S, M \nabla S \rangle \geq 0$

Without these conditions, you could have energy creation/destruction or entropy decrease, those would be violations of thermodynamics.

Casimir Functions and Constraints

Beyond energy and entropy, GENERIC systems often have additional conserved quantities called Casimir functions $C_i(x)$. These satisfy: \[ L(x) \nabla C_i(x) = 0 \quad \text{and} \quad M(x) \nabla C_i(x) = 0 \]

Casimirs are “super-conserved”, they’re annihilated by both the reversible and irreversible parts of the dynamics. Physically, Casimirs represent fundamental constraints that cannot be changed by any process in the system.

Examples: - Mechanics: Total momentum (in absence of external forces) - Fluids: Circulation in ideal fluids - Electromagnetism: Total charge - Information dynamics: $\sum h_i = C$ (marginal entropy conservation!)

Casimirs often arise from symmetries (Noether’s theorem, Lecture 5) or from fundamental conservation laws. They stratify the state space into symplectic leaves, invariant manifolds on which the dynamics are confined.

For information dynamics, the conservation of marginal entropy sum $\sum h_i = C$ is our primary Casimir. The dynamics must respect this constraint at all times.

Why This Structure?

You might wonder why GENERIC has exactly this form. Why two operators? Why these specific conditions?

The answer is that this is the most general structure*that allows reversible and irreversible processes to coexist while respecting: 1. Time-reversal symmetry for the reversible part 2. Second law for the irreversible part 3. Energy conservation overall 4. Additional conservation laws (Casimirs)

Grmela and Öttinger proved that any system satisfying these physical requirements must have GENERIC form. It’s not a choice, it’s a consequence of thermodynamic consistency.

In TIG this structure emerged from our information dynamics (Lectures 1-7) without imposing it. The axioms (Lecture 2) + maximum entropy dynamics (Lecture 3) + constraints (Lecture 4) produce the GENERIC structure. This suggests GENERIC is not just a modeling framework, it’s a principle that information isolated systems must obey.

A Worked Example: Damped Harmonic Oscillator

Let’s see GENERIC in action with a simple example: a harmonic oscillator with friction.

State: $x = (q, p)$ (position and momentum)

Energy: $E(q,p) = \frac{p^2}{2m} + \frac{1}{2}kq^2$ (kinetic + potential)

Entropy: For this simple example, we’ll use $S = -\beta E$ where $\beta = 1/(k_BT)$ is inverse temperature (this connects to Gibbs distribution).

Poisson operator: Standard symplectic structure from Lecture 5, \[ L = \begin{pmatrix} 0 & 1 \\ -1 & 0 \end{pmatrix}, \quad \{f,g\} = \frac{\partial f}{\partial q}\frac{\partial g}{\partial p} - \frac{\partial f}{\partial p}\frac{\partial g}{\partial q} \]

Friction operator: Simple isotropic damping, \[ M = \begin{pmatrix} 0 & 0 \\ 0 & \gamma \end{pmatrix} \] where $\gamma > 0$ is the friction coefficient (only momentum experiences friction, not position).

The dynamics: Compute the gradients, \[ \nabla E = \begin{pmatrix} kq \\ p/m \end{pmatrix}, \quad \nabla S = -\beta \nabla E \]

Then GENERIC gives: \[ \begin{pmatrix} \dot{q} \\ \dot{p} \end{pmatrix} = L \nabla E + M \nabla S = \begin{pmatrix} 0 & 1 \\ -1 & 0 \end{pmatrix}\begin{pmatrix} kq \\ p/m \end{pmatrix} + \begin{pmatrix} 0 & 0 \\ 0 & \gamma \end{pmatrix}\begin{pmatrix} -\beta kq \\ -\beta p/m \end{pmatrix} \]

This yields: \[ \dot{q} = \frac{p}{m}, \quad \dot{p} = -kq - \gamma\beta\frac{p}{m} \]

The first equation is just velocity = momentum/mass. The second is Newton’s law with friction: $m\ddot{q} = -kq - \gamma\beta \dot{q}$, which is exactly the damped harmonic oscillator!

Check degeneracy: - $M \nabla E = (0, \gamma p/m)$ is NOT zero! - Wait, this violates degeneracy condition 1!

Actually, for finite-dimensional systems with constant $M$ and $L$, the degeneracy conditions are automatically satisfied if we choose entropy correctly as $S \propto -E$ (equilibrium condition). More generally, for complex systems we need to ensure degeneracy by construction. We’ll see how information dynamics naturally satisfies these conditions in the next section.

Summary: The GENERIC equation $\dot{x} = L \nabla E + M \nabla S$ encodes: - Structure: Antisymmetric $L$ + symmetric positive semi-definite $M$ - Physics: Reversible dynamics (conserves energy, preserves entropy) + irreversible dynamics (conserves energy, increases entropy) - Consistency: Degeneracy conditions couple $L$ and $M$ to ensure thermodynamic laws - Generality: Covers everything from mechanics to thermodynamics to complex systems

Next, we’ll see how our information dynamics fit this framework perfectly.

Automatic Degeneracy Conditions

[edit]

A remarkable feature of the inaccessible game is that the GENERIC degeneracy conditions, which are typically difficult to impose and verify, emerge automatically from the constraint structure.

The GENERIC Degeneracy Requirements

Recall that standard GENERIC requires two degeneracy conditions for thermodynamic consistency:

Degeneracy 1: The entropy should be conserved by the reversible dynamics: \[ A(\boldsymbol{\theta})\nabla H(\boldsymbol{\theta}) = 0 \]

Degeneracy 2: The energy should be conserved by the irreversible dynamics: \[ S(\boldsymbol{\theta})\nabla E(\boldsymbol{\theta}) = 0 \]

In most GENERIC applications, constructing operators $A$ and $S$ that satisfy both conditions requires significant effort. You must carefully design the operators to ensure the degeneracies hold at every point in state space.

First Degeneracy: Automatic from Tangency

In our framework, the first degeneracy $A\nabla H = 0$ holds automatically at every point on the constraint manifold. It comes from the constraint maintenance requirement: \[ \mathbf{a}^\top \dot{\boldsymbol{\theta}} = 0 \] where $\mathbf{a} = \nabla\left(\sum_i h_i\right)$ is the constraint gradient.

This ensures that the dynamics remain tangent to the constraint surface at all times. The antisymmetric part $A$ inherits this tangency because it generates rotations on the constraint manifold. Since rotations conserve everything (by definition $\mathbf{z}^\top A\mathbf{z} = 0$ for antisymmetric $A$), and the rotations stay on the constraint surface, the first degeneracy is automatically satisfied.

Second Degeneracy: From Constraint Gradient

The second degeneracy is where our framework departs from standard GENERIC. In standard formulations, one requires $S\nabla E = 0$ where $E$ is thermodynamic energy. This must be verified case-by-case.

In our framework, the marginal entropy constraint $\sum_i h_i = C$ plays the role that energy conservation plays in GENERIC. The constraint gradient $\mathbf{a}$ defines the degeneracy direction along which the dissipative operator vanishes: \[ S\mathbf{a} = 0 \equiv S\nabla\left(\sum_i h_i\right) = 0. \]

This follows from the constraint tangency requirement: the symmetric part cannot have a component along the constraint gradient because that would violate conservation.

In Section 4 of the paper, we show that under certain conditions in the thermodynamic limit, the constraint gradient becomes asymptotically parallel to an energy gradient: \[ \nabla\left(\sum_i h_i\right) \parallel \nabla E. \]

When this equivalence holds, our automatically-derived degeneracy condition $S\nabla(\sum_i h_i) = 0$ becomes equivalent to the standard GENERIC condition $S\nabla E = 0$. This connects our information-theoretic framework to classical thermodynamics.

Why This Matters

The automatic emergence of degeneracy conditions is profound because:

No hand-crafting required: We don’t need to guess the form of $A$ and $S$ and verify they satisfy degeneracies. The structure emerges from the constrained dynamics.
Global validity: The degeneracies hold everywhere on the constraint manifold by construction, not just at specific points or in special cases.
Information-first foundation: Instead of starting with energy and thermodynamics and deriving information-theoretic properties, we start with information conservation and derive thermodynamic structure.
Fundamental principle: This suggests GENERIC is not just a modelling framework but a fundamental principle that information-isolated systems must obey.

In Grmela and Öttinger’s original work, satisfying the degeneracy conditions requires careful construction (see e.g. Chapter 4 of Öttinger (2005)). The fact that they emerge automatically in our framework suggests that marginal entropy conservation has special geometric significance for non-equilibrium dynamics.

Example: Harmonic Oscillator GENERIC Dynamics

[edit]

To see the GENERIC decomposition $M = S + A$ in action, let’s analyze a simple physical system: the harmonic oscillator with thermalisation. This demonstrates how reversible (Hamiltonian) and irreversible (dissipative) dynamics emerge from the constraint geometry.

import numpy as np
import matplotlib.pyplot as plt
from scipy.linalg import inv
from scipy.integrate import odeint

Starting from an out-of-equilibrium state (cold $x$, hot $p$), the system evolves toward thermal equilibrium while maintaining the entropy constraint.

# Physical parameters
k = 1.0      # Spring constant
m = 1.0      # Mass
T = 1.0      # Temperature
beta = 1.0 / T

# Equilibrium values
theta_xx_eq = beta * k
theta_pp_eq = beta / m
theta_xp_eq = 0.0

print(f"Equilibrium: θ_xx = {theta_xx_eq:.3f}, θ_pp = {theta_pp_eq:.3f}, θ_xp = 0")

# Initial condition: OUT of equilibrium (cold x, hot p)
theta_init = np.array([2.0 * theta_xx_eq,  # Cold x (small variance)
                       0.5 * theta_pp_eq,  # Hot p (large variance)  
                       0.0])               # No correlation

print(f"Initial:     θ_xx = {theta_init[0]:.3f}, θ_pp = {theta_init[1]:.3f}, θ_xp = 0")

# Simulate
t_span = np.linspace(0, 10, 200)
theta_trajectory = odeint(harmonic_dynamics, theta_init, t_span)

# Compute entropy conservation
h_x = np.array([marginal_entropy_gaussian(th) for th in theta_trajectory[:, 0]])
h_p = np.array([marginal_entropy_gaussian(th) for th in theta_trajectory[:, 1]])
h_total = h_x + h_p

print(f"\nEntropy conservation check:")
print(f"  Initial h(X) + h(P) = {h_total[0]:.6f}")
print(f"  Final   h(X) + h(P) = {h_total[-1]:.6f}")
print(f"  Variation: {np.std(h_total):.2e}")

Figure: Harmonic oscillator GENERIC dynamics. Top left: Trajectory in parameter space from out-of-equilibrium initial state toward thermal equilibrium. Top right: Total entropy $h(X) + h(P)$ conserved throughout evolution. Bottom left: Individual marginal entropies exchange while maintaining sum. Bottom right: Variances approach equipartition theorem values (both equal to 1.0 at equilibrium).

This demonstrates the GENERIC structure.

Symmetric part $S$: Drives system toward equilibrium (equipartition)
Antisymmetric part $A$: Would create oscillations (suppressed here by strong damping)
Constraint: Maintained exactly throughout evolution via Lagrange multiplier $\nu$

The system exhibits thermalisation—energy flows from the “hot” momentum degree of freedom to the “cold” position degree of freedom until equipartition is reached.

Computational Validation: Three Binary Variables

[edit]

To validate our theoretical predictions, we simulate the constrained information dynamics for a system of three binary variables with pairwise interactions—a minimal Ising model or Boltzmann machine. This system is complex enough to exhibit non-trivial correlation structure while remaining computationally tractable for exact analysis.

We verify three predictions.

The constraint $\sum_i h_i = C$ is maintained during evolution
The linearisation $M = \partial F/\partial \boldsymbol{\theta}$ decomposes as $M = S + A$
The ratio $\|A\|/\|S\|$ varies with local geometry

import numpy as np
from scipy.integrate import odeint

We start from a “frustrated” configuration where the interaction parameters have competing signs, creating interesting dynamics.

# Initial condition: Frustrated Ising model
# Competing interactions: theta12 > 0 (ferromagnetic)
#                        theta13 < 0 (antiferromagnetic)
#                        theta23 > 0 (ferromagnetic)
h = 0.0  # No external field
theta_init = np.array([h, h, h, 1.0, -1.0, 1.0])
theta_init = theta_init / np.linalg.norm(theta_init)  # Normalize

print("Initial parameters (normalized):")
print(f"  Marginals: θ₁={theta_init[0]:.3f}, θ₂={theta_init[1]:.3f}, θ₃={theta_init[2]:.3f}")
print(f"  Interactions: θ₁₂={theta_init[3]:.3f}, θ₁₃={theta_init[4]:.3f}, θ₂₃={theta_init[5]:.3f}")

# Compute initial constraint value
h_marginals_init = marginal_entropy_n3(theta_init)
C_init = h_marginals_init.sum()
print(f"\nInitial constraint: Σᵢ hᵢ = {C_init:.4f}")

# Simulate dynamics
t_span = np.linspace(0, 5, 100)
theta_trajectory = odeint(dynamics_n3, theta_init, t_span)

# Verify constraint maintenance
h_trajectory = np.array([marginal_entropy_n3(th).sum() for th in theta_trajectory])
constraint_error = np.abs(h_trajectory - C_init)

print(f"\nConstraint maintenance:")
print(f"  Maximum deviation: {constraint_error.max():.2e}")
print(f"  RMS deviation: {np.sqrt(np.mean(constraint_error**2)):.2e}")

# Analyze GENERIC decomposition at several points
analysis_indices = [0, len(t_span)//3, 2*len(t_span)//3, -1]
regime_ratios = []

print(f"\nGENERIC decomposition analysis:")
for idx in analysis_indices:
    M, S, A = compute_generic_decomposition(theta_trajectory[idx])
    norm_S = np.linalg.norm(S, 'fro')
    norm_A = np.linalg.norm(A, 'fro')
    ratio = norm_A / norm_S if norm_S > 0 else 0
    regime_ratios.append(ratio)
    print(f"  t={t_span[idx]:.2f}: ‖A‖/‖S‖ = {ratio:.3f}")

Figure: Computational validation for 3 binary variables. Top left: Constraint $\sum_i h_i = C$ maintained to machine precision. Top right: Individual marginal entropies exchange while sum is conserved. Bottom left: Interaction parameters evolve from frustrated initial state. Bottom right: Ratio $\|A\|/\|S\|$ varies with local geometry—no single “regime” throughout parameter space.

Key findings:

Constraint maintenance: The sum of marginal entropies is preserved to machine precision ($\sim 10^{-12}$), validating the constrained dynamics implementation.
GENERIC structure: At every point, the linearisation $M$ decomposes cleanly into symmetric and antisymmetric parts that satisfy the degeneracy conditions.
Regime variation: The ratio $\|A\|/\|S\|$ changes significantly during evolution, confirming there is no single “regime” that characterizes all of parameter space. The relative importance of reversible vs irreversible dynamics depends on local geometry.
Physical interpretation: Starting from a frustrated configuration (competing interactions), the system evolves by:
- Weakening the antiferromagnetic interaction $\theta_{13}$
- Maintaining ferromagnetic interactions $\theta_{12}$, $\theta_{23}$
- Redistributing entropy among marginals while conserving total

Information Topography

The Fisher information matrix provides mathematical teeth to the intuitive notion of an “information topography” from The Atomic Human.

Fisher Information as Conductance Tensor

[edit]

In the inaccessible game, the Fisher information matrix $G(\boldsymbol{\theta})$ plays a role analogous to conductance in electrical circuits—but with differences that make the game richer than a Kirchhoff network.

The Electrical Circuit Analogy

In a Kirchhoff electrical network, charge conservation is local and linear: $\sum_j I_{ij} = 0$ at each node, where current flows according to Ohm’s law with fixed conductances: \[ I_{ij} = g_{ij}(V_i - V_j). \] The conductances $g_{ij}$ are fixed parameters of the circuit. Given the conductances, the steady state can be found by solving linear equations derived from local charge conservation.

In contrast, our information conservation constraint $\sum_{i=1}^n h_i = C$ is generally nonlocal and nonlinear. Each marginal entropy $h_i$ requires marginalization over all other variables, making it a global functional of the entire state $\boldsymbol{\theta}$.

Consider a multivariate Gaussian as an example. The marginal entropy is \[ h_i = \frac{1}{2}\log(2\pi e \sigma_i^2) \] where $\sigma_i^2 = [G^{-1}]_{ii}$ is the $i$-th diagonal element of the inverse Fisher information. The conservation constraint becomes: \[ \sum_{i=1}^n \log([G^{-1}]_{ii}) = \text{constant}. \]

Dynamic Information Topography

Moreover, the Fisher information $G(\boldsymbol{\theta})$ itself evolves with the state. This creates a dynamic information topography, more analogous to memristive networks than fixed resistors.

The “conductance” for information flow is given by the Fisher information: \[ G(\boldsymbol{\theta}) = \nabla^2 A(\boldsymbol{\theta}), \] which depends on the current state $\boldsymbol{\theta}$. As the system evolves, both the “voltages” (parameters $\boldsymbol{\theta}$) and the “conductances” (Fisher information $G$) change together.

Information Channels and Bottlenecks

Despite the differences, the analogy provides intuition. The eigenvalues of $G(\boldsymbol{\theta})$ indicate information channel capacities:

Large eigenvalues: Low-resistance channels for information flow
Small eigenvalues: Bottlenecks that constrain flow
Eigenvectors: Directions of easy/hard information movement

The constrained maximum entropy production acts as a generalized Ohm’s law. Information flows “downhill” in the entropy landscape, but the rate of flow is governed by the Fisher information metric. The nonlocal conservation and emergent conductance structure create a system where information reorganizes itself through the interplay between local gradient flows and global constraints.

Why “Topography”?

We think of $G(\boldsymbol{\theta})$ as defining the information topography. In geography, topography describes hills, valleys, and plains that constrain how water flows. In our framework, $G(\boldsymbol{\theta})$ describes the “shape” of the information landscape that constrains how information flows.

This formalises the metaphor from The Atomic Human (Lawrence, 2024): “In geography, the topography is the configuration of natural and man-made features in the landscape… An information topography is similar, but instead of the movement of goods, water and people, it dictates the movement of information.”

Formalising Information Topography

[edit]

In The Atomic Human (Lawrence, 2024), the concept of an information topography was introduced as a metaphor: “In geography, the topography is the configuration of natural and man-made features in the landscape… These questions are framed by the topography. An information topography is similar, but instead of the movement of goods, water and people, it dictates the movement of information.”

However, no formal mathematical definition was given. The inaccessible game provides one.

Mathematical Definition

We define the information topography of a system to be the Fisher information matrix $G(\boldsymbol{\theta})$ viewed as a Riemannian metric on the space of probability distributions.

Formally, for an exponential family parametrised by natural parameters $\boldsymbol{\theta}$:

Definition (Information Topography): The information topography is the pair $(G(\boldsymbol{\theta}), \mathcal{M})$ where: - $\mathcal{M}$ is the statistical manifold of probability distributions - $G(\boldsymbol{\theta}) = \nabla \nabla A(\boldsymbol{\theta})$ is the Fisher information metric

This metric determines: 1. Information distances between distributions 2. Directions of information flow (geodesics) 3. Information channel capacities (eigenvalues) 4. Bottlenecks and pathways (eigenvectors)

How It Constrains Information Movement

The information topography constrains movement in three ways:

1. Metric Structure: The “distance” between two nearby distributions $p(\boldsymbol{\theta})$ and $p(\boldsymbol{\theta} + d\boldsymbol{\theta})$ is: \[ ds^2 = d\boldsymbol{\theta}^\top G(\boldsymbol{\theta}) d\boldsymbol{\theta} \] Moving in directions corresponding to small eigenvalues of $G$ requires large parameter changes for small distributional changes—these are “narrow passes” in the information landscape.

2. Gradient Flow: The entropy gradient is: \[ \nabla H = -G(\boldsymbol{\theta})\boldsymbol{\theta} \] Information flows “downhill” in entropy space, but the Fisher information determines the effective slope. Regions with small eigenvalues have shallow gradients—information flows slowly.

3. Constraint Enforcement: Under conservation $\sum h_i = C$, the constraint gradient $\mathbf{a} = \nabla(\sum h_i)$ interacts with $G$ to determine allowed flow directions. The dynamics become: \[ \dot{\boldsymbol{\theta}} = -\Pi_\parallel G(\boldsymbol{\theta})\boldsymbol{\theta} \] where $\Pi_\parallel$ projects onto the constraint tangent space.

Dynamic Topography

Unlike geographical topography which is static, information topography is dynamic, it changes as the system evolves. As $\boldsymbol{\theta}$ changes, so does $G(\boldsymbol{\theta})$. This creates a feedback loop:

Current topography $G(\boldsymbol{\theta})$ determines information flow
Flow changes parameters $\boldsymbol{\theta}$
New parameters change topography $G(\boldsymbol{\theta})$
Repeat

This dynamic restructuring is what makes information systems so rich. The landscape itself evolves as you move through it.

This formalisation gives mathematical precision to the intuitive notion from The Atomic Human that information movement is constrained by structure. The Fisher information matrix is that structure, and the inaccessible game describes how systems evolve within it.

Fisher Information as Geometry

[edit]

In the previous section, we saw that for exponential families, the Fisher information matrix appears as the second derivative of the log partition function \[ G(\boldsymbol{\theta}) = \nabla^2 \mathcal{A}(\boldsymbol{\theta}) = \mathrm{Cov}_{\boldsymbol{\theta}}[T(\mathbf{x})]. \] We now develop the geometric interpretation: the Fisher information matrix defines a metric on the space of probability distributions.

The Statistical Manifold

Consider the space of all probability distributions in an exponential family, parametrized by $\boldsymbol{\theta}$. This space forms a manifold — a smooth, curved space where each point represents a different distribution.

The Fisher information matrix $G(\boldsymbol{\theta})$ acts as a Riemannian metric on this manifold. Think of measuring distances on a curved surface like a sphere: you need a metric to tell you how far apart two nearby points are. The Fisher information provides exactly this for the space of probability distributions, telling us how to measure “statistical distance” between distributions.

The Fisher information defines the information distance between nearby distributions. If we move from parameters $\boldsymbol{\theta}$ to $\boldsymbol{\theta} + \text{d}\boldsymbol{\theta}$, the infinitesimal distance in information space is \[ \text{d}s^2 = \text{d}\boldsymbol{\theta}^\top G(\boldsymbol{\theta}) \text{d}\boldsymbol{\theta} \] where the Fisher information playing the role of the metric. Larger Fisher information means a given parameter change corresponds to a larger “information distance”, the distributions are more distinguishable.

Connection to Statistical Estimation

This geometric picture connects directly to Fisher’s original motivation. The Cramér-Rao bound states that for any unbiased estimator $\hat{\boldsymbol{\theta}}$ of parameters $\boldsymbol{\theta}$, \[ \text{cov}(\hat{\boldsymbol{\theta}}) \succeq G^{-1}(\boldsymbol{\theta}), \] where $\succeq$ denotes that the left side minus the right side is positive semidefinite.

Geometrically, this means: higher Fisher information (stronger metric) implies tighter bounds on estimation. The inverse $G^{-1}$ gives the minimum possible covariance of any unbiased estimator, it’s the fundamental limit on how well we can estimate parameters from data.

The Fisher information plays two distinct but related roles:

As a metric: It defines information distance, telling us how “far apart” distributions are.
In gradient flow: Recall from the exponential family definitions that that $\nabla H = -G(\boldsymbol{\theta})\boldsymbol{\theta}$. This means entropy gradient ascent in exponential families involves the Fisher information, \[ \dot{\boldsymbol{\theta}} = \nabla H = -G(\boldsymbol{\theta})\boldsymbol{\theta}. \]

The appearance in the gradient comes from the specific structure of exponential families (where $G = \nabla^2 \mathcal{A}$). Together, they determine how the system flows through information space, with the geometry guiding the dynamics.

Examples Revisited

For the Gaussian distribution, we saw that $G(\boldsymbol{\theta}) = \Sigma$. This means: - The information metric is the covariance matrix - The inverse $G^{-1} = \Sigma^{-1}$ is the precision matrix

Geometrically, the information ellipsoid has the same shape as the probability ellipsoid. This direct connection between the Fisher information and covariance is special to Gaussians (and arises because we’re working in natural parameters $\boldsymbol{\theta} = \Sigma^{-1}\boldsymbol{\mu}$).

For a categorical distribution with $K$ outcomes, the Fisher information has a special structure. Using the natural parameters $\theta_k = \log \pi_k$, the Fisher information is \[ G_{ij}(\boldsymbol{\theta}) = \delta_{ij}\pi_i - \pi_i\pi_j = \begin{cases} \pi_i(1 - \pi_i) & i = j \\ -\pi_i\pi_j & i \neq j \end{cases} \]

This metric defines the probability simplex geometry. Distributions near the center of the simplex (all $\pi_k \approx 1/K$) have different local geometry than those near the corners (one $\pi_k \approx 1$). The Fisher metric captures this intrinsic curvature.

Information Geometry: The Big Picture

The Fisher information matrix is a foundational element of information geometry, a field that studies probability distributions using differential geometric tools. Key insights:

**mari’s Dually Flat Structure*: Exponential families have a special property. They are “dually flat” under two different coordinate systems (natural parameters $\boldsymbol{\theta}$ and expectation parameters $\boldsymbol{\mu}$). The Fisher metric connects these.
Geodesics: The shortest path between two distributions (in the information geometry sense) is a geodesic. For exponential families, geodesics have elegant forms that will connect to our least action principles.
Curvature: The curvature of the statistical manifold (measured by the Riemann curvature tensor derived from $G$) tells us about the intrinsic structure of the family. Exponential families have zero curvature in a certain sense—they are “flat” manifolds.

These geometric properties will be essential when we study constrained information dynamics and emergence.

Why This Matters for The Inaccessible Game

The Fisher information matrix plays three roles in our framework:

Gradient Flow Metric: It appears in the entropy gradient, determining how the system evolves through information space via $\dot{\boldsymbol{\theta}} = -G(\boldsymbol{\theta})\boldsymbol{\theta}$.
Information Distance: It defines the metric for measuring statistical distinguishability between distributions.
Emergence Indicator: Changes in the structure of $G$ signal the emergence of new regimes and effective descriptions.

Understanding Fisher information as geometry, not just as a statistical tool, is key for everything that follows.

Connecting Information to Energy

The Thermodynamic Limit

Energy-Entropy Equivalence in the Thermodynamic Limit

[edit]

We’ve seen that marginal entropy conservation $\sum_i h_i = C$ leads to GENERIC-like structure. But how does this connect to traditional thermodynamics with energy conservation? The answer lies in the thermodynamic limit.

The Energy-Entropy Question

In real GENERIC systems, it’s not marginal entropy that’s conserved but extensive thermodynamic energy $E$. Can we show that, under specific conditions, the marginal entropy constraint asymptotically singles out the same degeneracy direction as energy conservation?

In other words, does the constraint gradient $\nabla(\sum_i h_i)$ become parallel to $\nabla E$ in appropriate limits?

Conditions for Equivalence

The equivalence requires specific scaling properties. Using multi-information $I = \sum_i h_i - H$, we can write: \[ \nabla\left(\sum_i h_i\right) = \nabla I + \nabla H. \]

Our requirement: Along certain directions (order parameters), the multi-information gradient must scale intensively while entropy gradients scale extensively.

Specifically, consider a macroscopic order parameter $m$ (like magnetization in a spin system). The requirement is: - $\nabla_m I = \mathscr{O}(1)$ (intensive) - $\nabla_m H = \mathscr{O}(n)$ (extensive) - $\nabla_m(\sum_i h_i) = \mathscr{O}(n)$ (extensive)

where $n$ is the number of variables.

When this scaling holds: \[ \nabla_m\left(\sum_i h_i\right) = \nabla_m H + \nabla_m I = \mathscr{O}(n) + \mathscr{O}(1). \]

In the thermodynamic limit $n \to \infty$, the $\mathscr{O}(1)$ correction from multi-information becomes negligible: \[ \nabla_m\left(\sum_i h_i\right) \parallel \nabla_m H. \]

Connecting to Energy

For exponential families, the entropy gradient in expectation parameters $\boldsymbol{\mu} = \langle T(\mathbf{x})\rangle$ is: \[ \nabla_{\boldsymbol{\mu}} H = \boldsymbol{\theta} \] where $\boldsymbol{\theta}$ are the natural parameters.

Now define an energy functional as: \[ E(\mathbf{x}) = -\boldsymbol{\alpha}^\top T(\mathbf{x}) \] where $\boldsymbol{\alpha}$ is chosen such that $\boldsymbol{\theta} = -\beta\boldsymbol{\alpha}$. This gives: \[ \nabla_{\boldsymbol{\mu}} E = -\boldsymbol{\alpha} = \frac{\boldsymbol{\theta}}{\beta} = \frac{\nabla_{\boldsymbol{\mu}} H}{\beta}. \]

Therefore \[ \nabla E \parallel \nabla H \parallel \nabla\left(\sum_i h_i\right) \] along the macroscopic direction in the thermodynamic limit.

When Does This Hold?

The equivalence requires:

Macroscopic order parameter: There exists a low-dimensional direction (like total magnetisation $m = \sum_i x_i$) that captures system-wide behavior
Bounded correlations: The correlation length $\xi$ remains finite (away from criticality), ensuring $\nabla_m I$ stays intensive
Translation invariance: Symmetry ensures marginal entropies are identical conditioned on the order parameter
Thermodynamic limit: Number of variables $n \to \infty$

Not all systems satisfy these conditions. Near critical points, correlations diverge and the intensive scaling breaks down. In small systems, $\mathscr{O}(1)$ corrections matter. But for many physically relevant systems—bulk matter far from phase transitions—the equivalence holds and provides a bridge between information theory and thermodynamics.

Implications

This equivalence has several implications:

Information $\leftrightarrow$ Thermodynamics: Energy conservation and marginal entropy conservation become equivalent statements in appropriate limits
Temperature emergence: The parameter $\beta$ emerges as inverse temperature from the information geometry, not imposed from thermodynamics
Landauer’s principle: Can be derived from information conservation via this equivalence
Wheeler’s “it from bit”: Suggests thermodynamics might emerge from information-theoretic principles

Analytical Validation: Curie-Weiss Model

[edit]

The energy-entropy equivalence theorem from the inaccessible game predicts that in the thermodynamic limit, accessible information $I$ should vanish, making energy and entropy equivalent. To test this rigorously, we use the Curie-Weiss model (a mean-field system where everything can be computed exactly) including across phase transitions.

The model exhibits a ferromagnetic phase transition: - Disordered phase ($T > T_c$): Magnetization $m = 0$, spins independent - Ordered phase ($T < T_c$): Magnetization $m \neq 0$, spins correlated

The theorem predicts $\nabla_m I \approx 0$ in the disordered phase (equivalence holds) but $\nabla_m I \gg 0$ in the ordered phase (equivalence fails).

import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import fsolve

We scan across temperatures from high (disordered) to low (ordered), computing the magnetization and testing whether the equivalence $\nabla_m I \approx 0$ holds.

# Parameters
J = 1.0  # Coupling strength
T_c = J  # Critical temperature
h = 0.0  # No external field

# Scan temperatures
temperatures = np.linspace(0.3, 2.0, 50)
betas = 1.0 / temperatures

# Compute magnetizations and gradients
magnetizations = []
gradients = []

for beta in betas:
    m = magnetization_selfconsistent(beta, J, h)
    magnetizations.append(m)
    
    # Compute gradient of accessible information
    dI_dm = accessible_information_gradient(beta, J, m)
    gradients.append(abs(dI_dm))

magnetizations = np.array(magnetizations)
gradients = np.array(gradients)

# Identify phase transition
T_transition_idx = np.argmin(np.abs(temperatures - T_c))

print(f"Critical temperature: T_c = {T_c:.2f}")
print(f"\nDisordered phase (T > T_c):")
print(f"  Temperature T = {temperatures[0]:.2f}: m = {magnetizations[0]:.4f}, |∇_m I| = {gradients[0]:.2e}")
print(f"\nOrdered phase (T < T_c):")
print(f"  Temperature T = {temperatures[-1]:.2f}: m = {magnetizations[-1]:.4f}, |∇_m I| = {gradients[-1]:.2e}")

# Verify theorem prediction
disordered_mask = temperatures > T_c
ordered_mask = temperatures < T_c

mean_gradient_disordered = gradients[disordered_mask].mean()
mean_gradient_ordered = gradients[ordered_mask].mean()

print(f"\nTheorem verification:")
print(f"  Disordered phase: <|∇_m I|> = {mean_gradient_disordered:.2e} ≈ 0 ✓")
print(f"  Ordered phase:    <|∇_m I|> = {mean_gradient_ordered:.2e} ≫ 0 ✓")

Figure: Testing energy-entropy equivalence across the Curie-Weiss phase transition. Left: Magnetization $m$ as a function of temperature, showing ferromagnetic transition at $T_c = J$. Right: Gradient of accessible information $|\nabla_m I|$ is small in the disordered phase (equivalence holds) but large in the ordered phase (equivalence fails), confirming the theorem’s predictions.

Results:

The numerical experiment confirms the theorem’s predictions:

Disordered phase ($T > T_c$):
- Magnetization $m \approx 0$ (no long-range order)
- Gradient $|\nabla_m I| \approx 0$ (energy-entropy equivalence holds)
- System is in the “thermodynamic regime” where correlations are weak
Ordered phase ($T < T_c$):
- Magnetization $m \neq 0$ (spontaneous symmetry breaking)
- Gradient $|\nabla_m I| \gg 0$ (equivalence fails)
- Strong correlations violate the assumptions for equivalence
Phase transition:
- Sharp crossover at critical temperature $T_c = J$
- Both magnetization and equivalence breakdown occur simultaneously
- Validates that correlations determine when equivalence applies

This provides rigorous analytical support for the energy-entropy equivalence theorem derived from the inaccessible game axioms.

GENERIC and Thermodynamics

GENERIC as Generalized Thermodynamics

[edit]

GENERIC provides a framework that generalizes classical thermodynamics to arbitrary non-equilibrium systems. Where classical thermodynamics describes systems near equilibrium with linear response, GENERIC handles systems arbitrarily far from equilibrium with full nonlinear dynamics.

Classical thermodynamics (Clausius, Kelvin, Carnot): - Equilibrium states - Quasi-static processes - Entropy maximization at equilibrium - No dynamics, only relations between states

Linear irreversible thermodynamics (Onsager, Prigogine): - Near-equilibrium dynamics - Linear force-flux relations - Onsager reciprocity - Valid only for small deviations

GENERIC (Grmela, Öttinger): - Arbitrary far-from-equilibrium states - Nonlinear dynamics - Combines reversible + irreversible - Reduces to classical thermo at equilibrium

GENERIC is the completion of thermodynamics—it describes the full dynamical evolution, not just equilibrium endpoints.

The Laws of Thermodynamics in GENERIC

GENERIC automatically encodes the fundamental laws of thermodynamics. Let’s see how:

Zeroth Law (Transitivity of equilibrium): At equilibrium, $\dot{x} = L \nabla E + M \nabla S = 0$. If systems A and B are each in equilibrium with C, and equilibrium is defined by the same functionals $E$ and $S$, then A and B are in equilibrium with each other. This follows from the uniqueness of critical points.

First Law (Energy conservation): \[ \frac{\text{d}E}{\text{d}t} = \langle \nabla E, \dot{x} \rangle = \langle \nabla E, L \nabla E + M \nabla S \rangle \] Using antisymmetry of $L$: $\langle \nabla E, L \nabla E \rangle = 0$

Using degeneracy condition 1: $\langle \nabla E, M \nabla S \rangle = 0$

Therefore: $\frac{\text{d}E}{\text{d}t} = 0$ (energy is conserved!)

The first law is built into GENERIC structure through antisymmetry and degeneracy.

Second Law (Entropy increase): \[ \frac{\text{d}S}{\text{d}t} = \langle \nabla S, \dot{x} \rangle = \langle \nabla S, L \nabla E + M \nabla S \rangle \] Using degeneracy condition 2: $\langle \nabla S, L \nabla E \rangle = 0$

Using positive semi-definiteness of $M$: $\langle \nabla S, M \nabla S \rangle \geq 0$

Therefore: $\frac{\text{d}S}{\text{d}t} \geq 0$ (entropy increases!)

The second law is built into GENERIC through degeneracy and positive semi-definiteness.

Third Law (Entropy vanishes at zero temperature): This is more subtle and depends on the specific form of $S$ and $M$, but GENERIC is compatible with quantum mechanical formulations where the third law emerges naturally.

Onsager Reciprocity Relations

One of the crowning achievements of linear irreversible thermodynamics was Onsager’s reciprocity relations (Onsager, 1931). These state that near equilibrium, the response matrix relating thermodynamic forces to fluxes is symmetric.

GENERIC provides a non-linear generalization of Onsager reciprocity through the symmetry of the friction operator $M$.

Near equilibrium, expand $M(\nabla S)$ to leading order: \[ M(x) \nabla S(x) \approx M(x_{\text{eq}}) \nabla^2 S(x_{\text{eq}}) (x - x_{\text{eq}}) = M_0 H_S (x - x_{\text{eq}}) \] where $H_S$ is the Hessian of entropy at equilibrium.

The flux is $J = M_0 H_S \delta x$ and the thermodynamic force is $X = H_S \delta x$. The response matrix is: \[ J = L X \quad \text{where} \quad L = M_0 \]

Since $M_0$ is symmetric (GENERIC requirement), we have $L_{ij} = L_{ji}$, which is exactly Onsager reciprocity!

Key insight: Onsager reciprocity isn’t a separate postulate—it’s a consequence of the symmetric structure of the friction operator, which in turn follows from thermodynamic consistency (entropy increase).

Entropy Production

A central concept in non-equilibrium thermodynamics is entropy production—the rate at which entropy is generated by irreversible processes. In GENERIC, this has a beautiful formulation.

The total entropy rate is: \[ \frac{\text{d}S}{\text{d}t} = \langle \nabla S, L \nabla E + M \nabla S \rangle = \langle \nabla S, M \nabla S \rangle \] (using degeneracy 2: $L \nabla S = 0$)

We can decompose this as: \[ \dot{S} = \sigma_S \geq 0 \] where $\sigma_S = \langle \nabla S, M \nabla S \rangle$ is the entropy production rate.

Key properties: 1. Non-negative: $\sigma_S \geq 0$ (from positive semi-definiteness of $M$) 2. Vanishes at equilibrium: When $\nabla S = 0$, we have $\sigma_S = 0$ 3. Measures irreversibility: $\sigma_S$ quantifies how far the system is from reversible dynamics

For our information dynamics: \[ \sigma_S = \langle -G\boldsymbol{\theta}, G(-G\boldsymbol{\theta}) \rangle = \boldsymbol{\theta}^\top G^2 \boldsymbol{\theta} \]

This is exactly the entropy production from maximum entropy dynamics (Lecture 3)! The Fisher information matrix $G$ governs the rate of entropy increase.

Free Energy and Dissipation

In equilibrium thermodynamics, the free energy $F = E - TS$ (where $T$ is temperature) determines equilibrium states: systems minimize $F$ at fixed temperature.

In GENERIC, we can define a generalized free energy functional and show that it decreases along trajectories (for isothermal processes).

For a system at temperature $T$, define: \[ \mathcal{F} = E - T S \]

The rate of change is: \[ \frac{\text{d}\mathcal{F}}{\text{d}t} = \frac{\text{d}E}{\text{d}t} - T\frac{\text{d}S}{\text{d}t} = 0 - T \sigma_S = -T \langle \nabla S, M \nabla S \rangle \leq 0 \]

So free energy decreases! The system dissipates toward minimum free energy at equilibrium.

For information dynamics, this connects to the free energy in exponential families: \[ \mathcal{F}(\boldsymbol{\theta}) = -A(\boldsymbol{\theta}) + \boldsymbol{\theta}^\top \mathbb{E}[T(\mathbf{x})] \] where $A(\boldsymbol{\theta})$ is the log partition function. The dynamics $\dot{\boldsymbol{\theta}} = -G\boldsymbol{\theta}$ perform gradient descent on free energy (under constraints).

Fluctuation-Dissipation Relations

Another deep result from statistical mechanics is the fluctuation-dissipation theorem, which relates the response of a system to perturbations to its spontaneous fluctuations at equilibrium.

Near equilibrium, fluctuations in a variable $x_i$ have variance: \[ \langle (\delta x_i)^2 \rangle \propto k_B T \]

The response to a small force $f_j$ is: \[ \langle \delta x_i \rangle = \chi_{ij} f_j \]

The fluctuation-dissipation theorem states: \[ \chi_{ij} \propto \frac{\langle \delta x_i \delta x_j \rangle}{k_B T} \]

In GENERIC, this emerges naturally from the structure. The friction operator $M$ governs both: - Dissipation: How perturbations relax - Fluctuations: Equilibrium variance (through Gibbs distribution)

For our information dynamics, $M = G$ (Fisher information), which is exactly the inverse covariance matrix for Gaussian distributions! So: \[ G_{ij}^{-1} = \text{Cov}(T_i, T_j) = \langle (\delta T_i)(\delta T_j) \rangle \]

The Fisher information (dissipation) and the covariance (fluctuations) are inverses—a direct manifestation of fluctuation-dissipation!

Maximum Entropy Production Principle

The maximum entropy production principle (MEPP) states that non-equilibrium steady states are characterized by maximum entropy production rate subject to constraints. This principle is debated in thermodynamics, but GENERIC provides a framework for understanding when it applies.

From Lecture 3, we derived that unconstrained information dynamics maximize entropy production: \[ \dot{S} = \max_{\dot{\boldsymbol{\theta}}} \{\dot{S}(\dot{\boldsymbol{\theta}}) : \text{Fisher-constrained}\} \]

With constraints, the dynamics become: \[ \dot{\boldsymbol{\theta}} = -G\boldsymbol{\theta} - \nu a \]

This still maximizes entropy production on the constraint manifold: \[ \dot{S} = \max_{\dot{\boldsymbol{\theta}} : a^\top\dot{\boldsymbol{\theta}}=0} \{-\boldsymbol{\theta}^\top G \dot{\boldsymbol{\theta}}\} \]

So MEPP holds for information dynamics as a consequence of GENERIC structure + Fisher information as friction.

The general lesson: MEPP emerges when: 1. The friction operator is related to the entropy Hessian (Fisher information) 2. Constraints are properly accounted for via Lagrange multipliers 3. The system is not externally driven

GENERIC provides the mathematical framework for understanding when MEPP applies and when it doesn’t.

Connection to Non-Equilibrium Statistical Mechanics

GENERIC bridges macroscopic thermodynamics and microscopic statistical mechanics. While we’ve been working at the level of distributions and information, GENERIC can be derived from:

Microscopic foundations: - Liouville equation for phase space density - BBGKY hierarchy for reduced distributions
- Projection operator methods (Zwanzig-Mori)

These microscopic theories show how reversible microscopic dynamics (Hamiltonian) give rise to irreversible macroscopic dynamics (dissipation) through coarse-graining and loss of information.

The antisymmetric part $L$ preserves the fine-grained microscopic reversibility. The symmetric part $M$ captures the effective irreversibility from ignored microscopic degrees of freedom.

For information dynamics: - Fine-grained: Full joint distribution $p(\mathbf{x})$ - Coarse-grained: Natural parameters $\boldsymbol{\theta}$ (exponential family) - $L$: Preserves structure within parameter space - $M = G$: Dissipation from unobserved correlations

This connects our information-theoretic approach to fundamental stat-mech!

Summary: GENERIC generalizes classical thermodynamics to arbitrary non-equilibrium dynamics. The laws of thermodynamics, Onsager reciprocity, entropy production, fluctuation-dissipation, and maximum entropy production all emerge as consequences of GENERIC structure. For information dynamics, Fisher information plays the role of thermodynamic friction, connecting information theory to the foundations of statistical mechanics and thermodynamics. The framework we built from axioms (Lectures 1-7) is not just mathematically consistent—it’s thermodynamically fundamental.

Landauer’s Principle

One of the most fundamental results connecting information and energy is Landauer’s principle. The inaccessible game allows us to derive it from first principles.

Landauer’s Principle from the Inaccessible Game

[edit]

The GENERIC-like structure and energy-entropy equivalence provide a natural framework for deriving Landauer’s principle (Landauer, 1961), which states that erasing information requires dissipating at least $k_BT\log 2$ of energy per bit.

Information Erasure as a Process

Consider erasing one bit: a memory variable $x_i \in \{0,1\}$ is reset to a standard state (say $x_i = 0$), destroying the stored information. From an ensemble perspective—considering many such erasure operations where the initial value is random—the marginal entropy of this variable decreases: \[ \Delta h(X_i) = -\log 2. \]

Conservation Requires Redistribution

For a system obeying $\sum_i h_i = C$, this decrease must be compensated by increases elsewhere: \[ \sum_{j \neq i} \Delta h(X_j) = +\log 2. \]

The antisymmetric (conservative) part $A$ of our GENERIC dynamics preserves both $H$ and $\sum_i h_i$. It can only shuffle entropy reversibly between variables. But such reversible redistribution doesn’t truly erase the information—it merely moves it to other variables, from which it could in principle be recovered.

True irreversible erasure requires increasing the total joint entropy $H$ (second law) while maintaining $\sum_i h_i = C$. Since $I = \sum_i h_i - H$, this means decreasing multi-information: \[ \Delta I < 0. \]

This reduction of correlations is precisely what the dissipative part $S$ achieves. It increases $H$ through entropy production while the constraint forces redistribution of marginal entropies. The erasure process thus necessarily involves the dissipative dynamics, not just conservative reshuffling.

Energy Cost from Energy-Entropy Equivalence

In the thermodynamic limit with energy-entropy equivalence (Section 5 of the paper), the gradients $\nabla(\sum_i h_i)$ and $\nabla E$ become parallel along the order-parameter direction. Near thermal equilibrium at inverse temperature $\beta = \tfrac{1}{k_BT}$, this implies: \[ \beta \langle E \rangle \approx \sum_i h_i + \text{const}. \]

Therefore, erasing one bit requires: \[ \Delta(\beta \langle E \rangle) \approx \Delta h(X_i) = -\log 2, \] giving an energy change: \[ \Delta \langle E \rangle \approx -\frac{\log 2}{\beta} = -k_BT \log 2. \]

Dissipation Bound

Since the system must dissipate this energy via the symmetric part $S$, and entropy production is non-negative, we obtain Landauer’s bound \[ Q_{\text{dissipated}} \geq k_BT\log 2. \]

This derivation shows that Landauer’s principle emerges from: 1. Marginal entropy conservation $\sum_i h_i = C$ 2. GENERIC-like structure distinguishing conservative redistribution ($A$) from dissipative entropy production ($S$)
3. Energy-entropy equivalence in the thermodynamic limit

The insight is that erasure requires both redistributing marginal entropy (to maintain the constraint) and increasing total entropy $H$ (second law), which necessarily reduces multi-information $I$ and invokes dissipation.

The information-theoretic constraint provides the foundation, with thermodynamic energy appearing as its dual in the large-system limit. This reverses the usual derivation where information bounds follow from thermodynamics — here thermodynamic bounds follow from information theory.

[edit]

Digital memory can be viewed as a communication channel through time - storing a bit is equivalent to transmitting information to a future moment. This perspective immediately suggests that we look for a connection between Landauer’s erasure principle and Shannon’s channel capacity. The connection might arise because both these systems are about maintaining reliable information against thermal noise.

The Landauer limit (Landauer, 1961) is the minimum amount of heat energy that is dissapated when a bit of information is erased. Conceptually it’s the potential energy associated with holding a bit to an identifiable single value that is differentiable from the background thermal noise (representated by temperature).

The Gaussian channel capacity (Shannon, 1948) represents how identifiable a signal $S$, is relative to the background noise, $N$. Here we trigger a small exploration of potential relationship between these two values.

When we store a bit in memory, we maintain a signal that can be reliably distinguished from thermal noise, just as in a communication channel. This suggests that Landauer’s limit for erasure of one bit of information, $E_{min} = k_BT$, and Shannon’s Gaussian channel capacity, \[ C = \frac{1}{2}\log_2\left(1 + \frac{S}{N}\right), \] might be different views of the same limit.

Landauer’s limit states that erasing one bit of information requires a minimum energy of $E_{\text{min}} = k_BT$. For a communication channel operating over time $1/B$, the signal power $S = EB$ and noise power $N = k_BTB$. This gives us: \[ C = \frac{1}{2}\log_2\left(1 + \frac{S}{N}\right) = \frac{1}{2}\log_2\left(1 + \frac{E}{k_BT}\right) \] where the bandwidth B cancels out in the ratio.

When we operate at Landauer’s limit, setting $E = k_BT$, we get a signal-to-noise ratio of exactly 1: \[ \frac{S}{N} = \frac{E}{k_BT} = 1 \] This yields a channel capacity of exactly half a bit per second, \[ C = \frac{1}{2}\log_2(2) = \frac{1}{2} \text{ bit/s} \]

The factor of 1/2 appears in Shannon’s formula because of Nyquist’s theorem - we need two samples per cycle at bandwidth B to represent a signal. The bandwidth $B$ appears in both signal and noise power but cancels in their ratio, showing how Landauer’s energy-per-bit limit connects to Shannon’s bits-per-second capacity.

This connection suggests that Landauer’s limit may correspond to the energy needed to establish a signal-to-noise ratio sufficient to transmit one bit of information per second. The temperature $T$ may set both the minimum energy scale for information erasure and the noise floor for information transmission.

Implications for Information Engines

This connection suggests that the fundamental limits on information processing may arise from the need to maintain signals above the thermal noise floor. Whether we’re erasing information (Landauer) or transmitting it (Shannon), we need to overcome the same fundamental noise threshold set by temperature.

This perspective suggests that both memory operations (erasure) and communication operations (transmission) are limited by the same physical principles. The temperature $T$ emerges as a fundamental parameter that sets the scale for both energy requirements and information capacity.

The connection between Landauer’s limit and Shannon’s channel capacity is intriguing but still remains speculative. For Landauer’s original work see Landauer (1961), Bennett’s review and developments see Bennet (1982), and for a more recent overview and connection to developments in non-equilibrium thermodynamics Parrondo et al. (2015).

Implications for Intelligence

Why Superintelligence is Like Perpetual Motion

Superintelligence as Perpetual Motion

[edit]

Claims about imminent superintelligence or artificial general intelligence that will recursively self-improve to unbounded capability bear a striking resemblance to promises of perpetual motion machines. Both violate fundamental physical constraints.

The Thermodynamic Constraint

Perpetual motion machines fail because they violate the second law of thermodynamics. You cannot extract unlimited work from a finite system without an external energy source. Entropy must increase, energy must be conserved, and there are fundamental limits on efficiency set by temperature.

These aren’t engineering challenges to overcome with better designs, they’re fundamental constraints built into the structure of physical law.

Similarly, unbounded intelligence fails because it would require unbounded information processing. The inaccessible game shows that information processing has thermodynamic costs through:

Landauer’s principle: Information erasure costs $k_BT\log 2$ per bit
Marginal entropy conservation: Information cannot be created from nothing
Fisher information bounds: Information channel capacity is finite
GENERIC structure: Any real process has dissipative components

The Recursive Self-Improvement Fallacy

The superintelligence narrative often invokes “recursive self-improvement”: an AI that makes itself smarter, which makes it better at making itself smarter, leading to explosive growth. This is supposed to lead to capabilities that far exceed human intelligence.

But this violates conservation of information. To “improve” requires: - Learning: Extracting information from environment (limited by Fisher information) - Memory: Storing that information (limited by physical substrate) - Computation: Processing information (limited by Landauer and thermodynamics) - Erasure: Clearing memory for new information (dissipates energy)

Each step has information-theoretic costs. You cannot recursively self-improve without limit any more than you can build a perpetual motion machine by clever arrangement of gears.

Embodiment as Thermodynamic Necessity

The inaccessible game reveals why embodiment—physical constraints—is not a limitation to be overcome but a necessary feature of any information-processing system.

The Fisher information matrix $G(\boldsymbol{\theta})$ defines the information topography. It determines: - How fast information can flow - What channels are available
- What bottlenecks exist - How much energy is needed

This topography is shaped by the physical implementation. A biological brain has different $G$ than a silicon chip, which has different $G$ than a quantum computer. Each physical substrate creates its own information landscape with its own constraints.

Promises of “uploading” consciousness or achieving superintelligence by removing physical constraints misunderstand the relationship between information and physics. Information processing is physical. The constraints aren’t bugs, they’re features that make information processing possible at all.

Why the Hype Persists

If the constraints are so fundamental, why do smart people keep claiming superintelligence is just around the corner? Several reasons.

Confusing capability with intelligence: Current AI systems can do impressive things, but that doesn’t mean they’re on a path to unbounded capability
Ignoring thermodynamic costs: Information processing seems “free” compared to mechanical work, but Landauer’s principle shows it has real energy costs
Mistaking scaling for fundamental progress: Making systems bigger isn’t the same as removing fundamental constraints
Economic incentives: Billions of dollars flow toward exciting promises

Just as perpetual motion machines attracted investors in the 19th and early 20th centuries, superintelligence claims attract billions today. But the fundamental constraints haven’t changed. The idea of this work is that information theory provides as firm a bound on intelligence as thermodynamics provides on engines.

The Limits of Enhancement

Transhumanism

[edit]

Some researchers talk about transhumanism, releasing us from our own limitations, gaining the bandwidth of the computer. Who wouldn’t want the equivalent of billions of dollars of communication that a computer has?

But what if that would destroy the very nature of what it is to be human. What if we are defined by our limitations. What if our consciousness is born out of a need to understand and be understood by others? What if that thing that we value the most is a side effect of our limitations?

AI is a technology, it is not a human being. It doesn’t worry it is being misunderstood, it doesn’t hate us, it doesn’t love us, it doesn’t even have an opinion about us.

Any sense that it does is in that little internal model we have as we anthropomorphize it. AI doesn’t stand for anthropomorphic intelligence, it stands for artificial intelligence. Artificial in the way a plastic plant is artificial.

Of course, like any technology, that doesn’t mean it’s without its dangers. Technological advance has always brought social challenges and likely always will, but if we are to face those challenges head on, we need to acknowledge the difference between our intelligence and that which we create in our computers.

Your cat understands you better than your computer does, your cat understands me better than your computer does, and it hasn’t even met me!

Our lives are defined by our desperate need to be understood: art, music, literature, dance, sport. So many creative ways to try and communicate who we are or what we feel. The computer has no need for this.

When you hear the term Artificial Intelligence, remember that’s precisely what it is. Artificial, like that plastic plant. A plastic plant is convenient, it doesn’t need watering, it doesn’t need to be put in the light, it won’t wilt if you don’t attend to it, and it won’t outgrow the place you put it.

A plastic plant will do some of the jobs that a real plant does, but it isn’t a proper replacement, and never will be. So, it is with our artificial intelligences.

I believe that our fascination with AI is actually a projected fascination with ourselves. A sort of technological narcissism. One of the reasons that the next generation of artificial intelligence solutions excites me is because I think it will lead to a much better understanding of our own intelligence.

But with our self-idolization comes a Icarian fear of what it might mean to emulate those characteristics that we perceive of as uniquely human.

Our fears of AI singularities and superintelligences come from a confused mixing of what we create and what we are.

Do not fool yourselves into thinking these computers are the same thing as us, they never will be. We are a consequence of our limitations, just as Bauby was defined by his. Or maybe limitations is the wrong word, as Bauby described there are always moments when we can explore our inner self and escape into our own imagination:

My cocoon becomes less oppressive, and my mind takes flight like a butterfly. There is so much to do. You can wander off in space or in time, set out for Tierra del Fuego or King Midas’s court. You can visit the woman you love, slide down beside her and stroke her still-sleeping face. You can build castles in Spain, steal the golden fleece, discover Atlantis, realize your childhood dreams and adult ambitions.
Enough rambling. My main task now is to compose the first of those bedridden travel notes so that I shall be ready when my publisher’s emissary arrives to take my dictation, letter by letter. In my head I churn over every sentence ten times, delete a word, add an adjective, and learn my text by heart, paragraph by paragraph.

The flower that is this book, that is this fight, can never bud from an artificial plant.

Conclusions

The inaccessible game provides an information-theoretic foundation for understanding physical systems and, by extension, intelligent systems. Starting from four axioms—three from Baez characterizing information loss, and a fourth imposing information isolation—we derive:

This reverses the usual logic where information bounds follow from thermodynamics. Here, thermodynamic structure emerges from information-theoretic principles. This suggests Wheeler’s “it from bit” vision may be realizable: physical laws emerging from information-theoretic constraints.

For intelligence, the message is clear: just as no clever arrangement of gears can create a perpetual motion machine, no clever arrangement of algorithms can create unbounded superintelligence. The constraints are fundamental, built into the structure of information itself.

David MacKay’s Legacy

David taught us to ask: “What are the fundamental constraints? What do the numbers actually say?” This work aspires to follow in that tradition. By starting with information-theoretic axioms and deriving physical structure, we can rigorously understand why certain promises, whether perpetual motion or superintelligence, are impossible.

I hope that David would have appreciated both the mathematical structure and its application to deflating hype. His legacy continues in work that uses careful reasoning to illuminate real constraints, helping us distinguish transformative but bounded progress from impossible dreams.

Open Questions

Many questions remain open:

Can we prove that exponential families are necessary, not just convenient?
What is the initial state of the inaccessible game (the origin where $H=0$)?
Under what conditions does the Jacobi identity hold globally?
Can this framework extend to quantum systems?
What are the implications for understanding biological intelligence?

These questions point toward future work connecting information theory, physics, and the nature of intelligence.

Information Engines: Intelligence as an Energy-Efficiency

[edit]

The entropy game shows some parallels between thermodynamics and measurement. This allows us to imagine information engines, simple systems that convert information to energy. This is our first simple model of intelligence.

Measurement as a Thermodynamic Process: Information-Modified Second Law

The second law of thermodynamics was generalised to include the effect of measurement by Sagawa and Ueda (Sagawa and Ueda, 2008). They showed that the maximum extractable work from a system can be increased by $k_BTI(X;M)$ where $k_B$ is Boltzmann’s constant, $T$ is temperature and $I(X;M)$ is the information gained by making a measurement, $M$, \[ I(X;M) = \sum_{x,m} \rho(x,m) \log \frac{\rho(x,m)}{\rho(x)\rho(m)}, \] where $\rho(x,m)$ is the joint probability of the system and measurement (see e.g. eq 14 in Sagawa and Ueda (2008)). This can be written as \[ W_\text{ext} \leq - \Delta\mathcal{F} + k_BTI(X;M), \] where $W_\text{ext}$ is the extractable work and it is upper bounded by the negative change in free energy, $\Delta \mathcal{F}$, plus the energy gained from measurement, $k_BTI(X;M)$. This is the information-modified second law.

The measurements can be seen as a thermodynamic process. In theory measurement, like computation is reversible. But in practice the process of measurement is likely to erode the free energy somewhat, but as long as the energy gained from information, $kTI(X;M)$ is greater than that spent in measurement the pricess can be thermodynamically efficient.

The modified second law shows that the maximum additional extractable work is proportional to the information gained. So information acquisition creates extractable work potential. Thermodynamic consistency is maintained by properly accounting for information-entropy relationships.

Efficacy of Feedback Control

Sagawa and Ueda extended this relationship to provide a generalised Jarzynski equality for feedback processes (Sagawa and Ueda, 2010). The Jarzynski equality is an imporant result from nonequilibrium thermodynamics that relates the average work done across an ensemble to the free energy difference between initial and final states (Jarzynski, 1997), \[ \left\langle \exp\left(-\frac{W}{k_B T}\right) \right\rangle = \exp\left(-\frac{\Delta\mathcal{F}}{k_BT}\right), \] where $\langle W \rangle$ is the average work done across an ensemble of trajectories, $\Delta\mathcal{F}$ is the change in free energy, $k_B$ is Boltzmann’s constant, and $\Delta S$ is the change in entropy. Sagawa and Ueda extended this equality to to include information gain from measurement (Sagawa and Ueda, 2010), \[ \left\langle \exp\left(-\frac{W}{k_B T}\right) \exp\left(\frac{\Delta\mathcal{F}}{k_BT}\right) \exp\left(-\mathcal{I}(X;M)\right)\right\rangle = 1, \] where $\mathcal{I}(X;M) = \log \frac{\rho(X|M)}{\rho(X)}$ is the information gain from measurement, and the mutual information is recovered $I(X;M) = \left\langle \mathcal{I}(X;M) \right\rangle$ as the average information gain.

Sagawa and Ueda introduce an efficacy term that captures the effect of feedback on the system they note in the presence of feedback, \[ \left\langle \exp\left(-\frac{W}{k_B T}\right) \exp\left(\frac{\Delta\mathcal{F}}{k_BT}\right)\right\rangle = \gamma, \] where $\gamma$ is the efficacy.

Channel Coding Perspective on Memory

When viewing $M$ as an information channel between past and future states, Shannon’s channel coding theorems apply (Shannon, 1948). The channel capacity $C$ represents the maximum rate of reliable information transmission [ C = _{(M)} I(X_1;M) ] and for a memory of $n$ bits we have [ C n, ] as the mutual information is upper bounded by the entropy of $\rho(M)$ which is at most $n$ bits.

This relationship seems to align with Ashby’s Law of Requisite Variety (pg 229 Ashby (1952)), which states that a control system must have at least as much ‘variety’ as the system it aims to control. In the context of memory systems, this means that to maintain temporal correlations effectively, the memory’s state space must be at least as large as the information content it needs to preserve. This provides a lower bound on the necessary memory capacity that complements the bound we get from Shannon for channel capacity.

This helps determine the required memory size for maintaining temporal correlations, optimal coding strategies, and fundamental limits on temporal correlation preservation.

Decomposition into Past and Future

Model Approximations and Thermodynamic Efficiency

Intelligent systems must balance measurement against energy efficiency and time requirements. A perfect model of the world would require infinite computational resources and speed, so approximations are necessary. This leads to uncertainties. Thermodynamics might be thought of as the physics of uncertainty: at equilibrium thermodynamic systems find thermodynamic states that minimize free energy, equivalent to maximising entropy.

Markov Blanket

To introduce some structure to the model assumption. We split $X$ into $X_0$ and $X_1$. $X_0$ is past and present of the system, $X_1$ is future The conditional mutual information $I(X_0;X_1|M)$ which is zero if $X_1$ and $X_0$ are independent conditioned on $M$.

At What Scales Does this Apply?

The equipartition theorem tells us that at equilibrium the average energy is $kT/2$ per degree of freedom. This means that for systems that operate at “human scale” the energy involved is many orders of magnitude larger than the amount of information we can store in memory. For a car engine producing 70 kW of power at 370 Kelvin, this implies \[ \frac{2 \times 70,000}{370 \times k_B} = \frac{2 \times 70,000}{370\times 1.380649×10^{−23}} = 2.74 × 10^{25} \] degrees of freedom per second. If we make a conservative assumption of one bit per degree of freedom, then the mutual information we would require in one second for comparative energy production would be around 3400 zettabytes, implying a memory bandwidth of around 3,400 zettabytes per second. In 2025 the estimate of all the data in the world stands at 149 zettabytes.

Small-Scale Biochemical Systems and Information Processing

While macroscopic systems operate in regimes where traditional thermodynamics dominates, microscopic biological systems operate at scales where information and thermal fluctuations become critically important. Here we examine how the framework applies to molecular machines and processes that have evolved to operate efficiently at these scales.

Molecular machines like ATP synthase, kinesin motors, and the photosynthetic apparatus can be viewed as sophisticated information engines that convert energy while processing information about their environment. These systems have evolved to exploit thermal fluctuations rather than fight against them, using information processing to extract useful work.

ATP Synthase: Nature’s Rotary Engine

ATP synthase functions as a rotary molecular motor that synthesizes ATP from ADP and inorganic phosphate using a proton gradient. The system uses the proton gradient as both an energy source and an information source about the cell’s energetic state and exploits Brownian motion through a ratchet mechanism. It converts information about proton locations into mechanical rotation and ultimately chemical energy with approximately 3-4 protons required per ATP.

Estimates suggest that one synapse firing may require $10^4$ ATP molecules, so around $4 \times 10^4$ protons. If we take the human brain as containing around $10^{14}$ synapses, and if we suggest each synapse only fires about once every five seconds, we would require approximately $10^{18}$ protons per second to power the synapses in our brain. With each proton having six degrees of freedom. Under these rough calculations the memory capacity distributed across the ATP Synthase in our brain must be of order $6 \times 10^{18}$ bits per second or 750 petabytes of information per second. Of course this memory capacity would be devolved across the billions of neurons within hundreds or thousands of mitochondria that each can contain thousands of ATP synthase molecules. By composition of extremely small systems we can see it’s possible to improve efficiencies in ways that seem very impractical for a car engine.

Quick note to clarify, here we’re referring to the information requirements to make our brain more energy efficient in its information processing rather than the information processing capabilities of the neurons themselves!

Thanks!

For more information on these subjects and more you might want to check the following resources.

References

Ananthanarayanan, R., Esser, S.K., Simon, H.D., Modha, D.S., 2009. The cat is out of the bag: Cortical simulations with $10^9$ neurons, $10^{13}$ synapses, in: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis - SC ’09. https://doi.org/10.1145/1654059.1654124

Ashby, W.R., 1952. Design for a brain: The origin of adaptive behaviour. Chapman & Hall, London.

Baez, J.C., Fritz, T., Leinster, T., 2011. A characterization of entropy in terms of information loss. Entropy 13, 1945–1957. https://doi.org/10.3390/e13111945

Bennet, C.H., 1982. The thermodynamics of computation—a review. International Journal of Theoretical Physics 21, 906–940.

Beretta, G.P., 2020. The fourth law of thermodynamics: Steepest entropy ascent. Philosophical Transactions of the Royal Society A 378, 20190168. https://doi.org/10.1098/rsta.2019.0168

Conway, F., Siegelman, J., 2005. Dark hero of the information age: In search of norbert wiener the father of cybernetics. Basic Books, New York.

Gardner, M., 1970. Mathematical games: The fantastic combinations of John Conway’s new solitaire game “life”. Scientific American 223, 120–123.

Grmela, M., Öttinger, H.C., 1997. Dynamics and thermodynamics of complex fluids. I. Development of a general formalism. Physical Review E 56, 6620–6632. https://doi.org/10.1103/PhysRevE.56.6620

Jarzynski, C., 1997. Nonequilibrium equality for free energy differences. Physical Review Letters 78, 2690–2693. https://doi.org/10.1103/PhysRevLett.78.2690

Landauer, R., 1961. Irreversibility and heat generation in the computing process. IBM Journal of Research and Development 5, 183–191. https://doi.org/10.1147/rd.53.0183

Lawrence, N.D., 2024. The atomic human: Understanding ourselves in the age of AI. Allen Lane.

Lawrence, N.D., 2017. Living together: Mind and machine intelligence. arXiv.

MacKay, D.J.C., 2008. Sustainable energy — without the hot air. UIT Cambridge, Cambridge, UK.

MacKay, D.J.C., 2003. Information theory, inference and learning algorithms. Cambridge University Press, Cambridge, U.K.

Moravec, H., 1999. Robot: Mere machine to transcendent mind. Oxford University Press, New York.

Onsager, L., 1931. Reciprocal relations in irreversible processes. I. Physical Review 37, 405–426. https://doi.org/10.1103/PhysRev.37.405

Öttinger, H.C., 2005. Beyond equilibrium thermodynamics. Wiley-Interscience, Hoboken, NJ. https://doi.org/10.1002/0471727903

Öttinger, H.C., Grmela, M., 1997. Dynamics and thermodynamics of complex fluids. II. Illustrations of a general formalism. Physical Review E 56, 6633–6655. https://doi.org/10.1103/PhysRevE.56.6633

Parrondo, J.M.R., Horowitz, J.M., Sagawa, T., 2015. Thermodynamics of information. Nature Physics 11, 131–139. https://doi.org/10.1038/nphys3230

Sagawa, T., Ueda, M., 2010. Generalized Jarzynski equality under nonequilibrium feedback control. Physical Review Letters 104, 090602. https://doi.org/10.1103/PhysRevLett.104.090602

Sagawa, T., Ueda, M., 2008. Second law of thermodynamics with discrete quantum feedback control. Physical Review Letters 100, 080403. https://doi.org/10.1103/PhysRevLett.100.080403

Sandberg, A., Bostrom, N., 2008. Whole brain emulation: A roadmap (Technical Report No. 2008-3). Future of Humanity Institute, Oxford University.

Shannon, C.E., 1948. A mathematical theory of communication. The Bell System Technical Journal 27, 379–423, 623–656. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x

Watanabe, S., 1960. Information theoretical analysis of multivariate correlation. IBM Journal of Research and Development 4, 66–82. https://doi.org/10.1147/rd.41.0066

Ziegler, H., Wehrli, C., 1987. On a principle of maximal rate of entropy production. Journal of Non-Equilibrium Thermodynamics 12, 229. https://doi.org/10.1515/jnet.1987.12.3.229

Abstract

Perpetual Motion and Superintelligence

In Memory of David MacKay

Information, Energy and Fundamental Limits

Information Theory and Thermodynamics

Entropy

Exponential Family

Available Energy

Work through Measurement

Information-Theoretic Limits on Intelligence

What Intelligent Systems Must Do

The Landauer Bound on Computation

Fisher Information Bounds on Learning

Embodiment as Necessity, Not Limitation

Why Superintelligence Claims Fail

The Inaccessible Game

Foundations: The Four Axioms

Baez-Fritz-Leinster Characterization of Information Loss

The Three Axioms

The Main Result

The Fourth Axiom: Information Conservation

Statement of the Axiom

Physical Interpretation

The Four Axioms Together

The Inaccessible Game

Why “Inaccessible”?

What Makes It a Game?

Connection to Physical Reality

Information Dynamics

The Conservation Law

The \(I + H = C\) Structure

Multi-Information: Measuring Correlation

The Information Action Principle: \(I + H = C\)

The Information Relaxation Principle

Connection to Marginal Entropy Conservation

Why This Matters for Dynamics

Information Relaxation

From Information Relaxation to Maximum Entropy Production

The Direction of Time: Entropy Increases

Maximum Entropy Production Principle

Natural Parameters and the Entropy Gradient

The MEP Dynamics

Why This Is the Unique Dynamics

The Information Relaxation Picture

Connection to Physical Intuition

Preview: Constrained Gradient Flow

Constrained Maximum Entropy Production

Unconstrained vs Constrained Dynamics

The Constrained Dynamics

Solving for the Lagrange Multiplier

Physical Interpretation

Emergence of Physical Structure

GENERIC: Reversible and Irreversible Dynamics

The GENERIC Framework

Historical Context: Non-Equilibrium Thermodynamics

What Problem Does GENERIC Solve?

The GENERIC Answer: Coexistence Requires Structure

Why “GENERIC” Matters for Information Dynamics

Preview: Structure of the GENERIC Equation

The GENERIC Equation

The Poisson Operator \(L(x)\)

The Friction Operator \(M(x)\)

The Degeneracy Conditions

Casimir Functions and Constraints

Why This Structure?

A Worked Example: Damped Harmonic Oscillator

Automatic Degeneracy Conditions

The GENERIC Degeneracy Requirements

First Degeneracy: Automatic from Tangency

Second Degeneracy: From Constraint Gradient

Why This Matters

Example: Harmonic Oscillator GENERIC Dynamics

Computational Validation: Three Binary Variables

Information Topography

Fisher Information as Conductance Tensor

The Electrical Circuit Analogy

Dynamic Information Topography

Information Channels and Bottlenecks

Why “Topography”?

Formalising Information Topography