Information Topography

Neil D. Lawrence

$$\newcommand{\tk}[1]{} \newcommand{\Amatrix}{\mathbf{A}} \newcommand{\KL}[2]{\text{KL}\left( #1\,\|\,#2 \right)} \newcommand{\Kaast}{\kernelMatrix_{\mathbf{ \ast}\mathbf{ \ast}}} \newcommand{\Kastu}{\kernelMatrix_{\mathbf{ \ast} \inducingVector}} \newcommand{\Kff}{\kernelMatrix_{\mappingFunctionVector \mappingFunctionVector}} \newcommand{\Kfu}{\kernelMatrix_{\mappingFunctionVector \inducingVector}} \newcommand{\Kuast}{\kernelMatrix_{\inducingVector \bf\ast}} \newcommand{\Kuf}{\kernelMatrix_{\inducingVector \mappingFunctionVector}} \newcommand{\Kuu}{\kernelMatrix_{\inducingVector \inducingVector}} \newcommand{\Kuui}{\Kuu^{-1}} \newcommand{\Qaast}{\mathbf{Q}_{\bf \ast \ast}} \newcommand{\Qastf}{\mathbf{Q}_{\ast \mappingFunction}} \newcommand{\Qfast}{\mathbf{Q}_{\mappingFunctionVector \bf \ast}} \newcommand{\Qff}{\mathbf{Q}_{\mappingFunctionVector \mappingFunctionVector}} \newcommand{\aMatrix}{\mathbf{A}} \newcommand{\aScalar}{a} \newcommand{\aVector}{\mathbf{a}} \newcommand{\acceleration}{a} \newcommand{\bMatrix}{\mathbf{B}} \newcommand{\bScalar}{b} \newcommand{\bVector}{\mathbf{b}} \newcommand{\basisFunc}{\phi} \newcommand{\basisFuncVector}{\boldsymbol{ \basisFunc}} \newcommand{\basisFunction}{\phi} \newcommand{\basisLocation}{\mu} \newcommand{\basisMatrix}{\boldsymbol{ \Phi}} \newcommand{\basisScalar}{\basisFunction} \newcommand{\basisVector}{\boldsymbol{ \basisFunction}} \newcommand{\activationFunction}{\phi} \newcommand{\activationMatrix}{\boldsymbol{ \Phi}} \newcommand{\activationScalar}{\basisFunction} \newcommand{\activationVector}{\boldsymbol{ \basisFunction}} \newcommand{\bigO}{\mathcal{O}} \newcommand{\binomProb}{\pi} \newcommand{\cMatrix}{\mathbf{C}} \newcommand{\cbasisMatrix}{\hat{\boldsymbol{ \Phi}}} \newcommand{\cdataMatrix}{\hat{\dataMatrix}} \newcommand{\cdataScalar}{\hat{\dataScalar}} \newcommand{\cdataVector}{\hat{\dataVector}} \newcommand{\centeredKernelMatrix}{\mathbf{ \MakeUppercase{\centeredKernelScalar}}} \newcommand{\centeredKernelScalar}{b} \newcommand{\centeredKernelVector}{\centeredKernelScalar} \newcommand{\centeringMatrix}{\mathbf{H}} \newcommand{\chiSquaredDist}[2]{\chi_{#1}^{2}\left(#2\right)} \newcommand{\chiSquaredSamp}[1]{\chi_{#1}^{2}} \newcommand{\conditionalCovariance}{\boldsymbol{ \Sigma}} \newcommand{\coregionalizationMatrix}{\mathbf{B}} \newcommand{\coregionalizationScalar}{b} \newcommand{\coregionalizationVector}{\mathbf{ \coregionalizationScalar}} \newcommand{\covDist}[2]{\text{cov}_{#2}\left(#1\right)} \newcommand{\covSamp}[1]{\text{cov}\left(#1\right)} \newcommand{\covarianceScalar}{c} \newcommand{\covarianceVector}{\mathbf{ \covarianceScalar}} \newcommand{\covarianceMatrix}{\mathbf{C}} \newcommand{\covarianceMatrixTwo}{\boldsymbol{ \Sigma}} \newcommand{\croupierScalar}{s} \newcommand{\croupierVector}{\mathbf{ \croupierScalar}} \newcommand{\croupierMatrix}{\mathbf{ \MakeUppercase{\croupierScalar}}} \newcommand{\dataDim}{p} \newcommand{\dataIndex}{i} \newcommand{\dataIndexTwo}{j} \newcommand{\dataMatrix}{\mathbf{Y}} \newcommand{\dataScalar}{y} \newcommand{\dataSet}{\mathcal{D}} \newcommand{\dataStd}{\sigma} \newcommand{\dataVector}{\mathbf{ \dataScalar}} \newcommand{\decayRate}{d} \newcommand{\degreeMatrix}{\mathbf{ \MakeUppercase{\degreeScalar}}} \newcommand{\degreeScalar}{d} \newcommand{\degreeVector}{\mathbf{ \degreeScalar}} \newcommand{\diag}[1]{\text{diag}\left(#1\right)} \newcommand{\diagonalMatrix}{\mathbf{D}} \newcommand{\diff}[2]{\frac{\text{d}#1}{\text{d}#2}} \newcommand{\diffTwo}[2]{\frac{\text{d}^2#1}{\text{d}#2^2}} \newcommand{\displacement}{x} \newcommand{\displacementVector}{\textbf{\displacement}} \newcommand{\distanceMatrix}{\mathbf{ \MakeUppercase{\distanceScalar}}} \newcommand{\distanceScalar}{d} \newcommand{\distanceVector}{\mathbf{ \distanceScalar}} \newcommand{\eigenvaltwo}{\ell} \newcommand{\eigenvaltwoMatrix}{\mathbf{L}} \newcommand{\eigenvaltwoVector}{\mathbf{l}} \newcommand{\eigenvalue}{\lambda} \newcommand{\eigenvalueMatrix}{\boldsymbol{ \Lambda}} \newcommand{\eigenvalueVector}{\boldsymbol{ \lambda}} \newcommand{\eigenvector}{\mathbf{ \eigenvectorScalar}} \newcommand{\eigenvectorMatrix}{\mathbf{U}} \newcommand{\eigenvectorScalar}{u} \newcommand{\eigenvectwo}{\mathbf{v}} \newcommand{\eigenvectwoMatrix}{\mathbf{V}} \newcommand{\eigenvectwoScalar}{v} \newcommand{\entropy}[1]{\mathcal{H}\left(#1\right)} \newcommand{\errorFunction}{E} \newcommand{\expDist}[2]{\left\langle#1\right\rangle_{#2}} \newcommand{\expSamp}[1]{\left\langle#1\right\rangle} \newcommand{\expectation}[1]{\left\langle #1 \right\rangle } \newcommand{\expectationDist}[2]{\left\langle #1 \right\rangle _{#2}} \newcommand{\expectedDistanceMatrix}{\mathcal{D}} \newcommand{\eye}{\mathbf{I}} \newcommand{\fantasyDim}{r} \newcommand{\fantasyMatrix}{\mathbf{ \MakeUppercase{\fantasyScalar}}} \newcommand{\fantasyScalar}{z} \newcommand{\fantasyVector}{\mathbf{ \fantasyScalar}} \newcommand{\featureStd}{\varsigma} \newcommand{\gammaCdf}[3]{\mathcal{GAMMA CDF}\left(#1|#2,#3\right)} \newcommand{\gammaDist}[3]{\mathcal{G}\left(#1|#2,#3\right)} \newcommand{\gammaSamp}[2]{\mathcal{G}\left(#1,#2\right)} \newcommand{\gaussianDist}[3]{\mathcal{N}\left(#1|#2,#3\right)} \newcommand{\gaussianSamp}[2]{\mathcal{N}\left(#1,#2\right)} \newcommand{\uniformDist}[3]{\mathcal{U}\left(#1|#2,#3\right)} \newcommand{\uniformSamp}[2]{\mathcal{U}\left(#1,#2\right)} \newcommand{\given}{|} \newcommand{\half}{\frac{1}{2}} \newcommand{\heaviside}{H} \newcommand{\hiddenMatrix}{\mathbf{ \MakeUppercase{\hiddenScalar}}} \newcommand{\hiddenScalar}{h} \newcommand{\hiddenVector}{\mathbf{ \hiddenScalar}} \newcommand{\identityMatrix}{\eye} \newcommand{\inducingInputScalar}{z} \newcommand{\inducingInputVector}{\mathbf{ \inducingInputScalar}} \newcommand{\inducingInputMatrix}{\mathbf{Z}} \newcommand{\inducingScalar}{u} \newcommand{\inducingVector}{\mathbf{ \inducingScalar}} \newcommand{\inducingMatrix}{\mathbf{U}} \newcommand{\inlineDiff}[2]{\text{d}#1/\text{d}#2} \newcommand{\inputDim}{q} \newcommand{\inputMatrix}{\mathbf{X}} \newcommand{\inputScalar}{x} \newcommand{\inputSpace}{\mathcal{X}} \newcommand{\inputVals}{\inputVector} \newcommand{\inputVector}{\mathbf{ \inputScalar}} \newcommand{\iterNum}{k} \newcommand{\kernel}{\kernelScalar} \newcommand{\kernelMatrix}{\mathbf{K}} \newcommand{\kernelScalar}{k} \newcommand{\kernelVector}{\mathbf{ \kernelScalar}} \newcommand{\kff}{\kernelScalar_{\mappingFunction \mappingFunction}} \newcommand{\kfu}{\kernelVector_{\mappingFunction \inducingScalar}} \newcommand{\kuf}{\kernelVector_{\inducingScalar \mappingFunction}} \newcommand{\kuu}{\kernelVector_{\inducingScalar \inducingScalar}} \newcommand{\lagrangeMultiplier}{\lambda} \newcommand{\lagrangeMultiplierMatrix}{\boldsymbol{ \Lambda}} \newcommand{\lagrangian}{L} \newcommand{\laplacianFactor}{\mathbf{ \MakeUppercase{\laplacianFactorScalar}}} \newcommand{\laplacianFactorScalar}{m} \newcommand{\laplacianFactorVector}{\mathbf{ \laplacianFactorScalar}} \newcommand{\laplacianMatrix}{\mathbf{L}} \newcommand{\laplacianScalar}{\ell} \newcommand{\laplacianVector}{\mathbf{ \ell}} \newcommand{\latentDim}{q} \newcommand{\latentDistanceMatrix}{\boldsymbol{ \Delta}} \newcommand{\latentDistanceScalar}{\delta} \newcommand{\latentDistanceVector}{\boldsymbol{ \delta}} \newcommand{\latentForce}{f} \newcommand{\latentFunction}{u} \newcommand{\latentFunctionVector}{\mathbf{ \latentFunction}} \newcommand{\latentFunctionMatrix}{\mathbf{ \MakeUppercase{\latentFunction}}} \newcommand{\latentIndex}{j} \newcommand{\latentScalar}{z} \newcommand{\latentVector}{\mathbf{ \latentScalar}} \newcommand{\latentMatrix}{\mathbf{Z}} \newcommand{\learnRate}{\eta} \newcommand{\lengthScale}{\ell} \newcommand{\rbfWidth}{\ell} \newcommand{\likelihoodBound}{\mathcal{L}} \newcommand{\likelihoodFunction}{L} \newcommand{\locationScalar}{\mu} \newcommand{\locationVector}{\boldsymbol{ \locationScalar}} \newcommand{\locationMatrix}{\mathbf{M}} \newcommand{\variance}[1]{\text{var}\left( #1 \right)} \newcommand{\mappingFunction}{f} \newcommand{\mappingFunctionMatrix}{\mathbf{F}} \newcommand{\mappingFunctionTwo}{g} \newcommand{\mappingFunctionTwoMatrix}{\mathbf{G}} \newcommand{\mappingFunctionTwoVector}{\mathbf{ \mappingFunctionTwo}} \newcommand{\mappingFunctionVector}{\mathbf{ \mappingFunction}} \newcommand{\scaleScalar}{s} \newcommand{\mappingScalar}{w} \newcommand{\mappingVector}{\mathbf{ \mappingScalar}} \newcommand{\mappingMatrix}{\mathbf{W}} \newcommand{\mappingScalarTwo}{v} \newcommand{\mappingVectorTwo}{\mathbf{ \mappingScalarTwo}} \newcommand{\mappingMatrixTwo}{\mathbf{V}} \newcommand{\maxIters}{K} \newcommand{\meanMatrix}{\mathbf{M}} \newcommand{\meanScalar}{\mu} \newcommand{\meanTwoMatrix}{\mathbf{M}} \newcommand{\meanTwoScalar}{m} \newcommand{\meanTwoVector}{\mathbf{ \meanTwoScalar}} \newcommand{\meanVector}{\boldsymbol{ \meanScalar}} \newcommand{\mrnaConcentration}{m} \newcommand{\naturalFrequency}{\omega} \newcommand{\neighborhood}[1]{\mathcal{N}\left( #1 \right)} \newcommand{\neilurl}{http://inverseprobability.com/} \newcommand{\noiseMatrix}{\boldsymbol{ E}} \newcommand{\noiseScalar}{\epsilon} \newcommand{\noiseVector}{\boldsymbol{ \epsilon}} \newcommand{\noiseStd}{\sigma} \newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \newcommand{\normalizedLaplacianMatrix}{\hat{\mathbf{L}}} \newcommand{\normalizedLaplacianScalar}{\hat{\ell}} \newcommand{\normalizedLaplacianVector}{\hat{\mathbf{ \ell}}} \newcommand{\numActive}{m} \newcommand{\numBasisFunc}{m} \newcommand{\numComponents}{m} \newcommand{\numComps}{K} \newcommand{\numData}{n} \newcommand{\numFeatures}{K} \newcommand{\numHidden}{h} \newcommand{\numInducing}{m} \newcommand{\numLayers}{\ell} \newcommand{\numNeighbors}{K} \newcommand{\numSequences}{s} \newcommand{\numSuccess}{s} \newcommand{\numTasks}{m} \newcommand{\numTime}{T} \newcommand{\numTrials}{S} \newcommand{\outputIndex}{j} \newcommand{\paramVector}{\boldsymbol{ \theta}} \newcommand{\parameterMatrix}{\boldsymbol{ \Theta}} \newcommand{\parameterScalar}{\theta} \newcommand{\parameterVector}{\boldsymbol{ \parameterScalar}} \newcommand{\partDiff}[2]{\frac{\partial#1}{\partial#2}} \newcommand{\precisionScalar}{j} \newcommand{\precisionVector}{\mathbf{ \precisionScalar}} \newcommand{\precisionMatrix}{\mathbf{J}} \newcommand{\pseudotargetScalar}{\widetilde{y}} \newcommand{\pseudotargetVector}{\mathbf{ \pseudotargetScalar}} \newcommand{\pseudotargetMatrix}{\mathbf{ \widetilde{Y}}} \newcommand{\rank}[1]{\text{rank}\left(#1\right)} \newcommand{\rayleighDist}[2]{\mathcal{R}\left(#1|#2\right)} \newcommand{\rayleighSamp}[1]{\mathcal{R}\left(#1\right)} \newcommand{\responsibility}{r} \newcommand{\rotationScalar}{r} \newcommand{\rotationVector}{\mathbf{ \rotationScalar}} \newcommand{\rotationMatrix}{\mathbf{R}} \newcommand{\sampleCovScalar}{s} \newcommand{\sampleCovVector}{\mathbf{ \sampleCovScalar}} \newcommand{\sampleCovMatrix}{\mathbf{s}} \newcommand{\scalarProduct}[2]{\left\langle{#1},{#2}\right\rangle} \newcommand{\sign}[1]{\text{sign}\left(#1\right)} \newcommand{\sigmoid}[1]{\sigma\left(#1\right)} \newcommand{\singularvalue}{\ell} \newcommand{\singularvalueMatrix}{\mathbf{L}} \newcommand{\singularvalueVector}{\mathbf{l}} \newcommand{\sorth}{\mathbf{u}} \newcommand{\spar}{\lambda} \newcommand{\trace}[1]{\text{tr}\left(#1\right)} \newcommand{\BasalRate}{B} \newcommand{\DampingCoefficient}{C} \newcommand{\DecayRate}{D} \newcommand{\Displacement}{X} \newcommand{\LatentForce}{F} \newcommand{\Mass}{M} \newcommand{\Sensitivity}{S} \newcommand{\basalRate}{b} \newcommand{\dampingCoefficient}{c} \newcommand{\mass}{m} \newcommand{\sensitivity}{s} \newcommand{\springScalar}{\kappa} \newcommand{\springVector}{\boldsymbol{ \kappa}} \newcommand{\springMatrix}{\boldsymbol{ \mathcal{K}}} \newcommand{\tfConcentration}{p} \newcommand{\tfDecayRate}{\delta} \newcommand{\tfMrnaConcentration}{f} \newcommand{\tfVector}{\mathbf{ \tfConcentration}} \newcommand{\velocity}{v} \newcommand{\sufficientStatsScalar}{g} \newcommand{\sufficientStatsVector}{\mathbf{ \sufficientStatsScalar}} \newcommand{\sufficientStatsMatrix}{\mathbf{G}} \newcommand{\switchScalar}{s} \newcommand{\switchVector}{\mathbf{ \switchScalar}} \newcommand{\switchMatrix}{\mathbf{S}} \newcommand{\tr}[1]{\text{tr}\left(#1\right)} \newcommand{\loneNorm}[1]{\left\Vert #1 \right\Vert_1} \newcommand{\ltwoNorm}[1]{\left\Vert #1 \right\Vert_2} \newcommand{\onenorm}[1]{\left\vert#1\right\vert_1} \newcommand{\twonorm}[1]{\left\Vert #1 \right\Vert} \newcommand{\vScalar}{v} \newcommand{\vVector}{\mathbf{v}} \newcommand{\vMatrix}{\mathbf{V}} \newcommand{\varianceDist}[2]{\text{var}_{#2}\left( #1 \right)} \newcommand{\vecb}[1]{\left(#1\right):} \newcommand{\weightScalar}{w} \newcommand{\weightVector}{\mathbf{ \weightScalar}} \newcommand{\weightMatrix}{\mathbf{W}} \newcommand{\weightedAdjacencyMatrix}{\mathbf{A}} \newcommand{\weightedAdjacencyScalar}{a} \newcommand{\weightedAdjacencyVector}{\mathbf{ \weightedAdjacencyScalar}} \newcommand{\onesVector}{\mathbf{1}} \newcommand{\zerosVector}{\mathbf{0}} $$

at DALI Sorrento Meeting on Apr 14, 2025 [jupyter][google colab][reveal]

Neil D. Lawrence

Abstract

Physical landscapes are shaped by elevation, valleys, and peaks. We might expect information landscapes are molded by entropy, precision, and capacity constraints. To explore how these ideas might manifest we introduce Jaynes’ world, an entropy game that maximises instantaneous entropy production.

In this talk we’ll argue that this landscape has a precision/capacity trade-off that suggests the underlying configuration requires a density matrix representation.

Jaynes’ World

[edit]

This game explores how structure, time, causality, and locality might emerge within a system governed solely by internal information-theoretic constraints. The hope is that it can serve as

A research framework for observer-free dynamics and entropy-based emergence,
A conceptual tool for exploring the notion of an information topography: A landscape in which information flows under constraints.

Definitions and Global Constraints

System Structure

Let $Z = \{Z_1, Z_2, \dots, Z_n\}$ be the full set of system variables. At game turn $t$, define a partition where $X(t) \subseteq Z$: are active variables (currently contributing to entropy) and $M(t) = Z \setminus X(t)$: latent or frozen variables that are stored in the form of an information reservoir (Barato and Seifert (2014),Parrondo et al. (2015)).

Representation via Density Matrix

We’ll argue that the configuration space must be represented by a density matrix, \[ \rho(\boldsymbol{\theta}) = \frac{1}{Z(\boldsymbol{\theta})} \exp\left( \sum_i \theta_i H_i \right), \] where $\boldsymbol{\theta} \in \mathbb{R}^d$ are the natural parameters, each $H_i$ is a Hermitian operator associated with the observables and the partition function is given by $Z(\boldsymbol{\theta}) = \mathrm{Tr}[\exp(\sum_i \theta_i H_i)]$.

From this we can see that the log-partition function, which has an interpretation as the cummulant generating function, is \[ A(\boldsymbol{\theta}) = \log Z(\boldsymbol{\theta}) \] and the von Neumann entropy is \[ S(\boldsymbol{\theta}) = A(\boldsymbol{\theta}) - \boldsymbol{\theta}^\top \nabla A(\boldsymbol{\theta}). \] We can show that the Fisher Information Matrix is \[ G_{ij}(\boldsymbol{\theta}) = \frac{\partial^2 A}{\partial \theta_i \partial \theta_j}. \]

Entropy Capacity and Resolution

We define our system to have a maximum entropy of $N$ bits. If the dimension $d$ of the parameter space is fixed, this implies a minimum detectable resolution in natural parameter space, \[ \varepsilon \sim \frac{1}{2^N}, \] where changes in natural parameters smaller than $\varepsilon$ are treated as invisible by the system. As a result, system dynamics exhibit discrete, detectable transitions between distinguishable states.

Note if the dimension $d$ scales with $N$ (e.g., $d = \alpha N$ for some constant $\alpha$), then the resolution constraint becomes more complex. In this case, the volume of distinguishable states $(\varepsilon)^d$ must equal $2^N$, which leads to $\varepsilon = 2^{1/\alpha}$, a constant independent of $N$. This suggests that as the system’s entropy capacity grows, it maintains a constant resolution while exponentially increasing the number of distinguishable states.

Dual Role of Parameters and Variables

Each variable $Z_i$ is associated with a generator $H_i$, and a natural parameter $\theta_i$. When we say a parameter $\theta_i \in X(t)$, we mean that the component of the system associated with $H_i$ is active at time $t$ and its parameter is evolving with $|\dot{\theta}_i| \geq \varepsilon$. This comes from the duality variables, observables, and natural parameters that we find in exponential family representations and we also see in a density matrix representation.

Core Axiom: Entropic Dynamics

Our core axiom is that the system evolves by steepest ascent in entropy. The gradient of the density matrix with respect to the natural parameters is given by \[ \nabla S[\rho] = -G(\boldsymbol{\theta}) \boldsymbol{\theta} \] and so we set \[ \frac{d\boldsymbol{\theta}}{dt} = -G(\boldsymbol{\theta}) \boldsymbol{\theta} \]

Histogram Game

[edit]

To illustrate the concept of the Jaynes’ world entropy game we’ll run a simple example using a four bin histogram. The entropy of a four bin histogram can be computed as, \[ S(p) = - \sum_{i=1}^4 p_i \log_2 p_i. \]

import numpy as np

First we write some helper code to plot the histogram and compute its entropy.

We can compute the entropy of any given histogram.


# Define probabilities
p = np.zeros(4)
p[0] = 4/13
p[1] = 3/13
p[2] = 3.7/13
p[3] = 1 - p.sum()

# Safe entropy calculation
nonzero_p = p[p > 0]  # Filter out zeros
entropy = - (nonzero_p*np.log2(nonzero_p)).sum()
print(f"The entropy of the histogram is {entropy:.3f}.")

Figure: The entropy of a four bin histogram.

We can play the entropy game by starting with a histogram with all the probability mass in the first bin and then ascending the gradient of the entropy function.

Two-Bin Histogram Example

The simplest possible example of Jaynes’ World is a two-bin histogram with probabilities $p$ and $1-p$. This minimal system allows us to visualize the entire entropy landscape.

The natural parameter is the log odds, $\theta = \log\frac{p}{1-p}$, and the update given by the entropy gradient is \[ \Delta \theta_{\text{steepest}} = \eta \frac{\text{d}S}{\text{d}\theta} = \eta p(1-p)(\log(1-p) - \log p). \] The Fisher information is \[ G(\theta) = p(1-p) \] This creates a dynamic where as $p$ approaches either 0 or 1 (minimal entropy states), the Fisher information approaches zero, creating a critical slowing” effect. This critical slowing is what leads to the formation of information resevoirs. Note also that in the natural gradient the updated is given by multiplying the gradient by the inverse Fisher information, which would lead to a more efficient update of the form, \[ \Delta \theta_{\text{natural}} = \eta(\log(1-p) - \log p), \] however, it is this efficiency that we want our game to avoid, because it is the inefficient behaviour in the reagion of saddle points that leads to critical slowing and the emergence of information resevoirs.

import numpy as np

# Python code for gradients
p_values = np.linspace(0.000001, 0.999999, 10000)
theta_values = np.log(p_values/(1-p_values))
entropy = -p_values * np.log(p_values) - (1-p_values) * np.log(1-p_values)
fisher_info = p_values * (1-p_values)
gradient = fisher_info * (np.log(1-p_values) - np.log(p_values))

Figure: Entropy gradients of the two bin histogram agains position.

This example reveals the entropy extrema at $p = 0$, $p = 0.5$, and $p = 1$. At minimal entropy ($p \approx 0$ or $p \approx 1$), the gradient approaches zero, creating natural information reservoirs. The dynamics slow dramatically near these points - these are the areas of critical slowing that create information reservoirs.

Gradient Ascent in Natural Parameter Space

We can visualize the entropy maximization process by performing gradient ascent in the natural parameter space $\theta$. Starting from a low-entropy state, we follow the gradient of entropy with respect to $\theta$ to reach the maximum entropy state.

import numpy as np

# Parameters for gradient ascent
theta_initial = -9.0  # Start with low entropy 
learning_rate = 1
num_steps = 1500

# Initialize
theta_current = theta_initial
theta_history = [theta_current]
p_history = [theta_to_p(theta_current)]
entropy_history = [entropy(theta_current)]

# Perform gradient ascent in theta space
for step in range(num_steps):
    # Compute gradient
    grad = entropy_gradient(theta_current)
    
    # Update theta
    theta_current = theta_current + learning_rate * grad
    
    # Store history
    theta_history.append(theta_current)
    p_history.append(theta_to_p(theta_current))
    entropy_history.append(entropy(theta_current))
    if step % 100 == 0:
        print(f"Step {step+1}: θ = {theta_current:.4f}, p = {p_history[-1]:.4f}, Entropy = {entropy_history[-1]:.4f}")

Figure: Evolution of the two-bin histogram during gradient ascent in natural parameter space.

Figure: Entropy evolution during gradient ascent for the two-bin histogram.

Figure: Gradient ascent trajectory in the natural parameter space for the two-bin histogram.

The gradient ascent visualization shows how the system evolves in the natural parameter space $\theta$. Starting from a negative $\theta$ (corresponding to a low-entropy state with $p << 0.5$), the system follows the gradient of entropy with respect to $\theta$ until it reaches $\theta = 0$ (corresponding to $p = 0.5$), which is the maximum entropy state.

Note that the maximum entropy occurs at $\theta = 0$, which corresponds to $p = 0.5$. The gradient of entropy with respect to $\theta$ is zero at this point, making it a stable equilibrium for the gradient ascent process.

Four Bin Histogram Entropy Game

[edit]

To do this we represent the histogram parameters as a vector of length 4, $\mathbf{ w}{\lambda} = [\lambda_1, \lambda_2, \lambda_3, \lambda_4]$ and define the histogram probabilities to be $p_i = \lambda_i^2 / \sum_{j=1}^4 \lambda_j^2$.

import numpy as np

We can then ascend the gradeint of the entropy function, starting at a parameter setting where the mass is placed in the first bin, we take $\lambda_2 = \lambda_3 = \lambda_4 = 0.01$ and $\lambda_1 = 100$.

First to check our code we compare our numerical and analytic gradients.

import numpy as np

# Initial parameters (lambda)
initial_lambdas = np.array([100, 0.01, 0.01, 0.01])

# Gradient check
numerical_grad = numerical_gradient(entropy, initial_lambdas)
analytical_grad = entropy_gradient(initial_lambdas)
print("Numerical Gradient:", numerical_grad)
print("Analytical Gradient:", analytical_grad)
print("Gradient Difference:", np.linalg.norm(numerical_grad - analytical_grad))  # Check if close to zero

Now we can run the steepest ascent algorithm.

import numpy as np

# Steepest ascent algorithm
lambdas = initial_lambdas.copy()

learning_rate = 1
turns = 15000
entropy_values = []
lambdas_history = []

for _ in range(turns):
    grad = entropy_gradient(lambdas)
    lambdas += learning_rate * grad # update lambda for steepest ascent
    entropy_values.append(entropy(lambdas))
    lambdas_history.append(lambdas.copy())

We can plot the histogram at a set of chosen turn numbers to see the progress of the algorithm.

Figure: Intermediate stages of the histogram entropy game. After 0, 1000, 5000, 10000 and 15000 iterations.

And we can also plot the changing entropy as a function of the number of game turns.

Figure: Four bin histogram entropy game. The plot shows the increasing entropy against the number of turns across 15000 iterations of gradient ascent.

Note that the entropy starts at a saddle point, increaseases rapidly, and the levels off towards the maximum entropy, with the gradient decreasing slowly in the manner of Zeno’s paradox.

Constructed Quantities and Lemmas

Variable Partition

\[ X(t) = \left\{ i \mid \left| \frac{\text{d}\theta_i}{\text{d}t} \right| \geq \varepsilon \right\}, \quad M(t) = Z \setminus X(t) \]

Fisher Information Matrix Partitioning

We partition the Fisher Information Matrix $G(\boldsymbol{\theta})$ according to the active variables $X(t)$ and latent information reservoir $M(t)$: \[ G(\boldsymbol{\theta}) = \begin{bmatrix} G_{XX} & G_{XM} \\ G_{MX} & G_{MM} \end{bmatrix} \] where $G_{XX}$ represents the information geometry within active variables, $G_{MM}$ within the latent reservoir, and $G_{XM} = G_{MX}^\top$ captures the cross-coupling between active and latent components. This partitioning reveals how information flows between observable dynamics and the latent structure.

Lemma 1: Form of the Minimal Entropy Configuration

The minimal-entropy state compatible with the system’s resolution constraint and regularity condition is represented by a density matrix of the exponential form, \[ \rho(\boldsymbol{\theta}_o) = \frac{1}{Z(\boldsymbol{\theta}_o)} \exp\left( \sum_i \theta_{oi} H_i \right), \] where all components $\theta_{oi}$ are sub-threshold \[ |\dot{\theta}_{oi}| < \varepsilon. \] This state minimizes entropy under the constraint that it remains regular, continuous, and detectable only above a resolution scale $$. Its structure can be derived via a minimum-entropy analogue of Jaynes’ formalism, using the same density matrix geometry but inverted optimization.

Lemma 2: Symmetry Breaking

If $\theta_k \in M(t)$ and $|\dot{\theta}_k| \geq \varepsilon$, then \[ \theta_k \in X(t + \delta t). \]

Four-Bin Saddle Point Example

[edit]

To illustrate saddle points and information reservoirs, we need at least a 4-bin system. This creates a 3-dimensional parameter space where we can observe genuine saddle points.

Consider a 4-bin system parameterized by natural parameters $\theta_1$, $\theta_2$, and $\theta_3$ (with one constraint). A saddle point occurs where the gradient $\nabla_\theta S = 0$, but the Hessian has mixed eigenvalues - some positive, some negative.

At these points, the Fisher information matrix $G(\theta)$ eigendecomposition reveals.

Fast modes: large positive eigenvalues → rapid evolution
Slow modes: small positive eigenvalues → gradual evolution
Critical modes: near-zero eigenvalues → information reservoirs

The eigenvectors of $G(\theta)$ at the saddle point determine which parameter combinations form information reservoirs.

import numpy as np

# Test the gradient calculation
test_theta = np.array([0.5, -0.3, 0.1, -0.3])
test_theta = test_theta - np.mean(test_theta)  # Ensure constraint is satisfied
print("Testing gradient calculation:")
analytical_grad, numerical_grad = check_gradient(test_theta)

# Verify if we're ascending or descending
entropy_before = exponential_family_entropy(test_theta)
step_size = 0.01
test_theta_after = test_theta + step_size * analytical_grad
entropy_after = exponential_family_entropy(test_theta_after)
print(f"Entropy before step: {entropy_before}")
print(f"Entropy after step: {entropy_after}")
print(f"Change in entropy: {entropy_after - entropy_before}")
if entropy_after > entropy_before:
    print("We are ascending the entropy gradient")
else:
    print("We are descending the entropy gradient")

# Initialize with asymmetric distribution (away from saddle point)
theta_init = np.array([1.0, -0.5, -0.2, -0.3])
theta_init = theta_init - np.mean(theta_init)  # Ensure constraint is satisfied

# Run gradient ascent
theta_history, entropy_history = gradient_ascent_four_bin(theta_init, steps=100, learning_rate=1.0)

# Create a grid for visualization
x = np.linspace(-2, 2, 100)
y = np.linspace(-2, 2, 100)
X, Y = np.meshgrid(x, y)

# Compute entropy at each grid point (with constraint on theta3 and theta4)
Z = np.zeros_like(X)
for i in range(X.shape[0]):
    for j in range(X.shape[1]):
        # Create full theta vector with constraint that sum is zero
        theta1, theta2 = X[i,j], Y[i,j]
        theta3 = -0.5 * (theta1 + theta2)
        theta4 = -0.5 * (theta1 + theta2)
        theta = np.array([theta1, theta2, theta3, theta4])
        Z[i,j] = exponential_family_entropy(theta)

# Compute gradient field
dX = np.zeros_like(X)
dY = np.zeros_like(Y)
for i in range(X.shape[0]):
    for j in range(X.shape[1]):
        # Create full theta vector with constraint
        theta1, theta2 = X[i,j], Y[i,j]
        theta3 = -0.5 * (theta1 + theta2)
        theta4 = -0.5 * (theta1 + theta2)
        theta = np.array([theta1, theta2, theta3, theta4])
        
        # Get full gradient and project
        grad = entropy_gradient(theta)
        proj_grad = project_gradient(theta, grad)
        
        # Store first two components
        dX[i,j] = proj_grad[0]
        dY[i,j] = proj_grad[1]

# Normalize gradient vectors for better visualization
norm = np.sqrt(dX**2 + dY**2)
# Avoid division by zero
norm = np.where(norm < 1e-10, 1e-10, norm)
dX_norm = dX / norm
dY_norm = dY / norm

# A few gradient vectors for visualization
stride = 10

Figure: Visualisation of a saddle point projected down to two dimensions.

Figure: Entropy evolution during gradient ascent on the four-bin system.

The animation of system evolution would show initial rapid movement along high-eigenvalue directions, progressive slowing in directions with low eigenvalues and formation of information reservoirs in the critically slowed directions. Parameter-capacity uncertainty emerges naturally at the saddle point.

Entropy-Time

\[ \tau(t) := S_{X(t)}(t) \]

Lemma 3: Monotonicity of Entropy-Time

\[ \tau(t_2) \geq \tau(t_1) \quad \text{for all } t_2 > t_1 \]

Corollary: Irreversibility

$\tau(t)$ increases monotonically, preventing time-reversal globally.

Conjecture: Frieden-Analogous Extremal Flow

At points where the latent-to-active flow functional is locally extremal (e.g., $ $), the system may exhibit critical slowing where information resevoir variables are slow relative to active variables. It may be possible to separate the system entropy into active variables and, $I = S[\rho_X]$ and “intrinsic information” $J= S[\rho_{X|M}]$ allowing us to create an information analogous to B. Roy Frieden’s extreme physical information (Frieden (1998)) which allows derivation of locally valid differential equations that depend on the information topography.

Thanks!

For more information on these subjects and more you might want to check the following resources.

Appendix

Variational Derivation of the Initial Curvature Structure

[edit]

We will determine constraints on the Fisher Information Matrix $G(\boldsymbol{\theta})$ that are consistent with the system’s unfolding rules and internal information geometry. We follow Jaynes (Jaynes, 1957) in solving a variational problem that captures the allowed structure of the system’s origin (minimal entropy) state.

Hirschman Jr (1957) established a connection between entropy and the Fourier transform, showing that the entropy of a function and its Fourier transform cannot both be arbitrarily small. This result, known as the Hirschman uncertainty principle, was later strengthened by Beckner (Beckner, 1975) who derived the optimal constant in the inequality. Białynicki-Birula and Mycielski (1975) extended these ideas to derive uncertainty relations for information entropy in wave mechanics.

From these results we know that there are fundamental limits to how we express the entropy of position and its conjugate space simultaneously. These limits inspire us to focus on the von Neumann entropy so that our system respects the Hirschman uncertainty principle.

A density matrix has the form \[ \rho(\boldsymbol{\theta}) = \frac{1}{Z(\boldsymbol{\theta})} \exp\left( \sum_i \theta_i H_i \right) \] where $Z(\boldsymbol{\theta}) = \mathrm{tr}\left[\exp\left( \sum_i \theta_i H_i \right)\right]$ and $\boldsymbol{\theta} \in \mathbb{R}^d$, $H_i$ are Hermitian observables.

The von Neumman entropy is given by \[ S[\rho] = -\text{tr} (\rho \log \rho) \]

We now derive the minimal entropy configuration inspired by Jaynes’s free-form variational approach. This enables us to derive the form of the density matrix directly from information-theoretic constraints (Jaynes, 1963).

Jaynesian Derivation of Minimal Entropy Configuration

[edit]

Jaynes suggested that statistical mechanics problems should be treated as problems of inference. Assign the probability distribution (or density matrix) that is maximally noncommittal with respect to missing information, subject to known constraints.

While Jaynes applied this idea to derive the maximum entropy configuration given constraints, here we adapt it to derive the minimum entropy configuration, under an assumption of zero initial entropy bounded by a maximum entropy of $N$ bits.

Let $\rho$ be a density matrix describing the state of a system. The von Neumann entropy is, \[ S[\rho] = -\mathrm{tr}(\rho \log \rho), \] we wish to minimize $S[\rho]$, subject to constraints that encode the resolution bounds.

In the game we assume that the system begins in a state of minimal entropy, the state cannot be a delta function (no singularities, so it must obey a resolution constraint $\varepsilon$) and the entropy is bounded above by $N$ bits: $S[\rho] \leq N$.

We apply a variational principle where we minimise \[ S[\rho] = -\mathrm{tr}(\rho \log \rho) \] subject to constraints.

Constraints

The first constraint is normalization, $\mathrm{tr}(\rho) = 1$.
The resolution constraint is motivated by the entropy being constrained to be, \[ S[\rho] \leq N \] with the bound saturated only when the system is at maximum entropy. This implies that the system is finite in resolution. To reflect this we introduce a resolution constraint, $\mathrm{tr}(\rho \hat{Z}^2) \geq \epsilon^2$. And/or $\mathrm{tr}(\rho \hat{P}^2) \geq \delta^2$, and other optional moment or dual-space constraints.

We introduce Lagrange multipliers $\lambda_0$, $\lambda_z$, $\lambda_p$ for these constraints and define the Lagrangian \[ \mathcal{L}[\rho] = -\mathrm{tr}(\rho \log \rho) + \lambda_0 (\mathrm{tr}(\rho) - 1) - \lambda_x (\mathrm{tr}(\rho \hat{Z}^2) - \epsilon^2) - \lambda_p (\mathrm{tr}(\rho \hat{P}^2) - \delta^2). \]

To find the extremum, we take the functional derivative and set it to zero, \[ \frac{\delta \mathcal{L}}{\delta \rho} = -\log \rho - 1 - \lambda_x \hat{Z}^2 - \lambda_p \hat{P}^2 + \lambda_0 = 0 \] and solving for $\rho$ gives \[ \rho = \frac{1}{Z} \exp\left(-\lambda_z \hat{Z}^2 - \lambda_p \hat{P}^2\right) \] where the partition function (which ensures normalisation) is \[ Z = \mathrm{tr}\left[\exp\left(-\lambda_z \hat{Z}^2 - \lambda_p \hat{P}^2\right)\right] \] This is a Gaussian state for a density matrix, which is consistent with the minimum entropy distribution under uncertainty constraints.

The Lagrange multipliers $\lambda_z, \lambda_p$ enforce lower bounds on variance. These define the natural parameters as $\theta_z = -\lambda_z$ and $\theta_p = -\lambda_p$ in the exponential family form $\rho(\boldsymbol{\theta}) \propto \exp(\boldsymbol{\theta} \cdot \mathbf{H})$. The form of $\rho$ is a density matrix. The curvature (second derivative) of $\log Z(\boldsymbol{\theta})$ gives the Fisher Information matrix $G(\boldsymbol{\theta})$. Steepest ascent trajectories in $\boldsymbol{\theta}$ space will trace the system’s entropy dynamics.

Next we compute $G(\boldsymbol{\theta})$ from $\log Z(\boldsymbol{\theta})$ to explore the information geometry. From this we should verify that the following conditions hold, \[ \left| \left[G(\boldsymbol{\theta}) \boldsymbol{\theta}\right]_i \right| < \varepsilon \quad \text{for all } i \] which implies that all variables remain latent at initialization.

The Hermitians have a non-commuting observable pair constraint, \[ [H_i, H_j] \neq 0, \] which is equivalent to an uncertainty relation, \[ \mathrm{Var}(H_i) \cdot \mathrm{Var}(H_j) \geq C > 0, \] and ensures that we have bounded curvature \[ \mathrm{tr}(G(\boldsymbol{\theta})) \geq \gamma > 0. \]

We can then use $\varepsilon$ and $N$ to define initial thresholds and maximum resolution and examine how variables decouple and how saddle-like regions emerge as the landscape unfolds through gradient ascent.

This constrained minimization problem yields the structure of the initial density matrix $\rho(\boldsymbol{\theta}_o)$ and the permissible curvature geometry $G(\boldsymbol{\theta}_o)$ and a constraint-consistent basis of observables $\{H_i\}$ that have a quadratic form. This ensures the system begins in a regular, latent, low-entropy state.

This is the configuration from which entropy ascent and symmetry-breaking transitions emerge.

References

Barato, A.C., Seifert, U., 2014. Stochastic thermodynamics with information reservoirs. Physical Review E 90, 042150. https://doi.org/10.1103/PhysRevE.90.042150

Beckner, W., 1975. Inequalities in Fourier analysis. Annals of Mathematics 159–182. https://doi.org/10.2307/1970980

Białynicki-Birula, I., Mycielski, J., 1975. Uncertainty relations for information entropy in wave mechanics. Communications in Mathematical Physics 44, 129–132. https://doi.org/10.1007/BF01608825

Frieden, B.R., 1998. Physics from fisher information: A unification. Cambridge University Press, Cambridge, UK. https://doi.org/10.1017/CBO9780511622670

Hirschman Jr, I.I., 1957. A note on entropy. American Journal of Mathematics 79, 152–156. https://doi.org/10.2307/2372390

Jaynes, E.T., 1963. Information theory and statistical mechanics, in: Ford, K.W. (Ed.), Brandeis University Summer Institute Lectures in Theoretical Physics, Vol. 3: Statistical Physics. W. A. Benjamin, Inc., New York, pp. 181–218.

Jaynes, E.T., 1957. Information theory and statistical mechanics. Physical Review 106, 620–630. https://doi.org/10.1103/PhysRev.106.620

Parrondo, J.M.R., Horowitz, J.M., Sagawa, T., 2015. Thermodynamics of information. Nature Physics 11, 131–139. https://doi.org/10.1038/nphys3230