Introduction

This talk bridges classical and quantum statistical mechanics through the lens familiar to machine learning practitioners: exponential families and information geometry.

Rather than starting from wavefunctions and postulates, we show how quantum mechanics emerges naturally when you extend probabilistic modeling to allow observables that don’t commute. The mathematics stays remarkably similar—you still have log-partition functions, Fisher-like metrics, and natural parameters—but you need one new computational tool (Duhamel calculus) to handle matrix exponentials.

From Classical to Quantum: The Exponential Family Bridge

From Classical to Quantum Exponential Families

[edit]

If you know classical exponential families and information geometry, you already have the right conceptual framework for quantum statistical mechanics. The move to quantum is not a wholesale replacement; it is a to handle one new feature: observables that do not commute.

This section maps the familiar classical objects to their quantum counterparts, highlights exactly what breaks (and why), and previews the computational fix (Duhamel calculus).

The classical exponential family (what you already know)

In classical statistics, an exponential family has the form \[ p(\mathbf{x};\theta) = \exp\!\big(\theta^\top \mathbf{F}(\mathbf{x}) - \psi(\theta)\big), \] where:

Key properties you rely on:

  1. Expectations from \(\psi\): \(\mathbb{E}_\theta[F_i]=\partial_i\psi(\theta)\).
  2. Fisher information metric: \(G_{ij}(\theta)=\partial_i\partial_j\psi(\theta)=\text{Cov}_\theta(F_i,F_j)\).
  3. Geometry: \(G\) is the Riemannian metric on the parameter space; it controls local linear response and natural gradient descent.
  4. Differentiating the exponential: \(\partial_i e^{\theta^\top \mathbf{F}} = F_i e^{\theta^\top \mathbf{F}}\) (scalars commute).

The quantum exponential family (same structure, different representation)

In quantum statistical mechanics, a state is represented by a \(\rho\), which is a positive semidefinite matrix with trace 1. Observables are Hermitian matrices \(A\), and the expectation is \(\langle A\rangle=\text{tr}(\rho A)\).

A quantum exponential family has exactly the same form: \[ \rho(\theta) = \exp\!\big(K(\theta) - \psi(\theta)\big), \] where:

  1. Expectations from \(\psi\): \(\langle F_i\rangle=\text{tr}(\rho F_i)=\partial_i \psi(\theta)\).
  2. Fisher-like metric: \(G_{ij}(\theta)=\partial_i\partial_j\psi(\theta)\) is the Bogoliubov–Kubo–Mori (BKM) metric.
  3. Geometry: \(G\) is still a Riemannian metric governing local linear response.
  4. \(\psi\) is still a scalar function of real parameters \(\theta\in\mathbb{R}^d\).

Matrices do not commute, so \(\partial_i e^{K(\theta)} \neq F_i e^{K(\theta)}\) in general.

This is the obstacle, but it requires new calculus.

Side-by-side comparison

Here is the structural mapping:

Classical Quantum
Sample space \(\mathcal{X}\) Hilbert space \(\mathcal{H}\) (dim \(d\))
Probability density \(p(\mathbf{x})\) Density matrix \(\rho\) (\(d\times d\) positive, trace 1)
Random variable \(F:\mathcal{X}\to\mathbb{R}\) Observable (Hermitian operator) \(F\)
Expectation \(\mathbb{E}_p[F]=\int F(\mathbf{x})p(\mathbf{x})\,\text{d}\mu\) Expectation \(\langle F\rangle=\text{tr}(\rho F)\)
Exponential family \(p=e^{\theta^\top \mathbf{F}-\psi}\) Quantum exp. family \(\rho=e^{K-\psi}\)
Log-partition \(\psi=\log\int e^{\theta^\top \mathbf{F}}\,\text{d}\mu\) Log-partition \(\psi=\log\text{tr}\,e^K\)
Fisher metric \(G_{ij}=\partial_i\partial_j\psi\) BKM metric \(G_{ij}=\partial_i\partial_j\psi\)
\(\partial_i e^{\theta^\top \mathbf{F}}=F_i e^{\theta^\top \mathbf{F}}\) (scalars) \(\partial_i e^K=\int_0^1 e^{(1-s)K}F_i e^{sK}\,\text{d}s\) (Duhamel)

The top six rows are . The last row is where noncommutativity forces new calculus.

Why noncommutativity matters (and what Duhamel fixes)

In the classical case, when you differentiate the exponential in \(\psi(\theta)=\log\int e^{\theta^\top \mathbf{F}(\mathbf{x})}\,\text{d}\mu\), you can “pull down” the sufficient statistic: \[ \partial_i e^{\theta^\top \mathbf{F}} = F_i(\mathbf{x})\, e^{\theta^\top \mathbf{F}(\mathbf{x})}. \] This is because \(\theta^\top \mathbf{F}(\mathbf{x})\) is a scalar for each \(\mathbf{x}\).

In the quantum case, \(K(\theta)=\sum_j\theta_j F_j\) is a , and the \(F_i\) are matrices. In general, \(F_i K \neq K F_i\) (they do not commute), so you cannot write \(\partial_i e^K = F_i e^K\).

The is the correct way to differentiate a matrix exponential: \[ \frac{\partial}{\partial\theta_i} e^{K(\theta)} = \int_0^1 e^{(1-s)K(\theta)}\, F_i\, e^{sK(\theta)}\,\text{d}s. \]

You can read \(s\in[0,1]\) as “where you insert the derivative inside the exponential”. It is an , not a physical time.

Once you have Duhamel, all the familiar exponential-family calculations go through:

Quantum exponential families are the same probabilistic modeling framework you already know, plus one new computational layer (Duhamel) to handle the fact that operators are matrices.

What comes next in the course

Now that you see the structural parallel, the rest of the quantum development is:

  1. Reversible dynamics (analogous to measure-preserving maps): in quantum, these are unitary transformations \(\rho\mapsto U\rho U^\dagger\), which preserve von Neumann entropy (the quantum analog of Shannon entropy).

  2. Infinitesimal reversible dynamics: these are generated by commutators \(\dot\rho=-i[H,\rho]\), which is the quantum analog of Hamiltonian flow (Poisson brackets).

  3. Computation layer: when you need to linearize dynamics or compute Fisher-like metrics, you use Duhamel/Kubo–Mori calculus to handle the derivatives of operator exponentials.

  4. Applications: quantum natural gradient, variational inference on quantum states, reversible quantum neural networks, etc.

The punchline: if you know exponential families + Fisher information + natural gradients, you already understand the quantum story—you just need Duhamel to handle the matrix algebra.

Reversible Dynamics and Unitarity

With the exponential-family structure in place, we now turn to dynamics: how states evolve in time. In classical probability, reversible transformations are measure-preserving bijections. In quantum probability, reversibility forces a different structure: unitary evolution.

This is not an arbitrary postulate—it’s the natural notion of “information-preserving transformation” in a noncommutative setting.

Quantum Preview: Reversibility Forces Unitarity

[edit]

So far, the course has largely lived in a setting. In a commutative setting we can talk about random variables, joint distributions, and entropies of those distributions.

The new TIG paper makes a sharper point: the ``origin’’ configuration we want (max multi-information with zero joint entropy, i.e. globally pure but locally maximally uncertain) is in the classical/Shannon setting because classical conditional entropy is nonnegative.

A clean resolution is to move to : keep expectations as the primitive notion, but allow observables that do not commute. This forces a different representation of ``state’’ and (crucially for today’s lecture) it forces a different notion of dynamics.

Expectation-first view (state = expectation functional)

In noncommutative probability, the observables are Hermitian operators (matrices) \(A\).

A is not a point in phase space; it is an \(\omega\) which assigns an expectation value \(\omega(A)\) to each observable \(A\).

In finite dimensions, every such state can be represented by a \(\rho\) so that [ (A) = (A),,()=1. ] Intuitively: the density matrix is the object that lets you take consistent expectations when there is no underlying joint sample space.

Reversible transformations (no information loss)

What does it mean for a transformation to be ?

In the information-loss framing, reversible means: it preserves information (equivalently, it preserves von Neumann entropy for all states).

This is a real shift from normal (commutative) probability. Classically, if you have a sample space \(\Omega\) and random variables are functions \(\Omega\to\mathbb{R}\), then a reversible transformation is (morally) an invertible relabelling \(\phi:\Omega\to\Omega\) that preserves probabilities; observables are pushed forward/pulled back by composition, and you really can think in terms of ``moving points around’’ in \(\Omega\).

In noncommutative probability there is when observables don’t commute. So ``reversibility’’ can’t mean a pointwise relabelling of outcomes; instead it means an isomorphism of the (a \(*\)-isomorphism) that preserves the expectation structure. In finite dimensions those algebra isomorphisms are implemented by unitary conjugation.

In noncommutative probability, the reversible transformations are exactly the ones implemented by unitary conjugation: [ UU,UU = I. ] This is not an additional modelling assumption; it is the structural notion of ``symmetry’’ compatible with noncommutative probability.

Infinitesimal form: commutator dynamics

The paper’s perspective is that . So we should describe reversible time evolution in a way that is with invariance of expectations.

Let \(\mathcal{A}\) be our observable algebra (think: matrices) and let a state be an expectation functional \(\omega:\mathcal{A}\to\mathbb{R}\).

are implemented by a one-parameter family of algebra automorphisms \(\alpha_t\) (in finite dimensions: \(\alpha_t(A)=U(t)^\dagger A U(t)\)). This is the Heisenberg-style statement: observables are transformed, and expectations are preserved by pulling the state back.

Define the time-evolved state by \[ \omega_t(A) \coloneq \omega_0 \big(\alpha_t(A)\big). \] This definition makes the invariance principle explicit: we are not changing the meaning of expectation-taking; we are composing with an automorphism of the observable algebra.

If \(\alpha_t\) is smooth, we can look at the change it induces. Define the generator \(\delta\) by \[ \frac{\text{d}}{\text{d}t}\alpha_t(A)\big\rvert_{t=0}=\delta(A). \] What does this mean concretely? In finite dimensions we can take \[ U(t)=e^{-itH}\qquad\text{with $H$ Hermitian,} \] so that \(\alpha_t(A)=U(t)^\dagger A U(t) = e^{itH} A e^{-itH}\). Differentiate this expression at \(t=0\) and you get \[ \delta(A)= i[H,A]. \] Here are the steps, written out: \[ \alpha_t(A)=e^{itH} A e^{-itH}. \] Use the product rule \[ \frac{\text{d}}{\text{d}t}\alpha_t(A) =\left(\frac{\text{d}}{\text{d}t}e^{itH}\right)Ae^{-itH} +e^{itH}A\left(\frac{\text{d}}{\text{d}t}e^{-itH}\right). \] Now use the standard derivative identities (valid for constant \(H\)): \[ \frac{\text{d}}{\text{d}t}e^{itH} = iH e^{itH},\qquad \frac{\text{d}}{\text{d}t}e^{-itH} = -iH e^{-itH}. \] Substitute them \[ \frac{\text{d}}{\text{d}t}\alpha_t(A) = iHe^{itH}Ae^{-itH}-e^{itH}AiHe^{-itH}. \] Finally evaluate at \(t=0\) (so \(e^{itH}=I\)): \[ \delta(A)=\frac{\text{d}}{\text{d}t}\alpha_t(A)\Big\rvert_{t=0} = iHA-iAH=i[H,A]. \]

This is the Heisenberg form: the observable evolves by a commutator. If you have not seen commutators before, you can read \([H,A]=HA-AH\) as “apply \(H\) then \(A\) minus apply \(A\) then \(H\)”.

Now translate this to the density-matrix representation. If \(\omega_t(A)=\text{tr}(\rho(t)A)\) represents the same expectation functional, then consistency for observables \(A\) forces \[ \text{tr}(\dot\rho A) = \frac{\text{d}}{\text{d}t} \omega_t(A) = \omega_0 \big(\delta(A)\big)=\text{tr}(\rho i[H,A]). \] Using cyclicity1 of trace, \(\text{tr}(\rho i[H,A])=\text{tr}((-i[H,\rho])A)\), so we must have \[ \dot\rho = -i[H,\rho]. \] This is exactly the paper’s point: the commutator form is the concrete finite-dimensional equation whose job is to implement the under reversible evolution.

Commutative sanity check: diagonals recover classical probability

Fix a basis and restrict attention to the of observables. Concretely, take observables of the form \[ A=\text{diag}(a_1,\dots,a_d) \] and restrict to diagonal density matrices \[ \rho=\text{diag}(p_1,\dots,p_d),\qquad p_i\ge 0,\ \sum_i p_i=1. \] Then the expectation functional becomes ordinary probability: \[ \omega(A)=\text{tr}(\rho A)=\sum_{i=1}^d p_i a_i. \] So ``noncommutative probability’’ collapses to classical probability on the finite sample space \(\Omega=\{1,\dots,d\}\).

Now look at reversible transformations. If we want to stay inside the diagonal algebra, the allowed unitaries are permutations of the basis. Let \(U_\pi\) be the permutation matrix for a bijection \(\pi:\Omega\to\Omega\). Then conjugation gives \[ \alpha_\pi(A)=U_\pi^\dagger A U_\pi = \text{diag}(a_{\pi(1)},\dots,a_{\pi(d)}), \] which is literally ``relabel the outcomes’’.

On states, \[ \rho^\prime = U_\pi\rho U_\pi^\dagger = \text{diag}(p_{\pi^{-1}(1)},\dots,p_{\pi^{-1}(d)}), \] which is the same relabelling acting on the probability vector.

Note what disappears in the commuting restriction: if both \(H\) and \(\rho\) are diagonal, then \([H,\rho]=H \rho - \rho H=0\) and the commutator dynamics is trivial. The only reversible transformations that preserve diagonality are discrete relabellings (permutations), whereas general unitaries give you the genuinely noncommutative ``rotation’’, i.e. basis-change behaviour.

Bridge back: Poisson brackets as the commutative shadow

Why does this belong in today’s Poisson-bracket lecture?

Because Poisson brackets are the of commutators: [ [f,g] ;; {f,g} ] (heuristically, as \(\hbar\to 0\) and operators become classical observables).

So today’s classical message

is the shadow of a deeper structural message:

We will now develop Poisson brackets as a concrete, classical calculus for this reversible sector.

Computing with Quantum Exponential Families

Now that we understand the structure (exponential families + reversible dynamics), we need the computational tools to actually work with operator exponentials. This is where Duhamel and Kubo–Mori calculus come in.

These are not new physical principles; they are the calculus you need to differentiate matrix exponentials when things don’t commute.

Computation Layer: Duhamel / Kubo–Mori Calculus

[edit]

Once we switch to the quantum (noncommutative) setting, we still want to do the same kinds of calculations as in the classical exponential family:

Noncommutativity introduces one new technical obstacle: matrices do not commute, so differentiating an operator exponential is not the same as “pulling down the derivative” as in the scalar case.

The Duhamel/Kubo–Mori formulas are the standard way to do these derivatives cleanly. The key pedagogical point is: this is , not additional structure. The structure came from invariance (what reversible maps must look like); Duhamel calculus is how we actually compute within that structure.

The Duhamel formula (derivative of a matrix exponential)

Let \(K(\theta)\) be a matrix depending smoothly on parameters \(\theta\), and let \(F_i=\partial K/\partial \theta_i\).

In general, \(F_i e^{K} \neq e^{K}F_i\), so we use the Duhamel formula: \[ \frac{\partial}{\partial \theta_i} e^{K(\theta)} = \int_0^1 e^{(1-s)K(\theta)}\, F_i\, e^{sK(\theta)}\,\text{d}s. \]

You can read the parameter \(s\in[0,1]\) as “how far through the exponential you insert the derivative”. It is an device, not a time variable.

Kubo–Mori derivatives and the BKM metric (quantum Fisher information)

In an exponential-family chart, we write \[ \rho(\theta)=\exp\!\big(K(\theta)-\psi(\theta)\big), \qquad \psi(\theta)=\log\text{tr}\,e^{K(\theta)}. \]

The same object that plays the role of Fisher information in the classical case appears as the Hessian of \(\psi\): \[ G_{ij}(\theta)=\frac{\partial^2 \psi}{\partial \theta_i\partial \theta_j}. \] In the quantum literature this is (one form of) the Bogoliubov–Kubo–Mori (BKM) metric; it can be expressed in several equivalent ways involving operator orderings.

For us, the useful message is simple:

Why Lie closure makes things tractable

The Duhamel integrals look scary until you pick a basis adapted to the problem.

If your operators \(\{F_a\}\) close under commutation (a Lie algebra), \[ [F_a,F_b]= i\sum_c f_{abc}F_c, \] then conjugating by \(e^{sK}\) stays inside the same finite-dimensional span: \[ e^{sK}F_a e^{-sK}\in \text{span}\{F_b\}. \] So the Duhamel integrals can be reduced to finite-dimensional linear algebra on coefficients (often via BCH identities).

Again: the “Lie closure” choice is an implementation trick for computation, not a new physical axiom.

A common confusion: the Duhamel parameter is not time

Quantum statistical mechanics often uses an “imaginary time” parameter in KMS formulas. It is easy to confuse:

In this course, we treat \(s\) as a purely computational ordering parameter. It is not a dynamical clock.

Connecting to Path Integrals and Dyson Series

The Duhamel formula might look like an isolated trick for differentiating exponentials, but it’s actually part of a broader family of techniques for handling noncommutativity. Time-ordered exponentials (Dyson series), Trotter product formulas, and path integrals all address the same fundamental issue: operator ordering matters.

This section shows how these different formulations connect, demystifying their relationship.

How Duhamel Connects to Dyson Series and Path Integrals

[edit]

Students often see a lot of different-looking formulas in quantum/statistical physics:

These are not disconnected tricks. They are different faces of the same underlying issue:

The role of Duhamel-type formulas is to give a way to move derivatives or perturbations through a noncommuting exponential.

Two “Duhamel” ideas

The word “Duhamel” gets used in two closely related contexts.

If \(K(\theta)\) is an operator and \(F_i=\partial K/\partial\theta_i\), then \[ \partial_i e^{K(\theta)} =\int_0^1 e^{(1-s)K(\theta)} F_i e^{sK(\theta)} \text{d}s. \] Here \(s\in[0,1]\) is an that tells you where the derivative is inserted inside the exponential.

If you want to compare the exponential of a sum to the exponential of a part, e.g. \[ e^{t(A+B)} \quad \text{vs} \quad e^{tA}, \] then a Duhamel formula expresses their difference as an integral involving \(B\). This is the backbone of perturbation expansions in time evolution.

From Duhamel to Dyson (time-ordered exponential)

In quantum dynamics, you often split a Hamiltonian into “easy + perturbation”: \[ H=H_0+V. \] In the interaction picture one defines an interaction operator \[ V_I(t)=e^{i t H_0} V e^{-i t H_0}, \] and the exact propagator can be written as a \[ U(t)=\mathcal{T}\exp \left(-i\int_0^t V_I(\tau) \text{d}\tau\right). \]

Expanding the time-ordered exponential gives the Dyson series: \[ U(t)=I+\sum_{n\ge 1}(-i)^n\int_{0\le \tau_n\le\cdots\le \tau_1\le t} V_I(\tau_1)\cdots V_I(\tau_n) \text{d}\tau_1 \cdots \text{d}\tau_n. \]

Conceptually: this is “Duhamel(B)” iterated. Each integral insertion accounts for the fact that \(H_0\) and \(V\) do not commute, so you must keep track of operator order in time.

From Dyson/Trotter to path integrals (time slicing)

Path integrals can be introduced as a computational method by time-slicing (Trotterisation).

For example, in (imaginary-time) statistical mechanics one uses \[ e^{-\beta(H_0+V)} =\lim_{n\to\infty}\left(e^{-\beta H_0/n}\,e^{-\beta V/n}\right)^n \] (Trotter product formula).

In real time one has an analogous statement for \(e^{-it(H_0+V)}\).

Now insert resolutions of the identity between the factors (e.g. position eigenstates for particles, spin basis for spins). In the limit of many slices, the product becomes an integral over intermediate configurations: that limiting object is the (Euclidean or real-time) path integral.

So the path integral is another way to cope with noncommutativity: it replaces a hard operator exponential with a limit of many small steps where the ordering is explicit.

How this relates to our Duhamel parameter \(s\in[0,1]\)

In our TIG quantum exponential-family calculations, the Duhamel integral involves an \(s\in[0,1]\) ordering parameter: \[ \partial_i e^{K}=\int_0^1 e^{(1-s)K}F_i e^{sK} \text{d}s. \]

This is not a physical time variable. It is the same kind of bookkeeping device as:

All three enforce one principle:

The Origin Paradox: Why Quantum?

A natural question remains: why do we need quantum mechanics at all? What configuration of reality requires noncommutative probability?

One compelling answer comes from considering a fundamental limit configuration: a globally pure state (zero joint entropy) with maximally uncertain local measurements (maximum multi-information). This configuration is forbidden in classical probability—Shannon entropy cannot be negative—but it is exactly the structure of quantum entanglement.

This “origin paradox” provides conceptual motivation for why nature might require the quantum framework.

The Origin Paradox: Shannon vs von Neumann

[edit]

The inaccessible game wants a very specific kind of ``origin’’ state:

Informally: the system is globally fully specified (no global uncertainty), but every subsystem looks maximally uncertain on its own.

In classical probability (commutative setting), this origin is impossible. The obstruction is not a technicality — it is a structural inequality: classical conditional entropy is nonnegative.

The quantum (noncommutative) resolution is equally structural: von Neumann conditional entropy can be negative for entangled states, and those are exactly the configurations that look ``locally maximally uncertain’’ while remaining globally pure.

Why the classical origin is impossible

Consider two variables \(X,Y\) (the same idea scales). Recall \[ I(X;Y)=H(X)+H(Y)-H(X,Y). \] If \(H(X,Y)=0\) then the joint distribution puts all mass on a single outcome \((x^\star,y^\star)\), which forces \(H(X)=H(Y)=0\) as well. So you cannot have \(H(X,Y)=0\) while keeping \(H(X)+H(Y)\) positive.

Another way to say the same thing is through conditional entropy: \[ H(X|Y)=H(X,Y)-H(Y)\ge 0. \] If \(H(X,Y)=0\) but \(H(Y)>0\), then \(H(X|Y)=-H(Y)<0\), contradicting \(H(X|Y)\ge 0\).

So the origin condition “globally pure but locally uncertain” is forbidden in the commutative/Shannon world.

von Neumann entropy changes the rules

In noncommutative probability, the joint system is described by a density matrix \(\rho_{AB}\) and subsystems by reduced density matrices \(\rho_A=\text{tr}_B(\rho_{AB})\), \(\rho_B=\text{tr}_A(\rho_{AB})\).

The von Neumann entropy is \[ \textsf{H}(\rho)=-\text{tr}(\rho\log\rho). \] We can define conditional entropy by the same formal identity: \[ \textsf{H}(A|B)=\textsf{H}(\rho_{AB})-\textsf{H}(\rho_B). \] But now \(\textsf{H}(A|B)\) can be . That is not a bug: it is exactly how entanglement shows up as “more-than-classical correlation”.

Worked toy example: Bell state

Take two qubits and the Bell state \[ \ket{\Phi^+}=\frac{1}{\sqrt 2}(\ket{00}+\ket{11}), \qquad \rho_{AB}=\ket{\Phi^+}\bra{\Phi^+}. \] This is a joint state, so \[ \textsf{H}(\rho_{AB})=0. \] But each marginal is maximally mixed: \[ \rho_A=\rho_B=\frac{1}{2}I, \qquad \textsf{H}(\rho_A)=\textsf{H}(\rho_B)=\log 2. \] Therefore \[ \textsf{H}(A|B)=\textsf{H}(AB)-\textsf{H}(B)=0-\log 2=-\log 2<0. \] This is exactly the pattern TIG wants at the origin:

So the quantum/noncommutative formalism makes the origin without changing the goal — only changing the structural notion of “information loss”.

Connection back to TIG

This is why the paper pivots to von Neumann entropy:

From here, the next question is: once expectations are primitive, what are the reversible transformations that preserve this expectation structure? That leads into the unitary/commutator story.

Conclusion: Quantum as Extended Probabilistic Modeling

The key message for ML practitioners:

With these tools, quantum statistical mechanics becomes accessible to anyone with a solid grounding in information geometry and exponential families.

Further Directions

Topics not covered in this talk but natural extensions for ML applications:

Each of these areas becomes more accessible once you see quantum mechanics as “exponential families + matrix calculus”.

Thanks!

For more information on these subjects and more you might want to check the following resources.

References