\[\newcommand{\tk}[1]{} %\newcommand{\tk}[1]{\textbf{TK}: #1} \newcommand{\Amatrix}{\mathbf{A}} \newcommand{\KL}[2]{\text{KL}\left( #1\,\|\,#2 \right)} \newcommand{\Kaast}{\kernelMatrix_{\mathbf{ \ast}\mathbf{ \ast}}} \newcommand{\Kastu}{\kernelMatrix_{\mathbf{ \ast} \inducingVector}} \newcommand{\Kff}{\kernelMatrix_{\mappingFunctionVector \mappingFunctionVector}} \newcommand{\Kfu}{\kernelMatrix_{\mappingFunctionVector \inducingVector}} \newcommand{\Kuast}{\kernelMatrix_{\inducingVector \bf\ast}} \newcommand{\Kuf}{\kernelMatrix_{\inducingVector \mappingFunctionVector}} \newcommand{\Kuu}{\kernelMatrix_{\inducingVector \inducingVector}} \newcommand{\Kuui}{\Kuu^{-1}} \newcommand{\Qaast}{\mathbf{Q}_{\bf \ast \ast}} \newcommand{\Qastf}{\mathbf{Q}_{\ast \mappingFunction}} \newcommand{\Qfast}{\mathbf{Q}_{\mappingFunctionVector \bf \ast}} \newcommand{\Qff}{\mathbf{Q}_{\mappingFunctionVector \mappingFunctionVector}} \newcommand{\aMatrix}{\mathbf{A}} \newcommand{\aScalar}{a} \newcommand{\aVector}{\mathbf{a}} \newcommand{\acceleration}{a} \newcommand{\bMatrix}{\mathbf{B}} \newcommand{\bScalar}{b} \newcommand{\bVector}{\mathbf{b}} \newcommand{\basisFunc}{\phi} \newcommand{\basisFuncVector}{\boldsymbol{ \basisFunc}} \newcommand{\basisFunction}{\phi} \newcommand{\basisLocation}{\mu} \newcommand{\basisMatrix}{\boldsymbol{ \Phi}} \newcommand{\basisScalar}{\basisFunction} \newcommand{\basisVector}{\boldsymbol{ \basisFunction}} \newcommand{\activationFunction}{\phi} \newcommand{\activationMatrix}{\boldsymbol{ \Phi}} \newcommand{\activationScalar}{\basisFunction} \newcommand{\activationVector}{\boldsymbol{ \basisFunction}} \newcommand{\bigO}{\mathcal{O}} \newcommand{\binomProb}{\pi} \newcommand{\cMatrix}{\mathbf{C}} \newcommand{\cbasisMatrix}{\hat{\boldsymbol{ \Phi}}} \newcommand{\cdataMatrix}{\hat{\dataMatrix}} \newcommand{\cdataScalar}{\hat{\dataScalar}} \newcommand{\cdataVector}{\hat{\dataVector}} \newcommand{\centeredKernelMatrix}{\mathbf{ \MakeUppercase{\centeredKernelScalar}}} \newcommand{\centeredKernelScalar}{b} \newcommand{\centeredKernelVector}{\centeredKernelScalar} \newcommand{\centeringMatrix}{\mathbf{H}} \newcommand{\chiSquaredDist}[2]{\chi_{#1}^{2}\left(#2\right)} \newcommand{\chiSquaredSamp}[1]{\chi_{#1}^{2}} \newcommand{\conditionalCovariance}{\boldsymbol{ \Sigma}} \newcommand{\coregionalizationMatrix}{\mathbf{B}} \newcommand{\coregionalizationScalar}{b} \newcommand{\coregionalizationVector}{\mathbf{ \coregionalizationScalar}} \newcommand{\covDist}[2]{\text{cov}_{#2}\left(#1\right)} \newcommand{\covSamp}[1]{\text{cov}\left(#1\right)} \newcommand{\covarianceScalar}{c} \newcommand{\covarianceVector}{\mathbf{ \covarianceScalar}} \newcommand{\covarianceMatrix}{\mathbf{C}} \newcommand{\covarianceMatrixTwo}{\boldsymbol{ \Sigma}} \newcommand{\croupierScalar}{s} \newcommand{\croupierVector}{\mathbf{ \croupierScalar}} \newcommand{\croupierMatrix}{\mathbf{ \MakeUppercase{\croupierScalar}}} \newcommand{\dataDim}{p} \newcommand{\dataIndex}{i} \newcommand{\dataIndexTwo}{j} \newcommand{\dataMatrix}{\mathbf{Y}} \newcommand{\dataScalar}{y} \newcommand{\dataSet}{\mathcal{D}} \newcommand{\dataStd}{\sigma} \newcommand{\dataVector}{\mathbf{ \dataScalar}} \newcommand{\decayRate}{d} \newcommand{\degreeMatrix}{\mathbf{ \MakeUppercase{\degreeScalar}}} \newcommand{\degreeScalar}{d} \newcommand{\degreeVector}{\mathbf{ \degreeScalar}} % Already defined by latex %\newcommand{\det}[1]{\left|#1\right|} \newcommand{\diag}[1]{\text{diag}\left(#1\right)} \newcommand{\diagonalMatrix}{\mathbf{D}} \newcommand{\diff}[2]{\frac{\text{d}#1}{\text{d}#2}} \newcommand{\diffTwo}[2]{\frac{\text{d}^2#1}{\text{d}#2^2}} \newcommand{\displacement}{x} \newcommand{\displacementVector}{\textbf{\displacement}} \newcommand{\distanceMatrix}{\mathbf{ \MakeUppercase{\distanceScalar}}} \newcommand{\distanceScalar}{d} \newcommand{\distanceVector}{\mathbf{ \distanceScalar}} \newcommand{\eigenvaltwo}{\ell} \newcommand{\eigenvaltwoMatrix}{\mathbf{L}} \newcommand{\eigenvaltwoVector}{\mathbf{l}} \newcommand{\eigenvalue}{\lambda} \newcommand{\eigenvalueMatrix}{\boldsymbol{ \Lambda}} \newcommand{\eigenvalueVector}{\boldsymbol{ \lambda}} \newcommand{\eigenvector}{\mathbf{ \eigenvectorScalar}} \newcommand{\eigenvectorMatrix}{\mathbf{U}} \newcommand{\eigenvectorScalar}{u} \newcommand{\eigenvectwo}{\mathbf{v}} \newcommand{\eigenvectwoMatrix}{\mathbf{V}} \newcommand{\eigenvectwoScalar}{v} \newcommand{\entropy}[1]{\mathcal{H}\left(#1\right)} \newcommand{\errorFunction}{E} \newcommand{\expDist}[2]{\left<#1\right>_{#2}} \newcommand{\expSamp}[1]{\left<#1\right>} \newcommand{\expectation}[1]{\left\langle #1 \right\rangle } \newcommand{\expectationDist}[2]{\left\langle #1 \right\rangle _{#2}} \newcommand{\expectedDistanceMatrix}{\mathcal{D}} \newcommand{\eye}{\mathbf{I}} \newcommand{\fantasyDim}{r} \newcommand{\fantasyMatrix}{\mathbf{ \MakeUppercase{\fantasyScalar}}} \newcommand{\fantasyScalar}{z} \newcommand{\fantasyVector}{\mathbf{ \fantasyScalar}} \newcommand{\featureStd}{\varsigma} \newcommand{\gammaCdf}[3]{\mathcal{GAMMA CDF}\left(#1|#2,#3\right)} \newcommand{\gammaDist}[3]{\mathcal{G}\left(#1|#2,#3\right)} \newcommand{\gammaSamp}[2]{\mathcal{G}\left(#1,#2\right)} \newcommand{\gaussianDist}[3]{\mathcal{N}\left(#1|#2,#3\right)} \newcommand{\gaussianSamp}[2]{\mathcal{N}\left(#1,#2\right)} \newcommand{\given}{|} \newcommand{\half}{\frac{1}{2}} \newcommand{\heaviside}{H} \newcommand{\hiddenMatrix}{\mathbf{ \MakeUppercase{\hiddenScalar}}} \newcommand{\hiddenScalar}{h} \newcommand{\hiddenVector}{\mathbf{ \hiddenScalar}} \newcommand{\identityMatrix}{\eye} \newcommand{\inducingInputScalar}{z} \newcommand{\inducingInputVector}{\mathbf{ \inducingInputScalar}} \newcommand{\inducingInputMatrix}{\mathbf{Z}} \newcommand{\inducingScalar}{u} \newcommand{\inducingVector}{\mathbf{ \inducingScalar}} \newcommand{\inducingMatrix}{\mathbf{U}} \newcommand{\inlineDiff}[2]{\text{d}#1/\text{d}#2} \newcommand{\inputDim}{q} \newcommand{\inputMatrix}{\mathbf{X}} \newcommand{\inputScalar}{x} \newcommand{\inputSpace}{\mathcal{X}} \newcommand{\inputVals}{\inputVector} \newcommand{\inputVector}{\mathbf{ \inputScalar}} \newcommand{\iterNum}{k} \newcommand{\kernel}{\kernelScalar} \newcommand{\kernelMatrix}{\mathbf{K}} \newcommand{\kernelScalar}{k} \newcommand{\kernelVector}{\mathbf{ \kernelScalar}} \newcommand{\kff}{\kernelScalar_{\mappingFunction \mappingFunction}} \newcommand{\kfu}{\kernelVector_{\mappingFunction \inducingScalar}} \newcommand{\kuf}{\kernelVector_{\inducingScalar \mappingFunction}} \newcommand{\kuu}{\kernelVector_{\inducingScalar \inducingScalar}} \newcommand{\lagrangeMultiplier}{\lambda} \newcommand{\lagrangeMultiplierMatrix}{\boldsymbol{ \Lambda}} \newcommand{\lagrangian}{L} \newcommand{\laplacianFactor}{\mathbf{ \MakeUppercase{\laplacianFactorScalar}}} \newcommand{\laplacianFactorScalar}{m} \newcommand{\laplacianFactorVector}{\mathbf{ \laplacianFactorScalar}} \newcommand{\laplacianMatrix}{\mathbf{L}} \newcommand{\laplacianScalar}{\ell} \newcommand{\laplacianVector}{\mathbf{ \ell}} \newcommand{\latentDim}{q} \newcommand{\latentDistanceMatrix}{\boldsymbol{ \Delta}} \newcommand{\latentDistanceScalar}{\delta} \newcommand{\latentDistanceVector}{\boldsymbol{ \delta}} \newcommand{\latentForce}{f} \newcommand{\latentFunction}{u} \newcommand{\latentFunctionVector}{\mathbf{ \latentFunction}} \newcommand{\latentFunctionMatrix}{\mathbf{ \MakeUppercase{\latentFunction}}} \newcommand{\latentIndex}{j} \newcommand{\latentScalar}{z} \newcommand{\latentVector}{\mathbf{ \latentScalar}} \newcommand{\latentMatrix}{\mathbf{Z}} \newcommand{\learnRate}{\eta} \newcommand{\lengthScale}{\ell} \newcommand{\rbfWidth}{\ell} \newcommand{\likelihoodBound}{\mathcal{L}} \newcommand{\likelihoodFunction}{L} \newcommand{\locationScalar}{\mu} \newcommand{\locationVector}{\boldsymbol{ \locationScalar}} \newcommand{\locationMatrix}{\mathbf{M}} \newcommand{\variance}[1]{\text{var}\left( #1 \right)} \newcommand{\mappingFunction}{f} \newcommand{\mappingFunctionMatrix}{\mathbf{F}} \newcommand{\mappingFunctionTwo}{g} \newcommand{\mappingFunctionTwoMatrix}{\mathbf{G}} \newcommand{\mappingFunctionTwoVector}{\mathbf{ \mappingFunctionTwo}} \newcommand{\mappingFunctionVector}{\mathbf{ \mappingFunction}} \newcommand{\scaleScalar}{s} \newcommand{\mappingScalar}{w} \newcommand{\mappingVector}{\mathbf{ \mappingScalar}} \newcommand{\mappingMatrix}{\mathbf{W}} \newcommand{\mappingScalarTwo}{v} \newcommand{\mappingVectorTwo}{\mathbf{ \mappingScalarTwo}} \newcommand{\mappingMatrixTwo}{\mathbf{V}} \newcommand{\maxIters}{K} \newcommand{\meanMatrix}{\mathbf{M}} \newcommand{\meanScalar}{\mu} \newcommand{\meanTwoMatrix}{\mathbf{M}} \newcommand{\meanTwoScalar}{m} \newcommand{\meanTwoVector}{\mathbf{ \meanTwoScalar}} \newcommand{\meanVector}{\boldsymbol{ \meanScalar}} \newcommand{\mrnaConcentration}{m} \newcommand{\naturalFrequency}{\omega} \newcommand{\neighborhood}[1]{\mathcal{N}\left( #1 \right)} \newcommand{\neilurl}{http://inverseprobability.com/} \newcommand{\noiseMatrix}{\boldsymbol{ E}} \newcommand{\noiseScalar}{\epsilon} \newcommand{\noiseVector}{\boldsymbol{ \epsilon}} \newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \newcommand{\normalizedLaplacianMatrix}{\hat{\mathbf{L}}} \newcommand{\normalizedLaplacianScalar}{\hat{\ell}} \newcommand{\normalizedLaplacianVector}{\hat{\mathbf{ \ell}}} \newcommand{\numActive}{m} \newcommand{\numBasisFunc}{m} \newcommand{\numComponents}{m} \newcommand{\numComps}{K} \newcommand{\numData}{n} \newcommand{\numFeatures}{K} \newcommand{\numHidden}{h} \newcommand{\numInducing}{m} \newcommand{\numLayers}{\ell} \newcommand{\numNeighbors}{K} \newcommand{\numSequences}{s} \newcommand{\numSuccess}{s} \newcommand{\numTasks}{m} \newcommand{\numTime}{T} \newcommand{\numTrials}{S} \newcommand{\outputIndex}{j} \newcommand{\paramVector}{\boldsymbol{ \theta}} \newcommand{\parameterMatrix}{\boldsymbol{ \Theta}} \newcommand{\parameterScalar}{\theta} \newcommand{\parameterVector}{\boldsymbol{ \parameterScalar}} \newcommand{\partDiff}[2]{\frac{\partial#1}{\partial#2}} \newcommand{\precisionScalar}{j} \newcommand{\precisionVector}{\mathbf{ \precisionScalar}} \newcommand{\precisionMatrix}{\mathbf{J}} \newcommand{\pseudotargetScalar}{\widetilde{y}} \newcommand{\pseudotargetVector}{\mathbf{ \pseudotargetScalar}} \newcommand{\pseudotargetMatrix}{\mathbf{ \widetilde{Y}}} \newcommand{\rank}[1]{\text{rank}\left(#1\right)} \newcommand{\rayleighDist}[2]{\mathcal{R}\left(#1|#2\right)} \newcommand{\rayleighSamp}[1]{\mathcal{R}\left(#1\right)} \newcommand{\responsibility}{r} \newcommand{\rotationScalar}{r} \newcommand{\rotationVector}{\mathbf{ \rotationScalar}} \newcommand{\rotationMatrix}{\mathbf{R}} \newcommand{\sampleCovScalar}{s} \newcommand{\sampleCovVector}{\mathbf{ \sampleCovScalar}} \newcommand{\sampleCovMatrix}{\mathbf{s}} \newcommand{\scalarProduct}[2]{\left\langle{#1},{#2}\right\rangle} \newcommand{\sign}[1]{\text{sign}\left(#1\right)} \newcommand{\sigmoid}[1]{\sigma\left(#1\right)} \newcommand{\singularvalue}{\ell} \newcommand{\singularvalueMatrix}{\mathbf{L}} \newcommand{\singularvalueVector}{\mathbf{l}} \newcommand{\sorth}{\mathbf{u}} \newcommand{\spar}{\lambda} \newcommand{\trace}[1]{\text{tr}\left(#1\right)} \newcommand{\BasalRate}{B} \newcommand{\DampingCoefficient}{C} \newcommand{\DecayRate}{D} \newcommand{\Displacement}{X} \newcommand{\LatentForce}{F} \newcommand{\Mass}{M} \newcommand{\Sensitivity}{S} \newcommand{\basalRate}{b} \newcommand{\dampingCoefficient}{c} \newcommand{\mass}{m} \newcommand{\sensitivity}{s} \newcommand{\springScalar}{\kappa} \newcommand{\springVector}{\boldsymbol{ \kappa}} \newcommand{\springMatrix}{\boldsymbol{ \mathcal{K}}} \newcommand{\tfConcentration}{p} \newcommand{\tfDecayRate}{\delta} \newcommand{\tfMrnaConcentration}{f} \newcommand{\tfVector}{\mathbf{ \tfConcentration}} \newcommand{\velocity}{v} \newcommand{\sufficientStatsScalar}{g} \newcommand{\sufficientStatsVector}{\mathbf{ \sufficientStatsScalar}} \newcommand{\sufficientStatsMatrix}{\mathbf{G}} \newcommand{\switchScalar}{s} \newcommand{\switchVector}{\mathbf{ \switchScalar}} \newcommand{\switchMatrix}{\mathbf{S}} \newcommand{\tr}[1]{\text{tr}\left(#1\right)} \newcommand{\loneNorm}[1]{\left\Vert #1 \right\Vert_1} \newcommand{\ltwoNorm}[1]{\left\Vert #1 \right\Vert_2} \newcommand{\onenorm}[1]{\left\vert#1\right\vert_1} \newcommand{\twonorm}[1]{\left\Vert #1 \right\Vert} \newcommand{\vScalar}{v} \newcommand{\vVector}{\mathbf{v}} \newcommand{\vMatrix}{\mathbf{V}} \newcommand{\varianceDist}[2]{\text{var}_{#2}\left( #1 \right)} % Already defined by latex %\newcommand{\vec}{#1:} \newcommand{\vecb}[1]{\left(#1\right):} \newcommand{\weightScalar}{w} \newcommand{\weightVector}{\mathbf{ \weightScalar}} \newcommand{\weightMatrix}{\mathbf{W}} \newcommand{\weightedAdjacencyMatrix}{\mathbf{A}} \newcommand{\weightedAdjacencyScalar}{a} \newcommand{\weightedAdjacencyVector}{\mathbf{ \weightedAdjacencyScalar}} \newcommand{\onesVector}{\mathbf{1}} \newcommand{\zerosVector}{\mathbf{0}} \]

Bayesian Regression

Neil D. Lawrence

University of Sheffield

Overdetermined System

  • With two unknowns and two observations: \[ \begin{aligned} \dataScalar_1 = & m\inputScalar_1 + c\\ \dataScalar_2 = & m\inputScalar_2 + c \end{aligned} \]
  • Additional observation leads to overdetermined system. \[ \dataScalar_3 = m\inputScalar_3 + c \]
  • This problem is solved through a noise model \(\noiseScalar \sim \gaussianSamp{0}{\dataStd^2}\) \[ \begin{aligned} \dataScalar_1 = m\inputScalar_1 + c + \noiseScalar_1\\ \dataScalar_2 = m\inputScalar_2 + c + \noiseScalar_2\\ \dataScalar_3 = m\inputScalar_3 + c + \noiseScalar_3 \end{aligned} \]

Underdetermined System

Underdetermined System

  • What about two unknowns and one observation? \[\dataScalar_1 = m\inputScalar_1 + c\]

Can compute \(m\) given \(c\). \[m = \frac{\dataScalar_1 - c}{\inputScalar}\]

Underdetermined System

The Bayesian Controversy: Philosophical Underpinnings

A segment from the lecture in 2012 on philsophical underpinnings.

Noise Models

  • We aren’t modeling entire system.
  • Noise model gives mismatch between model and data.
  • Gaussian model justified by appeal to central limit theorem.
  • Other models also possible (Student-\(t\) for heavy tails).
  • Maximum likelihood with Gaussian noise leads to least squares.

Different Types of Uncertainty

  • The first type of uncertainty we are assuming is aleatoric uncertainty.
  • The second type of uncertainty we are assuming is epistemic uncertainty.

Aleatoric Uncertainty

  • This is uncertainty we couldn’t know even if we wanted to. e.g. the result of a football match before it’s played.
  • Where a sheet of paper might land on the floor.

Epistemic Uncertainty

  • This is uncertainty we could in principal know the answer too. We just haven’t observed enough yet, e.g. the result of a football match after it’s played.
  • What colour socks your lecturer is wearing.

)}

Further Reading

  • Section 1.2.3 (pg 21–24) of Bishop (2006)

  • Sections 3.1-3.4 (pg 95-117) of Rogers and Girolami (2011)

  • Section 1.2.3 (pg 21–24) of Bishop (2006)

  • Section 1.2.6 (start from just past eq 1.64 pg 30-32) of Bishop (2006)

Prior Distribution

  • Bayesian inference requires a prior on the parameters.
  • The prior represents your belief before you see the data of the likely value of the parameters.
  • For linear regression, consider a Gaussian prior on the intercept:

    \[c \sim \gaussianSamp{0}{\alpha_1}\]

Posterior Distribution

  • Posterior distribution is found by combining the prior with the likelihood.
  • Posterior distribution is your belief after you see the data of the likely value of the parameters.
  • The posterior is found through Bayes’ Rule \[ p(c|\dataScalar) = \frac{p(\dataScalar|c)p(c)}{p(\dataScalar)} \]

    \[ \text{posterior} = \frac{\text{likelihood}\times \text{prior}}{\text{marginal likelihood}}. \]

Bayes Update

Stages to Derivation of the Posterior

  • Multiply likelihood by prior
    • they are “exponentiated quadratics”, the answer is always also an exponentiated quadratic because \(\exp(a^2)\exp(b^2) = \exp(a^2 + b^2)\).
  • Complete the square to get the resulting density in the form of a Gaussian.
  • Recognise the mean and (co)variance of the Gaussian. This is the estimate of the posterior.

Main Trick

\[p(c) = \frac{1}{\sqrt{2\pi\alpha_1}} \exp\left(-\frac{1}{2\alpha_1}c^2\right)\] \[p(\dataVector|\inputVector, c, m, \dataStd^2) = \frac{1}{\left(2\pi\dataStd^2\right)^{\frac{\numData}{2}}} \exp\left(-\frac{1}{2\dataStd^2}\sum_{i=1}^\numData(\dataScalar_i - m\inputScalar_i - c)^2\right)\]

\[p(c| \dataVector, \inputVector, m, \dataStd^2) = \frac{p(\dataVector|\inputVector, c, m, \dataStd^2)p(c)}{p(\dataVector|\inputVector, m, \dataStd^2)}\]

\[p(c| \dataVector, \inputVector, m, \dataStd^2) = \frac{p(\dataVector|\inputVector, c, m, \dataStd^2)p(c)}{\int p(\dataVector|\inputVector, c, m, \dataStd^2)p(c) \text{d} c}\]

\[p(c| \dataVector, \inputVector, m, \dataStd^2) \propto p(\dataVector|\inputVector, c, m, \dataStd^2)p(c)\]

\[\begin{aligned} \log p(c | \dataVector, \inputVector, m, \dataStd^2) =&-\frac{1}{2\dataStd^2} \sum_{i=1}^\numData(\dataScalar_i-c - m\inputScalar_i)^2-\frac{1}{2\alpha_1} c^2 + \text{const}\\ = &-\frac{1}{2\dataStd^2}\sum_{i=1}^\numData(\dataScalar_i-m\inputScalar_i)^2 -\left(\frac{\numData}{2\dataStd^2} + \frac{1}{2\alpha_1}\right)c^2\\ & + c\frac{\sum_{i=1}^\numData(\dataScalar_i-m\inputScalar_i)}{\dataStd^2}, \end{aligned}\]

complete the square of the quadratic form to obtain \[\log p(c | \dataVector, \inputVector, m, \dataStd^2) = -\frac{1}{2\tau^2}(c - \mu)^2 +\text{const},\] where \(\tau^2 = \left(\numData\dataStd^{-2} +\alpha_1^{-1}\right)^{-1}\) and \(\mu = \frac{\tau^2}{\dataStd^2} \sum_{i=1}^\numData(\dataScalar_i-m\inputScalar_i)\).

Main Trick

\[ p(c) = \frac{1}{\sqrt{2\pi\alpha_1}} \exp\left(-\frac{1}{2\alpha_1}c^2\right) \] \[ p(\dataVector|\inputVector, c, m, \dataStd^2) = \frac{1}{\left(2\pi\dataStd^2\right)^{\frac{\numData}{2}}} \exp\left(-\frac{1}{2\dataStd^2}\sum_{i=1}^\numData(\dataScalar_i - m\inputScalar_i - c)^2\right) \] \[ p(c| \dataVector, \inputVector, m, \dataStd^2) = \frac{p(\dataVector|\inputVector, c, m, \dataStd^2)p(c)}{p(\dataVector|\inputVector, m, \dataStd^2)} \] \[ p(c| \dataVector, \inputVector, m, \dataStd^2) = \frac{p(\dataVector|\inputVector, c, m, \dataStd^2)p(c)}{\int p(\dataVector|\inputVector, c, m, \dataStd^2)p(c) \text{d} c} \] \[ p(c| \dataVector, \inputVector, m, \dataStd^2) \propto p(\dataVector|\inputVector, c, m, \dataStd^2)p(c) \] \[ \begin{aligned} \log p(c | \dataVector, \inputVector, m, \dataStd^2) =&-\frac{1}{2\dataStd^2} \sum_{i=1}^\numData(\dataScalar_i-c - m\inputScalar_i)^2-\frac{1}{2\alpha_1} c^2 + \text{const}\\ = &-\frac{1}{2\dataStd^2}\sum_{i=1}^\numData(\dataScalar_i-m\inputScalar_i)^2 -\left(\frac{\numData}{2\dataStd^2} + \frac{1}{2\alpha_1}\right)c^2\\ & + c\frac{\sum_{i=1}^\numData(\dataScalar_i-m\inputScalar_i)}{\dataStd^2}, \end{aligned} \] complete the square of the quadratic form to obtain \[ \log p(c | \dataVector, \inputVector, m, \dataStd^2) = -\frac{1}{2\tau^2}(c - \mu)^2 +\text{const}, \] where \(\tau^2 = \left(n\dataStd^{-2} +\alpha_1^{-1}\right)^{-1}\) and \(\mu = \frac{\tau^2}{\dataStd^2} \sum_{i=1}^\numData(\dataScalar_i-m\inputScalar_i)\).

The Joint Density

  • Really want to know the joint posterior density over the parameters \(c\) and \(m\).
  • Could now integrate out over \(m\), but it’s easier to consider the multivariate case.

Two Dimensional Gaussian

  • Consider height, \(h/m\) and weight, \(w/kg\).
  • Could sample height from a distribution: \[ p(h) \sim \gaussianSamp{1.7}{0.0225}. \]
  • And similarly weight: \[ p(w) \sim \gaussianSamp{75}{36}. \]

Height and Weight Models

Independence Assumption

  • We assume height and weight are independent.

    \[ p(w, h) = p(w)p(h). \]

Sampling Two Dimensional Variables

Body Mass Index

  • In reality they are dependent (body mass index) \(= \frac{w}{h^2}\).
  • To deal with this dependence we introduce correlated multivariate Gaussians.

Sampling Two Dimensional Variables

Independent Gaussians

\[ p(w, h) = p(w)p(h) \]

Independent Gaussians

\[ p(w, h) = \frac{1}{\sqrt{2\pi \dataStd_1^2}\sqrt{2\pi\dataStd_2^2}} \exp\left(-\frac{1}{2}\left(\frac{(w-\meanScalar_1)^2}{\dataStd_1^2} + \frac{(h-\meanScalar_2)^2}{\dataStd_2^2}\right)\right) \]

Independent Gaussians

\[ p(w, h) = \frac{1}{\sqrt{2\pi\dataStd_1^22\pi\dataStd_2^2}} \exp\left(-\frac{1}{2}\left(\begin{bmatrix}w \\ h\end{bmatrix} - \begin{bmatrix}\meanScalar_1 \\ \meanScalar_2\end{bmatrix}\right)^\top\begin{bmatrix}\dataStd_1^2& 0\\0&\dataStd_2^2\end{bmatrix}^{-1}\left(\begin{bmatrix}w \\ h\end{bmatrix} - \begin{bmatrix}\meanScalar_1 \\ \meanScalar_2\end{bmatrix}\right)\right) \]

Independent Gaussians

\[ p(\dataVector) = \frac{1}{\det{2\pi \mathbf{D}}^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\dataVector - \meanVector)^\top\mathbf{D}^{-1}(\dataVector - \meanVector)\right) \]

Correlated Gaussian

Form correlated from original by rotating the data space using matrix \(\rotationMatrix\).

\[ p(\dataVector) = \frac{1}{\det{2\pi\mathbf{D}}^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\dataVector - \meanVector)^\top\mathbf{D}^{-1}(\dataVector - \meanVector)\right) \]

Correlated Gaussian

Form correlated from original by rotating the data space using matrix \(\rotationMatrix\).

\[ p(\dataVector) = \frac{1}{\det{2\pi\mathbf{D}}^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\rotationMatrix^\top\dataVector - \rotationMatrix^\top\meanVector)^\top\mathbf{D}^{-1}(\rotationMatrix^\top\dataVector - \rotationMatrix^\top\meanVector)\right) \]

Correlated Gaussian

Form correlated from original by rotating the data space using matrix \(\rotationMatrix\).

\[ p(\dataVector) = \frac{1}{\det{2\pi\mathbf{D}}^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\dataVector - \meanVector)^\top\rotationMatrix\mathbf{D}^{-1}\rotationMatrix^\top(\dataVector - \meanVector)\right) \] this gives a covariance matrix: \[ \covarianceMatrix^{-1} = \rotationMatrix \mathbf{D}^{-1} \rotationMatrix^\top \]

Correlated Gaussian

Form correlated from original by rotating the data space using matrix \(\rotationMatrix\).

\[ p(\dataVector) = \frac{1}{\det{2\pi\covarianceMatrix}^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\dataVector - \meanVector)^\top\covarianceMatrix^{-1} (\dataVector - \meanVector)\right) \] this gives a covariance matrix: \[ \covarianceMatrix = \rotationMatrix \mathbf{D} \rotationMatrix^\top \]

The Prior Density

Let’s assume that the prior density is given by a zero mean Gaussian, which is independent across each of the parameters, \[ \mappingVector \sim \gaussianSamp{\zerosVector}{\alpha \eye} \] In other words, we are assuming, for the prior, that each element of the parameters vector, \(\mappingScalar_i\), was drawn from a Gaussian density as follows \[ \mappingScalar_i \sim \gaussianSamp{0}{\alpha} \] Let’s start by assigning the parameter of the prior distribution, which is the variance of the prior distribution, \(\alpha\).

Further Reading

  • Multivariate Gaussians: Section 2.3 up to top of pg 85 of Bishop (2006)

  • Section 3.3 up to 159 (pg 152–159) of Bishop (2006)

Revisit Olympics Data

  • Use Bayesian approach on olympics data with polynomials.

  • Choose a prior \(\mappingVector \sim \gaussianSamp{\zerosVector}{\alpha \eye}\) with \(\alpha = 1\).

  • Choose noise variance \(\dataStd^2 = 0.01\)

Sampling the Prior

  • Always useful to perform a ‘sanity check’ and sample from the prior before observing the data.
  • Since \(\dataVector = \basisMatrix \mappingVector + \noiseVector\) just need to sample \[ \mappingVector \sim \gaussianSamp{0}{\alpha\eye} \] \[ \noiseVector \sim \gaussianSamp{\zerosVector}{\dataStd^2} \] with \(\alpha=1\) and \(\dataStd^2 = 0.01\).

Computing the Posterior

\[ p(\mappingVector | \dataVector, \inputVector, \dataStd^2) = \gaussianDist{\mappingVector}{\meanVector_\mappingScalar}{\covarianceMatrix_\mappingScalar} \] with \[ \covarianceMatrix_\mappingScalar = \left(\dataStd^{-2}\basisMatrix^\top \basisMatrix + \alpha^{-1}\eye\right)^{-1} \] and \[ \meanVector_\mappingScalar = \covarianceMatrix_\mappingScalar \dataStd^{-2}\basisMatrix^\top \dataVector \]

Olympic Marathon Data

  • Gold medal times for Olympic Marathon since 1896.
  • Marathons before 1924 didn’t have a standardised distance.
  • Present results using pace per km.
  • In 1904 Marathon was badly organised leading to very slow times.
Image from Wikimedia Commons http://bit.ly/16kMKHQ

Olympic Marathon Data

Olympic Data with Bayesian Polynomials

Hold Out Validation

5-fold Cross Validation

Model Fit

  • Marginal likelihood doesn’t always increase as model order increases.
  • Bayesian model always has 2 parameters, regardless of how many basis functions (and here we didn’t even fit them).
  • Maximum likelihood model over fits through increasing number of parameters.
  • Revisit maximum likelihood solution with validation set.

Regularized Mean

  • Validation fit here based on mean solution for \(\mappingVector\) only.
  • For Bayesian solution \[ \meanVector_w = \left[\dataStd^{-2}\basisMatrix^\top\basisMatrix + \alpha^{-1}\eye\right]^{-1} \dataStd^{-2} \basisMatrix^\top \dataVector \] instead of \[ \mappingVector^* = \left[\basisMatrix^\top\basisMatrix\right]^{-1} \basisMatrix^\top \dataVector \]
  • Two are equivalent when \(\alpha \rightarrow \infty\).
  • Equivalent to a prior for \(\mappingVector\) with infinite variance.
  • In other cases \(\alpha \eye\) regularizes the system (keeps parameters smaller).

Sampling from the Posterior

  • Now check samples by extracting \(\mappingVector\) from the posterior.
  • Now for \(\dataVector = \basisMatrix \mappingVector + \noiseVector\) need \[ \mappingVector \sim \gaussianSamp{\meanVector_w}{\covarianceMatrix_w} \] with \(\covarianceMatrix_w = \left[\dataStd^{-2}\basisMatrix^\top \basisMatrix + \alpha^{-1}\eye\right]^{-1}\) and \(\meanVector_w =\covarianceMatrix_w \dataStd^{-2} \basisMatrix^\top \dataVector\) \[ \noiseVector \sim \gaussianSamp{\zerosVector}{\dataStd^2\eye} \] with \(\alpha=1\) and \(\dataStd^2 = 0.01\).

Marginal Likelihood

  • The marginal likelihood can also be computed, it has the form: \[ p(\dataVector|\inputMatrix, \dataStd^2, \alpha) = \frac{1}{(2\pi)^\frac{n}{2}\left|\kernelMatrix\right|^\frac{1}{2}} \exp\left(-\frac{1}{2} \dataVector^\top \kernelMatrix^{-1} \dataVector\right) \] where \(\kernelMatrix = \alpha \basisMatrix\basisMatrix^\top + \dataStd^2 \eye\).

  • So it is a zero mean \(\numData\)-dimensional Gaussian with covariance matrix \(\kernelMatrix\).

Computing the Expected Output

  • Given the posterior for the parameters, how can we compute the expected output at a given location?
  • Output of model at location \(\inputVector_i\) is given by \[ \mappingFunction(\inputVector_i; \mappingVector) = \basisVector_i^\top \mappingVector \]
  • We want the expected output under the posterior density, \(p(\mappingVector|\dataVector, \inputMatrix, \dataStd^2, \alpha)\).
  • Mean of mapping function will be given by \[ \begin{aligned} \left\langle f(\inputVector_i; \mappingVector)\right\rangle_{p(\mappingVector|\dataVector, \inputMatrix, \dataStd^2, \alpha)} &= \basisVector_i^\top \left\langle\mappingVector\right\rangle_{p(\mappingVector|\dataVector, \inputMatrix, \dataStd^2, \alpha)} \\ & = \basisVector_i^\top \meanVector_w \end{aligned} \]

Variance of Expected Output

  • Variance of model at location \(\inputVector_i\) is given by \[ \begin{aligned}\text{var}(\mappingFunction(\inputVector_i; \mappingVector)) &= \left\langle(\mappingFunction(\inputVector_i; \mappingVector))^2\right\rangle - \left\langle \mappingFunction(\inputVector_i; \mappingVector)\right\rangle^2 \\&= \basisVector_i^\top\left\langle\mappingVector\mappingVector^\top\right\rangle \basisVector_i - \basisVector_i^\top \left\langle\mappingVector\right\rangle\left\langle\mappingVector\right\rangle^\top \basisVector_i \\&= \basisVector_i^\top \covarianceMatrix_i\basisVector_i \end{aligned} \] where all these expectations are taken under the posterior density, \(p(\mappingVector|\dataVector, \inputMatrix, \dataStd^2, \alpha)\).

Further Reading

  • Section 3.7–3.8 (pg 122–133) of Rogers and Girolami (2011)

  • Section 3.4 (pg 161–165) of Bishop (2006)

Thanks!

References

Bishop, C.M., 2006. Pattern recognition and machine learning. springer.

Rogers, S., Girolami, M., 2011. A first course in machine learning. CRC Press.