
# Bayesian Regression

University of Sheffield

## Overdetermined System

• With two unknowns and two observations: \begin{aligned} \dataScalar_1 = & m\inputScalar_1 + c\\ \dataScalar_2 = & m\inputScalar_2 + c \end{aligned}
• Additional observation leads to overdetermined system. $\dataScalar_3 = m\inputScalar_3 + c$
• This problem is solved through a noise model $\noiseScalar \sim \gaussianSamp{0}{\dataStd^2}$ \begin{aligned} \dataScalar_1 = m\inputScalar_1 + c + \noiseScalar_1\\ \dataScalar_2 = m\inputScalar_2 + c + \noiseScalar_2\\ \dataScalar_3 = m\inputScalar_3 + c + \noiseScalar_3 \end{aligned}

## Underdetermined System

• What about two unknowns and one observation? $\dataScalar_1 = m\inputScalar_1 + c$

Can compute $m$ given $c$. $m = \frac{\dataScalar_1 - c}{\inputScalar}$

## The Bayesian Controversy: Philosophical Underpinnings

A segment from the lecture in 2012 on philsophical underpinnings.

## Noise Models

• We aren’t modeling entire system.
• Noise model gives mismatch between model and data.
• Gaussian model justified by appeal to central limit theorem.
• Other models also possible (Student-$t$ for heavy tails).
• Maximum likelihood with Gaussian noise leads to least squares.

## Different Types of Uncertainty

• The first type of uncertainty we are assuming is aleatoric uncertainty.
• The second type of uncertainty we are assuming is epistemic uncertainty.

## Aleatoric Uncertainty

• This is uncertainty we couldn’t know even if we wanted to. e.g. the result of a football match before it’s played.
• Where a sheet of paper might land on the floor.

## Epistemic Uncertainty

• This is uncertainty we could in principal know the answer too. We just haven’t observed enough yet, e.g. the result of a football match after it’s played.
• What colour socks your lecturer is wearing.

)}

• Section 1.2.3 (pg 21–24) of Bishop (2006)

• Sections 3.1-3.4 (pg 95-117) of Rogers and Girolami (2011)

• Section 1.2.3 (pg 21–24) of Bishop (2006)

• Section 1.2.6 (start from just past eq 1.64 pg 30-32) of Bishop (2006)

## Prior Distribution

• Bayesian inference requires a prior on the parameters.
• The prior represents your belief before you see the data of the likely value of the parameters.
• For linear regression, consider a Gaussian prior on the intercept:

$c \sim \gaussianSamp{0}{\alpha_1}$

## Posterior Distribution

• Posterior distribution is found by combining the prior with the likelihood.
• Posterior distribution is your belief after you see the data of the likely value of the parameters.
• The posterior is found through Bayes’ Rule $p(c|\dataScalar) = \frac{p(\dataScalar|c)p(c)}{p(\dataScalar)}$

$\text{posterior} = \frac{\text{likelihood}\times \text{prior}}{\text{marginal likelihood}}.$

## Stages to Derivation of the Posterior

• Multiply likelihood by prior
• they are “exponentiated quadratics”, the answer is always also an exponentiated quadratic because $\exp(a^2)\exp(b^2) = \exp(a^2 + b^2)$.
• Complete the square to get the resulting density in the form of a Gaussian.
• Recognise the mean and (co)variance of the Gaussian. This is the estimate of the posterior.

## Main Trick

$p(c) = \frac{1}{\sqrt{2\pi\alpha_1}} \exp\left(-\frac{1}{2\alpha_1}c^2\right)$ $p(\dataVector|\inputVector, c, m, \dataStd^2) = \frac{1}{\left(2\pi\dataStd^2\right)^{\frac{\numData}{2}}} \exp\left(-\frac{1}{2\dataStd^2}\sum_{i=1}^\numData(\dataScalar_i - m\inputScalar_i - c)^2\right)$

$p(c| \dataVector, \inputVector, m, \dataStd^2) = \frac{p(\dataVector|\inputVector, c, m, \dataStd^2)p(c)}{p(\dataVector|\inputVector, m, \dataStd^2)}$

$p(c| \dataVector, \inputVector, m, \dataStd^2) = \frac{p(\dataVector|\inputVector, c, m, \dataStd^2)p(c)}{\int p(\dataVector|\inputVector, c, m, \dataStd^2)p(c) \text{d} c}$

$p(c| \dataVector, \inputVector, m, \dataStd^2) \propto p(\dataVector|\inputVector, c, m, \dataStd^2)p(c)$

\begin{aligned} \log p(c | \dataVector, \inputVector, m, \dataStd^2) =&-\frac{1}{2\dataStd^2} \sum_{i=1}^\numData(\dataScalar_i-c - m\inputScalar_i)^2-\frac{1}{2\alpha_1} c^2 + \text{const}\\ = &-\frac{1}{2\dataStd^2}\sum_{i=1}^\numData(\dataScalar_i-m\inputScalar_i)^2 -\left(\frac{\numData}{2\dataStd^2} + \frac{1}{2\alpha_1}\right)c^2\\ & + c\frac{\sum_{i=1}^\numData(\dataScalar_i-m\inputScalar_i)}{\dataStd^2}, \end{aligned}

complete the square of the quadratic form to obtain $\log p(c | \dataVector, \inputVector, m, \dataStd^2) = -\frac{1}{2\tau^2}(c - \mu)^2 +\text{const},$ where $\tau^2 = \left(\numData\dataStd^{-2} +\alpha_1^{-1}\right)^{-1}$ and $\mu = \frac{\tau^2}{\dataStd^2} \sum_{i=1}^\numData(\dataScalar_i-m\inputScalar_i)$.

## Main Trick

$p(c) = \frac{1}{\sqrt{2\pi\alpha_1}} \exp\left(-\frac{1}{2\alpha_1}c^2\right)$ $p(\dataVector|\inputVector, c, m, \dataStd^2) = \frac{1}{\left(2\pi\dataStd^2\right)^{\frac{\numData}{2}}} \exp\left(-\frac{1}{2\dataStd^2}\sum_{i=1}^\numData(\dataScalar_i - m\inputScalar_i - c)^2\right)$ $p(c| \dataVector, \inputVector, m, \dataStd^2) = \frac{p(\dataVector|\inputVector, c, m, \dataStd^2)p(c)}{p(\dataVector|\inputVector, m, \dataStd^2)}$ $p(c| \dataVector, \inputVector, m, \dataStd^2) = \frac{p(\dataVector|\inputVector, c, m, \dataStd^2)p(c)}{\int p(\dataVector|\inputVector, c, m, \dataStd^2)p(c) \text{d} c}$ $p(c| \dataVector, \inputVector, m, \dataStd^2) \propto p(\dataVector|\inputVector, c, m, \dataStd^2)p(c)$ \begin{aligned} \log p(c | \dataVector, \inputVector, m, \dataStd^2) =&-\frac{1}{2\dataStd^2} \sum_{i=1}^\numData(\dataScalar_i-c - m\inputScalar_i)^2-\frac{1}{2\alpha_1} c^2 + \text{const}\\ = &-\frac{1}{2\dataStd^2}\sum_{i=1}^\numData(\dataScalar_i-m\inputScalar_i)^2 -\left(\frac{\numData}{2\dataStd^2} + \frac{1}{2\alpha_1}\right)c^2\\ & + c\frac{\sum_{i=1}^\numData(\dataScalar_i-m\inputScalar_i)}{\dataStd^2}, \end{aligned} complete the square of the quadratic form to obtain $\log p(c | \dataVector, \inputVector, m, \dataStd^2) = -\frac{1}{2\tau^2}(c - \mu)^2 +\text{const},$ where $\tau^2 = \left(n\dataStd^{-2} +\alpha_1^{-1}\right)^{-1}$ and $\mu = \frac{\tau^2}{\dataStd^2} \sum_{i=1}^\numData(\dataScalar_i-m\inputScalar_i)$.

## The Joint Density

• Really want to know the joint posterior density over the parameters $c$ and $m$.
• Could now integrate out over $m$, but it’s easier to consider the multivariate case.

## Two Dimensional Gaussian

• Consider height, $h/m$ and weight, $w/kg$.
• Could sample height from a distribution: $p(h) \sim \gaussianSamp{1.7}{0.0225}.$
• And similarly weight: $p(w) \sim \gaussianSamp{75}{36}.$

## Independence Assumption

• We assume height and weight are independent.

$p(w, h) = p(w)p(h).$

## Body Mass Index

• In reality they are dependent (body mass index) $= \frac{w}{h^2}$.
• To deal with this dependence we introduce correlated multivariate Gaussians.

## Independent Gaussians

$p(w, h) = p(w)p(h)$

## Independent Gaussians

$p(w, h) = \frac{1}{\sqrt{2\pi \dataStd_1^2}\sqrt{2\pi\dataStd_2^2}} \exp\left(-\frac{1}{2}\left(\frac{(w-\meanScalar_1)^2}{\dataStd_1^2} + \frac{(h-\meanScalar_2)^2}{\dataStd_2^2}\right)\right)$

## Independent Gaussians

$p(w, h) = \frac{1}{\sqrt{2\pi\dataStd_1^22\pi\dataStd_2^2}} \exp\left(-\frac{1}{2}\left(\begin{bmatrix}w \\ h\end{bmatrix} - \begin{bmatrix}\meanScalar_1 \\ \meanScalar_2\end{bmatrix}\right)^\top\begin{bmatrix}\dataStd_1^2& 0\\0&\dataStd_2^2\end{bmatrix}^{-1}\left(\begin{bmatrix}w \\ h\end{bmatrix} - \begin{bmatrix}\meanScalar_1 \\ \meanScalar_2\end{bmatrix}\right)\right)$

## Independent Gaussians

$p(\dataVector) = \frac{1}{\det{2\pi \mathbf{D}}^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\dataVector - \meanVector)^\top\mathbf{D}^{-1}(\dataVector - \meanVector)\right)$

## Correlated Gaussian

Form correlated from original by rotating the data space using matrix $\rotationMatrix$.

$p(\dataVector) = \frac{1}{\det{2\pi\mathbf{D}}^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\dataVector - \meanVector)^\top\mathbf{D}^{-1}(\dataVector - \meanVector)\right)$

## Correlated Gaussian

Form correlated from original by rotating the data space using matrix $\rotationMatrix$.

$p(\dataVector) = \frac{1}{\det{2\pi\mathbf{D}}^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\rotationMatrix^\top\dataVector - \rotationMatrix^\top\meanVector)^\top\mathbf{D}^{-1}(\rotationMatrix^\top\dataVector - \rotationMatrix^\top\meanVector)\right)$

## Correlated Gaussian

Form correlated from original by rotating the data space using matrix $\rotationMatrix$.

$p(\dataVector) = \frac{1}{\det{2\pi\mathbf{D}}^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\dataVector - \meanVector)^\top\rotationMatrix\mathbf{D}^{-1}\rotationMatrix^\top(\dataVector - \meanVector)\right)$ this gives a covariance matrix: $\covarianceMatrix^{-1} = \rotationMatrix \mathbf{D}^{-1} \rotationMatrix^\top$

## Correlated Gaussian

Form correlated from original by rotating the data space using matrix $\rotationMatrix$.

$p(\dataVector) = \frac{1}{\det{2\pi\covarianceMatrix}^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\dataVector - \meanVector)^\top\covarianceMatrix^{-1} (\dataVector - \meanVector)\right)$ this gives a covariance matrix: $\covarianceMatrix = \rotationMatrix \mathbf{D} \rotationMatrix^\top$

## The Prior Density

Let’s assume that the prior density is given by a zero mean Gaussian, which is independent across each of the parameters, $\mappingVector \sim \gaussianSamp{\zerosVector}{\alpha \eye}$ In other words, we are assuming, for the prior, that each element of the parameters vector, $\mappingScalar_i$, was drawn from a Gaussian density as follows $\mappingScalar_i \sim \gaussianSamp{0}{\alpha}$ Let’s start by assigning the parameter of the prior distribution, which is the variance of the prior distribution, $\alpha$.

• Multivariate Gaussians: Section 2.3 up to top of pg 85 of Bishop (2006)

• Section 3.3 up to 159 (pg 152–159) of Bishop (2006)

## Revisit Olympics Data

• Use Bayesian approach on olympics data with polynomials.

• Choose a prior $\mappingVector \sim \gaussianSamp{\zerosVector}{\alpha \eye}$ with $\alpha = 1$.

• Choose noise variance $\dataStd^2 = 0.01$

## Sampling the Prior

• Always useful to perform a ‘sanity check’ and sample from the prior before observing the data.
• Since $\dataVector = \basisMatrix \mappingVector + \noiseVector$ just need to sample $\mappingVector \sim \gaussianSamp{0}{\alpha\eye}$ $\noiseVector \sim \gaussianSamp{\zerosVector}{\dataStd^2}$ with $\alpha=1$ and $\dataStd^2 = 0.01$.

## Computing the Posterior

$p(\mappingVector | \dataVector, \inputVector, \dataStd^2) = \gaussianDist{\mappingVector}{\meanVector_\mappingScalar}{\covarianceMatrix_\mappingScalar}$ with $\covarianceMatrix_\mappingScalar = \left(\dataStd^{-2}\basisMatrix^\top \basisMatrix + \alpha^{-1}\eye\right)^{-1}$ and $\meanVector_\mappingScalar = \covarianceMatrix_\mappingScalar \dataStd^{-2}\basisMatrix^\top \dataVector$

## Olympic Marathon Data

 Gold medal times for Olympic Marathon since 1896. Marathons before 1924 didn’t have a standardised distance. Present results using pace per km. In 1904 Marathon was badly organised leading to very slow times. Image from Wikimedia Commons http://bit.ly/16kMKHQ

## Model Fit

• Marginal likelihood doesn’t always increase as model order increases.
• Bayesian model always has 2 parameters, regardless of how many basis functions (and here we didn’t even fit them).
• Maximum likelihood model over fits through increasing number of parameters.
• Revisit maximum likelihood solution with validation set.

## Regularized Mean

• Validation fit here based on mean solution for $\mappingVector$ only.
• For Bayesian solution $\meanVector_w = \left[\dataStd^{-2}\basisMatrix^\top\basisMatrix + \alpha^{-1}\eye\right]^{-1} \dataStd^{-2} \basisMatrix^\top \dataVector$ instead of $\mappingVector^* = \left[\basisMatrix^\top\basisMatrix\right]^{-1} \basisMatrix^\top \dataVector$
• Two are equivalent when $\alpha \rightarrow \infty$.
• Equivalent to a prior for $\mappingVector$ with infinite variance.
• In other cases $\alpha \eye$ regularizes the system (keeps parameters smaller).

## Sampling from the Posterior

• Now check samples by extracting $\mappingVector$ from the posterior.
• Now for $\dataVector = \basisMatrix \mappingVector + \noiseVector$ need $\mappingVector \sim \gaussianSamp{\meanVector_w}{\covarianceMatrix_w}$ with $\covarianceMatrix_w = \left[\dataStd^{-2}\basisMatrix^\top \basisMatrix + \alpha^{-1}\eye\right]^{-1}$ and $\meanVector_w =\covarianceMatrix_w \dataStd^{-2} \basisMatrix^\top \dataVector$ $\noiseVector \sim \gaussianSamp{\zerosVector}{\dataStd^2\eye}$ with $\alpha=1$ and $\dataStd^2 = 0.01$.

## Marginal Likelihood

• The marginal likelihood can also be computed, it has the form: $p(\dataVector|\inputMatrix, \dataStd^2, \alpha) = \frac{1}{(2\pi)^\frac{n}{2}\left|\kernelMatrix\right|^\frac{1}{2}} \exp\left(-\frac{1}{2} \dataVector^\top \kernelMatrix^{-1} \dataVector\right)$ where $\kernelMatrix = \alpha \basisMatrix\basisMatrix^\top + \dataStd^2 \eye$.

• So it is a zero mean $\numData$-dimensional Gaussian with covariance matrix $\kernelMatrix$.

## Computing the Expected Output

• Given the posterior for the parameters, how can we compute the expected output at a given location?
• Output of model at location $\inputVector_i$ is given by $\mappingFunction(\inputVector_i; \mappingVector) = \basisVector_i^\top \mappingVector$
• We want the expected output under the posterior density, $p(\mappingVector|\dataVector, \inputMatrix, \dataStd^2, \alpha)$.
• Mean of mapping function will be given by \begin{aligned} \left\langle f(\inputVector_i; \mappingVector)\right\rangle_{p(\mappingVector|\dataVector, \inputMatrix, \dataStd^2, \alpha)} &= \basisVector_i^\top \left\langle\mappingVector\right\rangle_{p(\mappingVector|\dataVector, \inputMatrix, \dataStd^2, \alpha)} \\ & = \basisVector_i^\top \meanVector_w \end{aligned}

## Variance of Expected Output

• Variance of model at location $\inputVector_i$ is given by \begin{aligned}\text{var}(\mappingFunction(\inputVector_i; \mappingVector)) &= \left\langle(\mappingFunction(\inputVector_i; \mappingVector))^2\right\rangle - \left\langle \mappingFunction(\inputVector_i; \mappingVector)\right\rangle^2 \\&= \basisVector_i^\top\left\langle\mappingVector\mappingVector^\top\right\rangle \basisVector_i - \basisVector_i^\top \left\langle\mappingVector\right\rangle\left\langle\mappingVector\right\rangle^\top \basisVector_i \\&= \basisVector_i^\top \covarianceMatrix_i\basisVector_i \end{aligned} where all these expectations are taken under the posterior density, $p(\mappingVector|\dataVector, \inputMatrix, \dataStd^2, \alpha)$.