With two unknowns and two observations: \[
\begin{aligned}
\dataScalar_1 = & m\inputScalar_1 + c\\
\dataScalar_2 = & m\inputScalar_2 + c
\end{aligned}
\]
Additional observation leads to overdetermined system. \[
\dataScalar_3 = m\inputScalar_3 + c
\]
This problem is solved through a noise model \(\noiseScalar \sim \gaussianSamp{0}{\dataStd^2}\)\[
\begin{aligned}
\dataScalar_1 = m\inputScalar_1 + c + \noiseScalar_1\\
\dataScalar_2 = m\inputScalar_2 + c + \noiseScalar_2\\
\dataScalar_3 = m\inputScalar_3 + c + \noiseScalar_3
\end{aligned}
\]
Underdetermined System
Underdetermined System
What about two unknowns and one observation? \[\dataScalar_1 = m\inputScalar_1 + c\]
Can compute \(m\) given \(c\). \[m = \frac{\dataScalar_1 - c}{\inputScalar}\]
Underdetermined System
The Bayesian Controversy: Philosophical Underpinnings
A segment from the lecture in 2012 on philsophical underpinnings.
Noise Models
We aren’t modeling entire system.
Noise model gives mismatch between model and data.
Gaussian model justified by appeal to central limit theorem.
Other models also possible (Student-\(t\) for heavy tails).
Maximum likelihood with Gaussian noise leads to least squares.
Different Types of Uncertainty
The first type of uncertainty we are assuming is aleatoric uncertainty.
The second type of uncertainty we are assuming is epistemic uncertainty.
Aleatoric Uncertainty
This is uncertainty we couldn’t know even if we wanted to. e.g. the result of a football match before it’s played.
Where a sheet of paper might land on the floor.
Epistemic Uncertainty
This is uncertainty we could in principal know the answer too. We just haven’t observed enough yet, e.g. the result of a football match after it’s played.
What colour socks your lecturer is wearing.
)}
Further Reading
Section 1.2.3 (pg 21–24) of Bishop (2006)
Sections 3.1-3.4 (pg 95-117) of Rogers and Girolami (2011)
Section 1.2.3 (pg 21–24) of Bishop (2006)
Section 1.2.6 (start from just past eq 1.64 pg 30-32) of Bishop (2006)
Prior Distribution
Bayesian inference requires a prior on the parameters.
The prior represents your belief before you see the data of the likely value of the parameters.
For linear regression, consider a Gaussian prior on the intercept:
\[c \sim \gaussianSamp{0}{\alpha_1}\]
Posterior Distribution
Posterior distribution is found by combining the prior with the likelihood.
Posterior distribution is your belief after you see the data of the likely value of the parameters.
The posterior is found through Bayes’ Rule\[
p(c|\dataScalar) = \frac{p(\dataScalar|c)p(c)}{p(\dataScalar)}
\]
\[p(c| \dataVector, \inputVector, m, \dataStd^2) = \frac{p(\dataVector|\inputVector, c, m, \dataStd^2)p(c)}{p(\dataVector|\inputVector, m, \dataStd^2)}\]
\[p(c| \dataVector, \inputVector, m, \dataStd^2) = \frac{p(\dataVector|\inputVector, c, m, \dataStd^2)p(c)}{\int p(\dataVector|\inputVector, c, m, \dataStd^2)p(c) \text{d} c}\]
\[p(c| \dataVector, \inputVector, m, \dataStd^2) \propto p(\dataVector|\inputVector, c, m, \dataStd^2)p(c)\]
complete the square of the quadratic form to obtain \[\log p(c | \dataVector, \inputVector, m, \dataStd^2) = -\frac{1}{2\tau^2}(c - \mu)^2 +\text{const},\] where \(\tau^2 = \left(\numData\dataStd^{-2} +\alpha_1^{-1}\right)^{-1}\) and \(\mu = \frac{\tau^2}{\dataStd^2} \sum_{i=1}^\numData(\dataScalar_i-m\inputScalar_i)\).
Main Trick
\[
p(c) = \frac{1}{\sqrt{2\pi\alpha_1}} \exp\left(-\frac{1}{2\alpha_1}c^2\right)
\]\[
p(\dataVector|\inputVector, c, m, \dataStd^2) = \frac{1}{\left(2\pi\dataStd^2\right)^{\frac{\numData}{2}}} \exp\left(-\frac{1}{2\dataStd^2}\sum_{i=1}^\numData(\dataScalar_i - m\inputScalar_i - c)^2\right)
\]\[
p(c| \dataVector, \inputVector, m, \dataStd^2) = \frac{p(\dataVector|\inputVector, c, m, \dataStd^2)p(c)}{p(\dataVector|\inputVector, m, \dataStd^2)}
\]\[
p(c| \dataVector, \inputVector, m, \dataStd^2) = \frac{p(\dataVector|\inputVector, c, m, \dataStd^2)p(c)}{\int p(\dataVector|\inputVector, c, m, \dataStd^2)p(c) \text{d} c}
\]\[
p(c| \dataVector, \inputVector, m, \dataStd^2) \propto p(\dataVector|\inputVector, c, m, \dataStd^2)p(c)
\]\[
\begin{aligned}
\log p(c | \dataVector, \inputVector, m, \dataStd^2) =&-\frac{1}{2\dataStd^2} \sum_{i=1}^\numData(\dataScalar_i-c - m\inputScalar_i)^2-\frac{1}{2\alpha_1} c^2 + \text{const}\\
= &-\frac{1}{2\dataStd^2}\sum_{i=1}^\numData(\dataScalar_i-m\inputScalar_i)^2 -\left(\frac{\numData}{2\dataStd^2} + \frac{1}{2\alpha_1}\right)c^2\\
& + c\frac{\sum_{i=1}^\numData(\dataScalar_i-m\inputScalar_i)}{\dataStd^2},
\end{aligned}
\] complete the square of the quadratic form to obtain \[
\log p(c | \dataVector, \inputVector, m, \dataStd^2) = -\frac{1}{2\tau^2}(c - \mu)^2 +\text{const},
\] where \(\tau^2 = \left(n\dataStd^{-2} +\alpha_1^{-1}\right)^{-1}\) and \(\mu = \frac{\tau^2}{\dataStd^2} \sum_{i=1}^\numData(\dataScalar_i-m\inputScalar_i)\).
The Joint Density
Really want to know the joint posterior density over the parameters \(c\)and\(m\).
Could now integrate out over \(m\), but it’s easier to consider the multivariate case.
Two Dimensional Gaussian
Consider height, \(h/m\) and weight, \(w/kg\).
Could sample height from a distribution: \[
p(h) \sim \gaussianSamp{1.7}{0.0225}.
\]
And similarly weight: \[
p(w) \sim \gaussianSamp{75}{36}.
\]
Height and Weight Models
Independence Assumption
We assume height and weight are independent.
\[
p(w, h) = p(w)p(h).
\]
Sampling Two Dimensional Variables
Body Mass Index
In reality they are dependent (body mass index) \(= \frac{w}{h^2}\).
To deal with this dependence we introduce correlated multivariate Gaussians.
Let’s assume that the prior density is given by a zero mean Gaussian, which is independent across each of the parameters, \[
\mappingVector \sim \gaussianSamp{\zerosVector}{\alpha \eye}
\] In other words, we are assuming, for the prior, that each element of the parameters vector, \(\mappingScalar_i\), was drawn from a Gaussian density as follows \[
\mappingScalar_i \sim \gaussianSamp{0}{\alpha}
\] Let’s start by assigning the parameter of the prior distribution, which is the variance of the prior distribution, \(\alpha\).
Further Reading
Multivariate Gaussians: Section 2.3 up to top of pg 85 of Bishop (2006)
Section 3.3 up to 159 (pg 152–159) of Bishop (2006)
Revisit Olympics Data
Use Bayesian approach on olympics data with polynomials.
Choose a prior \(\mappingVector \sim \gaussianSamp{\zerosVector}{\alpha \eye}\) with \(\alpha = 1\).
Choose noise variance \(\dataStd^2 = 0.01\)
Sampling the Prior
Always useful to perform a ‘sanity check’ and sample from the prior before observing the data.
Since \(\dataVector = \basisMatrix \mappingVector + \noiseVector\) just need to sample \[
\mappingVector \sim \gaussianSamp{0}{\alpha\eye}
\]\[
\noiseVector \sim \gaussianSamp{\zerosVector}{\dataStd^2}
\] with \(\alpha=1\) and \(\dataStd^2 = 0.01\).
Two are equivalent when \(\alpha \rightarrow \infty\).
Equivalent to a prior for \(\mappingVector\) with infinite variance.
In other cases \(\alpha \eye\)regularizes the system (keeps parameters smaller).
Sampling from the Posterior
Now check samples by extracting \(\mappingVector\) from the posterior.
Now for \(\dataVector = \basisMatrix \mappingVector + \noiseVector\) need \[
\mappingVector \sim \gaussianSamp{\meanVector_w}{\covarianceMatrix_w}
\] with \(\covarianceMatrix_w = \left[\dataStd^{-2}\basisMatrix^\top \basisMatrix + \alpha^{-1}\eye\right]^{-1}\) and \(\meanVector_w =\covarianceMatrix_w \dataStd^{-2} \basisMatrix^\top \dataVector\)\[
\noiseVector \sim \gaussianSamp{\zerosVector}{\dataStd^2\eye}
\] with \(\alpha=1\) and \(\dataStd^2 = 0.01\).
Marginal Likelihood
The marginal likelihood can also be computed, it has the form: \[
p(\dataVector|\inputMatrix, \dataStd^2, \alpha) = \frac{1}{(2\pi)^\frac{n}{2}\left|\kernelMatrix\right|^\frac{1}{2}} \exp\left(-\frac{1}{2} \dataVector^\top \kernelMatrix^{-1} \dataVector\right)
\] where \(\kernelMatrix = \alpha \basisMatrix\basisMatrix^\top + \dataStd^2 \eye\).
So it is a zero mean \(\numData\)-dimensional Gaussian with covariance matrix \(\kernelMatrix\).
Computing the Expected Output
Given the posterior for the parameters, how can we compute the expected output at a given location?
Output of model at location \(\inputVector_i\) is given by \[
\mappingFunction(\inputVector_i; \mappingVector) = \basisVector_i^\top \mappingVector
\]
We want the expected output under the posterior density, \(p(\mappingVector|\dataVector, \inputMatrix, \dataStd^2, \alpha)\).
Mean of mapping function will be given by \[
\begin{aligned} \left\langle f(\inputVector_i; \mappingVector)\right\rangle_{p(\mappingVector|\dataVector, \inputMatrix, \dataStd^2, \alpha)} &= \basisVector_i^\top \left\langle\mappingVector\right\rangle_{p(\mappingVector|\dataVector, \inputMatrix, \dataStd^2, \alpha)} \\ & = \basisVector_i^\top \meanVector_w \end{aligned}
\]
Variance of Expected Output
Variance of model at location \(\inputVector_i\) is given by \[
\begin{aligned}\text{var}(\mappingFunction(\inputVector_i; \mappingVector)) &= \left\langle(\mappingFunction(\inputVector_i; \mappingVector))^2\right\rangle - \left\langle \mappingFunction(\inputVector_i; \mappingVector)\right\rangle^2 \\&= \basisVector_i^\top\left\langle\mappingVector\mappingVector^\top\right\rangle \basisVector_i - \basisVector_i^\top \left\langle\mappingVector\right\rangle\left\langle\mappingVector\right\rangle^\top \basisVector_i \\&= \basisVector_i^\top \covarianceMatrix_i\basisVector_i
\end{aligned}
\] where all these expectations are taken under the posterior density, \(p(\mappingVector|\dataVector, \inputMatrix, \dataStd^2, \alpha)\).
Further Reading
Section 3.7–3.8 (pg 122–133) of Rogers and Girolami (2011)