data : observations, could be actively or passively acquired (meta-data).
model : assumptions, based on previous experience (other data! transfer learning etc), or beliefs about the regularities of the universe. Inductive bias.
prediction : an action to be taken or a categorization or a quality score.
point 1: \(\inputScalar = 1\), \(\dataScalar=3\)\[
3 = m + c
\]
point 2: \(\inputScalar = 3\), \(\dataScalar=1\)\[
1 = 3m + c
\]
point 3: \(\inputScalar = 2\), \(\dataScalar=2.5\)
\[2.5 = 2m + c\]
\(\dataScalar = m\inputScalar + c + \noiseScalar\)
point 1: \(\inputScalar = 1\), \(\dataScalar=3\)\[
3 = m + c + \noiseScalar_1
\]
point 2: \(\inputScalar = 3\), \(\dataScalar=1\)\[
1 = 3m + c + \noiseScalar_2
\]
point 3: \(\inputScalar = 2\), \(\dataScalar=2.5\)\[
2.5 = 2m + c + \noiseScalar_3
\]
A Probabilistic Process
Set the mean of Gaussian to be a function. \[p
\left(\dataScalar_i|\inputScalar_i\right)=\frac{1}{\sqrt{2\pi\dataStd^2}}\exp \left(-\frac{\left(\dataScalar_i-\mappingFunction\left(\inputScalar_i\right)\right)^{2}}{2\dataStd^2}\right).
\]
The Gaussian PDF with \({\meanScalar}=1.7\) and variance \({\dataStd}^2=0.0225\). Mean shown as cyan line. It could represent the heights of a population of students.
Predict a real value, \(\dataScalar_i\) given some inputs \(\inputVector_i\).
Predict quality of meat given spectral measurements (Tecator data).
Radiocarbon dating, the C14 calibration curve: predict age given quantity of C14 isotope.
Predict quality of different Go or Backgammon moves given expert rated training data.
Underdetermined System
What about two unknowns and one observation? \[\dataScalar_1 = m\inputScalar_1 + c\]
Can compute \(m\) given \(c\). \[m = \frac{\dataScalar_1 - c}{\inputScalar}\]
Underdetermined System
Overdetermined System
With two unknowns and two observations: \[\begin{aligned}
\dataScalar_1 = & m\inputScalar_1 + c\\
\dataScalar_2 = & m\inputScalar_2 + c
\end{aligned}\]
This problem is solved through a noise model \(\noiseScalar \sim \gaussianSamp{0}{\dataStd^2}\)\[\begin{aligned}
\dataScalar_1 = m\inputScalar_1 + c + \noiseScalar_1\\
\dataScalar_2 = m\inputScalar_2 + c + \noiseScalar_2\\
\dataScalar_3 = m\inputScalar_3 + c + \noiseScalar_3
\end{aligned}\]
Noise Models
We aren’t modeling entire system.
Noise model gives mismatch between model and data.
Gaussian model justified by appeal to central limit theorem.
Other models also possible (Student-\(t\) for heavy tails).
Maximum likelihood with Gaussian noise leads to least squares.
Probability for Under- and Overdetermined
To deal with overdetermined introduced probability distribution for ‘variable’, \({\noiseScalar}_i\).
For underdetermined system introduced probability distribution for ‘parameter’, \(c\).
This is known as a Bayesian treatment.
Different Types of Uncertainty
The first type of uncertainty we are assuming is aleatoric uncertainty.
The second type of uncertainty we are assuming is epistemic uncertainty.
Aleatoric Uncertainty
This is uncertainty we couldn’t know even if we wanted to. e.g. the result of a football match before it’s played.
Where a sheet of paper might land on the floor.
Epistemic Uncertainty
This is uncertainty we could in principle know the answer too. We just haven’t observed enough yet, e.g. the result of a football match after it’s played.
What colour socks your lecturer is wearing.
Bayesian Regression
Prior Distribution
Bayesian inference requires a prior on the parameters.
The prior represents your belief before you see the data of the likely value of the parameters.
For linear regression, consider a Gaussian prior on the intercept:
\[c \sim \gaussianSamp{0}{\alpha_1}\]
Posterior Distribution
Posterior distribution is found by combining the prior with the likelihood.
Posterior distribution is your belief after you see the data of the likely value of the parameters.
The posterior is found through Bayes’ Rule\[
p(c|\dataScalar) = \frac{p(\dataScalar|c)p(c)}{p(\dataScalar)}
\]
Now use a multivariate Gaussian prior: \[p(\weightVector) = \frac{1}{\left(2\pi \alpha\right)^\frac{\dataDim}{2}} \exp \left(-\frac{1}{2\alpha} \weightVector^\top \weightVector\right)\]
Two Dimensional Gaussian Distribution
Two Dimensional Gaussian
Consider height, \(h/m\) and weight, \(w/kg\).
Could sample height from a distribution: \[
p(h) \sim \gaussianSamp{1.7}{0.0225}.
\]
And similarly weight: \[
p(w) \sim \gaussianSamp{75}{36}.
\]
Height and Weight Models
Gaussian distributions for height and weight.
Independence Assumption
We assume height and weight are independent.
\[
p(w, h) = p(w)p(h).
\]
Sampling Two Dimensional Variables
Body Mass Index
In reality they are dependent (body mass index) \(= \frac{w}{h^2}\).
To deal with this dependence we introduce correlated multivariate Gaussians.
Gaussian processes are initially of interest because 1. linear Gaussian models are easier to deal with 2. Even the parameters within the process can be handled, by considering a particular limit.
Multivariate Gaussian Properties
If \[
\dataVector = \mappingMatrix \inputVector + \noiseVector,
\]
Then \[
\dataVector \sim \gaussianSamp{\mappingMatrix\meanVector}{\mappingMatrix\covarianceMatrix\mappingMatrix^\top + \covarianceMatrixTwo}.
\] If \(\covarianceMatrixTwo=\dataStd^2\eye\), this is Probabilistic Principal Component Analysis (Tipping and Bishop, 1999), because we integrated out the inputs (or latent variables they would be called in that case).
Non linear on Inputs
Set each activation function computed at each data point to be
formed by inner products of the rows of the design matrix.
Gaussian Process
Instead of making assumptions about our density over each data point, \(\dataScalar_i\) as i.i.d.
make a joint Gaussian assumption over our data.
covariance matrix is now a function of both the parameters of the activation function, \(\mappingMatrix_1\), and the input variables, \(\inputMatrix\).
Arises from integrating out \(\mappingVector^{(2)}\).
Basis Functions
Can be very complex, such as deep kernels, (Cho and Saul, 2009) or could even put a convolutional neural network inside.
Viewing a neural network in this way is also what allows us to beform sensible batch normalizations (Ioffe and Szegedy, 2015).
Distributions over Functions
x
Sampling a Function
Multi-variate Gaussians
We will consider a Gaussian with a particular structure of covariance matrix.
Generate a single sample from this 25 dimensional Gaussian density, \[
\mappingFunctionVector=\left[\mappingFunction_{1},\mappingFunction_{2}\dots \mappingFunction_{25}\right].
\]
We will plot these points against their index.
Gaussian Distribution Sample
Prediction of \(\mappingFunction_{2}\) from \(\mappingFunction_{1}\)
Uluru
Prediction with Correlated Gaussians
Prediction of \(\mappingFunction_2\) from \(\mappingFunction_1\) requires conditional density.
Conditional density is also Gaussian. \[
p(\mappingFunction_2|\mappingFunction_1) = \gaussianDist{\mappingFunction_2}{\frac{\kernelScalar_{1, 2}}{\kernelScalar_{1, 1}}\mappingFunction_1}{ \kernelScalar_{2, 2} - \frac{\kernelScalar_{1,2}^2}{\kernelScalar_{1,1}}}
\] where covariance of joint density is given by \[
\kernelMatrix = \begin{bmatrix} \kernelScalar_{1, 1} & \kernelScalar_{1, 2}\\ \kernelScalar_{2, 1} & \kernelScalar_{2, 2}.\end{bmatrix}
\]
Prediction of \(\mappingFunction_{8}\) from \(\mappingFunction_{1}\)
This implies \[
b-a = \Delta\locationScalar (\numBasisFunc -1)
\] and therefore \[
\numBasisFunc = \frac{b-a}{\Delta \locationScalar} + 1
\]
Take limit as \(\Delta\locationScalar\rightarrow 0\) so \(\numBasisFunc \rightarrow \infty\) where we have used \(a + k\cdot\Delta\locationScalar\rightarrow \locationScalar\).
Now take limit as \(a\rightarrow -\infty\) and \(b\rightarrow \infty\)\[\kernelScalar\left(\inputScalar_i,\inputScalar_j\right) = \alpha\exp\left(
-\frac{\left(\inputScalar_i-\inputScalar_j\right)^2}{4\rbfWidth^2}\right).\] where \(\alpha=\alpha^\prime \sqrt{\pi\rbfWidth^2}\).
Infinite Feature Space
An RBF model with infinite basis functions is a Gaussian process.
The covariance function is given by the covariance function. \[\kernelScalar\left(\inputScalar_i,\inputScalar_j\right) = \alpha \exp\left(
-\frac{\left(\inputScalar_i-\inputScalar_j\right)^2}{4\rbfWidth^2}\right).\]
Infinite Feature Space
An RBF model with infinite basis functions is a Gaussian process.
The covariance function is the exponentiated quadratic (squared exponential).
Note: The functional form for the covariance function and basis functions are similar.
Cho, Y., Saul, L.K., 2009. Kernel methods for deep learning, in: Bengio, Y., Schuurmans, D., Lafferty, J.D., Williams, C.K.I., Culotta, A. (Eds.), Advances in Neural Information Processing Systems 22. Curran Associates, Inc., pp. 342–350.
Ioffe, S., Szegedy, C., 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift, in: Bach, F., Blei, D. (Eds.), Proceedings of the 32nd International Conference on Machine Learning, Proceedings of Machine Learning Research. PMLR, Lille, France, pp. 448–456.
Tipping, M.E., Bishop, C.M., 1999. Probabilistic principal component analysis. Journal of the Royal Statistical Society, B 6, 611–622. https://doi.org/doi:10.1111/1467-9868.00196