at Gaussian Process Summer School, Sheffield on Sep 3, 2018 [jupyter][google colab][reveal]
Neil D. Lawrence, Amazon Cambridge and University of Sheffield

#### Abstract

In this talk we introduce Gaussian process models. Motivating the representation of uncertainty through probability distributions we review Laplace's approach to understanding uncertainty and how uncertainty in functions can be represented through a multivariate Gaussian density.

.

.

Rasmussen and Williams (2006) is still one of the most important references on Gaussian process models. It is available freely online.

# What is Machine Learning? 

What is machine learning? At its most basic level machine learning is a combination of

$$\text{data} + \text{model} \xrightarrow{\text{compute}} \text{prediction}$$

where data is our observations. They can be actively or passively acquired (meta-data). The model contains our assumptions, based on previous experience. That experience can be other data, it can come from transfer learning, or it can merely be our beliefs about the regularities of the universe. In humans our models include our inductive biases. The prediction is an action to be taken or a categorization or a quality score. The reason that machine learning has become a mainstay of artificial intelligence is the importance of predictions in artificial intelligence. The data and the model are combined through computation.

In practice we normally perform machine learning using two functions. To combine data with a model we typically make use of:

a prediction function a function which is used to make the predictions. It includes our beliefs about the regularities of the universe, our assumptions about how the world works, e.g. smoothness, spatial similarities, temporal similarities.

an objective function a function which defines the cost of misprediction. Typically it includes knowledge about the world's generating processes (probabilistic objectives) or the costs we pay for mispredictions (empiricial risk minimization).

The combination of data and model through the prediction function and the objectie function leads to a learning algorithm. The class of prediction functions and objective functions we can make use of is restricted by the algorithms they lead to. If the prediction function or the objective function are too complex, then it can be difficult to find an appropriate learning algorithm. Much of the acdemic field of machine learning is the quest for new learning algorithms that allow us to bring different types of models and data together.

A useful reference for state of the art in machine learning is the UK Royal Society Report, Machine Learning: Power and Promise of Computers that Learn by Example.

You can also check my blog post on "What is Machine Learning?"

## Olympic Marathon Data 

 Gold medal times for Olympic Marathon since 1896. Marathons before 1924 didn’t have a standardised distance. Present results using pace per km. In 1904 Marathon was badly organised leading to very slow times. Image from Wikimedia Commons http://bit.ly/16kMKHQ

The first thing we will do is load a standard data set for regression modelling. The data consists of the pace of Olympic Gold Medal Marathon winners for the Olympics from 1896 to present. First we load in the data and plot.

import numpy as np
import pods
data = pods.datasets.olympic_marathon_men()
x = data['X']
y = data['Y']

offset = y.mean()
scale = np.sqrt(y.var())
import matplotlib.pyplot as plt
import teaching_plots as plot
import mlai

xlim = (1875,2030)
ylim = (2.5, 6.5)
yhat = (y-offset)/scale

fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
_ = ax.plot(x, y, 'r.',markersize=10)
ax.set_xlabel('year', fontsize=20)
ax.set_ylabel('pace min/km', fontsize=20)
ax.set_xlim(xlim)
ax.set_ylim(ylim)

mlai.write_figure(figure=fig,
filename='../slides/diagrams/datasets/olympic-marathon.svg',
transparent=True,
frameon=True)

Things to notice about the data include the outlier in 1904, in this year, the olympics was in St Louis, USA. Organizational problems and challenges with dust kicked up by the cars following the race meant that participants got lost, and only very few participants completed.

More recent years see more consistently quick marathons.

## Overdetermined System 

The challenge with a linear model is that it has two unknowns, m, and c. Observing data allows us to write down a system of simultaneous linear equations. So, for example if we observe two data points, the first with the input value, $\inputScalar_1 = 1$ and the output value, $\dataScalar_1 =3$ and a second data point, $\inputScalar = 3$, $\dataScalar=1$, then we can write two simultaneous linear equations of the form.

point 1: $\inputScalar = 1$, $\dataScalar=3$
3 = m + c
point 2: $\inputScalar = 3$, $\dataScalar=1$
1 = 3m + c

The solution to these two simultaneous equations can be represented graphically as

The challenge comes when a third data point is observed and it doesn't naturally fit on the straight line.

point 3: $\inputScalar = 2$, $\dataScalar=2.5$
2.5 = 2m + c

Now there are three candidate lines, each consistent with our data.

This is known as an overdetermined system because there are more data than we need to determine our parameters. The problem arises because the model is a simplification of the real world, and the data we observe is therefore inconsistent with our model.

pods.notebook.display_plots('over_determined_system{samp:0>3}.svg',
directory='../slides/diagrams/ml',
samp=IntSlider(1,1,7,1))

The solution was proposed by Pierre-Simon Laplace. His idea was to accept that the model was an incomplete representation of the real world, and the manner in which it was incomplete is unknown. His idea was that such unknowns could be dealt with through probability.

### Pierre-Simon Laplace 

<iframe frameborder=”0” scrolling=”no” style=”border:0px” src=”http://books.google.co.uk/books?id=1YQPAAAAQAAJ&pg=PR17-IA2&output=embed” , width=”700” height=”500”> </iframe>

Famously, Laplace considered the idea of a deterministic Universe, one in which the model is known, or as the below translation refers to it, "an intelligence which could comprehend all the forces by which nature is animated". He speculates on an "intelligence" that can submit this vast data to analysis and propsoses that such an entity would be able to predict the future.

Given for one instant an intelligence which could comprehend all the forces by which nature is animated and the respective situation of the beings who compose it---an intelligence sufficiently vast to submit these data to analysis---it would embrace in the same formulate the movements of the greatest bodies of the universe and those of the lightest atom; for it, nothing would be uncertain and the future, as the past, would be present in its eyes.

This notion is known as Laplace's demon or Laplace's superman.

Unfortunately, most analyses of his ideas stop at that point, whereas his real point is that such a notion is unreachable. Not so much superman as strawman. Just three pages later in the "Philosophical Essay on Probabilities" (Laplace 1814), Laplace goes on to observe:

The curve described by a simple molecule of air or vapor is regulated in a manner just as certain as the planetary orbits; the only difference between them is that which comes from our ignorance.

Probability is relative, in part to this ignorance, in part to our knowledge.

<iframe frameborder=”0” scrolling=”no” style=”border:0px” src=”http://books.google.co.uk/books?id=1YQPAAAAQAAJ&pg=PR17-IA4&output=embed” , width=”700” height=”500”> </iframe>

In other words, we can never make use of the idealistic deterministc Universe due to our ignorance about the world, Laplace's suggestion, and focus in this essay is that we turn to probability to deal with this uncertainty. This is also our inspiration for using probabilit in machine learning.

The "forces by which nature is animated" is our model, the "situation of beings that compose it" is our data and the "intelligence sufficiently vast enough to submit these data to analysis" is our compute. The fly in the ointment is our ignorance about these aspects. And probability is the tool we use to incorporate this ignorance leading to uncertainty or doubt in our predictions.

Laplace's concept was that the reason that the data doesn't match up to the model is because of unconsidered factors, and that these might be well represented through probability densities. He tackles the challenge of the unknown factors by adding a variable, $\noiseScalar$, that represents the unknown. In modern parlance we would call this a latent variable. But in the context Laplace uses it, the variable is so common that it has other names such as a "slack" variable or the noise in the system.

point 1: $\inputScalar = 1$, $\dataScalar=3$
$$3 = m + c + \noiseScalar_1$$
point 2: $\inputScalar = 3$, $\dataScalar=1$
$$1 = 3m + c + \noiseScalar_2$$
point 3: $\inputScalar = 2$, $\dataScalar=2.5$
$$2.5 = 2m + c + \noiseScalar_3$$

Laplace's trick has converted the overdetermined system into an underdetermined system. He has now added three variables, $\{\noiseScalar_i\}_{i=1}^3$, which represent the unknown corruptions of the real world. Laplace's idea is that we should represent that unknown corruption with a probability distribution.

## A Probabilistic Process

However, it was left to an admirer of Gauss to develop a practical probability density for that purpose. It was Carl Friederich Gauss who suggested that the Gaussian density (which at the time was unnamed!) should be used to represent this error.

The result is a noisy function, a function which has a deterministic part, and a stochastic part. This type of function is sometimes known as a probabilistic or stochastic process, to distinguish it from a deterministic process.

## The Gaussian Density 

The Gaussian density is perhaps the most commonly used probability density. It is defined by a mean, $\meanScalar$, and a variance, $\dataStd^2$. The variance is taken to be the square of the standard deviation, $\dataStd$.

\begin{align} p(\dataScalar| \meanScalar, \dataStd^2) & = \frac{1}{\sqrt{2\pi\dataStd^2}}\exp\left(-\frac{(\dataScalar - \meanScalar)^2}{2\dataStd^2}\right)\\& \buildrel\triangle\over = \gaussianDist{\dataScalar}{\meanScalar}{\dataStd^2} \end{align}

## Two Important Gaussian Properties 

The Gaussian density has many important properties, but for the moment we'll review two of them.

## Sum of Gaussians

If we assume that a variable, $\dataScalar_i$, is sampled from a Gaussian density,

$$\dataScalar_i \sim \gaussianSamp{\meanScalar_i}{\sigma_i^2}$$

Then we can show that the sum of a set of variables, each drawn independently from such a density is also distributed as Gaussian. The mean of the resulting density is the sum of the means, and the variance is the sum of the variances,

$$\sum_{i=1}^{\numData} \dataScalar_i \sim \gaussianSamp{\sum_{i=1}^\numData \meanScalar_i}{\sum_{i=1}^\numData \sigma_i^2}$$

Since we are very familiar with the Gaussian density and its properties, it is not immediately apparent how unusual this is. Most random variables, when you add them together, change the family of density they are drawn from. For example, the Gaussian is exceptional in this regard. Indeed, other random variables, if they are independently drawn and summed together tend to a Gaussian density. That is the central limit theorem which is a major justification for the use of a Gaussian density.

## Scaling a Gaussian

Less unusual is the scaling property of a Gaussian density. If a variable, $\dataScalar$, is sampled from a Gaussian density,

$$\dataScalar \sim \gaussianSamp{\meanScalar}{\sigma^2}$$
and we choose to scale that variable by a deterministic value, $\mappingScalar$, then the scaled variable is distributed as

$$\mappingScalar \dataScalar \sim \gaussianSamp{\mappingScalar\meanScalar}{\mappingScalar^2 \sigma^2}.$$
Unlike the summing properties, where adding two or more random variables independently sampled from a family of densitites typically brings the summed variable outside that family, scaling many densities leaves the distribution of that variable in the same family of densities. Indeed, many densities include a scale parameter (e.g. the Gamma density) which is purely for this purpose. In the Gaussian the standard deviation, $\dataStd$, is the scale parameter. To see why this makes sense, let's consider,
$$z \sim \gaussianSamp{0}{1},$$
then if we scale by $\dataStd$ so we have, $\dataScalar=\dataStd z$, we can write,
$$\dataScalar =\dataStd z \sim \gaussianSamp{0}{\dataStd^2}$$

## Regression Examples 

Regression involves predicting a real value, $\dataScalar_i$, given an input vector, $\inputVector_i$. For example, the Tecator data involves predicting the quality of meat given spectral measurements. Or in radiocarbon dating, the C14 calibration curve maps from radiocarbon age to age measured through a back-trace of tree rings. Regression has also been used to predict the quality of board game moves given expert rated training data.

# Underdetermined System 

What about the situation where you have more parameters than data in your simultaneous equation? This is known as an underdetermined system. In fact this set up is in some sense easier to solve, because we don't need to think about introducing a slack variable (although it might make a lot of sense from a modelling perspective to do so).

The way Laplace proposed resolving an overdetermined system, was to introduce slack variables, $\noiseScalar_i$, which needed to be estimated for each point. The slack variable represented the difference between our actual prediction and the true observation. This is known as the residual. By introducing the slack variable we now have an additional n variables to estimate, one for each data point, $\{\noiseScalar_i\}$. This actually turns the overdetermined system into an underdetermined system. Introduction of n variables, plus the original m and c gives us $\numData+2$ parameters to be estimated from n observations, which actually makes the system underdetermined. However, we then made a probabilistic assumption about the slack variables, we assumed that the slack variables were distributed according to a probability density. And for the moment we have been assuming that density was the Gaussian,
$$\noiseScalar_i \sim \gaussianSamp{0}{\dataStd^2},$$
with zero mean and variance $\dataStd^2$.

The follow up question is whether we can do the same thing with the parameters. If we have two parameters and only one unknown can we place a probability distribution over the parameters, as we did with the slack variables? The answer is yes.

## Underdetermined System

import pods
from ipywidgets import IntSlider
pods.notebook.display_plots('under_determined_system{samp:0>3}.svg',
directory='../slides/diagrams/ml', samp=IntSlider(0, 0, 10, 1))

Classically, there are two types of uncertainty that we consider. The first is known as aleatoric uncertainty. This is uncertainty we couldn't resolve even if we wanted to. An example, would be the result of a football match before it's played, or where a sheet of paper lands on the floor.

The second is known as epistemic uncertainty. This is uncertainty that we could, in principle, resolve. We just haven't yet made the observation. For example, the result of a football match after it is played, or the color of socks that a lecturer is wearing.

Note, that there isn't a clean difference between the two. It is arguable, that if we knew enough about a football match, or the physics of a falling sheet of paper then we might be able to resolve the uncertainty. The reason we can't is because chaotic behaviour means that a very small change in any of the initial conditions we would need to resolve can have a large change in downstream effects. By this argument, the only truly aleatoric uncertainty might be quantum uncertainty. However, in practice the distinction is often applied.

In classical statistics, the frequentist approach only treats aleatoric uncertainty with probability. The key philosophical difference in the Bayesian approach is to treat any unknowns through probability. This approach was formally justified seperately by Cox (1946) and Finetti (1937).

The term Bayesian was a mocking term promoted by Fisher, it comes from the use, by Bayes, of a billiard table formulation to justify the Bernoulli distribution. Bayes considers a ball landing uniform at random between two sides of a billiard table. He then considers the outcome of the Bernoulli as being whether a second ball comes to rest to the right or left of the original. In this way, the parameter of his Bernoulli distribution is a stochastic variable (the uncertainty in the parameter is aleatoric). In contrast, when Bernoulli formulates the distribution he considers a bag of red and black balls. The parameter of his Bernoulli is the ratio of red balls to total balls, a deterministic variable.

Note how this relates to Laplace's demon. Laplace describes the deterministic universe ("... for it nothing would be uncertain and the future, as the past, would be present in its eyes"), but acknowledges the impossibility of achieving this in practice, (" ... the curve described by a simple molecule of air or vapor is regulated in a manner just as certain as the planetary orbits; the only difference between them is that which comes from our ignorance. Probability is relative in part to this ignorance, in part to our knowledge ...)

## Prior Distribution 

The tradition in Bayesian inference is to place a probability density over the parameters of interest in your model. This choice is made regardless of whether you generally believe those parameters to be stochastic or deterministic in origin. In other words, to a Bayesian, the modelling treatment does not differentiate between epistemic and aleatoric uncertainty. For linear regression we could consider the following Gaussian prior on the intercept parameter,
$$c \sim \gaussianSamp{0}{\alpha_1}$$
where α1 is the variance of the prior distribution, its mean being zero.

## Posterior Distribution

The prior distribution is combined with the likelihood of the data given the parameters $p(\dataScalar|c)$ to give the posterior via Bayes' rule,
$$p(c|\dataScalar) = \frac{p(\dataScalar|c)p(c)}{p(\dataScalar)}$$
where $p(\dataScalar)$ is the marginal probability of the data, obtained through integration over the joint density, $p(\dataScalar, c)=p(\dataScalar|c)p(c)$. Overall the equation can be summarized as,
$$\text{posterior} = \frac{\text{likelihood}\times \text{prior}}{\text{marginal likelihood}}.$$

from ipywidgets import IntSlider
import pods
pods.notebook.display_plots('dem_gaussian{stage:0>2}.svg',
diagrams='../slides/diagrams/ml',
stage=IntSlider(1, 1, 3, 1))

Another way of seeing what's going on is to note that the numerator of Bayes' rule merely multiplies the likelihood by the prior. The denominator, is not a function of c. So the functional form is entirely determined by the multiplication of prior and likelihood. This has the effect of ensuring that the posterior only has probability mass in regions where both the prior and the likelihood have probability mass.

The marginal likelihood, $p(\dataScalar)$, operates to ensure that the distribution is normalised.

For the Gaussian case, the normalisation of the posterior can be performed analytically. This is because both the prior and the likelihood have the form of an exponentiated quadratic,
exp(a2)exp(b2)=exp(a2 + b2),
and the properties of the exponential mean that the product of two exponentiated quadratics is also an exponentiated quadratic. That implies that the posterior is also Gaussian, because a normalized exponentiated quadratic is a Gaussian distribution.1

For general Bayesian inference, over more than one parameter, we need multivariate priors. For example, consider the multivariate linear regression where an observation, $\dataScalar_i$ is related to a vector of features, $\inputVector_{i, :}$, through a vector of parameters, $\weightVector$,
$$\dataScalar_i = \sum_j \weightScalar_j \inputScalar_{i, j} + \noiseScalar_i,$$
or in vector notation,
$$\dataScalar_i = \weightVector^\top \inputVector_{i, :} + \noiseScalar_i.$$
Here we've dropped the intercpet for convenience, it can be reintroduced by augmenting the feature vector, $\inputVector_{i, :}$, with a constant valued feature.

This motivates the need for a multivariate Gaussian density.

## Multivariate Regression Likelihood 

• Noise corrupted data point
$$\dataScalar_i = \weightVector^\top \inputVector_{i, :} + {\noiseScalar}_i$$
• Multivariate regression likelihood:
$$p(\dataVector| \inputMatrix, \weightVector) = \frac{1}{\left(2\pi {\dataStd}^2\right)^{\numData/2}} \exp\left(-\frac{1}{2{\dataStd}^2}\sum_{i=1}^{\numData}\left(\dataScalar_i - \weightVector^\top \inputVector_{i, :}\right)^2\right)$$
• Now use a multivariate Gaussian prior:
$$p(\weightVector) = \frac{1}{\left(2\pi \alpha\right)^\frac{\dataDim}{2}} \exp \left(-\frac{1}{2\alpha} \weightVector^\top \weightVector\right)$$

## Two Dimensional Gaussian 

Consider the distribution of height (in meters) of an adult male human population. We will approximate the marginal density of heights as a Gaussian density with mean given by 1.7m and a standard deviation of 0.15m, implying a variance of $\dataStd^2=0.0225$,
$$p(h) \sim \gaussianSamp{1.7}{0.0225}.$$
Similarly, we assume that weights of the population are distributed a Gaussian density with a mean of 75kg and a standard deviation of 6kg (implying a variance of 36),
$$p(w) \sim \gaussianSamp{75}{36}.$$

## Independence Assumption

First of all, we make an independence assumption, we assume that height and weight are independent. The definition of probabilistic independence is that the joint density, p(w, h), factorizes into its marginal densities,
p(w, h)=p(w)p(h).
Given this assumption we can sample from the joint distribution by independently sampling weights and heights.

import pods
from ipywidgets import IntSlider
pods.notebook.display_plots('independent_height_weight{fig:0>3}.svg',
directory='../slides/diagrams/ml',
fig=IntSlider(0, 0, 7, 1))

In reality height and weight are not independent. Taller people tend on average to be heavier, and heavier people are likely to be taller. This is reflected by the body mass index. A ratio suggested by one of the fathers of statistics, Adolphe Quetelet. Quetelet was interested in the notion of the average man and collected various statistics about people. He defined the BMI to be,
$$\text{BMI} = \frac{w}{h^2}$$
To deal with this dependence we now introduce the notion of correlation to the multivariate Gaussian density.

## Sampling Two Dimensional Variables 

import pods
from ipywidgets import IntSlider

pods.notebook.display_plots('correlated_height_weight{fig:0>3}.svg',
directory='../slides/diagrams/ml',
fig=IntSlider(0, 0, 7, 1))

## Independent Gaussians 

p(w, h)=p(w)p(h)

$$p(w, h) = \frac{1}{\sqrt{2\pi \dataStd_1^2}\sqrt{2\pi\dataStd_2^2}} \exp\left(-\frac{1}{2}\left(\frac{(w-\meanScalar_1)^2}{\dataStd_1^2} + \frac{(h-\meanScalar_2)^2}{\dataStd_2^2}\right)\right)$$

$$p(w, h) = \frac{1}{\sqrt{2\pi\dataStd_1^22\pi\dataStd_2^2}} \exp\left(-\frac{1}{2}\left(\begin{bmatrix}w \\ h\end{bmatrix} - \begin{bmatrix}\meanScalar_1 \\ \meanScalar_2\end{bmatrix}\right)^\top\begin{bmatrix}\dataStd_1^2& 0\\0&\dataStd_2^2\end{bmatrix}^{-1}\left(\begin{bmatrix}w \\ h\end{bmatrix} - \begin{bmatrix}\meanScalar_1 \\ \meanScalar_2\end{bmatrix}\right)\right)$$

$$p(\dataVector) = \frac{1}{\det{2\pi \mathbf{D}}^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\dataVector - \meanVector)^\top\mathbf{D}^{-1}(\dataVector - \meanVector)\right)$$

## Correlated Gaussian

Form correlated from original by rotating the data space using matrix $\rotationMatrix$.

$$p(\dataVector) = \frac{1}{\det{2\pi\mathbf{D}}^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\dataVector - \meanVector)^\top\mathbf{D}^{-1}(\dataVector - \meanVector)\right)$$

$$p(\dataVector) = \frac{1}{\det{2\pi\mathbf{D}}^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\rotationMatrix^\top\dataVector - \rotationMatrix^\top\meanVector)^\top\mathbf{D}^{-1}(\rotationMatrix^\top\dataVector - \rotationMatrix^\top\meanVector)\right)$$

$$p(\dataVector) = \frac{1}{\det{2\pi\mathbf{D}}^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\dataVector - \meanVector)^\top\rotationMatrix\mathbf{D}^{-1}\rotationMatrix^\top(\dataVector - \meanVector)\right)$$
this gives a covariance matrix:
$$\covarianceMatrix^{-1} = \rotationMatrix \mathbf{D}^{-1} \rotationMatrix^\top$$

$$p(\dataVector) = \frac{1}{\det{2\pi\covarianceMatrix}^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\dataVector - \meanVector)^\top\covarianceMatrix^{-1} (\dataVector - \meanVector)\right)$$
this gives a covariance matrix:
$$\covarianceMatrix = \rotationMatrix \mathbf{D} \rotationMatrix^\top$$

Let's first of all review the properties of the multivariate Gaussian distribution that make linear Gaussian models easier to deal with. We'll return to the, perhaps surprising, result on the parameters within the nonlinearity, $\parameterVector$, shortly.

To work with linear Gaussian models, to find the marginal likelihood all you need to know is the following rules. If
$$\dataVector = \mappingMatrix \inputVector + \noiseVector,$$
where $\dataVector$, $\inputVector$ and $\noiseVector$ are vectors and we assume that $\inputVector$ and $\noiseVector$ are drawn from multivariate Gaussians,
\begin{align} \inputVector & \sim \gaussianSamp{\meanVector}{\covarianceMatrix}\\ \noiseVector & \sim \gaussianSamp{\zerosVector}{\covarianceMatrixTwo} \end{align}
then we know that $\dataVector$ is also drawn from a multivariate Gaussian with,
$$\dataVector \sim \gaussianSamp{\mappingMatrix\meanVector}{\mappingMatrix\covarianceMatrix\mappingMatrix^\top + \covarianceMatrixTwo}.$$

With apprioriately defined covariance, $\covarianceMatrixTwo$, this is actually the marginal likelihood for Factor Analysis, or Probabilistic Principal Component Analysis (Tipping and Bishop 1999), because we integrated out the inputs (or latent variables they would be called in that case).

However, we are focussing on what happens in models which are non-linear in the inputs, whereas the above would be linear in the inputs. To consider these, we introduce a matrix, called the design matrix. We set each activation function computed at each data point to be
$$\activationScalar_{i,j} = \activationScalar(\mappingVector^{(1)}_{j}, \inputVector_{i})$$
and define the matrix of activations (known as the design matrix in statistics) to be,
$$\activationMatrix = \begin{bmatrix} \activationScalar_{1, 1} & \activationScalar_{1, 2} & \dots & \activationScalar_{1, \numHidden} \\ \activationScalar_{1, 2} & \activationScalar_{1, 2} & \dots & \activationScalar_{1, \numData} \\ \vdots & \vdots & \ddots & \vdots \\ \activationScalar_{\numData, 1} & \activationScalar_{\numData, 2} & \dots & \activationScalar_{\numData, \numHidden} \end{bmatrix}.$$
By convention this matrix always has $\numData$ rows and $\numHidden$ columns, now if we define the vector of all noise corruptions, $\noiseVector = \left[\noiseScalar_1, \dots \noiseScalar_\numData\right]^\top$.

If we define the prior distribution over the vector $\mappingVector$ to be Gaussian,
$$\mappingVector \sim \gaussianSamp{\zerosVector}{\alpha\eye},$$

then we can use rules of multivariate Gaussians to see that,
$$\dataVector \sim \gaussianSamp{\zerosVector}{\alpha \activationMatrix \activationMatrix^\top + \dataStd^2 \eye}.$$

In other words, our training data is distributed as a multivariate Gaussian, with zero mean and a covariance given by
$$\kernelMatrix = \alpha \activationMatrix \activationMatrix^\top + \dataStd^2 \eye.$$

This is an $\numData \times \numData$ size matrix. Its elements are in the form of a function. The maths shows that any element, index by i and j, is a function only of inputs associated with data points i and j, $\dataVector_i$, $\dataVector_j$. $\kernel_{i,j} = \kernel\left(\inputVector_i, \inputVector_j\right)$

If we look at the portion of this function associated only with $\mappingFunction(\cdot)$, i.e. we remove the noise, then we can write down the covariance associated with our neural network,
$$\kernel_\mappingFunction\left(\inputVector_i, \inputVector_j\right) = \alpha \activationVector\left(\mappingMatrix_1, \inputVector_i\right)^\top \activationVector\left(\mappingMatrix_1, \inputVector_j\right)$$
so the elements of the covariance or kernel matrix are formed by inner products of the rows of the design matrix.

## Gaussian Process 

This is the essence of a Gaussian process. Instead of making assumptions about our density over each data point, $\dataScalar_i$ as i.i.d. we make a joint Gaussian assumption over our data. The covariance matrix is now a function of both the parameters of the activation function, $\mappingMatrixTwo$, and the input variables, $\inputMatrix$. This comes about through integrating out the parameters of the model, $\mappingVector$.

## Basis Functions

We can basically put anything inside the basis functions, and many people do. These can be deep kernels (Cho and Saul 2009) or we can learn the parameters of a convolutional neural network inside there.

Viewing a neural network in this way is also what allows us to beform sensible batch normalizations (Ioffe and Szegedy 2015).

## Bayesian Inference by Rejection Sampling 

One view of Bayesian inference is to assume we are given a mechanism for generating samples, where we assume that mechanism is representing on accurate view on the way we believe the world works.

This mechanism is known as our prior belief.

We combine our prior belief with our observations of the real world by discarding all those samples that are inconsistent with our prior. The likelihood defines mathematically what we mean by inconsistent with the prior. The higher the noise level in the likelihood, the looser the notion of consistent.

The samples that remain are considered to be samples from the posterior.

This approach to Bayesian inference is closely related to two sampling techniques known as rejection sampling and importance sampling. It is realized in practice in an approach known as approximate Bayesian computation (ABC) or likelihood-free inference.

In practice, the algorithm is often too slow to be practical, because most samples will be inconsistent with the data and as a result the mechanism has to be operated many times to obtain a few posterior samples.

However, in the Gaussian process case, when the likelihood also assumes Gaussian noise, we can operate this mechanism mathematically, and obtain the posterior density analytically. This is the benefit of Gaussian processes.

import pods
from ipywidgets import IntSlider
pods.notebook.display_plots('gp_rejection_sample{sample:0>3}.png',
directory='../slides/diagrams/gp',
sample=IntSlider(1,1,5,1))

{

## Sampling a Function 

We will consider a Gaussian distribution with a particular structure of covariance matrix. We will generate one sample from a 25-dimensional Gaussian density.
$$\mappingFunctionVector=\left[\mappingFunction_{1},\mappingFunction_{2}\dots \mappingFunction_{25}\right].$$
in the figure below we plot these data on the y-axis against their indices on the x-axis.

import pods
from ipywidgets import IntSlider
pods.notebook.display_plots('two_point_sample{sample:0>3}.svg', '../slides/diagrams/gp', sample=IntSlider(0, 0, 8, 1))
import pods
from ipywidgets import IntSlider
pods.notebook.display_plots('two_point_sample{sample:0>3}.svg',
'../slides/diagrams/gp',
sample=IntSlider(9, 9, 12, 1))

{

## Uluru 

When viewing these contour plots, I sometimes find it helpful to think of Uluru, the prominent rock formation in Australia. The rock rises above the surface of the plane, just like a probability density rising above the zero line. The rock is three dimensional, but when we view Uluru from the classical position, we are looking at one side of it. This is equivalent to viewing the marginal density.

The joint density can be viewed from above, using contours. The conditional density is equivalent to slicing the rock. Uluru is a holy rock, so this has to be an imaginary slice. Imagine we cut down a vertical plane orthogonal to our view point (e.g. coming across our view point). This would give a profile of the rock, which when renormalized, would give us the conditional distribution, the value of conditioning would be the location of the slice in the direction we are facing.

## Prediction with Correlated Gaussians

Of course in practice, rather than manipulating mountains physically, the advantage of the Gaussian density is that we can perform these manipulations mathematically.

Prediction of $\mappingFunction_2$ given $\mappingFunction_1$ requires the conditional density, $p(\mappingFunction_2|\mappingFunction_1)$.Another remarkable property of the Gaussian density is that this conditional distribution is also guaranteed to be a Gaussian density. It has the form,
$$p(\mappingFunction_2|\mappingFunction_1) = \gaussianDist{\mappingFunction_2}{\frac{\kernelScalar_{1, 2}}{\kernelScalar_{1, 1}}\mappingFunction_1}{ \kernelScalar_{2, 2} - \frac{\kernelScalar_{1,2}^2}{\kernelScalar_{1,1}}}$$
where we have assumed that the covariance of the original joint density was given by
$$\kernelMatrix = \begin{bmatrix} \kernelScalar_{1, 1} & \kernelScalar_{1, 2}\\ \kernelScalar_{2, 1} & \kernelScalar_{2, 2}.\end{bmatrix}$$

Using these formulae we can determine the conditional density for any of the elements of our vector $\mappingFunctionVector$. For example, the variable $\mappingFunction_8$ is less correlated with $\mappingFunction_1$ than $\mappingFunction_2$. If we consider this variable we see the conditional density is more diffuse.

import pods
from ipywidgets import IntSlider
pods.notebook.display_plots('two_point_sample{sample:0>3}.svg',
'../slides/diagrams/gp',
sample=IntSlider(13, 13, 17, 1))

## Where Did This Covariance Matrix Come From? 

$$k(\inputVector, \inputVector^\prime) = \alpha \exp\left(-\frac{\left\Vert \inputVector - \inputVector^\prime\right\Vert^2_2}{2\lengthScale^2}\right)$$

import pods
from ipywidgets import IntSlider
pods.notebook.display_plots('computing_eq_three_covariance{sample:0>3}.svg',
directory='../slides/diagrams/kern',
sample=IntSlider(0, 0, 16, 1))
Figure: Entrywise fill in of the covariance matrix from the covariance function.
import pods
from ipywidgets import IntSlider
pods.notebook.display_plots('computing_eq_four_covariance{sample:0>3}.svg',
directory='../slides/diagrams/kern',
sample=IntSlider(0, 0, 27, 1))
Figure: Entrywise fill in of the covariance matrix from the covariance function.
import pods
from ipywidgets import IntSlider
pods.notebook.display_plots('computing_eq_three_2_covariance{sample:0>3}.svg',
directory='../slides/diagrams/kern',
sample=IntSlider(0, 0, 16, 1))
Figure: Entrywise fill in of the covariance matrix from the covariance function.

## Polynomial Covariance 

$$\kernelScalar(\inputVector, \inputVector^\prime) = \alpha(w \inputVector^\top\inputVector^\prime + b)^d$$

## Brownian Covariance 

Brownian motion is also a Gaussian process. It follows a Gaussian random walk, with diffusion occuring at each time point driven by a Gaussian input. This implies it is both Markov and Gaussian. The covariance function for Brownian motion has the form
$$\kernelScalar(t, t^\prime)=\alpha \min(t, t^\prime)$$

$$\kernelScalar(t, t^\prime)=\alpha \min(t, t^\prime)$$

## Periodic Covariance 

$$\kernelScalar(\inputVector, \inputVector^\prime) = \alpha\exp\left(\frac{-2\sin(\pi rw)^2}{\lengthScale^2}\right)$$

## Degenerate Covariance Functions 

Any linear basis function can also be incorporated into a covariance function. For example, an RBF network is a type of neural network with a set of radial basis functions. Meaning, the basis funciton is radially symmetric. These basis functions take the form,
$$\basisFunction_k(\inputScalar) = \exp\left(-\frac{\ltwoNorm{\inputScalar-\meanScalar_k}^{2}}{\lengthScale^{2}}\right).$$
Given a set of parameters,
$$\meanVector = \begin{bmatrix} -1 \\ 0 \\ 1\end{bmatrix},$$
we can construct the corresponding covariance function, which has the form,
$$\kernelScalar\left(\inputVals,\inputVals^{\prime}\right)=\alpha\basisVector(\inputVals)^\top \basisVector(\inputVals^\prime).$$

## Basis Function Covariance 

The fixed basis function covariance just comes from the properties of a multivariate Gaussian, if we decide
$$\mappingFunctionVector=\basisMatrix\mappingVector$$
and then we assume
$$\mappingVector \sim \gaussianSamp{\zerosVector}{\alpha\eye}$$
then it follows from the properties of a multivariate Gaussian that
$$\mappingFunctionVector \sim \gaussianSamp{\zerosVector}{\alpha\basisMatrix\basisMatrix^\top}$$
meaning that the vector of observations from the function is jointly distributed as a Gaussian process and the covariance matrix is $\kernelMatrix = \alpha\basisMatrix \basisMatrix^\top$, each element of the covariance matrix can then be found as the inner product between two rows of the basis funciton matrix.

$$\kernel(\inputVector, \inputVector^\prime) = \basisVector(\inputVector)^\top \basisVector(\inputVector^\prime)$$

## Selecting Number and Location of Basis

In practice for a basis function model we need to choose both 1. the location of the basis functions 2. the number of basis functions

One very clever of finessing this problem is to choose to have infinite basis functions and place them everywhere. To show how this is possible, we will consider a one dimensional system, $\inputScalar$, which should give the intuition of how to do this. However, these ideas also extend to multidimensional systems as shown in, for example, Williams (n.d.) and Neal (1994).

We consider a one dimensional set up with exponentiated quadratic basis functions,
$$\basisFunction_k(\inputScalar_i) = \exp\left(\frac{\ltwoNorm{\inputScalar_i - \locationScalar_k}^2}{2\rbfWidth^2}\right)$$

To place these basis functions, we first define the basis function centers in terms of a starting point on the left of our input, a, and a finishing point, b. The gap between basis is given by $\Delta\locationScalar$. The location of each basis is then given by
$$\locationScalar_k = a+\Delta\locationScalar\cdot (k-1).$$
The covariance function can then be given as
$$\kernelScalar\left(\inputScalar_i,\inputScalar_j\right) = \sum_{k=1}^\numBasisFunc \basisFunction_k(\inputScalar_i)\basisFunction_k(\inputScalar_j)$$

\begin{aligned} \kernelScalar\left(\inputScalar_i,\inputScalar_j\right) = &\alpha^\prime\Delta\locationScalar \sum_{k=1}^{\numBasisFunc} \exp\Bigg( -\frac{\inputScalar_i^2 + \inputScalar_j^2}{2\rbfWidth^2}\\ & - \frac{2\left(a+\Delta\locationScalar\cdot (k-1)\right) \left(\inputScalar_i+\inputScalar_j\right) + 2\left(a+\Delta\locationScalar \cdot (k-1)\right)^2}{2\rbfWidth^2} \Bigg) \end{aligned}
where we've also scaled the variance of the process by $\Delta\locationScalar$.

A consequence of our definition is that the first and last basis function locations are given by
$$\locationScalar_1=a \ \text{and}\ \locationScalar_\numBasisFunc=b \ \text{so}\ b= a+ \Delta\locationScalar\cdot(\numBasisFunc-1)$$
This implies that the distance between b and a is given by
$$b-a = \Delta\locationScalar (\numBasisFunc -1)$$
and since the basis functions are separated by $\Delta\locationScalar$ the number of basis functions is given by
$$\numBasisFunc = \frac{b-a}{\Delta \locationScalar} + 1$$
The next step is to take the limit as $\Delta\locationScalar\rightarrow 0$ so $\numBasisFunc \rightarrow \infty$ where we have used $a + k\cdot\Delta\locationScalar\rightarrow \locationScalar$.

Performing the integration gives
\begin{aligned} \kernelScalar(\inputScalar_i,&\inputScalar_j) = \alpha^\prime \sqrt{\pi\rbfWidth^2} \exp\left( -\frac{\left(\inputScalar_i-\inputScalar_j\right)^2}{4\rbfWidth^2}\right)\\ &\times \frac{1}{2}\left[\text{erf}\left(\frac{\left(b - \frac{1}{2}\left(\inputScalar_i + \inputScalar_j\right)\right)}{\rbfWidth} \right)- \text{erf}\left(\frac{\left(a - \frac{1}{2}\left(\inputScalar_i + \inputScalar_j\right)\right)}{\rbfWidth} \right)\right], \end{aligned}
Now we take the limit as a → −∞ and b → ∞
$$\kernelScalar\left(\inputScalar_i,\inputScalar_j\right) = \alpha\exp\left( -\frac{\left(\inputScalar_i-\inputScalar_j\right)^2}{4\rbfWidth^2}\right).$$
where $\alpha=\alpha^\prime \sqrt{\pi\rbfWidth^2}$.

In conclusion, an RBF model with infinite basis functions is a Gaussian process with the exponentiated quadratic covariance function
$$\kernelScalar\left(\inputScalar_i,\inputScalar_j\right) = \alpha \exp\left( -\frac{\left(\inputScalar_i-\inputScalar_j\right)^2}{4\rbfWidth^2}\right).$$

Note that while the functional form of the basis function and the covariance function are similar, in general if we repeated this analysis for other basis functions the covariance function will have a very different form. For example the error function, erf(⋅), results in an $\asin(\cdot)$ form. See Williams (n.d.) for more details.

## MLP Covariance 

The multi-layer perceptron (MLP) covariance, also known as the neural network covariance or the arcsin covariance, is derived by considering the infinite limit of a neural network.

$$\kernelScalar(\inputVector, \inputVector^\prime) = \alpha \arcsin\left(\frac{w \inputVector^\top \inputVector^\prime + b}{\sqrt{\left(w \inputVector^\top \inputVector + b + 1\right)\left(w \left.\inputVector^\prime\right.^\top \inputVector^\prime + b + 1\right)}}\right)$$

## RELU Covariance 

$$\kernelScalar(\inputVector, \inputVector^\prime) = \alpha \arcsin\left(\frac{w \inputVector^\top \inputVector^\prime + b} {\sqrt{\left(w \inputVector^\top \inputVector + b + 1\right) \left(w \left.\inputVector^\prime\right.^\top \inputVector^\prime + b + 1\right)}}\right)$$

## Sinc Covariance 

Another approach to developing covariance function exploits Bochner's theorem Bochner (1959). Bochner's theorem tells us that any positve filter in Fourier space implies has an associated Gaussian process with a stationary covariance function. The covariance function is the inverse Fourier transform of the filter applied in Fourier space.

For example, in signal processing, band limitations are commonly applied as an assumption. For example, we may believe that no frequency above w = 2 exists in the signal. This is equivalent to a rectangle function being applied as a the filter in Fourier space.

The inverse Fourier transform of the rectangle function is the sinc(⋅) function. So the sinc is a valid covariance function, and it represents band limited signals.

Note that other covariance functions we've introduced can also be interpreted in this way. For example, the exponentiated quadratic covariance function can be Fourier transformed to see what the implied filter in Fourier space is. The Fourier transform of the exponentiated quadratic is an exponentiated quadratic, so the standard EQ-covariance implies a EQ filter in Fourier space.

# References

Bochner, Salomon. 1959. Lectures on Fourier Integrals. Princeton University Press. http://books.google.co.uk/books?id=-vU02QewWK8C.

Cho, Youngmin, and Lawrence K. Saul. 2009. “Kernel Methods for Deep Learning.” In Advances in Neural Information Processing Systems 22, edited by Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta, 342–50. Curran Associates, Inc. http://papers.nips.cc/paper/3628-kernel-methods-for-deep-learning.pdf.

Cox, R. T. 1946. “Probability, Frequency and Reasonable Expectation.” American Journal of Physics 14 (1): 1–13.

Finetti, Bruno de. 1937. “La Prévision: Ses Lois Logiques, Ses Sources Subjectives.” Annales de L’Institut Henri Poincaré 7: 1–68.

Ioffe, Sergey, and Christian Szegedy. 2015. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” In Proceedings of the 32nd International Conference on Machine Learning, edited by Francis Bach and David Blei, 37:448–56. Proceedings of Machine Learning Research. Lille, France: PMLR. http://proceedings.mlr.press/v37/ioffe15.html.

Laplace, Pierre Simon. 1814. Essai Philosophique Sur Les Probabilités. 2nd ed. Paris: Courcier.

1. Note not all exponentiated quadratics can be normalized, to do so, the coefficient associated with the variable squared, $\dataScalar^2$, must be strictly positive.