# Deep Gaussian Processes: A Motivation and Introduction

Heilbronn Data Science Seminar, Jean Golding Institute

## The Fourth Industrial Revolution

• The first to be named before it happened.

## What is Machine Learning?

$\text{data} + \text{model} \stackrel{\text{compute}}{\rightarrow} \text{prediction}$

• data : observations, could be actively or passively acquired (meta-data).
• model : assumptions, based on previous experience (other data! transfer learning etc), or beliefs about the regularities of the universe. Inductive bias.
• prediction : an action to be taken or a categorization or a quality score.

## What is Machine Learning?

$\text{data} + \text{model} \stackrel{\text{compute}}{\rightarrow} \text{prediction}$

• To combine data with a model need:
• a prediction function $f(\cdot)$ includes our beliefs about the regularities of the universe
• an objective function $E(\cdot)$ defines the cost of misprediction.

## What does Machine Learning do?

• Automation scales by codifying processes and automating them.
• Need:
• Interconnected components
• Compatible components
• Early examples:
• cf Colt 45, Ford Model T

## Codify Through Mathematical Functions

• How does machine learning work?
• Jumper (jersey/sweater) purchase with logistic regression

$\text{odds} = \frac{p(\text{bought})}{p(\text{not bought})}$

$\log \text{odds} = \beta_0 + \beta_1 \text{age} + \beta_2 \text{latitude}.$

## Codify Through Mathematical Functions

• How does machine learning work?
• Jumper (jersey/sweater) purchase with logistic regression

$p(\text{bought}) = \sigma\left(\beta_0 + \beta_1 \text{age} + \beta_2 \text{latitude}\right).$

## Codify Through Mathematical Functions

• How does machine learning work?
• Jumper (jersey/sweater) purchase with logistic regression

$p(\text{bought}) = \sigma\left(\boldsymbol{\beta}^\top \mathbf{ x}\right).$

## Codify Through Mathematical Functions

• How does machine learning work?
• Jumper (jersey/sweater) purchase with logistic regression

$y= f\left(\mathbf{ x}, \boldsymbol{\beta}\right).$

We call $f(\cdot)$ the prediction function.

## Fit to Data

• Use an objective function

$E(\boldsymbol{\beta}, \mathbf{Y}, \mathbf{X})$

• E.g. least squares $E(\boldsymbol{\beta}, \mathbf{Y}, \mathbf{X}) = \sum_{i=1}^n\left(y_i - f(\mathbf{ x}_i, \boldsymbol{\beta})\right)^2.$

## Two Components

• Prediction function, $f(\cdot)$
• Objective function, $E(\cdot)$

## Deep Learning

• These are interpretable models: vital for disease modeling etc.

• Modern machine learning methods are less interpretable

• Example: face recognition

Outline of the DeepFace architecture. A front-end of a single convolution-pooling-convolution filtering on the rectified input, followed by three locally-connected layers and two fully-connected layers. Color illustrates feature maps produced at each layer. The net includes more than 120 million parameters, where more than 95% come from the local and fully connected. Source: DeepFace (Taigman et al., 2014) ## Mathematically

\begin{align*} \mathbf{ h}_{1} &= \phi\left(\mathbf{W}_1 \mathbf{ x}\right)\\ \mathbf{ h}_{2} &= \phi\left(\mathbf{W}_2\mathbf{ h}_{1}\right)\\ \mathbf{ h}_{3} &= \phi\left(\mathbf{W}_3 \mathbf{ h}_{2}\right)\\ f&= \mathbf{ w}_4 ^\top\mathbf{ h}_{3} \end{align*}

## AlphaGo ## Sedolian Void  ## Uncertainty

• Uncertainty in prediction arises from:
• scarcity of training data and
• mismatch between the set of prediction functions we choose and all possible prediction functions.
• Also uncertainties in objective, leave those for another day.     ## Structure of Priors MacKay: NeurIPS Tutorial 1997 “Have we thrown out the baby with the bathwater?” (Published as MacKay, n.d.)

## Overfitting

• Potential problem: if number of nodes in two adjacent layers is big, corresponding $\mathbf{W}$ is also very big and there is the potential to overfit.

• Proposed solution: “dropout.”

• Alternative solution: parameterize $\mathbf{W}$ with its SVD. $\mathbf{W}= \mathbf{U}\boldsymbol{ \Lambda}\mathbf{V}^\top$ or $\mathbf{W}= \mathbf{U}\mathbf{V}^\top$ where if $\mathbf{W}\in \Re^{k_1\times k_2}$ then $\mathbf{U}\in \Re^{k_1\times q}$ and $\mathbf{V}\in \Re^{k_2\times q}$, i.e. we have a low rank matrix factorization for the weights.

## Mathematically

The network can now be written mathematically as \begin{align} \mathbf{ z}_{1} &= \mathbf{V}^\top_1 \mathbf{ x}\\ \mathbf{ h}_{1} &= \phi\left(\mathbf{U}_1 \mathbf{ z}_{1}\right)\\ \mathbf{ z}_{2} &= \mathbf{V}^\top_2 \mathbf{ h}_{1}\\ \mathbf{ h}_{2} &= \phi\left(\mathbf{U}_2 \mathbf{ z}_{2}\right)\\ \mathbf{ z}_{3} &= \mathbf{V}^\top_3 \mathbf{ h}_{2}\\ \mathbf{ h}_{3} &= \phi\left(\mathbf{U}_3 \mathbf{ z}_{3}\right)\\ \mathbf{ y}&= \mathbf{ w}_4^\top\mathbf{ h}_{3}. \end{align}

## A Cascade of Neural Networks

\begin{align} \mathbf{ z}_{1} &= \mathbf{V}^\top_1 \mathbf{ x}\\ \mathbf{ z}_{2} &= \mathbf{V}^\top_2 \phi\left(\mathbf{U}_1 \mathbf{ z}_{1}\right)\\ \mathbf{ z}_{3} &= \mathbf{V}^\top_3 \phi\left(\mathbf{U}_2 \mathbf{ z}_{2}\right)\\ \mathbf{ y}&= \mathbf{ w}_4 ^\top \mathbf{ z}_{3} \end{align}

• Replace each neural network with a Gaussian process \begin{align} \mathbf{ z}_{1} &= \mathbf{ f}_1\left(\mathbf{ x}\right)\\ \mathbf{ z}_{2} &= \mathbf{ f}_2\left(\mathbf{ z}_{1}\right)\\ \mathbf{ z}_{3} &= \mathbf{ f}_3\left(\mathbf{ z}_{2}\right)\\ \mathbf{ y}&= \mathbf{ f}_4\left(\mathbf{ z}_{3}\right) \end{align}

• Equivalent to prior over parameters, take width of each layer to infinity.

## Stochastic Process Composition

$\mathbf{ y}= \mathbf{ f}_4\left(\mathbf{ f}_3\left(\mathbf{ f}_2\left(\mathbf{ f}_1\left(\mathbf{ x}\right)\right)\right)\right)$

## Mathematically

• Composite multivariate function

$\mathbf{g}(\mathbf{ x})=\mathbf{ f}_5(\mathbf{ f}_4(\mathbf{ f}_3(\mathbf{ f}_2(\mathbf{ f}_1(\mathbf{ x}))))).$

## Equivalent to Markov Chain

• Composite multivariate function $p(\mathbf{ y}|\mathbf{ x})= p(\mathbf{ y}|\mathbf{ f}_5)p(\mathbf{ f}_5|\mathbf{ f}_4)p(\mathbf{ f}_4|\mathbf{ f}_3)p(\mathbf{ f}_3|\mathbf{ f}_2)p(\mathbf{ f}_2|\mathbf{ f}_1)p(\mathbf{ f}_1|\mathbf{ x})$

## Why Composition?

• Gaussian processes give priors over functions.

• Elegant properties:

• e.g. Derivatives of process are also Gaussian distributed (if they exist).
• For particular covariance functions they are ‘universal approximators,’ i.e. all functions can have support under the prior.

• Gaussian derivatives might ring alarm bells.

• E.g. a priori they don’t believe in function ‘jumps.’

## Stochastic Process Composition

• From a process perspective: process composition.

• A (new?) way of constructing more complex processes based on simpler components.

## Deep Gaussian Processes

Damianou (2015)   $=f\Bigg($ $\Bigg)$

## Modern Review

• A Unifying Framework for Gaussian Process Pseudo-Point Approximations using Power Expectation Propagation Bui et al. (2017)

• Deep Gaussian Processes and Variational Propagation of Uncertainty Damianou (2015)

## GPy: A Gaussian Process Framework in Python https://github.com/SheffieldML/GPy

## GPy: A Gaussian Process Framework in Python

• Wide availability of libraries, ‘modern’ scripting language.
• Allows us to set projects to undergraduates in Comp Sci that use GPs.
• Available through GitHub https://github.com/SheffieldML/GPy
• Reproducible Research with Jupyter Notebook.

## Features

• Probabilistic-style programming (specify the model, not the algorithm).
• Non-Gaussian likelihoods.
• Multivariate outputs.
• Dimensionality reduction.
• Approximations for large data sets.

## Olympic Marathon Data

 Gold medal times for Olympic Marathon since 1896. Marathons before 1924 didn’t have a standardized distance. Present results using pace per km. In 1904 Marathon was badly organized leading to very slow times. Image from Wikimedia Commons http://bit.ly/16kMKHQ

## Alan Turing  ## Probability Winning Olympics?

• He was a formidable Marathon runner.
• In 1946 he ran a time 2 hours 46 minutes.
• That’s a pace of 3.95 min/km.
• What is the probability he would have won an Olympics if one had been held in 1946?

## Deep GP Fit

• Can a Deep Gaussian process help?

• Deep GP is one GP feeding into another.

## Deep NNs as Point Estimates for Deep GPs

Dutordoir et al. (2021) ## Spherical Harmonics 