Deep Gaussian Processes: A Motivation and Introduction

Introduction

The Fourth Industrial Revolution

The first to be named before it happened.

What is Machine Learning?

\[ \text{data} + \text{model} \stackrel{\text{compute}}{\rightarrow} \text{prediction}\]

data : observations, could be actively or passively acquired (meta-data).

model : assumptions, based on previous experience (other data! transfer learning etc), or beliefs about the regularities of the universe. Inductive bias.

prediction : an action to be taken or a categorization or a quality score.

Royal Society Report: Machine Learning: Power and Promise of Computers that Learn by Example

What is Machine Learning?

\[\text{data} + \text{model} \stackrel{\text{compute}}{\rightarrow} \text{prediction}\]

To combine data with a model need:
a prediction function \(f(\cdot)\) includes our beliefs about the regularities of the universe
an objective function \(E(\cdot)\) defines the cost of misprediction.

What does Machine Learning do?

Automation scales by codifying processes and automating them.
Need:
- Interconnected components
- Compatible components
Early examples:
- cf Colt 45, Ford Model T

Codify Through Mathematical Functions

How does machine learning work?
Jumper (jersey/sweater) purchase with logistic regression

\[ \text{odds} = \frac{p(\text{bought})}{p(\text{not bought})} \]

\[ \log \text{odds} = \beta_0 + \beta_1 \text{age} + \beta_2 \text{latitude}.\]

Codify Through Mathematical Functions

How does machine learning work?
Jumper (jersey/sweater) purchase with logistic regression

\[ p(\text{bought}) = \sigma\left(\beta_0 + \beta_1 \text{age} + \beta_2 \text{latitude}\right).\]

Codify Through Mathematical Functions

How does machine learning work?
Jumper (jersey/sweater) purchase with logistic regression

\[ p(\text{bought}) = \sigma\left(\boldsymbol{\beta}^\top \mathbf{ x}\right).\]

Codify Through Mathematical Functions

How does machine learning work?
Jumper (jersey/sweater) purchase with logistic regression

\[ y= f\left(\mathbf{ x}, \boldsymbol{\beta}\right).\]

We call \(f(\cdot)\) the prediction function.

Fit to Data

Use an objective function

\[E(\boldsymbol{\beta}, \mathbf{Y}, \mathbf{X})\]

E.g. least squares \[E(\boldsymbol{\beta}, \mathbf{Y}, \mathbf{X}) = \sum_{i=1}^n\left(y_i - f(\mathbf{ x}_i, \boldsymbol{\beta})\right)^2.\]

Two Components

Prediction function, \(f(\cdot)\)
Objective function, \(E(\cdot)\)

Deep Learning

These are interpretable models: vital for disease modeling etc.
Modern machine learning methods are less interpretable
Example: face recognition

Outline of the DeepFace architecture. A front-end of a single convolution-pooling-convolution filtering on the rectified input, followed by three locally-connected layers and two fully-connected layers. Color illustrates feature maps produced at each layer. The net includes more than 120 million parameters, where more than 95% come from the local and fully connected.

Source: DeepFace (Taigman et al., 2014)

Deep Neural Network

Mathematically

\[ \begin{align*} \mathbf{ h}_{1} &= \phi\left(\mathbf{W}_1 \mathbf{ x}\right)\\ \mathbf{ h}_{2} &= \phi\left(\mathbf{W}_2\mathbf{ h}_{1}\right)\\ \mathbf{ h}_{3} &= \phi\left(\mathbf{W}_3 \mathbf{ h}_{2}\right)\\ f&= \mathbf{ w}_4 ^\top\mathbf{ h}_{3} \end{align*} \]

AlphaGo

Sedolian Void

Uber ATG

Uncertainty

Uncertainty in prediction arises from:
scarcity of training data and
mismatch between the set of prediction functions we choose and all possible prediction functions.
Also uncertainties in objective, leave those for another day.

Structure of Priors

MacKay: NeurIPS Tutorial 1997 “Have we thrown out the baby with the bathwater?” (Published as MacKay, n.d.)

Deep Neural Network

Overfitting

Potential problem: if number of nodes in two adjacent layers is big, corresponding \(\mathbf{W}\) is also very big and there is the potential to overfit.
Proposed solution: “dropout.”
Alternative solution: parameterize \(\mathbf{W}\) with its SVD. \[ \mathbf{W}= \mathbf{U}\boldsymbol{ \Lambda}\mathbf{V}^\top \] or \[ \mathbf{W}= \mathbf{U}\mathbf{V}^\top \] where if \(\mathbf{W}\in \Re^{k_1\times k_2}\) then \(\mathbf{U}\in \Re^{k_1\times q}\) and \(\mathbf{V}\in \Re^{k_2\times q}\), i.e. we have a low rank matrix factorization for the weights.

Low Rank Approximation

Bottleneck Layers in Deep Neural Networks

Deep Neural Network

Mathematically

The network can now be written mathematically as \[ \begin{align} \mathbf{ z}_{1} &= \mathbf{V}^\top_1 \mathbf{ x}\\ \mathbf{ h}_{1} &= \phi\left(\mathbf{U}_1 \mathbf{ z}_{1}\right)\\ \mathbf{ z}_{2} &= \mathbf{V}^\top_2 \mathbf{ h}_{1}\\ \mathbf{ h}_{2} &= \phi\left(\mathbf{U}_2 \mathbf{ z}_{2}\right)\\ \mathbf{ z}_{3} &= \mathbf{V}^\top_3 \mathbf{ h}_{2}\\ \mathbf{ h}_{3} &= \phi\left(\mathbf{U}_3 \mathbf{ z}_{3}\right)\\ \mathbf{ y}&= \mathbf{ w}_4^\top\mathbf{ h}_{3}. \end{align} \]

A Cascade of Neural Networks

\[ \begin{align} \mathbf{ z}_{1} &= \mathbf{V}^\top_1 \mathbf{ x}\\ \mathbf{ z}_{2} &= \mathbf{V}^\top_2 \phi\left(\mathbf{U}_1 \mathbf{ z}_{1}\right)\\ \mathbf{ z}_{3} &= \mathbf{V}^\top_3 \phi\left(\mathbf{U}_2 \mathbf{ z}_{2}\right)\\ \mathbf{ y}&= \mathbf{ w}_4 ^\top \mathbf{ z}_{3} \end{align} \]

Cascade of Gaussian Processes

Replace each neural network with a Gaussian process \[ \begin{align} \mathbf{ z}_{1} &= \mathbf{ f}_1\left(\mathbf{ x}\right)\\ \mathbf{ z}_{2} &= \mathbf{ f}_2\left(\mathbf{ z}_{1}\right)\\ \mathbf{ z}_{3} &= \mathbf{ f}_3\left(\mathbf{ z}_{2}\right)\\ \mathbf{ y}&= \mathbf{ f}_4\left(\mathbf{ z}_{3}\right) \end{align} \]
Equivalent to prior over parameters, take width of each layer to infinity.

Stochastic Process Composition

\[\mathbf{ y}= \mathbf{ f}_4\left(\mathbf{ f}_3\left(\mathbf{ f}_2\left(\mathbf{ f}_1\left(\mathbf{ x}\right)\right)\right)\right)\]

Mathematically

Composite multivariate function

\[ \mathbf{g}(\mathbf{ x})=\mathbf{ f}_5(\mathbf{ f}_4(\mathbf{ f}_3(\mathbf{ f}_2(\mathbf{ f}_1(\mathbf{ x}))))). \]

Equivalent to Markov Chain

Composite multivariate function \[ p(\mathbf{ y}|\mathbf{ x})= p(\mathbf{ y}|\mathbf{ f}_5)p(\mathbf{ f}_5|\mathbf{ f}_4)p(\mathbf{ f}_4|\mathbf{ f}_3)p(\mathbf{ f}_3|\mathbf{ f}_2)p(\mathbf{ f}_2|\mathbf{ f}_1)p(\mathbf{ f}_1|\mathbf{ x}) \]

Why Composition?

Gaussian processes give priors over functions.
Elegant properties:
- e.g. Derivatives of process are also Gaussian distributed (if they exist).
For particular covariance functions they are ‘universal approximators,’ i.e. all functions can have support under the prior.
Gaussian derivatives might ring alarm bells.
E.g. a priori they don’t believe in function ‘jumps.’

Stochastic Process Composition

From a process perspective: process composition.
A (new?) way of constructing more complex processes based on simpler components.

Deep Gaussian Processes

Damianou (2015)

\(=f\Bigg(\)

\(\Bigg)\)

Modern Review

A Unifying Framework for Gaussian Process Pseudo-Point Approximations using Power Expectation Propagation Bui et al. (2017)
Deep Gaussian Processes and Variational Propagation of Uncertainty Damianou (2015)

GPy: A Gaussian Process Framework in Python

https://github.com/SheffieldML/GPy

GPy: A Gaussian Process Framework in Python

BSD Licensed software base.
Wide availability of libraries, ‘modern’ scripting language.
Allows us to set projects to undergraduates in Comp Sci that use GPs.
Available through GitHub https://github.com/SheffieldML/GPy
Reproducible Research with Jupyter Notebook.

Features

Probabilistic-style programming (specify the model, not the algorithm).
Non-Gaussian likelihoods.
Multivariate outputs.
Dimensionality reduction.
Approximations for large data sets.

Olympic Marathon Data

Gold medal times for Olympic Marathon since 1896.
Marathons before 1924 didn’t have a standardized distance.
Present results using pace per km.
In 1904 Marathon was badly organized leading to very slow times.

Image from Wikimedia Commons http://bit.ly/16kMKHQ

Olympic Marathon Data

Alan Turing

Probability Winning Olympics?

He was a formidable Marathon runner.
In 1946 he ran a time 2 hours 46 minutes.
- That’s a pace of 3.95 min/km.
What is the probability he would have won an Olympics if one had been held in 1946?

Gaussian Process Fit

Olympic Marathon Data GP

Deep GP Fit

Can a Deep Gaussian process help?
Deep GP is one GP feeding into another.

Olympic Marathon Data Deep GP

Olympic Marathon Data Latent 1

Olympic Marathon Data Latent 2

Olympic Marathon Pinball Plot

Step Function Data

Step Function Data GP

Step Function Data Deep GP

Step Function Data Latent 1

Step Function Data Latent 2

Step Function Data Latent 3

Step Function Data Latent 4

Step Function Pinball Plot

Motorcycle Helmet Data

Motorcycle Helmet Data GP

Motorcycle Helmet Data Deep GP

Motorcycle Helmet Data Latent 1

Motorcycle Helmet Data Latent 2

Motorcycle Helmet Pinball Plot

Subsample of the MNIST Data

Fitting a Deep GP to a the MNIST Digits Subsample

Deep NNs as Point Estimates for Deep GPs

Dutordoir et al. (2021)

ReLU as a Spherical Basis

Soft Plus as a Stationary Spherical

Spherical Harmonics

Predictions on Banana Data

Deep Health

Thanks!

References

Bui, T.D., Yan, J., Turner, R.E., 2017. A unifying framework for Gaussian process pseudo-point approximations using power expectation propagation. Journal of Machine Learning Research 18, 1–72.

Damianou, A., 2015. Deep Gaussian processes and variational propagation of uncertainty (PhD thesis). University of Sheffield.

Dunlop, M.M., Girolami, M.A., Stuart, A.M., Teckentrup, A.L., n.d. How deep are deep Gaussian processes? Journal of Machine Learning Research 19, 1–46.

Dutordoir, V., Hensman, J., Wilk, M. van der, Ek, C.H., Ghahramani, Z., Durrande, N., 2021. Deep neural networks as point estimates for deep Gaussian processes, in: Advances in Neural Information Processing Systems.

MacKay, D.J.C., n.d. Introduction to Gaussian processes. pp. 133–166.

Rätsch, G., Onoda, T., Müller, K.-R., 2001. Soft margins for AdaBoost. Machine Learning 42, 287–320.

Taigman, Y., Yang, M., Ranzato, M., Wolf, L., 2014. DeepFace: Closing the gap to human-level performance in face verification, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. https://doi.org/10.1109/CVPR.2014.220