Deep Gaussian Processes: A Motivation and Introduction

Heilbronn Data Science Seminar, Jean Golding Institute

The Fourth Industrial Revolution

• The first to be named before it happened.

What is Machine Learning?

$\text{data} + \text{model} \stackrel{\text{compute}}{\rightarrow} \text{prediction}$

• data : observations, could be actively or passively acquired (meta-data).
• model : assumptions, based on previous experience (other data! transfer learning etc), or beliefs about the regularities of the universe. Inductive bias.
• prediction : an action to be taken or a categorization or a quality score.

What is Machine Learning?

$\text{data} + \text{model} \stackrel{\text{compute}}{\rightarrow} \text{prediction}$

• To combine data with a model need:
• a prediction function $f(\cdot)$ includes our beliefs about the regularities of the universe
• an objective function $E(\cdot)$ defines the cost of misprediction.

What does Machine Learning do?

• Automation scales by codifying processes and automating them.
• Need:
• Interconnected components
• Compatible components
• Early examples:
• cf Colt 45, Ford Model T

Codify Through Mathematical Functions

• How does machine learning work?
• Jumper (jersey/sweater) purchase with logistic regression

$\text{odds} = \frac{p(\text{bought})}{p(\text{not bought})}$

$\log \text{odds} = \beta_0 + \beta_1 \text{age} + \beta_2 \text{latitude}.$

Codify Through Mathematical Functions

• How does machine learning work?
• Jumper (jersey/sweater) purchase with logistic regression

$p(\text{bought}) = \sigma\left(\beta_0 + \beta_1 \text{age} + \beta_2 \text{latitude}\right).$

Codify Through Mathematical Functions

• How does machine learning work?
• Jumper (jersey/sweater) purchase with logistic regression

$p(\text{bought}) = \sigma\left(\boldsymbol{\beta}^\top \mathbf{ x}\right).$

Codify Through Mathematical Functions

• How does machine learning work?
• Jumper (jersey/sweater) purchase with logistic regression

$y= f\left(\mathbf{ x}, \boldsymbol{\beta}\right).$

We call $f(\cdot)$ the prediction function.

Fit to Data

• Use an objective function

$E(\boldsymbol{\beta}, \mathbf{Y}, \mathbf{X})$

• E.g. least squares $E(\boldsymbol{\beta}, \mathbf{Y}, \mathbf{X}) = \sum_{i=1}^n\left(y_i - f(\mathbf{ x}_i, \boldsymbol{\beta})\right)^2.$

Two Components

• Prediction function, $f(\cdot)$
• Objective function, $E(\cdot)$

Deep Learning

• These are interpretable models: vital for disease modeling etc.

• Modern machine learning methods are less interpretable

• Example: face recognition

Outline of the DeepFace architecture. A front-end of a single convolution-pooling-convolution filtering on the rectified input, followed by three locally-connected layers and two fully-connected layers. Color illustrates feature maps produced at each layer. The net includes more than 120 million parameters, where more than 95% come from the local and fully connected.

Source: DeepFace (Taigman et al., 2014)

Mathematically

\begin{align*} \mathbf{ h}_{1} &= \phi\left(\mathbf{W}_1 \mathbf{ x}\right)\\ \mathbf{ h}_{2} &= \phi\left(\mathbf{W}_2\mathbf{ h}_{1}\right)\\ \mathbf{ h}_{3} &= \phi\left(\mathbf{W}_3 \mathbf{ h}_{2}\right)\\ f&= \mathbf{ w}_4 ^\top\mathbf{ h}_{3} \end{align*}

Uncertainty

• Uncertainty in prediction arises from:
• scarcity of training data and
• mismatch between the set of prediction functions we choose and all possible prediction functions.
• Also uncertainties in objective, leave those for another day.

Structure of Priors

MacKay: NeurIPS Tutorial 1997 “Have we thrown out the baby with the bathwater?” (Published as MacKay, n.d.)

Overfitting

• Potential problem: if number of nodes in two adjacent layers is big, corresponding $\mathbf{W}$ is also very big and there is the potential to overfit.

• Proposed solution: “dropout.”

• Alternative solution: parameterize $\mathbf{W}$ with its SVD. $\mathbf{W}= \mathbf{U}\boldsymbol{ \Lambda}\mathbf{V}^\top$ or $\mathbf{W}= \mathbf{U}\mathbf{V}^\top$ where if $\mathbf{W}\in \Re^{k_1\times k_2}$ then $\mathbf{U}\in \Re^{k_1\times q}$ and $\mathbf{V}\in \Re^{k_2\times q}$, i.e. we have a low rank matrix factorization for the weights.

Mathematically

The network can now be written mathematically as \begin{align} \mathbf{ z}_{1} &= \mathbf{V}^\top_1 \mathbf{ x}\\ \mathbf{ h}_{1} &= \phi\left(\mathbf{U}_1 \mathbf{ z}_{1}\right)\\ \mathbf{ z}_{2} &= \mathbf{V}^\top_2 \mathbf{ h}_{1}\\ \mathbf{ h}_{2} &= \phi\left(\mathbf{U}_2 \mathbf{ z}_{2}\right)\\ \mathbf{ z}_{3} &= \mathbf{V}^\top_3 \mathbf{ h}_{2}\\ \mathbf{ h}_{3} &= \phi\left(\mathbf{U}_3 \mathbf{ z}_{3}\right)\\ \mathbf{ y}&= \mathbf{ w}_4^\top\mathbf{ h}_{3}. \end{align}

\begin{align} \mathbf{ z}_{1} &= \mathbf{V}^\top_1 \mathbf{ x}\\ \mathbf{ z}_{2} &= \mathbf{V}^\top_2 \phi\left(\mathbf{U}_1 \mathbf{ z}_{1}\right)\\ \mathbf{ z}_{3} &= \mathbf{V}^\top_3 \phi\left(\mathbf{U}_2 \mathbf{ z}_{2}\right)\\ \mathbf{ y}&= \mathbf{ w}_4 ^\top \mathbf{ z}_{3} \end{align}

• Replace each neural network with a Gaussian process \begin{align} \mathbf{ z}_{1} &= \mathbf{ f}_1\left(\mathbf{ x}\right)\\ \mathbf{ z}_{2} &= \mathbf{ f}_2\left(\mathbf{ z}_{1}\right)\\ \mathbf{ z}_{3} &= \mathbf{ f}_3\left(\mathbf{ z}_{2}\right)\\ \mathbf{ y}&= \mathbf{ f}_4\left(\mathbf{ z}_{3}\right) \end{align}

• Equivalent to prior over parameters, take width of each layer to infinity.

Stochastic Process Composition

$\mathbf{ y}= \mathbf{ f}_4\left(\mathbf{ f}_3\left(\mathbf{ f}_2\left(\mathbf{ f}_1\left(\mathbf{ x}\right)\right)\right)\right)$

Mathematically

• Composite multivariate function

$\mathbf{g}(\mathbf{ x})=\mathbf{ f}_5(\mathbf{ f}_4(\mathbf{ f}_3(\mathbf{ f}_2(\mathbf{ f}_1(\mathbf{ x}))))).$

Equivalent to Markov Chain

• Composite multivariate function $p(\mathbf{ y}|\mathbf{ x})= p(\mathbf{ y}|\mathbf{ f}_5)p(\mathbf{ f}_5|\mathbf{ f}_4)p(\mathbf{ f}_4|\mathbf{ f}_3)p(\mathbf{ f}_3|\mathbf{ f}_2)p(\mathbf{ f}_2|\mathbf{ f}_1)p(\mathbf{ f}_1|\mathbf{ x})$

Why Composition?

• Gaussian processes give priors over functions.

• Elegant properties:

• e.g. Derivatives of process are also Gaussian distributed (if they exist).
• For particular covariance functions they are ‘universal approximators,’ i.e. all functions can have support under the prior.

• Gaussian derivatives might ring alarm bells.

• E.g. a priori they don’t believe in function ‘jumps.’

Stochastic Process Composition

• From a process perspective: process composition.

• A (new?) way of constructing more complex processes based on simpler components.

Damianou (2015)

Modern Review

• A Unifying Framework for Gaussian Process Pseudo-Point Approximations using Power Expectation Propagation Bui et al. (2017)

• Deep Gaussian Processes and Variational Propagation of Uncertainty Damianou (2015)

GPy: A Gaussian Process Framework in Python

https://github.com/SheffieldML/GPy

GPy: A Gaussian Process Framework in Python

• Wide availability of libraries, ‘modern’ scripting language.
• Allows us to set projects to undergraduates in Comp Sci that use GPs.
• Available through GitHub https://github.com/SheffieldML/GPy
• Reproducible Research with Jupyter Notebook.

Features

• Probabilistic-style programming (specify the model, not the algorithm).
• Non-Gaussian likelihoods.
• Multivariate outputs.
• Dimensionality reduction.
• Approximations for large data sets.

Olympic Marathon Data

 Gold medal times for Olympic Marathon since 1896. Marathons before 1924 didn’t have a standardized distance. Present results using pace per km. In 1904 Marathon was badly organized leading to very slow times. Image from Wikimedia Commons http://bit.ly/16kMKHQ

Probability Winning Olympics?

• He was a formidable Marathon runner.
• In 1946 he ran a time 2 hours 46 minutes.
• That’s a pace of 3.95 min/km.
• What is the probability he would have won an Olympics if one had been held in 1946?

Deep GP Fit

• Can a Deep Gaussian process help?

• Deep GP is one GP feeding into another.

Deep NNs as Point Estimates for Deep GPs

Dutordoir et al. (2021)

References

Bui, T.D., Yan, J., Turner, R.E., 2017. A unifying framework for Gaussian process pseudo-point approximations using power expectation propagation. Journal of Machine Learning Research 18, 1–72.
Damianou, A., 2015. Deep Gaussian processes and variational propagation of uncertainty (PhD thesis). University of Sheffield.
Dunlop, M.M., Girolami, M.A., Stuart, A.M., Teckentrup, A.L., n.d. How deep are deep Gaussian processes? Journal of Machine Learning Research 19, 1–46.
Dutordoir, V., Hensman, J., Wilk, M. van der, Ek, C.H., Ghahramani, Z., Durrande, N., 2021. Deep neural networks as point estimates for deep Gaussian processes, in: Advances in Neural Information Processing Systems.
MacKay, D.J.C., n.d. Introduction to Gaussian processes. pp. 133–166.
Rätsch, G., Onoda, T., Müller, K.-R., 2001. Soft margins for AdaBoost. Machine Learning 42, 287–320.
Taigman, Y., Yang, M., Ranzato, M., Wolf, L., 2014. DeepFace: Closing the gap to human-level performance in face verification, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. https://doi.org/10.1109/CVPR.2014.220