Deep Gaussian Processes: A Motivation and Introduction

Neil D. Lawrence

Heilbronn Data Science Seminar, Jean Golding Institute


The Fourth Industrial Revolution

  • The first to be named before it happened.

What is Machine Learning?

What is Machine Learning?

\[ \text{data} + \text{model} \stackrel{\text{compute}}{\rightarrow} \text{prediction}\]

  • data : observations, could be actively or passively acquired (meta-data).
  • model : assumptions, based on previous experience (other data! transfer learning etc), or beliefs about the regularities of the universe. Inductive bias.
  • prediction : an action to be taken or a categorization or a quality score.

What is Machine Learning?

\[\text{data} + \text{model} \stackrel{\text{compute}}{\rightarrow} \text{prediction}\]

  • To combine data with a model need:
  • a prediction function \(f(\cdot)\) includes our beliefs about the regularities of the universe
  • an objective function \(E(\cdot)\) defines the cost of misprediction.

What does Machine Learning do?

  • Automation scales by codifying processes and automating them.
  • Need:
    • Interconnected components
    • Compatible components
  • Early examples:
    • cf Colt 45, Ford Model T

Codify Through Mathematical Functions

  • How does machine learning work?
  • Jumper (jersey/sweater) purchase with logistic regression

\[ \text{odds} = \frac{p(\text{bought})}{p(\text{not bought})} \]

\[ \log \text{odds} = \beta_0 + \beta_1 \text{age} + \beta_2 \text{latitude}.\]

Codify Through Mathematical Functions

  • How does machine learning work?
  • Jumper (jersey/sweater) purchase with logistic regression

\[ p(\text{bought}) = \sigma\left(\beta_0 + \beta_1 \text{age} + \beta_2 \text{latitude}\right).\]

Codify Through Mathematical Functions

  • How does machine learning work?
  • Jumper (jersey/sweater) purchase with logistic regression

\[ p(\text{bought}) = \sigma\left(\boldsymbol{\beta}^\top \mathbf{ x}\right).\]

Codify Through Mathematical Functions

  • How does machine learning work?
  • Jumper (jersey/sweater) purchase with logistic regression

\[ y= f\left(\mathbf{ x}, \boldsymbol{\beta}\right).\]

We call \(f(\cdot)\) the prediction function.

Fit to Data

  • Use an objective function

\[E(\boldsymbol{\beta}, \mathbf{Y}, \mathbf{X})\]

  • E.g. least squares \[E(\boldsymbol{\beta}, \mathbf{Y}, \mathbf{X}) = \sum_{i=1}^n\left(y_i - f(\mathbf{ x}_i, \boldsymbol{\beta})\right)^2.\]

Two Components

  • Prediction function, \(f(\cdot)\)
  • Objective function, \(E(\cdot)\)

Deep Learning

Deep Learning

  • These are interpretable models: vital for disease modeling etc.

  • Modern machine learning methods are less interpretable

  • Example: face recognition

Outline of the DeepFace architecture. A front-end of a single convolution-pooling-convolution filtering on the rectified input, followed by three locally-connected layers and two fully-connected layers. Color illustrates feature maps produced at each layer. The net includes more than 120 million parameters, where more than 95% come from the local and fully connected.

Source: DeepFace (Taigman et al., 2014)

Deep Neural Network

Deep Neural Network


\[ \begin{align*} \mathbf{ h}_{1} &= \phi\left(\mathbf{W}_1 \mathbf{ x}\right)\\ \mathbf{ h}_{2} &= \phi\left(\mathbf{W}_2\mathbf{ h}_{1}\right)\\ \mathbf{ h}_{3} &= \phi\left(\mathbf{W}_3 \mathbf{ h}_{2}\right)\\ f&= \mathbf{ w}_4 ^\top\mathbf{ h}_{3} \end{align*} \]


Sedolian Void

Uber ATG


  • Uncertainty in prediction arises from:
  • scarcity of training data and
  • mismatch between the set of prediction functions we choose and all possible prediction functions.
  • Also uncertainties in objective, leave those for another day.

Structure of Priors

MacKay: NeurIPS Tutorial 1997 “Have we thrown out the baby with the bathwater?” (Published as MacKay, n.d.)

Deep Neural Network


  • Potential problem: if number of nodes in two adjacent layers is big, corresponding \(\mathbf{W}\) is also very big and there is the potential to overfit.

  • Proposed solution: “dropout.”

  • Alternative solution: parameterize \(\mathbf{W}\) with its SVD. \[ \mathbf{W}= \mathbf{U}\boldsymbol{ \Lambda}\mathbf{V}^\top \] or \[ \mathbf{W}= \mathbf{U}\mathbf{V}^\top \] where if \(\mathbf{W}\in \Re^{k_1\times k_2}\) then \(\mathbf{U}\in \Re^{k_1\times q}\) and \(\mathbf{V}\in \Re^{k_2\times q}\), i.e. we have a low rank matrix factorization for the weights.

Low Rank Approximation

Bottleneck Layers in Deep Neural Networks

Deep Neural Network


The network can now be written mathematically as \[ \begin{align} \mathbf{ z}_{1} &= \mathbf{V}^\top_1 \mathbf{ x}\\ \mathbf{ h}_{1} &= \phi\left(\mathbf{U}_1 \mathbf{ z}_{1}\right)\\ \mathbf{ z}_{2} &= \mathbf{V}^\top_2 \mathbf{ h}_{1}\\ \mathbf{ h}_{2} &= \phi\left(\mathbf{U}_2 \mathbf{ z}_{2}\right)\\ \mathbf{ z}_{3} &= \mathbf{V}^\top_3 \mathbf{ h}_{2}\\ \mathbf{ h}_{3} &= \phi\left(\mathbf{U}_3 \mathbf{ z}_{3}\right)\\ \mathbf{ y}&= \mathbf{ w}_4^\top\mathbf{ h}_{3}. \end{align} \]

A Cascade of Neural Networks

\[ \begin{align} \mathbf{ z}_{1} &= \mathbf{V}^\top_1 \mathbf{ x}\\ \mathbf{ z}_{2} &= \mathbf{V}^\top_2 \phi\left(\mathbf{U}_1 \mathbf{ z}_{1}\right)\\ \mathbf{ z}_{3} &= \mathbf{V}^\top_3 \phi\left(\mathbf{U}_2 \mathbf{ z}_{2}\right)\\ \mathbf{ y}&= \mathbf{ w}_4 ^\top \mathbf{ z}_{3} \end{align} \]

Cascade of Gaussian Processes

  • Replace each neural network with a Gaussian process \[ \begin{align} \mathbf{ z}_{1} &= \mathbf{ f}_1\left(\mathbf{ x}\right)\\ \mathbf{ z}_{2} &= \mathbf{ f}_2\left(\mathbf{ z}_{1}\right)\\ \mathbf{ z}_{3} &= \mathbf{ f}_3\left(\mathbf{ z}_{2}\right)\\ \mathbf{ y}&= \mathbf{ f}_4\left(\mathbf{ z}_{3}\right) \end{align} \]

  • Equivalent to prior over parameters, take width of each layer to infinity.

Stochastic Process Composition

\[\mathbf{ y}= \mathbf{ f}_4\left(\mathbf{ f}_3\left(\mathbf{ f}_2\left(\mathbf{ f}_1\left(\mathbf{ x}\right)\right)\right)\right)\]


  • Composite multivariate function

    \[ \mathbf{g}(\mathbf{ x})=\mathbf{ f}_5(\mathbf{ f}_4(\mathbf{ f}_3(\mathbf{ f}_2(\mathbf{ f}_1(\mathbf{ x}))))). \]

Equivalent to Markov Chain

  • Composite multivariate function \[ p(\mathbf{ y}|\mathbf{ x})= p(\mathbf{ y}|\mathbf{ f}_5)p(\mathbf{ f}_5|\mathbf{ f}_4)p(\mathbf{ f}_4|\mathbf{ f}_3)p(\mathbf{ f}_3|\mathbf{ f}_2)p(\mathbf{ f}_2|\mathbf{ f}_1)p(\mathbf{ f}_1|\mathbf{ x}) \]

Why Composition?

  • Gaussian processes give priors over functions.

  • Elegant properties:

    • e.g. Derivatives of process are also Gaussian distributed (if they exist).
  • For particular covariance functions they are ‘universal approximators,’ i.e. all functions can have support under the prior.

  • Gaussian derivatives might ring alarm bells.

  • E.g. a priori they don’t believe in function ‘jumps.’

Stochastic Process Composition

  • From a process perspective: process composition.

  • A (new?) way of constructing more complex processes based on simpler components.

Deep Gaussian Processes

Andreas Damianou

Damianou (2015)


Modern Review

  • A Unifying Framework for Gaussian Process Pseudo-Point Approximations using Power Expectation Propagation Bui et al. (2017)

  • Deep Gaussian Processes and Variational Propagation of Uncertainty Damianou (2015)

GPy: A Gaussian Process Framework in Python

GPy: A Gaussian Process Framework in Python

  • BSD Licensed software base.
  • Wide availability of libraries, ‘modern’ scripting language.
  • Allows us to set projects to undergraduates in Comp Sci that use GPs.
  • Available through GitHub
  • Reproducible Research with Jupyter Notebook.


  • Probabilistic-style programming (specify the model, not the algorithm).
  • Non-Gaussian likelihoods.
  • Multivariate outputs.
  • Dimensionality reduction.
  • Approximations for large data sets.

Olympic Marathon Data

  • Gold medal times for Olympic Marathon since 1896.
  • Marathons before 1924 didn’t have a standardized distance.
  • Present results using pace per km.
  • In 1904 Marathon was badly organized leading to very slow times.
Image from Wikimedia Commons

Olympic Marathon Data

Alan Turing

Probability Winning Olympics?

  • He was a formidable Marathon runner.
  • In 1946 he ran a time 2 hours 46 minutes.
    • That’s a pace of 3.95 min/km.
  • What is the probability he would have won an Olympics if one had been held in 1946?

Gaussian Process Fit

Olympic Marathon Data GP

Deep GP Fit

  • Can a Deep Gaussian process help?

  • Deep GP is one GP feeding into another.

Olympic Marathon Data Deep GP

Olympic Marathon Data Deep GP

Olympic Marathon Data Latent 1

Olympic Marathon Data Latent 2

Olympic Marathon Pinball Plot

Step Function Data

Step Function Data GP

Step Function Data Deep GP

Step Function Data Deep GP

Step Function Data Latent 1

Step Function Data Latent 2

Step Function Data Latent 3

Step Function Data Latent 4

Step Function Pinball Plot

Motorcycle Helmet Data

Motorcycle Helmet Data GP

Motorcycle Helmet Data Deep GP

Motorcycle Helmet Data Deep GP

Motorcycle Helmet Data Latent 1

Motorcycle Helmet Data Latent 2

Motorcycle Helmet Pinball Plot

Subsample of the MNIST Data

Fitting a Deep GP to a the MNIST Digits Subsample

Zhenwen Dai Andreas Damianou

Deep NNs as Point Estimates for Deep GPs

Vincent Dutordoir
Dutordoir et al. (2021)

ReLU as a Spherical Basis

Soft Plus as a Stationary Spherical

Spherical Harmonics

Predictions on Banana Data

Predictions on Banana Data

Predictions on Banana Data

Deep Health



Bui, T.D., Yan, J., Turner, R.E., 2017. A unifying framework for Gaussian process pseudo-point approximations using power expectation propagation. Journal of Machine Learning Research 18, 1–72.
Damianou, A., 2015. Deep Gaussian processes and variational propagation of uncertainty (PhD thesis). University of Sheffield.
Dunlop, M.M., Girolami, M.A., Stuart, A.M., Teckentrup, A.L., n.d. How deep are deep Gaussian processes? Journal of Machine Learning Research 19, 1–46.
Dutordoir, V., Hensman, J., Wilk, M. van der, Ek, C.H., Ghahramani, Z., Durrande, N., 2021. Deep neural networks as point estimates for deep Gaussian processes, in: Advances in Neural Information Processing Systems.
MacKay, D.J.C., n.d. Introduction to Gaussian processes. pp. 133–166.
Rätsch, G., Onoda, T., Müller, K.-R., 2001. Soft margins for AdaBoost. Machine Learning 42, 287–320.
Taigman, Y., Yang, M., Ranzato, M., Wolf, L., 2014. DeepFace: Closing the gap to human-level performance in face verification, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition.