Deep Gaussian Processes

Neil D. Lawrence

2019-01-11

MLSS, Stellenbosch, South Africa

\(=f\Bigg(\)

\(\Bigg)\)

Approximations

Low Rank Motivation

Inference in a GP has the following demands:

Complexity:	\(\bigO(\numData^3)\)
Storage:	\(\bigO(\numData^2)\)

Inference in a low rank GP has the following demands:

Complexity:	\(\bigO(\numData\numInducing^2)\)
Storage:	\(\bigO(\numData\numInducing)\)

where \(\numInducing\) is a user chosen parameter.

Snelson and Ghahramani (n.d.),Quiñonero Candela and Rasmussen (2005),Lawrence (n.d.),Titsias (n.d.),Bui et al. (2017)

### Variational Compression

Inducing variables are a compression of the real observations.
They are like pseudo-data. They can be in space of \(\mappingFunctionVector\) or a space that is related through a linear operator (Álvarez et al., 2010) — e.g. a gradient or convolution.

Variational Compression II

Introduce inducing variables.
Compress information into the inducing variables and avoid the need to store all the data.
Allow for scaling e.g. stochastic variational Hensman et al. (n.d.) or parallelization Gal et al. (n.d.),Dai et al. (2014), Seeger et al. (2017)

Nonparametric Gaussian Processes

We’ve seen how we go from parametric to non-parametric.
The limit implies infinite dimensional \(\mappingVector\).
Gaussian processes are generally non-parametric: combine data with covariance function to get model.
This representation cannot be summarized by a parameter vector of a fixed size.

The Parametric Bottleneck

Parametric models have a representation that does not respond to increasing training set size.
Bayesian posterior distributions over parameters contain the information about the training data.
- Use Bayes’ rule from training data, \(p\left(\mappingVector|\dataVector, \inputMatrix\right)\),
- Make predictions on test data \[p\left(\dataScalar_*|\inputMatrix_*, \dataVector, \inputMatrix\right) = \int p\left(\dataScalar_*|\mappingVector,\inputMatrix_*\right)p\left(\mappingVector|\dataVector, \inputMatrix)\text{d}\mappingVector\right).\]
\(\mappingVector\) becomes a bottleneck for information about the training set to pass to the test set.
Solution: increase \(\numBasisFunc\) so that the bottleneck is so large that it no longer presents a problem.
How big is big enough for \(\numBasisFunc\)? Non-parametrics says \(\numBasisFunc \rightarrow \infty\).

The Parametric Bottleneck

Now no longer possible to manipulate the model through the standard parametric form.

However, it is possible to express parametric as GPs: \[\kernelScalar\left(\inputVector_i,\inputVector_j\right)=\basisFunction_:\left(\inputVector_i\right)^\top\basisFunction_:\left(\inputVector_j\right).\]

These are known as degenerate covariance matrices.

Their rank is at most \(\numBasisFunc\), non-parametric models have full rank covariance matrices.

Most well known is the “linear kernel”, \(\kernelScalar(\inputVector_i, \inputVector_j) = \inputVector_i^\top\inputVector_j\).

Making Predictions

For non-parametrics prediction at new points \(\mappingFunctionVector_*\) is made by conditioning on \(\mappingFunctionVector\) in the joint distribution.

In GPs this involves combining the training data with the covariance function and the mean function.

Parametric is a special case when conditional prediction can be summarized in a fixed number of parameters.

Complexity of parametric model remains fixed regardless of the size of our training data set.

For a non-parametric model the required number of parameters grows with the size of the training data.

Augment Variable Space

Augment variable space with inducing observations, \(\inducingVector\) \[ \begin{bmatrix} \mappingFunctionVector\\ \inducingVector \end{bmatrix} \sim \gaussianSamp{\zerosVector}{\kernelMatrix} \] with \[ \kernelMatrix = \begin{bmatrix} \Kff & \Kfu \\ \Kuf & \Kuu \end{bmatrix} \]

Joint Density

\[ p(\mappingFunctionVector, \inducingVector) = p(\mappingFunctionVector| \inducingVector) p(\inducingVector) \] to augment our model \[ \dataScalar(\inputVector) = \mappingFunction(\inputVector) + \noiseScalar, \] giving \[ p(\dataVector) = \int p(\dataVector|\mappingFunctionVector) p(\mappingFunctionVector) \text{d}\mappingFunctionVector, \] where for the independent case we have \(p(\dataVector | \mappingFunctionVector) = \prod_{i=1}^\numData p(\dataScalar_i|\mappingFunction_i)\).

Auxilliary Variables

\[ p(\dataVector) = \int p(\dataVector|\mappingFunctionVector) p(\mappingFunctionVector|\inducingVector) p(\inducingVector) \text{d}\inducingVector \text{d}\mappingFunctionVector. \] Integrating over \(\mappingFunctionVector\) \[ p(\dataVector) = \int p(\dataVector|\inducingVector) p(\inducingVector) \text{d}\inducingVector. \]

Parametric Comparison

\[ \dataScalar(\inputVector) = \weightVector^\top\basisVector(\inputVector) + \noiseScalar \]

\[ p(\dataVector) = \int p(\dataVector|\weightVector) p(\weightVector) \text{d} \weightVector \]

\[ p(\dataVector^*|\dataVector) = \int p(\dataVector^*|\weightVector) p(\weightVector|\dataVector) \text{d} \weightVector \]

New Form

\[ p(\dataVector^*|\dataVector) = \int p(\dataVector^*|\inducingVector) p(\inducingVector|\dataVector) \text{d} \inducingVector \]

but \(\inducingVector\) is not a parameter
Unfortunately computing \(p(\dataVector|\inducingVector)\) is intractable

Variational Bound on \(p(\dataVector |\inducingVector)\)

\[ \begin{aligned} \log p(\dataVector|\inducingVector) & = \log \int p(\dataVector|\mappingFunctionVector) p(\mappingFunctionVector|\inducingVector) \text{d}\mappingFunctionVector\\ & = \int q(\mappingFunctionVector) \log \frac{p(\dataVector|\mappingFunctionVector) p(\mappingFunctionVector|\inducingVector)}{q(\mappingFunctionVector)}\text{d}\mappingFunctionVector + \KL{q(\mappingFunctionVector)}{p(\mappingFunctionVector|\dataVector, \inducingVector)}. \end{aligned} \]

Choose form for \(q(\cdot)\)

Set \(q(\mappingFunctionVector)=p(\mappingFunctionVector|\inducingVector)\), \[ \log p(\dataVector|\inducingVector) \geq \int p(\mappingFunctionVector|\inducingVector) \log p(\dataVector|\mappingFunctionVector)\text{d}\mappingFunctionVector. \] \[ p(\dataVector|\inducingVector) \geq \exp \int p(\mappingFunctionVector|\inducingVector) \log p(\dataVector|\mappingFunctionVector)\text{d}\mappingFunctionVector. \]

Optimal Compression in Inducing Variables

Maximizing lower bound minimizes the KL divergence (information gain): \[ \KL{p(\mappingFunctionVector|\inducingVector)}{p(\mappingFunctionVector|\dataVector, \inducingVector)} = \int p(\mappingFunctionVector|\inducingVector) \log \frac{p(\mappingFunctionVector|\inducingVector)}{p(\mappingFunctionVector|\dataVector, \inducingVector)}\text{d}\inducingVector \]
This is minimized when the information stored about \(\dataVector\) is stored already in \(\inducingVector\).
The bound seeks an optimal compression from the information gain perspective.
If \(\inducingVector = \mappingFunctionVector\) bound is exact (\(\mappingFunctionVector\) \(d\)-separates \(\dataVector\) from \(\inducingVector\)).

Choice of Inducing Variables

Free to choose whatever heuristics for the inducing variables.
Can quantify which heuristics perform better through checking lower bound.

\[ \begin{bmatrix} \mappingFunctionVector\\ \inducingVector \end{bmatrix} \sim \gaussianSamp{\zerosVector}{\kernelMatrix} \] with \[ \kernelMatrix = \begin{bmatrix} \Kff & \Kfu \\ \Kuf & \Kuu \end{bmatrix} \]

Variational Compression

Inducing variables are a compression of the real observations.
They are like pseudo-data. They can be in space of \(\mappingFunctionVector\) or a space that is related through a linear operator (Álvarez et al., 2010) — e.g. a gradient or convolution.

Variational Compression II

Resulting algorithms reduce computational complexity.
Also allow deployment of more standard scaling techniques.
E.g. Stochastic variational inference Hoffman et al. (2012)
Allow for scaling e.g. stochastic variational Hensman et al. (n.d.) or parallelization Gal et al. (n.d.),Dai et al. (2014), Seeger et al. (2017)

Full Gaussian Process Fit

Inducing Variable Fit

Inducing Variable Param Optimize

Inducing Variable Full Optimize

Eight Optimized Inducing Variables

Full Gaussian Process Fit

Leads to Other Approximations …

Let’s be explicity about storing approximate posterior of \(\inducingVector\), \(q(\inducingVector)\).
Now we have \[p(\dataVector^*|\dataVector) = \int p(\dataVector^*| \inducingVector) q(\inducingVector | \dataVector) \text{d} \inducingVector\]

Leads to Other Approximations …

Inducing variables look a lot like regular parameters.
But: their dimensionality does not need to be set at design time.
They can be modified arbitrarily at run time without effecting the model likelihood.
They only effect the quality of compression and the lower bound.

In GPs for Big Data

Exploit the resulting factorization … \[p(\dataVector^*|\dataVector) = \int p(\dataVector^*| \inducingVector) q(\inducingVector | \dataVector) \text{d} \inducingVector\]
The distribution now factorizes: \[p(\dataVector^*|\dataVector) = \int \prod_{i=1}^{\numData^*}p(\dataScalar^*_i| \inducingVector) q(\inducingVector | \dataVector) \text{d} \inducingVector\]
This factorization can be exploited for stochastic variational inference (Hoffman et al., 2012).

Nonparametrics for Very Large Data Sets

Modern data availability

Nonparametrics for Very Large Data Sets

Proxy for index of deprivation?

Nonparametrics for Very Large Data Sets

Actually index of deprivation is a proxy for this …

(Hensman et al., n.d.)

http://auai.org/uai2013/prints/papers/244.pdf

(Hensman et al., n.d.)

http://auai.org/uai2013/prints/papers/244.pdf

Modern Review

A Unifying Framework for Gaussian Process Pseudo-Point Approximations using Power Expectation Propagation Bui et al. (2017)
Deep Gaussian Processes and Variational Propagation of Uncertainty Damianou (2015)

Structure of Priors

MacKay: NeurIPS Tutorial 1997 “Have we thrown out the baby with the bathwater?” (Published as MacKay, n.d.)

Deep Neural Network

Mathematically

\[ \begin{align} \hiddenVector_{1} &= \basisFunction\left(\mappingMatrix_1 \inputVector\right)\\ \hiddenVector_{2} &= \basisFunction\left(\mappingMatrix_2\hiddenVector_{1}\right)\\ \hiddenVector_{3} &= \basisFunction\left(\mappingMatrix_3 \hiddenVector_{2}\right)\\ \dataVector &= \mappingVector_4 ^\top\hiddenVector_{3} \end{align} \]

### Overfitting

Potential problem: if number of nodes in two adjacent layers is big, corresponding \(\mappingMatrix\) is also very big and there is the potential to overfit.
Proposed solution: “dropout”.
Alternative solution: parameterize \(\mappingMatrix\) with its SVD. \[ \mappingMatrix = \eigenvectorMatrix\eigenvalueMatrix\eigenvectwoMatrix^\top \] or \[ \mappingMatrix = \eigenvectorMatrix\eigenvectwoMatrix^\top \] where if \(\mappingMatrix \in \Re^{k_1\times k_2}\) then \(\eigenvectorMatrix\in \Re^{k_1\times q}\) and \(\eigenvectwoMatrix \in \Re^{k_2\times q}\), i.e. we have a low rank matrix factorization for the weights.

Low Rank Approximation

Deep Neural Network

Mathematically

The network can now be written mathematically as \[ \begin{align} \latentVector_{1} &= \eigenvectwoMatrix^\top_1 \inputVector\\ \hiddenVector_{1} &= \basisFunction\left(\eigenvectorMatrix_1 \latentVector_{1}\right)\\ \latentVector_{2} &= \eigenvectwoMatrix^\top_2 \hiddenVector_{1}\\ \hiddenVector_{2} &= \basisFunction\left(\eigenvectorMatrix_2 \latentVector_{2}\right)\\ \latentVector_{3} &= \eigenvectwoMatrix^\top_3 \hiddenVector_{2}\\ \hiddenVector_{3} &= \basisFunction\left(\eigenvectorMatrix_3 \latentVector_{3}\right)\\ \dataVector &= \mappingVector_4^\top\hiddenVector_{3}. \end{align} \]

A Cascade of Neural Networks

\[ \begin{align} \latentVector_{1} &= \eigenvectwoMatrix^\top_1 \inputVector\\ \latentVector_{2} &= \eigenvectwoMatrix^\top_2 \basisFunction\left(\eigenvectorMatrix_1 \latentVector_{1}\right)\\ \latentVector_{3} &= \eigenvectwoMatrix^\top_3 \basisFunction\left(\eigenvectorMatrix_2 \latentVector_{2}\right)\\ \dataVector &= \mappingVector_4 ^\top \latentVector_{3} \end{align} \]

Cascade of Gaussian Processes

Replace each neural network with a Gaussian process \[ \begin{align} \latentVector_{1} &= \mappingFunctionVector_1\left(\inputVector\right)\\ \latentVector_{2} &= \mappingFunctionVector_2\left(\latentVector_{1}\right)\\ \latentVector_{3} &= \mappingFunctionVector_3\left(\latentVector_{2}\right)\\ \dataVector &= \mappingFunctionVector_4\left(\latentVector_{3}\right) \end{align} \]
Equivalent to prior over parameters, take width of each layer to infinity.

Outline of the DeepFace architecture. A front-end of a single convolution-pooling-convolution filtering on the rectified input, followed by three locally-connected layers and two fully-connected layers. Color illustrates feature maps produced at each layer. The net includes more than 120 million parameters, where more than 95% come from the local and fully connected.

Source: DeepFace (Taigman et al., 2014)

Mathematically

Composite multivariate function

\[ \mathbf{g}(\inputVector)=\mappingFunctionVector_5(\mappingFunctionVector_4(\mappingFunctionVector_3(\mappingFunctionVector_2(\mappingFunctionVector_1(\inputVector))))). \]

Equivalent to Markov Chain

Composite multivariate function \[ p(\dataVector|\inputVector)= p(\dataVector|\mappingFunctionVector_5)p(\mappingFunctionVector_5|\mappingFunctionVector_4)p(\mappingFunctionVector_4|\mappingFunctionVector_3)p(\mappingFunctionVector_3|\mappingFunctionVector_2)p(\mappingFunctionVector_2|\mappingFunctionVector_1)p(\mappingFunctionVector_1|\inputVector) \]

Why Deep?

Gaussian processes give priors over functions.
Elegant properties:
e.g. Derivatives of process are also Gaussian distributed (if they exist).
For particular covariance functions they are ‘universal approximators’, i.e. all functions can have support under the prior.
Gaussian derivatives might ring alarm bells.
E.g. a priori they don’t believe in function ‘jumps’.

Stochastic Process Composition

From a process perspective: process composition.
A (new?) way of constructing more complex processes based on simpler components.

Difficulty for Probabilistic Approaches

Propagate a probability distribution through a non-linear mapping.
Normalisation of distribution becomes intractable.

Difficulty for Probabilistic Approaches

Propagate a probability distribution through a non-linear mapping.
Normalisation of distribution becomes intractable.

Difficulty for Probabilistic Approaches

Propagate a probability distribution through a non-linear mapping.
Normalisation of distribution becomes intractable.

Deep Gaussian Processes

Deep architectures allow abstraction of features (Bengio, 2009; Hinton and Osindero, 2006; Salakhutdinov and Murray, n.d.)
We use variational approach to stack GP models.

Stacked PCA

Stacked GP

Analysis of Deep GPs

Avoiding pathologies in very deep networks Duvenaud et al. (2014) show that the derivative distribution of the process becomes more heavy tailed as number of layers increase.
How Deep Are Deep Gaussian Processes? Dunlop et al. (n.d.) perform a theoretical analysis possible through conditional Gaussian Markov property.

GPy: A Gaussian Process Framework in Python

https://github.com/SheffieldML/GPy

GPy: A Gaussian Process Framework in Python

BSD Licensed software base.
Wide availability of libraries, ‘modern’ scripting language.
Allows us to set projects to undergraduates in Comp Sci that use GPs.
Available through GitHub https://github.com/SheffieldML/GPy
Reproducible Research with Jupyter Notebook.

Features

Probabilistic-style programming (specify the model, not the algorithm).
Non-Gaussian likelihoods.
Multivariate outputs.
Dimensionality reduction.
Approximations for large data sets.

Olympic Marathon Data

Gold medal times for Olympic Marathon since 1896.
Marathons before 1924 didn’t have a standardised distance.
Present results using pace per km.
In 1904 Marathon was badly organised leading to very slow times.

Image from Wikimedia Commons http://bit.ly/16kMKHQ

Olympic Marathon Data

Alan Turing

Probability Winning Olympics?

He was a formidable Marathon runner.
In 1946 he ran a time 2 hours 46 minutes.
- That’s a pace of 3.95 min/km.
What is the probability he would have won an Olympics if one had been held in 1946?

Olympic Marathon Data GP

Deep GP Fit

Can a Deep Gaussian process help?
Deep GP is one GP feeding into another.

Olympic Marathon Data Deep GP

Olympic Marathon Data Latent 1

Olympic Marathon Data Latent 2

Olympic Marathon Pinball Plot

Della Gatta Gene Data

Given given expression levels in the form of a time series from Della Gatta et al. (2008).

Della Gatta Gene Data

Gene Expression Example

Want to detect if a gene is expressed or not, fit a GP to each gene Kalaitzis and Lawrence (2011).

http://www.biomedcentral.com/1471-2105/12/180

TP53 Gene Data GP

Multiple Optima

TP53 Gene Data Deep GP

TP53 Gene Data Latent 1

TP53 Gene Data Latent 2

TP53 Gene Pinball Plot

Step Function Data

Step Function Data GP

Step Function Data Deep GP

Step Function Data Latent 1

Step Function Data Latent 2

Step Function Data Latent 3

../slides/diagrams/deepgp/step-function-deep-gp-layer-2.svg

Step Function Data Latent 4

Step Function Pinball Plot

Motorcycle Helmet Data

Motorcycle Helmet Data GP

Motorcycle Helmet Data Deep GP

Motorcycle Helmet Data Latent 1

Motorcycle Helmet Data Latent 2

Motorcycle Helmet Pinball Plot

Robot Wireless Ground Truth

Robot WiFi Data

Robot WiFi Data GP

Robot WiFi Data Deep GP

Robot WiFi Data Latent Space

Motion Capture

‘High five’ data.
Model learns structure between two interacting subjects.

Shared LVM

Thanks to: Zhenwen Dai and Neil D. Lawrence

Deep Health

From NIPS 2017

Gaussian process based nonlinear latent structure discovery in multivariate spike train data Wu et al. (2017)
Doubly Stochastic Variational Inference for Deep Gaussian Processes Salimbeni and Deisenroth (2017)
Deep Multi-task Gaussian Processes for Survival Analysis with Competing Risks Alaa and van der Schaar (2017)
Counterfactual Gaussian Processes for Reliable Decision-making and What-if Reasoning Schulam and Saria (2017)

Some Other Works

Deep Survival Analysis Ranganath et al. (2016)
Recurrent Gaussian Processes Mattos et al. (2015)
Gaussian Process Based Approaches for Survival Analysis Saul (2016)

Data Driven

Machine Learning: Replicate Processes through direct use of data.
Aim to emulate cognitive processes through the use of data.
Use data to provide new approaches in control and optimization that should allow for emulation of human motor skills.

Process Emulation

Key idea: emulate the process as a mathematical function.
Each function has a set of parameters which control its behavior.
Learning is the process of changing these parameters to change the shape of the function
Choice of which class of mathematical functions we use is a vital component of our model.

Emukit Playground

Work Adam Hirst, Software Engineering Intern and Cliff McCollum.
Tutorial on emulation.

Emukit Playground

Uncertainty Quantification

Deep nets are powerful approach to images, speech, language.
Proposal: Deep GPs may also be a great approach, but better to deploy according to natural strengths.

Uncertainty Quantification

Probabilistic numerics, surrogate modelling, emulation, and UQ.
Not a fan of AI as a term.
But we are faced with increasing amounts of algorithmic decision making.

ML and Decision Making

When trading off decisions: compute or acquire data?
There is a critical need for uncertainty.

Uncertainty Quantification

Uncertainty quantification (UQ) is the science of quantitative characterization and reduction of uncertainties in both computational and real world applications. It tries to determine how likely certain outcomes are if some aspects of the system are not exactly known.

Interaction between physical and virtual worlds of major interest.

Contrast

Simulation in reinforcement learning.
Known as data augmentation.
Newer, similar in spirit, but typically ignores uncertainty.

Example: Formula One Racing

Designing an F1 Car requires CFD, Wind Tunnel, Track Testing etc.
How to combine them?

Mountain Car Simulator

Car Dynamics

\[\inputVector_{t+1} = \mappingFunction(\inputVector_{t},\textbf{u}_{t})\]

where \(\textbf{u}_t\) is the action force, \(\inputVector_t = (p_t, v_t)\) is the vehicle state

Policy

Assume policy is linear with parameters \(\boldsymbol{\theta}\)

\[\pi(\inputVector,\theta)= \theta_0 + \theta_p p + \theta_vv.\]

Emulate the Mountain Car

Goal is find \(\theta\) such that

\[\theta^* = arg \max_{\theta} R_T(\theta).\]

Reward is computed as 100 for target, minus squared sum of actions

Random Linear Controller

Best Controller after 50 Iterations of Bayesian Optimization

Data Efficient Emulation

For standard Bayesian Optimization ignored dynamics of the car.
For more data efficiency, first emulate the dynamics.
Then do Bayesian optimization of the emulator.
Use a Gaussian process to model \[\Delta v_{t+1} = v_{t+1} - v_{t}\] and \[\Delta x_{t+1} = p_{t+1} - p_{t}\]
Two processes, one with mean \(v_{t}\) one with mean \(p_{t}\)

Emulator Training

Used 500 randomly selected points to train emulators.
Can make proces smore efficient through experimental design.

Comparison of Emulation and Simulation

Data Efficiency

Our emulator used only 500 calls to the simulator.
Optimizing the simulator directly required 37,500 calls to the simulator.

Best Controller using Emulator of Dynamics

500 calls to the simulator vs 37,500 calls to the simulator

\[\mappingFunction_i\left(\inputVector\right) = \rho\mappingFunction_{i-1}\left(\inputVector\right) + \delta_i\left(\inputVector \right)\]

Multi-Fidelity Emulation

\[\mappingFunction_i\left(\inputVector\right) = \mappingFunctionTwo_{i}\left(\mappingFunction_{i-1}\left(\inputVector\right)\right) + \delta_i\left(\inputVector \right),\]

Best Controller with Multi-Fidelity Emulator

250 observations of high fidelity simulator and 250 of the low fidelity simulator

Emukit

Work by Javier Gonzalez, Andrei Paleyes, Mark Pullin, Maren Mahsereci, Alex Gessner, Aaron Klein.
Available on Github
Example sensitivity notebook.

Emukit Software

Multi-fidelity emulation: build surrogate models for multiple sources of information;
Bayesian optimisation: optimise physical experiments and tune parameters ML algorithms;
Experimental design/Active learning: design experiments and perform active learning with ML models;
Sensitivity analysis: analyse the influence of inputs on the outputs
Bayesian quadrature: compute integrals of functions that are expensive to evaluate.

MXFusion: Modular Probabilistic Programming on MXNet

https://github.com/amzn/MXFusion

MxFusion

Work by Eric Meissner and Zhenwen Dai.
Probabilistic programming.
Available on Github

MxFusion

Targeted at challenges we face in emulation.
Composition of Gaussian processes (Deep GPs)
Combining GPs with neural networks.
Example PPCA Tutorial.

Why another framework?

Existing libraries had either:
Probabilistic modelling with rich, flexible models and universal inference or
Specialized, efficient inference over a subset of models

We needed both

Key Requirements

Integration with deep learning
Flexiblility
Scalability
Specialized inference and models support
- Bayesian Deep Learning methods
- Rapid prototyping and software re-use
- GPUs, specialized inference methods

Modularity

Specialized Inference
Composability (tinkerability)
- Better leveraging of expert expertise

What does it look like?

Modelling

Inference

Modelling

Directed Graphs

Variable
Function
Distribution

Example

m = Model()
m.mu = Variable()
m.s = Variable(transformation=PositiveTransformation())
m.Y = Normal.define_variable(mean=m.mu, variance=m.s)

3 primary components in modeling

Variable
Distribution
Function

2 primary methods for models

log_pdf
draw_samples

Inference: Two Classes

Variational Inference
MCMC Sampling (soon) Built on MXNet Gluon (imperative code, not static graph)

Example

infr = GradBasedInference(inference_algorithm=MAP(model=m, observed=[m.Y]))
infr.run(Y=data)

Modules

Model + Inference together form building blocks.
- Just doing modular modeling with universal inference doesn’t really scale, need specialized inference methods for specialized modelling objects like non-parametrics.

Long term Aim

Simulate/Emulate the components of the system.
- Validate with real world using multifidelity.
- Interpret system using e.g. sensitivity analysis.
Perform end to end learning to optimize.
- Maintain interpretability.

Acknowledgments

Stefanos Eleftheriadis, John Bronskill, Hugh Salimbeni, Rich Turner, Zhenwen Dai, Javier Gonzalez, Andreas Damianou, Mark Pullin, Eric Meissner.

Thanks!

twitter: @lawrennd
blog: http://inverseprobability.com

References

Alaa, A.M., van der Schaar, M., 2017. Deep multi-task Gaussian processes for survival analysis with competing risks, in: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (Eds.), Advances in Neural Information Processing Systems 30. Curran Associates, Inc., pp. 2326–2334.

Álvarez, M.A., Luengo, D., Titsias, M.K., Lawrence, N.D., 2010. Efficient multioutput Gaussian processes through variational inducing kernels, in:. pp. 25–32.

Bengio, Y., 2009. Learning Deep Architectures for AI. Found. Trends Mach. Learn. 2, 1–127. https://doi.org/10.1561/2200000006

Bui, T.D., Yan, J., Turner, R.E., 2017. A unifying framework for Gaussian process pseudo-point approximations using power expectation propagation. Journal of Machine Learning Research 18, 1–72.

Dai, Z., Damianou, A., Hensman, J., Lawrence, N.D., 2014. Gaussian process models with parallelization and GPU acceleration.

Damianou, A., 2015. Deep Gaussian processes and variational propagation of uncertainty (PhD thesis). University of Sheffield.

Della Gatta, G., Bansal, M., Ambesi-Impiombato, A., Antonini, D., Missero, C., Bernardo, D. di, 2008. Direct targets of the trp63 transcription factor revealed by a combination of gene expression profiling and reverse engineering. Genome Research 18, 939–948. https://doi.org/10.1101/gr.073601.107

Dunlop, M.M., Girolami, M.A., Stuart, A.M., Teckentrup, A.L., n.d. How deep are deep Gaussian processes? Journal of Machine Learning Research 19, 1–46.

Duvenaud, D., Rippel, O., Adams, R., Ghahramani, Z., 2014. Avoiding pathologies in very deep networks, in:.

Gal, Y., Wilk, M. van der, Rasmussen, C.E., n.d. Distributed variational inference in sparse Gaussian process regression and latent variable models, in:.

Hensman, J., Fusi, N., Lawrence, N.D., n.d. Gaussian processes for big data, in:.

Hinton, G.E., Osindero, S., 2006. A fast learning algorithm for deep belief nets. Neural Computation 18, 2006.

Hoffman, M., Blei, D.M., Wang, C., Paisley, J., 2012. Stochastic variational inference, arXiv preprint arXiv:1206.7051.

Kalaitzis, A.A., Lawrence, N.D., 2011. A simple approach to ranking differentially expressed gene expression time courses through Gaussian process regression. BMC Bioinformatics 12. https://doi.org/10.1186/1471-2105-12-180

Lawrence, N.D., n.d. Learning for larger datasets with the Gaussian process latent variable model, in:. pp. 243–250.

MacKay, D.J.C., n.d. Introduction to Gaussian processes, in:. pp. 133–166.

Mattos, C.L.C., Dai, Z., Damianou, A.C., Forth, J., Barreto, G.A., Lawrence, N.D., 2015. Recurrent gaussian processes. CoRR abs/1511.06644.

Quiñonero Candela, J., Rasmussen, C.E., 2005. A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research 6, 1939–1959.

Ranganath, R., Perotte, A., Elhadad, N., Blei, D., 2016. Deep survival analysis, in: Doshi-Velez, F., Fackler, J., Kale, D., Wallace, B., Wiens, J. (Eds.), Proceedings of the 1st Machine Learning for Healthcare Conference, Proceedings of Machine Learning Research. PMLR, Children’s Hospital LA, Los Angeles, CA, USA, pp. 101–114.

Salakhutdinov, R., Murray, I., n.d. On the quantitative analysis of deep belief networks, in:. pp. 872–879.

Salimbeni, H., Deisenroth, M., 2017. Doubly stochastic variational inference for deep Gaussian processes, in: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (Eds.), Advances in Neural Information Processing Systems 30. Curran Associates, Inc., pp. 4591–4602.

Saul, A.D., 2016. Gaussian process based approaches for survival analysis (PhD thesis). University of Sheffield.

Schulam, P., Saria, S., 2017. Counterfactual Gaussian processes for reliable decision-making and what-if reasoning, in: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (Eds.), Advances in Neural Information Processing Systems 30. Curran Associates, Inc., pp. 1696–1706.

Seeger, M.W., Hetzel, A., Dai, Z., Lawrence, N.D., 2017. Auto-differentiating linear algebra. CoRR abs/1710.08717.

Snelson, E., Ghahramani, Z., n.d. Sparse Gaussian processes using pseudo-inputs, in:.

Taigman, Y., Yang, M., Ranzato, M., Wolf, L., 2014. DeepFace: Closing the gap to human-level performance in face verification, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. https://doi.org/10.1109/CVPR.2014.220

Titsias, M.K., n.d. Variational learning of inducing variables in sparse Gaussian processes, in:. pp. 567–574.

Wu, A., Roy, N.G., Keeley, S., Pillow, J.W., 2017. Gaussian process based nonlinear latent structure discovery in multivariate spike train data, in: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (Eds.), Advances in Neural Information Processing Systems 30. Curran Associates, Inc., pp. 3499–3508.