Meta-Modelling and Deploying ML Software

Neil D. Lawrence

The Mathematics of Deep Learning and Data Science

Introduction

An Intelligent System

Joint work with M. Milo

An Intelligent System

Joint work with M. Milo

Deep Learning

Outline of the DeepFace architecture. A front-end of a single convolution-pooling-convolution filtering on the rectified input, followed by three locally-connected layers and two fully-connected layers. Color illustrates feature maps produced at each layer. The net includes more than 120 million parameters, where more than 95% come from the local and fully connected.

Source: DeepFace (Taigman et al., 2014)

Deep Freeze

Deep Freeze

Motto

Solve Supply Chain, then solve everything else.

Statistical Emulation

Emulation

Emulation

Emulation

Emulation

Emulation

Uncertainty Quantification

  • Deep nets are powerful approach to images, speech, language.
  • Proposal: Deep GPs may also be a great approach, but better to deploy according to natural strengths.

Uncertainty Quantification

  • Probabilistic numerics, surrogate modelling, emulation, and UQ.
  • Not a fan of AI as a term.
  • But we are faced with increasing amounts of algorithmic decision making.

ML and Decision Making

  • When trading off decisions: compute or acquire data?
  • There is a critical need for uncertainty.

Uncertainty Quantification

Uncertainty quantification (UQ) is the science of quantitative characterization and reduction of uncertainties in both computational and real world applications. It tries to determine how likely certain outcomes are if some aspects of the system are not exactly known.

  • Interaction between physical and virtual worlds of major interest.

Contrast

  • Simulation in reinforcement learning.
  • Known as data augmentation.
  • Newer, similar in spirit, but typically ignores uncertainty.

Example: Formula One Racing

  • Designing an F1 Car requires CFD, Wind Tunnel, Track Testing etc.

  • How to combine them?

Mountain Car Simulator

Car Dynamics

\[ \mathbf{ x}_{t+1} = f(\mathbf{ x}_{t},\textbf{u}_{t}) \] where \(\textbf{u}_t\) is the action force, \(\mathbf{ x}_t = (p_t, v_t)\) is the vehicle state

Policy

  • Assume policy is linear with parameters \(\boldsymbol{\theta}\) \[ \pi(\mathbf{ x},\theta)= \theta_0 + \theta_p p + \theta_vv. \]

Emulate the Mountain Car

  • Goal is find \(\theta\) such that \[ \theta^* = arg \max_{\theta} R_T(\theta). \]
  • Reward is computed as 100 for target, minus squared sum of actions

Random Linear Controller

Best Controller after 50 Iterations of Bayesian Optimization

Data Efficient Emulation

  • For standard Bayesian Optimization ignored dynamics of the car.
  • For more data efficiency, first emulate the dynamics.
  • Then do Bayesian optimization of the emulator.

\[ \mathbf{ x}_{t+1} =g(\mathbf{ x}_{t},\textbf{u}_{t}) \]

  • Use a Gaussian process to model \[ \Delta v_{t+1} = v_{t+1} - v_{t} \] and \[ \Delta x_{t+1} = p_{t+1} - p_{t} \]
  • Two processes, one with mean \(v_{t}\) one with mean \(p_{t}\)

Emulator Training

  • Used 500 randomly selected points to train emulators.
  • Can make proces smore efficient through experimental design.

Comparison of Emulation and Simulation

Data Efficiency

  • Our emulator used only 500 calls to the simulator.
  • Optimizing the simulator directly required 37,500 calls to the simulator.

Best Controller using Emulator of Dynamics

Mountain Car: Multi-Fidelity Emulation

\[ f_i\left(\mathbf{ x}\right) = \rho f_{i-1}\left(\mathbf{ x}\right) + \delta_i\left(\mathbf{ x}\right), \]

\[ f_i\left(\mathbf{ x}\right) = g_{i}\left(f_{i-1}\left(\mathbf{ x}\right)\right) + \delta_i\left(\mathbf{ x}\right), \]

Building the Multifidelity Emulation

n_initial_points = 25 random_design = RandomDesign(design_space) initial_design = random_design.get_samples(n_initial_points) acquisition = GPyOpt.acquisitions.AcquisitionEI(model, design_space, optimizer=aquisition_optimizer) evaluator = GPyOpt.core.evaluators.Sequential(acquisition)}

Best Controller with Multi-Fidelity Emulator

250 observations of high fidelity simulator and 250 of the low fidelity simulator

Emukit Playground

Leah Hirst Cliff McCollum

Emukit Playground

Emukit Playground

Emukit

Javier Gonzalez

Emukit

Emukit

Javier Gonzalez Andrei Paleyes Mark Pullin Maren Mahsereci

Modular Design

Introduce your own surrogate models.

from emukit.model_wrappers import GPyModelWrapper

To building your own model see this notebook.

from emukit.model_wrappers import YourModelWrapperHere

{For monitoring systems in production, emulation needn’t just be about simulator models. What we envisage, is that even data driven models could be emulated. This is important for understanding system behaviour, how the different components are interconnected. This drives the notion of the information dynamics of the machine learning system. What is the effect of one particular intervention in the wider system? One way of answering this is through emulation. But it requires that our machine learning models (and our simulators) are deployed in an environment where emulation can be automatically deployed. The resulting system would allow us to monitor the downstream effects of indivdiual decision making on the wider system.

Deep Gaussian Processes

Bottleneck Layers in Deep Neural Networks

Deep Neural Network

Mathematically

The network can now be written mathematically as \[ \begin{align} \mathbf{ z}_{1} &= \mathbf{V}^\top_1 \mathbf{ x}\\ \mathbf{ h}_{1} &= \phi\left(\mathbf{U}_1 \mathbf{ z}_{1}\right)\\ \mathbf{ z}_{2} &= \mathbf{V}^\top_2 \mathbf{ h}_{1}\\ \mathbf{ h}_{2} &= \phi\left(\mathbf{U}_2 \mathbf{ z}_{2}\right)\\ \mathbf{ z}_{3} &= \mathbf{V}^\top_3 \mathbf{ h}_{2}\\ \mathbf{ h}_{3} &= \phi\left(\mathbf{U}_3 \mathbf{ z}_{3}\right)\\ \mathbf{ y}&= \mathbf{ w}_4^\top\mathbf{ h}_{3}. \end{align} \]

A Cascade of Neural Networks

\[ \begin{align} \mathbf{ z}_{1} &= \mathbf{V}^\top_1 \mathbf{ x}\\ \mathbf{ z}_{2} &= \mathbf{V}^\top_2 \phi\left(\mathbf{U}_1 \mathbf{ z}_{1}\right)\\ \mathbf{ z}_{3} &= \mathbf{V}^\top_3 \phi\left(\mathbf{U}_2 \mathbf{ z}_{2}\right)\\ \mathbf{ y}&= \mathbf{ w}_4 ^\top \mathbf{ z}_{3} \end{align} \]

Cascade of Gaussian Processes

  • Replace each neural network with a Gaussian process \[ \begin{align} \mathbf{ z}_{1} &= \mathbf{ f}_1\left(\mathbf{ x}\right)\\ \mathbf{ z}_{2} &= \mathbf{ f}_2\left(\mathbf{ z}_{1}\right)\\ \mathbf{ z}_{3} &= \mathbf{ f}_3\left(\mathbf{ z}_{2}\right)\\ \mathbf{ y}&= \mathbf{ f}_4\left(\mathbf{ z}_{3}\right) \end{align} \]

  • Equivalent to prior over parameters, take width of each layer to infinity.

Olympic Marathon Data

  • Gold medal times for Olympic Marathon since 1896.
  • Marathons before 1924 didn’t have a standardised distance.
  • Present results using pace per km.
  • In 1904 Marathon was badly organised leading to very slow times.
Image from Wikimedia Commons http://bit.ly/16kMKHQ

Olympic Marathon Data

Alan Turing

Probability Winning Olympics?

  • He was a formidable Marathon runner.
  • In 1946 he ran a time 2 hours 46 minutes.
    • That’s a pace of 3.95 min/km.
  • What is the probability he would have won an Olympics if one had been held in 1946?

Gaussian Process Fit

Olympic Marathon Data GP

Deep GP Fit

  • Can a Deep Gaussian process help?

  • Deep GP is one GP feeding into another.

Olympic Marathon Data Deep GP

Olympic Marathon Data Deep GP

Olympic Marathon Data Latent 1

Olympic Marathon Data Latent 2

Olympic Marathon Pinball Plot

MXFusion: Modular Probabilistic Programming on MXNet

https://github.com/amzn/MXFusion

MxFusion

\ericMeissner{15%}\zhenwenDai{15%}

  • Work by Eric Meissner and Zhenwen Dai.
  • Probabilistic programming.
  • Available on Github

Conclusion

  • ML deployed in interacting systems.
  • Meta modelling fits statistical models to existing mechanistic models.
  • Leads to speed and interpretability improvements.
  • Deep GPs are a flexible approach to meta-modelling.

Thanks!

References

Taigman, Y., Yang, M., Ranzato, M., Wolf, L., 2014. DeepFace: Closing the gap to human-level performance in face verification, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. https://doi.org/10.1109/CVPR.2014.220