\[\newcommand{\Amatrix}{\mathbf{A}} \newcommand{\KL}[2]{\text{KL}\left( #1\,\|\,#2 \right)} \newcommand{\Kaast}{\kernelMatrix_{\mathbf{ \ast}\mathbf{ \ast}}} \newcommand{\Kastu}{\kernelMatrix_{\mathbf{ \ast} \inducingVector}} \newcommand{\Kff}{\kernelMatrix_{\mappingFunctionVector \mappingFunctionVector}} \newcommand{\Kfu}{\kernelMatrix_{\mappingFunctionVector \inducingVector}} \newcommand{\Kuast}{\kernelMatrix_{\inducingVector \bf\ast}} \newcommand{\Kuf}{\kernelMatrix_{\inducingVector \mappingFunctionVector}} \newcommand{\Kuu}{\kernelMatrix_{\inducingVector \inducingVector}} \newcommand{\Kuui}{\Kuu^{-1}} \newcommand{\Qaast}{\mathbf{Q}_{\bf \ast \ast}} \newcommand{\Qastf}{\mathbf{Q}_{\ast \mappingFunction}} \newcommand{\Qfast}{\mathbf{Q}_{\mappingFunctionVector \bf \ast}} \newcommand{\Qff}{\mathbf{Q}_{\mappingFunctionVector \mappingFunctionVector}} \newcommand{\aMatrix}{\mathbf{A}} \newcommand{\aScalar}{a} \newcommand{\aVector}{\mathbf{a}} \newcommand{\acceleration}{a} \newcommand{\bMatrix}{\mathbf{B}} \newcommand{\bScalar}{b} \newcommand{\bVector}{\mathbf{b}} \newcommand{\basisFunc}{\phi} \newcommand{\basisFuncVector}{\boldsymbol{ \basisFunc}} \newcommand{\basisFunction}{\phi} \newcommand{\basisLocation}{\mu} \newcommand{\basisMatrix}{\boldsymbol{ \Phi}} \newcommand{\basisScalar}{\basisFunction} \newcommand{\basisVector}{\boldsymbol{ \basisFunction}} \newcommand{\activationFunction}{\phi} \newcommand{\activationMatrix}{\boldsymbol{ \Phi}} \newcommand{\activationScalar}{\basisFunction} \newcommand{\activationVector}{\boldsymbol{ \basisFunction}} \newcommand{\bigO}{\mathcal{O}} \newcommand{\binomProb}{\pi} \newcommand{\cMatrix}{\mathbf{C}} \newcommand{\cbasisMatrix}{\hat{\boldsymbol{ \Phi}}} \newcommand{\cdataMatrix}{\hat{\dataMatrix}} \newcommand{\cdataScalar}{\hat{\dataScalar}} \newcommand{\cdataVector}{\hat{\dataVector}} \newcommand{\centeredKernelMatrix}{\mathbf{ \MakeUppercase{\centeredKernelScalar}}} \newcommand{\centeredKernelScalar}{b} \newcommand{\centeredKernelVector}{\centeredKernelScalar} \newcommand{\centeringMatrix}{\mathbf{H}} \newcommand{\chiSquaredDist}[2]{\chi_{#1}^{2}\left(#2\right)} \newcommand{\chiSquaredSamp}[1]{\chi_{#1}^{2}} \newcommand{\conditionalCovariance}{\boldsymbol{ \Sigma}} \newcommand{\coregionalizationMatrix}{\mathbf{B}} \newcommand{\coregionalizationScalar}{b} \newcommand{\coregionalizationVector}{\mathbf{ \coregionalizationScalar}} \newcommand{\covDist}[2]{\text{cov}_{#2}\left(#1\right)} \newcommand{\covSamp}[1]{\text{cov}\left(#1\right)} \newcommand{\covarianceScalar}{c} \newcommand{\covarianceVector}{\mathbf{ \covarianceScalar}} \newcommand{\covarianceMatrix}{\mathbf{C}} \newcommand{\covarianceMatrixTwo}{\boldsymbol{ \Sigma}} \newcommand{\croupierScalar}{s} \newcommand{\croupierVector}{\mathbf{ \croupierScalar}} \newcommand{\croupierMatrix}{\mathbf{ \MakeUppercase{\croupierScalar}}} \newcommand{\dataDim}{p} \newcommand{\dataIndex}{i} \newcommand{\dataIndexTwo}{j} \newcommand{\dataMatrix}{\mathbf{Y}} \newcommand{\dataScalar}{y} \newcommand{\dataSet}{\mathcal{D}} \newcommand{\dataStd}{\sigma} \newcommand{\dataVector}{\mathbf{ \dataScalar}} \newcommand{\decayRate}{d} \newcommand{\degreeMatrix}{\mathbf{ \MakeUppercase{\degreeScalar}}} \newcommand{\degreeScalar}{d} \newcommand{\degreeVector}{\mathbf{ \degreeScalar}} % Already defined by latex %\newcommand{\det}[1]{\left|#1\right|} \newcommand{\diag}[1]{\text{diag}\left(#1\right)} \newcommand{\diagonalMatrix}{\mathbf{D}} \newcommand{\diff}[2]{\frac{\text{d}#1}{\text{d}#2}} \newcommand{\diffTwo}[2]{\frac{\text{d}^2#1}{\text{d}#2^2}} \newcommand{\displacement}{x} \newcommand{\displacementVector}{\textbf{\displacement}} \newcommand{\distanceMatrix}{\mathbf{ \MakeUppercase{\distanceScalar}}} \newcommand{\distanceScalar}{d} \newcommand{\distanceVector}{\mathbf{ \distanceScalar}} \newcommand{\eigenvaltwo}{\ell} \newcommand{\eigenvaltwoMatrix}{\mathbf{L}} \newcommand{\eigenvaltwoVector}{\mathbf{l}} \newcommand{\eigenvalue}{\lambda} \newcommand{\eigenvalueMatrix}{\boldsymbol{ \Lambda}} \newcommand{\eigenvalueVector}{\boldsymbol{ \lambda}} \newcommand{\eigenvector}{\mathbf{ \eigenvectorScalar}} \newcommand{\eigenvectorMatrix}{\mathbf{U}} \newcommand{\eigenvectorScalar}{u} \newcommand{\eigenvectwo}{\mathbf{v}} \newcommand{\eigenvectwoMatrix}{\mathbf{V}} \newcommand{\eigenvectwoScalar}{v} \newcommand{\entropy}[1]{\mathcal{H}\left(#1\right)} \newcommand{\errorFunction}{E} \newcommand{\expDist}[2]{\left<#1\right>_{#2}} \newcommand{\expSamp}[1]{\left<#1\right>} \newcommand{\expectation}[1]{\left\langle #1 \right\rangle } \newcommand{\expectationDist}[2]{\left\langle #1 \right\rangle _{#2}} \newcommand{\expectedDistanceMatrix}{\mathcal{D}} \newcommand{\eye}{\mathbf{I}} \newcommand{\fantasyDim}{r} \newcommand{\fantasyMatrix}{\mathbf{ \MakeUppercase{\fantasyScalar}}} \newcommand{\fantasyScalar}{z} \newcommand{\fantasyVector}{\mathbf{ \fantasyScalar}} \newcommand{\featureStd}{\varsigma} \newcommand{\gammaCdf}[3]{\mathcal{GAMMA CDF}\left(#1|#2,#3\right)} \newcommand{\gammaDist}[3]{\mathcal{G}\left(#1|#2,#3\right)} \newcommand{\gammaSamp}[2]{\mathcal{G}\left(#1,#2\right)} \newcommand{\gaussianDist}[3]{\mathcal{N}\left(#1|#2,#3\right)} \newcommand{\gaussianSamp}[2]{\mathcal{N}\left(#1,#2\right)} \newcommand{\given}{|} \newcommand{\half}{\frac{1}{2}} \newcommand{\heaviside}{H} \newcommand{\hiddenMatrix}{\mathbf{ \MakeUppercase{\hiddenScalar}}} \newcommand{\hiddenScalar}{h} \newcommand{\hiddenVector}{\mathbf{ \hiddenScalar}} \newcommand{\identityMatrix}{\eye} \newcommand{\inducingInputScalar}{z} \newcommand{\inducingInputVector}{\mathbf{ \inducingInputScalar}} \newcommand{\inducingInputMatrix}{\mathbf{Z}} \newcommand{\inducingScalar}{u} \newcommand{\inducingVector}{\mathbf{ \inducingScalar}} \newcommand{\inducingMatrix}{\mathbf{U}} \newcommand{\inlineDiff}[2]{\text{d}#1/\text{d}#2} \newcommand{\inputDim}{q} \newcommand{\inputMatrix}{\mathbf{X}} \newcommand{\inputScalar}{x} \newcommand{\inputSpace}{\mathcal{X}} \newcommand{\inputVals}{\inputVector} \newcommand{\inputVector}{\mathbf{ \inputScalar}} \newcommand{\iterNum}{k} \newcommand{\kernel}{\kernelScalar} \newcommand{\kernelMatrix}{\mathbf{K}} \newcommand{\kernelScalar}{k} \newcommand{\kernelVector}{\mathbf{ \kernelScalar}} \newcommand{\kff}{\kernelScalar_{\mappingFunction \mappingFunction}} \newcommand{\kfu}{\kernelVector_{\mappingFunction \inducingScalar}} \newcommand{\kuf}{\kernelVector_{\inducingScalar \mappingFunction}} \newcommand{\kuu}{\kernelVector_{\inducingScalar \inducingScalar}} \newcommand{\lagrangeMultiplier}{\lambda} \newcommand{\lagrangeMultiplierMatrix}{\boldsymbol{ \Lambda}} \newcommand{\lagrangian}{L} \newcommand{\laplacianFactor}{\mathbf{ \MakeUppercase{\laplacianFactorScalar}}} \newcommand{\laplacianFactorScalar}{m} \newcommand{\laplacianFactorVector}{\mathbf{ \laplacianFactorScalar}} \newcommand{\laplacianMatrix}{\mathbf{L}} \newcommand{\laplacianScalar}{\ell} \newcommand{\laplacianVector}{\mathbf{ \ell}} \newcommand{\latentDim}{q} \newcommand{\latentDistanceMatrix}{\boldsymbol{ \Delta}} \newcommand{\latentDistanceScalar}{\delta} \newcommand{\latentDistanceVector}{\boldsymbol{ \delta}} \newcommand{\latentForce}{f} \newcommand{\latentFunction}{u} \newcommand{\latentFunctionVector}{\mathbf{ \latentFunction}} \newcommand{\latentFunctionMatrix}{\mathbf{ \MakeUppercase{\latentFunction}}} \newcommand{\latentIndex}{j} \newcommand{\latentScalar}{z} \newcommand{\latentVector}{\mathbf{ \latentScalar}} \newcommand{\latentMatrix}{\mathbf{Z}} \newcommand{\learnRate}{\eta} \newcommand{\lengthScale}{\ell} \newcommand{\rbfWidth}{\ell} \newcommand{\likelihoodBound}{\mathcal{L}} \newcommand{\likelihoodFunction}{L} \newcommand{\locationScalar}{\mu} \newcommand{\locationVector}{\boldsymbol{ \locationScalar}} \newcommand{\locationMatrix}{\mathbf{M}} \newcommand{\variance}[1]{\text{var}\left( #1 \right)} \newcommand{\mappingFunction}{f} \newcommand{\mappingFunctionMatrix}{\mathbf{F}} \newcommand{\mappingFunctionTwo}{g} \newcommand{\mappingFunctionTwoMatrix}{\mathbf{G}} \newcommand{\mappingFunctionTwoVector}{\mathbf{ \mappingFunctionTwo}} \newcommand{\mappingFunctionVector}{\mathbf{ \mappingFunction}} \newcommand{\scaleScalar}{s} \newcommand{\mappingScalar}{w} \newcommand{\mappingVector}{\mathbf{ \mappingScalar}} \newcommand{\mappingMatrix}{\mathbf{W}} \newcommand{\mappingScalarTwo}{v} \newcommand{\mappingVectorTwo}{\mathbf{ \mappingScalarTwo}} \newcommand{\mappingMatrixTwo}{\mathbf{V}} \newcommand{\maxIters}{K} \newcommand{\meanMatrix}{\mathbf{M}} \newcommand{\meanScalar}{\mu} \newcommand{\meanTwoMatrix}{\mathbf{M}} \newcommand{\meanTwoScalar}{m} \newcommand{\meanTwoVector}{\mathbf{ \meanTwoScalar}} \newcommand{\meanVector}{\boldsymbol{ \meanScalar}} \newcommand{\mrnaConcentration}{m} \newcommand{\naturalFrequency}{\omega} \newcommand{\neighborhood}[1]{\mathcal{N}\left( #1 \right)} \newcommand{\neilurl}{http://inverseprobability.com/} \newcommand{\noiseMatrix}{\boldsymbol{ E}} \newcommand{\noiseScalar}{\epsilon} \newcommand{\noiseVector}{\boldsymbol{ \epsilon}} \newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \newcommand{\normalizedLaplacianMatrix}{\hat{\mathbf{L}}} \newcommand{\normalizedLaplacianScalar}{\hat{\ell}} \newcommand{\normalizedLaplacianVector}{\hat{\mathbf{ \ell}}} \newcommand{\numActive}{m} \newcommand{\numBasisFunc}{m} \newcommand{\numComponents}{m} \newcommand{\numComps}{K} \newcommand{\numData}{n} \newcommand{\numFeatures}{K} \newcommand{\numHidden}{h} \newcommand{\numInducing}{m} \newcommand{\numLayers}{\ell} \newcommand{\numNeighbors}{K} \newcommand{\numSequences}{s} \newcommand{\numSuccess}{s} \newcommand{\numTasks}{m} \newcommand{\numTime}{T} \newcommand{\numTrials}{S} \newcommand{\outputIndex}{j} \newcommand{\paramVector}{\boldsymbol{ \theta}} \newcommand{\parameterMatrix}{\boldsymbol{ \Theta}} \newcommand{\parameterScalar}{\theta} \newcommand{\parameterVector}{\boldsymbol{ \parameterScalar}} \newcommand{\partDiff}[2]{\frac{\partial#1}{\partial#2}} \newcommand{\precisionScalar}{j} \newcommand{\precisionVector}{\mathbf{ \precisionScalar}} \newcommand{\precisionMatrix}{\mathbf{J}} \newcommand{\pseudotargetScalar}{\widetilde{y}} \newcommand{\pseudotargetVector}{\mathbf{ \pseudotargetScalar}} \newcommand{\pseudotargetMatrix}{\mathbf{ \widetilde{Y}}} \newcommand{\rank}[1]{\text{rank}\left(#1\right)} \newcommand{\rayleighDist}[2]{\mathcal{R}\left(#1|#2\right)} \newcommand{\rayleighSamp}[1]{\mathcal{R}\left(#1\right)} \newcommand{\responsibility}{r} \newcommand{\rotationScalar}{r} \newcommand{\rotationVector}{\mathbf{ \rotationScalar}} \newcommand{\rotationMatrix}{\mathbf{R}} \newcommand{\sampleCovScalar}{s} \newcommand{\sampleCovVector}{\mathbf{ \sampleCovScalar}} \newcommand{\sampleCovMatrix}{\mathbf{s}} \newcommand{\scalarProduct}[2]{\left\langle{#1},{#2}\right\rangle} \newcommand{\sign}[1]{\text{sign}\left(#1\right)} \newcommand{\sigmoid}[1]{\sigma\left(#1\right)} \newcommand{\singularvalue}{\ell} \newcommand{\singularvalueMatrix}{\mathbf{L}} \newcommand{\singularvalueVector}{\mathbf{l}} \newcommand{\sorth}{\mathbf{u}} \newcommand{\spar}{\lambda} \newcommand{\trace}[1]{\text{tr}\left(#1\right)} \newcommand{\BasalRate}{B} \newcommand{\DampingCoefficient}{C} \newcommand{\DecayRate}{D} \newcommand{\Displacement}{X} \newcommand{\LatentForce}{F} \newcommand{\Mass}{M} \newcommand{\Sensitivity}{S} \newcommand{\basalRate}{b} \newcommand{\dampingCoefficient}{c} \newcommand{\mass}{m} \newcommand{\sensitivity}{s} \newcommand{\springScalar}{\kappa} \newcommand{\springVector}{\boldsymbol{ \kappa}} \newcommand{\springMatrix}{\boldsymbol{ \mathcal{K}}} \newcommand{\tfConcentration}{p} \newcommand{\tfDecayRate}{\delta} \newcommand{\tfMrnaConcentration}{f} \newcommand{\tfVector}{\mathbf{ \tfConcentration}} \newcommand{\velocity}{v} \newcommand{\sufficientStatsScalar}{g} \newcommand{\sufficientStatsVector}{\mathbf{ \sufficientStatsScalar}} \newcommand{\sufficientStatsMatrix}{\mathbf{G}} \newcommand{\switchScalar}{s} \newcommand{\switchVector}{\mathbf{ \switchScalar}} \newcommand{\switchMatrix}{\mathbf{S}} \newcommand{\tr}[1]{\text{tr}\left(#1\right)} \newcommand{\loneNorm}[1]{\left\Vert #1 \right\Vert_1} \newcommand{\ltwoNorm}[1]{\left\Vert #1 \right\Vert_2} \newcommand{\onenorm}[1]{\left\vert#1\right\vert_1} \newcommand{\twonorm}[1]{\left\Vert #1 \right\Vert} \newcommand{\vScalar}{v} \newcommand{\vVector}{\mathbf{v}} \newcommand{\vMatrix}{\mathbf{V}} \newcommand{\varianceDist}[2]{\text{var}_{#2}\left( #1 \right)} % Already defined by latex %\newcommand{\vec}{#1:} \newcommand{\vecb}[1]{\left(#1\right):} \newcommand{\weightScalar}{w} \newcommand{\weightVector}{\mathbf{ \weightScalar}} \newcommand{\weightMatrix}{\mathbf{W}} \newcommand{\weightedAdjacencyMatrix}{\mathbf{A}} \newcommand{\weightedAdjacencyScalar}{a} \newcommand{\weightedAdjacencyVector}{\mathbf{ \weightedAdjacencyScalar}} \newcommand{\onesVector}{\mathbf{1}} \newcommand{\zerosVector}{\mathbf{0}} \]

Machine Learning and the Physical World

Neil D. Lawrence

Center for Statistics and Machine Learning, Princeton

Watt’s Steam Engine which made Steam Power Efficient and Practical

\[ \text{data} + \text{model} \xrightarrow{\text{compute}} \text{prediction}\]

Efficiency

  • Economies driven by ‘production’.
  • Greater production comes with better efficiency.
    • E.g. moving from gathering food to settled agriculture.
  • In the modern era one approach to becoming more efficient is automation of processes.
    • E.g. manufacturing production lines

Physical Processes

  • Manufacturing processes consist of production lines and robotic automation.
  • Logistics can also be decomposed into the supply chain processes.
  • Efficiency can be improved by automation.

Goods and Information

  • For modern society: management of flow of goods and information.
  • Flow of information is highly automated.
  • Processing of data is decomposed into stages in computer code.

Intervention

  • For all cases: manufacturing, logistics, data management
  • Pipeline requires human intervention from an operator.
  • Interventions create bottlenecks, slow the process.
  • Machine learning is a key technology in automating these manual stages.

Long Grass

  • Easy to replicate interventions have already been dealt with.
  • Components that still require human intervention are the knottier problems.
  • Difficult decompose into stages which could then be further automated.
  • These components are ‘process-atoms’.
  • These are the “long grass” regions of technology.

Nature of Challenge

  • In manufacturing or logistics settings atoms are flexible manual skills.
    • Requires emulation of a human’s motor skills.
  • In information processing: our flexible cognitive skills.
    • Our ability to mentally process an image or some text.

Worked Example: Delivery Drones

Artificial Intelligence and Data Science

  • AI aims to equip computers with human capabilities
    • Image understanding
    • Computer vision
    • Speech recognition
    • Natural language understanding
    • Machine translation

Supervised Learning for AI

  • Dominant approach today:
    • Generate large labelled data set from humans.
    • Use supervised learning to emulate that data.
      • E.g. ImageNet Russakovsky et al. (2015)
  • Significant advances due to deep learning
    • E.g. Alexa, Amazon Go

Data Science

  • Arises from happenstance data.
  • Differs from statistics in that the question comes after data collection.

The Gap

  • There is a gap between the world of data science and AI.
  • The mapping of the virtual onto the physical world.
  • E.g. Causal understanding.

Machine Learning in Supply Chain

  • Supply chain: Large Automated Decision Making Network
  • Amazon’s supply chain: Possibly the world’s largest ‘AI’
  • Major Challenge:
    • We have a mechanistic understanding of supply chain.
    • Machine learning is a data driven technology.

UNCERTAINTY QUANTIFICATION

Data Driven

  • Machine Learning: Replicate Processes through direct use of data.
  • Aim to emulate cognitive processes through the use of data.
  • Use data to provide new approaches in control and optimization that should allow for emulation of human motor skills.

Process Emulation

  • Key idea: emulate the process as a mathematical function.
  • Each function has a set of parameters which control its behavior.
  • Learning is the process of changing these parameters to change the shape of the function
  • Choice of which class of mathematical functions we use is a vital component of our model.

Emukit Playground

  • Work Adam Hirst, Software Engineering Intern and Cliff McCollum.

  • Tutorial on emulation.

Emukit Playground

Emukit Playground

Uncertainty Quantification

  • Deep nets are powerful approach to images, speech, language.
  • Proposal: Deep GPs may also be a great approach, but better to deploy according to natural strengths.

Uncertainty Quantification

  • Probabilistic numerics, surrogate modelling, emulation, and UQ.
  • Not a fan of AI as a term.
  • But we are faced with increasing amounts of algorithmic decision making.

ML and Decision Making

  • When trading off decisions: compute or acquire data?
  • There is a critical need for uncertainty.

Uncertainty Quantification

Uncertainty quantification (UQ) is the science of quantitative characterization and reduction of uncertainties in both computational and real world applications. It tries to determine how likely certain outcomes are if some aspects of the system are not exactly known.

  • Interaction between physical and virtual worlds of major interest.

Contrast

  • Simulation in reinforcement learning.
  • Known as data augmentation.
  • Newer, similar in spirit, but typically ignores uncertainty.

Example: Formula One Racing

  • Designing an F1 Car requires CFD, Wind Tunnel, Track Testing etc.

  • How to combine them?

Mountain Car Simulator

Car Dynamics

\[\inputVector_{t+1} = \mappingFunction(\inputVector_{t},\textbf{u}_{t})\]

where \(\textbf{u}_t\) is the action force, \(\inputVector_t = (p_t, v_t)\) is the vehicle state

Policy

  • Assume policy is linear with parameters \(\boldsymbol{\theta}\)

\[\pi(\inputVector,\theta)= \theta_0 + \theta_p p + \theta_vv.\]

Emulate the Mountain Car

  • Goal is find \(\theta\) such that

\[\theta^* = arg \max_{\theta} R_T(\theta).\]

  • Reward is computed as 100 for target, minus squared sum of actions

Random Linear Controller

Best Controller after 50 Iterations of Bayesian Optimization

Data Efficient Emulation

  • For standard Bayesian Optimization ignored dynamics of the car.

  • For more data efficiency, first emulate the dynamics.

  • Then do Bayesian optimization of the emulator.

  • Use a Gaussian process to model \[\Delta v_{t+1} = v_{t+1} - v_{t}\] and \[\Delta x_{t+1} = p_{t+1} - p_{t}\]

  • Two processes, one with mean \(v_{t}\) one with mean \(p_{t}\)

Emulator Training

  • Used 500 randomly selected points to train emulators.

  • Can make proces smore efficient through experimental design.

Comparison of Emulation and Simulation

Data Efficiency

  • Our emulator used only 500 calls to the simulator.

  • Optimizing the simulator directly required 37,500 calls to the simulator.

Best Controller using Emulator of Dynamics

500 calls to the simulator vs 37,500 calls to the simulator

\[\mappingFunction_i\left(\inputVector\right) = \rho\mappingFunction_{i-1}\left(\inputVector\right) + \delta_i\left(\inputVector \right)\]

Multi-Fidelity Emulation

\[\mappingFunction_i\left(\inputVector\right) = \mappingFunctionTwo_{i}\left(\mappingFunction_{i-1}\left(\inputVector\right)\right) + \delta_i\left(\inputVector \right),\]

Best Controller with Multi-Fidelity Emulator

250 observations of high fidelity simulator and 250 of the low fidelity simulator

Emukit

Emukit Software

  • Multi-fidelity emulation: build surrogate models for multiple sources of information;
  • Bayesian optimisation: optimise physical experiments and tune parameters ML algorithms;
  • Experimental design/Active learning: design experiments and perform active learning with ML models;
  • Sensitivity analysis: analyse the influence of inputs on the outputs
  • Bayesian quadrature: compute integrals of functions that are expensive to evaluate.

MxFusion

  • Work by Eric Meissner and Zhenwen Dai.
  • Probabilistic programming.
  • Available on Github

MxFusion

  • Targeted at challenges we face in emulation.
  • Composition of Gaussian processes (Deep GPs)
  • Combining GPs with neural networks.
  • Example PPCA Tutorial.

Why another framework?

Existing libraries had either: * Probabilistic modelling with rich, flexible models and universal inference or * Specialized, efficient inference over a subset of models

We needed both

Key Requirements

  • Integration with deep learning
  • Flexiblility
  • Scalability
  • Specialized inference and models support
    • Bayesian Deep Learning methods
    • Rapid prototyping and software re-use
    • GPUs, specialized inference methods

Modularity

  • Specialized Inference
  • Composability (tinkerability)
    • Better leveraging of expert expertise

What does it look like?

Modelling

Inference

Modelling

Directed Graphs

  • Variable
  • Function
  • Distribution

Example

m = Model()
m.mu = Variable()
m.s = Variable(transformation=PositiveTransformation())
m.Y = Normal.define_variable(mean=m.mu, variance=m.s)

3 primary components in modeling

  • Variable
  • Distribution
  • Function

2 primary methods for models

  • log_pdf
  • draw_samples

Inference: Two Classes

  • Variational Inference
  • MCMC Sampling (soon) Built on MXNet Gluon (imperative code, not static graph)

Example

infr = GradBasedInference(inference_algorithm=MAP(model=m, observed=[m.Y]))
infr.run(Y=data)

Modules

  • Model + Inference together form building blocks.
    • Just doing modular modeling with universal inference doesn’t really scale, need specialized inference methods for specialized modelling objects like non-parametrics.

Enhancements MXFusion Brings

  • Use Monte Carlo integration instead of moment estimation
  • Use automatic differentiation
  • A flexible interface for Gaussian processes, trivial to switch to sparse or stochastic variational

Preparation

pip install mxnet mxfusion gym

Set the global configuration.

Example: Pendulum

Pendulum

Fit the Dynamics Model

  • Dynamics:

\[ p(\dataScalar_{t+1}|\dataScalar_t, a_t) \]

Define and fit the model

Policy

  • Make use of neural network with one hidden layer

Obtaining the policy gradients

The Loop

After First Epsiode

Policy after the first episode (random exploration):

After Fifth Episode

Policy after the 5th episode:

Contribute!

https://github.com/amzn/mxfusion

Future plans

  • Deep GPs (implemented, not yet merged)
  • MCMC Methods
  • Time series models (RGPs)

Long term Aim

  • Simulate/Emulate the components of the system.
    • Validate with real world using multifidelity.
    • Interpret system using e.g. sensitivity analysis.
  • Perform end to end learning to optimize.
    • Maintain interpretability.

References

Deisenroth, M.P., Rasmussen, C.E., n.d. PILCO: A model-based and data-efficient approach to policy search, in:.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L., 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115, 211–252. https://doi.org/10.1007/s11263-015-0816-y