[edit]

Machine Learning and the Physical World

$$\newcommand{\tk}[1]{} %\newcommand{\tk}[1]{\textbf{TK}: #1} \newcommand{\Amatrix}{\mathbf{A}} \newcommand{\KL}[2]{\text{KL}\left( #1\,\|\,#2 \right)} \newcommand{\Kaast}{\kernelMatrix_{\mathbf{ \ast}\mathbf{ \ast}}} \newcommand{\Kastu}{\kernelMatrix_{\mathbf{ \ast} \inducingVector}} \newcommand{\Kff}{\kernelMatrix_{\mappingFunctionVector \mappingFunctionVector}} \newcommand{\Kfu}{\kernelMatrix_{\mappingFunctionVector \inducingVector}} \newcommand{\Kuast}{\kernelMatrix_{\inducingVector \bf\ast}} \newcommand{\Kuf}{\kernelMatrix_{\inducingVector \mappingFunctionVector}} \newcommand{\Kuu}{\kernelMatrix_{\inducingVector \inducingVector}} \newcommand{\Kuui}{\Kuu^{-1}} \newcommand{\Qaast}{\mathbf{Q}_{\bf \ast \ast}} \newcommand{\Qastf}{\mathbf{Q}_{\ast \mappingFunction}} \newcommand{\Qfast}{\mathbf{Q}_{\mappingFunctionVector \bf \ast}} \newcommand{\Qff}{\mathbf{Q}_{\mappingFunctionVector \mappingFunctionVector}} \newcommand{\aMatrix}{\mathbf{A}} \newcommand{\aScalar}{a} \newcommand{\aVector}{\mathbf{a}} \newcommand{\acceleration}{a} \newcommand{\bMatrix}{\mathbf{B}} \newcommand{\bScalar}{b} \newcommand{\bVector}{\mathbf{b}} \newcommand{\basisFunc}{\phi} \newcommand{\basisFuncVector}{\boldsymbol{ \basisFunc}} \newcommand{\basisFunction}{\phi} \newcommand{\basisLocation}{\mu} \newcommand{\basisMatrix}{\boldsymbol{ \Phi}} \newcommand{\basisScalar}{\basisFunction} \newcommand{\basisVector}{\boldsymbol{ \basisFunction}} \newcommand{\activationFunction}{\phi} \newcommand{\activationMatrix}{\boldsymbol{ \Phi}} \newcommand{\activationScalar}{\basisFunction} \newcommand{\activationVector}{\boldsymbol{ \basisFunction}} \newcommand{\bigO}{\mathcal{O}} \newcommand{\binomProb}{\pi} \newcommand{\cMatrix}{\mathbf{C}} \newcommand{\cbasisMatrix}{\hat{\boldsymbol{ \Phi}}} \newcommand{\cdataMatrix}{\hat{\dataMatrix}} \newcommand{\cdataScalar}{\hat{\dataScalar}} \newcommand{\cdataVector}{\hat{\dataVector}} \newcommand{\centeredKernelMatrix}{\mathbf{ \MakeUppercase{\centeredKernelScalar}}} \newcommand{\centeredKernelScalar}{b} \newcommand{\centeredKernelVector}{\centeredKernelScalar} \newcommand{\centeringMatrix}{\mathbf{H}} \newcommand{\chiSquaredDist}[2]{\chi_{#1}^{2}\left(#2\right)} \newcommand{\chiSquaredSamp}[1]{\chi_{#1}^{2}} \newcommand{\conditionalCovariance}{\boldsymbol{ \Sigma}} \newcommand{\coregionalizationMatrix}{\mathbf{B}} \newcommand{\coregionalizationScalar}{b} \newcommand{\coregionalizationVector}{\mathbf{ \coregionalizationScalar}} \newcommand{\covDist}[2]{\text{cov}_{#2}\left(#1\right)} \newcommand{\covSamp}[1]{\text{cov}\left(#1\right)} \newcommand{\covarianceScalar}{c} \newcommand{\covarianceVector}{\mathbf{ \covarianceScalar}} \newcommand{\covarianceMatrix}{\mathbf{C}} \newcommand{\covarianceMatrixTwo}{\boldsymbol{ \Sigma}} \newcommand{\croupierScalar}{s} \newcommand{\croupierVector}{\mathbf{ \croupierScalar}} \newcommand{\croupierMatrix}{\mathbf{ \MakeUppercase{\croupierScalar}}} \newcommand{\dataDim}{p} \newcommand{\dataIndex}{i} \newcommand{\dataIndexTwo}{j} \newcommand{\dataMatrix}{\mathbf{Y}} \newcommand{\dataScalar}{y} \newcommand{\dataSet}{\mathcal{D}} \newcommand{\dataStd}{\sigma} \newcommand{\dataVector}{\mathbf{ \dataScalar}} \newcommand{\decayRate}{d} \newcommand{\degreeMatrix}{\mathbf{ \MakeUppercase{\degreeScalar}}} \newcommand{\degreeScalar}{d} \newcommand{\degreeVector}{\mathbf{ \degreeScalar}} % Already defined by latex %\newcommand{\det}[1]{\left|#1\right|} \newcommand{\diag}[1]{\text{diag}\left(#1\right)} \newcommand{\diagonalMatrix}{\mathbf{D}} \newcommand{\diff}[2]{\frac{\text{d}#1}{\text{d}#2}} \newcommand{\diffTwo}[2]{\frac{\text{d}^2#1}{\text{d}#2^2}} \newcommand{\displacement}{x} \newcommand{\displacementVector}{\textbf{\displacement}} \newcommand{\distanceMatrix}{\mathbf{ \MakeUppercase{\distanceScalar}}} \newcommand{\distanceScalar}{d} \newcommand{\distanceVector}{\mathbf{ \distanceScalar}} \newcommand{\eigenvaltwo}{\ell} \newcommand{\eigenvaltwoMatrix}{\mathbf{L}} \newcommand{\eigenvaltwoVector}{\mathbf{l}} \newcommand{\eigenvalue}{\lambda} \newcommand{\eigenvalueMatrix}{\boldsymbol{ \Lambda}} \newcommand{\eigenvalueVector}{\boldsymbol{ \lambda}} \newcommand{\eigenvector}{\mathbf{ \eigenvectorScalar}} \newcommand{\eigenvectorMatrix}{\mathbf{U}} \newcommand{\eigenvectorScalar}{u} \newcommand{\eigenvectwo}{\mathbf{v}} \newcommand{\eigenvectwoMatrix}{\mathbf{V}} \newcommand{\eigenvectwoScalar}{v} \newcommand{\entropy}[1]{\mathcal{H}\left(#1\right)} \newcommand{\errorFunction}{E} \newcommand{\expDist}[2]{\left<#1\right>_{#2}} \newcommand{\expSamp}[1]{\left<#1\right>} \newcommand{\expectation}[1]{\left\langle #1 \right\rangle } \newcommand{\expectationDist}[2]{\left\langle #1 \right\rangle _{#2}} \newcommand{\expectedDistanceMatrix}{\mathcal{D}} \newcommand{\eye}{\mathbf{I}} \newcommand{\fantasyDim}{r} \newcommand{\fantasyMatrix}{\mathbf{ \MakeUppercase{\fantasyScalar}}} \newcommand{\fantasyScalar}{z} \newcommand{\fantasyVector}{\mathbf{ \fantasyScalar}} \newcommand{\featureStd}{\varsigma} \newcommand{\gammaCdf}[3]{\mathcal{GAMMA CDF}\left(#1|#2,#3\right)} \newcommand{\gammaDist}[3]{\mathcal{G}\left(#1|#2,#3\right)} \newcommand{\gammaSamp}[2]{\mathcal{G}\left(#1,#2\right)} \newcommand{\gaussianDist}[3]{\mathcal{N}\left(#1|#2,#3\right)} \newcommand{\gaussianSamp}[2]{\mathcal{N}\left(#1,#2\right)} \newcommand{\given}{|} \newcommand{\half}{\frac{1}{2}} \newcommand{\heaviside}{H} \newcommand{\hiddenMatrix}{\mathbf{ \MakeUppercase{\hiddenScalar}}} \newcommand{\hiddenScalar}{h} \newcommand{\hiddenVector}{\mathbf{ \hiddenScalar}} \newcommand{\identityMatrix}{\eye} \newcommand{\inducingInputScalar}{z} \newcommand{\inducingInputVector}{\mathbf{ \inducingInputScalar}} \newcommand{\inducingInputMatrix}{\mathbf{Z}} \newcommand{\inducingScalar}{u} \newcommand{\inducingVector}{\mathbf{ \inducingScalar}} \newcommand{\inducingMatrix}{\mathbf{U}} \newcommand{\inlineDiff}[2]{\text{d}#1/\text{d}#2} \newcommand{\inputDim}{q} \newcommand{\inputMatrix}{\mathbf{X}} \newcommand{\inputScalar}{x} \newcommand{\inputSpace}{\mathcal{X}} \newcommand{\inputVals}{\inputVector} \newcommand{\inputVector}{\mathbf{ \inputScalar}} \newcommand{\iterNum}{k} \newcommand{\kernel}{\kernelScalar} \newcommand{\kernelMatrix}{\mathbf{K}} \newcommand{\kernelScalar}{k} \newcommand{\kernelVector}{\mathbf{ \kernelScalar}} \newcommand{\kff}{\kernelScalar_{\mappingFunction \mappingFunction}} \newcommand{\kfu}{\kernelVector_{\mappingFunction \inducingScalar}} \newcommand{\kuf}{\kernelVector_{\inducingScalar \mappingFunction}} \newcommand{\kuu}{\kernelVector_{\inducingScalar \inducingScalar}} \newcommand{\lagrangeMultiplier}{\lambda} \newcommand{\lagrangeMultiplierMatrix}{\boldsymbol{ \Lambda}} \newcommand{\lagrangian}{L} \newcommand{\laplacianFactor}{\mathbf{ \MakeUppercase{\laplacianFactorScalar}}} \newcommand{\laplacianFactorScalar}{m} \newcommand{\laplacianFactorVector}{\mathbf{ \laplacianFactorScalar}} \newcommand{\laplacianMatrix}{\mathbf{L}} \newcommand{\laplacianScalar}{\ell} \newcommand{\laplacianVector}{\mathbf{ \ell}} \newcommand{\latentDim}{q} \newcommand{\latentDistanceMatrix}{\boldsymbol{ \Delta}} \newcommand{\latentDistanceScalar}{\delta} \newcommand{\latentDistanceVector}{\boldsymbol{ \delta}} \newcommand{\latentForce}{f} \newcommand{\latentFunction}{u} \newcommand{\latentFunctionVector}{\mathbf{ \latentFunction}} \newcommand{\latentFunctionMatrix}{\mathbf{ \MakeUppercase{\latentFunction}}} \newcommand{\latentIndex}{j} \newcommand{\latentScalar}{z} \newcommand{\latentVector}{\mathbf{ \latentScalar}} \newcommand{\latentMatrix}{\mathbf{Z}} \newcommand{\learnRate}{\eta} \newcommand{\lengthScale}{\ell} \newcommand{\rbfWidth}{\ell} \newcommand{\likelihoodBound}{\mathcal{L}} \newcommand{\likelihoodFunction}{L} \newcommand{\locationScalar}{\mu} \newcommand{\locationVector}{\boldsymbol{ \locationScalar}} \newcommand{\locationMatrix}{\mathbf{M}} \newcommand{\variance}[1]{\text{var}\left( #1 \right)} \newcommand{\mappingFunction}{f} \newcommand{\mappingFunctionMatrix}{\mathbf{F}} \newcommand{\mappingFunctionTwo}{g} \newcommand{\mappingFunctionTwoMatrix}{\mathbf{G}} \newcommand{\mappingFunctionTwoVector}{\mathbf{ \mappingFunctionTwo}} \newcommand{\mappingFunctionVector}{\mathbf{ \mappingFunction}} \newcommand{\scaleScalar}{s} \newcommand{\mappingScalar}{w} \newcommand{\mappingVector}{\mathbf{ \mappingScalar}} \newcommand{\mappingMatrix}{\mathbf{W}} \newcommand{\mappingScalarTwo}{v} \newcommand{\mappingVectorTwo}{\mathbf{ \mappingScalarTwo}} \newcommand{\mappingMatrixTwo}{\mathbf{V}} \newcommand{\maxIters}{K} \newcommand{\meanMatrix}{\mathbf{M}} \newcommand{\meanScalar}{\mu} \newcommand{\meanTwoMatrix}{\mathbf{M}} \newcommand{\meanTwoScalar}{m} \newcommand{\meanTwoVector}{\mathbf{ \meanTwoScalar}} \newcommand{\meanVector}{\boldsymbol{ \meanScalar}} \newcommand{\mrnaConcentration}{m} \newcommand{\naturalFrequency}{\omega} \newcommand{\neighborhood}[1]{\mathcal{N}\left( #1 \right)} \newcommand{\neilurl}{http://inverseprobability.com/} \newcommand{\noiseMatrix}{\boldsymbol{ E}} \newcommand{\noiseScalar}{\epsilon} \newcommand{\noiseVector}{\boldsymbol{ \epsilon}} \newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \newcommand{\normalizedLaplacianMatrix}{\hat{\mathbf{L}}} \newcommand{\normalizedLaplacianScalar}{\hat{\ell}} \newcommand{\normalizedLaplacianVector}{\hat{\mathbf{ \ell}}} \newcommand{\numActive}{m} \newcommand{\numBasisFunc}{m} \newcommand{\numComponents}{m} \newcommand{\numComps}{K} \newcommand{\numData}{n} \newcommand{\numFeatures}{K} \newcommand{\numHidden}{h} \newcommand{\numInducing}{m} \newcommand{\numLayers}{\ell} \newcommand{\numNeighbors}{K} \newcommand{\numSequences}{s} \newcommand{\numSuccess}{s} \newcommand{\numTasks}{m} \newcommand{\numTime}{T} \newcommand{\numTrials}{S} \newcommand{\outputIndex}{j} \newcommand{\paramVector}{\boldsymbol{ \theta}} \newcommand{\parameterMatrix}{\boldsymbol{ \Theta}} \newcommand{\parameterScalar}{\theta} \newcommand{\parameterVector}{\boldsymbol{ \parameterScalar}} \newcommand{\partDiff}[2]{\frac{\partial#1}{\partial#2}} \newcommand{\precisionScalar}{j} \newcommand{\precisionVector}{\mathbf{ \precisionScalar}} \newcommand{\precisionMatrix}{\mathbf{J}} \newcommand{\pseudotargetScalar}{\widetilde{y}} \newcommand{\pseudotargetVector}{\mathbf{ \pseudotargetScalar}} \newcommand{\pseudotargetMatrix}{\mathbf{ \widetilde{Y}}} \newcommand{\rank}[1]{\text{rank}\left(#1\right)} \newcommand{\rayleighDist}[2]{\mathcal{R}\left(#1|#2\right)} \newcommand{\rayleighSamp}[1]{\mathcal{R}\left(#1\right)} \newcommand{\responsibility}{r} \newcommand{\rotationScalar}{r} \newcommand{\rotationVector}{\mathbf{ \rotationScalar}} \newcommand{\rotationMatrix}{\mathbf{R}} \newcommand{\sampleCovScalar}{s} \newcommand{\sampleCovVector}{\mathbf{ \sampleCovScalar}} \newcommand{\sampleCovMatrix}{\mathbf{s}} \newcommand{\scalarProduct}[2]{\left\langle{#1},{#2}\right\rangle} \newcommand{\sign}[1]{\text{sign}\left(#1\right)} \newcommand{\sigmoid}[1]{\sigma\left(#1\right)} \newcommand{\singularvalue}{\ell} \newcommand{\singularvalueMatrix}{\mathbf{L}} \newcommand{\singularvalueVector}{\mathbf{l}} \newcommand{\sorth}{\mathbf{u}} \newcommand{\spar}{\lambda} \newcommand{\trace}[1]{\text{tr}\left(#1\right)} \newcommand{\BasalRate}{B} \newcommand{\DampingCoefficient}{C} \newcommand{\DecayRate}{D} \newcommand{\Displacement}{X} \newcommand{\LatentForce}{F} \newcommand{\Mass}{M} \newcommand{\Sensitivity}{S} \newcommand{\basalRate}{b} \newcommand{\dampingCoefficient}{c} \newcommand{\mass}{m} \newcommand{\sensitivity}{s} \newcommand{\springScalar}{\kappa} \newcommand{\springVector}{\boldsymbol{ \kappa}} \newcommand{\springMatrix}{\boldsymbol{ \mathcal{K}}} \newcommand{\tfConcentration}{p} \newcommand{\tfDecayRate}{\delta} \newcommand{\tfMrnaConcentration}{f} \newcommand{\tfVector}{\mathbf{ \tfConcentration}} \newcommand{\velocity}{v} \newcommand{\sufficientStatsScalar}{g} \newcommand{\sufficientStatsVector}{\mathbf{ \sufficientStatsScalar}} \newcommand{\sufficientStatsMatrix}{\mathbf{G}} \newcommand{\switchScalar}{s} \newcommand{\switchVector}{\mathbf{ \switchScalar}} \newcommand{\switchMatrix}{\mathbf{S}} \newcommand{\tr}[1]{\text{tr}\left(#1\right)} \newcommand{\loneNorm}[1]{\left\Vert #1 \right\Vert_1} \newcommand{\ltwoNorm}[1]{\left\Vert #1 \right\Vert_2} \newcommand{\onenorm}[1]{\left\vert#1\right\vert_1} \newcommand{\twonorm}[1]{\left\Vert #1 \right\Vert} \newcommand{\vScalar}{v} \newcommand{\vVector}{\mathbf{v}} \newcommand{\vMatrix}{\mathbf{V}} \newcommand{\varianceDist}[2]{\text{var}_{#2}\left( #1 \right)} % Already defined by latex %\newcommand{\vec}{#1:} \newcommand{\vecb}[1]{\left(#1\right):} \newcommand{\weightScalar}{w} \newcommand{\weightVector}{\mathbf{ \weightScalar}} \newcommand{\weightMatrix}{\mathbf{W}} \newcommand{\weightedAdjacencyMatrix}{\mathbf{A}} \newcommand{\weightedAdjacencyScalar}{a} \newcommand{\weightedAdjacencyVector}{\mathbf{ \weightedAdjacencyScalar}} \newcommand{\onesVector}{\mathbf{1}} \newcommand{\zerosVector}{\mathbf{0}} $$
at Center for Statistics and Machine Learning, Princeton on Dec 10, 2018 [jupyter][google colab][reveal]
Neil D. Lawrence, Amazon Cambridge and University of Sheffield

Abstract

Machine learning is a data driven endeavour, but real world systems are physical and mechanistic. In this talk we will review approaches to integrating machine learning with real world systems. Our focus will be on emulation (otherwise known as surrogate modeling).

$$\newcommand{\tk}[1]{} %\newcommand{\tk}[1]{\textbf{TK}: #1} \newcommand{\Amatrix}{\mathbf{A}} \newcommand{\KL}[2]{\text{KL}\left( #1\,\|\,#2 \right)} \newcommand{\Kaast}{\kernelMatrix_{\mathbf{ \ast}\mathbf{ \ast}}} \newcommand{\Kastu}{\kernelMatrix_{\mathbf{ \ast} \inducingVector}} \newcommand{\Kff}{\kernelMatrix_{\mappingFunctionVector \mappingFunctionVector}} \newcommand{\Kfu}{\kernelMatrix_{\mappingFunctionVector \inducingVector}} \newcommand{\Kuast}{\kernelMatrix_{\inducingVector \bf\ast}} \newcommand{\Kuf}{\kernelMatrix_{\inducingVector \mappingFunctionVector}} \newcommand{\Kuu}{\kernelMatrix_{\inducingVector \inducingVector}} \newcommand{\Kuui}{\Kuu^{-1}} \newcommand{\Qaast}{\mathbf{Q}_{\bf \ast \ast}} \newcommand{\Qastf}{\mathbf{Q}_{\ast \mappingFunction}} \newcommand{\Qfast}{\mathbf{Q}_{\mappingFunctionVector \bf \ast}} \newcommand{\Qff}{\mathbf{Q}_{\mappingFunctionVector \mappingFunctionVector}} \newcommand{\aMatrix}{\mathbf{A}} \newcommand{\aScalar}{a} \newcommand{\aVector}{\mathbf{a}} \newcommand{\acceleration}{a} \newcommand{\bMatrix}{\mathbf{B}} \newcommand{\bScalar}{b} \newcommand{\bVector}{\mathbf{b}} \newcommand{\basisFunc}{\phi} \newcommand{\basisFuncVector}{\boldsymbol{ \basisFunc}} \newcommand{\basisFunction}{\phi} \newcommand{\basisLocation}{\mu} \newcommand{\basisMatrix}{\boldsymbol{ \Phi}} \newcommand{\basisScalar}{\basisFunction} \newcommand{\basisVector}{\boldsymbol{ \basisFunction}} \newcommand{\activationFunction}{\phi} \newcommand{\activationMatrix}{\boldsymbol{ \Phi}} \newcommand{\activationScalar}{\basisFunction} \newcommand{\activationVector}{\boldsymbol{ \basisFunction}} \newcommand{\bigO}{\mathcal{O}} \newcommand{\binomProb}{\pi} \newcommand{\cMatrix}{\mathbf{C}} \newcommand{\cbasisMatrix}{\hat{\boldsymbol{ \Phi}}} \newcommand{\cdataMatrix}{\hat{\dataMatrix}} \newcommand{\cdataScalar}{\hat{\dataScalar}} \newcommand{\cdataVector}{\hat{\dataVector}} \newcommand{\centeredKernelMatrix}{\mathbf{ \MakeUppercase{\centeredKernelScalar}}} \newcommand{\centeredKernelScalar}{b} \newcommand{\centeredKernelVector}{\centeredKernelScalar} \newcommand{\centeringMatrix}{\mathbf{H}} \newcommand{\chiSquaredDist}[2]{\chi_{#1}^{2}\left(#2\right)} \newcommand{\chiSquaredSamp}[1]{\chi_{#1}^{2}} \newcommand{\conditionalCovariance}{\boldsymbol{ \Sigma}} \newcommand{\coregionalizationMatrix}{\mathbf{B}} \newcommand{\coregionalizationScalar}{b} \newcommand{\coregionalizationVector}{\mathbf{ \coregionalizationScalar}} \newcommand{\covDist}[2]{\text{cov}_{#2}\left(#1\right)} \newcommand{\covSamp}[1]{\text{cov}\left(#1\right)} \newcommand{\covarianceScalar}{c} \newcommand{\covarianceVector}{\mathbf{ \covarianceScalar}} \newcommand{\covarianceMatrix}{\mathbf{C}} \newcommand{\covarianceMatrixTwo}{\boldsymbol{ \Sigma}} \newcommand{\croupierScalar}{s} \newcommand{\croupierVector}{\mathbf{ \croupierScalar}} \newcommand{\croupierMatrix}{\mathbf{ \MakeUppercase{\croupierScalar}}} \newcommand{\dataDim}{p} \newcommand{\dataIndex}{i} \newcommand{\dataIndexTwo}{j} \newcommand{\dataMatrix}{\mathbf{Y}} \newcommand{\dataScalar}{y} \newcommand{\dataSet}{\mathcal{D}} \newcommand{\dataStd}{\sigma} \newcommand{\dataVector}{\mathbf{ \dataScalar}} \newcommand{\decayRate}{d} \newcommand{\degreeMatrix}{\mathbf{ \MakeUppercase{\degreeScalar}}} \newcommand{\degreeScalar}{d} \newcommand{\degreeVector}{\mathbf{ \degreeScalar}} % Already defined by latex %\newcommand{\det}[1]{\left|#1\right|} \newcommand{\diag}[1]{\text{diag}\left(#1\right)} \newcommand{\diagonalMatrix}{\mathbf{D}} \newcommand{\diff}[2]{\frac{\text{d}#1}{\text{d}#2}} \newcommand{\diffTwo}[2]{\frac{\text{d}^2#1}{\text{d}#2^2}} \newcommand{\displacement}{x} \newcommand{\displacementVector}{\textbf{\displacement}} \newcommand{\distanceMatrix}{\mathbf{ \MakeUppercase{\distanceScalar}}} \newcommand{\distanceScalar}{d} \newcommand{\distanceVector}{\mathbf{ \distanceScalar}} \newcommand{\eigenvaltwo}{\ell} \newcommand{\eigenvaltwoMatrix}{\mathbf{L}} \newcommand{\eigenvaltwoVector}{\mathbf{l}} \newcommand{\eigenvalue}{\lambda} \newcommand{\eigenvalueMatrix}{\boldsymbol{ \Lambda}} \newcommand{\eigenvalueVector}{\boldsymbol{ \lambda}} \newcommand{\eigenvector}{\mathbf{ \eigenvectorScalar}} \newcommand{\eigenvectorMatrix}{\mathbf{U}} \newcommand{\eigenvectorScalar}{u} \newcommand{\eigenvectwo}{\mathbf{v}} \newcommand{\eigenvectwoMatrix}{\mathbf{V}} \newcommand{\eigenvectwoScalar}{v} \newcommand{\entropy}[1]{\mathcal{H}\left(#1\right)} \newcommand{\errorFunction}{E} \newcommand{\expDist}[2]{\left<#1\right>_{#2}} \newcommand{\expSamp}[1]{\left<#1\right>} \newcommand{\expectation}[1]{\left\langle #1 \right\rangle } \newcommand{\expectationDist}[2]{\left\langle #1 \right\rangle _{#2}} \newcommand{\expectedDistanceMatrix}{\mathcal{D}} \newcommand{\eye}{\mathbf{I}} \newcommand{\fantasyDim}{r} \newcommand{\fantasyMatrix}{\mathbf{ \MakeUppercase{\fantasyScalar}}} \newcommand{\fantasyScalar}{z} \newcommand{\fantasyVector}{\mathbf{ \fantasyScalar}} \newcommand{\featureStd}{\varsigma} \newcommand{\gammaCdf}[3]{\mathcal{GAMMA CDF}\left(#1|#2,#3\right)} \newcommand{\gammaDist}[3]{\mathcal{G}\left(#1|#2,#3\right)} \newcommand{\gammaSamp}[2]{\mathcal{G}\left(#1,#2\right)} \newcommand{\gaussianDist}[3]{\mathcal{N}\left(#1|#2,#3\right)} \newcommand{\gaussianSamp}[2]{\mathcal{N}\left(#1,#2\right)} \newcommand{\given}{|} \newcommand{\half}{\frac{1}{2}} \newcommand{\heaviside}{H} \newcommand{\hiddenMatrix}{\mathbf{ \MakeUppercase{\hiddenScalar}}} \newcommand{\hiddenScalar}{h} \newcommand{\hiddenVector}{\mathbf{ \hiddenScalar}} \newcommand{\identityMatrix}{\eye} \newcommand{\inducingInputScalar}{z} \newcommand{\inducingInputVector}{\mathbf{ \inducingInputScalar}} \newcommand{\inducingInputMatrix}{\mathbf{Z}} \newcommand{\inducingScalar}{u} \newcommand{\inducingVector}{\mathbf{ \inducingScalar}} \newcommand{\inducingMatrix}{\mathbf{U}} \newcommand{\inlineDiff}[2]{\text{d}#1/\text{d}#2} \newcommand{\inputDim}{q} \newcommand{\inputMatrix}{\mathbf{X}} \newcommand{\inputScalar}{x} \newcommand{\inputSpace}{\mathcal{X}} \newcommand{\inputVals}{\inputVector} \newcommand{\inputVector}{\mathbf{ \inputScalar}} \newcommand{\iterNum}{k} \newcommand{\kernel}{\kernelScalar} \newcommand{\kernelMatrix}{\mathbf{K}} \newcommand{\kernelScalar}{k} \newcommand{\kernelVector}{\mathbf{ \kernelScalar}} \newcommand{\kff}{\kernelScalar_{\mappingFunction \mappingFunction}} \newcommand{\kfu}{\kernelVector_{\mappingFunction \inducingScalar}} \newcommand{\kuf}{\kernelVector_{\inducingScalar \mappingFunction}} \newcommand{\kuu}{\kernelVector_{\inducingScalar \inducingScalar}} \newcommand{\lagrangeMultiplier}{\lambda} \newcommand{\lagrangeMultiplierMatrix}{\boldsymbol{ \Lambda}} \newcommand{\lagrangian}{L} \newcommand{\laplacianFactor}{\mathbf{ \MakeUppercase{\laplacianFactorScalar}}} \newcommand{\laplacianFactorScalar}{m} \newcommand{\laplacianFactorVector}{\mathbf{ \laplacianFactorScalar}} \newcommand{\laplacianMatrix}{\mathbf{L}} \newcommand{\laplacianScalar}{\ell} \newcommand{\laplacianVector}{\mathbf{ \ell}} \newcommand{\latentDim}{q} \newcommand{\latentDistanceMatrix}{\boldsymbol{ \Delta}} \newcommand{\latentDistanceScalar}{\delta} \newcommand{\latentDistanceVector}{\boldsymbol{ \delta}} \newcommand{\latentForce}{f} \newcommand{\latentFunction}{u} \newcommand{\latentFunctionVector}{\mathbf{ \latentFunction}} \newcommand{\latentFunctionMatrix}{\mathbf{ \MakeUppercase{\latentFunction}}} \newcommand{\latentIndex}{j} \newcommand{\latentScalar}{z} \newcommand{\latentVector}{\mathbf{ \latentScalar}} \newcommand{\latentMatrix}{\mathbf{Z}} \newcommand{\learnRate}{\eta} \newcommand{\lengthScale}{\ell} \newcommand{\rbfWidth}{\ell} \newcommand{\likelihoodBound}{\mathcal{L}} \newcommand{\likelihoodFunction}{L} \newcommand{\locationScalar}{\mu} \newcommand{\locationVector}{\boldsymbol{ \locationScalar}} \newcommand{\locationMatrix}{\mathbf{M}} \newcommand{\variance}[1]{\text{var}\left( #1 \right)} \newcommand{\mappingFunction}{f} \newcommand{\mappingFunctionMatrix}{\mathbf{F}} \newcommand{\mappingFunctionTwo}{g} \newcommand{\mappingFunctionTwoMatrix}{\mathbf{G}} \newcommand{\mappingFunctionTwoVector}{\mathbf{ \mappingFunctionTwo}} \newcommand{\mappingFunctionVector}{\mathbf{ \mappingFunction}} \newcommand{\scaleScalar}{s} \newcommand{\mappingScalar}{w} \newcommand{\mappingVector}{\mathbf{ \mappingScalar}} \newcommand{\mappingMatrix}{\mathbf{W}} \newcommand{\mappingScalarTwo}{v} \newcommand{\mappingVectorTwo}{\mathbf{ \mappingScalarTwo}} \newcommand{\mappingMatrixTwo}{\mathbf{V}} \newcommand{\maxIters}{K} \newcommand{\meanMatrix}{\mathbf{M}} \newcommand{\meanScalar}{\mu} \newcommand{\meanTwoMatrix}{\mathbf{M}} \newcommand{\meanTwoScalar}{m} \newcommand{\meanTwoVector}{\mathbf{ \meanTwoScalar}} \newcommand{\meanVector}{\boldsymbol{ \meanScalar}} \newcommand{\mrnaConcentration}{m} \newcommand{\naturalFrequency}{\omega} \newcommand{\neighborhood}[1]{\mathcal{N}\left( #1 \right)} \newcommand{\neilurl}{http://inverseprobability.com/} \newcommand{\noiseMatrix}{\boldsymbol{ E}} \newcommand{\noiseScalar}{\epsilon} \newcommand{\noiseVector}{\boldsymbol{ \epsilon}} \newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \newcommand{\normalizedLaplacianMatrix}{\hat{\mathbf{L}}} \newcommand{\normalizedLaplacianScalar}{\hat{\ell}} \newcommand{\normalizedLaplacianVector}{\hat{\mathbf{ \ell}}} \newcommand{\numActive}{m} \newcommand{\numBasisFunc}{m} \newcommand{\numComponents}{m} \newcommand{\numComps}{K} \newcommand{\numData}{n} \newcommand{\numFeatures}{K} \newcommand{\numHidden}{h} \newcommand{\numInducing}{m} \newcommand{\numLayers}{\ell} \newcommand{\numNeighbors}{K} \newcommand{\numSequences}{s} \newcommand{\numSuccess}{s} \newcommand{\numTasks}{m} \newcommand{\numTime}{T} \newcommand{\numTrials}{S} \newcommand{\outputIndex}{j} \newcommand{\paramVector}{\boldsymbol{ \theta}} \newcommand{\parameterMatrix}{\boldsymbol{ \Theta}} \newcommand{\parameterScalar}{\theta} \newcommand{\parameterVector}{\boldsymbol{ \parameterScalar}} \newcommand{\partDiff}[2]{\frac{\partial#1}{\partial#2}} \newcommand{\precisionScalar}{j} \newcommand{\precisionVector}{\mathbf{ \precisionScalar}} \newcommand{\precisionMatrix}{\mathbf{J}} \newcommand{\pseudotargetScalar}{\widetilde{y}} \newcommand{\pseudotargetVector}{\mathbf{ \pseudotargetScalar}} \newcommand{\pseudotargetMatrix}{\mathbf{ \widetilde{Y}}} \newcommand{\rank}[1]{\text{rank}\left(#1\right)} \newcommand{\rayleighDist}[2]{\mathcal{R}\left(#1|#2\right)} \newcommand{\rayleighSamp}[1]{\mathcal{R}\left(#1\right)} \newcommand{\responsibility}{r} \newcommand{\rotationScalar}{r} \newcommand{\rotationVector}{\mathbf{ \rotationScalar}} \newcommand{\rotationMatrix}{\mathbf{R}} \newcommand{\sampleCovScalar}{s} \newcommand{\sampleCovVector}{\mathbf{ \sampleCovScalar}} \newcommand{\sampleCovMatrix}{\mathbf{s}} \newcommand{\scalarProduct}[2]{\left\langle{#1},{#2}\right\rangle} \newcommand{\sign}[1]{\text{sign}\left(#1\right)} \newcommand{\sigmoid}[1]{\sigma\left(#1\right)} \newcommand{\singularvalue}{\ell} \newcommand{\singularvalueMatrix}{\mathbf{L}} \newcommand{\singularvalueVector}{\mathbf{l}} \newcommand{\sorth}{\mathbf{u}} \newcommand{\spar}{\lambda} \newcommand{\trace}[1]{\text{tr}\left(#1\right)} \newcommand{\BasalRate}{B} \newcommand{\DampingCoefficient}{C} \newcommand{\DecayRate}{D} \newcommand{\Displacement}{X} \newcommand{\LatentForce}{F} \newcommand{\Mass}{M} \newcommand{\Sensitivity}{S} \newcommand{\basalRate}{b} \newcommand{\dampingCoefficient}{c} \newcommand{\mass}{m} \newcommand{\sensitivity}{s} \newcommand{\springScalar}{\kappa} \newcommand{\springVector}{\boldsymbol{ \kappa}} \newcommand{\springMatrix}{\boldsymbol{ \mathcal{K}}} \newcommand{\tfConcentration}{p} \newcommand{\tfDecayRate}{\delta} \newcommand{\tfMrnaConcentration}{f} \newcommand{\tfVector}{\mathbf{ \tfConcentration}} \newcommand{\velocity}{v} \newcommand{\sufficientStatsScalar}{g} \newcommand{\sufficientStatsVector}{\mathbf{ \sufficientStatsScalar}} \newcommand{\sufficientStatsMatrix}{\mathbf{G}} \newcommand{\switchScalar}{s} \newcommand{\switchVector}{\mathbf{ \switchScalar}} \newcommand{\switchMatrix}{\mathbf{S}} \newcommand{\tr}[1]{\text{tr}\left(#1\right)} \newcommand{\loneNorm}[1]{\left\Vert #1 \right\Vert_1} \newcommand{\ltwoNorm}[1]{\left\Vert #1 \right\Vert_2} \newcommand{\onenorm}[1]{\left\vert#1\right\vert_1} \newcommand{\twonorm}[1]{\left\Vert #1 \right\Vert} \newcommand{\vScalar}{v} \newcommand{\vVector}{\mathbf{v}} \newcommand{\vMatrix}{\mathbf{V}} \newcommand{\varianceDist}[2]{\text{var}_{#2}\left( #1 \right)} % Already defined by latex %\newcommand{\vec}{#1:} \newcommand{\vecb}[1]{\left(#1\right):} \newcommand{\weightScalar}{w} \newcommand{\weightVector}{\mathbf{ \weightScalar}} \newcommand{\weightMatrix}{\mathbf{W}} \newcommand{\weightedAdjacencyMatrix}{\mathbf{A}} \newcommand{\weightedAdjacencyScalar}{a} \newcommand{\weightedAdjacencyVector}{\mathbf{ \weightedAdjacencyScalar}} \newcommand{\onesVector}{\mathbf{1}} \newcommand{\zerosVector}{\mathbf{0}} $$

The Centrifugal Governor [edit]

Figure: Centrifugal governor as held by “Science” on Holborn Viaduct

Boulton and Watt’s Steam Engine [edit]

Figure: Watt’s Steam Engine which made Steam Power Efficient and Practical.

James Watt’s steam engine contained an early machine learning device. In the same way that modern systems are component based, his engine was composed of components. One of which is a speed regulator sometimes known as Watt’s governor. The two balls in the center of the image, when spun fast, rise, and through a linkage mechanism.

The centrifugal governor was made famous by Boulton and Watt when it was deployed in the steam engine. Studying stability in the governor is the main subject of James Clerk Maxwell’s paper on the theoretical analysis of governors (Maxwell 1867). This paper is a founding paper of control theory. In an acknowledgment of its influence, Wiener used the name cybernetics to describe the field of control and communication in animals and the machine (Wiener 1948). Cybernetics is the Greek word for governor, which comes from the latin for helmsman.

A governor is one of the simplest artificial intelligence systems. It senses the speed of an engine, and acts to change the position of the valve on the engine to slow it down.

Although it’s a mechanical system a governor can be seen as automating a role that a human would have traditionally played. It is an early example of artificial intelligence.

The centrifugal governor has several parameters, the weight of the balls used, the length of the linkages and the limits on the balls movement.

Two principle differences exist between the centrifugal governor and artificial intelligence systems of today.

  1. The centrifugal governor is a physical system and it is an integral part of a wider physical system that it regulates (the engine).
  2. The parameters of the governor were set by hand, our modern artificial intelligence systems have their parameters set by data.

Figure: The centrifugal governor, an early example of a decision making system. The parameters of the governor include the lengths of the linkages (which effect how far the throttle opens in response to movement in the balls), the weight of the balls (which effects inertia) and the limits of to which the balls can rise.

This has the basic components of sense and act that we expect in an intelligent system, and this system saved the need for a human operator to manually adjust the system in the case of overspeed. Overspeed has the potential to destroy an engine, so the governor operates as a safety device.

The first wave of automation did bring about sabotoage as a worker’s response. But if machinery was sabotaged, for example, if the linkage between sensor (the spinning balls) and action (the valve closure) was broken, this would be obvious to the engine operator at start up time. The machine could be repaired before operation.

What is Machine Learning? [edit]

Machine learning allows us to extract knowledge from data to form a prediction.


$$\text{data} + \text{model} \xrightarrow{\text{compute}} \text{prediction}$$

A machine learning prediction is made by combining a model with data to form the prediction. The manner in which this is done gives us the machine learning algorithm.

Machine learning models are mathematical models which make weak assumptions about data, e.g. smoothness assumptions. By combining these assumptions with the data, we observe we can interpolate between data points or, occasionally, extrapolate into the future.

Machine learning is a technology which strongly overlaps with the methodology of statistics. From a historical/philosophical view point, machine learning differs from statistics in that the focus in the machine learning community has been primarily on accuracy of prediction, whereas the focus in statistics is typically on the interpretability of a model and/or validating a hypothesis through data collection.

The rapid increase in the availability of compute and data has led to the increased prominence of machine learning. This prominence is surfacing in two different but overlapping domains: data science and artificial intelligence.

From Model to Decision [edit]

The real challenge, however, is end-to-end decision making. Taking information from the environment and using it to drive decision making to achieve goals.

Figure: Amazon’s proposed drone delivery system.

Artificial Intelligence and Data Science [edit]

Artificial intelligence has the objective of endowing computers with human-like intelligent capabilities. For example, understanding an image (computer vision) or the contents of some speech (speech recognition), the meaning of a sentence (natural language processing) or the translation of a sentence (machine translation).

Supervised Learning for AI

The machine learning approach to artificial intelligence is to collect and annotate a large data set from humans. The problem is characterized by input data (e.g. a particular image) and a label (e.g. is there a car in the image yes/no). The machine learning algorithm fits a mathematical function (I call this the prediction function) to map from the input image to the label. The parameters of the prediction function are set by minimizing an error between the function’s predictions and the true data. This mathematical function that encapsulates this error is known as the objective function.

This approach to machine learning is known as supervised learning. Various approaches to supervised learning use different prediction functions, objective functions or different optimization algorithms to fit them.

For example, deep learning makes use of neural networks to form the predictions. A neural network is a particular type of mathematical function that allows the algorithm designer to introduce invariances into the function.

An invariance is an important way of including prior understanding in a machine learning model. For example, in an image, a car is still a car regardless of whether it’s in the upper left or lower right corner of the image. This is known as translation invariance. A neural network encodes translation invariance in convolutional layers. Convolutional neural networks are widely used in image recognition tasks.

An alternative structure is known as a recurrent neural network (RNN). RNNs neural networks encode temporal structure. They use auto regressive connections in their hidden layers, they can be seen as time series models which have non-linear auto-regressive basis functions. They are widely used in speech recognition and machine translation.

Machine learning has been deployed in Speech Recognition (e.g. Alexa, deep neural networks, convolutional neural networks for speech recognition), in computer vision (e.g. Amazon Go, convolutional neural networks for person recognition and pose detection).

The field of data science is related to AI, but philosophically different. It arises because we are increasingly creating large amounts of data through happenstance rather than active collection. In the modern era data is laid down by almost all our activities. The objective of data science is to extract insights from this data.

Classically, in the field of statistics, data analysis proceeds by assuming that the question (or scientific hypothesis) comes before the data is created. E.g., if I want to determine the effectiveness of a particular drug, I perform a design for my data collection. I use foundational approaches such as randomization to account for confounders. This made a lot of sense in an era where data had to be actively collected. The reduction in cost of data collection and storage now means that many data sets are available which weren’t collected with a particular question in mind. This is a challenge because bias in the way data was acquired can corrupt the insights we derive. We can perform randomized control trials (or A/B tests) to verify our conclusions, but the opportunity is to use data science techniques to better guide our question selection or even answer a question without the expense of a full randomized control trial (referred to as A/B testing in modern internet parlance).

  • There is a gap between the world of data science and AI.
  • The mapping of the virtual onto the physical world.
  • E.g. Causal understanding.

Supply Chain [edit]

Figure: Packhorse Bridge under Burbage Edge. This packhorse route climbs steeply out of Hathersage and heads towards Sheffield. Packhorses were the main route for transporting goods across the Peak District. The high cost of transport is one driver of the ‘smith’ model, where there is a local skilled person responsible for assembling or creating goods (e.g. a blacksmith).

On Sunday mornings in Sheffield, I often used to run across Packhorse Bridge in Burbage valley. The bridge is part of an ancient network of trails crossing the Pennines that, before Turnpike roads arrived in the 18th century, was the main way in which goods were moved. Given that the moors around Sheffield were home to sand quarries, tin mines, lead mines and the villages in the Derwent valley were known for nail and pin manufacture, this wasn’t simply movement of agricultural goods, but it was the infrastructure for industrial transport.

The profession of leading the horses was known as a Jagger and leading out of the village of Hathersage is Jagger’s Lane, a trail that headed underneath Stanage Edge and into Sheffield.

The movement of goods from regions of supply to areas of demand is fundamental to our society. The physical infrastructure of supply chain has evolved a great deal over the last 300 years.

Cromford [edit]

Figure: Richard Arkwright is regarded of the founder of the modern factory system. Factories exploit distribution networks to centralize production of goods. Arkwright located his factory in Cromford due to proximity to Nottingham Weavers (his market) and availability of water power from the tributaries of the Derwent river. When he first arrived there was almost no transportation network. Over the following 200 years The Cromford Canal (1790s), a Turnpike (now the A6, 1816-18) and the High Peak Railway (now closed, 1820s) were all constructed to improve transportation access as the factory blossomed.

Richard Arkwright is known as the father of the modern factory system. In 1771 he set up a Mill for spinning cotton yarn in the village of Cromford, in the Derwent Valley. The Derwent valley is relatively inaccessible. Raw cotton arrived in Liverpool from the US and India. It needed to be transported on packhorse across the bridleways of the Pennines. But Cromford was a good location due to proximity to Nottingham, where weavers where consuming the finished thread, and the availability of water power from small tributaries of the Derwent river for Arkwright’s water frames which automated the production of yarn from raw cotton.

By 1794 the Cromford Canal was opened to bring coal in to Cromford and give better transport to Nottingham. The construction of the canals was driven by the need to improve the transport infrastructure, facilitating the movement of goods across the UK. Canals, roads and railways were initially constructed by the economic need for moving goods. To improve supply chain.

The A6 now does pass through Cromford, but at the time he moved there there was merely a track. The High Peak Railway was opened in 1832, it is now converted to the High Peak Trail, but it remains the highest railway built in Britain.

Cooper (1991)

Containerization [edit]

Figure: The container is one of the major drivers of globalization, and arguably the largest agent of social change in the last 100 years. It reduces the cost of transportation, significantly changing the appropriate topology of distribution networks. The container makes it possible to ship goods halfway around the world for cheaper than it costs to process those goods, leading to an extended distribution topology.

Containerization has had a dramatic effect on global economics, placing many people in the developing world at the end of the supply chain.

Figure: Wild Alaskan Cod, being solid in the Pacific Northwest, that is a product of China. It is cheaper to ship the deep frozen fish thousands of kilometers for processing than to process locally.

For example, you can buy Wild Alaskan Cod fished from Alaska, processed in China, sold in North America. This is driven by the low cost of transport for frozen cod vs the higher relative cost of cod processing in the US versus China. Similarly, Scottish prawns are also processed in China for sale in the UK.

This effect on cost of transport vs cost of processing is the main driver of the topology of the modern supply chain and the associated effect of globalization. If transport is much cheaper than processing, then processing will tend to agglomerate in places where processing costs can be minimized.

Large scale global economic change has principally been driven by changes in the technology that drives supply chain.

Supply chain is a large-scale automated decision making network. Our aim is to make decisions not only based on our models of customer behavior (as observed through data), but also by accounting for the structure of our fulfilment center, and delivery network.

Many of the most important questions in supply chain take the form of counterfactuals. E.g. “What would happen if we opened a manufacturing facility in Cambridge?” A counter factual is a question that implies a mechanistic understanding of a system. It goes beyond simple smoothness assumptions or translation invariants. It requires a physical, or mechanistic understanding of the supply chain network. For this reason, the type of models we deploy in supply chain often involve simulations or more mechanistic understanding of the network.

In supply chain Machine Learning alone is not enough, we need to bridge between models that contain real mechanisms and models that are entirely data driven.

This is challenging, because as we introduce more mechanism to the models we use, it becomes harder to develop efficient algorithms to match those models to data.

Machine learning aims to replicate processes through the direct use of data. When deployed in the domain of ‘artificial intelligence’, the processes that it is replicating, or emulating, are cognitive processes.

The first trick in machine learning is to convert the process itself into a mathematical function. That function has a set of parameters which control its behaviour. What we call learning is the adaption of these parameters to change the behavior of the function. The choice of mathematical function we use is a vital component of the model.

Emukit Playground [edit]

Emukit playground is a software toolkit for exploring the use of statistical emulation as a tool. It was built by Adam Hirst, during his software engineering internship at Amazon and supervised by Cliff McCollum.

Figure: Emukit playground is a tutorial for understanding the simulation/emulation relationship. https://amzn.github.io/emukit-playground/

Figure: Tutorial on Bayesian optimization of the number of taxis deployed from Emukit playground. https://amzn.github.io/emukit-playground/#!/learn/bayesian_optimization

You can explore Bayesian optimization of a taxi simulation.

Uncertainty Quantification [edit]

Uncertainty quantification (UQ) is the science of quantitative characterization and reduction of uncertainties in both computational and real world applications. It tries to determine how likely certain outcomes are if some aspects of the system are not exactly known.

We will to illustrate different concepts of Uncertainty Quantification (UQ) and the role that Gaussian processes play in this field. Based on a simple simulator of a car moving between a valley and a mountain, we are going to illustrate the following concepts:

  • Systems emulation. Many real world decisions are based on simulations that can be computationally very demanding. We will show how simulators can be replaced by emulators: Gaussian process models fitted on a few simulations that can be used to replace the simulator. Emulators are cheap to compute, fast to run, and always provide ways to quantify the uncertainty of how precise they are compared the original simulator.

  • Emulators in optimization problems. We will show how emulators can be used to optimize black-box functions that are expensive to evaluate. This field is also called Bayesian Optimization and has gained an increasing relevance in machine learning as emulators can be used to optimize computer simulations (and machine learning algorithms) quite efficiently.

  • Multi-fidelity emulation methods. In many scenarios we have simulators of different quality about the same measure of interest. In these cases the goal is to merge all sources of information under the same model so the final emulator is cheaper and more accurate than an emulator fitted only using data from the most accurate and expensive simulator.

Mountain Car Simulator [edit]

To illustrate the above mentioned concepts we we use the mountain car simulator. This simulator is widely used in machine learning to test reinforcement learning algorithms. The goal is to define a control policy on a car whose objective is to climb a mountain. Graphically, the problem looks as follows:

Figure: The mountain car simulation from the Open AI gym.

The goal is to define a sequence of actions (push the car right or left with certain intensity) to make the car reach the flag after a number T of time steps.

At each time step t, the car is characterized by a vector $\inputVector_{t} = (p_t,v_t)$ of states which are respectively the the position and velocity of the car at time t. For a sequence of states (an episode), the dynamics of the car is given by


$$\inputVector_{t+1} = \mappingFunction(\inputVector_{t},\textbf{u}_{t})$$

where ut is the value of an action force, which in this example corresponds to push car to the left (negative value) or to the right (positive value). The actions across a full episode are represented in a policy $\textbf{u}_{t} = \pi(\inputVector_{t},\theta)$ that acts according to the current state of the car and some parameters θ. In the following examples we will assume that the policy is linear which allows us to write $\pi(\inputVector_{t},\theta)$ as


$$\pi(\inputVector,\theta)= \theta_0 + \theta_p p + \theta_vv.$$

For t = 1, …, T now given some initial state $\inputVector_{0}$ and some some values of each ut, we can simulate the full dynamics of the car for a full episode using Gym. The values of ut are fully determined by the parameters of the linear controller.

After each episode of length T is complete, a reward function RT(θ) is computed. In the mountain car example the reward is computed as 100 for reaching the target of the hill on the right hand side, minus the squared sum of actions (a real negative to push to the left and a real positive to push to the right) from start to goal. Note that our reward depend on θ as we make it dependent on the parameters of the linear controller.

Emulate the Mountain Car

Our goal in this section is to find the parameters θ of the linear controller such that


θ* = argmaxθRT(θ).

In this section, we directly use Bayesian optimization to solve this problem. We will use GPyOpt so we first define the objective function:

For each set of parameter values of the linear controller we can run an episode of the simulator (that we fix to have a horizon of T = 500) to generate the reward. Using as input the parameters of the controller and as outputs the rewards we can build a Gaussian process emulator of the reward.

We start defining the input space, which is three-dimensional:

Now we initizialize a Gaussian process emulator.

In Bayesian optimization an acquisition function is used to balance exploration and exploitation to evaluate new locations close to the optimum of the objective. In this notebook we select the expected improvement (EI). For further details have a look to the review paper of Shahriari et al (2015).

To initalize the model we start sampling some initial points (25) for the linear controler randomly.

Before we start any optimization, lets have a look to the behavior of the car with the first of these initial points that we have selected randomly.

Figure: Random linear controller for the Mountain car. It fails to move the car to the top of the mountain.

As we can see the random linear controller does not manage to push the car to the top of the mountain. Now, let’s optimize the regret using Bayesian optimization and the emulator for the reward. We try 50 new parameters chosen by the EI.

Now we visualize the result for the best controller that we have found with Bayesian optimization.

Figure: Mountain car simulator trained using Bayesian optimization and the simulator of the dynamics. Fifty iterations of Bayesian optimization are used to optimize the controler.

he car can now make it to the top of the mountain! Emulating the reward function and using the EI helped as to find a linear controller that solves the problem.

Data Efficient Emulation [edit]

In the previous section we solved the mountain car problem by directly emulating the reward but no considerations about the dynamics $\inputVector_{t+1} = \mappingFunction(\inputVector_{t},\textbf{u}_{t})$ of the system were made. Note that we had to run 75 episodes of 500 steps each to solve the problem, which required to call the simulator 500 × 75 = 37500 times. In this section we will show how it is possible to reduce this number by building an emulator for f that can later be used to directly optimize the control.

The inputs of the model for the dynamics are the velocity, the position and the value of the control so create this space accordingly.

The outputs are the velocity and the position. Indeed our model will capture the change in position and velocity on time. That is, we will model


Δvt + 1 = vt + 1 − vt


Δxt + 1 = pt + 1 − pt

with Gaussian processes with prior mean vt and pt respectively. As a covariance function, we use a Matern52. We need therefore two models to capture the full dynamics of the system.

Next, we sample some input parameters and use the simulator to compute the outputs. Note that in this case we are not running the full episodes, we are just using the simulator to compute $\inputVector_{t+1}$ given $\inputVector_{t}$ and ut.

In general we might use much smarter strategies to design our emulation of the simulator. For example, we could use the variance of the predictive distributions of the models to collect points using uncertainty sampling, which will give us a better coverage of the space. For simplicity, we move ahead with the 500 randomly selected points.

Now that we have a data set, we can update the emulators for the location and the velocity.

We can now have a look to how the emulator and the simulator match. First, we show a contour plot of the car aceleration for each pair of can position and velocity. You can use the bar bellow to play with the values of the controler to compare the emulator and the simulator.

We can see how the emulator is doing a fairly good job approximating the simulator. On the edges, however, it struggles to captures the dynamics of the simulator.

Given some input parameters of the linear controlling, how do the dynamics of the emulator and simulator match? In the following figure we show the position and velocity of the car for the 500 time steps of an episode in which the parameters of the linear controller have been fixed beforehand. The value of the input control is also shown.

Figure: Comparison between the mountain car simulator and the emulator.

We now make explicit use of the emulator, using it to replace the simulator and optimize the linear controller. Note that in this optimization, we don’t need to query the simulator anymore as we can reproduce the full dynamics of an episode using the emulator. For illustrative purposes, in this example we fix the initial location of the car.

We define the objective reward function in terms of the simulator.

And as before, we use Bayesian optimization to find the best possible linear controller.

The design space is the three continuous variables that make up the linear controller.

We set the acquisition function to be expected improvement using GPyOpt.

Figure: Mountain car controller learnt through emulation. Here 500 calls to the simulator are used to fit the controller rather than 37,500 calls to the simulator required in the standard learning.

And the problem is again solved, but in this case we have replaced the simulator of the car dynamics by a Gaussian process emulator that we learned by calling the simulator only 500 times. Compared to the 37500 calls that we needed when applying Bayesian optimization directly on the simulator this is a great gain.

Multi-Fidelity Emulation [edit]

In some scenarios we have simulators of the same environment that have different fidelities, that is that reflect with different level of accuracy the dynamics of the real world. Running simulations of the different fidelities also have a different cost: hight fidelity simulations are more expensive the cheaper ones. If we have access to these simulators we can combine high and low fidelity simulations under the same model.

So let’s assume that we have two simulators of the mountain car dynamics, one of high fidelity (the one we have used) and another one of low fidelity. The traditional approach to this form of multi-fidelity emulation is to assume that


$$\mappingFunction_i\left(\inputVector\right) = \rho\mappingFunction_{i-1}\left(\inputVector\right) + \delta_i\left(\inputVector \right)$$

where $\mappingFunction_{i-1}\left(\inputVector\right)$ is a low fidelity simulation of the problem of interest and $\mappingFunction_i\left(\inputVector\right)$ is a higher fidelity simulation. The function $\delta_i\left(\inputVector \right)$ represents the difference between the lower and higher fidelity simulation, which is considered additive. The additive form of this covariance means that if $\mappingFunction_{0}\left(\inputVector\right)$ and $\left\{\delta_i\left(\inputVector \right)\right\}_{i=1}^m$ are all Gaussian processes, then the process over all fidelities of simuation will be a joint Gaussian process.

But with Deep Gaussian processes we can consider the form


$$\mappingFunction_i\left(\inputVector\right) = \mappingFunctionTwo_{i}\left(\mappingFunction_{i-1}\left(\inputVector\right)\right) + \delta_i\left(\inputVector \right),$$

where the low fidelity representation is non linearly transformed by $\mappingFunctionTwo(\cdot)$ before use in the process. This is the approach taken in Perdikaris et al. (2017). But once we accept that these models can be composed, a highly flexible framework can emerge. A key point is that the data enters the model at different levels, and represents different aspects. For example these correspond to the two fidelities of the mountain car simulator.

We start by sampling both of them at 250 random input locations.

Next, we evaluate the high and low fidelity simualtors at those locations.

It is time to build the multi-fidelity model for both the position and the velocity.

As we did in the previous section we use the emulator to optimize the simulator. In this case we use the high fidelity output of the emulator.

And we optimize using Bayesian optimzation.

Best Controller with Multi-Fidelity Emulator

Figure: Mountain car learnt with multi-fidelity model. Here 250 observations of the high fidelity simulator and 250 observations of the low fidelity simulator are used to learn the controller.

And problem solved! We see how the problem is also solved with 250 observations of the high fidelity simulator and 250 of the low fidelity simulator.

Emukit [edit]

Figure: The Emukit software is a set of software tools for emulation and surrogate modeling. https://amzn.github.io/emukit/

The aim is to provide a suite where different approaches to emulation are assimilated under one roof. The current version of Emukit includes multi-fidelity emulation for build surrogate models when data is obtained from multiple information sources that have different fidelity and/or cost; Bayesian optimisation for optimising physical experiments and tune parameters of machine learning algorithms or other computational simulations; experimental design and active learning: design the most informative experiments and perform active learning with machine learning models; sensitivity analysis: analyse the influence of inputs on the outputs of a given system; and Bayesian quadrature: efficiently compute the integrals of functions that are expensive to evaluate.

MXFusion: Modular Probabilistic Programming on MXNet [edit]

One challenge for practitioners in Gaussian processes, is flexible software that allows the construction of the relevant GP modle. With this in mind, the Amazon Cambridge team has developed MXFusion. It is a modular probabilistic programming language focussed on efficient implementation of hybrid GP-neural network models, but with additional probabilistic programming capabilities.

Figure: MXFusion is a probabilistic programming language targeted specifically at Gaussian process models and combining them with probaiblistic neural network. It is available through the MIT license and we welcome contributions throguh the Github repository https://github.com/amzn/MXFusion.

We developed the framework for greater ease of transitioning models from ‘science’ to ‘production’, our aim was to have code that could be created by scientists, but deployed in our systems through solutions such as AWS SageMaker.

  • Work by Eric Meissner and Zhenwen Dai.
  • Probabilistic programming.
  • Available on Github

Figure: The MXFusion software.

MxFusion

Why another framework?

Key Requirements

Specialized inference methods + models, without requiring users to reimplement nor understand them every time. Leverage expert knowledge. Efficient inference, flexible framework. Existing frameworks either did one or the other: flexible, or efficient.

What does it look like?

Modelling

Inference

  • Variable
  • Distribution
  • Function

  • log_pdf
  • draw_samples

  • Variational Inference
  • MCMC Sampling (soon) Built on MXNet Gluon (imperative code, not static graph)

  • Model + Inference together form building blocks.
    • Just doing modular modeling with universal inference doesn’t really scale, need specialized inference methods for specialized modelling objects like non-parametrics.

PILCO: A Model-based Policy Search [edit]

Common reinforcement learning methods suffer from data inefficiency, which can be a issue in real world applications where gathering sufficiently large amounts of data pose economic issues and may be impossible. propose a model-based policy search method known as PILCO in part to address this issue. PILCO uses a Gaussian process (GP) for learning the dynamics of the environment and optimizes a parametric policy function using the learned dynamics model.

We construct an implementation of PILCO using MXFusion. This implementation follows the main idea of PILCO and has a few enhancements in addition to the published implementation. The enhancements are as follows: * Use Monte Carlo integration instead of moment estimation. We approximate the expected reward using Monte Carlo integration instead of the proposed moment estimation approach. This removes the bias in the expected reward computation and enables a wide range of choices of kernels and policy functions. In the original work, only RBF and linear kernel and only linear and RBF network policy can be used. * Use automatic differentiation. Thanks to automatic differentiation, no gradient derivation is needed. * A unified interface of Gaussian process. MXFusion provides an unified inferface of GP modules. We allows us to easily switch among plan GP, variational sparse GP and stocastic variational GP implementations.

This notebook depends on MXNet, MXFusion and Open AI Gym. These packages can be installed into your Python environment by running the following commands.

Set the global configuration.

We use the inverted pendulum swingup problem as an example. We use the Pendulum-v0 environment in Open AI Gym. The task is to swing the pendulum up and balance it at the inverted position. This is a classical control problem and is known to be unsolvable with a linear controller.

To solve this problem with PILCO, we need three components:

  • Execute a policy in an real environment (an Open AI Gym simulator in this example) and collect data.
  • Fit a GP model as the model for the dynamics of the environment.
  • Optimize the policy given the dynamics model learned from all the data that have been collected so far.

The overall PILCO algorithm is to iterate the above three steps until a policy that can solve the problem is found.

Execute the Environment

The state of the pendulum environment is a 3D vector. The first two dimensions are the 2D location of the end point of the pendulum. The third dimension encodes the angular speed of the pendulum. The action space is a 1D vector in [-2, 2].

We write a helper function for executing the environment with a given policy.

We first apply a random policy and see how the environment reacts. The random policy uniformly samples in the space of action.

The animation is generated with the following commands:

Pendulum

Figure: Random policy for the control of the pendulum.

The dynamics model of pendulum can be written as
$$ p(\dataScalar_{t+1}|\dataScalar_t, a_t) $$
where $\dataScalar_t$ is the state vector at the time t and at is the action taken at the time t.

PILCO uses a Gaussian process to model the above dynamics.

In this example, we do a maximum likelihood estimate for the model hyper- parameters. In MXFusion, Gaussian process regression model is available as a module, which includes a dediated inference algorithm.

Define and fit the model

Policy

PILCO computes the expected reward of a policy given the dynamics model. First, we need to define the parametric form of the policy. In this example, we use a neural network with one hidden layer. As the action space is [-2, 2], we apply a tanh transformation and multiply the come with two. This enforces the returned actions stay within the range.

We define a neural network with one hidden layer and and output constrained between [-2,2] for the policy.

To compute the expected reward, we also need to define a reward function. This reward function is defined by us according to the task. The main component is the height of the pendulum. We also penalize the force and the angular momentum.

The expected reward function can be written as
$$ R = \mathbb{E}_{p(\dataScalar_T, \ldots, \dataScalar_0)}\left(\sum_{t=0}^\top r(\dataScalar_t)\right) $$
where r( ⋅ ) is the reward function, $p(\dataScalar_T, \ldots, \dataScalar_0)$ is the joint distribution when applying the policy to the dynamics model:
$$ p(\dataScalar_T, \ldots, \dataScalar_0) = p(\dataScalar_0) \prod_{t=1}^\top p(\dataScalar_t|\dataScalar_{t-1}, a_{t-1}), $$
where $a_{t-1} = \pi(\dataScalar_{t-1})$ is the action taken at the time t − 1, which is the outcome of the policy π( ⋅ ).

The expected reward function is implemented as follows.

Obtaining the policy gradients

We optimize the policy with respect to the expected reward by using a gradient optimizer.

The Loop

We need to define a function that provides random initial states.

Policy after the first episode (random exploration):

Figure: PILCO policy for control of the animation after first episode (using random exploration).

Policy after the 5th episode:

Figure: PILCO policy for control of the animation after the fifth episode.

https://github.com/amzn/mxfusion

Long term Aim

  • Simulate/Emulate the components of the system.
    • Validate with real world using multifidelity.
    • Interpret system using e.g. sensitivity analysis.
  • Perform end to end learning to optimize.
    • Maintain interpretability.

References

Cooper, Brian. 1991. Transformation of a Valley: Derbyshire Derwent. Scarthin Books.

Maxwell, James Clerk. 1867. “On Governors.” Proceedings of the Royal Society of London 16. The Royal Society: 270–83. http://www.jstor.org/stable/112510.

Perdikaris, P., M. Raissi, A. Damianou, N. D. Lawrence, and G. E. Karniadakis. 2017. “Nonlinear Information Fusion Algorithms for Data-Efficient Multi-Fidelity Modelling.” Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences 473 (2198). The Royal Society. https://doi.org/10.1098/rspa.2016.0751.

Wiener, Norbert. 1948. Cybernetics: Control and Communication in the Animal and the Machine. Cambridge, MA: MIT Press.