Neil Lawrence's Talkstalks given by Neil Lawrence
http://inverseprobability.com/talks/
Sat, 13 Oct 2018 06:34:45 +0000Sat, 13 Oct 2018 06:34:45 +0000Jekyll v3.7.4Data Science and Digital Systems<div style="display:none">
$$\newcommand{\Amatrix}{\mathbf{A}}
\newcommand{\KL}[2]{\text{KL}\left( #1\,\|\,#2 \right)}
\newcommand{\Kaast}{\kernelMatrix_{\mathbf{ \ast}\mathbf{ \ast}}}
\newcommand{\Kastu}{\kernelMatrix_{\mathbf{ \ast} \inducingVector}}
\newcommand{\Kff}{\kernelMatrix_{\mappingFunctionVector \mappingFunctionVector}}
\newcommand{\Kfu}{\kernelMatrix_{\mappingFunctionVector \inducingVector}}
\newcommand{\Kuast}{\kernelMatrix_{\inducingVector \bf\ast}}
\newcommand{\Kuf}{\kernelMatrix_{\inducingVector \mappingFunctionVector}}
\newcommand{\Kuu}{\kernelMatrix_{\inducingVector \inducingVector}}
\newcommand{\Kuui}{\Kuu^{-1}}
\newcommand{\Qaast}{\mathbf{Q}_{\bf \ast \ast}}
\newcommand{\Qastf}{\mathbf{Q}_{\ast \mappingFunction}}
\newcommand{\Qfast}{\mathbf{Q}_{\mappingFunctionVector \bf \ast}}
\newcommand{\Qff}{\mathbf{Q}_{\mappingFunctionVector \mappingFunctionVector}}
\newcommand{\aMatrix}{\mathbf{A}}
\newcommand{\aScalar}{a}
\newcommand{\aVector}{\mathbf{a}}
\newcommand{\acceleration}{a}
\newcommand{\bMatrix}{\mathbf{B}}
\newcommand{\bScalar}{b}
\newcommand{\bVector}{\mathbf{b}}
\newcommand{\basisFunc}{\phi}
\newcommand{\basisFuncVector}{\boldsymbol{ \basisFunc}}
\newcommand{\basisFunction}{\phi}
\newcommand{\basisLocation}{\mu}
\newcommand{\basisMatrix}{\boldsymbol{ \Phi}}
\newcommand{\basisScalar}{\basisFunction}
\newcommand{\basisVector}{\boldsymbol{ \basisFunction}}
\newcommand{\activationFunction}{\phi}
\newcommand{\activationMatrix}{\boldsymbol{ \Phi}}
\newcommand{\activationScalar}{\basisFunction}
\newcommand{\activationVector}{\boldsymbol{ \basisFunction}}
\newcommand{\bigO}{\mathcal{O}}
\newcommand{\binomProb}{\pi}
\newcommand{\cMatrix}{\mathbf{C}}
\newcommand{\cbasisMatrix}{\hat{\boldsymbol{ \Phi}}}
\newcommand{\cdataMatrix}{\hat{\dataMatrix}}
\newcommand{\cdataScalar}{\hat{\dataScalar}}
\newcommand{\cdataVector}{\hat{\dataVector}}
\newcommand{\centeredKernelMatrix}{\mathbf{ \MakeUppercase{\centeredKernelScalar}}}
\newcommand{\centeredKernelScalar}{b}
\newcommand{\centeredKernelVector}{\centeredKernelScalar}
\newcommand{\centeringMatrix}{\mathbf{H}}
\newcommand{\chiSquaredDist}[2]{\chi_{#1}^{2}\left(#2\right)}
\newcommand{\chiSquaredSamp}[1]{\chi_{#1}^{2}}
\newcommand{\conditionalCovariance}{\boldsymbol{ \Sigma}}
\newcommand{\coregionalizationMatrix}{\mathbf{B}}
\newcommand{\coregionalizationScalar}{b}
\newcommand{\coregionalizationVector}{\mathbf{ \coregionalizationScalar}}
\newcommand{\covDist}[2]{\text{cov}_{#2}\left(#1\right)}
\newcommand{\covSamp}[1]{\text{cov}\left(#1\right)}
\newcommand{\covarianceScalar}{c}
\newcommand{\covarianceVector}{\mathbf{ \covarianceScalar}}
\newcommand{\covarianceMatrix}{\mathbf{C}}
\newcommand{\covarianceMatrixTwo}{\boldsymbol{ \Sigma}}
\newcommand{\croupierScalar}{s}
\newcommand{\croupierVector}{\mathbf{ \croupierScalar}}
\newcommand{\croupierMatrix}{\mathbf{ \MakeUppercase{\croupierScalar}}}
\newcommand{\dataDim}{p}
\newcommand{\dataIndex}{i}
\newcommand{\dataIndexTwo}{j}
\newcommand{\dataMatrix}{\mathbf{Y}}
\newcommand{\dataScalar}{y}
\newcommand{\dataSet}{\mathcal{D}}
\newcommand{\dataStd}{\sigma}
\newcommand{\dataVector}{\mathbf{ \dataScalar}}
\newcommand{\decayRate}{d}
\newcommand{\degreeMatrix}{\mathbf{ \MakeUppercase{\degreeScalar}}}
\newcommand{\degreeScalar}{d}
\newcommand{\degreeVector}{\mathbf{ \degreeScalar}}
% Already defined by latex
%\newcommand{\det}[1]{\left|#1\right|}
\newcommand{\diag}[1]{\text{diag}\left(#1\right)}
\newcommand{\diagonalMatrix}{\mathbf{D}}
\newcommand{\diff}[2]{\frac{\text{d}#1}{\text{d}#2}}
\newcommand{\diffTwo}[2]{\frac{\text{d}^2#1}{\text{d}#2^2}}
\newcommand{\displacement}{x}
\newcommand{\displacementVector}{\textbf{\displacement}}
\newcommand{\distanceMatrix}{\mathbf{ \MakeUppercase{\distanceScalar}}}
\newcommand{\distanceScalar}{d}
\newcommand{\distanceVector}{\mathbf{ \distanceScalar}}
\newcommand{\eigenvaltwo}{\ell}
\newcommand{\eigenvaltwoMatrix}{\mathbf{L}}
\newcommand{\eigenvaltwoVector}{\mathbf{l}}
\newcommand{\eigenvalue}{\lambda}
\newcommand{\eigenvalueMatrix}{\boldsymbol{ \Lambda}}
\newcommand{\eigenvalueVector}{\boldsymbol{ \lambda}}
\newcommand{\eigenvector}{\mathbf{ \eigenvectorScalar}}
\newcommand{\eigenvectorMatrix}{\mathbf{U}}
\newcommand{\eigenvectorScalar}{u}
\newcommand{\eigenvectwo}{\mathbf{v}}
\newcommand{\eigenvectwoMatrix}{\mathbf{V}}
\newcommand{\eigenvectwoScalar}{v}
\newcommand{\entropy}[1]{\mathcal{H}\left(#1\right)}
\newcommand{\errorFunction}{E}
\newcommand{\expDist}[2]{\left<#1\right>_{#2}}
\newcommand{\expSamp}[1]{\left<#1\right>}
\newcommand{\expectation}[1]{\left\langle #1 \right\rangle }
\newcommand{\expectationDist}[2]{\left\langle #1 \right\rangle _{#2}}
\newcommand{\expectedDistanceMatrix}{\mathcal{D}}
\newcommand{\eye}{\mathbf{I}}
\newcommand{\fantasyDim}{r}
\newcommand{\fantasyMatrix}{\mathbf{ \MakeUppercase{\fantasyScalar}}}
\newcommand{\fantasyScalar}{z}
\newcommand{\fantasyVector}{\mathbf{ \fantasyScalar}}
\newcommand{\featureStd}{\varsigma}
\newcommand{\gammaCdf}[3]{\mathcal{GAMMA CDF}\left(#1|#2,#3\right)}
\newcommand{\gammaDist}[3]{\mathcal{G}\left(#1|#2,#3\right)}
\newcommand{\gammaSamp}[2]{\mathcal{G}\left(#1,#2\right)}
\newcommand{\gaussianDist}[3]{\mathcal{N}\left(#1|#2,#3\right)}
\newcommand{\gaussianSamp}[2]{\mathcal{N}\left(#1,#2\right)}
\newcommand{\given}{|}
\newcommand{\half}{\frac{1}{2}}
\newcommand{\heaviside}{H}
\newcommand{\hiddenMatrix}{\mathbf{ \MakeUppercase{\hiddenScalar}}}
\newcommand{\hiddenScalar}{h}
\newcommand{\hiddenVector}{\mathbf{ \hiddenScalar}}
\newcommand{\identityMatrix}{\eye}
\newcommand{\inducingInputScalar}{z}
\newcommand{\inducingInputVector}{\mathbf{ \inducingInputScalar}}
\newcommand{\inducingInputMatrix}{\mathbf{Z}}
\newcommand{\inducingScalar}{u}
\newcommand{\inducingVector}{\mathbf{ \inducingScalar}}
\newcommand{\inducingMatrix}{\mathbf{U}}
\newcommand{\inlineDiff}[2]{\text{d}#1/\text{d}#2}
\newcommand{\inputDim}{q}
\newcommand{\inputMatrix}{\mathbf{X}}
\newcommand{\inputScalar}{x}
\newcommand{\inputSpace}{\mathcal{X}}
\newcommand{\inputVals}{\inputVector}
\newcommand{\inputVector}{\mathbf{ \inputScalar}}
\newcommand{\iterNum}{k}
\newcommand{\kernel}{\kernelScalar}
\newcommand{\kernelMatrix}{\mathbf{K}}
\newcommand{\kernelScalar}{k}
\newcommand{\kernelVector}{\mathbf{ \kernelScalar}}
\newcommand{\kff}{\kernelScalar_{\mappingFunction \mappingFunction}}
\newcommand{\kfu}{\kernelVector_{\mappingFunction \inducingScalar}}
\newcommand{\kuf}{\kernelVector_{\inducingScalar \mappingFunction}}
\newcommand{\kuu}{\kernelVector_{\inducingScalar \inducingScalar}}
\newcommand{\lagrangeMultiplier}{\lambda}
\newcommand{\lagrangeMultiplierMatrix}{\boldsymbol{ \Lambda}}
\newcommand{\lagrangian}{L}
\newcommand{\laplacianFactor}{\mathbf{ \MakeUppercase{\laplacianFactorScalar}}}
\newcommand{\laplacianFactorScalar}{m}
\newcommand{\laplacianFactorVector}{\mathbf{ \laplacianFactorScalar}}
\newcommand{\laplacianMatrix}{\mathbf{L}}
\newcommand{\laplacianScalar}{\ell}
\newcommand{\laplacianVector}{\mathbf{ \ell}}
\newcommand{\latentDim}{q}
\newcommand{\latentDistanceMatrix}{\boldsymbol{ \Delta}}
\newcommand{\latentDistanceScalar}{\delta}
\newcommand{\latentDistanceVector}{\boldsymbol{ \delta}}
\newcommand{\latentForce}{f}
\newcommand{\latentFunction}{u}
\newcommand{\latentFunctionVector}{\mathbf{ \latentFunction}}
\newcommand{\latentFunctionMatrix}{\mathbf{ \MakeUppercase{\latentFunction}}}
\newcommand{\latentIndex}{j}
\newcommand{\latentScalar}{z}
\newcommand{\latentVector}{\mathbf{ \latentScalar}}
\newcommand{\latentMatrix}{\mathbf{Z}}
\newcommand{\learnRate}{\eta}
\newcommand{\lengthScale}{\ell}
\newcommand{\rbfWidth}{\ell}
\newcommand{\likelihoodBound}{\mathcal{L}}
\newcommand{\likelihoodFunction}{L}
\newcommand{\locationScalar}{\mu}
\newcommand{\locationVector}{\boldsymbol{ \locationScalar}}
\newcommand{\locationMatrix}{\mathbf{M}}
\newcommand{\variance}[1]{\text{var}\left( #1 \right)}
\newcommand{\mappingFunction}{f}
\newcommand{\mappingFunctionMatrix}{\mathbf{F}}
\newcommand{\mappingFunctionTwo}{g}
\newcommand{\mappingFunctionTwoMatrix}{\mathbf{G}}
\newcommand{\mappingFunctionTwoVector}{\mathbf{ \mappingFunctionTwo}}
\newcommand{\mappingFunctionVector}{\mathbf{ \mappingFunction}}
\newcommand{\scaleScalar}{s}
\newcommand{\mappingScalar}{w}
\newcommand{\mappingVector}{\mathbf{ \mappingScalar}}
\newcommand{\mappingMatrix}{\mathbf{W}}
\newcommand{\mappingScalarTwo}{v}
\newcommand{\mappingVectorTwo}{\mathbf{ \mappingScalarTwo}}
\newcommand{\mappingMatrixTwo}{\mathbf{V}}
\newcommand{\maxIters}{K}
\newcommand{\meanMatrix}{\mathbf{M}}
\newcommand{\meanScalar}{\mu}
\newcommand{\meanTwoMatrix}{\mathbf{M}}
\newcommand{\meanTwoScalar}{m}
\newcommand{\meanTwoVector}{\mathbf{ \meanTwoScalar}}
\newcommand{\meanVector}{\boldsymbol{ \meanScalar}}
\newcommand{\mrnaConcentration}{m}
\newcommand{\naturalFrequency}{\omega}
\newcommand{\neighborhood}[1]{\mathcal{N}\left( #1 \right)}
\newcommand{\neilurl}{http://inverseprobability.com/}
\newcommand{\noiseMatrix}{\boldsymbol{ E}}
\newcommand{\noiseScalar}{\epsilon}
\newcommand{\noiseVector}{\boldsymbol{ \epsilon}}
\newcommand{\norm}[1]{\left\Vert #1 \right\Vert}
\newcommand{\normalizedLaplacianMatrix}{\hat{\mathbf{L}}}
\newcommand{\normalizedLaplacianScalar}{\hat{\ell}}
\newcommand{\normalizedLaplacianVector}{\hat{\mathbf{ \ell}}}
\newcommand{\numActive}{m}
\newcommand{\numBasisFunc}{m}
\newcommand{\numComponents}{m}
\newcommand{\numComps}{K}
\newcommand{\numData}{n}
\newcommand{\numFeatures}{K}
\newcommand{\numHidden}{h}
\newcommand{\numInducing}{m}
\newcommand{\numLayers}{\ell}
\newcommand{\numNeighbors}{K}
\newcommand{\numSequences}{s}
\newcommand{\numSuccess}{s}
\newcommand{\numTasks}{m}
\newcommand{\numTime}{T}
\newcommand{\numTrials}{S}
\newcommand{\outputIndex}{j}
\newcommand{\paramVector}{\boldsymbol{ \theta}}
\newcommand{\parameterMatrix}{\boldsymbol{ \Theta}}
\newcommand{\parameterScalar}{\theta}
\newcommand{\parameterVector}{\boldsymbol{ \parameterScalar}}
\newcommand{\partDiff}[2]{\frac{\partial#1}{\partial#2}}
\newcommand{\precisionScalar}{j}
\newcommand{\precisionVector}{\mathbf{ \precisionScalar}}
\newcommand{\precisionMatrix}{\mathbf{J}}
\newcommand{\pseudotargetScalar}{\widetilde{y}}
\newcommand{\pseudotargetVector}{\mathbf{ \pseudotargetScalar}}
\newcommand{\pseudotargetMatrix}{\mathbf{ \widetilde{Y}}}
\newcommand{\rank}[1]{\text{rank}\left(#1\right)}
\newcommand{\rayleighDist}[2]{\mathcal{R}\left(#1|#2\right)}
\newcommand{\rayleighSamp}[1]{\mathcal{R}\left(#1\right)}
\newcommand{\responsibility}{r}
\newcommand{\rotationScalar}{r}
\newcommand{\rotationVector}{\mathbf{ \rotationScalar}}
\newcommand{\rotationMatrix}{\mathbf{R}}
\newcommand{\sampleCovScalar}{s}
\newcommand{\sampleCovVector}{\mathbf{ \sampleCovScalar}}
\newcommand{\sampleCovMatrix}{\mathbf{s}}
\newcommand{\scalarProduct}[2]{\left\langle{#1},{#2}\right\rangle}
\newcommand{\sign}[1]{\text{sign}\left(#1\right)}
\newcommand{\sigmoid}[1]{\sigma\left(#1\right)}
\newcommand{\singularvalue}{\ell}
\newcommand{\singularvalueMatrix}{\mathbf{L}}
\newcommand{\singularvalueVector}{\mathbf{l}}
\newcommand{\sorth}{\mathbf{u}}
\newcommand{\spar}{\lambda}
\newcommand{\trace}[1]{\text{tr}\left(#1\right)}
\newcommand{\BasalRate}{B}
\newcommand{\DampingCoefficient}{C}
\newcommand{\DecayRate}{D}
\newcommand{\Displacement}{X}
\newcommand{\LatentForce}{F}
\newcommand{\Mass}{M}
\newcommand{\Sensitivity}{S}
\newcommand{\basalRate}{b}
\newcommand{\dampingCoefficient}{c}
\newcommand{\mass}{m}
\newcommand{\sensitivity}{s}
\newcommand{\springScalar}{\kappa}
\newcommand{\springVector}{\boldsymbol{ \kappa}}
\newcommand{\springMatrix}{\boldsymbol{ \mathcal{K}}}
\newcommand{\tfConcentration}{p}
\newcommand{\tfDecayRate}{\delta}
\newcommand{\tfMrnaConcentration}{f}
\newcommand{\tfVector}{\mathbf{ \tfConcentration}}
\newcommand{\velocity}{v}
\newcommand{\sufficientStatsScalar}{g}
\newcommand{\sufficientStatsVector}{\mathbf{ \sufficientStatsScalar}}
\newcommand{\sufficientStatsMatrix}{\mathbf{G}}
\newcommand{\switchScalar}{s}
\newcommand{\switchVector}{\mathbf{ \switchScalar}}
\newcommand{\switchMatrix}{\mathbf{S}}
\newcommand{\tr}[1]{\text{tr}\left(#1\right)}
\newcommand{\loneNorm}[1]{\left\Vert #1 \right\Vert_1}
\newcommand{\ltwoNorm}[1]{\left\Vert #1 \right\Vert_2}
\newcommand{\onenorm}[1]{\left\vert#1\right\vert_1}
\newcommand{\twonorm}[1]{\left\Vert #1 \right\Vert}
\newcommand{\vScalar}{v}
\newcommand{\vVector}{\mathbf{v}}
\newcommand{\vMatrix}{\mathbf{V}}
\newcommand{\varianceDist}[2]{\text{var}_{#2}\left( #1 \right)}
% Already defined by latex
%\newcommand{\vec}{#1:}
\newcommand{\vecb}[1]{\left(#1\right):}
\newcommand{\weightScalar}{w}
\newcommand{\weightVector}{\mathbf{ \weightScalar}}
\newcommand{\weightMatrix}{\mathbf{W}}
\newcommand{\weightedAdjacencyMatrix}{\mathbf{A}}
\newcommand{\weightedAdjacencyScalar}{a}
\newcommand{\weightedAdjacencyVector}{\mathbf{ \weightedAdjacencyScalar}}
\newcommand{\onesVector}{\mathbf{1}}
\newcommand{\zerosVector}{\mathbf{0}}
$$
</div>
<p><!--% not ipynb--></p>
<h2 id="the-gartner-hype-cycle">The Gartner Hype Cycle</h2>
<p><img class="negate" src="../slides/diagrams/Gartner_Hype_Cycle.png" width="70%" height="auto" align="center" style="background:none; border:none; box-shadow:none; display:block; margin-left:auto; margin-right:auto"></p>
<p>The <a href="https://en.wikipedia.org/wiki/Hype_cycle">Gartner Hype Cycle</a> tries to assess where an idea is in terms of maturity and adoption. It splits the evolution of technology into a technological trigger, a peak of expectations followed by a trough of disillusionment and a final ascension into a useful technology. It looks rather like a classical control response to a final set point.</p>
<object class="svgplot " align data="../slides/diagrams/data-science/ai-bd-dm-dl-ml-google-trends004.svg" style>
</object>
<center>
<em>Google trends for different technological terms on the hype cycle. </em>
</center>
<p>Google trends gives us insight into how far along various technological terms are on the hype cycle.</p>
<div style="text-align:center">
<img class="rotateimg90" src="../slides/diagrams/2017-10-12 16.47.34.jpg" width="50%" height="auto" align="center" style="background:none; border:none; box-shadow:none; display:block; margin-left:auto; margin-right:auto">
</div>
<div style="text-align:center">
<img class="center" src="../slides/diagrams/SteamEngine_Boulton&Watt_1784_neg.png" width="50%" height="auto" align="" style="background:none; border:none; box-shadow:none; display:block; margin-left:auto; margin-right:auto">
</div>
<p>Machine learning allows us to extract knowledge from data to form a prediction.</p>
<p><span class="math display">\[ \text{data} + \text{model} \xrightarrow{\text{compute}} \text{prediction}\]</span></p>
<p>A machine learning prediction is made by combining a model with data to form the prediction. The manner in which this is done gives us the machine learning <em>algorithm</em>.</p>
<p>Machine learning models are <em>mathematical models</em> which make weak assumptions about data, e.g. smoothness assumptions. By combining these assumptions with the data we observe we can interpolate between data points or, occasionally, extrapolate into the future.</p>
<p>Machine learning is a technology which strongly overlaps with the methodology of statistics. From a historical/philosophical view point, machine learning differs from statistics in that the focus in the machine learning community has been primarily on accuracy of prediction, whereas the focus in statistics is typically on the interpretability of a model and/or validating a hypothesis through data collection.</p>
<p>The rapid increase in the availability of compute and data has led to the increased prominence of machine learning. This prominence is surfacing in two different, but overlapping domains: data science and artificial intelligence.</p>
<h3 id="artificial-intelligence-and-data-science">Artificial Intelligence and Data Science</h3>
<p>Artificial intelligence has the objective of endowing computers with human-like intelligent capabilities. For example, understanding an image (computer vision) or the contents of some speech (speech recognition), the meaning of a sentence (natural language processing) or the translation of a sentence (machine translation). ### Supervised Learning for AI</p>
<p>The machine learning approach to artificial intelligence is to collect and annotate a large data set from humans. The problem is characterized by input data (e.g. a particular image) and a label (e.g. is there a car in the image yes/no). The machine learning algorithm fits a mathematical function (I call this the <em>prediction function</em>) to map from the input image to the label. The parameters of the prediction function are set by minimizing an error between the function’s predictions and the true data. This mathematical function that encapsulates this error is known as the <em>objective function</em>.</p>
<p>This approach to machine learning is known as <em>supervised learning</em>. Various approaches to supervised learning use different prediction functions, objective functions or different optimization algorithms to fit them.</p>
<p>For example, <em>deep learning</em> makes use of <em>neural networks</em> to form the predictions. A neural network is a particular type of mathematical function that allows the algorithm designer to introduce invariances into the function.</p>
<p>An invariance is an important way of including prior understanding in a machine learning model. For example, in an image, a car is still a car regardless of whether it’s in the upper left or lower right corner of the image. This is known as translation invariance. A neural network encodes translation invariance in <em>convolutional layers</em>. Convolutional neural networks are widely used in image recognition tasks.</p>
<p>An alternative structure is known as a recurrent neural network (RNN). RNNs neural networks encode temporal structure. They use auto regressive connections in their hidden layers, they can be seen as time series models which have non-linear auto-regressive basis functions. They are widely used in speech recognition and machine translation.</p>
<p>Machine learning has been deployed in Speech Recognition (e.g. Alexa, deep neural networks, convolutional neural networks for speech recognition), in computer vision (e.g. Amazon Go, convolutional neural networks for person recognition and pose detection).</p>
<p>The field of data science is related to AI, but philosophically different. It arises because we are increasingly creating large amounts of data through <em>happenstance</em> rather than active collection. In the modern era data is laid down by almost all our activities. The objective of data science is to extract insights from this data.</p>
<p>Classically, in the field of statistics, data analysis proceeds by assuming that the question (or scientific hypothesis) comes before the data is created. E.g., if I want to determine the effectiveness of a particular drug I perform a <em>design</em> for my data collection. I use foundational approaches such as randomization to account for confounders. This made a lot of sense in an era where data had to be actively collected. The reduction in cost of data collection and storage now means that many data sets are available which weren’t collected with a particular question in mind. This is a challenge because bias in the way data was acquired can corrupt the insights we derive. We can perform randomized control trials (or A/B tests) to verify our conclusions, but the opportunity is to use data science techniques to better guide our question selection or even answer a question without the expense of a full randomized control trial (referred to as A/B testing in modern internet parlance).</p>
<!--
### Amazon: Bits and Atoms
### Machine Learning in Supply Chain
Supply chain is a large scale automated decision making network. Our aim is to make decisions not only based on our models of customer behavior (as observed through data), but also by accounting for the structure of our fulfilment center, and delivery network.
Many of the most important questions in supply chain take the form of counterfactuals. E.g. “What would happen if we opened a manufacturing facility in Cambridge?” A counter factual is a question that implies a mechanistic understanding of our systems. It goes beyond simple smoothness assumptions or translation invariants. It requires a physical, or *mechanistic* understanding of the supply chain network. For this reason the type of models we deploy in supply chain often involve simulations or more mechanistic understanding of the network.
In supply chain Machine Learning alone is not enough, we need to bridge between models that contain real mechanisms and models that are entirely data driven.
This is challenging, because as we introduce more mechanism to the models we use, it becomes harder to develop efficient algorithms to match those models to data.
include{../_ai/includes/embodiment-factors.md}
include{_data-science/includes/evolved-relationship.md}
include{_ml/includes/what-does-machine-learning-do.md}
newslide{Deep Learning}
* These are interpretable models: vital for disease etc.
* Modern machine learning methods are less interpretable
* Example: face recognition
include{_ml/includes/deep-learning-overview.md}-->
<!--include{_gp/includes/gp-intro-very-short.md}-->
<!--include{_deepgp/includes/deep-olympic.md}-->
<!--
include{_data-science/includes/a-time-for-professionalisation.md}
include{_data-science/includes/the-data-crisis.md}
newslide{Rest of this Talk: Two Areas of Focus}
* Reusability of Data
* Deployment of Machine Learning Systems
newslide{Rest of this Talk: Two Areas of Focus}
* <s>Reusability of Data</s>
* Deployment of Machine Learning Systems
include{_data-science/includes/data-readiness-levels.md}
* Challenges in deploying AI.
* Currently this is in the form of "machine learning systems"
* Fog computing: barrier between cloud and device blurring.
* Computing on the Edge
* Complex feedback between algorithm and implementation
* Major new challenge for systems designers.
* Internet of Intelligence but currently:
* AI systems are *fragile*
Machine learning allows us to extract knowledge from data to
form a prediction.
$$ \text{data} + \text{model} \xrightarrow{\text{compute}} \text{prediction}$$
A machine learning prediction is made by combining a model with data to form the prediction. The manner in which this is done gives us the machine learning *algorithm*.
Machine learning models are *mathematical models* which make weak assumptions about data, e.g. smoothness assumptions. By combining these assumptions with the data we observe we can interpolate between data points or, occasionally, extrapolate into the future.
Machine learning is a technology which strongly overlaps with the methodology of statistics. From a historical/philosophical view point, machine learning differs from statistics in that the focus in the machine learning community has been primarily on accuracy of prediction, whereas the focus in statistics is typically on the interpretability of a model and/or validating a hypothesis through data collection.
The rapid increase in the availability of compute and data has led to the increased prominence of machine learning. This prominence is surfacing in two different, but overlapping domains: data science and artificial intelligence.
### Artificial Intelligence and Data Science
Artificial intelligence has the objective of endowing computers with human-like intelligent capabilities. For example, understanding an image (computer vision) or the contents of some speech (speech recognition), the meaning of a sentence (natural language processing) or the translation of a sentence (machine translation).
### Supervised Learning for AI
The machine learning approach to artificial intelligence is to collect and annotate a large data set from humans. The problem is characterized by input data (e.g. a particular image) and a label (e.g. is there a car in the image yes/no). The machine learning algorithm fits a mathematical function (I call this the *prediction function*) to map from the input image to the label. The parameters of the prediction function are set by minimizing an error between the function’s predictions and the true data. This mathematical function that encapsulates this error is known as the *objective function*.
This approach to machine learning is known as *supervised learning*. Various approaches to supervised learning use different prediction functions, objective functions or different optimization algorithms to fit them.
For example, *deep learning* makes use of *neural networks* to form the predictions. A neural network is a particular type of mathematical function that allows the algorithm designer to introduce invariances into the function.
An invariance is an important way of including prior understanding in a machine learning model. For example, in an image, a car is still a car regardless of whether it’s in the upper left or lower right corner of the image. This is known as translation invariance. A neural network encodes translation invariance in *convolutional layers*. Convolutional neural networks are widely used in image recognition tasks.
An alternative structure is known as a recurrent neural network (RNN). RNNs neural networks encode temporal structure. They use auto regressive connections in their hidden layers, they can be seen as time series models which have non-linear auto-regressive basis functions. They are widely used in speech recognition and machine translation.
Machine learning has been deployed in Speech Recognition (e.g. Alexa, deep neural networks, convolutional neural networks for speech recognition), in computer vision (e.g. Amazon Go, convolutional neural networks for person recognition and pose detection).
The field of data science is related to AI, but philosophically different. It arises because we are increasingly creating large amounts of data through *happenstance* rather than active collection. In the modern era data is laid down by almost all our activities. The objective of data science is to extract insights from this data.
Classically, in the field of statistics, data analysis proceeds by assuming that the question (or scientific hypothesis) comes before the data is created. E.g., if I want to determine the effectiveness of a particular drug I perform a *design* for my data collection. I use foundational approaches such as randomization to account for confounders. This made a lot of sense in an era where data had to be actively collected. The reduction in cost of data collection and storage now means that many data sets are available which weren’t collected with a particular question in mind. This is a challenge because bias in the way data was acquired can corrupt the insights we derive. We can perform randomized control trials (or A/B tests) to verify our conclusions, but the opportunity is to use data science techniques to better guide our question selection or even answer a question without the expense of a full randomized control trial (referred to as A/B testing in modern internet parlance).
-->
<h3 id="machine-learning-in-supply-chain">Machine Learning in Supply Chain</h3>
<p>Supply chain is a large scale automated decision making network. Our aim is to make decisions not only based on our models of customer behavior (as observed through data), but also by accounting for the structure of our fulfilment center, and delivery network.</p>
<p>Many of the most important questions in supply chain take the form of counterfactuals. E.g. “What would happen if we opened a manufacturing facility in Cambridge?” A counter factual is a question that implies a mechanistic understanding of our systems. It goes beyond simple smoothness assumptions or translation invariants. It requires a physical, or <em>mechanistic</em> understanding of the supply chain network. For this reason the type of models we deploy in supply chain often involve simulations or more mechanistic understanding of the network.</p>
<p>In supply chain Machine Learning alone is not enough, we need to bridge between models that contain real mechanisms and models that are entirely data driven.</p>
<p>This is challenging, because as we introduce more mechanism to the models we use, it becomes harder to develop efficient algorithms to match those models to data.</p>
<h3 id="operations-research-control-econometrics-statistics-and-machine-learning">Operations Research, Control, Econometrics, Statistics and Machine Learning</h3>
<p>Different academic fields are born in different eras, driven by different motivations and arrive at different solutions.</p>
<p>The separation between these fields can almost become tribal, and from one perspective this can be very helpful. Each tribe can agree on a common language, a common set of goals and a shared understanding of the approach they’ve chose for those goals. This ensures that best practice can be developed and shared and as a result quality standards rise.</p>
<p>This is the nature of our <em>professions</em>. Medics, lawyers, engineers and accountants all have a system of shared best practice that they deploy efficiently in the resolution of a roughly standardized set of problems where they deploy (broken leg, defending a libel trial, bridging a river, ensuring finances are correct).</p>
<p>Control, statistics, economics, operations research are all established professions. Techniques are established, often at undergraduate level, and graduation to the profession is regulated by professional bodies. This system works well as long as the problems we are easily categorized and mapped onto the existing set of known problems.</p>
<p>However, at another level our separate professions of OR, statistics and control engineering are just different views on the same problem. Just as any tribe of humans need to eat and sleep, so do these professions depend on data, modelling, optimization and decision-making.</p>
<p>We are doing something that has never been done before, optimizing and evolving very large scale automated decision making networks. The ambition to scale and automate in a <em>data driven</em> manner means that a tribal approach to problem solving can hinder our progress. Any tribe of hunter gatherers would struggle to understand the operation of a modern city. Similarly, supply chain needs to develop cross-functional skill sets to address the modern problems we face, not the problems that were formulated in the past.</p>
<p>Many of the challenges we face are at the interface between our tribal expertize. We have particular cost functions we are trying to minimize (an expertise of OR) but we have large scale feedbacks in our system (an expertise of control). We also want our systems to be adaptive to changing circumstances, to perform the best action given the data available (an expertise of machine learning and statistics).</p>
<p>Taking the tribal analogy further, we could imagine each of our professions as a separate tribe of hunter-gathers, each with particular expertise (e.g. fishing, deer hunting, trapping). Each of these tribes has their own approach to eating to survive, just as each of our localized professions has its own approach to modelling. But in this analogy, the technological landscapes we face are not wildernesses, they are emerging metropolises. Our new task is to feed our population through a budding network of supermarkets. While we may be sourcing our food in the same way, this requires new types of thinking that don't belong in the pure domain of any of our existing tribes.</p>
<p>For our biggest challenges, focusing on the differences between these fields is unhelpful, we should consider their strengths and how they overlap. Fundamentally all these fields are focused on taking the right action given the information available to us. They need to work in <em>synergy</em> for us to make progress.</p>
<p><strong>Recommendation</strong>: We should be aware of the limitations of a single tribal view of any of our problem sets. Where our modelling is dominated by one perspective (e.g. economics, OR, control, ML) we should ensure cross fertilization of ideas occurs through scientific review and team rotation mechanisms that embed our scientists (for a short period) in different teams across our organizations.</p>
<h3 id="challenges-for-machine-learning-in-general">Challenges for Machine Learning in General</h3>
<p>We can characterize the challenges for integrating machine learning within our systems as the three Ds. Design, Data and Deployment.</p>
<p>The first two components <em>design</em> and <em>data</em> are interlinked, but we will first outline the design challenge. Below we will mainly focus on <em>supervised learning</em> because this is arguably the technology that is best understood within machine learning.</p>
<h3 id="design">Design</h3>
<p>Machine learning is not magical pixie dust, we cannot simply automate all decisions through data. We are constrained by our data (see below) and the models we use.<a href="#fn1" class="footnoteRef" id="fnref1"><sup>1</sup></a> Machine learning models are relatively simple function mappings that include characteristics such as smoothness. With some famous exceptions, e.g. speech and image data, inputs are constrained in the form of vectors and the model consists of a mathematically well behaved function. This means that some careful thought has to be put in to the right sub-process to automate with machine learning.</p>
<p>Any repetitive task is a candidate for automation, but many of the repetitive tasks we perform as humans are more complex than any individual algorithm can replace. The selection of which task to automate becomes critical and has downstream effects on our overall system design.</p>
<h3 id="pigeonholing">Pigeonholing</h3>
<div style="text-align:center">
<img class="" src="../slides/diagrams/TooManyPigeons.jpg" width="60%" height="auto" align="" style="background:none; border:none; box-shadow:none; display:block; margin-left:auto; margin-right:auto">
</div>
<center>
<em>The machine learning systems design process calls for separating a complex task into decomposable separate entities. A process we can think of as </em><a href="https://en.wikipedia.org/wiki/Pigeonholing">pigeonholing</a><em>. </em>
</center>
<p>Some aspects to take into account are</p>
<ol style="list-style-type: decimal">
<li>Can we refine the decision we need to a set of repetitive tasks where input information and output decision/value is well defined?</li>
<li>Can we represent the sub-task we’ve defined with a mathematical mapping?</li>
</ol>
<p>The design for the second task may involve massaging of the problem: feature selection or adaptation. It may also involve filtering out exception cases (perhaps through a pre-classification).</p>
<p>All else being equal, we’d like to keep our models simple and interpretable. If we can convert a complex mapping to a linear mapping through clever selection sub-task and features this is a big win.</p>
<p>For example, Facebook have <em>feature engineers</em>, individuals whose main role is to design features they think might be useful for one of their tasks (e.g. newsfeed ranking, or ad matching). Facebook have a training/testing pipeline called <a href="https://www.facebook.com/Engineering/posts/fblearner-flow-is-a-machine-learning-platform-capable-of-easily-reusing-algorith/10154077833317200/">FBLearner</a>. Facebook have predefined the sub-tasks they are interested in, and they are tightly connected to their business model. A challenge for companies that have a more diversified portfolio of activities is the identification of the most appropriate sub-task. A potential solution to feature and model selection is known as <em>auto ML</em>. Or we can think of it as using Machine Learning to assist Machine Learning. It’s also called meta-learning. Learning about learning. The input to the ML algorithm is a machine learning task, the output is a proposed model to solve the task.</p>
<p>One trap that is easy to fall in is too much emphasis on the type of model we have deployed rather than the appropriateness of the task decomposition we have chosen.</p>
<p><strong>Recommendation</strong>: Conditioned on task decomposition, we should automate the process of model improvement. Model updates should not be discussed in management meetings, they should be deployed and updated as a matter of course. Further details below on model deployment, but model updating needs to be considered at design time.</p>
<div style="text-align:center">
<img class="" src="../slides/diagrams/ai/chicken-and-egg.jpg" width="" height="auto" align="" style="background:none; border:none; box-shadow:none; display:block; margin-left:auto; margin-right:auto">
</div>
<center>
<em>The answer to the question which comes first, the chicken or the egg is simple, they co-evolve <span class="citation">(Popper, 1963)</span>. Similarly, when we place components together in a complex machine learning system, they will tend to co-evolve and compensate for one another. </em>
</center>
<p>To form modern decision making systems, many components are interlinked. We decompose our complex decision making into individual tasks, but the performance of each component is dependent on those upstream of it.</p>
<p>This naturally leads to co-evolution of systems, upstream errors can be compensated by downstream corrections.</p>
<p>To embrace this characteristic, end-to-end training could be considered. Why produce the best forecast by metrics when we can just produce the best forecast for our systems? End to end training can lead to improvements in performance, but it would also damage our systems decomposability and its interpretability, and perhaps its adaptability.</p>
<p>The less human interpretable our systems are, the harder they are to adapt to different circumstances or diagnose when there's a challenge. The trade-off between interpretability and performance is a constant tension which we should always retain in our minds when performing our system design.</p>
<h3 id="data">Data</h3>
<p>It is difficult to overstate the importance of data. It is half of the equation for machine learning, but is often utterly neglected. I speculate that there are two reasons for this. Firstly, data cleaning is perceived as tedious. It doesn’t seem to consist of the same intellectual challenges that are inherent in constructing complex mathematical models and implementing them in code. Secondly, data cleaning is highly complex, it requires a deep understanding of how machine learning systems operate and good intuitions about the data itself, the domain from which data is drawn (e.g. Supply Chain) and what downstream problems might be caused by poor data quality.</p>
<p>A consequence these two reasons, data cleaning seems difficult to formulate into a readily teachable set of principles. As a result it is heavily neglected in courses on machine learning and data science. Despite data being half the equation, most University courses spend little to no time on its challenges.</p>
<p>Anecdotally, talking to data modelling scientists. Most say they spend 80% of their time acquiring and cleaning data. This is precipitating what I refer to as the “data crisis”. This is an analogy with software. The “software crisis” was the phenomenon of inability to deliver software solutions due to increasing complexity of implementation. There was no single shot solution for the software crisis, it involved better practice (scrum, test orientated development, sprints, code review), improved programming paradigms (object orientated, functional) and better tools (CVS, then SVN, then git).</p>
<p>However, these challenges aren't new, they are merely taking a different form. From the computer's perspective software <em>is</em> data. The first wave of the data crisis was known as the <em>software crisis</em>.</p>
<h3 id="the-software-crisis">The Software Crisis</h3>
<blockquote>
<p>The major cause of the software crisis is that the machines have become several orders of magnitude more powerful! To put it quite bluntly: as long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a mild problem, and now we have gigantic computers, programming has become an equally gigantic problem.</p>
<p>Edsger Dijkstra (1930-2002), The Humble Programmer</p>
</blockquote>
<p>In the late sixties early software programmers made note of the increasing costs of software development and termed the challenges associated with it as the "<a href="https://en.wikipedia.org/wiki/Software_crisis">Software Crisis</a>". Edsger Dijkstra referred to the crisis in his 1972 Turing Award winner's address.</p>
<h3 id="the-mordern-data-crisis">The Mordern Data Crisis</h3>
<blockquote>
<p>The major cause of the data crisis is that machines have become more interconnected than ever before. Data access is therefore cheap, but data quality is often poor. What we need is cheap high quality data. That implies that we develop processes for improving and verifying data quality that are efficient.</p>
<p>There would seem to be two ways for improving efficiency. Firstly, we should not duplicate work. Secondly, where possible we should automate work.</p>
</blockquote>
<p>What I term "The Data Crisis" is the modern equivalent of this problem. The quantity of modern data, and the lack of attention paid to data as it is initially "laid down" and the costs of data cleaning are bringing about a crisis in data-driven decision making.</p>
<p>Just as with software, the crisis is most correctly addressed by 'scaling' the manner in which we process our data. Duplication of work occurs because the value of data cleaning is not correctly recognised in management decision making processes. Automation of work is increasingly possible through techniques in "artificial intelligence", but this will also require better management of the data science pipeline so that data about data science (meta-data science) can be correctly assimilated and processed. The Alan Turing institute has a program focussed on this area, <a href="https://www.turing.ac.uk/research_projects/artificial-intelligence-data-analytics/">AI for Data Analytics</a>.</p>
<p>Data is the new software, and the data crisis is already upon us. It is driven by the cost of cleaning data, the paucity of tools for monitoring and maintaining our deployments, the provenance of our models (e.g. with respect to the data they’re trained on).</p>
<p>Three principal changes need to occur in response. They are cultural and infrastructural.</p>
<p>First of all, to excel in data driven decision making we need to move from a <em>software first</em> paradigm to a <em>data first</em> paradigm. That means refocusing on data as the product. Software is the intermediary to producing the data, and its quality standards must be maintained, but not at the expense of the data we are producing. Data cleaning and maintenance need to be prized as highly as software debugging and maintenance. Instead of <em>software</em> as a service, we should refocus around <em>data</em> as a service. This first change is a cultural change in which our teams think about their outputs in terms of data. Instead of decomposing our systems around the software components, we need to decompose them around the data generating and consuming components.<a href="#fn2" class="footnoteRef" id="fnref2"><sup>2</sup></a> Software first is only an intermediate step on the way to be coming <em>data first</em>. It is a necessary, but not a sufficient condition for efficient machine learning systems design and deployment.</p>
<p>Secondly, we need to improve our language around data quality. We cannot assess the costs of improving data quality unless we generate a language around what data quality means. Data Readiness Levels are an assessment of data quality that is based on the usage to which data is put.</p>
<p>Thirdly, we need to improve our mental model of the separation of data science from applied science. A common trap in our thinking around data is to see data science (and data engineering, data preparation) as a sub-set of the software engineer’s or applied scientist’s skill set. As a result we recruit and deploy the wrong type of resource. Data preparation and question formulation is superficially similar to both because of the need for programming skills, but the day to day problems faced are very different.</p>
<p><strong>Recommendation</strong>: Build a shared understanding of the language of data readiness levels for use in planning documents and costing of data cleaning and the benefits of reusing cleaned data.</p>
<h2 id="data-readiness-levels">Data Readiness Levels</h2>
<p><a href="http://inverseprobability.com/2017/01/12/data-readiness-levels">Data Readiness Levels</a> <span class="citation">(Lawrence, 2017)</span> are an attempt to develop a language around data quality that can bridge the gap between technical solutions and decision makers such as managers and project planners. The are inspired by Technology Readiness Levels which attempt to quantify the readiness of technologies for deployment.</p>
<p>Data-readiness describes, at its coarsest level, three separate stages of data graduation.</p>
<ul>
<li><p>Grade C - accessibility</p></li>
<li><p>Grade B - validity</p></li>
<li><p>Grade A - usability</p></li>
</ul>
<h3 id="accessibility-grade-c">Accessibility: Grade C</h3>
<p>The first grade refers to the accessibility of data. Most data science practitioners will be used to working with data-providers who, perhaps having had little experience of data-science before, state that they "have the data". More often than not, they have not verified this. A convenient term for this is "Hearsay Data", someone has <em>heard</em> that they have the data so they <em>say</em> they have it. This is the lowest grade of data readiness.</p>
<p>Progressing through Grade C involves ensuring that this data is accessible. Not just in terms of digital accessiblity, but also for regulatory, ethical and intellectual property reasons.</p>
<h3 id="validity-grade-b">Validity: Grade B</h3>
<p>Data transits from Grade C to Grade B once we can begin digital analysis on the computer. Once the challenges of access to the data have been resolved, we can make the data available either via API, or for direct loading into analysis software (such as Python, R, Matlab, Mathematica or SPSS). Once this has occured the data is at B4 level. Grade B involves the <em>validity</em> of the data. Does the data really represent what it purports to? There are challenges such as missing values, outliers, record duplication. Each of these needs to be investigated.</p>
<p>Grade B and C are important as if the work done in these grades is documented well, it can be reused in other projects. Reuse of this labour is key to reducing the costs of data-driven automated decision making. There is a strong overlap between the work required in this grade and the statistical field of <a href="https://en.wikipedia.org/wiki/Exploratory_data_analysis"><em>exploratory data analysis</em></a> <span class="citation">(Tukey, 1977)</span>.</p>
<p>The need for Grade B emerges due to the fundamental change in the availability of data. Classically, the scientific question came first, and the data came later. This is still the approach in a randomized control trial, e.g. in A/B testing or clinical trials for drugs. Today data is being laid down by happenstance, and the question we wish to ask about the data often comes after the data has been created. The Grade B of data readiness ensures thought can be put into data quality <em>before</em> the question is defined. It is this work that is reusable across multiple teams. It is these processes that the team which is <em>standing up</em> the data must deliver.</p>
<h3 id="usability-grade-a">Usability: Grade A</h3>
<p>Once the validity of the data is determined, the data set can be considered for use in a particular task. This stage of data readiness is more akin to what machine learning scientists are used to doing in Universities. Bringing an algorithm to bear on a well understood data set.</p>
<p>In Grade A we are concerned about the utility of the data given a particular task. Grade A may involve additional data collection (experimental design in statistics) to ensure that the task is fulfilled.</p>
<p>This is the stage where the data and the model are brought together, so expertise in learning algorithms and their application is key. Further ethical considerations, such as the fairness of the resulting predictions are required at this stage. At the end of this stage a prototype model is ready for deployment.</p>
<p>Deployment and maintenance of machine learning models in production is another important issue which Data Readiness Levels are only a part of the solution for.</p>
<h3 id="recursive-effects">Recursive Effects</h3>
<p>To find out more, or to contribute ideas go to <a href="http://data-readiness.org" class="uri">http://data-readiness.org</a></p>
<p>Throughout the data preparation pipeline, it is important to have close interaction between data scientists and application domain experts. Decisions on data preparation taken outside the context of application have dangerous downstream consequences. This provides an additional burden on the data scientist as they are required for each project, but it should also be seen as a learning and familiarization exercise for the domain expert. Long term, just as biologists have found it necessary to assimilate the skills of the bioinformatician to be effective in their science, most domains will also require a familiarity with the nature of data driven decision making and its application. Working closely with data-scientists on data preparation is one way to begin this sharing of best practice.</p>
<p>The processes involved in Grade C and B are often badly taught in courses on data science. Perhaps not due to a lack of interest in the areas, but maybe more due to a lack of access to real world examples where data quality is poor.</p>
<p>These stages of data science are also ridden with ambiguity. In the long term they could do with more formalization, and automation, but best practice needs to be understood by a wider community before that can happen.</p>
<h3 id="combining-data-and-systems-design">Combining Data and Systems Design</h3>
<p>One analogy I find helpful for understanding the depth of change we need is the following. Imagine as an engineer, you find a USB stick on the ground. And for some reason you <em>know</em> that on that USB stick is a particular API call that will enable you to make a significant positive difference on a business problem. However, you also know on that USB stick there is potentially malicious code. The most secure thing to do would be to <em>not</em> introduce this code into your production system. But what if your manager told you to do so, how would you go about incorporating this code base?</p>
<p>The answer is <em>very</em> carefully. You would have to engage in a process more akin to debugging than regular software engineering. As you understood the code base, for your work to be reproducible, you should be documenting it, not just what you discovered, but how you discovered it. In the end, you typically find a single API call that is the one that most benefits your system. But more thought has been placed into this line of code than any line of code you have written before.</p>
<p>Even then, when your API code is introduced into your production system, it needs to be deployed in an environment that monitors it. We cannot rely on an individual’s decision making to ensure the quality of all our systems. We need to create an environment that includes quality controls, checks and bounds, tests, all designed to ensure that assumptions made about this foreign code base are remaining valid.</p>
<p>This situation is akin to what we are doing when we incorporate data in our production systems. When we are consuming data from others, we cannot assume that it has been produced in alignment with our goals for our own systems. Worst case, it may have been adversarialy produced. A further challenge is that data is dynamic. So, in effect, the code on the USB stick is evolving over time.</p>
<p>Anecdotally, resolving a machine learning challenge requires 80% of the resource to be focused on the data and perhaps 20% to be focused on the model. But many companies are too keen to employ machine learning engineers who focus on the models, not the data.</p>
<div style="text-align;center">
<img class="" src="../slides/diagrams/data-science/water-bridge-hill-transport-arch-calm-544448-pxhere.com.jpg" width="80%" height="auto" align="" style="background:none; border:none; box-shadow:none; display:block; margin-left:auto; margin-right:auto">
</div>
<center>
<em>A reservoir of data has more value if the data is consumbable. The data crisis can only be addressed if we focus on outputs rather than inputs. </em>
</center>
<div style="text-align;center">
<img class="" src="../slides/diagrams/data-science/1024px-Lake_District_picture.JPG" width="80%" height="auto" align="" style="background:none; border:none; box-shadow:none; display:block; margin-left:auto; margin-right:auto">
</div>
<center>
<em>For a data first architecture we need to clean our data at source, rather than individually cleaning data for each task. This involves a shift of focus from our inputs to our outputs. </em>
</center>
<p><strong>Recommendation</strong>: We need to share best practice around data deployment across our teams. We should make best use of our processes where applicable, but we need to develop them to become <em>data first</em> organizations. Data needs to be cleaned at <em>output</em> not at <em>input</em>.</p>
<h3 id="deployment">Deployment</h3>
<h3 id="continuous-deployment">Continuous Deployment</h3>
<p>Once the design is complete, the model code needs to be deployed.</p>
<p>To extend our USB stick analogy further, how would we deploy that code if we thought it was likely to evolve in production? This is what data does. We cannot assume that the conditions under which we trained our model will be retained as we move forward, indeed the only constant we have is change.</p>
<p>This means that when any data dependent model is deployed into production, it requires <em>continuous monitoring</em> to ensure the assumptions of design have not been invalidated. Software changes are qualified through testing, in particular a regression test ensures that existing functionality is not broken by change. Since data is continually evolving, machine learning systems require continual regression testing: oversight by systems that ensure their existing functionality has not been broken as the world evolves around them. Unfortunately, standards around ML model deployment yet been developed. The modern world of continuous deployment does rely on testing, but it does not recognize the continuous evolution of the world around us.</p>
<p>If the world has changed around our decision making ecosystem, how are we alerted to those changes?</p>
<p><strong>Recommendation</strong>: We establish best practice around model deployment. We need to shift our culture from standing up a software service, to standing up a <em>data service</em>. Data as a Service would involve continual monitoring of our deployed models in production. This would be regulated by 'hypervisor' systems<a href="#fn3" class="footnoteRef" id="fnref3"><sup>3</sup></a> that understand the context in which models are deployed and recognize when circumstance has changed and models need retraining or restructuring.</p>
<p><strong>Recommendation</strong>: We should consider a major re-architecting of systems around our services. In particular we should scope the use of a <em>streaming architecture</em> (such as Apache Kafka) that ensures data persistence and enables asynchronous operation of our systems.<a href="#fn4" class="footnoteRef" id="fnref4"><sup>4</sup></a> This would enable the provision of QC streams, and real time dash boards as well as hypervisors.</p>
<p>Importantly a streaming architecture implies the services we build are <em>stateless</em>, internal state is deployed on streams alongside external state. This allows for rapid assessment of other services' data.</p>
<h3 id="outlook-for-machine-learning">Outlook for Machine Learning</h3>
<p>Machine learning has risen to prominence as an approach to <em>scaling</em> our activities. For us to continue to automate in the manner we have over the last two decades, we need to make more use of computer-based automation. Machine learning is allowing us to automate processes that were out of reach before.</p>
<h3 id="conclusion">Conclusion</h3>
<p>We operate in a technologically evolving environment. Machine learning is becoming a key coponent in our decision making capabilities, our intelligence and strategic command. However, technology drove changes in battlefield strategy. From the stalemate of the first world war to the tank-dominated Blitzkrieg of the second, to the asymmetric warfare of the present. Our technology, tactics and strategies are also constantly evolving. Machine learning is part of that evolution solution, but the main challenge is not to become so fixated on the tactics of today that we miss the evolution of strategy that the technology is suggesting.</p>
<h3 id="references" class="unnumbered">References</h3>
<div id="refs" class="references">
<div id="ref-Lawrence:drl17">
<p>Lawrence, N.D., 2017. Data readiness levels. arXiv.</p>
</div>
<div id="ref-Popper:conjectures63">
<p>Popper, K.R., 1963. Conjectures and refutations: The growth of scientific knowledge. Routledge, London.</p>
</div>
<div id="ref-Tukey:exploratory77">
<p>Tukey, J.W., 1977. Exploratory data analysis. Addison-Wesley.</p>
</div>
</div>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>We can also become constrained by our tribal thinking, just as each of the other groups can.<a href="#fnref1">↩</a></p></li>
<li id="fn2"><p>This is related to machine learning and technical debt, although we are framing the solution here rather than the problem.<a href="#fnref2">↩</a></p></li>
<li id="fn3"><p>Emulation is one approach to forming such a hypervisor, because we can build emulators that operate at the meta level, not on the systems directly, but how they interact. Or emulators that monitor a simulation to ensure performance does not change dramatically. However, they are not the only approach. Using real time dashboards, anomaly detection and classical statistics are also applicable in this domain.<a href="#fnref3">↩</a></p></li>
<li id="fn4"><p>The Cambridge team has been exploring this area. We have a reference architecture, and are also considering how such a system could/should be extended for incorporation of simulation models.<a href="#fnref4">↩</a></p></li>
</ol>
</div>
Tue, 19 Feb 2019 00:00:00 +0000
http://inverseprobability.com/talks/notes/data-science-and-digital-systems.html
http://inverseprobability.com/talks/notes/data-science-and-digital-systems.htmlnotesData Science and Digital Systems<div style="display:none">
$$\newcommand{\Amatrix}{\mathbf{A}}
\newcommand{\KL}[2]{\text{KL}\left( #1\,\|\,#2 \right)}
\newcommand{\Kaast}{\kernelMatrix_{\mathbf{ \ast}\mathbf{ \ast}}}
\newcommand{\Kastu}{\kernelMatrix_{\mathbf{ \ast} \inducingVector}}
\newcommand{\Kff}{\kernelMatrix_{\mappingFunctionVector \mappingFunctionVector}}
\newcommand{\Kfu}{\kernelMatrix_{\mappingFunctionVector \inducingVector}}
\newcommand{\Kuast}{\kernelMatrix_{\inducingVector \bf\ast}}
\newcommand{\Kuf}{\kernelMatrix_{\inducingVector \mappingFunctionVector}}
\newcommand{\Kuu}{\kernelMatrix_{\inducingVector \inducingVector}}
\newcommand{\Kuui}{\Kuu^{-1}}
\newcommand{\Qaast}{\mathbf{Q}_{\bf \ast \ast}}
\newcommand{\Qastf}{\mathbf{Q}_{\ast \mappingFunction}}
\newcommand{\Qfast}{\mathbf{Q}_{\mappingFunctionVector \bf \ast}}
\newcommand{\Qff}{\mathbf{Q}_{\mappingFunctionVector \mappingFunctionVector}}
\newcommand{\aMatrix}{\mathbf{A}}
\newcommand{\aScalar}{a}
\newcommand{\aVector}{\mathbf{a}}
\newcommand{\acceleration}{a}
\newcommand{\bMatrix}{\mathbf{B}}
\newcommand{\bScalar}{b}
\newcommand{\bVector}{\mathbf{b}}
\newcommand{\basisFunc}{\phi}
\newcommand{\basisFuncVector}{\boldsymbol{ \basisFunc}}
\newcommand{\basisFunction}{\phi}
\newcommand{\basisLocation}{\mu}
\newcommand{\basisMatrix}{\boldsymbol{ \Phi}}
\newcommand{\basisScalar}{\basisFunction}
\newcommand{\basisVector}{\boldsymbol{ \basisFunction}}
\newcommand{\activationFunction}{\phi}
\newcommand{\activationMatrix}{\boldsymbol{ \Phi}}
\newcommand{\activationScalar}{\basisFunction}
\newcommand{\activationVector}{\boldsymbol{ \basisFunction}}
\newcommand{\bigO}{\mathcal{O}}
\newcommand{\binomProb}{\pi}
\newcommand{\cMatrix}{\mathbf{C}}
\newcommand{\cbasisMatrix}{\hat{\boldsymbol{ \Phi}}}
\newcommand{\cdataMatrix}{\hat{\dataMatrix}}
\newcommand{\cdataScalar}{\hat{\dataScalar}}
\newcommand{\cdataVector}{\hat{\dataVector}}
\newcommand{\centeredKernelMatrix}{\mathbf{ \MakeUppercase{\centeredKernelScalar}}}
\newcommand{\centeredKernelScalar}{b}
\newcommand{\centeredKernelVector}{\centeredKernelScalar}
\newcommand{\centeringMatrix}{\mathbf{H}}
\newcommand{\chiSquaredDist}[2]{\chi_{#1}^{2}\left(#2\right)}
\newcommand{\chiSquaredSamp}[1]{\chi_{#1}^{2}}
\newcommand{\conditionalCovariance}{\boldsymbol{ \Sigma}}
\newcommand{\coregionalizationMatrix}{\mathbf{B}}
\newcommand{\coregionalizationScalar}{b}
\newcommand{\coregionalizationVector}{\mathbf{ \coregionalizationScalar}}
\newcommand{\covDist}[2]{\text{cov}_{#2}\left(#1\right)}
\newcommand{\covSamp}[1]{\text{cov}\left(#1\right)}
\newcommand{\covarianceScalar}{c}
\newcommand{\covarianceVector}{\mathbf{ \covarianceScalar}}
\newcommand{\covarianceMatrix}{\mathbf{C}}
\newcommand{\covarianceMatrixTwo}{\boldsymbol{ \Sigma}}
\newcommand{\croupierScalar}{s}
\newcommand{\croupierVector}{\mathbf{ \croupierScalar}}
\newcommand{\croupierMatrix}{\mathbf{ \MakeUppercase{\croupierScalar}}}
\newcommand{\dataDim}{p}
\newcommand{\dataIndex}{i}
\newcommand{\dataIndexTwo}{j}
\newcommand{\dataMatrix}{\mathbf{Y}}
\newcommand{\dataScalar}{y}
\newcommand{\dataSet}{\mathcal{D}}
\newcommand{\dataStd}{\sigma}
\newcommand{\dataVector}{\mathbf{ \dataScalar}}
\newcommand{\decayRate}{d}
\newcommand{\degreeMatrix}{\mathbf{ \MakeUppercase{\degreeScalar}}}
\newcommand{\degreeScalar}{d}
\newcommand{\degreeVector}{\mathbf{ \degreeScalar}}
% Already defined by latex
%\newcommand{\det}[1]{\left|#1\right|}
\newcommand{\diag}[1]{\text{diag}\left(#1\right)}
\newcommand{\diagonalMatrix}{\mathbf{D}}
\newcommand{\diff}[2]{\frac{\text{d}#1}{\text{d}#2}}
\newcommand{\diffTwo}[2]{\frac{\text{d}^2#1}{\text{d}#2^2}}
\newcommand{\displacement}{x}
\newcommand{\displacementVector}{\textbf{\displacement}}
\newcommand{\distanceMatrix}{\mathbf{ \MakeUppercase{\distanceScalar}}}
\newcommand{\distanceScalar}{d}
\newcommand{\distanceVector}{\mathbf{ \distanceScalar}}
\newcommand{\eigenvaltwo}{\ell}
\newcommand{\eigenvaltwoMatrix}{\mathbf{L}}
\newcommand{\eigenvaltwoVector}{\mathbf{l}}
\newcommand{\eigenvalue}{\lambda}
\newcommand{\eigenvalueMatrix}{\boldsymbol{ \Lambda}}
\newcommand{\eigenvalueVector}{\boldsymbol{ \lambda}}
\newcommand{\eigenvector}{\mathbf{ \eigenvectorScalar}}
\newcommand{\eigenvectorMatrix}{\mathbf{U}}
\newcommand{\eigenvectorScalar}{u}
\newcommand{\eigenvectwo}{\mathbf{v}}
\newcommand{\eigenvectwoMatrix}{\mathbf{V}}
\newcommand{\eigenvectwoScalar}{v}
\newcommand{\entropy}[1]{\mathcal{H}\left(#1\right)}
\newcommand{\errorFunction}{E}
\newcommand{\expDist}[2]{\left<#1\right>_{#2}}
\newcommand{\expSamp}[1]{\left<#1\right>}
\newcommand{\expectation}[1]{\left\langle #1 \right\rangle }
\newcommand{\expectationDist}[2]{\left\langle #1 \right\rangle _{#2}}
\newcommand{\expectedDistanceMatrix}{\mathcal{D}}
\newcommand{\eye}{\mathbf{I}}
\newcommand{\fantasyDim}{r}
\newcommand{\fantasyMatrix}{\mathbf{ \MakeUppercase{\fantasyScalar}}}
\newcommand{\fantasyScalar}{z}
\newcommand{\fantasyVector}{\mathbf{ \fantasyScalar}}
\newcommand{\featureStd}{\varsigma}
\newcommand{\gammaCdf}[3]{\mathcal{GAMMA CDF}\left(#1|#2,#3\right)}
\newcommand{\gammaDist}[3]{\mathcal{G}\left(#1|#2,#3\right)}
\newcommand{\gammaSamp}[2]{\mathcal{G}\left(#1,#2\right)}
\newcommand{\gaussianDist}[3]{\mathcal{N}\left(#1|#2,#3\right)}
\newcommand{\gaussianSamp}[2]{\mathcal{N}\left(#1,#2\right)}
\newcommand{\given}{|}
\newcommand{\half}{\frac{1}{2}}
\newcommand{\heaviside}{H}
\newcommand{\hiddenMatrix}{\mathbf{ \MakeUppercase{\hiddenScalar}}}
\newcommand{\hiddenScalar}{h}
\newcommand{\hiddenVector}{\mathbf{ \hiddenScalar}}
\newcommand{\identityMatrix}{\eye}
\newcommand{\inducingInputScalar}{z}
\newcommand{\inducingInputVector}{\mathbf{ \inducingInputScalar}}
\newcommand{\inducingInputMatrix}{\mathbf{Z}}
\newcommand{\inducingScalar}{u}
\newcommand{\inducingVector}{\mathbf{ \inducingScalar}}
\newcommand{\inducingMatrix}{\mathbf{U}}
\newcommand{\inlineDiff}[2]{\text{d}#1/\text{d}#2}
\newcommand{\inputDim}{q}
\newcommand{\inputMatrix}{\mathbf{X}}
\newcommand{\inputScalar}{x}
\newcommand{\inputSpace}{\mathcal{X}}
\newcommand{\inputVals}{\inputVector}
\newcommand{\inputVector}{\mathbf{ \inputScalar}}
\newcommand{\iterNum}{k}
\newcommand{\kernel}{\kernelScalar}
\newcommand{\kernelMatrix}{\mathbf{K}}
\newcommand{\kernelScalar}{k}
\newcommand{\kernelVector}{\mathbf{ \kernelScalar}}
\newcommand{\kff}{\kernelScalar_{\mappingFunction \mappingFunction}}
\newcommand{\kfu}{\kernelVector_{\mappingFunction \inducingScalar}}
\newcommand{\kuf}{\kernelVector_{\inducingScalar \mappingFunction}}
\newcommand{\kuu}{\kernelVector_{\inducingScalar \inducingScalar}}
\newcommand{\lagrangeMultiplier}{\lambda}
\newcommand{\lagrangeMultiplierMatrix}{\boldsymbol{ \Lambda}}
\newcommand{\lagrangian}{L}
\newcommand{\laplacianFactor}{\mathbf{ \MakeUppercase{\laplacianFactorScalar}}}
\newcommand{\laplacianFactorScalar}{m}
\newcommand{\laplacianFactorVector}{\mathbf{ \laplacianFactorScalar}}
\newcommand{\laplacianMatrix}{\mathbf{L}}
\newcommand{\laplacianScalar}{\ell}
\newcommand{\laplacianVector}{\mathbf{ \ell}}
\newcommand{\latentDim}{q}
\newcommand{\latentDistanceMatrix}{\boldsymbol{ \Delta}}
\newcommand{\latentDistanceScalar}{\delta}
\newcommand{\latentDistanceVector}{\boldsymbol{ \delta}}
\newcommand{\latentForce}{f}
\newcommand{\latentFunction}{u}
\newcommand{\latentFunctionVector}{\mathbf{ \latentFunction}}
\newcommand{\latentFunctionMatrix}{\mathbf{ \MakeUppercase{\latentFunction}}}
\newcommand{\latentIndex}{j}
\newcommand{\latentScalar}{z}
\newcommand{\latentVector}{\mathbf{ \latentScalar}}
\newcommand{\latentMatrix}{\mathbf{Z}}
\newcommand{\learnRate}{\eta}
\newcommand{\lengthScale}{\ell}
\newcommand{\rbfWidth}{\ell}
\newcommand{\likelihoodBound}{\mathcal{L}}
\newcommand{\likelihoodFunction}{L}
\newcommand{\locationScalar}{\mu}
\newcommand{\locationVector}{\boldsymbol{ \locationScalar}}
\newcommand{\locationMatrix}{\mathbf{M}}
\newcommand{\variance}[1]{\text{var}\left( #1 \right)}
\newcommand{\mappingFunction}{f}
\newcommand{\mappingFunctionMatrix}{\mathbf{F}}
\newcommand{\mappingFunctionTwo}{g}
\newcommand{\mappingFunctionTwoMatrix}{\mathbf{G}}
\newcommand{\mappingFunctionTwoVector}{\mathbf{ \mappingFunctionTwo}}
\newcommand{\mappingFunctionVector}{\mathbf{ \mappingFunction}}
\newcommand{\scaleScalar}{s}
\newcommand{\mappingScalar}{w}
\newcommand{\mappingVector}{\mathbf{ \mappingScalar}}
\newcommand{\mappingMatrix}{\mathbf{W}}
\newcommand{\mappingScalarTwo}{v}
\newcommand{\mappingVectorTwo}{\mathbf{ \mappingScalarTwo}}
\newcommand{\mappingMatrixTwo}{\mathbf{V}}
\newcommand{\maxIters}{K}
\newcommand{\meanMatrix}{\mathbf{M}}
\newcommand{\meanScalar}{\mu}
\newcommand{\meanTwoMatrix}{\mathbf{M}}
\newcommand{\meanTwoScalar}{m}
\newcommand{\meanTwoVector}{\mathbf{ \meanTwoScalar}}
\newcommand{\meanVector}{\boldsymbol{ \meanScalar}}
\newcommand{\mrnaConcentration}{m}
\newcommand{\naturalFrequency}{\omega}
\newcommand{\neighborhood}[1]{\mathcal{N}\left( #1 \right)}
\newcommand{\neilurl}{http://inverseprobability.com/}
\newcommand{\noiseMatrix}{\boldsymbol{ E}}
\newcommand{\noiseScalar}{\epsilon}
\newcommand{\noiseVector}{\boldsymbol{ \epsilon}}
\newcommand{\norm}[1]{\left\Vert #1 \right\Vert}
\newcommand{\normalizedLaplacianMatrix}{\hat{\mathbf{L}}}
\newcommand{\normalizedLaplacianScalar}{\hat{\ell}}
\newcommand{\normalizedLaplacianVector}{\hat{\mathbf{ \ell}}}
\newcommand{\numActive}{m}
\newcommand{\numBasisFunc}{m}
\newcommand{\numComponents}{m}
\newcommand{\numComps}{K}
\newcommand{\numData}{n}
\newcommand{\numFeatures}{K}
\newcommand{\numHidden}{h}
\newcommand{\numInducing}{m}
\newcommand{\numLayers}{\ell}
\newcommand{\numNeighbors}{K}
\newcommand{\numSequences}{s}
\newcommand{\numSuccess}{s}
\newcommand{\numTasks}{m}
\newcommand{\numTime}{T}
\newcommand{\numTrials}{S}
\newcommand{\outputIndex}{j}
\newcommand{\paramVector}{\boldsymbol{ \theta}}
\newcommand{\parameterMatrix}{\boldsymbol{ \Theta}}
\newcommand{\parameterScalar}{\theta}
\newcommand{\parameterVector}{\boldsymbol{ \parameterScalar}}
\newcommand{\partDiff}[2]{\frac{\partial#1}{\partial#2}}
\newcommand{\precisionScalar}{j}
\newcommand{\precisionVector}{\mathbf{ \precisionScalar}}
\newcommand{\precisionMatrix}{\mathbf{J}}
\newcommand{\pseudotargetScalar}{\widetilde{y}}
\newcommand{\pseudotargetVector}{\mathbf{ \pseudotargetScalar}}
\newcommand{\pseudotargetMatrix}{\mathbf{ \widetilde{Y}}}
\newcommand{\rank}[1]{\text{rank}\left(#1\right)}
\newcommand{\rayleighDist}[2]{\mathcal{R}\left(#1|#2\right)}
\newcommand{\rayleighSamp}[1]{\mathcal{R}\left(#1\right)}
\newcommand{\responsibility}{r}
\newcommand{\rotationScalar}{r}
\newcommand{\rotationVector}{\mathbf{ \rotationScalar}}
\newcommand{\rotationMatrix}{\mathbf{R}}
\newcommand{\sampleCovScalar}{s}
\newcommand{\sampleCovVector}{\mathbf{ \sampleCovScalar}}
\newcommand{\sampleCovMatrix}{\mathbf{s}}
\newcommand{\scalarProduct}[2]{\left\langle{#1},{#2}\right\rangle}
\newcommand{\sign}[1]{\text{sign}\left(#1\right)}
\newcommand{\sigmoid}[1]{\sigma\left(#1\right)}
\newcommand{\singularvalue}{\ell}
\newcommand{\singularvalueMatrix}{\mathbf{L}}
\newcommand{\singularvalueVector}{\mathbf{l}}
\newcommand{\sorth}{\mathbf{u}}
\newcommand{\spar}{\lambda}
\newcommand{\trace}[1]{\text{tr}\left(#1\right)}
\newcommand{\BasalRate}{B}
\newcommand{\DampingCoefficient}{C}
\newcommand{\DecayRate}{D}
\newcommand{\Displacement}{X}
\newcommand{\LatentForce}{F}
\newcommand{\Mass}{M}
\newcommand{\Sensitivity}{S}
\newcommand{\basalRate}{b}
\newcommand{\dampingCoefficient}{c}
\newcommand{\mass}{m}
\newcommand{\sensitivity}{s}
\newcommand{\springScalar}{\kappa}
\newcommand{\springVector}{\boldsymbol{ \kappa}}
\newcommand{\springMatrix}{\boldsymbol{ \mathcal{K}}}
\newcommand{\tfConcentration}{p}
\newcommand{\tfDecayRate}{\delta}
\newcommand{\tfMrnaConcentration}{f}
\newcommand{\tfVector}{\mathbf{ \tfConcentration}}
\newcommand{\velocity}{v}
\newcommand{\sufficientStatsScalar}{g}
\newcommand{\sufficientStatsVector}{\mathbf{ \sufficientStatsScalar}}
\newcommand{\sufficientStatsMatrix}{\mathbf{G}}
\newcommand{\switchScalar}{s}
\newcommand{\switchVector}{\mathbf{ \switchScalar}}
\newcommand{\switchMatrix}{\mathbf{S}}
\newcommand{\tr}[1]{\text{tr}\left(#1\right)}
\newcommand{\loneNorm}[1]{\left\Vert #1 \right\Vert_1}
\newcommand{\ltwoNorm}[1]{\left\Vert #1 \right\Vert_2}
\newcommand{\onenorm}[1]{\left\vert#1\right\vert_1}
\newcommand{\twonorm}[1]{\left\Vert #1 \right\Vert}
\newcommand{\vScalar}{v}
\newcommand{\vVector}{\mathbf{v}}
\newcommand{\vMatrix}{\mathbf{V}}
\newcommand{\varianceDist}[2]{\text{var}_{#2}\left( #1 \right)}
% Already defined by latex
%\newcommand{\vec}{#1:}
\newcommand{\vecb}[1]{\left(#1\right):}
\newcommand{\weightScalar}{w}
\newcommand{\weightVector}{\mathbf{ \weightScalar}}
\newcommand{\weightMatrix}{\mathbf{W}}
\newcommand{\weightedAdjacencyMatrix}{\mathbf{A}}
\newcommand{\weightedAdjacencyScalar}{a}
\newcommand{\weightedAdjacencyVector}{\mathbf{ \weightedAdjacencyScalar}}
\newcommand{\onesVector}{\mathbf{1}}
\newcommand{\zerosVector}{\mathbf{0}}
$$
</div>
<p><!--% not ipynb--></p>
<h2 id="the-gartner-hype-cycle">The Gartner Hype Cycle</h2>
<p><img class="negate" src="../slides/diagrams/Gartner_Hype_Cycle.png" width="70%" height="auto" align="center" style="background:none; border:none; box-shadow:none; display:block; margin-left:auto; margin-right:auto"></p>
<p>The <a href="https://en.wikipedia.org/wiki/Hype_cycle">Gartner Hype Cycle</a> tries to assess where an idea is in terms of maturity and adoption. It splits the evolution of technology into a technological trigger, a peak of expectations followed by a trough of disillusionment and a final ascension into a useful technology. It looks rather like a classical control response to a final set point.</p>
<object class="svgplot " align data="../slides/diagrams/data-science/ai-bd-dm-dl-ml-google-trends004.svg" style>
</object>
<center>
<em>Google trends for different technological terms on the hype cycle. </em>
</center>
<p>Google trends gives us insight into how far along various technological terms are on the hype cycle.</p>
<div style="text-align:center">
<img class="rotateimg90" src="../slides/diagrams/2017-10-12 16.47.34.jpg" width="50%" height="auto" align="center" style="background:none; border:none; box-shadow:none; display:block; margin-left:auto; margin-right:auto">
</div>
<div style="text-align:center">
<img class="center" src="../slides/diagrams/SteamEngine_Boulton&Watt_1784_neg.png" width="50%" height="auto" align="" style="background:none; border:none; box-shadow:none; display:block; margin-left:auto; margin-right:auto">
</div>
<p>Machine learning allows us to extract knowledge from data to form a prediction.</p>
<p><span class="math display">\[ \text{data} + \text{model} \xrightarrow{\text{compute}} \text{prediction}\]</span></p>
<p>A machine learning prediction is made by combining a model with data to form the prediction. The manner in which this is done gives us the machine learning <em>algorithm</em>.</p>
<p>Machine learning models are <em>mathematical models</em> which make weak assumptions about data, e.g. smoothness assumptions. By combining these assumptions with the data we observe we can interpolate between data points or, occasionally, extrapolate into the future.</p>
<p>Machine learning is a technology which strongly overlaps with the methodology of statistics. From a historical/philosophical view point, machine learning differs from statistics in that the focus in the machine learning community has been primarily on accuracy of prediction, whereas the focus in statistics is typically on the interpretability of a model and/or validating a hypothesis through data collection.</p>
<p>The rapid increase in the availability of compute and data has led to the increased prominence of machine learning. This prominence is surfacing in two different, but overlapping domains: data science and artificial intelligence.</p>
<h3 id="artificial-intelligence-and-data-science">Artificial Intelligence and Data Science</h3>
<p>Artificial intelligence has the objective of endowing computers with human-like intelligent capabilities. For example, understanding an image (computer vision) or the contents of some speech (speech recognition), the meaning of a sentence (natural language processing) or the translation of a sentence (machine translation). ### Supervised Learning for AI</p>
<p>The machine learning approach to artificial intelligence is to collect and annotate a large data set from humans. The problem is characterized by input data (e.g. a particular image) and a label (e.g. is there a car in the image yes/no). The machine learning algorithm fits a mathematical function (I call this the <em>prediction function</em>) to map from the input image to the label. The parameters of the prediction function are set by minimizing an error between the function’s predictions and the true data. This mathematical function that encapsulates this error is known as the <em>objective function</em>.</p>
<p>This approach to machine learning is known as <em>supervised learning</em>. Various approaches to supervised learning use different prediction functions, objective functions or different optimization algorithms to fit them.</p>
<p>For example, <em>deep learning</em> makes use of <em>neural networks</em> to form the predictions. A neural network is a particular type of mathematical function that allows the algorithm designer to introduce invariances into the function.</p>
<p>An invariance is an important way of including prior understanding in a machine learning model. For example, in an image, a car is still a car regardless of whether it’s in the upper left or lower right corner of the image. This is known as translation invariance. A neural network encodes translation invariance in <em>convolutional layers</em>. Convolutional neural networks are widely used in image recognition tasks.</p>
<p>An alternative structure is known as a recurrent neural network (RNN). RNNs neural networks encode temporal structure. They use auto regressive connections in their hidden layers, they can be seen as time series models which have non-linear auto-regressive basis functions. They are widely used in speech recognition and machine translation.</p>
<p>Machine learning has been deployed in Speech Recognition (e.g. Alexa, deep neural networks, convolutional neural networks for speech recognition), in computer vision (e.g. Amazon Go, convolutional neural networks for person recognition and pose detection).</p>
<p>The field of data science is related to AI, but philosophically different. It arises because we are increasingly creating large amounts of data through <em>happenstance</em> rather than active collection. In the modern era data is laid down by almost all our activities. The objective of data science is to extract insights from this data.</p>
<p>Classically, in the field of statistics, data analysis proceeds by assuming that the question (or scientific hypothesis) comes before the data is created. E.g., if I want to determine the effectiveness of a particular drug I perform a <em>design</em> for my data collection. I use foundational approaches such as randomization to account for confounders. This made a lot of sense in an era where data had to be actively collected. The reduction in cost of data collection and storage now means that many data sets are available which weren’t collected with a particular question in mind. This is a challenge because bias in the way data was acquired can corrupt the insights we derive. We can perform randomized control trials (or A/B tests) to verify our conclusions, but the opportunity is to use data science techniques to better guide our question selection or even answer a question without the expense of a full randomized control trial (referred to as A/B testing in modern internet parlance).</p>
<!--
### Amazon: Bits and Atoms
### Machine Learning in Supply Chain
Supply chain is a large scale automated decision making network. Our aim is to make decisions not only based on our models of customer behavior (as observed through data), but also by accounting for the structure of our fulfilment center, and delivery network.
Many of the most important questions in supply chain take the form of counterfactuals. E.g. “What would happen if we opened a manufacturing facility in Cambridge?” A counter factual is a question that implies a mechanistic understanding of our systems. It goes beyond simple smoothness assumptions or translation invariants. It requires a physical, or *mechanistic* understanding of the supply chain network. For this reason the type of models we deploy in supply chain often involve simulations or more mechanistic understanding of the network.
In supply chain Machine Learning alone is not enough, we need to bridge between models that contain real mechanisms and models that are entirely data driven.
This is challenging, because as we introduce more mechanism to the models we use, it becomes harder to develop efficient algorithms to match those models to data.
include{../_ai/includes/embodiment-factors.md}
include{_data-science/includes/evolved-relationship.md}
include{_ml/includes/what-does-machine-learning-do.md}
newslide{Deep Learning}
* These are interpretable models: vital for disease etc.
* Modern machine learning methods are less interpretable
* Example: face recognition
include{_ml/includes/deep-learning-overview.md}-->
<!--include{_gp/includes/gp-intro-very-short.md}-->
<!--include{_deepgp/includes/deep-olympic.md}-->
<!--
include{_data-science/includes/a-time-for-professionalisation.md}
include{_data-science/includes/the-data-crisis.md}
newslide{Rest of this Talk: Two Areas of Focus}
* Reusability of Data
* Deployment of Machine Learning Systems
newslide{Rest of this Talk: Two Areas of Focus}
* <s>Reusability of Data</s>
* Deployment of Machine Learning Systems
include{_data-science/includes/data-readiness-levels.md}
* Challenges in deploying AI.
* Currently this is in the form of "machine learning systems"
* Fog computing: barrier between cloud and device blurring.
* Computing on the Edge
* Complex feedback between algorithm and implementation
* Major new challenge for systems designers.
* Internet of Intelligence but currently:
* AI systems are *fragile*
Machine learning allows us to extract knowledge from data to
form a prediction.
$$ \text{data} + \text{model} \xrightarrow{\text{compute}} \text{prediction}$$
A machine learning prediction is made by combining a model with data to form the prediction. The manner in which this is done gives us the machine learning *algorithm*.
Machine learning models are *mathematical models* which make weak assumptions about data, e.g. smoothness assumptions. By combining these assumptions with the data we observe we can interpolate between data points or, occasionally, extrapolate into the future.
Machine learning is a technology which strongly overlaps with the methodology of statistics. From a historical/philosophical view point, machine learning differs from statistics in that the focus in the machine learning community has been primarily on accuracy of prediction, whereas the focus in statistics is typically on the interpretability of a model and/or validating a hypothesis through data collection.
The rapid increase in the availability of compute and data has led to the increased prominence of machine learning. This prominence is surfacing in two different, but overlapping domains: data science and artificial intelligence.
### Artificial Intelligence and Data Science
Artificial intelligence has the objective of endowing computers with human-like intelligent capabilities. For example, understanding an image (computer vision) or the contents of some speech (speech recognition), the meaning of a sentence (natural language processing) or the translation of a sentence (machine translation).
### Supervised Learning for AI
The machine learning approach to artificial intelligence is to collect and annotate a large data set from humans. The problem is characterized by input data (e.g. a particular image) and a label (e.g. is there a car in the image yes/no). The machine learning algorithm fits a mathematical function (I call this the *prediction function*) to map from the input image to the label. The parameters of the prediction function are set by minimizing an error between the function’s predictions and the true data. This mathematical function that encapsulates this error is known as the *objective function*.
This approach to machine learning is known as *supervised learning*. Various approaches to supervised learning use different prediction functions, objective functions or different optimization algorithms to fit them.
For example, *deep learning* makes use of *neural networks* to form the predictions. A neural network is a particular type of mathematical function that allows the algorithm designer to introduce invariances into the function.
An invariance is an important way of including prior understanding in a machine learning model. For example, in an image, a car is still a car regardless of whether it’s in the upper left or lower right corner of the image. This is known as translation invariance. A neural network encodes translation invariance in *convolutional layers*. Convolutional neural networks are widely used in image recognition tasks.
An alternative structure is known as a recurrent neural network (RNN). RNNs neural networks encode temporal structure. They use auto regressive connections in their hidden layers, they can be seen as time series models which have non-linear auto-regressive basis functions. They are widely used in speech recognition and machine translation.
Machine learning has been deployed in Speech Recognition (e.g. Alexa, deep neural networks, convolutional neural networks for speech recognition), in computer vision (e.g. Amazon Go, convolutional neural networks for person recognition and pose detection).
The field of data science is related to AI, but philosophically different. It arises because we are increasingly creating large amounts of data through *happenstance* rather than active collection. In the modern era data is laid down by almost all our activities. The objective of data science is to extract insights from this data.
Classically, in the field of statistics, data analysis proceeds by assuming that the question (or scientific hypothesis) comes before the data is created. E.g., if I want to determine the effectiveness of a particular drug I perform a *design* for my data collection. I use foundational approaches such as randomization to account for confounders. This made a lot of sense in an era where data had to be actively collected. The reduction in cost of data collection and storage now means that many data sets are available which weren’t collected with a particular question in mind. This is a challenge because bias in the way data was acquired can corrupt the insights we derive. We can perform randomized control trials (or A/B tests) to verify our conclusions, but the opportunity is to use data science techniques to better guide our question selection or even answer a question without the expense of a full randomized control trial (referred to as A/B testing in modern internet parlance).
-->
<h3 id="machine-learning-in-supply-chain">Machine Learning in Supply Chain</h3>
<p>Supply chain is a large scale automated decision making network. Our aim is to make decisions not only based on our models of customer behavior (as observed through data), but also by accounting for the structure of our fulfilment center, and delivery network.</p>
<p>Many of the most important questions in supply chain take the form of counterfactuals. E.g. “What would happen if we opened a manufacturing facility in Cambridge?” A counter factual is a question that implies a mechanistic understanding of our systems. It goes beyond simple smoothness assumptions or translation invariants. It requires a physical, or <em>mechanistic</em> understanding of the supply chain network. For this reason the type of models we deploy in supply chain often involve simulations or more mechanistic understanding of the network.</p>
<p>In supply chain Machine Learning alone is not enough, we need to bridge between models that contain real mechanisms and models that are entirely data driven.</p>
<p>This is challenging, because as we introduce more mechanism to the models we use, it becomes harder to develop efficient algorithms to match those models to data.</p>
<h3 id="operations-research-control-econometrics-statistics-and-machine-learning">Operations Research, Control, Econometrics, Statistics and Machine Learning</h3>
<p>Different academic fields are born in different eras, driven by different motivations and arrive at different solutions.</p>
<p>The separation between these fields can almost become tribal, and from one perspective this can be very helpful. Each tribe can agree on a common language, a common set of goals and a shared understanding of the approach they’ve chose for those goals. This ensures that best practice can be developed and shared and as a result quality standards rise.</p>
<p>This is the nature of our <em>professions</em>. Medics, lawyers, engineers and accountants all have a system of shared best practice that they deploy efficiently in the resolution of a roughly standardized set of problems where they deploy (broken leg, defending a libel trial, bridging a river, ensuring finances are correct).</p>
<p>Control, statistics, economics, operations research are all established professions. Techniques are established, often at undergraduate level, and graduation to the profession is regulated by professional bodies. This system works well as long as the problems we are easily categorized and mapped onto the existing set of known problems.</p>
<p>However, at another level our separate professions of OR, statistics and control engineering are just different views on the same problem. Just as any tribe of humans need to eat and sleep, so do these professions depend on data, modelling, optimization and decision-making.</p>
<p>We are doing something that has never been done before, optimizing and evolving very large scale automated decision making networks. The ambition to scale and automate in a <em>data driven</em> manner means that a tribal approach to problem solving can hinder our progress. Any tribe of hunter gatherers would struggle to understand the operation of a modern city. Similarly, supply chain needs to develop cross-functional skill sets to address the modern problems we face, not the problems that were formulated in the past.</p>
<p>Many of the challenges we face are at the interface between our tribal expertize. We have particular cost functions we are trying to minimize (an expertise of OR) but we have large scale feedbacks in our system (an expertise of control). We also want our systems to be adaptive to changing circumstances, to perform the best action given the data available (an expertise of machine learning and statistics).</p>
<p>Taking the tribal analogy further, we could imagine each of our professions as a separate tribe of hunter-gathers, each with particular expertise (e.g. fishing, deer hunting, trapping). Each of these tribes has their own approach to eating to survive, just as each of our localized professions has its own approach to modelling. But in this analogy, the technological landscapes we face are not wildernesses, they are emerging metropolises. Our new task is to feed our population through a budding network of supermarkets. While we may be sourcing our food in the same way, this requires new types of thinking that don't belong in the pure domain of any of our existing tribes.</p>
<p>For our biggest challenges, focusing on the differences between these fields is unhelpful, we should consider their strengths and how they overlap. Fundamentally all these fields are focused on taking the right action given the information available to us. They need to work in <em>synergy</em> for us to make progress.</p>
<p><strong>Recommendation</strong>: We should be aware of the limitations of a single tribal view of any of our problem sets. Where our modelling is dominated by one perspective (e.g. economics, OR, control, ML) we should ensure cross fertilization of ideas occurs through scientific review and team rotation mechanisms that embed our scientists (for a short period) in different teams across our organizations.</p>
<h3 id="challenges-for-machine-learning-in-general">Challenges for Machine Learning in General</h3>
<p>We can characterize the challenges for integrating machine learning within our systems as the three Ds. Design, Data and Deployment.</p>
<p>The first two components <em>design</em> and <em>data</em> are interlinked, but we will first outline the design challenge. Below we will mainly focus on <em>supervised learning</em> because this is arguably the technology that is best understood within machine learning.</p>
<h3 id="design">Design</h3>
<p>Machine learning is not magical pixie dust, we cannot simply automate all decisions through data. We are constrained by our data (see below) and the models we use.<a href="#fn1" class="footnoteRef" id="fnref1"><sup>1</sup></a> Machine learning models are relatively simple function mappings that include characteristics such as smoothness. With some famous exceptions, e.g. speech and image data, inputs are constrained in the form of vectors and the model consists of a mathematically well behaved function. This means that some careful thought has to be put in to the right sub-process to automate with machine learning.</p>
<p>Any repetitive task is a candidate for automation, but many of the repetitive tasks we perform as humans are more complex than any individual algorithm can replace. The selection of which task to automate becomes critical and has downstream effects on our overall system design.</p>
<h3 id="pigeonholing">Pigeonholing</h3>
<div style="text-align:center">
<img class="" src="../slides/diagrams/TooManyPigeons.jpg" width="60%" height="auto" align="" style="background:none; border:none; box-shadow:none; display:block; margin-left:auto; margin-right:auto">
</div>
<center>
<em>The machine learning systems design process calls for separating a complex task into decomposable separate entities. A process we can think of as </em><a href="https://en.wikipedia.org/wiki/Pigeonholing">pigeonholing</a><em>. </em>
</center>
<p>Some aspects to take into account are</p>
<ol style="list-style-type: decimal">
<li>Can we refine the decision we need to a set of repetitive tasks where input information and output decision/value is well defined?</li>
<li>Can we represent the sub-task we’ve defined with a mathematical mapping?</li>
</ol>
<p>The design for the second task may involve massaging of the problem: feature selection or adaptation. It may also involve filtering out exception cases (perhaps through a pre-classification).</p>
<p>All else being equal, we’d like to keep our models simple and interpretable. If we can convert a complex mapping to a linear mapping through clever selection sub-task and features this is a big win.</p>
<p>For example, Facebook have <em>feature engineers</em>, individuals whose main role is to design features they think might be useful for one of their tasks (e.g. newsfeed ranking, or ad matching). Facebook have a training/testing pipeline called <a href="https://www.facebook.com/Engineering/posts/fblearner-flow-is-a-machine-learning-platform-capable-of-easily-reusing-algorith/10154077833317200/">FBLearner</a>. Facebook have predefined the sub-tasks they are interested in, and they are tightly connected to their business model. A challenge for companies that have a more diversified portfolio of activities is the identification of the most appropriate sub-task. A potential solution to feature and model selection is known as <em>auto ML</em>. Or we can think of it as using Machine Learning to assist Machine Learning. It’s also called meta-learning. Learning about learning. The input to the ML algorithm is a machine learning task, the output is a proposed model to solve the task.</p>
<p>One trap that is easy to fall in is too much emphasis on the type of model we have deployed rather than the appropriateness of the task decomposition we have chosen.</p>
<p><strong>Recommendation</strong>: Conditioned on task decomposition, we should automate the process of model improvement. Model updates should not be discussed in management meetings, they should be deployed and updated as a matter of course. Further details below on model deployment, but model updating needs to be considered at design time.</p>
<div style="text-align:center">
<img class="" src="../slides/diagrams/ai/chicken-and-egg.jpg" width="" height="auto" align="" style="background:none; border:none; box-shadow:none; display:block; margin-left:auto; margin-right:auto">
</div>
<center>
<em>The answer to the question which comes first, the chicken or the egg is simple, they co-evolve <span class="citation">(Popper, 1963)</span>. Similarly, when we place components together in a complex machine learning system, they will tend to co-evolve and compensate for one another. </em>
</center>
<p>To form modern decision making systems, many components are interlinked. We decompose our complex decision making into individual tasks, but the performance of each component is dependent on those upstream of it.</p>
<p>This naturally leads to co-evolution of systems, upstream errors can be compensated by downstream corrections.</p>
<p>To embrace this characteristic, end-to-end training could be considered. Why produce the best forecast by metrics when we can just produce the best forecast for our systems? End to end training can lead to improvements in performance, but it would also damage our systems decomposability and its interpretability, and perhaps its adaptability.</p>
<p>The less human interpretable our systems are, the harder they are to adapt to different circumstances or diagnose when there's a challenge. The trade-off between interpretability and performance is a constant tension which we should always retain in our minds when performing our system design.</p>
<h3 id="data">Data</h3>
<p>It is difficult to overstate the importance of data. It is half of the equation for machine learning, but is often utterly neglected. I speculate that there are two reasons for this. Firstly, data cleaning is perceived as tedious. It doesn’t seem to consist of the same intellectual challenges that are inherent in constructing complex mathematical models and implementing them in code. Secondly, data cleaning is highly complex, it requires a deep understanding of how machine learning systems operate and good intuitions about the data itself, the domain from which data is drawn (e.g. Supply Chain) and what downstream problems might be caused by poor data quality.</p>
<p>A consequence these two reasons, data cleaning seems difficult to formulate into a readily teachable set of principles. As a result it is heavily neglected in courses on machine learning and data science. Despite data being half the equation, most University courses spend little to no time on its challenges.</p>
<p>Anecdotally, talking to data modelling scientists. Most say they spend 80% of their time acquiring and cleaning data. This is precipitating what I refer to as the “data crisis”. This is an analogy with software. The “software crisis” was the phenomenon of inability to deliver software solutions due to increasing complexity of implementation. There was no single shot solution for the software crisis, it involved better practice (scrum, test orientated development, sprints, code review), improved programming paradigms (object orientated, functional) and better tools (CVS, then SVN, then git).</p>
<p>However, these challenges aren't new, they are merely taking a different form. From the computer's perspective software <em>is</em> data. The first wave of the data crisis was known as the <em>software crisis</em>.</p>
<h3 id="the-software-crisis">The Software Crisis</h3>
<blockquote>
<p>The major cause of the software crisis is that the machines have become several orders of magnitude more powerful! To put it quite bluntly: as long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a mild problem, and now we have gigantic computers, programming has become an equally gigantic problem.</p>
<p>Edsger Dijkstra (1930-2002), The Humble Programmer</p>
</blockquote>
<p>In the late sixties early software programmers made note of the increasing costs of software development and termed the challenges associated with it as the "<a href="https://en.wikipedia.org/wiki/Software_crisis">Software Crisis</a>". Edsger Dijkstra referred to the crisis in his 1972 Turing Award winner's address.</p>
<h3 id="the-mordern-data-crisis">The Mordern Data Crisis</h3>
<blockquote>
<p>The major cause of the data crisis is that machines have become more interconnected than ever before. Data access is therefore cheap, but data quality is often poor. What we need is cheap high quality data. That implies that we develop processes for improving and verifying data quality that are efficient.</p>
<p>There would seem to be two ways for improving efficiency. Firstly, we should not duplicate work. Secondly, where possible we should automate work.</p>
</blockquote>
<p>What I term "The Data Crisis" is the modern equivalent of this problem. The quantity of modern data, and the lack of attention paid to data as it is initially "laid down" and the costs of data cleaning are bringing about a crisis in data-driven decision making.</p>
<p>Just as with software, the crisis is most correctly addressed by 'scaling' the manner in which we process our data. Duplication of work occurs because the value of data cleaning is not correctly recognised in management decision making processes. Automation of work is increasingly possible through techniques in "artificial intelligence", but this will also require better management of the data science pipeline so that data about data science (meta-data science) can be correctly assimilated and processed. The Alan Turing institute has a program focussed on this area, <a href="https://www.turing.ac.uk/research_projects/artificial-intelligence-data-analytics/">AI for Data Analytics</a>.</p>
<p>Data is the new software, and the data crisis is already upon us. It is driven by the cost of cleaning data, the paucity of tools for monitoring and maintaining our deployments, the provenance of our models (e.g. with respect to the data they’re trained on).</p>
<p>Three principal changes need to occur in response. They are cultural and infrastructural.</p>
<p>First of all, to excel in data driven decision making we need to move from a <em>software first</em> paradigm to a <em>data first</em> paradigm. That means refocusing on data as the product. Software is the intermediary to producing the data, and its quality standards must be maintained, but not at the expense of the data we are producing. Data cleaning and maintenance need to be prized as highly as software debugging and maintenance. Instead of <em>software</em> as a service, we should refocus around <em>data</em> as a service. This first change is a cultural change in which our teams think about their outputs in terms of data. Instead of decomposing our systems around the software components, we need to decompose them around the data generating and consuming components.<a href="#fn2" class="footnoteRef" id="fnref2"><sup>2</sup></a> Software first is only an intermediate step on the way to be coming <em>data first</em>. It is a necessary, but not a sufficient condition for efficient machine learning systems design and deployment.</p>
<p>Secondly, we need to improve our language around data quality. We cannot assess the costs of improving data quality unless we generate a language around what data quality means. Data Readiness Levels are an assessment of data quality that is based on the usage to which data is put.</p>
<p>Thirdly, we need to improve our mental model of the separation of data science from applied science. A common trap in our thinking around data is to see data science (and data engineering, data preparation) as a sub-set of the software engineer’s or applied scientist’s skill set. As a result we recruit and deploy the wrong type of resource. Data preparation and question formulation is superficially similar to both because of the need for programming skills, but the day to day problems faced are very different.</p>
<p><strong>Recommendation</strong>: Build a shared understanding of the language of data readiness levels for use in planning documents and costing of data cleaning and the benefits of reusing cleaned data.</p>
<h2 id="data-readiness-levels">Data Readiness Levels</h2>
<p><a href="http://inverseprobability.com/2017/01/12/data-readiness-levels">Data Readiness Levels</a> <span class="citation">(Lawrence, 2017)</span> are an attempt to develop a language around data quality that can bridge the gap between technical solutions and decision makers such as managers and project planners. The are inspired by Technology Readiness Levels which attempt to quantify the readiness of technologies for deployment.</p>
<p>Data-readiness describes, at its coarsest level, three separate stages of data graduation.</p>
<ul>
<li><p>Grade C - accessibility</p></li>
<li><p>Grade B - validity</p></li>
<li><p>Grade A - usability</p></li>
</ul>
<h3 id="accessibility-grade-c">Accessibility: Grade C</h3>
<p>The first grade refers to the accessibility of data. Most data science practitioners will be used to working with data-providers who, perhaps having had little experience of data-science before, state that they "have the data". More often than not, they have not verified this. A convenient term for this is "Hearsay Data", someone has <em>heard</em> that they have the data so they <em>say</em> they have it. This is the lowest grade of data readiness.</p>
<p>Progressing through Grade C involves ensuring that this data is accessible. Not just in terms of digital accessiblity, but also for regulatory, ethical and intellectual property reasons.</p>
<h3 id="validity-grade-b">Validity: Grade B</h3>
<p>Data transits from Grade C to Grade B once we can begin digital analysis on the computer. Once the challenges of access to the data have been resolved, we can make the data available either via API, or for direct loading into analysis software (such as Python, R, Matlab, Mathematica or SPSS). Once this has occured the data is at B4 level. Grade B involves the <em>validity</em> of the data. Does the data really represent what it purports to? There are challenges such as missing values, outliers, record duplication. Each of these needs to be investigated.</p>
<p>Grade B and C are important as if the work done in these grades is documented well, it can be reused in other projects. Reuse of this labour is key to reducing the costs of data-driven automated decision making. There is a strong overlap between the work required in this grade and the statistical field of <a href="https://en.wikipedia.org/wiki/Exploratory_data_analysis"><em>exploratory data analysis</em></a> <span class="citation">(Tukey, 1977)</span>.</p>
<p>The need for Grade B emerges due to the fundamental change in the availability of data. Classically, the scientific question came first, and the data came later. This is still the approach in a randomized control trial, e.g. in A/B testing or clinical trials for drugs. Today data is being laid down by happenstance, and the question we wish to ask about the data often comes after the data has been created. The Grade B of data readiness ensures thought can be put into data quality <em>before</em> the question is defined. It is this work that is reusable across multiple teams. It is these processes that the team which is <em>standing up</em> the data must deliver.</p>
<h3 id="usability-grade-a">Usability: Grade A</h3>
<p>Once the validity of the data is determined, the data set can be considered for use in a particular task. This stage of data readiness is more akin to what machine learning scientists are used to doing in Universities. Bringing an algorithm to bear on a well understood data set.</p>
<p>In Grade A we are concerned about the utility of the data given a particular task. Grade A may involve additional data collection (experimental design in statistics) to ensure that the task is fulfilled.</p>
<p>This is the stage where the data and the model are brought together, so expertise in learning algorithms and their application is key. Further ethical considerations, such as the fairness of the resulting predictions are required at this stage. At the end of this stage a prototype model is ready for deployment.</p>
<p>Deployment and maintenance of machine learning models in production is another important issue which Data Readiness Levels are only a part of the solution for.</p>
<h3 id="recursive-effects">Recursive Effects</h3>
<p>To find out more, or to contribute ideas go to <a href="http://data-readiness.org" class="uri">http://data-readiness.org</a></p>
<p>Throughout the data preparation pipeline, it is important to have close interaction between data scientists and application domain experts. Decisions on data preparation taken outside the context of application have dangerous downstream consequences. This provides an additional burden on the data scientist as they are required for each project, but it should also be seen as a learning and familiarization exercise for the domain expert. Long term, just as biologists have found it necessary to assimilate the skills of the bioinformatician to be effective in their science, most domains will also require a familiarity with the nature of data driven decision making and its application. Working closely with data-scientists on data preparation is one way to begin this sharing of best practice.</p>
<p>The processes involved in Grade C and B are often badly taught in courses on data science. Perhaps not due to a lack of interest in the areas, but maybe more due to a lack of access to real world examples where data quality is poor.</p>
<p>These stages of data science are also ridden with ambiguity. In the long term they could do with more formalization, and automation, but best practice needs to be understood by a wider community before that can happen.</p>
<h3 id="combining-data-and-systems-design">Combining Data and Systems Design</h3>
<p>One analogy I find helpful for understanding the depth of change we need is the following. Imagine as an engineer, you find a USB stick on the ground. And for some reason you <em>know</em> that on that USB stick is a particular API call that will enable you to make a significant positive difference on a business problem. However, you also know on that USB stick there is potentially malicious code. The most secure thing to do would be to <em>not</em> introduce this code into your production system. But what if your manager told you to do so, how would you go about incorporating this code base?</p>
<p>The answer is <em>very</em> carefully. You would have to engage in a process more akin to debugging than regular software engineering. As you understood the code base, for your work to be reproducible, you should be documenting it, not just what you discovered, but how you discovered it. In the end, you typically find a single API call that is the one that most benefits your system. But more thought has been placed into this line of code than any line of code you have written before.</p>
<p>Even then, when your API code is introduced into your production system, it needs to be deployed in an environment that monitors it. We cannot rely on an individual’s decision making to ensure the quality of all our systems. We need to create an environment that includes quality controls, checks and bounds, tests, all designed to ensure that assumptions made about this foreign code base are remaining valid.</p>
<p>This situation is akin to what we are doing when we incorporate data in our production systems. When we are consuming data from others, we cannot assume that it has been produced in alignment with our goals for our own systems. Worst case, it may have been adversarialy produced. A further challenge is that data is dynamic. So, in effect, the code on the USB stick is evolving over time.</p>
<p>Anecdotally, resolving a machine learning challenge requires 80% of the resource to be focused on the data and perhaps 20% to be focused on the model. But many companies are too keen to employ machine learning engineers who focus on the models, not the data.</p>
<div style="text-align;center">
<img class="" src="../slides/diagrams/data-science/water-bridge-hill-transport-arch-calm-544448-pxhere.com.jpg" width="80%" height="auto" align="" style="background:none; border:none; box-shadow:none; display:block; margin-left:auto; margin-right:auto">
</div>
<center>
<em>A reservoir of data has more value if the data is consumbable. The data crisis can only be addressed if we focus on outputs rather than inputs. </em>
</center>
<div style="text-align;center">
<img class="" src="../slides/diagrams/data-science/1024px-Lake_District_picture.JPG" width="80%" height="auto" align="" style="background:none; border:none; box-shadow:none; display:block; margin-left:auto; margin-right:auto">
</div>
<center>
<em>For a data first architecture we need to clean our data at source, rather than individually cleaning data for each task. This involves a shift of focus from our inputs to our outputs. </em>
</center>
<p><strong>Recommendation</strong>: We need to share best practice around data deployment across our teams. We should make best use of our processes where applicable, but we need to develop them to become <em>data first</em> organizations. Data needs to be cleaned at <em>output</em> not at <em>input</em>.</p>
<h3 id="deployment">Deployment</h3>
<h3 id="continuous-deployment">Continuous Deployment</h3>
<p>Once the design is complete, the model code needs to be deployed.</p>
<p>To extend our USB stick analogy further, how would we deploy that code if we thought it was likely to evolve in production? This is what data does. We cannot assume that the conditions under which we trained our model will be retained as we move forward, indeed the only constant we have is change.</p>
<p>This means that when any data dependent model is deployed into production, it requires <em>continuous monitoring</em> to ensure the assumptions of design have not been invalidated. Software changes are qualified through testing, in particular a regression test ensures that existing functionality is not broken by change. Since data is continually evolving, machine learning systems require continual regression testing: oversight by systems that ensure their existing functionality has not been broken as the world evolves around them. Unfortunately, standards around ML model deployment yet been developed. The modern world of continuous deployment does rely on testing, but it does not recognize the continuous evolution of the world around us.</p>
<p>If the world has changed around our decision making ecosystem, how are we alerted to those changes?</p>
<p><strong>Recommendation</strong>: We establish best practice around model deployment. We need to shift our culture from standing up a software service, to standing up a <em>data service</em>. Data as a Service would involve continual monitoring of our deployed models in production. This would be regulated by 'hypervisor' systems<a href="#fn3" class="footnoteRef" id="fnref3"><sup>3</sup></a> that understand the context in which models are deployed and recognize when circumstance has changed and models need retraining or restructuring.</p>
<p><strong>Recommendation</strong>: We should consider a major re-architecting of systems around our services. In particular we should scope the use of a <em>streaming architecture</em> (such as Apache Kafka) that ensures data persistence and enables asynchronous operation of our systems.<a href="#fn4" class="footnoteRef" id="fnref4"><sup>4</sup></a> This would enable the provision of QC streams, and real time dash boards as well as hypervisors.</p>
<p>Importantly a streaming architecture implies the services we build are <em>stateless</em>, internal state is deployed on streams alongside external state. This allows for rapid assessment of other services' data.</p>
<h3 id="outlook-for-machine-learning">Outlook for Machine Learning</h3>
<p>Machine learning has risen to prominence as an approach to <em>scaling</em> our activities. For us to continue to automate in the manner we have over the last two decades, we need to make more use of computer-based automation. Machine learning is allowing us to automate processes that were out of reach before.</p>
<h3 id="conclusion">Conclusion</h3>
<p>We operate in a technologically evolving environment. Machine learning is becoming a key coponent in our decision making capabilities, our intelligence and strategic command. However, technology drove changes in battlefield strategy. From the stalemate of the first world war to the tank-dominated Blitzkrieg of the second, to the asymmetric warfare of the present. Our technology, tactics and strategies are also constantly evolving. Machine learning is part of that evolution solution, but the main challenge is not to become so fixated on the tactics of today that we miss the evolution of strategy that the technology is suggesting.</p>
<h3 id="references" class="unnumbered">References</h3>
<div id="refs" class="references">
<div id="ref-Lawrence:drl17">
<p>Lawrence, N.D., 2017. Data readiness levels. arXiv.</p>
</div>
<div id="ref-Popper:conjectures63">
<p>Popper, K.R., 1963. Conjectures and refutations: The growth of scientific knowledge. Routledge, London.</p>
</div>
<div id="ref-Tukey:exploratory77">
<p>Tukey, J.W., 1977. Exploratory data analysis. Addison-Wesley.</p>
</div>
</div>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>We can also become constrained by our tribal thinking, just as each of the other groups can.<a href="#fnref1">↩</a></p></li>
<li id="fn2"><p>This is related to machine learning and technical debt, although we are framing the solution here rather than the problem.<a href="#fnref2">↩</a></p></li>
<li id="fn3"><p>Emulation is one approach to forming such a hypervisor, because we can build emulators that operate at the meta level, not on the systems directly, but how they interact. Or emulators that monitor a simulation to ensure performance does not change dramatically. However, they are not the only approach. Using real time dashboards, anomaly detection and classical statistics are also applicable in this domain.<a href="#fnref3">↩</a></p></li>
<li id="fn4"><p>The Cambridge team has been exploring this area. We have a reference architecture, and are also considering how such a system could/should be extended for incorporation of simulation models.<a href="#fnref4">↩</a></p></li>
</ol>
</div>
Fri, 30 Nov 2018 00:00:00 +0000
http://inverseprobability.com/talks/notes/data-science-and-digital-systems.html
http://inverseprobability.com/talks/notes/data-science-and-digital-systems.htmlnotesNatural and Artificial Intelligence<div style="display:none">
$$\newcommand{\Amatrix}{\mathbf{A}}
\newcommand{\KL}[2]{\text{KL}\left( #1\,\|\,#2 \right)}
\newcommand{\Kaast}{\kernelMatrix_{\mathbf{ \ast}\mathbf{ \ast}}}
\newcommand{\Kastu}{\kernelMatrix_{\mathbf{ \ast} \inducingVector}}
\newcommand{\Kff}{\kernelMatrix_{\mappingFunctionVector \mappingFunctionVector}}
\newcommand{\Kfu}{\kernelMatrix_{\mappingFunctionVector \inducingVector}}
\newcommand{\Kuast}{\kernelMatrix_{\inducingVector \bf\ast}}
\newcommand{\Kuf}{\kernelMatrix_{\inducingVector \mappingFunctionVector}}
\newcommand{\Kuu}{\kernelMatrix_{\inducingVector \inducingVector}}
\newcommand{\Kuui}{\Kuu^{-1}}
\newcommand{\Qaast}{\mathbf{Q}_{\bf \ast \ast}}
\newcommand{\Qastf}{\mathbf{Q}_{\ast \mappingFunction}}
\newcommand{\Qfast}{\mathbf{Q}_{\mappingFunctionVector \bf \ast}}
\newcommand{\Qff}{\mathbf{Q}_{\mappingFunctionVector \mappingFunctionVector}}
\newcommand{\aMatrix}{\mathbf{A}}
\newcommand{\aScalar}{a}
\newcommand{\aVector}{\mathbf{a}}
\newcommand{\acceleration}{a}
\newcommand{\bMatrix}{\mathbf{B}}
\newcommand{\bScalar}{b}
\newcommand{\bVector}{\mathbf{b}}
\newcommand{\basisFunc}{\phi}
\newcommand{\basisFuncVector}{\boldsymbol{ \basisFunc}}
\newcommand{\basisFunction}{\phi}
\newcommand{\basisLocation}{\mu}
\newcommand{\basisMatrix}{\boldsymbol{ \Phi}}
\newcommand{\basisScalar}{\basisFunction}
\newcommand{\basisVector}{\boldsymbol{ \basisFunction}}
\newcommand{\activationFunction}{\phi}
\newcommand{\activationMatrix}{\boldsymbol{ \Phi}}
\newcommand{\activationScalar}{\basisFunction}
\newcommand{\activationVector}{\boldsymbol{ \basisFunction}}
\newcommand{\bigO}{\mathcal{O}}
\newcommand{\binomProb}{\pi}
\newcommand{\cMatrix}{\mathbf{C}}
\newcommand{\cbasisMatrix}{\hat{\boldsymbol{ \Phi}}}
\newcommand{\cdataMatrix}{\hat{\dataMatrix}}
\newcommand{\cdataScalar}{\hat{\dataScalar}}
\newcommand{\cdataVector}{\hat{\dataVector}}
\newcommand{\centeredKernelMatrix}{\mathbf{ \MakeUppercase{\centeredKernelScalar}}}
\newcommand{\centeredKernelScalar}{b}
\newcommand{\centeredKernelVector}{\centeredKernelScalar}
\newcommand{\centeringMatrix}{\mathbf{H}}
\newcommand{\chiSquaredDist}[2]{\chi_{#1}^{2}\left(#2\right)}
\newcommand{\chiSquaredSamp}[1]{\chi_{#1}^{2}}
\newcommand{\conditionalCovariance}{\boldsymbol{ \Sigma}}
\newcommand{\coregionalizationMatrix}{\mathbf{B}}
\newcommand{\coregionalizationScalar}{b}
\newcommand{\coregionalizationVector}{\mathbf{ \coregionalizationScalar}}
\newcommand{\covDist}[2]{\text{cov}_{#2}\left(#1\right)}
\newcommand{\covSamp}[1]{\text{cov}\left(#1\right)}
\newcommand{\covarianceScalar}{c}
\newcommand{\covarianceVector}{\mathbf{ \covarianceScalar}}
\newcommand{\covarianceMatrix}{\mathbf{C}}
\newcommand{\covarianceMatrixTwo}{\boldsymbol{ \Sigma}}
\newcommand{\croupierScalar}{s}
\newcommand{\croupierVector}{\mathbf{ \croupierScalar}}
\newcommand{\croupierMatrix}{\mathbf{ \MakeUppercase{\croupierScalar}}}
\newcommand{\dataDim}{p}
\newcommand{\dataIndex}{i}
\newcommand{\dataIndexTwo}{j}
\newcommand{\dataMatrix}{\mathbf{Y}}
\newcommand{\dataScalar}{y}
\newcommand{\dataSet}{\mathcal{D}}
\newcommand{\dataStd}{\sigma}
\newcommand{\dataVector}{\mathbf{ \dataScalar}}
\newcommand{\decayRate}{d}
\newcommand{\degreeMatrix}{\mathbf{ \MakeUppercase{\degreeScalar}}}
\newcommand{\degreeScalar}{d}
\newcommand{\degreeVector}{\mathbf{ \degreeScalar}}
% Already defined by latex
%\newcommand{\det}[1]{\left|#1\right|}
\newcommand{\diag}[1]{\text{diag}\left(#1\right)}
\newcommand{\diagonalMatrix}{\mathbf{D}}
\newcommand{\diff}[2]{\frac{\text{d}#1}{\text{d}#2}}
\newcommand{\diffTwo}[2]{\frac{\text{d}^2#1}{\text{d}#2^2}}
\newcommand{\displacement}{x}
\newcommand{\displacementVector}{\textbf{\displacement}}
\newcommand{\distanceMatrix}{\mathbf{ \MakeUppercase{\distanceScalar}}}
\newcommand{\distanceScalar}{d}
\newcommand{\distanceVector}{\mathbf{ \distanceScalar}}
\newcommand{\eigenvaltwo}{\ell}
\newcommand{\eigenvaltwoMatrix}{\mathbf{L}}
\newcommand{\eigenvaltwoVector}{\mathbf{l}}
\newcommand{\eigenvalue}{\lambda}
\newcommand{\eigenvalueMatrix}{\boldsymbol{ \Lambda}}
\newcommand{\eigenvalueVector}{\boldsymbol{ \lambda}}
\newcommand{\eigenvector}{\mathbf{ \eigenvectorScalar}}
\newcommand{\eigenvectorMatrix}{\mathbf{U}}
\newcommand{\eigenvectorScalar}{u}
\newcommand{\eigenvectwo}{\mathbf{v}}
\newcommand{\eigenvectwoMatrix}{\mathbf{V}}
\newcommand{\eigenvectwoScalar}{v}
\newcommand{\entropy}[1]{\mathcal{H}\left(#1\right)}
\newcommand{\errorFunction}{E}
\newcommand{\expDist}[2]{\left<#1\right>_{#2}}
\newcommand{\expSamp}[1]{\left<#1\right>}
\newcommand{\expectation}[1]{\left\langle #1 \right\rangle }
\newcommand{\expectationDist}[2]{\left\langle #1 \right\rangle _{#2}}
\newcommand{\expectedDistanceMatrix}{\mathcal{D}}
\newcommand{\eye}{\mathbf{I}}
\newcommand{\fantasyDim}{r}
\newcommand{\fantasyMatrix}{\mathbf{ \MakeUppercase{\fantasyScalar}}}
\newcommand{\fantasyScalar}{z}
\newcommand{\fantasyVector}{\mathbf{ \fantasyScalar}}
\newcommand{\featureStd}{\varsigma}
\newcommand{\gammaCdf}[3]{\mathcal{GAMMA CDF}\left(#1|#2,#3\right)}
\newcommand{\gammaDist}[3]{\mathcal{G}\left(#1|#2,#3\right)}
\newcommand{\gammaSamp}[2]{\mathcal{G}\left(#1,#2\right)}
\newcommand{\gaussianDist}[3]{\mathcal{N}\left(#1|#2,#3\right)}
\newcommand{\gaussianSamp}[2]{\mathcal{N}\left(#1,#2\right)}
\newcommand{\given}{|}
\newcommand{\half}{\frac{1}{2}}
\newcommand{\heaviside}{H}
\newcommand{\hiddenMatrix}{\mathbf{ \MakeUppercase{\hiddenScalar}}}
\newcommand{\hiddenScalar}{h}
\newcommand{\hiddenVector}{\mathbf{ \hiddenScalar}}
\newcommand{\identityMatrix}{\eye}
\newcommand{\inducingInputScalar}{z}
\newcommand{\inducingInputVector}{\mathbf{ \inducingInputScalar}}
\newcommand{\inducingInputMatrix}{\mathbf{Z}}
\newcommand{\inducingScalar}{u}
\newcommand{\inducingVector}{\mathbf{ \inducingScalar}}
\newcommand{\inducingMatrix}{\mathbf{U}}
\newcommand{\inlineDiff}[2]{\text{d}#1/\text{d}#2}
\newcommand{\inputDim}{q}
\newcommand{\inputMatrix}{\mathbf{X}}
\newcommand{\inputScalar}{x}
\newcommand{\inputSpace}{\mathcal{X}}
\newcommand{\inputVals}{\inputVector}
\newcommand{\inputVector}{\mathbf{ \inputScalar}}
\newcommand{\iterNum}{k}
\newcommand{\kernel}{\kernelScalar}
\newcommand{\kernelMatrix}{\mathbf{K}}
\newcommand{\kernelScalar}{k}
\newcommand{\kernelVector}{\mathbf{ \kernelScalar}}
\newcommand{\kff}{\kernelScalar_{\mappingFunction \mappingFunction}}
\newcommand{\kfu}{\kernelVector_{\mappingFunction \inducingScalar}}
\newcommand{\kuf}{\kernelVector_{\inducingScalar \mappingFunction}}
\newcommand{\kuu}{\kernelVector_{\inducingScalar \inducingScalar}}
\newcommand{\lagrangeMultiplier}{\lambda}
\newcommand{\lagrangeMultiplierMatrix}{\boldsymbol{ \Lambda}}
\newcommand{\lagrangian}{L}
\newcommand{\laplacianFactor}{\mathbf{ \MakeUppercase{\laplacianFactorScalar}}}
\newcommand{\laplacianFactorScalar}{m}
\newcommand{\laplacianFactorVector}{\mathbf{ \laplacianFactorScalar}}
\newcommand{\laplacianMatrix}{\mathbf{L}}
\newcommand{\laplacianScalar}{\ell}
\newcommand{\laplacianVector}{\mathbf{ \ell}}
\newcommand{\latentDim}{q}
\newcommand{\latentDistanceMatrix}{\boldsymbol{ \Delta}}
\newcommand{\latentDistanceScalar}{\delta}
\newcommand{\latentDistanceVector}{\boldsymbol{ \delta}}
\newcommand{\latentForce}{f}
\newcommand{\latentFunction}{u}
\newcommand{\latentFunctionVector}{\mathbf{ \latentFunction}}
\newcommand{\latentFunctionMatrix}{\mathbf{ \MakeUppercase{\latentFunction}}}
\newcommand{\latentIndex}{j}
\newcommand{\latentScalar}{z}
\newcommand{\latentVector}{\mathbf{ \latentScalar}}
\newcommand{\latentMatrix}{\mathbf{Z}}
\newcommand{\learnRate}{\eta}
\newcommand{\lengthScale}{\ell}
\newcommand{\rbfWidth}{\ell}
\newcommand{\likelihoodBound}{\mathcal{L}}
\newcommand{\likelihoodFunction}{L}
\newcommand{\locationScalar}{\mu}
\newcommand{\locationVector}{\boldsymbol{ \locationScalar}}
\newcommand{\locationMatrix}{\mathbf{M}}
\newcommand{\variance}[1]{\text{var}\left( #1 \right)}
\newcommand{\mappingFunction}{f}
\newcommand{\mappingFunctionMatrix}{\mathbf{F}}
\newcommand{\mappingFunctionTwo}{g}
\newcommand{\mappingFunctionTwoMatrix}{\mathbf{G}}
\newcommand{\mappingFunctionTwoVector}{\mathbf{ \mappingFunctionTwo}}
\newcommand{\mappingFunctionVector}{\mathbf{ \mappingFunction}}
\newcommand{\scaleScalar}{s}
\newcommand{\mappingScalar}{w}
\newcommand{\mappingVector}{\mathbf{ \mappingScalar}}
\newcommand{\mappingMatrix}{\mathbf{W}}
\newcommand{\mappingScalarTwo}{v}
\newcommand{\mappingVectorTwo}{\mathbf{ \mappingScalarTwo}}
\newcommand{\mappingMatrixTwo}{\mathbf{V}}
\newcommand{\maxIters}{K}
\newcommand{\meanMatrix}{\mathbf{M}}
\newcommand{\meanScalar}{\mu}
\newcommand{\meanTwoMatrix}{\mathbf{M}}
\newcommand{\meanTwoScalar}{m}
\newcommand{\meanTwoVector}{\mathbf{ \meanTwoScalar}}
\newcommand{\meanVector}{\boldsymbol{ \meanScalar}}
\newcommand{\mrnaConcentration}{m}
\newcommand{\naturalFrequency}{\omega}
\newcommand{\neighborhood}[1]{\mathcal{N}\left( #1 \right)}
\newcommand{\neilurl}{http://inverseprobability.com/}
\newcommand{\noiseMatrix}{\boldsymbol{ E}}
\newcommand{\noiseScalar}{\epsilon}
\newcommand{\noiseVector}{\boldsymbol{ \epsilon}}
\newcommand{\norm}[1]{\left\Vert #1 \right\Vert}
\newcommand{\normalizedLaplacianMatrix}{\hat{\mathbf{L}}}
\newcommand{\normalizedLaplacianScalar}{\hat{\ell}}
\newcommand{\normalizedLaplacianVector}{\hat{\mathbf{ \ell}}}
\newcommand{\numActive}{m}
\newcommand{\numBasisFunc}{m}
\newcommand{\numComponents}{m}
\newcommand{\numComps}{K}
\newcommand{\numData}{n}
\newcommand{\numFeatures}{K}
\newcommand{\numHidden}{h}
\newcommand{\numInducing}{m}
\newcommand{\numLayers}{\ell}
\newcommand{\numNeighbors}{K}
\newcommand{\numSequences}{s}
\newcommand{\numSuccess}{s}
\newcommand{\numTasks}{m}
\newcommand{\numTime}{T}
\newcommand{\numTrials}{S}
\newcommand{\outputIndex}{j}
\newcommand{\paramVector}{\boldsymbol{ \theta}}
\newcommand{\parameterMatrix}{\boldsymbol{ \Theta}}
\newcommand{\parameterScalar}{\theta}
\newcommand{\parameterVector}{\boldsymbol{ \parameterScalar}}
\newcommand{\partDiff}[2]{\frac{\partial#1}{\partial#2}}
\newcommand{\precisionScalar}{j}
\newcommand{\precisionVector}{\mathbf{ \precisionScalar}}
\newcommand{\precisionMatrix}{\mathbf{J}}
\newcommand{\pseudotargetScalar}{\widetilde{y}}
\newcommand{\pseudotargetVector}{\mathbf{ \pseudotargetScalar}}
\newcommand{\pseudotargetMatrix}{\mathbf{ \widetilde{Y}}}
\newcommand{\rank}[1]{\text{rank}\left(#1\right)}
\newcommand{\rayleighDist}[2]{\mathcal{R}\left(#1|#2\right)}
\newcommand{\rayleighSamp}[1]{\mathcal{R}\left(#1\right)}
\newcommand{\responsibility}{r}
\newcommand{\rotationScalar}{r}
\newcommand{\rotationVector}{\mathbf{ \rotationScalar}}
\newcommand{\rotationMatrix}{\mathbf{R}}
\newcommand{\sampleCovScalar}{s}
\newcommand{\sampleCovVector}{\mathbf{ \sampleCovScalar}}
\newcommand{\sampleCovMatrix}{\mathbf{s}}
\newcommand{\scalarProduct}[2]{\left\langle{#1},{#2}\right\rangle}
\newcommand{\sign}[1]{\text{sign}\left(#1\right)}
\newcommand{\sigmoid}[1]{\sigma\left(#1\right)}
\newcommand{\singularvalue}{\ell}
\newcommand{\singularvalueMatrix}{\mathbf{L}}
\newcommand{\singularvalueVector}{\mathbf{l}}
\newcommand{\sorth}{\mathbf{u}}
\newcommand{\spar}{\lambda}
\newcommand{\trace}[1]{\text{tr}\left(#1\right)}
\newcommand{\BasalRate}{B}
\newcommand{\DampingCoefficient}{C}
\newcommand{\DecayRate}{D}
\newcommand{\Displacement}{X}
\newcommand{\LatentForce}{F}
\newcommand{\Mass}{M}
\newcommand{\Sensitivity}{S}
\newcommand{\basalRate}{b}
\newcommand{\dampingCoefficient}{c}
\newcommand{\mass}{m}
\newcommand{\sensitivity}{s}
\newcommand{\springScalar}{\kappa}
\newcommand{\springVector}{\boldsymbol{ \kappa}}
\newcommand{\springMatrix}{\boldsymbol{ \mathcal{K}}}
\newcommand{\tfConcentration}{p}
\newcommand{\tfDecayRate}{\delta}
\newcommand{\tfMrnaConcentration}{f}
\newcommand{\tfVector}{\mathbf{ \tfConcentration}}
\newcommand{\velocity}{v}
\newcommand{\sufficientStatsScalar}{g}
\newcommand{\sufficientStatsVector}{\mathbf{ \sufficientStatsScalar}}
\newcommand{\sufficientStatsMatrix}{\mathbf{G}}
\newcommand{\switchScalar}{s}
\newcommand{\switchVector}{\mathbf{ \switchScalar}}
\newcommand{\switchMatrix}{\mathbf{S}}
\newcommand{\tr}[1]{\text{tr}\left(#1\right)}
\newcommand{\loneNorm}[1]{\left\Vert #1 \right\Vert_1}
\newcommand{\ltwoNorm}[1]{\left\Vert #1 \right\Vert_2}
\newcommand{\onenorm}[1]{\left\vert#1\right\vert_1}
\newcommand{\twonorm}[1]{\left\Vert #1 \right\Vert}
\newcommand{\vScalar}{v}
\newcommand{\vVector}{\mathbf{v}}
\newcommand{\vMatrix}{\mathbf{V}}
\newcommand{\varianceDist}[2]{\text{var}_{#2}\left( #1 \right)}
% Already defined by latex
%\newcommand{\vec}{#1:}
\newcommand{\vecb}[1]{\left(#1\right):}
\newcommand{\weightScalar}{w}
\newcommand{\weightVector}{\mathbf{ \weightScalar}}
\newcommand{\weightMatrix}{\mathbf{W}}
\newcommand{\weightedAdjacencyMatrix}{\mathbf{A}}
\newcommand{\weightedAdjacencyScalar}{a}
\newcommand{\weightedAdjacencyVector}{\mathbf{ \weightedAdjacencyScalar}}
\newcommand{\onesVector}{\mathbf{1}}
\newcommand{\zerosVector}{\mathbf{0}}
$$
</div>
<p><!--% not ipynb--></p>
<div style="text-align:center;width:40%">
<img class="" src="../slides/diagrams/the-diving-bell-and-the-butterfly.jpg" width="" height="auto" align="" style="background:none; border:none; box-shadow:none; display:block; margin-left:auto; margin-right:auto">
</div>
<p>The Diving Bell and the Butterfly is the autobiography of Jean-Dominique Bauby.</p>
<p>In 1995, when he was editor-in-chief of the French Elle magazine, he suffered a stroke, which destroyed his brainstem. He became almost totally physically paralyzed, but was still mentally active. He acquired what is known as locked-in syndrome.</p>
<p>Incredibly, Bauby wrote his memoir after he became paralyzed.</p>
<p>His left eye was the only muscle he could voluntarily move, and he wrote the entire book by winking it.</p>
<p>How could he do that? Well, first, they set up a mechanism where he could scan across letters and blink at the letter he wanted to use. In this way, he was able to write each letter.</p>
<p>It took him 10 months of four hours a day to write the book. Each word took two minutes to write.</p>
<p>Imagine doing all that thinking, but so little speaking, having all those thoughts and so little ability to communicate.</p>
<p>The idea behind this talk is that we are all in that situation. While not as extreme as for Bauby, we all have somewhat of a locked in intelligence.</p>
<table>
<tr>
<td>
</td>
<td align="center">
<div class="fragment" data-fragment-index="6">
<p><img src="../slides/diagrams/IBM_Blue_Gene_P_supercomputer.jpg"
width="40%" style="background:none; border:none; box-shadow:none;"
align="center"></p>
</div>
</td>
<td align="center">
<div class="fragment" data-fragment-index="4">
<p><img src="../slides/diagrams/ClaudeShannon_MFO3807.jpg"
width="150%" style="background:none; border:none;
box-shadow:none;" align="center"></p>
</div>
</td>
<td align="center">
<div class="fragment" data-fragment-index="1">
<p><img src="../slides/diagrams/Jean-Dominique_Bauby.jpg"
width="150%" style="background:none; border:none;
box-shadow:none;" align="center"></p>
</div>
</td>
</tr>
<tr>
<td>
<div class="fragment" data-fragment-index="2">
<p>bits/min</p>
</div>
</td>
<td align="center">
<div class="fragment" data-fragment-index="6">
<p>billions</p>
</div>
</td>
<td align="center">
<div class="fragment" data-fragment-index="5">
<p>2000</p>
</div>
</td>
<td align="center">
<div class="fragment" data-fragment-index="3">
<p>6</p>
</div>
</td>
</tr>
<tr>
<td>
<div class="fragment" data-fragment-index="7">
<p>billion<br>calculations/s</p>
</div>
</td>
<td align="center">
<div class="fragment" data-fragment-index="9">
<p>~100</p>
</div>
</td>
<td align="center">
<div class="fragment" data-fragment-index="8">
<p>a billion</p>
</div>
</td>
<td align="center">
<div class="fragment" data-fragment-index="8">
<p>a billion</p>
</div>
</td>
</tr>
<tr>
<td>
<div class="fragment" data-fragment-index="10">
<p>embodiment</p>
</div>
</td>
<td align="center">
<div class="fragment" data-fragment-index="11">
<p>20 minutes</p>
</div>
</td>
<td align="center">
<div class="fragment" data-fragment-index="12">
<p>5 billion years</p>
</div>
</td>
<td align="center">
<div class="fragment" data-fragment-index="12">
<p>15 trillion years</p>
</div>
</td>
</tr>
</table>
<p>Let me explain what I mean. Claude Shannon introduced a mathematical concept of information for the purposes of understanding telephone exchanges.</p>
<p>Information has many meanings, but mathematically, Shannon defined a bit of information to be the amount of information you get from tossing a coin.</p>
<p>If I toss a coin, and look at it, I know the answer. You don't. But if I now tell you the answer I communicate to you 1 bit of information. Shannon defined this as the fundamental unit of information.</p>
<p>If I toss the coin twice, and tell you the result of both tosses, I give you two bits of information. Information is additive.</p>
<p>Shannon also estimated the average information associated with the English language. He estimated that the average information in any word is 12 bits, equivalent to twelve coin tosses.</p>
<p>So every two minutes Bauby was able to communicate 12 bits, or six bits per minute.</p>
<p>This is the information transfer rate he was limited to, the rate at which he could communicate.</p>
<p>Compare this to me, talking now. The average speaker for TEDX speaks around 160 words per minute. That's 320 times faster than Bauby or around a 2000 bits per minute. 2000 coin tosses per minute.</p>
<p>But, just think how much thought Bauby was putting into every sentence. Imagine how carefully chosen each of his words was. Because he was communication constrained he could put more thought into each of his words. Into thinking about his audience.</p>
<p>So, his intelligence became locked in. He thinks as fast as any of us, but can communicate slower. Like the tree falling in the woods with no one there to hear it, his intelligence is embedded inside him.</p>
<p>Two thousand coin tosses per minute sounds pretty impressive, but this talk is not just about us, it’s about our computers, and the type of intelligence we are creating within them.</p>
<p>So how does two thousand compare to our digital companions? When computers talk to each other, they do so with billions of coin tosses per minute.</p>
<p>Let’s imagine for a moment, that instead of talking about communication of information, we are actually talking about money. Bauby would have 6 dollars. I would have 2000 dollars, and my computer has billions of dollars.</p>
<p>The internet has interconnected computers and equipped them with extremely high transfer rates.</p>
<p>However, by our very best estimates, computers actually think slower than us.</p>
<p>How can that be? You might ask, computers calculate much faster than me. That’s true, but underlying your conscious thoughts there are a lot of calculations going on.</p>
<p>Each thought involves many thousands, millions or billions of calculations. How many exactly, we don’t know yet, because we don’t know how the brain turns calculations into thoughts.</p>
<p>Our best estimates suggest that to simulate your brain a computer would have to be as large as the UK Met Office machine here in Exeter. That’s a 250 million pound machine, the fastest in the UK. It can do 16 billion billon calculations per second.</p>
<p>It simulates the weather across the word every day, that’s how much power we think we need to simulate our brains.</p>
<p>So, in terms of our computational power we are extraordinary, but in terms of our ability to explain ourselves, just like Bauby, we are locked in.</p>
<p>For a typical computer, to communicate everything it computes in one second, it would only take it a couple of minutes. For us to do the same would take 15 billion years.</p>
<p>If intelligence is fundamentally about processing and sharing of information. This gives us a fundamental constraint on human intelligence that dictates its nature.</p>
<p>I call this ratio between the time it takes to compute something, and the time it takes to say it, the embodiment factor. Because it reflects how embodied our cognition is.</p>
<p>If it takes you two minutes to say the thing you have thought in a second, then you are a computer. If it takes you 15 billion years, then you are a human.</p>
<!--include{_ai/includes/sahelanthropus-tchadensis.md}-->
<div style="text-align:center;width:70%">
<img class="" src="../slides/diagrams/Lotus_49-2.JPG" width="" height="auto" align="" style="background:none; border:none; box-shadow:none; display:block; margin-left:auto; margin-right:auto">
</div>
<p>If we think of ourselves as vehicles, then we are massively overpowered. Our ability to generate derived information from raw fuel is extraordinary. Intellectually we have formula one engines.</p>
<div style="text-align:center;width:70%">
<img class="" src="../slides/diagrams/640px-Marcel_Renault_1903.jpg" width="" height="auto" align="" style="background:none; border:none; box-shadow:none; display:block; margin-left:auto; margin-right:auto">
</div>
<p>But, if you think about our ability to make use of those thoughts, to deploy them on the track, we are F1 cars with bicycle wheels.</p>
<p>Just think of the control a driver would have to have to deploy such power through such a narrow channel of traction. That is the beauty and the skill of the human mind.</p>
<div style="width:70%;text-align:center">
<img class="" src="../slides/diagrams/Caleb_McDuff_WIX_Silence_Racing_livery.jpg" width="" height="auto" align="" style="background:none; border:none; box-shadow:none; display:block; margin-left:auto; margin-right:auto">
</div>
<p>In contrast, our computers are more like go-karts. Underpowered, but with well-matched tires. More efficient, but somehow less extraordinary, less beautiful.</p>
<object class="svgplot " align data="../slides/diagrams/anne-bob-conversation006.svg" style>
</object>
<center>
<em>Conversation relies on internal models of other individuals. </em>
</center>
<object class="svgplot " align data="../slides/diagrams/anne-bob-conversation007.svg" style>
</object>
<center>
<em>Misunderstanding of context and who we are talking to leads to arguments. </em>
</center>
<p>The consequences between this mismatch of power and delivery are to be seen all around us. Because, just as driving an F1 car with bicycle wheels would be a fine art, so is the process of communication between humans.</p>
<p>If I have a thought and I wish to communicate it, I first of all need to have a model of what you think. I should think before I speak. When I speak, you may react. You have a model of who I am and what I was trying to say, and why I chose to say what I said. Now we begin this dance, where we are each trying to better understand each other and what we are saying. When it works, it is beautiful, but when misdeployed, just like a badly driven F1 car, there is a horrible crash, an argument.</p>
<div style="text-align:center">
<object class="svgplot " align data="../slides/diagrams/hilbert-info-growth.svg" style>
</object>
</div>
<div style="text-align:center;width:60%">
<img class="" src="../slides/diagrams/Classic_baby_shoes.jpg" width="" height="auto" align="" style="background:none; border:none; box-shadow:none; display:block; margin-left:auto; margin-right:auto">
</div>
<center>
<em>For sale: baby shoes, never worn. </em>
</center>
<p>But this is a very different kind of intelligence than ours. A computer cannot understand the depth of the Ernest Hemingway's apocryphal six word novel: "For Sale, Baby Shoes, Never worn", because it isn't equipped with that ability to model the complexity of humanity that underlies that statement.</p>
<div style="text-align:center">
<object class="svgplot " align data="../slides/diagrams/anne-computer-conversation.svg" style>
</object>
</div>
<p>Similarly, we find it difficult to comprehend how computers are making decisions. Because they do so with more data than we can possibly imagine.</p>
<p>In many respects, this is not a problem, it's a good thing. Computers and us are good at different things. But when we interact with a computer, when it acts in a different way to us, we need to remember why.</p>
<p>Just as the first step to getting along with other humans is understanding other humans, so it needs to be with getting along with our computers.</p>
<p>Embodiment factors explain why, at the same time, computers are so impressive in simulating our weather, but so poor at predicting our moods. Our complexity is greater than that of our weather, and each of us is tuned to read and respond to one another.</p>
<p>Their intelligence is different. It is based on very large quantities of data that we cannot absorb. Our computers don’t have a complex internal model of who we are. They don’t understand the human condition. They are not tuned to respond to us as we are to each other.</p>
<p>Embodiment factors encapsulate a profound thing about the nature of humans. Our locked in intelligence means that we are striving to communicate, so we put a lot of thought into what we’re communicating with. And if we’re communicating with something complex, we naturally anthropomorphize them.</p>
<p>We give our dogs, our cats and our cars human motivations. We do the same with our computers. We anthropomorphize them. We assume that they have the same objectives as us and the same constraints. They don’t.</p>
<p>This means, that when we worry about artificial intelligence, we worry about the wrong things. We fear computers that behave like more powerful versions of ourselves that will struggle to outcompete us.</p>
<p>In reality, the challenge is that our computers cannot be human enough. They cannot understand us with the depth we understand one another. They drop below our cognitive radar and operate outside our mental models.</p>
<p>The real danger is that computers don’t anthropomorphize. They’ll make decisions in isolation from us without our supervision, because they can’t communicate truly and deeply with us.</p>
<p>Some researchers talk about transhumanism, releasing us from our own limitations, gaining the bandwidth of the computer. Who wouldn’t want the equivalent of billions of dollars of communication that a computer has?</p>
<p>But what if that would destroy the very nature of what it is to be human. What if we are defined by our limitations. What if our consciousness is born out of a need to understand and be understood by others? What if that thing that we value the most is a side effect of our limitations?</p>
<p>AI is a technology, it is not a human being. It doesn’t worry it is being misunderstood, it doesn’t hate us, it doesn’t love us, it doesn’t even have an opinion about us.</p>
<p>Any sense that it does is in that little internal model we have as we anthropomorphize it. AI doesn’t stand for anthropomorphic intelligence, it stands for artificial intelligence. Artificial in the way a plastic plant is artificial.</p>
<p>Of course, like any technology, that doesn’t mean it’s without its dangers. Technological advance has always brought social challenges and likely always will, but if we are to face those challenges head on, we need to acknowledge the difference between our intelligence and that which we create in our computers.</p>
<p>Your cat understands you better than your computer does, your cat understands me better than your computer does, and it hasn’t even met me!</p>
<p>Our lives are defined by our desperate need to be understood: art, music, literature, dance, sport. So many creative ways to try and communicate who we are or what we feel. The computer has no need for this.</p>
<p>When you hear the term Artificial Intelligence, remember that’s precisely what it is. Artificial, like that plastic plant. A plastic plant is convenient, it doesn’t need watering, it doesn’t need to be put in the light, it won’t wilt if you don’t attend to it, and it won’t outgrow the place you put it.</p>
<p>A plastic plant will do some of the jobs that a real plant does, but it isn’t a proper replacement, and never will be. So, it is with our artificial intelligences.</p>
<p>I believe that our fascination with AI is actually a projected fascination with ourselves. A sort of technological narcissism. One of the reasons that the next generation of artificial intelligence solutions excites me is because I think it will lead to a much better understanding of our own intelligence.</p>
<p>But with our self-idolization comes a Icarian fear of what it might mean to emulate those characteristics that we perceive of as uniquely human.</p>
<p>Our fears of AI singularities and superintelligences come from a confused mixing of what we create and what we are.</p>
<p>Do not fool yourselves into thinking these computers are the same thing as us, they never will be. We are a consequence of our limitations, just as Bauby was defined by his. Or maybe limitations is the wrong word, as Bauby described there are always moments when we can explore our inner self and escape into our own imagination:</p>
<blockquote>
<p>My cocoon becomes less oppressive, and my mind takes flight like a butterfly. There is so much to do. You can wander off in space or in time, set out for Tierra del Fuego or King Midas’s court. You can visit the woman you love, slide down beside her and stroke her still-sleeping face. You can build castles in Spain, steal the golden fleece, discover Atlantis, realize your childhood dreams and adult ambitions.<br />
Enough rambling. My main task now is to compose the first of those bedridden travel notes so that I shall be ready when my publisher’s emissary arrives to take my dictation, letter by letter. In my head I churn over every sentence ten times, delete a word, add an adjective, and learn my text by heart, paragraph by paragraph.</p>
</blockquote>
<p>The flower that is this book, that is this fight, can never bud from an artificial plant.</p>
<ul>
<li>twitter: @lawrennd</li>
<li>blog: <a href="http://inverseprobability.com/blog.html">http://inverseprobability.com</a></li>
<li>podcast: <a href="http://thetalkingmachines.com" class="uri">http://thetalkingmachines.com</a></li>
<li><a href="http://inverseprobability.com/2018/02/06/natural-and-artificial-intelligence">Natural vs Artifical Intelligence</a></li>
</ul>
Thu, 18 Oct 2018 00:00:00 +0000
http://inverseprobability.com/talks/notes/natural-and-artificial-intelligence.html
http://inverseprobability.com/talks/notes/natural-and-artificial-intelligence.htmlnotesfAIth<div style="display:none">
$$\newcommand{\Amatrix}{\mathbf{A}}
\newcommand{\KL}[2]{\text{KL}\left( #1\,\|\,#2 \right)}
\newcommand{\Kaast}{\kernelMatrix_{\mathbf{ \ast}\mathbf{ \ast}}}
\newcommand{\Kastu}{\kernelMatrix_{\mathbf{ \ast} \inducingVector}}
\newcommand{\Kff}{\kernelMatrix_{\mappingFunctionVector \mappingFunctionVector}}
\newcommand{\Kfu}{\kernelMatrix_{\mappingFunctionVector \inducingVector}}
\newcommand{\Kuast}{\kernelMatrix_{\inducingVector \bf\ast}}
\newcommand{\Kuf}{\kernelMatrix_{\inducingVector \mappingFunctionVector}}
\newcommand{\Kuu}{\kernelMatrix_{\inducingVector \inducingVector}}
\newcommand{\Kuui}{\Kuu^{-1}}
\newcommand{\Qaast}{\mathbf{Q}_{\bf \ast \ast}}
\newcommand{\Qastf}{\mathbf{Q}_{\ast \mappingFunction}}
\newcommand{\Qfast}{\mathbf{Q}_{\mappingFunctionVector \bf \ast}}
\newcommand{\Qff}{\mathbf{Q}_{\mappingFunctionVector \mappingFunctionVector}}
\newcommand{\aMatrix}{\mathbf{A}}
\newcommand{\aScalar}{a}
\newcommand{\aVector}{\mathbf{a}}
\newcommand{\acceleration}{a}
\newcommand{\bMatrix}{\mathbf{B}}
\newcommand{\bScalar}{b}
\newcommand{\bVector}{\mathbf{b}}
\newcommand{\basisFunc}{\phi}
\newcommand{\basisFuncVector}{\boldsymbol{ \basisFunc}}
\newcommand{\basisFunction}{\phi}
\newcommand{\basisLocation}{\mu}
\newcommand{\basisMatrix}{\boldsymbol{ \Phi}}
\newcommand{\basisScalar}{\basisFunction}
\newcommand{\basisVector}{\boldsymbol{ \basisFunction}}
\newcommand{\activationFunction}{\phi}
\newcommand{\activationMatrix}{\boldsymbol{ \Phi}}
\newcommand{\activationScalar}{\basisFunction}
\newcommand{\activationVector}{\boldsymbol{ \basisFunction}}
\newcommand{\bigO}{\mathcal{O}}
\newcommand{\binomProb}{\pi}
\newcommand{\cMatrix}{\mathbf{C}}
\newcommand{\cbasisMatrix}{\hat{\boldsymbol{ \Phi}}}
\newcommand{\cdataMatrix}{\hat{\dataMatrix}}
\newcommand{\cdataScalar}{\hat{\dataScalar}}
\newcommand{\cdataVector}{\hat{\dataVector}}
\newcommand{\centeredKernelMatrix}{\mathbf{ \MakeUppercase{\centeredKernelScalar}}}
\newcommand{\centeredKernelScalar}{b}
\newcommand{\centeredKernelVector}{\centeredKernelScalar}
\newcommand{\centeringMatrix}{\mathbf{H}}
\newcommand{\chiSquaredDist}[2]{\chi_{#1}^{2}\left(#2\right)}
\newcommand{\chiSquaredSamp}[1]{\chi_{#1}^{2}}
\newcommand{\conditionalCovariance}{\boldsymbol{ \Sigma}}
\newcommand{\coregionalizationMatrix}{\mathbf{B}}
\newcommand{\coregionalizationScalar}{b}
\newcommand{\coregionalizationVector}{\mathbf{ \coregionalizationScalar}}
\newcommand{\covDist}[2]{\text{cov}_{#2}\left(#1\right)}
\newcommand{\covSamp}[1]{\text{cov}\left(#1\right)}
\newcommand{\covarianceScalar}{c}
\newcommand{\covarianceVector}{\mathbf{ \covarianceScalar}}
\newcommand{\covarianceMatrix}{\mathbf{C}}
\newcommand{\covarianceMatrixTwo}{\boldsymbol{ \Sigma}}
\newcommand{\croupierScalar}{s}
\newcommand{\croupierVector}{\mathbf{ \croupierScalar}}
\newcommand{\croupierMatrix}{\mathbf{ \MakeUppercase{\croupierScalar}}}
\newcommand{\dataDim}{p}
\newcommand{\dataIndex}{i}
\newcommand{\dataIndexTwo}{j}
\newcommand{\dataMatrix}{\mathbf{Y}}
\newcommand{\dataScalar}{y}
\newcommand{\dataSet}{\mathcal{D}}
\newcommand{\dataStd}{\sigma}
\newcommand{\dataVector}{\mathbf{ \dataScalar}}
\newcommand{\decayRate}{d}
\newcommand{\degreeMatrix}{\mathbf{ \MakeUppercase{\degreeScalar}}}
\newcommand{\degreeScalar}{d}
\newcommand{\degreeVector}{\mathbf{ \degreeScalar}}
% Already defined by latex
%\newcommand{\det}[1]{\left|#1\right|}
\newcommand{\diag}[1]{\text{diag}\left(#1\right)}
\newcommand{\diagonalMatrix}{\mathbf{D}}
\newcommand{\diff}[2]{\frac{\text{d}#1}{\text{d}#2}}
\newcommand{\diffTwo}[2]{\frac{\text{d}^2#1}{\text{d}#2^2}}
\newcommand{\displacement}{x}
\newcommand{\displacementVector}{\textbf{\displacement}}
\newcommand{\distanceMatrix}{\mathbf{ \MakeUppercase{\distanceScalar}}}
\newcommand{\distanceScalar}{d}
\newcommand{\distanceVector}{\mathbf{ \distanceScalar}}
\newcommand{\eigenvaltwo}{\ell}
\newcommand{\eigenvaltwoMatrix}{\mathbf{L}}
\newcommand{\eigenvaltwoVector}{\mathbf{l}}
\newcommand{\eigenvalue}{\lambda}
\newcommand{\eigenvalueMatrix}{\boldsymbol{ \Lambda}}
\newcommand{\eigenvalueVector}{\boldsymbol{ \lambda}}
\newcommand{\eigenvector}{\mathbf{ \eigenvectorScalar}}
\newcommand{\eigenvectorMatrix}{\mathbf{U}}
\newcommand{\eigenvectorScalar}{u}
\newcommand{\eigenvectwo}{\mathbf{v}}
\newcommand{\eigenvectwoMatrix}{\mathbf{V}}
\newcommand{\eigenvectwoScalar}{v}
\newcommand{\entropy}[1]{\mathcal{H}\left(#1\right)}
\newcommand{\errorFunction}{E}
\newcommand{\expDist}[2]{\left<#1\right>_{#2}}
\newcommand{\expSamp}[1]{\left<#1\right>}
\newcommand{\expectation}[1]{\left\langle #1 \right\rangle }
\newcommand{\expectationDist}[2]{\left\langle #1 \right\rangle _{#2}}
\newcommand{\expectedDistanceMatrix}{\mathcal{D}}
\newcommand{\eye}{\mathbf{I}}
\newcommand{\fantasyDim}{r}
\newcommand{\fantasyMatrix}{\mathbf{ \MakeUppercase{\fantasyScalar}}}
\newcommand{\fantasyScalar}{z}
\newcommand{\fantasyVector}{\mathbf{ \fantasyScalar}}
\newcommand{\featureStd}{\varsigma}
\newcommand{\gammaCdf}[3]{\mathcal{GAMMA CDF}\left(#1|#2,#3\right)}
\newcommand{\gammaDist}[3]{\mathcal{G}\left(#1|#2,#3\right)}
\newcommand{\gammaSamp}[2]{\mathcal{G}\left(#1,#2\right)}
\newcommand{\gaussianDist}[3]{\mathcal{N}\left(#1|#2,#3\right)}
\newcommand{\gaussianSamp}[2]{\mathcal{N}\left(#1,#2\right)}
\newcommand{\given}{|}
\newcommand{\half}{\frac{1}{2}}
\newcommand{\heaviside}{H}
\newcommand{\hiddenMatrix}{\mathbf{ \MakeUppercase{\hiddenScalar}}}
\newcommand{\hiddenScalar}{h}
\newcommand{\hiddenVector}{\mathbf{ \hiddenScalar}}
\newcommand{\identityMatrix}{\eye}
\newcommand{\inducingInputScalar}{z}
\newcommand{\inducingInputVector}{\mathbf{ \inducingInputScalar}}
\newcommand{\inducingInputMatrix}{\mathbf{Z}}
\newcommand{\inducingScalar}{u}
\newcommand{\inducingVector}{\mathbf{ \inducingScalar}}
\newcommand{\inducingMatrix}{\mathbf{U}}
\newcommand{\inlineDiff}[2]{\text{d}#1/\text{d}#2}
\newcommand{\inputDim}{q}
\newcommand{\inputMatrix}{\mathbf{X}}
\newcommand{\inputScalar}{x}
\newcommand{\inputSpace}{\mathcal{X}}
\newcommand{\inputVals}{\inputVector}
\newcommand{\inputVector}{\mathbf{ \inputScalar}}
\newcommand{\iterNum}{k}
\newcommand{\kernel}{\kernelScalar}
\newcommand{\kernelMatrix}{\mathbf{K}}
\newcommand{\kernelScalar}{k}
\newcommand{\kernelVector}{\mathbf{ \kernelScalar}}
\newcommand{\kff}{\kernelScalar_{\mappingFunction \mappingFunction}}
\newcommand{\kfu}{\kernelVector_{\mappingFunction \inducingScalar}}
\newcommand{\kuf}{\kernelVector_{\inducingScalar \mappingFunction}}
\newcommand{\kuu}{\kernelVector_{\inducingScalar \inducingScalar}}
\newcommand{\lagrangeMultiplier}{\lambda}
\newcommand{\lagrangeMultiplierMatrix}{\boldsymbol{ \Lambda}}
\newcommand{\lagrangian}{L}
\newcommand{\laplacianFactor}{\mathbf{ \MakeUppercase{\laplacianFactorScalar}}}
\newcommand{\laplacianFactorScalar}{m}
\newcommand{\laplacianFactorVector}{\mathbf{ \laplacianFactorScalar}}
\newcommand{\laplacianMatrix}{\mathbf{L}}
\newcommand{\laplacianScalar}{\ell}
\newcommand{\laplacianVector}{\mathbf{ \ell}}
\newcommand{\latentDim}{q}
\newcommand{\latentDistanceMatrix}{\boldsymbol{ \Delta}}
\newcommand{\latentDistanceScalar}{\delta}
\newcommand{\latentDistanceVector}{\boldsymbol{ \delta}}
\newcommand{\latentForce}{f}
\newcommand{\latentFunction}{u}
\newcommand{\latentFunctionVector}{\mathbf{ \latentFunction}}
\newcommand{\latentFunctionMatrix}{\mathbf{ \MakeUppercase{\latentFunction}}}
\newcommand{\latentIndex}{j}
\newcommand{\latentScalar}{z}
\newcommand{\latentVector}{\mathbf{ \latentScalar}}
\newcommand{\latentMatrix}{\mathbf{Z}}
\newcommand{\learnRate}{\eta}
\newcommand{\lengthScale}{\ell}
\newcommand{\rbfWidth}{\ell}
\newcommand{\likelihoodBound}{\mathcal{L}}
\newcommand{\likelihoodFunction}{L}
\newcommand{\locationScalar}{\mu}
\newcommand{\locationVector}{\boldsymbol{ \locationScalar}}
\newcommand{\locationMatrix}{\mathbf{M}}
\newcommand{\variance}[1]{\text{var}\left( #1 \right)}
\newcommand{\mappingFunction}{f}
\newcommand{\mappingFunctionMatrix}{\mathbf{F}}
\newcommand{\mappingFunctionTwo}{g}
\newcommand{\mappingFunctionTwoMatrix}{\mathbf{G}}
\newcommand{\mappingFunctionTwoVector}{\mathbf{ \mappingFunctionTwo}}
\newcommand{\mappingFunctionVector}{\mathbf{ \mappingFunction}}
\newcommand{\scaleScalar}{s}
\newcommand{\mappingScalar}{w}
\newcommand{\mappingVector}{\mathbf{ \mappingScalar}}
\newcommand{\mappingMatrix}{\mathbf{W}}
\newcommand{\mappingScalarTwo}{v}
\newcommand{\mappingVectorTwo}{\mathbf{ \mappingScalarTwo}}
\newcommand{\mappingMatrixTwo}{\mathbf{V}}
\newcommand{\maxIters}{K}
\newcommand{\meanMatrix}{\mathbf{M}}
\newcommand{\meanScalar}{\mu}
\newcommand{\meanTwoMatrix}{\mathbf{M}}
\newcommand{\meanTwoScalar}{m}
\newcommand{\meanTwoVector}{\mathbf{ \meanTwoScalar}}
\newcommand{\meanVector}{\boldsymbol{ \meanScalar}}
\newcommand{\mrnaConcentration}{m}
\newcommand{\naturalFrequency}{\omega}
\newcommand{\neighborhood}[1]{\mathcal{N}\left( #1 \right)}
\newcommand{\neilurl}{http://inverseprobability.com/}
\newcommand{\noiseMatrix}{\boldsymbol{ E}}
\newcommand{\noiseScalar}{\epsilon}
\newcommand{\noiseVector}{\boldsymbol{ \epsilon}}
\newcommand{\norm}[1]{\left\Vert #1 \right\Vert}
\newcommand{\normalizedLaplacianMatrix}{\hat{\mathbf{L}}}
\newcommand{\normalizedLaplacianScalar}{\hat{\ell}}
\newcommand{\normalizedLaplacianVector}{\hat{\mathbf{ \ell}}}
\newcommand{\numActive}{m}
\newcommand{\numBasisFunc}{m}
\newcommand{\numComponents}{m}
\newcommand{\numComps}{K}
\newcommand{\numData}{n}
\newcommand{\numFeatures}{K}
\newcommand{\numHidden}{h}
\newcommand{\numInducing}{m}
\newcommand{\numLayers}{\ell}
\newcommand{\numNeighbors}{K}
\newcommand{\numSequences}{s}
\newcommand{\numSuccess}{s}
\newcommand{\numTasks}{m}
\newcommand{\numTime}{T}
\newcommand{\numTrials}{S}
\newcommand{\outputIndex}{j}
\newcommand{\paramVector}{\boldsymbol{ \theta}}
\newcommand{\parameterMatrix}{\boldsymbol{ \Theta}}
\newcommand{\parameterScalar}{\theta}
\newcommand{\parameterVector}{\boldsymbol{ \parameterScalar}}
\newcommand{\partDiff}[2]{\frac{\partial#1}{\partial#2}}
\newcommand{\precisionScalar}{j}
\newcommand{\precisionVector}{\mathbf{ \precisionScalar}}
\newcommand{\precisionMatrix}{\mathbf{J}}
\newcommand{\pseudotargetScalar}{\widetilde{y}}
\newcommand{\pseudotargetVector}{\mathbf{ \pseudotargetScalar}}
\newcommand{\pseudotargetMatrix}{\mathbf{ \widetilde{Y}}}
\newcommand{\rank}[1]{\text{rank}\left(#1\right)}
\newcommand{\rayleighDist}[2]{\mathcal{R}\left(#1|#2\right)}
\newcommand{\rayleighSamp}[1]{\mathcal{R}\left(#1\right)}
\newcommand{\responsibility}{r}
\newcommand{\rotationScalar}{r}
\newcommand{\rotationVector}{\mathbf{ \rotationScalar}}
\newcommand{\rotationMatrix}{\mathbf{R}}
\newcommand{\sampleCovScalar}{s}
\newcommand{\sampleCovVector}{\mathbf{ \sampleCovScalar}}
\newcommand{\sampleCovMatrix}{\mathbf{s}}
\newcommand{\scalarProduct}[2]{\left\langle{#1},{#2}\right\rangle}
\newcommand{\sign}[1]{\text{sign}\left(#1\right)}
\newcommand{\sigmoid}[1]{\sigma\left(#1\right)}
\newcommand{\singularvalue}{\ell}
\newcommand{\singularvalueMatrix}{\mathbf{L}}
\newcommand{\singularvalueVector}{\mathbf{l}}
\newcommand{\sorth}{\mathbf{u}}
\newcommand{\spar}{\lambda}
\newcommand{\trace}[1]{\text{tr}\left(#1\right)}
\newcommand{\BasalRate}{B}
\newcommand{\DampingCoefficient}{C}
\newcommand{\DecayRate}{D}
\newcommand{\Displacement}{X}
\newcommand{\LatentForce}{F}
\newcommand{\Mass}{M}
\newcommand{\Sensitivity}{S}
\newcommand{\basalRate}{b}
\newcommand{\dampingCoefficient}{c}
\newcommand{\mass}{m}
\newcommand{\sensitivity}{s}
\newcommand{\springScalar}{\kappa}
\newcommand{\springVector}{\boldsymbol{ \kappa}}
\newcommand{\springMatrix}{\boldsymbol{ \mathcal{K}}}
\newcommand{\tfConcentration}{p}
\newcommand{\tfDecayRate}{\delta}
\newcommand{\tfMrnaConcentration}{f}
\newcommand{\tfVector}{\mathbf{ \tfConcentration}}
\newcommand{\velocity}{v}
\newcommand{\sufficientStatsScalar}{g}
\newcommand{\sufficientStatsVector}{\mathbf{ \sufficientStatsScalar}}
\newcommand{\sufficientStatsMatrix}{\mathbf{G}}
\newcommand{\switchScalar}{s}
\newcommand{\switchVector}{\mathbf{ \switchScalar}}
\newcommand{\switchMatrix}{\mathbf{S}}
\newcommand{\tr}[1]{\text{tr}\left(#1\right)}
\newcommand{\loneNorm}[1]{\left\Vert #1 \right\Vert_1}
\newcommand{\ltwoNorm}[1]{\left\Vert #1 \right\Vert_2}
\newcommand{\onenorm}[1]{\left\vert#1\right\vert_1}
\newcommand{\twonorm}[1]{\left\Vert #1 \right\Vert}
\newcommand{\vScalar}{v}
\newcommand{\vVector}{\mathbf{v}}
\newcommand{\vMatrix}{\mathbf{V}}
\newcommand{\varianceDist}[2]{\text{var}_{#2}\left( #1 \right)}
% Already defined by latex
%\newcommand{\vec}{#1:}
\newcommand{\vecb}[1]{\left(#1\right):}
\newcommand{\weightScalar}{w}
\newcommand{\weightVector}{\mathbf{ \weightScalar}}
\newcommand{\weightMatrix}{\mathbf{W}}
\newcommand{\weightedAdjacencyMatrix}{\mathbf{A}}
\newcommand{\weightedAdjacencyScalar}{a}
\newcommand{\weightedAdjacencyVector}{\mathbf{ \weightedAdjacencyScalar}}
\newcommand{\onesVector}{\mathbf{1}}
\newcommand{\zerosVector}{\mathbf{0}}
$$
</div>
<p><!--% not ipynb--></p>
<h2 id="the-gartner-hype-cycle">The Gartner Hype Cycle</h2>
<p><img class="negate" src="../slides/diagrams/Gartner_Hype_Cycle.png" width="70%" height="auto" align="center" style="background:none; border:none; box-shadow:none; display:block; margin-left:auto; margin-right:auto"></p>
<p>The <a href="https://en.wikipedia.org/wiki/Hype_cycle">Gartner Hype Cycle</a> tries to assess where an idea is in terms of maturity and adoption. It splits the evolution of technology into a technological trigger, a peak of expectations followed by a trough of disillusionment and a final ascension into a useful technology. It looks rather like a classical control response to a final set point.</p>
<object class="svgplot " align data="../slides/diagrams/data-science/ai-bd-dm-dl-ml-google-trends004.svg" style>
</object>
<center>
<em>Google trends for different technological terms on the hype cycle. </em>
</center>
<p>Google trends gives us insight into how far along various technological terms are on the hype cycle.</p>
<!--include{_data-science/includes/gartner-hype-cycle-bd-ds-iot-ml.md}-->
<h3 id="lies-and-damned-lies">Lies and Damned Lies</h3>
<blockquote>
<p>There are three types of lies: lies, damned lies and statistics</p>
<p>Benjamin Disraeli 1804-1881</p>
</blockquote>
<p>The quote lies, damned lies and statistics was credited to Benjamin Disraeli by Mark Twain in his autobiography. It characterizes the idea that statistic can be made to prove anything. But Disraeli died in 1881 and Mark Twain died in 1910. The important breakthrough in overcoming our tendency to overinterpet data came with the formalization of the field through the development of <em>mathematical statistics</em>.</p>
<h3 id="mathematical-statistics"><em>Mathematical</em> Statistics</h3>
<img class="" src="../slides/diagrams/Portrait_of_Karl_Pearson.jpg" width="30%" height="auto" align="center" style="background:none; border:none; box-shadow:none; display:block; margin-left:auto; margin-right:auto">
<center>
<em>'Founded' by Karl Pearson (1857-1936) </em>
</center>
<p><a href="https://en.wikipedia.org/wiki/Karl_Pearson">Karl Pearson</a> (1857-1936), <a href="https://en.wikipedia.org/wiki/Ronald_Fisher">Ronald Fisher</a> (1890-1962) and others considered the question of what conclusions can truly be drawn from data. Their mathematical studies act as a restraint on our tendency to over-interpret and see patterns where there are none. They introduced concepts such as randomized control trials that form a mainstay of the our decision making today, from government, to clinicians to large scale A/B testing that determines the nature of the web interfaces we interact with on social media and shopping.</p>
<p>Today the statement "There are three types of lies: lies, damned lies and 'big data'" may be more apt. We are revisiting many of the mistakes made in interpreting data from the 19th century. Big data is laid down by happenstance, rather than actively collected with a particular question in mind. That means it needs to be treated with care when conclusions are being drawn. For data science to succede it needs the same form of rigour that Pearson and Fisher brought to statistics, a "mathematical data science" is needed.</p>
<h2 id="what-is-machine-learning">What is Machine Learning?</h2>
<p>What is machine learning? At its most basic level machine learning is a combination of</p>
<p><span class="math display">\[\text{data} + \text{model} \xrightarrow{\text{compute}} \text{prediction}\]</span></p>
<p>where <em>data</em> is our observations. They can be actively or passively acquired (meta-data). The <em>model</em> contains our assumptions, based on previous experience. That experience can be other data, it can come from transfer learning, or it can merely be our beliefs about the regularities of the universe. In humans our models include our inductive biases. The <em>prediction</em> is an action to be taken or a categorization or a quality score. The reason that machine learning has become a mainstay of artificial intelligence is the importance of predictions in artificial intelligence. The data and the model are combined through computation.</p>
<p>In practice we normally perform machine learning using two functions. To combine data with a model we typically make use of:</p>
<p><strong>a prediction function</strong> a function which is used to make the predictions. It includes our beliefs about the regularities of the universe, our assumptions about how the world works, e.g. smoothness, spatial similarities, temporal similarities.</p>
<p><strong>an objective function</strong> a function which defines the cost of misprediction. Typically it includes knowledge about the world's generating processes (probabilistic objectives) or the costs we pay for mispredictions (empiricial risk minimization).</p>
<p>The combination of data and model through the prediction function and the objectie function leads to a <em>learning algorithm</em>. The class of prediction functions and objective functions we can make use of is restricted by the algorithms they lead to. If the prediction function or the objective function are too complex, then it can be difficult to find an appropriate learning algorithm. Much of the acdemic field of machine learning is the quest for new learning algorithms that allow us to bring different types of models and data together.</p>
<p>A useful reference for state of the art in machine learning is the UK Royal Society Report, <a href="https://royalsociety.org/~/media/policy/projects/machine-learning/publications/machine-learning-report.pdf">Machine Learning: Power and Promise of Computers that Learn by Example</a>.</p>
<p>You can also check my blog post on <a href="http://inverseprobability.com/2017/07/17/what-is-machine-learning">"What is Machine Learning?"</a></p>
<h3 id="artificial-intelligence-and-data-science">Artificial Intelligence and Data Science</h3>
<p>Machine learning technologies have been the driver of two related, but distinct disciplines. The first is <em>data science</em>. Data science is an emerging field that arises from the fact that we now collect so much data by happenstance, rather than by <em>experimental design</em>. Classical statistics is the science of drawing conclusions from data, and to do so statistical experiments are carefully designed. In the modern era we collect so much data that there's a desire to draw inferences directly from the data.</p>
<p>As well as machine learning, the field of data science draws from statistics, cloud computing, data storage (e.g. streaming data), visualization and data mining.</p>
<p>In contrast, artificial intelligence technologies typically focus on emulating some form of human behaviour, such as understanding an image, or some speech, or translating text from one form to another. The recent advances in artifcial intelligence have come from machine learning providing the automation. But in contrast to data science, in artifcial intelligence the data is normally collected with the specific task in mind. In this sense it has strong relations to classical statistics.</p>
<p>Classically artificial intelligence worried more about <em>logic</em> and <em>planning</em> and focussed less on data driven decision making. Modern machine learning owes more to the field of <em>Cybernetics</em> <span class="citation">(Wiener, 1948)</span> than artificial intelligence. Related fields include <em>robotics</em>, <em>speech recognition</em>, <em>language understanding</em> and <em>computer vision</em>.</p>
<p>There are strong overlaps between the fields, the wide availability of data by happenstance makes it easier to collect data for designing AI systems. These relations are coming through wide availability of sensing technologies that are interconnected by celluar networks, WiFi and the internet. This phenomenon is sometimes known as the <em>Internet of Things</em>, but this feels like a dangerous misnomer. We must never forget that we are interconnecting people, not things.</p>
<h2 id="natural-and-artificial-intelligence-embodiment-factors">Natural and Artificial Intelligence: Embodiment Factors</h2>
<h2 id="natural-and-artificial-intelligence-embodiment-factors-1">Natural and Artificial Intelligence: Embodiment Factors</h2>
<table>
<tr>
<td>
</td>
<td align="center">
<img class="" src="../slides/diagrams/IBM_Blue_Gene_P_supercomputer.jpg" width="40%" height="auto" align="center" style="background:none; border:none; box-shadow:none; display:block; margin-left:auto; margin-right:auto">
</td>
<td align="center">
<img class="" src="../slides/diagrams/ClaudeShannon_MFO3807.jpg" width="25%" height="auto" align="center" style="background:none; border:none; box-shadow:none; display:block; margin-left:auto; margin-right:auto">
</td>
</tr>
<tr>
<td>
compute
</td>
<td align="center">
<span class="math display">\[\approx 100 \text{ gigaflops}\]</span>
</td>
<td align="center">
<span class="math display">\[\approx 16 \text{ petaflops}\]</span>
</td>
</tr>
<tr>
<td>
communicate
</td>
<td align="center">
<span class="math display">\[1 \text{ gigbit/s}\]</span>
</td>
<td align="center">
<span class="math display">\[100 \text{ bit/s}\]</span>
</td>
</tr>
<tr>
<td>
(compute/communicate)
</td>
<td align="center">
<span class="math display">\[10^{4}\]</span>
</td>
<td align="center">
<span class="math display">\[10^{14}\]</span>
</td>
</tr>
</table>
<p>There is a fundamental limit placed on our intelligence based on our ability to communicate. Claude Shannon founded the field of information theory. The clever part of this theory is it allows us to separate our measurement of information from what the information pertains to<a href="#fn1" class="footnoteRef" id="fnref1"><sup>1</sup></a>.</p>
<p>Shannon measured information in bits. One bit of information is the amount of information I pass to you when I give you the result of a coin toss. Shannon was also interested in the amount of information in the English language. He estimated that on average a word in the English language contains 12 bits of information.</p>
<p>Given typical speaking rates, that gives us an estimate of our ability to communicate of around 100 bits per second <span class="citation">(Reed and Durlach, 1998)</span>. Computers on the other hand can communicate much more rapidly. Current wired network speeds are around a billion bits per second, ten million times faster.</p>
<p>When it comes to compute though, our best estimates indicate our computers are slower. A typical modern computer can process make around 100 billion floating point operations per second, each floating point operation involves a 64 bit number. So the computer is processing around 6,400 billion bits per second.</p>
<p>It's difficult to get similar estimates for humans, but by some estimates the amount of compute we would require to <em>simulate</em> a human brain is equivalent to that in the UK's fastest computer <span class="citation">(Ananthanarayanan et al., 2009)</span>, the MET office machine in Exeter, which in 2018 ranks as the 11th fastest computer in the world. That machine simulates the world's weather each morning, and then simulates the world's climate in the afternoon. It is a 16 petaflop machine, processing around 1,000 <em>trillion</em> bits per second.</p>
<div style="text-align:center">
<img class="" src="../slides/diagrams/Lotus_49-2.JPG" width="70%" height="auto" align="" style="background:none; border:none; box-shadow:none; display:block; margin-left:auto; margin-right:auto">
</div>
<p>So when it comes to our ability to compute we are extraordinary, not compute in our conscious mind, but the underlying neuron firings that underpin both our consciousness, our sbuconsciousness as well as our motor control etc. By analogy I sometimes like to think of us as a Formula One engine. But in terms of our ability to deploy that computation in actual use, to share the results of what we have inferred, we are very limited. So when you imagine the F1 car that represents a psyche, think of an F1 car with bicycle wheels.</p>
<div style="text-align:center">
<img class="" src="../slides/diagrams/640px-Marcel_Renault_1903.jpg" width="70%" height="auto" align="" style="background:none; border:none; box-shadow:none; display:block; margin-left:auto; margin-right:auto">
</div>
<p>In contrast, our computers have less computational power, but they can communicate far more fluidly. They are more like a go-kart, less well powered, but with tires that allow them to deploy that power.</p>
<div style="text-align:center">
<img class="" src="../slides/diagrams/Caleb_McDuff_WIX_Silence_Racing_livery.jpg" width="70%" height="auto" align="" style="background:none; border:none; box-shadow:none; display:block; margin-left:auto; margin-right:auto">
</div>
<p>For humans, that means much of our computation should be dedicated to considering <em>what</em> we should compute. To do that efficiently we need to model the world around us. The most complex thing in the world around us is other humans. So it is no surprise that we model them. We second guess what their intentions are, and our communication is only necessary when they are departing from how we model them. Naturally, for this to work well, we need to understand those we work closely with. So it is no surprise that social communication, social bonding, forms so much of a part of our use of our limited bandwidth.</p>
<p>There is a second effect here, our need to anthropomorphise objects around us. Our tendency to model our fellow humans extends to when we interact with other entities in our environment. To our pets as well as inanimate objects around us, such as computers or even our cars. This tendency to overinterpret could be a consequence of our limited ability to communicate.</p>
<p>For more details see this paper <a href="https://arxiv.org/abs/1705.07996">"Living Together: Mind and Machine Intelligence"</a>, and this <a href="http://inverseprobability.com/talks/lawrence-tedx17/living-together.html">TEDx talk</a>.</p>
<h2 id="evolved-relationship-with-information">Evolved Relationship with Information</h2>
<p>The high bandwidth of computers has resulted in a close relationship between the computer and data. Large amounts of information can flow between the two. The degree to which the computer is mediating our relationship with data means that we should consider it an intermediary.</p>
<p>Originaly our low bandwith relationship with data was affected by two characteristics. Firstly, our tendency to over-interpret driven by our need to extract as much knowledge from our low bandwidth information channel as possible. Secondly, by our improved understanding of the domain of <em>mathematical</em> statistics and how our cognitive biases can mislead us.</p>
<p>With this new set up there is a potential for assimilating far more information via the computer, but the computer can present this to us in various ways. If it's motives are not aligned with ours then it can misrepresent the information. This needn't be nefarious it can be simply as a result of the computer pursuing a different objective from us. For example, if the computer is aiming to maximize our interaction time that may be a different objective from ours which may be to summarize information in a representative manner in the <em>shortest</em> possible length of time.</p>
<p>For example, for me it was a common experience to pick up my telephone with the intention of checking when my next appointment was, but to soon find myself distracted by another application on the phone, and end up reading something on the internet. By the time I'd finished reading, I would often have forgotten the reason I picked up my phone in the first place.</p>
<p>There are great benefits to be had from the huge amount of information we can unlock from this evolved relationship between us and data. In biology, large scale data sharing has been driven by a revolution in genomic, transcriptomic and epigenomic measurement. The improved inferences that that can be drawn through summarizing data by computer have fundamentally changed the nature of biological science, now this phenomenon is also infuencing us in our daily lives as data measured by <em>happenstance</em> is increasingly used to characterize us.</p>
<p>Better mediation of this flow actually requires a better understanding of human-computer interaction. This in turn involves understanding our own intelligence better, what its cognitive biases are and how these might mislead us.</p>
<p>For further thoughts see <a href="https://www.theguardian.com/media-network/2015/jul/23/data-driven-economy-marketing">this Guardian article</a> from 2015 on marketing in the internet era and <a href="http://inverseprobability.com/2015/12/04/what-kind-of-ai">this blog post</a> on System Zero.</p>
<object class="svgplot " align data="../slides/diagrams/data-science/information-flow003.svg" style>
</object>
<center>
<em>New direction of information flow, information is reaching us mediated by the computer </em>
</center>
<h3 id="societal-effects">Societal Effects</h3>
<p>We have already seen the effects of this changed dynamic in biology and computational biology. Improved sensorics have led to the new domains of transcriptomics, epigenomics, and 'rich phenomics' as well as considerably augmenting our capabilities in genomics.</p>
<p>Biologists have had to become data-savvy, they require a rich understanding of the available data resources and need to assimilate existing data sets in their hypothesis generation as well as their experimental design. Modern biology has become a far more quantitative science, but the quantitativeness has required new methods developed in the domains of <em>computational biology</em> and <em>bioinformatics</em>.</p>
<p>There is also great promise for personalized health, but in health the wide data-sharing that has underpinned success in the computational biology community is much harder to cary out.</p>
<p>We can expect to see these phenomena reflected in wider society. Particularly as we make use of more automated decision making based only on data.</p>
<p>The main phenomenon we see across the board is the shift in dynamic from the direct pathway between human and data, as traditionally mediated by classical statistcs, to a new flow of information via the computer. This change of dynamics gives us the modern and emerging domain of <em>data science</em>.</p>
<h3 id="human-communication">Human Communication</h3>
<p>For human conversation to work, we require an internal model of who we are speaking to. We model each other, and combine our sense of who they are, who they think we are, and what has been said. This is our approach to dealing with the limited bandwidth connection we have. Empathy and understanding of intent. Mental dispositional concepts are used to augment our limited communication bandwidth.</p>
<p>Fritz Heider referred to the important point of a conversation as being that they are happenings that are "<em>psychologically represented</em> in each of the participants" (his emphasis) <span class="citation">(Heider, 1958)</span></p>
<object class="svgplot " align data="../slides/diagrams/anne-bob-conversation006.svg" style>
</object>
<center>
<em>Conversation relies on internal models of other individuals. </em>
</center>
<object class="svgplot " align data="../slides/diagrams/anne-bob-conversation007.svg" style>
</object>
<center>
<em>Misunderstanding of context and who we are talking to leads to arguments. </em>
</center>
<h3 id="machine-learning-and-narratives">Machine Learning and Narratives</h3>
<p><img class="" src="../slides/diagrams/Classic_baby_shoes.jpg" width="60%" height="auto" align="center" style="background:none; border:none; box-shadow:none; display:block; margin-left:auto; margin-right:auto"></p>
<center>
<em>For sale: baby shoes, never worn.</em>
</center>
<p>Consider the six word novel, apocraphally credited to Ernest Hemingway, "For sale: baby shoes, never worn". To understand what that means to a human, you need a great deal of additional context. Context that is not directly accessible to a machine that has not got both the evolved and contextual understanding of our own condition to realize both the implication of the advert and what that implication means emotionally to the previous owner.</p>
<p><a href="https://www.youtube.com/watch?v=8FIEZXMUM2I&t=7"><img src="https://img.youtube.com/vi/8FIEZXMUM2I/0.jpg" /></a></p>
<p><a href="https://en.wikipedia.org/wiki/Fritz_Heider">Fritz Heider</a> and <a href="https://en.wikipedia.org/wiki/Marianne_Simmel">Marianne Simmel</a>'s experiments with animated shapes from 1944 <span class="citation">(Heider and Simmel, 1944)</span>. Our interpretation of these objects as showing motives and even emotion is a combination of our desire for narrative, a need for understanding of each other, and our ability to empathise. At one level, these are crudely drawn objects, but in another key way, the animator has communicated a story through simple facets such as their relative motions, their sizes and their actions. We apply our psychological representations to these faceless shapes in an effort to interpret their actions.</p>
<h3 id="faith-and-ai">Faith and AI</h3>
<p>There would seem to be at least three ways in which artificial intelligence and religion interconnect.</p>
<ol style="list-style-type: decimal">
<li>Artificial Intelligence as Cartoon Religion</li>
<li>Artificial Intelligence and Introspection</li>
<li>Independence of thought and Control: A Systemic Catch 22</li>
</ol>
<h3 id="singulariansm-ai-as-cartoon-religion">Singulariansm: AI as Cartoon Religion</h3>
<p>The first parallels one can find between artificial intelligence and religion come in somewhat of a cartoon doomsday scenario form. The publically hyped fears of superintelligence and singularity can equally be placed within the framework of the simpler questions that religion can try to answer. The parallels are</p>
<ol style="list-style-type: decimal">
<li>Superintelligence as god</li>
<li>Demi-god status achievable through transhumanism</li>
<li>Immortality through uploading the connectome</li>
<li>The day of judgement as the "singularity"</li>
</ol>
<p>The notion of a ultra-intelligence is similar to the notion of an interventionist god, with omniscience in the past, present and the future. This notion was described by Pierre Simon Laplace.</p>
<p><img class="" src="../slides/diagrams/ml/Pierre-Simon_Laplace.png" width="30%" height="auto" align="center" style="background:none; border:none; box-shadow:none; display:block; margin-left:auto; margin-right:auto"></p>
<p><a href="https://play.google.com/books/reader?id=1YQPAAAAQAAJ&pg=PR17-IA2"><img src="../slides/diagrams/books/1YQPAAAAQAAJ-PR17-IA2.png" /></a></p>
<p>Famously, Laplace considered the idea of a deterministic Universe, one in which the model is <em>known</em>, or as the below translation refers to it, "an intelligence which could comprehend all the forces by which nature is animated". He speculates on an "intelligence" that can submit this vast data to analysis and propsoses that such an entity would be able to predict the future.</p>
<blockquote>
<p>Given for one instant an intelligence which could comprehend all the forces by which nature is animated and the respective situation of the beings who compose it---an intelligence sufficiently vast to submit these data to analysis---it would embrace in the same formulate the movements of the greatest bodies of the universe and those of the lightest atom; for it, nothing would be uncertain and the future, as the past, would be present in its eyes.</p>
</blockquote>
<p>This notion is known as <em>Laplace's demon</em> or <em>Laplace's superman</em>.</p>
<p><img class="" src="../slides/diagrams/laplacesDeterminismEnglish.png" width="60%" height="auto" align="center" style="background:none; border:none; box-shadow:none; display:block; margin-left:auto; margin-right:auto"></p>
<p>Unfortunately, most analyses of his ideas stop at that point, whereas his real point is that such a notion is unreachable. Not so much <em>superman</em> as <em>strawman</em>. Just three pages later in the "Philosophical Essay on Probabilities" <span class="citation">(Laplace, 1814)</span>, Laplace goes on to observe:</p>
<blockquote>
<p>The curve described by a simple molecule of air or vapor is regulated in a manner just as certain as the planetary orbits; the only difference between them is that which comes from our ignorance.</p>
<p>Probability is relative, in part to this ignorance, in part to our knowledge.</p>
</blockquote>
<p><a href="https://play.google.com/books/reader?id=1YQPAAAAQAAJ&pg=PR17-IA4"><img src="../slides/diagrams/books/1YQPAAAAQAAJ-PR17-IA4.png" /></a></p>
<p><img class="" src="../slides/diagrams/philosophicaless00lapliala.png" width="60%" height="auto" align="center" style="background:none; border:none; box-shadow:none; display:block; margin-left:auto; margin-right:auto"></p>
<p>In other words, we can never make use of the idealistic deterministc Universe due to our ignorance about the world, Laplace's suggestion, and focus in this essay is that we turn to probability to deal with this uncertainty. This is also our inspiration for using probabilit in machine learning.</p>
<p>The "forces by which nature is animated" is our <em>model</em>, the "situation of beings that compose it" is our <em>data</em> and the "intelligence sufficiently vast enough to submit these data to analysis" is our compute. The fly in the ointment is our <em>ignorance</em> about these aspects. And <em>probability</em> is the tool we use to incorporate this ignorance leading to uncertainty or <em>doubt</em> in our predictions.</p>
<p>The notion of Superintelligence in, e.g. Nick Bostrom's book <span class="citation">(Bostrom, 2014)</span>, is that of an infallible omniscience. A major narrative of the book is that the challenge of Superintelligence according is to constrain the power of such an entity. In practice, this narrative is strongly related to Laplace's "straw superman". No such intelligence could exist due to our ignorance, in practice any real intelligence must express <em>doubt</em>.</p>
<p>Elon Musk has proposed that the only way to defeat the inevitable omniscience would be to augment ourselves with machine like capabilities. Ray Kurzweil has pushed the notion of developing ourselves by augmenting our existing cortex with direct connection to the internet.</p>
<p>Within Sillicon Valley there is a particular obsession with 'uploading'. Their idea is that once the brain is connected, we can achieve immortality by continuing to exist digitally in an artificial environment of our own creation while our physical body is left behind us. They want to remove the individual bandwidth limitation we place on ourselves.</p>
<p>But embodiment factors <span class="citation">(Lawrence, 2017)</span> imply that we are defined by our limitations. Removing the bandwidth limitation removes what it means to be human.</p>
<p>In Singualrism doomsday is the 'technological singularity', the moment at which computers rapidly outstrip our capabilities and take over our world. The high priests are the scientists, and the aim is to bring about the latter while restraining the former.</p>
<p><em>Singularism</em> is to religion what <em>scientology</em> is to science. Scientology is religion expressing itself as science and Singularism is science expressing itself as religion.</p>
<p>For further reading see <a href="http://inverseprobability.com/2016/05/09/machine-learning-futures-5">this post on Singularism</a> as well as this <a href="http://www.academia.edu/15037984/Singularitarians_AItheists_and_Why_the_Problem_with_Artificial_Intelligence_is_H.A.L._Humanity_At_Large_not_HAL">paper by Luciano Floridi</a> and this <a href="http://inverseprobability.com/2016/05/09/machine-learning-futures-6">review of Superintelligence</a> <span class="citation">(Bostrom, 2014)</span>.</p>
<h3 id="artificial-intelligence-and-introspection">Artificial Intelligence and Introspection</h3>
<p>Ignoring the cartoon view of religion we've outlined above and focussing more on how religion can bring strength to people in their day-to-day living, religious environments bring a place to self reflect and meditate on our existence, and the wider cosmos. How are we here? What is our role? What makes us different?</p>
<p>Creating machine intelligences characterizes the manner in which we are different, helps us understand what is special about us rather than the machine.</p>
<p>I have in the past argued strongly against the term artificial intelligence but there is a sense in which it is a good term. If we think of artificial plants, then we have the right sense in which we are creating an artificial intelligence. An artificial plant is fundamentally different from a real plant, but can appear similar, or from a distance identical. However, a creator of an artificial plant gains a greater appreciation for the complexity of a natural plant.</p>
<p>In a similar way, we might expect that attempts to emulate human intelligence would lead to a deeper appreciation of that intelligence. This type of reflection on who we are has a lot in common with many of the (to me) most endearing characteristics of religion.</p>
<h3 id="the-digital-catch-22">The Digital Catch 22</h3>
<img class="" src="../slides/diagrams/ai/Catch22.jpg" width="" height="auto" align="center" style="background:none; border:none; box-shadow:none; display:block; margin-left:auto; margin-right:auto">
<center>
<em>A digital Catch 22: for systems to watch over us they have to watch us. </em>
</center>
<p>A final parallel between the world of AI and that of religion is the conundrums they raise for us. In particular the tradeoffs between a paternalistic society and individual freedoms. Two models for artificial intelligence that may be realistic are the "Big Brother" and the "Big Mother" models.</p>
<p>Big Brother refers to the surveillance society and the control of populations that can be achieved with a greater understanding of the individual self. A perceptual understanding of the individual that conceivably be of better than the individual's self perception. This scenario was most famously explored by George Orwell, but also came into being in Communist East Germany where it is estimated that one in 66 citizens acted as an informants, <span class="citation">(<em>Stasi</em>, 1999)</span>.</p>
<table>
<tr>
<td width="50%">
<img class="" src="../slides/diagrams/ai/Cropped-big-brother-is-watching-1984.png" width="" height="auto" align="center" style="background:none; border:none; box-shadow:none; display:block; margin-left:auto; margin-right:auto">
</td>
<td width="50%">
<img class="" src="../slides/diagrams/ai/548px-Plakat_Mutti_is_Watching_You.png" width="80%" height="auto" align="center" style="background:none; border:none; box-shadow:none; display:block; margin-left:auto; margin-right:auto">
</td>
</tr>
</table>
<center>
<em>The Big Brother to Big Mother dilemma. As computers help us they constrain us, leading to a form of dys-utopia. </em>
</center>
<p>But for a system to <em>watch over</em> us it first has to <em>watch us</em>. So the same understanding of individual is also necessary for the "Big Mother" scenario, where intelligent agents provide for us in the manner in which our parents did for us when we were young. Both scenarios are disempowering in terms of individual liberties. In a metaphorical sense, this could be seen as a return to Eden, a surrendering of individual liberties for a perceived paradise. But those individual liberties are also what we value. There is a tension between a desire to create the perfect environment, where no evil exists and our individual liberty. Our society chooses a balance between the pros and cons that attempts to sustain a diversity of perspectives and beliefs. Even if it were possible to use AI to organzie society in such a way that particular malevolent behaviours were prevented, doing so may come at the cost of the individual freedom we enjoy. These are difficult trade offs, and the exist both when explaining the nature of religious belief and when considering the nature of either the dystopian Big Brother or the "dys-utopian" Big Mother view of AI.</p>
<img class="" src="../slides/diagrams/ai/1024px-Thomas_Cole_The_Garden_of_Eden_Amon_Carter_Museum.jpg" width="" height="auto" align="center" style="background:none; border:none; box-shadow:none; display:block; margin-left:auto; margin-right:auto">
<center>
<em>The Garden of Eden by Thomas Cole </em>
</center>
<h3 id="conclusion">Conclusion</h3>
<p>We've provided an overview of the advances in artificial intelligence from the perspective of machine learning, and tried to give a sense of how machine learning models operate to learn about us.</p>
<p>We've highlighted a quintissential difference between humans and computers: the embodiment factor, the relatively restricted ability of human to communicate themselves when compared to computers. We explored how this has effected our evolved relationship with data and the relationship between the human and narrative.</p>
<p>Finally, we explored three parallels between faith and AI, in particular the cartoon nature of religion based on technological promises of the singularity and AI. A more sophisticated relationship occurs when we see the way in which, as artificial intelligences invade our notion of personhood we will need to intrspect about who we are and what we want to be, a characteristic shared with many religions. The final parallel was in the emergent questions of AI, "Should we build an artificial intelligence to eliminate war?" has a strong parallel with the question "Why does God allow war?". War is a consequence of human choices. Building such a system would likely severely restrict our freedoms to make choices, and there is a tension between how much we wish those freedoms to be impinged versus the potential lives that could be saved.</p>
<h3 id="thanks">Thanks!</h3>
<ul>
<li>twitter: @lawrennd</li>
<li>blog: <a href="http://inverseprobability.com/blog.html">http://inverseprobability.com</a></li>
</ul>
<div id="refs" class="references">
<div id="ref-Ananthanarayanan-cat09">
<p>Ananthanarayanan, R., Esser, S.K., Simon, H.D., Modha, D.S., 2009. The cat is out of the bag: Cortical simulations with <span class="math inline">\(10^9\)</span> neurons, <span class="math inline">\(10^{13}\)</span> synapses, in: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis - Sc ’09. <a href="https://doi.org/10.1145/1654059.1654124" class="uri">https://doi.org/10.1145/1654059.1654124</a></p>
</div>
<div id="ref-Bostrom-superintelligence14">
<p>Bostrom, N., 2014. Superintelligence: Paths, dangers, strategies, 1st ed. Oxford University Press, Oxford, UK.</p>
</div>
<div id="ref-Heider:interpersonal58">
<p>Heider, F., 1958. The psychology of interpersonal relations. John Wiley.</p>
</div>
<div id="ref-Heider:experimental44">
<p>Heider, F., Simmel, M., 1944. An experimental study of apparent behavior. The American Journal of Psychology 57, 243–259.</p>
</div>
<div id="ref-Laplace:essai14">
<p>Laplace, P.S., 1814. Essai philosophique sur les probabilités, 2nd ed. Courcier, Paris.</p>
</div>
<div id="ref-Lawrence:embodiment17">
<p>Lawrence, N.D., 2017. Living together: Mind and machine intelligence. arXiv.</p>
</div>
<div id="ref-Reed-information98">
<p>Reed, C., Durlach, N.I., 1998. Note on information transfer rates in human communication. Presence Teleoperators & Virtual Environments 7, 509–518. <a href="https://doi.org/10.1162/105474698565893" class="uri">https://doi.org/10.1162/105474698565893</a></p>
</div>
<div id="ref-Koehler-stasi99">
<p>Stasi: The untold story of the East German secret police, 1999.</p>
</div>
<div id="ref-Wiener:cybernetics48">
<p>Wiener, N., 1948. Cybernetics: Control and communication in the animal and the machine. MIT Press, Cambridge, MA.</p>
</div>
</div>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>the challenge of understanding what information pertains to is known as knowledge representation.<a href="#fnref1">↩</a></p></li>
</ol>
</div>
Wed, 12 Sep 2018 00:00:00 +0000
http://inverseprobability.com/talks/notes/faith.html
http://inverseprobability.com/talks/notes/faith.htmlnotesData Science and the Professions<div style="display:none">
$$\newcommand{\Amatrix}{\mathbf{A}}
\newcommand{\KL}[2]{\text{KL}\left( #1\,\|\,#2 \right)}
\newcommand{\Kaast}{\kernelMatrix_{\mathbf{ \ast}\mathbf{ \ast}}}
\newcommand{\Kastu}{\kernelMatrix_{\mathbf{ \ast} \inducingVector}}
\newcommand{\Kff}{\kernelMatrix_{\mappingFunctionVector \mappingFunctionVector}}
\newcommand{\Kfu}{\kernelMatrix_{\mappingFunctionVector \inducingVector}}
\newcommand{\Kuast}{\kernelMatrix_{\inducingVector \bf\ast}}
\newcommand{\Kuf}{\kernelMatrix_{\inducingVector \mappingFunctionVector}}
\newcommand{\Kuu}{\kernelMatrix_{\inducingVector \inducingVector}}
\newcommand{\Kuui}{\Kuu^{-1}}
\newcommand{\Qaast}{\mathbf{Q}_{\bf \ast \ast}}
\newcommand{\Qastf}{\mathbf{Q}_{\ast \mappingFunction}}
\newcommand{\Qfast}{\mathbf{Q}_{\mappingFunctionVector \bf \ast}}
\newcommand{\Qff}{\mathbf{Q}_{\mappingFunctionVector \mappingFunctionVector}}
\newcommand{\aMatrix}{\mathbf{A}}
\newcommand{\aScalar}{a}
\newcommand{\aVector}{\mathbf{a}}
\newcommand{\acceleration}{a}
\newcommand{\bMatrix}{\mathbf{B}}
\newcommand{\bScalar}{b}
\newcommand{\bVector}{\mathbf{b}}
\newcommand{\basisFunc}{\phi}
\newcommand{\basisFuncVector}{\boldsymbol{ \basisFunc}}
\newcommand{\basisFunction}{\phi}
\newcommand{\basisLocation}{\mu}
\newcommand{\basisMatrix}{\boldsymbol{ \Phi}}
\newcommand{\basisScalar}{\basisFunction}
\newcommand{\basisVector}{\boldsymbol{ \basisFunction}}
\newcommand{\activationFunction}{\phi}
\newcommand{\activationMatrix}{\boldsymbol{ \Phi}}
\newcommand{\activationScalar}{\basisFunction}
\newcommand{\activationVector}{\boldsymbol{ \basisFunction}}
\newcommand{\bigO}{\mathcal{O}}
\newcommand{\binomProb}{\pi}
\newcommand{\cMatrix}{\mathbf{C}}
\newcommand{\cbasisMatrix}{\hat{\boldsymbol{ \Phi}}}
\newcommand{\cdataMatrix}{\hat{\dataMatrix}}
\newcommand{\cdataScalar}{\hat{\dataScalar}}
\newcommand{\cdataVector}{\hat{\dataVector}}
\newcommand{\centeredKernelMatrix}{\mathbf{ \MakeUppercase{\centeredKernelScalar}}}
\newcommand{\centeredKernelScalar}{b}
\newcommand{\centeredKernelVector}{\centeredKernelScalar}
\newcommand{\centeringMatrix}{\mathbf{H}}
\newcommand{\chiSquaredDist}[2]{\chi_{#1}^{2}\left(#2\right)}
\newcommand{\chiSquaredSamp}[1]{\chi_{#1}^{2}}
\newcommand{\conditionalCovariance}{\boldsymbol{ \Sigma}}
\newcommand{\coregionalizationMatrix}{\mathbf{B}}
\newcommand{\coregionalizationScalar}{b}
\newcommand{\coregionalizationVector}{\mathbf{ \coregionalizationScalar}}
\newcommand{\covDist}[2]{\text{cov}_{#2}\left(#1\right)}
\newcommand{\covSamp}[1]{\text{cov}\left(#1\right)}
\newcommand{\covarianceScalar}{c}
\newcommand{\covarianceVector}{\mathbf{ \covarianceScalar}}
\newcommand{\covarianceMatrix}{\mathbf{C}}
\newcommand{\covarianceMatrixTwo}{\boldsymbol{ \Sigma}}
\newcommand{\croupierScalar}{s}
\newcommand{\croupierVector}{\mathbf{ \croupierScalar}}
\newcommand{\croupierMatrix}{\mathbf{ \MakeUppercase{\croupierScalar}}}
\newcommand{\dataDim}{p}
\newcommand{\dataIndex}{i}
\newcommand{\dataIndexTwo}{j}
\newcommand{\dataMatrix}{\mathbf{Y}}
\newcommand{\dataScalar}{y}
\newcommand{\dataSet}{\mathcal{D}}
\newcommand{\dataStd}{\sigma}
\newcommand{\dataVector}{\mathbf{ \dataScalar}}
\newcommand{\decayRate}{d}
\newcommand{\degreeMatrix}{\mathbf{ \MakeUppercase{\degreeScalar}}}
\newcommand{\degreeScalar}{d}
\newcommand{\degreeVector}{\mathbf{ \degreeScalar}}
% Already defined by latex
%\newcommand{\det}[1]{\left|#1\right|}
\newcommand{\diag}[1]{\text{diag}\left(#1\right)}
\newcommand{\diagonalMatrix}{\mathbf{D}}
\newcommand{\diff}[2]{\frac{\text{d}#1}{\text{d}#2}}
\newcommand{\diffTwo}[2]{\frac{\text{d}^2#1}{\text{d}#2^2}}
\newcommand{\displacement}{x}
\newcommand{\displacementVector}{\textbf{\displacement}}
\newcommand{\distanceMatrix}{\mathbf{ \MakeUppercase{\distanceScalar}}}
\newcommand{\distanceScalar}{d}
\newcommand{\distanceVector}{\mathbf{ \distanceScalar}}
\newcommand{\eigenvaltwo}{\ell}
\newcommand{\eigenvaltwoMatrix}{\mathbf{L}}
\newcommand{\eigenvaltwoVector}{\mathbf{l}}
\newcommand{\eigenvalue}{\lambda}
\newcommand{\eigenvalueMatrix}{\boldsymbol{ \Lambda}}
\newcommand{\eigenvalueVector}{\boldsymbol{ \lambda}}
\newcommand{\eigenvector}{\mathbf{ \eigenvectorScalar}}
\newcommand{\eigenvectorMatrix}{\mathbf{U}}
\newcommand{\eigenvectorScalar}{u}
\newcommand{\eigenvectwo}{\mathbf{v}}
\newcommand{\eigenvectwoMatrix}{\mathbf{V}}
\newcommand{\eigenvectwoScalar}{v}
\newcommand{\entropy}[1]{\mathcal{H}\left(#1\right)}
\newcommand{\errorFunction}{E}
\newcommand{\expDist}[2]{\left<#1\right>_{#2}}
\newcommand{\expSamp}[1]{\left<#1\right>}
\newcommand{\expectation}[1]{\left\langle #1 \right\rangle }
\newcommand{\expectationDist}[2]{\left\langle #1 \right\rangle _{#2}}
\newcommand{\expectedDistanceMatrix}{\mathcal{D}}
\newcommand{\eye}{\mathbf{I}}
\newcommand{\fantasyDim}{r}
\newcommand{\fantasyMatrix}{\mathbf{ \MakeUppercase{\fantasyScalar}}}
\newcommand{\fantasyScalar}{z}
\newcommand{\fantasyVector}{\mathbf{ \fantasyScalar}}
\newcommand{\featureStd}{\varsigma}
\newcommand{\gammaCdf}[3]{\mathcal{GAMMA CDF}\left(#1|#2,#3\right)}
\newcommand{\gammaDist}[3]{\mathcal{G}\left(#1|#2,#3\right)}
\newcommand{\gammaSamp}[2]{\mathcal{G}\left(#1,#2\right)}
\newcommand{\gaussianDist}[3]{\mathcal{N}\left(#1|#2,#3\right)}
\newcommand{\gaussianSamp}[2]{\mathcal{N}\left(#1,#2\right)}
\newcommand{\given}{|}
\newcommand{\half}{\frac{1}{2}}
\newcommand{\heaviside}{H}
\newcommand{\hiddenMatrix}{\mathbf{ \MakeUppercase{\hiddenScalar}}}
\newcommand{\hiddenScalar}{h}
\newcommand{\hiddenVector}{\mathbf{ \hiddenScalar}}
\newcommand{\identityMatrix}{\eye}
\newcommand{\inducingInputScalar}{z}
\newcommand{\inducingInputVector}{\mathbf{ \inducingInputScalar}}
\newcommand{\inducingInputMatrix}{\mathbf{Z}}
\newcommand{\inducingScalar}{u}
\newcommand{\inducingVector}{\mathbf{ \inducingScalar}}
\newcommand{\inducingMatrix}{\mathbf{U}}
\newcommand{\inlineDiff}[2]{\text{d}#1/\text{d}#2}
\newcommand{\inputDim}{q}
\newcommand{\inputMatrix}{\mathbf{X}}
\newcommand{\inputScalar}{x}
\newcommand{\inputSpace}{\mathcal{X}}
\newcommand{\inputVals}{\inputVector}
\newcommand{\inputVector}{\mathbf{ \inputScalar}}
\newcommand{\iterNum}{k}
\newcommand{\kernel}{\kernelScalar}
\newcommand{\kernelMatrix}{\mathbf{K}}
\newcommand{\kernelScalar}{k}
\newcommand{\kernelVector}{\mathbf{ \kernelScalar}}
\newcommand{\kff}{\kernelScalar_{\mappingFunction \mappingFunction}}
\newcommand{\kfu}{\kernelVector_{\mappingFunction \inducingScalar}}
\newcommand{\kuf}{\kernelVector_{\inducingScalar \mappingFunction}}
\newcommand{\kuu}{\kernelVector_{\inducingScalar \inducingScalar}}
\newcommand{\lagrangeMultiplier}{\lambda}
\newcommand{\lagrangeMultiplierMatrix}{\boldsymbol{ \Lambda}}
\newcommand{\lagrangian}{L}
\newcommand{\laplacianFactor}{\mathbf{ \MakeUppercase{\laplacianFactorScalar}}}
\newcommand{\laplacianFactorScalar}{m}
\newcommand{\laplacianFactorVector}{\mathbf{ \laplacianFactorScalar}}
\newcommand{\laplacianMatrix}{\mathbf{L}}
\newcommand{\laplacianScalar}{\ell}
\newcommand{\laplacianVector}{\mathbf{ \ell}}
\newcommand{\latentDim}{q}
\newcommand{\latentDistanceMatrix}{\boldsymbol{ \Delta}}
\newcommand{\latentDistanceScalar}{\delta}
\newcommand{\latentDistanceVector}{\boldsymbol{ \delta}}
\newcommand{\latentForce}{f}
\newcommand{\latentFunction}{u}
\newcommand{\latentFunctionVector}{\mathbf{ \latentFunction}}
\newcommand{\latentFunctionMatrix}{\mathbf{ \MakeUppercase{\latentFunction}}}
\newcommand{\latentIndex}{j}
\newcommand{\latentScalar}{z}
\newcommand{\latentVector}{\mathbf{ \latentScalar}}
\newcommand{\latentMatrix}{\mathbf{Z}}
\newcommand{\learnRate}{\eta}
\newcommand{\lengthScale}{\ell}
\newcommand{\rbfWidth}{\ell}
\newcommand{\likelihoodBound}{\mathcal{L}}
\newcommand{\likelihoodFunction}{L}
\newcommand{\locationScalar}{\mu}
\newcommand{\locationVector}{\boldsymbol{ \locationScalar}}
\newcommand{\locationMatrix}{\mathbf{M}}
\newcommand{\variance}[1]{\text{var}\left( #1 \right)}
\newcommand{\mappingFunction}{f}
\newcommand{\mappingFunctionMatrix}{\mathbf{F}}
\newcommand{\mappingFunctionTwo}{g}
\newcommand{\mappingFunctionTwoMatrix}{\mathbf{G}}
\newcommand{\mappingFunctionTwoVector}{\mathbf{ \mappingFunctionTwo}}
\newcommand{\mappingFunctionVector}{\mathbf{ \mappingFunction}}
\newcommand{\scaleScalar}{s}
\newcommand{\mappingScalar}{w}
\newcommand{\mappingVector}{\mathbf{ \mappingScalar}}
\newcommand{\mappingMatrix}{\mathbf{W}}
\newcommand{\mappingScalarTwo}{v}
\newcommand{\mappingVectorTwo}{\mathbf{ \mappingScalarTwo}}
\newcommand{\mappingMatrixTwo}{\mathbf{V}}
\newcommand{\maxIters}{K}
\newcommand{\meanMatrix}{\mathbf{M}}
\newcommand{\meanScalar}{\mu}
\newcommand{\meanTwoMatrix}{\mathbf{M}}
\newcommand{\meanTwoScalar}{m}
\newcommand{\meanTwoVector}{\mathbf{ \meanTwoScalar}}
\newcommand{\meanVector}{\boldsymbol{ \meanScalar}}
\newcommand{\mrnaConcentration}{m}
\newcommand{\naturalFrequency}{\omega}
\newcommand{\neighborhood}[1]{\mathcal{N}\left( #1 \right)}
\newcommand{\neilurl}{http://inverseprobability.com/}
\newcommand{\noiseMatrix}{\boldsymbol{ E}}
\newcommand{\noiseScalar}{\epsilon}
\newcommand{\noiseVector}{\boldsymbol{ \epsilon}}
\newcommand{\norm}[1]{\left\Vert #1 \right\Vert}
\newcommand{\normalizedLaplacianMatrix}{\hat{\mathbf{L}}}
\newcommand{\normalizedLaplacianScalar}{\hat{\ell}}
\newcommand{\normalizedLaplacianVector}{\hat{\mathbf{ \ell}}}
\newcommand{\numActive}{m}
\newcommand{\numBasisFunc}{m}
\newcommand{\numComponents}{m}
\newcommand{\numComps}{K}
\newcommand{\numData}{n}
\newcommand{\numFeatures}{K}
\newcommand{\numHidden}{h}
\newcommand{\numInducing}{m}
\newcommand{\numLayers}{\ell}
\newcommand{\numNeighbors}{K}
\newcommand{\numSequences}{s}
\newcommand{\numSuccess}{s}
\newcommand{\numTasks}{m}
\newcommand{\numTime}{T}
\newcommand{\numTrials}{S}
\newcommand{\outputIndex}{j}
\newcommand{\paramVector}{\boldsymbol{ \theta}}
\newcommand{\parameterMatrix}{\boldsymbol{ \Theta}}
\newcommand{\parameterScalar}{\theta}
\newcommand{\parameterVector}{\boldsymbol{ \parameterScalar}}
\newcommand{\partDiff}[2]{\frac{\partial#1}{\partial#2}}
\newcommand{\precisionScalar}{j}
\newcommand{\precisionVector}{\mathbf{ \precisionScalar}}
\newcommand{\precisionMatrix}{\mathbf{J}}
\newcommand{\pseudotargetScalar}{\widetilde{y}}
\newcommand{\pseudotargetVector}{\mathbf{ \pseudotargetScalar}}
\newcommand{\pseudotargetMatrix}{\mathbf{ \widetilde{Y}}}
\newcommand{\rank}[1]{\text{rank}\left(#1\right)}
\newcommand{\rayleighDist}[2]{\mathcal{R}\left(#1|#2\right)}
\newcommand{\rayleighSamp}[1]{\mathcal{R}\left(#1\right)}
\newcommand{\responsibility}{r}
\newcommand{\rotationScalar}{r}
\newcommand{\rotationVector}{\mathbf{ \rotationScalar}}
\newcommand{\rotationMatrix}{\mathbf{R}}
\newcommand{\sampleCovScalar}{s}
\newcommand{\sampleCovVector}{\mathbf{ \sampleCovScalar}}
\newcommand{\sampleCovMatrix}{\mathbf{s}}
\newcommand{\scalarProduct}[2]{\left\langle{#1},{#2}\right\rangle}
\newcommand{\sign}[1]{\text{sign}\left(#1\right)}
\newcommand{\sigmoid}[1]{\sigma\left(#1\right)}
\newcommand{\singularvalue}{\ell}
\newcommand{\singularvalueMatrix}{\mathbf{L}}
\newcommand{\singularvalueVector}{\mathbf{l}}
\newcommand{\sorth}{\mathbf{u}}
\newcommand{\spar}{\lambda}
\newcommand{\trace}[1]{\text{tr}\left(#1\right)}
\newcommand{\BasalRate}{B}
\newcommand{\DampingCoefficient}{C}
\newcommand{\DecayRate}{D}
\newcommand{\Displacement}{X}
\newcommand{\LatentForce}{F}
\newcommand{\Mass}{M}
\newcommand{\Sensitivity}{S}
\newcommand{\basalRate}{b}
\newcommand{\dampingCoefficient}{c}
\newcommand{\mass}{m}
\newcommand{\sensitivity}{s}
\newcommand{\springScalar}{\kappa}
\newcommand{\springVector}{\boldsymbol{ \kappa}}
\newcommand{\springMatrix}{\boldsymbol{ \mathcal{K}}}
\newcommand{\tfConcentration}{p}
\newcommand{\tfDecayRate}{\delta}
\newcommand{\tfMrnaConcentration}{f}
\newcommand{\tfVector}{\mathbf{ \tfConcentration}}
\newcommand{\velocity}{v}
\newcommand{\sufficientStatsScalar}{g}
\newcommand{\sufficientStatsVector}{\mathbf{ \sufficientStatsScalar}}
\newcommand{\sufficientStatsMatrix}{\mathbf{G}}
\newcommand{\switchScalar}{s}
\newcommand{\switchVector}{\mathbf{ \switchScalar}}
\newcommand{\switchMatrix}{\mathbf{S}}
\newcommand{\tr}[1]{\text{tr}\left(#1\right)}
\newcommand{\loneNorm}[1]{\left\Vert #1 \right\Vert_1}
\newcommand{\ltwoNorm}[1]{\left\Vert #1 \right\Vert_2}
\newcommand{\onenorm}[1]{\left\vert#1\right\vert_1}
\newcommand{\twonorm}[1]{\left\Vert #1 \right\Vert}
\newcommand{\vScalar}{v}
\newcommand{\vVector}{\mathbf{v}}
\newcommand{\vMatrix}{\mathbf{V}}
\newcommand{\varianceDist}[2]{\text{var}_{#2}\left( #1 \right)}
% Already defined by latex
%\newcommand{\vec}{#1:}
\newcommand{\vecb}[1]{\left(#1\right):}
\newcommand{\weightScalar}{w}
\newcommand{\weightVector}{\mathbf{ \weightScalar}}
\newcommand{\weightMatrix}{\mathbf{W}}
\newcommand{\weightedAdjacencyMatrix}{\mathbf{A}}
\newcommand{\weightedAdjacencyScalar}{a}
\newcommand{\weightedAdjacencyVector}{\mathbf{ \weightedAdjacencyScalar}}
\newcommand{\onesVector}{\mathbf{1}}
\newcommand{\zerosVector}{\mathbf{0}}
$$
</div>
<p><!--% not ipynb--></p>
<h2 id="what-is-machine-learning">What is Machine Learning?</h2>
<p>What is machine learning? At its most basic level machine learning is a combination of</p>
<p><span class="math display">\[ \text{data} + \text{model} \xrightarrow{\text{compute}} \text{prediction}\]</span></p>
<p>where <em>data</em> is our observations. They can be actively or passively acquired (meta-data). The <em>model</em> contains our assumptions, based on previous experience. That experience can be other data, it can come from transfer learning, or it can merely be our beliefs about the regularities of the universe. In humans our models include our inductive biases. The <em>prediction</em> is an action to be taken or a categorization or a quality score. The reason that machine learning has become a mainstay of artificial intelligence is the importance of predictions in artificial intelligence. The data and the model are combined through computation.</p>
<p>In practice we normally perform machine learning using two functions. To combine data with a model we typically make use of:</p>
<p><strong>a prediction function</strong> a function which is used to make the predictions. It includes our beliefs about the regularities of the universe, our assumptions about how the world works, e.g. smoothness, spatial similarities, temporal similarities.</p>
<p><strong>an objective function</strong> a function which defines the cost of misprediction. Typically it includes knowledge about the world's generating processes (probabilistic objectives) or the costs we pay for mispredictions (empiricial risk minimization).</p>
<p>The combination of data and model through the prediction function and the objectie function leads to a <em>learning algorithm</em>. The class of prediction functions and objective functions we can make use of is restricted by the algorithms they lead to. If the prediction function or the objective function are too complex, then it can be difficult to find an appropriate learning algorithm. Much of the acdemic field of machine learning is the quest for new learning algorithms that allow us to bring different types of models and data together.</p>
<p>A useful reference for state of the art in machine learning is the UK Royal Society Report, <a href="https://royalsociety.org/~/media/policy/projects/machine-learning/publications/machine-learning-report.pdf">Machine Learning: Power and Promise of Computers that Learn by Example</a>.</p>
<p>You can also check my blog post on <a href="http://inverseprobability.com/2017/07/17/what-is-machine-learning">"What is Machine Learning?"</a></p>
<h3 id="artificial-intelligence-and-data-science">Artificial Intelligence and Data Science</h3>
<p>Machine learning technologies have been the driver of two related, but distinct disciplines. The first is <em>data science</em>. Data science is an emerging field that arises from the fact that we now collect so much data by happenstance, rather than by <em>experimental design</em>. Classical statistics is the science of drawing conclusions from data, and to do so statistical experiments are carefully designed. In the modern era we collect so much data that there's a desire to draw inferences directly from the data.</p>
<p>As well as machine learning, the field of data science draws from statistics, cloud computing, data storage (e.g. streaming data), visualization and data mining.</p>
<p>In contrast, artificial intelligence technologies typically focus on emulating some form of human behaviour, such as understanding an image, or some speech, or translating text from one form to another. The recent advances in artifcial intelligence have come from machine learning providing the automation. But in contrast to data science, in artifcial intelligence the data is normally collected with the specific task in mind. In this sense it has strong relations to classical statistics.</p>
<p>Classically artificial intelligence worried more about <em>logic</em> and <em>planning</em> and focussed less on data driven decision making. Modern machine learning owes more to the field of <em>Cybernetics</em> <span class="citation">(Wiener, 1948)</span> than artificial intelligence. Related fields include <em>robotics</em>, <em>speech recognition</em>, <em>language understanding</em> and <em>computer vision</em>.</p>
<p>There are strong overlaps between the fields, the wide availability of data by happenstance makes it easier to collect data for designing AI systems. These relations are coming through wide availability of sensing technologies that are interconnected by celluar networks, WiFi and the internet. This phenomenon is sometimes known as the <em>Internet of Things</em>, but this feels like a dangerous misnomer. We must never forget that we are interconnecting people, not things.</p>
<h2 id="natural-and-artificial-intelligence-embodiment-factors">Natural and Artificial Intelligence: Embodiment Factors</h2>
<table>
<tr>
<td>
</td>
<td align="center">
<img class="" src="../slides/diagrams/IBM_Blue_Gene_P_supercomputer.jpg" width="40%" align="center" style="background:none; border:none; box-shadow:none;">
</td>
<td align="center">
<img class="" src="../slides/diagrams/ClaudeShannon_MFO3807.jpg" width="25%" align="center" style="background:none; border:none; box-shadow:none;">
</td>
</tr>
<tr>
<td>
compute
</td>
<td align="center">
<span class="math display">\[\approx 100 \text{ gigaflops}\]</span>
</td>
<td align="center">
<span class="math display">\[\approx 16 \text{ petaflops}\]</span>
</td>
</tr>
<tr>
<td>
communicate
</td>
<td align="center">
<span class="math display">\[1 \text{ gigbit/s}\]</span>
</td>
<td align="center">
<span class="math display">\[100 \text{ bit/s}\]</span>
</td>
</tr>
<tr>
<td>
(compute/communicate)
</td>
<td align="center">
<span class="math display">\[10^{4}\]</span>
</td>
<td align="center">
<span class="math display">\[10^{14}\]</span>
</td>
</tr>
</table>
<p>There is a fundamental limit placed on our intelligence based on our ability to communicate. Claude Shannon founded the field of information theory. The clever part of this theory is it allows us to separate our measurement of information from what the information pertains to<a href="#fn1" class="footnoteRef" id="fnref1"><sup>1</sup></a>.</p>
<p>Shannon measured information in bits. One bit of information is the amount of information I pass to you when I give you the result of a coin toss. Shannon was also interested in the amount of information in the English language. He estimated that on average a word in the English language contains 12 bits of information.</p>
<p>Given typical speaking rates, that gives us an estimate of our ability to communicate of around 100 bits per second <span class="citation">(Reed and Durlach, 1998)</span>. Computers on the other hand can communicate much more rapidly. Current wired network speeds are around a billion bits per second, ten million times faster.</p>
<p>When it comes to compute though, our best estimates indicate our computers are slower. A typical modern computer can process make around 100 billion floating point operations per second, each floating point operation involves a 64 bit number. So the computer is processing around 6,400 billion bits per second.</p>
<p>It's difficult to get similar estimates for humans, but by some estimates the amount of compute we would require to <em>simulate</em> a human brain is equivalent to that in the UK's fastest computer <span class="citation">(Ananthanarayanan et al., 2009)</span>, the MET office machine in Exeter, which in 2018 ranks as the 11th fastest computer in the world. That machine simulates the world's weather each morning, and then simulates the world's climate. It is a 16 petaflop machine, processing around 1,000 <em>trillion</em> bits per second.</p>
<p>So when it comes to our ability to compute we are extraordinary, not compute in our conscious mind, but the underlying neuron firings that underpin both our consciousness, our sbuconsciousness as well as our motor control etc. By analogy I sometimes like to think of us as a Formula One engine. But in terms of our ability to deploy that computation in actual use, to share the results of what we have inferred, we are very limited. So when you imagine the F1 car that represents a psyche, think of an F1 car with bicycle wheels.</p>
<p><img class="" src="../slides/diagrams/640px-Marcel_Renault_1903.jpg" width="70%" align="center" style="background:none; border:none; box-shadow:none;"></p>
<p>In contrast, our computers have less computational power, but they can communicate far more fluidly. They are more like a go-kart, less well powered, but with tires that allow them to deploy that power.</p>
<p><img class="" src="../slides/diagrams/Caleb_McDuff_WIX_Silence_Racing_livery.jpg" width="70%" align="center" style="background:none; border:none; box-shadow:none;"></p>
<p>For humans, that means much of our computation should be dedicated to considering <em>what</em> we should compute. To do that efficiently we need to model the world around us. The most complex thing in the world around us is other humans. So it is no surprise that we model them. We second guess what their intentions are, and our communication is only necessary when they are departing from how we model them. Naturally, for this to work well, we need to understand those we work closely with. So it is no surprise that social communication, social bonding, forms so much of a part of our use of our limited bandwidth.</p>
<p>There is a second effect here, our need to anthropomorphise objects around us. Our tendency to model our fellow humans extends to when we interact with other entities in our environment. To our pets as well as inanimate objects around us, such as computers or even our cars. This tendency to overinterpret could be a consequence of our limited ability to communicate.</p>
<p>For more details see this paper <a href="https://arxiv.org/abs/1705.07996">"Living Together: Mind and Machine Intelligence"</a>, and this <a href="http://inverseprobability.com/talks/lawrence-tedx17/living-together.html">TEDx talk</a>.</p>
<h2 id="evolved-relationship-with-information">Evolved Relationship with Information</h2>
<p>The high bandwidth of computers has resulted in a close relationship between the computer and data. Large amounts of information can flow between the two. The degree to which the computer is mediating our relationship with data means that we should consider it an intermediary.</p>
<p>Originaly our low bandwith relationship with data was affected by two characteristics. Firstly, our tendency to over-interpret driven by our need to extract as much knowledge from our low bandwidth information channel as possible. Secondly, by our improved understanding of the domain of <em>mathematical</em> statistics and how our cognitive biases can mislead us.</p>
<p>With this new set up there is a potential for assimilating far more information via the computer, but the computer can present this to us in various ways. If it's motives are not aligned with ours then it can misrepresent the information. This needn't be nefarious it can be simply as a result of the computer pursuing a different objective from us. For example, if the computer is aiming to maximize our interaction time that may be a different objective from ours which may be to summarize information in a representative manner in the <em>shortest</em> possible length of time.</p>
<p>For example, for me it was a common experience to pick up my telephone with the intention of checking when my next appointment was, but to soon find myself distracted by another application on the phone, and end up reading something on the internet. By the time I'd finished reading, I would often have forgotten the reason I picked up my phone in the first place.</p>
<p>There are great benefits to be had from the huge amount of information we can unlock from this evolved relationship between us and data. In biology, large scale data sharing has been driven by a revolution in genomic, transcriptomic and epigenomic measurement. The improved inferences that that can be drawn through summarizing data by computer have fundamentally changed the nature of biological science, now this phenomenon is also infuencing us in our daily lives as data measured by <em>happenstance</em> is increasingly used to characterize us.</p>
<p>Better mediation of this flow actually requires a better understanding of human-computer interaction. This in turn involves understanding our own intelligence better, what its cognitive biases are and how these might mislead us.</p>
<p>For further thoughts see <a href="https://www.theguardian.com/media-network/2015/jul/23/data-driven-economy-marketing">this Guardian article</a> from 2015 on marketing in the internet era and <a href="http://inverseprobability.com/2015/12/04/what-kind-of-ai">this blog post</a> on System Zero.</p>
<object class="svgplot" align data="../slides/diagrams/data-science/information-flow003.svg">
</object>
<center>
<em>New direction of information flow, information is reaching us mediated by the computer </em>
</center>
<h3 id="what-does-machine-learning-do">What does Machine Learning do?</h3>
<p>Any process of automation allows us to scale what we do by codifying a process in some way that makes it efficient and repeatable. Machine learning automates by emulating human (or other actions) found in data. Machine learning codifies in the form of a mathematical function that is learnt by a computer. If we can create these mathematical functions in ways in which they can interconnect, then we can also build systems.</p>
<p>Machine learning works through codifing a prediction of interest into a mathematical function. For example, we can try and predict the probability that a customer wants to by a jersey given knowledge of their age, and the latitude where they live. The technique known as logistic regression estimates the odds that someone will by a jumper as a linear weighted sum of the features of interest.</p>
<p><span class="math display">\[ \text{odds} = \frac{p(\text{bought})}{p(\text{not bought})} \]</span> <span class="math display">\[ \log \text{odds} = \beta_0 + \beta_1 \text{age} + \beta_2 \text{latitude}\]</span></p>
<p>Here <span class="math inline">\(\beta_0\)</span>, <span class="math inline">\(\beta_1\)</span> and <span class="math inline">\(\beta_2\)</span> are the parameters of the model. If <span class="math inline">\(\beta_1\)</span> and <span class="math inline">\(\beta_2\)</span> are both positive, then the log-odds that someone will buy a jumper increase with increasing latitude and age, so the further north you are and the older you are the more likely you are to buy a jumper. The parameter <span class="math inline">\(\beta_0\)</span> is an offset parameter, and gives the log-odds of buying a jumper at zero age and on the equator. It is likely to be negative[^logarithms] indicating that the purchase is odds-against. This is actually a classical statistical model, and models like logistic regression are widely used to estimate probabilities from ad-click prediction to risk of disease.</p>
<p>This is called a generalized linear model, we can also think of it as estimating the <em>probability</em> of a purchase as a nonlinear function of the features (age, lattitude) and the parameters (the <span class="math inline">\(\beta\)</span> values). The function is known as the <em>sigmoid</em> or <a href="https://en.wikipedia.org/wiki/Logistic_regression">logistic function</a>, thus the name <em>logistic</em> regression.</p>
<p><span class="math display">\[ p(\text{bought}) = \sigmoid{\beta_0 + \beta_1 \text{age} + \beta_2 \text{latitude}}\]</span></p>
<p>In the case where we have <em>features</em> to help us predict, we sometimes denote such features as a vector, <span class="math inline">\(\inputVector\)</span>, and we then use an inner product between the features and the parameters, <span class="math inline">\(\boldsymbol{\beta}^\top \inputVector = \beta_1 \inputScalar_1 + \beta_2 \inputScalar_2 + \beta_3 \inputScalar_3 ...\)</span>, to represent the argument of the sigmoid.</p>
<p><span class="math display">\[ p(\text{bought}) = \sigmoid{\boldsymbol{\beta}^\top \inputVector}\]</span></p>
<p>More generally, we aim to predict some aspect of our data, <span class="math inline">\(\dataScalar\)</span>, by relating it through a mathematical function, <span class="math inline">\(\mappingFunction(\cdot)\)</span>, to the parameters, <span class="math inline">\(\boldsymbol{\beta}\)</span> and the data, <span class="math inline">\(\inputVector\)</span>.</p>
<p><span class="math display">\[ \dataScalar = \mappingFunction\left(\inputVector, \boldsymbol{\beta}\right)\]</span></p>
<p>We call <span class="math inline">\(\mappingFunction(\cdot)\)</span> the <em>prediction function</em></p>
<p>To obtain the fit to data, we use a separate function called the <em>objective function</em> that gives us a mathematical representation of the difference between our predictions and the real data.</p>
<p><span class="math display">\[\errorFunction(\boldsymbol{\beta}, \dataMatrix, \inputMatrix)\]</span> A commonly used examples (for example in a regression problem) is least squares, <span class="math display">\[\errorFunction(\boldsymbol{\beta}, \dataMatrix, \inputMatrix) = \sum_{i=1}^\numData \left(\dataScalar_i - \mappingFunction(\inputVector_i, \boldsymbol{\beta})\right)^2.\]</span></p>
<p>If a linear prediction function is combined with the least squares objective function then that gives us a classical <em>linear regression</em>, another classical statistical model. Statistics often focusses on linear models because it makes interpretation of the model easier. Interpretation is key in statistics because the aim is normally to validate questions by analysis of data. Machine learning has typically focussed more on the prediction function itself and worried less about the interpretation of parameters, which are normally denoted by <span class="math inline">\(\mathbf{w}\)</span> instead of <span class="math inline">\(\boldsymbol{\beta}\)</span>. As a result <em>non-linear</em> functions are explored more often as they tend to improve quality of predictions but at the expense of interpretability.</p>
<ul>
<li><p>These are interpretable models: vital for disease etc.</p></li>
<li><p>Modern machine learning methods are less interpretable</p></li>
<li><p>Example: face recognition</p></li>
</ul>
<img class="" src="../slides/diagrams/deepface_neg.png" width="100%" align="center" style="background:none; border:none; box-shadow:none;">
<center>
<em>The DeepFace architecture <span class="citation">(Taigman et al., 2014)</span>, visualized through colors to represent the functional mappings at each layer. There are 120 million parameters in the model. </em>
</center>
<p>The DeepFace architecture <span class="citation">(Taigman et al., 2014)</span> consists of layers that deal with <em>translation</em> and <em>rotational</em> invariances. These layers are followed by three locally-connected layers and two fully-connected layers. Color illustrates feature maps produced at each layer. The net includes more than 120 million parameters, where more than 95% come from the local and fully connected layers.</p>
<img class="" src="../slides/diagrams/576px-Early_Pinball.jpg" width="50%" align="center" style="background:none; border:none; box-shadow:none;">
<center>
<em>Deep learning models are composition of simple functions. We can think of a pinball machine as an analogy. Each layer of pins corresponds to one of the layers of functions in the model. Input data is represented by the location of the ball from left to right when it is dropped in from the top. Output class comes from the position of the ball as it leaves the pins at the bottom. </em>
</center>
<p>We can think of what these models are doing as being similar to early pin ball machines. In a neural network, we input a number (or numbers), whereas in pinball, we input a ball. The location of the ball on the left-right axis can be thought of as the number. As the ball falls through the machine, each layer of pins can be thought of as a different layer of neurons. Each layer acts to move the ball from left to right.</p>
<p>In a pinball machine, when the ball gets to the bottom it might fall into a hole defining a score, in a neural network, that is equivalent to the decision: a classification of the input object.</p>
<p>An image has more than one number associated with it, so it's like playing pinball in a <em>hyper-space</em>.</p>
<object class="svgplot" align data="../slides/diagrams/pinball001.svg">
</object>
<center>
<em>At initialization, the pins, which represent the parameters of the function, aren't in the right place to bring the balls to the correct decisions. </em>
</center>
<object class="svgplot" align data="../slides/diagrams/pinball002.svg">
</object>
<center>
<em>After learning the pins are now in the right place to bring the balls to the correct decisions. </em>
</center>
<p>Learning involves moving all the pins to be in the right position, so that the ball falls in the right place. But moving all these pins in hyperspace can be difficult. In a hyper space you have to put a lot of data through the machine for to explore the positions of all the pins. Adversarial learning reflects the fact that a ball can be moved a small distance and lead to a very different result.</p>
<p>Probabilistic methods explore more of the space by considering a range of possible paths for the ball through the machine.</p>
<h3 id="data-science">Data Science</h3>
<ul>
<li>Industrial Revolution 4.0?</li>
<li><em>Industrial Revolution</em> (1760-1840) term coined by Arnold Toynbee, late 19th century.</li>
<li>Maybe: But this one is dominated by <em>data</em> not <em>capital</em></li>
<li>That presents <em>challenges</em> and <em>opportunities</em></li>
</ul>
<p>compare <a href="https://www.theguardian.com/media-network/2015/mar/05/digital-oligarchy-algorithms-personal-data">digital oligarchy</a> vs <a href="https://www.theguardian.com/media-network/2015/aug/25/africa-benefit-data-science-information">how Africa can benefit from the data revolution</a></p>
<ul>
<li>Apple vs Nokia: How you handle disruption.</li>
</ul>
<p>Disruptive technologies take time to assimilate, and best practices, as well as the pitfalls of new technologies take time to share. Historically, new technologies led to new professions. <a href="https://en.wikipedia.org/wiki/Isambard_Kingdom_Brunel">Isambard Kingdom Brunel</a> (born 1806) was a leading innovator in civil, mechanical and naval engineering. Each of these has its own professional institutions founded in 1818, 1847, and 1860 respectively.</p>
<p><a href="https://en.wikipedia.org/wiki/Nikola_Tesla">Nikola Tesla</a> developed the modern approach to electrical distribution, he was born in 1856 and the American Instiute for Electrical Engineers was founded in 1884, the UK equivalent was founded in 1871.</p>
<p><a href="https://en.wikipedia.org/wiki/William_Shockley">William Schockley Jr</a>, born 1910, led the group that developed the transistor, referred to as "the man who brought silicon to Silicon Valley", in 1963 the American Institute for Electical Engineers merged with the Institute of Radio Engineers to form the Institute of Electrical and Electronic Engineers.</p>
<p><a href="https://en.wikipedia.org/wiki/Watts_Humphrey">Watts S. Humphrey</a>, born 1927, was known as the "father of software quality", in the 1980s he founded a program aimed at understanding and managing the software process. The British Computer Society was founded in 1956.</p>
<p>Why the need for these professions? Much of it is about codification of best practice and developing trust between the public and practitioners. These fundamental characteristics of the professions are shared with the oldest professions (Medicine, Law) as well as the newest (Information Technology).</p>
<p>So where are we today? My best guess is we are somewhere equivalent to the 1980s for Software Engineering. In terms of professional deployment we have a basic understanding of the equivalent of "programming" but much less understanding of <em>machine learning systems design</em> and <em>data infrastructure</em>. How the components we ahve developed interoperate together in a reliable and accountable manner. Best practice is still evolving, but perhaps isn't being shared widely enough.</p>
<p>One problem is that the art of data science is superficially similar to regular software engineering. Although in practice it is rather different. Modern software engineering practice operates to generate code which is well tested as it is written, agile programming techniques provide the appropriate degree of flexibility for the individual programmers alongside sufficient formalization and testing. These techniques have evolved from an overly restrictive formalization that was proposed in the early days of software engineering.</p>
<p>While data science involves programming, it is different in the following way. Most of the work in data science involves understanding the data and the appropriate manipulations to apply to extract knowledge from the data. The eventual number of lines of code that are required to extract that knowledge are often very few, but the amount of thought and attention that needs to be applied to each line is much more than a traditional line of software code. Testing of those lines is also of a different nature, provisions have to be made for evolving data environments. Any development work is often done on a static snapshot of data, but deployment is made in a live environment where the nature of data changes. Quality control involves checking for degradation in performance arising form unanticipated changes in data quality. It may also need to check for regulatory conformity. For example, in the UK the General Data Protection Regulation stipulates standards of explainability and fairness that may need to be monitored. These concerns do not affect traditional software deployments.</p>
<p>Others are also pointing out these challenges, <a href="https://medium.com/@karpathy/software-2-0-a64152b37c35">this post</a> from Andrej Karpathy (now head of AI at Tesla) covers the notion of "Software 2.0". Google researchers have highlighted the challenges of "Technical Debt" in machine learning <span class="citation">(Sculley et al., 2015)</span>. Researchers at Berkeley have characterized the systems challenges associated with machine learning <span class="citation">(Stoica et al., 2017)</span>.</p>
<p>Data science is not only about technical expertise and analysis of data, we need to also generate a culture of decision making that acknowledges the true challenges in data-driven automated decision making. In particular, a focus on algorithms has neglected the importance of data in driving decisions. The quality of data is paramount in that poor quality data will inevitably lead to poor quality decisions. Anecdotally most data scientists will suggest that 80% of their time is spent on data clean up, and only 20% on actually modelling.</p>
<h3 id="the-software-crisis">The Software Crisis</h3>
<blockquote>
<p>The major cause of the software crisis is that the machines have become several orders of magnitude more powerful! To put it quite bluntly: as long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a mild problem, and now we have gigantic computers, programming has become an equally gigantic problem.</p>
<p>Edsger Dijkstra (1930-2002), The Humble Programmer</p>
</blockquote>
<p>In the late sixties early software programmers made note of the increasing costs of software development and termed the challenges associated with it as the "<a href="https://en.wikipedia.org/wiki/Software_crisis">Software Crisis</a>". Edsger Dijkstra referred to the crisis in his 1972 Turing Award winner's address.</p>
<h3 id="the-data-crisis">The Data Crisis</h3>
<blockquote>
<p>The major cause of the data crisis is that machines have become more interconnected than ever before. Data access is therefore cheap, but data quality is often poor. What we need is cheap high quality data. That implies that we develop processes for improving and verifying data quality that are efficient.</p>
<p>There would seem to be two ways for improving efficiency. Firstly, we should not duplicate work. Secondly, where possible we should automate work.</p>
</blockquote>
<p>What I term "The Data Crisis" is the modern equivalent of this problem. The quantity of modern data, and the lack of attention paid to data as it is initially "laid down" and the costs of data cleaning are bringing about a crisis in data-driven decision making. Just as with software, the crisis is most correctly addressed by 'scaling' the manner in which we process our data. Duplication of work occurs because the value of data cleaning is not correctly recognised in management decision making processes. Automation of work is increasingly possible through techniques in "artificial intelligence", but this will also require better management of the data science pipeline so that data about data science (meta-data science) can be correctly assimilated and processed. The Alan Turing institute has a program focussed on this area, <a href="https://www.turing.ac.uk/research_projects/artificial-intelligence-data-analytics/">AI for Data Analytics</a>.</p>
<ul>
<li>Society is becoming harder to monitor</li>
<li><p>Individual is becoming easier to monitor</p></li>
<li><p>social media monitoring for 'hate speech' can be easily turned to political dissent monitoring</p></li>
<li><p>can become more sinister when the target of the marketing is well understood and the (digital) environment of the target is also so well controlled</p></li>
<li><p>What does it mean if a computer can predict our individual behavior better than we ourselves can?</p></li>
<li>Potential for explicit and implicit discrimination on the basis of race, religion, sexuality, health status</li>
<li>All prohibited under European law, but can pass unawares, or be implicit</li>
<li><p>GDPR: General Data Protection Regulation</p></li>
</ul>
<p>{Discrimination {.slide: data-transition="none"}</p>
<ul>
<li><p>Potential for explicit and implicit discrimination on the basis of race, religion, sexuality, health status</p></li>
<li><p>All prohibited under European law, but can pass unawares, or be implicit</p></li>
<li><p>GDPR: Good Data Practice Rules</p></li>
<li>Credit scoring, insurance, medical treatment</li>
<li>What if certain sectors of society are under-represented in our analysis?</li>
<li><p>What if Silicon Valley develops everything for us?</p></li>
</ul>
<p><img class="" src="../slides/diagrams/woman-tends-house-in-village-of-uganda-africa.jpg" width="50%" align="" style="background:none; border:none; box-shadow:none;"></p>
<ul>
<li>Work to ensure individual retains control of their own data</li>
<li>We accept privacy in our real lives, need to accept it in our digital</li>
<li>Control of persona and ability to project</li>
<li>Need better technological solutions: trust and algorithms.</li>
</ul>
<h2 id="how-the-gdpr-may-help">How the GDPR May Help</h2>
<p>Early reactions to the General Data Protection Regulation by companies seem to have been fairly wary, but if we view the principles outlined in the GDPR as good practice, rather than regulation, it feels like companies can only improve their internal data ecosystems by conforming to the GDPR. For this reason, I like to think of the initials as standing for "Good Data Practice Rules" rather than General Data Protection Regulation. In particular, the term "data protection" is a misnomer, and indeed the earliest <a href="https://en.wikipedia.org/wiki/Convention_for_the_protection_of_individuals_with_regard_to_automatic_processing_of_personal_data">data protection directive from the EU</a> (from 1981) refers to the protection of <em>individuals</em> with regard to the automatic processing of personal data, which is a much better sense of the term.</p>
<p>If we think of the legislation as protecting individuals, and we note that it seeks, and instead of viewing it as regulation, we view it as "Wouldn't it be good if ...", e.g. in respect to the <a href="https://en.wikipedia.org/wiki/Right_to_explanation">"right to an explanation"</a>, we might suggest: "Wouldn't it be good if we could explain why our automated decision making system made a particular decison". That seems like good practice for an organization's automated decision making systems.</p>
<p>Similarly, with regard to data minimization principles. Retaing the minimum amount of personal data needed to drive decisions could well lead to <em>better</em> decision making as it causes us to become intentional about which data is used rather than the sloppier thinking that "more is better" encourages. Particularly when we consider that to be truly useful data has to be cleaned and maintained.</p>
<p>If GDPR is truly reflecting the interests of individuals, then it is also reflecting the interests of consumers, patients, users etc, each of whom make use of these systems. For any company that is customer facing, or any service that prides itself on the quality of its delivery to those individuals, "good data practice" should become part of the DNA of the organization.</p>
<h3 id="section" data-transition="None"></h3>
<p><img src="../slides/diagrams/20160609_132315.jpg"
align="center" width="70%" style="background:none; border:none;
box-shadow:none;" class="rotateimg90"></p>
<h3 id="section-1" data-transition="None"></h3>
<p><img src="../slides/diagrams/20160609_132338.jpg"
align="center" width="70%" style="background:none; border:none;
box-shadow:none;" class="rotateimg90"></p>
<h2 id="data-trusts">Data Trusts</h2>
<p>The machine learning solutions we are dependent on to drive automated decision making are dependent on data. But with regard to personal data there are important issues of privacy. Data sharing brings benefits, but also exposes our digital selves. From the use of social media data for targeted advertising to influence us, to the use of genetic data to identify criminals, or natural family members. Control of our virtual selves maps on to control of our actual selves.</p>
<p>The fuedal system that is implied by current data protection legislation has signficant power asymmetries at its heart, in that the data controller has a duty of care over the data subject, but the data subject may only discover failings in that duty of care when it's too late. Data controllers also may have conflicting motivations, and often their primary motivation is <em>not</em> towards the data-subject, but that is a consideration in their wider agenda.</p>
<p><a href="https://www.theguardian.com/media-network/2016/jun/03/data-trusts-privacy-fears-feudalism-democracy">Data Trusts</a> <span class="citation">(Edwards, 2004,<span class="citation">Lawrence (2016)</span>)</span> are a potential solution to this problem. Inspired by <em>land societies</em> that formed in the 19th century to bring democratic representation to the growing middle classes. A land society was a mutual organisation where resources were pooled for the common good.</p>
<p>A Data Trust would be a legal entity where the trustees responsibility was entirely to the members of the trust. So the motivation of the data-controllers is aligned only with the data-subjects. How data is handled would be subject to the terms under which the trust was convened. The success of an individual trust would be contingent on it satisfying its members with appropriate balancing of individual privacy with the benefits of data sharing.</p>
<p>Formation of Data Trusts became the number one recommendation of the Hall-Presenti report on AI, but the manner in which this is done will have a significant impact on their utility. It feels important to have a diversity of approaches, and yet it feels important that any individual trust would be large enough to be taken seriously in representing the views of its members in wider negotiations.</p>
<ul>
<li><p>Reusability of Data</p></li>
<li><p>Deployment of Machine Learning Systems</p></li>
</ul>
<h2 id="data-readiness-levels">Data Readiness Levels</h2>
<p><a href="http://inverseprobability.com/2017/01/12/data-readiness-levels">Data Readiness Levels</a> <span class="citation">(Lawrence, 2017)</span> are an attempt to develop a language around data quality that can bridge the gap between technical solutions and decision makers such as managers and project planners. The are inspired by Technology Readiness Levels which attempt to quantify the readiness of technologies for deployment.</p>
<p>Data-readiness describes, at its coarsest level, three separate stages of data graduation.</p>
<ul>
<li><p>Grade C - accessibility</p></li>
<li><p>Grade B - validity</p></li>
<li><p>Grade A - usability</p></li>
</ul>
<h3 id="accessibility-grade-c">Accessibility: Grade C</h3>
<p>The first grade refers to the accessibility of data. Most data science practitioners will be used to working with data-providers who, perhaps having had little experience of data-science before, state that they "have the data". More often than not, they have not verified this. A convenient term for this is "Hearsay Data", someone has <em>heard</em> that they have the data so they <em>say</em> they have it. This is the lowest grade of data readiness.</p>
<p>Progressing through Grade C involves ensuring that this data is accessible. Not just in terms of digital accessiblity, but also for regulatory, ethical and intellectual property reasons.</p>
<h3 id="validity-grade-b">Validity: Grade B</h3>
<p>Data transits from Grade C to Grade B once we can begin digital analysis on the computer. Once the challenges of access to the data have been resolved, we can make the data available either via API, or for direct loading into analysis software (such as Python, R, Matlab, Mathematica or SPSS). Once this has occured the data is at B4 level. Grade B involves the <em>validity</em> of the data. Does the data really represent what it purports to? There are challenges such as missing values, outliers, record duplication. Each of these needs to be investigated.</p>
<p>Grade B and C are important as if the work done in these grades is documented well, it can be reused in other projects. Reuse of this labour is key to reducing the costs of data-driven automated decision making. There is a strong overlap between the work required in this grade and the statistical field of <a href="https://en.wikipedia.org/wiki/Exploratory_data_analysis"><em>exploratory data analysis</em></a> <span class="citation">(Tukey, 1977)</span>.</p>
<h3 id="usability-grade-a">Usability: Grade A</h3>
<p>Once the validity of the data is determined, the data set can be considered for use in a particular task. This stage of data readiness is more akin to what machine learning scientists are used to doing in Universities. Bringing an algorithm to bear on a well understood data set.</p>
<p>In Grade A we are concerned about the utility of the data given a particular task. Grade A may involve additional data collection (experimental design in statistics) to ensure that the task is fulfilled.</p>
<p>This is the stage where the data and the model are brought together, so expertise in learning algorithms and their application is key. Further ethical considerations, such as the fairness of the resulting predictions are required at this stage. At the end of this stage a prototype model is ready for deployment.</p>
<p>Deployment and maintenance of machine learning models in production is another important issue which Data Readiness Levels are only a part of the solution for.</p>
<p>To find out more, or to contribute ideas go to <a href="http://data-readiness.org" class="uri">http://data-readiness.org</a></p>
<p>Throughout the data preparation pipeline, it is important to have close interaction between data scientists and application domain experts. Decisions on data preparation taken outside the context of application have dangerous downstream consequences. This provides an additional burden on the data scientist as they are required for each project, but it should also be seen as a learning and familiarization exercise for the domain expert. Long term, just as biologists have found it necessary to assimilate the skills of the bioinformatician to be effective in their science, most domains will also require a familiarity with the nature of data driven decision making and its application. Working closely with data-scientists on data preparation is one way to begin this sharing of best practice.</p>
<p>The processes involved in Grade C and B are often badly taught in courses on data science. Perhaps not due to a lack of interest in the areas, but maybe more due to a lack of access to real world examples where data quality is poor.</p>
<p>These stages of data science are also ridden with ambiguity. In the long term they could do with more formalization, and automation, but best practice needs to be understood by a wider community before that can happen.</p>
<h2 id="assessing-the-organizations-readiness">Assessing the Organizations Readiness</h2>
<p>Assessing the readiness of data for analysis is one action that can be taken, but assessing teams that need to assimilate the information in the data is the other side of the coin. With this in mind both <a href="https://medium.com/@damoncivin/the-joel-test-for-data-readiness-4882aae64753">Damon Civin</a> and <a href="https://blog.dominodatalab.com/joel-test-data-science/">Nick Elprin</a> have independently proposed the idea of a "Data Joel Test". A "<a href="https://www.joelonsoftware.com/2000/08/09/the-joel-test-12-steps-to-better-code/">Joel Test</a>" is a short questionaire to establish the ability of a team to handle software engineering tasks. It is designed as a rough and ready capability assessment. A "Data Joel Test" is similar, but for assessing the capability of a team in performing data science.</p>
<h3 id="artificial-intelligence" class="slide:" data-transition="none">Artificial Intelligence</h3>
<ul>
<li><p>Challenges in deploying AI.</p></li>
<li><p>Currently this is in the form of "machine learning systems"</p></li>
</ul>
<h3 id="internet-of-people" class="slide:" data-transition="none">Internet of People</h3>
<ul>
<li><p>Fog computing: barrier between cloud and device blurring.</p>
<ul>
<li>Computing on the Edge</li>
</ul></li>
<li><p>Complex feedback between algorithm and implementation</p></li>
</ul>
<h3 id="deploying-ml-in-real-world-machine-learning-systems-design" class="slide:" data-transition="none">Deploying ML in Real World: Machine Learning Systems Design</h3>
<ul>
<li><p>Major new challenge for systems designers.</p></li>
<li><p>Internet of Intelligence but currently:</p>
<ul>
<li>AI systems are <em>fragile</em></li>
</ul></li>
</ul>
<h2 id="machine-learning-system-design">Machine Learning System Design</h2>
<p>The way we are deploying artificial intelligence systems in practice is to build up systems of machine learning components. To build a machine learning system, we decompose the task into parts, each of which we can emulate with ML methods. These parts are typically independently constructed and verified. For example, in a driverless car we can decompose the tasks into components such as "pedestrian detection" and "road line detection". Each of these components can be constructed with, for example, an independent classifier. We can then superimpose a logic on top. For example, "Follow the road line unless you detect a pedestrian in the road".</p>
<p>This allows for verification of car performance, as long as we can verify the individual components. However, it also implies that the AI systems we deploy are <em>fragile</em>.</p>
<p>Our intelligent systems are composed by "pigeonholing" each indvidual task, then substituting with a machine learning model.</p>
<h3 id="rapid-reimplementation">Rapid Reimplementation</h3>
<p>This is also the classical approach to automation, but in traditional automation we also ensure the <em>environment</em> in which the system operates becomes controlled. For example, trains run on railway lines, fast cars run on motorways, goods are manufactured in a controlled factory environment.</p>
<p>The difference with modern automated decision making systems is our intention is to deploy them in the <em>uncontrolled</em> environment that makes up our own world.</p>
<p>This exposes us to either unforseen circumstances or adversarial action. And yet it is unclear our our intelligent systems are capable of adapting to this.</p>
<p>We become exposed to mischief and adversaries. Adversaries intentially may wish to take over the artificial intelligence system, and mischief is the constant practice of many in our society. Simply watching a 10 year old interact with a voice agent such as Alexa or Siri shows that they are delighted when the can make the the "intelligent" agent seem foolish.</p>
<img class="rotateimg90" src="../slides/diagrams/2017-10-12 16.47.34.jpg" width="40%" align="center" style="background:none; border:none; box-shadow:none;">
<center>
<em>Watt's Governor as held by "Science" on Holborn Viaduct</em>
</center>
<img class="center" src="../slides/diagrams/SteamEngine_Boulton&Watt_1784_neg.png" width="50%" align="" style="background:none; border:none; box-shadow:none;">
<center>
<em>Watt's Steam Engine which made Steam Power Efficient and Practical</em>
</center>
<p>One of the first automated decision making systems was Watt's governor, as held by "Science" on Holborns viaduct. Watt's governor was a key component in his steam engine. It senses increases in speed in the engine and closed the steam valve to prevent the engine overspeeding and destroying itself. Until the invention of this device, it was a human job to do this.</p>
<p>The formal study of governors and other feedback control devices was then began by <a href="https://en.wikipedia.org/wiki/James_Clerk_Maxwell">James Clerk Maxwell</a>, the Scottish physicist. This field became the foundation of our modern techniques of artificial intelligence through Norbert Wiener's book <em>Cybernetics</em> <span class="citation">(Wiener, 1948)</span>. Cybernetics is Greek for governor, a word that in itself simply means helmsman in English.</p>
<p>The recent WannaCry virus that had a wide impact on our health services ecosystem was exploiting a security flaw in Windows systems that was first exploited by a virus called Stuxnet.</p>
<p>Stuxnet was a virus designed to infect the Iranian nuclear program's Uranium enrichment centrifuges. A centrifuge is prevented from overspeed by a controller, just like Watt's governor. Only now it is implemented in control logic, in this case on a Siemens PLC controller.</p>
<p>Stuxnet infected these controllers and took over the response signal in the centrifuge, fooling the system into thinking that no overspeed was occuring. As a result, the centrifuges destroyed themselves through spinning too fast.</p>
<p>This is equivalent to detaching Watt's governor from the steam engine. Such sabotage would be easily recognized by a steam engine operator. The challenge for the operators of the Iranian Uranium centrifuges was that the sabotage was occurring inside the electronics.</p>
<p>That is the effect of an adversary on an intelligent system, but even without adveraries, the mischief of a 10 year old can confuse our AIs.</p>
<p><a href="https://www.youtube.com/watch?v=1y2UKz47gew&t="><img src="https://img.youtube.com/vi/1y2UKz47gew/0.jpg" /></a></p>
<p>Asking Siri "What is a trillion to the power of a thousand minus one?" leads to a 30 minute response consisting of only 9s. I found this out because my nine year old grabbed my phone and did it. The only way to stop Siri was to force closure. This is an interesting example of a system feature that's <em>not</em> a bug, in fact it requires clever processing from Wolfram Alpha. But it's an unexpected result from the system performing correctly.</p>
<p>This challenge of facing a circumstance that was unenvisaged in design but has consequences in deployment becomes far larger when the environment is uncontrolled. Or in the extreme case, where actions of the intelligent system effect the wider environment and change it.</p>
<p>These unforseen circumstances are likely to lead to need for much more efficient turn-around and update for our intelligent systems. Whether we are correcting for security flaws (which <em>are</em> bugs) or unenvisaged circumstantial challenges: an issue I'm referring to as <em>peppercorns</em>. Rapid deployment of system updates is required. For example, Apple have "fixed" the problem of Siri returning long numbers.</p>
<p>The challenge is particularly acute because of the <em>scale</em> at which we can deploy AI solutions. This means when something does go wrong, it may be going wrong in billions of households simultaneously.</p>
<p>See also <a href="http://inverseprobability.com/2018/02/06/natural-and-artificial-intelligence">this blog on the differences between natural and artificial intelligence</a> and this paper <a href="http://inverseprobability.com/2017/11/15/decision-making">on the need for diversity in decision making</a>.</p>
<ul>
<li><p>Artificial Intelligence and Data Science are fundamentally different.</p></li>
<li><p>In one you are dealing with data collected by happenstance.</p></li>
<li><p>In the other you are trying to build systems in the real world, often by actively collecting data.</p></li>
<li><p>Our approaches to systems design are building powerful machines that will be deployed in evolving environments.</p></li>
<li>twitter: @lawrennd</li>
<li><p>blog: <a href="http://inverseprobability.com/blog.html">http://inverseprobability.com</a></p></li>
</ul>
<div id="refs" class="references">
<div id="ref-Ananthanarayanan-cat09">
<p>Ananthanarayanan, R., Esser, S.K., Simon, H.D., Modha, D.S., 2009. The cat is out of the bag: Cortical simulations with <span class="math inline">\(10^9\)</span> neurons, <span class="math inline">\(10^{13}\)</span> synapses, in: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis - Sc ’09. <a href="https://doi.org/10.1145/1654059.1654124" class="uri">https://doi.org/10.1145/1654059.1654124</a></p>
</div>
<div id="ref-Edwards:privacy04">
<p>Edwards, L., 2004. The problem with privacy. International Review of Law Computers & Technology 18, 263–294.</p>
</div>
<div id="ref-Lawrence:drl17">
<p>Lawrence, N.D., 2017. Data readiness levels. arXiv.</p>
</div>
<div id="ref-Lawrence:trusts16">
<p>Lawrence, N.D., 2016. Data trusts could allay our privacy fears.</p>
</div>
<div id="ref-Reed-information98">
<p>Reed, C., Durlach, N.I., 1998. Note on information transfer rates in human communication. Presence Teleoperators & Virtual Environments 7, 509–518. <a href="https://doi.org/10.1162/105474698565893" class="uri">https://doi.org/10.1162/105474698565893</a></p>
</div>
<div id="ref-Sculley:debt15">
<p>Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V., Young, M., Crespo, J.-F., Dennison, D., 2015. Hidden technical debt in machine learning systems, in: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (Eds.), Advances in Neural Information Processing Systems 28. Curran Associates, Inc., pp. 2503–2511.</p>
</div>
<div id="ref-Stoica:systemsml17">
<p>Stoica, I., Song, D., Popa, R.A., Patterson, D.A., Mahoney, M.W., Katz, R.H., Joseph, A.D., Jordan, M., Hellerstein, J.M., Gonzalez, J., Goldberg, K., Ghodsi, A., Culler, D.E., Abbeel, P., 2017. A berkeley view of systems challenges for ai (No. UCB/EECS-2017-159). EECS Department, University of California, Berkeley.</p>
</div>
<div id="ref-Taigman:deepface14">
<p>Taigman, Y., Yang, M., Ranzato, M., Wolf, L., 2014. DeepFace: Closing the gap to human-level performance in face verification, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. <a href="https://doi.org/10.1109/CVPR.2014.220" class="uri">https://doi.org/10.1109/CVPR.2014.220</a></p>
</div>
<div id="ref-Tukey:exploratory77">
<p>Tukey, J.W., 1977. Exploratory data analysis. Addison-Wesley.</p>
</div>
<div id="ref-Wiener:cybernetics48">
<p>Wiener, N., 1948. Cybernetics: Control and communication in the animal and the machine. MIT Press, Cambridge, MA.</p>
</div>
</div>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>the challenge of understanding what information pertains to is known as knowledge representation.<a href="#fnref1">↩</a></p></li>
</ol>
</div>
Wed, 05 Sep 2018 00:00:00 +0000
http://inverseprobability.com/talks/notes/data-science-and-the-professions.html
http://inverseprobability.com/talks/notes/data-science-and-the-professions.htmlnotesIntroduction to Gaussian Processes<div style="display:none">
$$\newcommand{\Amatrix}{\mathbf{A}}
\newcommand{\KL}[2]{\text{KL}\left( #1\,\|\,#2 \right)}
\newcommand{\Kaast}{\kernelMatrix_{\mathbf{ \ast}\mathbf{ \ast}}}
\newcommand{\Kastu}{\kernelMatrix_{\mathbf{ \ast} \inducingVector}}
\newcommand{\Kff}{\kernelMatrix_{\mappingFunctionVector \mappingFunctionVector}}
\newcommand{\Kfu}{\kernelMatrix_{\mappingFunctionVector \inducingVector}}
\newcommand{\Kuast}{\kernelMatrix_{\inducingVector \bf\ast}}
\newcommand{\Kuf}{\kernelMatrix_{\inducingVector \mappingFunctionVector}}
\newcommand{\Kuu}{\kernelMatrix_{\inducingVector \inducingVector}}
\newcommand{\Kuui}{\Kuu^{-1}}
\newcommand{\Qaast}{\mathbf{Q}_{\bf \ast \ast}}
\newcommand{\Qastf}{\mathbf{Q}_{\ast \mappingFunction}}
\newcommand{\Qfast}{\mathbf{Q}_{\mappingFunctionVector \bf \ast}}
\newcommand{\Qff}{\mathbf{Q}_{\mappingFunctionVector \mappingFunctionVector}}
\newcommand{\aMatrix}{\mathbf{A}}
\newcommand{\aScalar}{a}
\newcommand{\aVector}{\mathbf{a}}
\newcommand{\acceleration}{a}
\newcommand{\bMatrix}{\mathbf{B}}
\newcommand{\bScalar}{b}
\newcommand{\bVector}{\mathbf{b}}
\newcommand{\basisFunc}{\phi}
\newcommand{\basisFuncVector}{\boldsymbol{ \basisFunc}}
\newcommand{\basisFunction}{\phi}
\newcommand{\basisLocation}{\mu}
\newcommand{\basisMatrix}{\boldsymbol{ \Phi}}
\newcommand{\basisScalar}{\basisFunction}
\newcommand{\basisVector}{\boldsymbol{ \basisFunction}}
\newcommand{\activationFunction}{\phi}
\newcommand{\activationMatrix}{\boldsymbol{ \Phi}}
\newcommand{\activationScalar}{\basisFunction}
\newcommand{\activationVector}{\boldsymbol{ \basisFunction}}
\newcommand{\bigO}{\mathcal{O}}
\newcommand{\binomProb}{\pi}
\newcommand{\cMatrix}{\mathbf{C}}
\newcommand{\cbasisMatrix}{\hat{\boldsymbol{ \Phi}}}
\newcommand{\cdataMatrix}{\hat{\dataMatrix}}
\newcommand{\cdataScalar}{\hat{\dataScalar}}
\newcommand{\cdataVector}{\hat{\dataVector}}
\newcommand{\centeredKernelMatrix}{\mathbf{ \MakeUppercase{\centeredKernelScalar}}}
\newcommand{\centeredKernelScalar}{b}
\newcommand{\centeredKernelVector}{\centeredKernelScalar}
\newcommand{\centeringMatrix}{\mathbf{H}}
\newcommand{\chiSquaredDist}[2]{\chi_{#1}^{2}\left(#2\right)}
\newcommand{\chiSquaredSamp}[1]{\chi_{#1}^{2}}
\newcommand{\conditionalCovariance}{\boldsymbol{ \Sigma}}
\newcommand{\coregionalizationMatrix}{\mathbf{B}}
\newcommand{\coregionalizationScalar}{b}
\newcommand{\coregionalizationVector}{\mathbf{ \coregionalizationScalar}}
\newcommand{\covDist}[2]{\text{cov}_{#2}\left(#1\right)}
\newcommand{\covSamp}[1]{\text{cov}\left(#1\right)}
\newcommand{\covarianceScalar}{c}
\newcommand{\covarianceVector}{\mathbf{ \covarianceScalar}}
\newcommand{\covarianceMatrix}{\mathbf{C}}
\newcommand{\covarianceMatrixTwo}{\boldsymbol{ \Sigma}}
\newcommand{\croupierScalar}{s}
\newcommand{\croupierVector}{\mathbf{ \croupierScalar}}
\newcommand{\croupierMatrix}{\mathbf{ \MakeUppercase{\croupierScalar}}}
\newcommand{\dataDim}{p}
\newcommand{\dataIndex}{i}
\newcommand{\dataIndexTwo}{j}
\newcommand{\dataMatrix}{\mathbf{Y}}
\newcommand{\dataScalar}{y}
\newcommand{\dataSet}{\mathcal{D}}
\newcommand{\dataStd}{\sigma}
\newcommand{\dataVector}{\mathbf{ \dataScalar}}
\newcommand{\decayRate}{d}
\newcommand{\degreeMatrix}{\mathbf{ \MakeUppercase{\degreeScalar}}}
\newcommand{\degreeScalar}{d}
\newcommand{\degreeVector}{\mathbf{ \degreeScalar}}
% Already defined by latex
%\newcommand{\det}[1]{\left|#1\right|}
\newcommand{\diag}[1]{\text{diag}\left(#1\right)}
\newcommand{\diagonalMatrix}{\mathbf{D}}
\newcommand{\diff}[2]{\frac{\text{d}#1}{\text{d}#2}}
\newcommand{\diffTwo}[2]{\frac{\text{d}^2#1}{\text{d}#2^2}}
\newcommand{\displacement}{x}
\newcommand{\displacementVector}{\textbf{\displacement}}
\newcommand{\distanceMatrix}{\mathbf{ \MakeUppercase{\distanceScalar}}}
\newcommand{\distanceScalar}{d}
\newcommand{\distanceVector}{\mathbf{ \distanceScalar}}
\newcommand{\eigenvaltwo}{\ell}
\newcommand{\eigenvaltwoMatrix}{\mathbf{L}}
\newcommand{\eigenvaltwoVector}{\mathbf{l}}
\newcommand{\eigenvalue}{\lambda}
\newcommand{\eigenvalueMatrix}{\boldsymbol{ \Lambda}}
\newcommand{\eigenvalueVector}{\boldsymbol{ \lambda}}
\newcommand{\eigenvector}{\mathbf{ \eigenvectorScalar}}
\newcommand{\eigenvectorMatrix}{\mathbf{U}}
\newcommand{\eigenvectorScalar}{u}
\newcommand{\eigenvectwo}{\mathbf{v}}
\newcommand{\eigenvectwoMatrix}{\mathbf{V}}
\newcommand{\eigenvectwoScalar}{v}
\newcommand{\entropy}[1]{\mathcal{H}\left(#1\right)}
\newcommand{\errorFunction}{E}
\newcommand{\expDist}[2]{\left<#1\right>_{#2}}
\newcommand{\expSamp}[1]{\left<#1\right>}
\newcommand{\expectation}[1]{\left\langle #1 \right\rangle }
\newcommand{\expectationDist}[2]{\left\langle #1 \right\rangle _{#2}}
\newcommand{\expectedDistanceMatrix}{\mathcal{D}}
\newcommand{\eye}{\mathbf{I}}
\newcommand{\fantasyDim}{r}
\newcommand{\fantasyMatrix}{\mathbf{ \MakeUppercase{\fantasyScalar}}}
\newcommand{\fantasyScalar}{z}
\newcommand{\fantasyVector}{\mathbf{ \fantasyScalar}}
\newcommand{\featureStd}{\varsigma}
\newcommand{\gammaCdf}[3]{\mathcal{GAMMA CDF}\left(#1|#2,#3\right)}
\newcommand{\gammaDist}[3]{\mathcal{G}\left(#1|#2,#3\right)}
\newcommand{\gammaSamp}[2]{\mathcal{G}\left(#1,#2\right)}
\newcommand{\gaussianDist}[3]{\mathcal{N}\left(#1|#2,#3\right)}
\newcommand{\gaussianSamp}[2]{\mathcal{N}\left(#1,#2\right)}
\newcommand{\given}{|}
\newcommand{\half}{\frac{1}{2}}
\newcommand{\heaviside}{H}
\newcommand{\hiddenMatrix}{\mathbf{ \MakeUppercase{\hiddenScalar}}}
\newcommand{\hiddenScalar}{h}
\newcommand{\hiddenVector}{\mathbf{ \hiddenScalar}}
\newcommand{\identityMatrix}{\eye}
\newcommand{\inducingInputScalar}{z}
\newcommand{\inducingInputVector}{\mathbf{ \inducingInputScalar}}
\newcommand{\inducingInputMatrix}{\mathbf{Z}}
\newcommand{\inducingScalar}{u}
\newcommand{\inducingVector}{\mathbf{ \inducingScalar}}
\newcommand{\inducingMatrix}{\mathbf{U}}
\newcommand{\inlineDiff}[2]{\text{d}#1/\text{d}#2}
\newcommand{\inputDim}{q}
\newcommand{\inputMatrix}{\mathbf{X}}
\newcommand{\inputScalar}{x}
\newcommand{\inputSpace}{\mathcal{X}}
\newcommand{\inputVals}{\inputVector}
\newcommand{\inputVector}{\mathbf{ \inputScalar}}
\newcommand{\iterNum}{k}
\newcommand{\kernel}{\kernelScalar}
\newcommand{\kernelMatrix}{\mathbf{K}}
\newcommand{\kernelScalar}{k}
\newcommand{\kernelVector}{\mathbf{ \kernelScalar}}
\newcommand{\kff}{\kernelScalar_{\mappingFunction \mappingFunction}}
\newcommand{\kfu}{\kernelVector_{\mappingFunction \inducingScalar}}
\newcommand{\kuf}{\kernelVector_{\inducingScalar \mappingFunction}}
\newcommand{\kuu}{\kernelVector_{\inducingScalar \inducingScalar}}
\newcommand{\lagrangeMultiplier}{\lambda}
\newcommand{\lagrangeMultiplierMatrix}{\boldsymbol{ \Lambda}}
\newcommand{\lagrangian}{L}
\newcommand{\laplacianFactor}{\mathbf{ \MakeUppercase{\laplacianFactorScalar}}}
\newcommand{\laplacianFactorScalar}{m}
\newcommand{\laplacianFactorVector}{\mathbf{ \laplacianFactorScalar}}
\newcommand{\laplacianMatrix}{\mathbf{L}}
\newcommand{\laplacianScalar}{\ell}
\newcommand{\laplacianVector}{\mathbf{ \ell}}
\newcommand{\latentDim}{q}
\newcommand{\latentDistanceMatrix}{\boldsymbol{ \Delta}}
\newcommand{\latentDistanceScalar}{\delta}
\newcommand{\latentDistanceVector}{\boldsymbol{ \delta}}
\newcommand{\latentForce}{f}
\newcommand{\latentFunction}{u}
\newcommand{\latentFunctionVector}{\mathbf{ \latentFunction}}
\newcommand{\latentFunctionMatrix}{\mathbf{ \MakeUppercase{\latentFunction}}}
\newcommand{\latentIndex}{j}
\newcommand{\latentScalar}{z}
\newcommand{\latentVector}{\mathbf{ \latentScalar}}
\newcommand{\latentMatrix}{\mathbf{Z}}
\newcommand{\learnRate}{\eta}
\newcommand{\lengthScale}{\ell}
\newcommand{\rbfWidth}{\ell}
\newcommand{\likelihoodBound}{\mathcal{L}}
\newcommand{\likelihoodFunction}{L}
\newcommand{\locationScalar}{\mu}
\newcommand{\locationVector}{\boldsymbol{ \locationScalar}}
\newcommand{\locationMatrix}{\mathbf{M}}
\newcommand{\variance}[1]{\text{var}\left( #1 \right)}
\newcommand{\mappingFunction}{f}
\newcommand{\mappingFunctionMatrix}{\mathbf{F}}
\newcommand{\mappingFunctionTwo}{g}
\newcommand{\mappingFunctionTwoMatrix}{\mathbf{G}}
\newcommand{\mappingFunctionTwoVector}{\mathbf{ \mappingFunctionTwo}}
\newcommand{\mappingFunctionVector}{\mathbf{ \mappingFunction}}
\newcommand{\scaleScalar}{s}
\newcommand{\mappingScalar}{w}
\newcommand{\mappingVector}{\mathbf{ \mappingScalar}}
\newcommand{\mappingMatrix}{\mathbf{W}}
\newcommand{\mappingScalarTwo}{v}
\newcommand{\mappingVectorTwo}{\mathbf{ \mappingScalarTwo}}
\newcommand{\mappingMatrixTwo}{\mathbf{V}}
\newcommand{\maxIters}{K}
\newcommand{\meanMatrix}{\mathbf{M}}
\newcommand{\meanScalar}{\mu}
\newcommand{\meanTwoMatrix}{\mathbf{M}}
\newcommand{\meanTwoScalar}{m}
\newcommand{\meanTwoVector}{\mathbf{ \meanTwoScalar}}
\newcommand{\meanVector}{\boldsymbol{ \meanScalar}}
\newcommand{\mrnaConcentration}{m}
\newcommand{\naturalFrequency}{\omega}
\newcommand{\neighborhood}[1]{\mathcal{N}\left( #1 \right)}
\newcommand{\neilurl}{http://inverseprobability.com/}
\newcommand{\noiseMatrix}{\boldsymbol{ E}}
\newcommand{\noiseScalar}{\epsilon}
\newcommand{\noiseVector}{\boldsymbol{ \epsilon}}
\newcommand{\norm}[1]{\left\Vert #1 \right\Vert}
\newcommand{\normalizedLaplacianMatrix}{\hat{\mathbf{L}}}
\newcommand{\normalizedLaplacianScalar}{\hat{\ell}}
\newcommand{\normalizedLaplacianVector}{\hat{\mathbf{ \ell}}}
\newcommand{\numActive}{m}
\newcommand{\numBasisFunc}{m}
\newcommand{\numComponents}{m}
\newcommand{\numComps}{K}
\newcommand{\numData}{n}
\newcommand{\numFeatures}{K}
\newcommand{\numHidden}{h}
\newcommand{\numInducing}{m}
\newcommand{\numLayers}{\ell}
\newcommand{\numNeighbors}{K}
\newcommand{\numSequences}{s}
\newcommand{\numSuccess}{s}
\newcommand{\numTasks}{m}
\newcommand{\numTime}{T}
\newcommand{\numTrials}{S}
\newcommand{\outputIndex}{j}
\newcommand{\paramVector}{\boldsymbol{ \theta}}
\newcommand{\parameterMatrix}{\boldsymbol{ \Theta}}
\newcommand{\parameterScalar}{\theta}
\newcommand{\parameterVector}{\boldsymbol{ \parameterScalar}}
\newcommand{\partDiff}[2]{\frac{\partial#1}{\partial#2}}
\newcommand{\precisionScalar}{j}
\newcommand{\precisionVector}{\mathbf{ \precisionScalar}}
\newcommand{\precisionMatrix}{\mathbf{J}}
\newcommand{\pseudotargetScalar}{\widetilde{y}}
\newcommand{\pseudotargetVector}{\mathbf{ \pseudotargetScalar}}
\newcommand{\pseudotargetMatrix}{\mathbf{ \widetilde{Y}}}
\newcommand{\rank}[1]{\text{rank}\left(#1\right)}
\newcommand{\rayleighDist}[2]{\mathcal{R}\left(#1|#2\right)}
\newcommand{\rayleighSamp}[1]{\mathcal{R}\left(#1\right)}
\newcommand{\responsibility}{r}
\newcommand{\rotationScalar}{r}
\newcommand{\rotationVector}{\mathbf{ \rotationScalar}}
\newcommand{\rotationMatrix}{\mathbf{R}}
\newcommand{\sampleCovScalar}{s}
\newcommand{\sampleCovVector}{\mathbf{ \sampleCovScalar}}
\newcommand{\sampleCovMatrix}{\mathbf{s}}
\newcommand{\scalarProduct}[2]{\left\langle{#1},{#2}\right\rangle}
\newcommand{\sign}[1]{\text{sign}\left(#1\right)}
\newcommand{\sigmoid}[1]{\sigma\left(#1\right)}
\newcommand{\singularvalue}{\ell}
\newcommand{\singularvalueMatrix}{\mathbf{L}}
\newcommand{\singularvalueVector}{\mathbf{l}}
\newcommand{\sorth}{\mathbf{u}}
\newcommand{\spar}{\lambda}
\newcommand{\trace}[1]{\text{tr}\left(#1\right)}
\newcommand{\BasalRate}{B}
\newcommand{\DampingCoefficient}{C}
\newcommand{\DecayRate}{D}
\newcommand{\Displacement}{X}
\newcommand{\LatentForce}{F}
\newcommand{\Mass}{M}
\newcommand{\Sensitivity}{S}
\newcommand{\basalRate}{b}
\newcommand{\dampingCoefficient}{c}
\newcommand{\mass}{m}
\newcommand{\sensitivity}{s}
\newcommand{\springScalar}{\kappa}
\newcommand{\springVector}{\boldsymbol{ \kappa}}
\newcommand{\springMatrix}{\boldsymbol{ \mathcal{K}}}
\newcommand{\tfConcentration}{p}
\newcommand{\tfDecayRate}{\delta}
\newcommand{\tfMrnaConcentration}{f}
\newcommand{\tfVector}{\mathbf{ \tfConcentration}}
\newcommand{\velocity}{v}
\newcommand{\sufficientStatsScalar}{g}
\newcommand{\sufficientStatsVector}{\mathbf{ \sufficientStatsScalar}}
\newcommand{\sufficientStatsMatrix}{\mathbf{G}}
\newcommand{\switchScalar}{s}
\newcommand{\switchVector}{\mathbf{ \switchScalar}}
\newcommand{\switchMatrix}{\mathbf{S}}
\newcommand{\tr}[1]{\text{tr}\left(#1\right)}
\newcommand{\loneNorm}[1]{\left\Vert #1 \right\Vert_1}
\newcommand{\ltwoNorm}[1]{\left\Vert #1 \right\Vert_2}
\newcommand{\onenorm}[1]{\left\vert#1\right\vert_1}
\newcommand{\twonorm}[1]{\left\Vert #1 \right\Vert}
\newcommand{\vScalar}{v}
\newcommand{\vVector}{\mathbf{v}}
\newcommand{\vMatrix}{\mathbf{V}}
\newcommand{\varianceDist}[2]{\text{var}_{#2}\left( #1 \right)}
% Already defined by latex
%\newcommand{\vec}{#1:}
\newcommand{\vecb}[1]{\left(#1\right):}
\newcommand{\weightScalar}{w}
\newcommand{\weightVector}{\mathbf{ \weightScalar}}
\newcommand{\weightMatrix}{\mathbf{W}}
\newcommand{\weightedAdjacencyMatrix}{\mathbf{A}}
\newcommand{\weightedAdjacencyScalar}{a}
\newcommand{\weightedAdjacencyVector}{\mathbf{ \weightedAdjacencyScalar}}
\newcommand{\onesVector}{\mathbf{1}}
\newcommand{\zerosVector}{\mathbf{0}}
$$
</div>
<!-- To compile -->
<p><!--% not ipynb--></p>
<p><img class="" src="../slides/diagrams/gp/rasmussen-williams-book.jpg" width="50%" align="" style="background:none; border:none; box-shadow:none;"></p>
<p><span class="citation">Rasmussen and Williams (2006)</span> is still one of the most important references on Gaussian process models. It is available freely online.</p>
<h2 id="what-is-machine-learning">What is Machine Learning?</h2>
<p>What is machine learning? At its most basic level machine learning is a combination of</p>
<p><span class="math display">\[ \text{data} + \text{model} \xrightarrow{\text{compute}} \text{prediction}\]</span></p>
<p>where <em>data</em> is our observations. They can be actively or passively acquired (meta-data). The <em>model</em> contains our assumptions, based on previous experience. That experience can be other data, it can come from transfer learning, or it can merely be our beliefs about the regularities of the universe. In humans our models include our inductive biases. The <em>prediction</em> is an action to be taken or a categorization or a quality score. The reason that machine learning has become a mainstay of artificial intelligence is the importance of predictions in artificial intelligence. The data and the model are combined through computation.</p>
<p>In practice we normally perform machine learning using two functions. To combine data with a model we typically make use of:</p>
<p><strong>a prediction function</strong> a function which is used to make the predictions. It includes our beliefs about the regularities of the universe, our assumptions about how the world works, e.g. smoothness, spatial similarities, temporal similarities.</p>
<p><strong>an objective function</strong> a function which defines the cost of misprediction. Typically it includes knowledge about the world's generating processes (probabilistic objectives) or the costs we pay for mispredictions (empiricial risk minimization).</p>
<p>The combination of data and model through the prediction function and the objectie function leads to a <em>learning algorithm</em>. The class of prediction functions and objective functions we can make use of is restricted by the algorithms they lead to. If the prediction function or the objective function are too complex, then it can be difficult to find an appropriate learning algorithm. Much of the acdemic field of machine learning is the quest for new learning algorithms that allow us to bring different types of models and data together.</p>
<p>A useful reference for state of the art in machine learning is the UK Royal Society Report, <a href="https://royalsociety.org/~/media/policy/projects/machine-learning/publications/machine-learning-report.pdf">Machine Learning: Power and Promise of Computers that Learn by Example</a>.</p>
<p>You can also check my blog post on <a href="http://inverseprobability.com/2017/07/17/what-is-machine-learning">"What is Machine Learning?"</a></p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> numpy <span class="im">as</span> np
<span class="im">import</span> matplotlib.pyplot <span class="im">as</span> plt
<span class="im">import</span> pods
<span class="im">import</span> teaching_plots <span class="im">as</span> plot
<span class="im">import</span> mlai</code></pre></div>
<h3 id="olympic-marathon-data">Olympic Marathon Data</h3>
<p>The first thing we will do is load a standard data set for regression modelling. The data consists of the pace of Olympic Gold Medal Marathon winners for the Olympics from 1896 to present. First we load in the data and plot.</p>
<h3 id="olympic-marathon-data-1">Olympic Marathon Data</h3>
<table>
<tr>
<td width="70%">
<ul>
<li><p>Gold medal times for Olympic Marathon since 1896.</p></li>
<li><p>Marathons before 1924 didn’t have a standardised distance.</p></li>
<li><p>Present results using pace per km.</p></li>
<li>In 1904 Marathon was badly organised leading to very slow times.</li>
</ul>
</td>
<td width="30%">
<img src="../slides/diagrams/Stephen_Kiprotich.jpg" alt="image" /> <small>Image from Wikimedia Commons <a href="http://bit.ly/16kMKHQ" class="uri">http://bit.ly/16kMKHQ</a></small>
</td>
</tr>
</table>
<object class="svgplot" align data="../slides/diagrams/datasets/olympic-marathon.svg">
</object>
<p>Things to notice about the data include the outlier in 1904, in this year, the olympics was in St Louis, USA. Organizational problems and challenges with dust kicked up by the cars following the race meant that participants got lost, and only very few participants completed.</p>
<p>More recent years see more consistently quick marathons.</p>
<h3 id="overdetermined-system">Overdetermined System</h3>
<p>The challenge with a linear model is that it has two unknowns, <span class="math inline">\(m\)</span>, and <span class="math inline">\(c\)</span>. Observing data allows us to write down a system of simultaneous linear equations. So, for example if we observe two data points, the first with the input value, <span class="math inline">\(\inputScalar_1 = 1\)</span> and the output value, <span class="math inline">\(\dataScalar_1 =3\)</span> and a second data point, <span class="math inline">\(\inputScalar = 3\)</span>, <span class="math inline">\(\dataScalar=1\)</span>, then we can write two simultaneous linear equations of the form.</p>
<p>point 1: <span class="math inline">\(\inputScalar = 1\)</span>, <span class="math inline">\(\dataScalar=3\)</span> <span class="math display">\[3 = m + c\]</span> point 2: <span class="math inline">\(\inputScalar = 3\)</span>, <span class="math inline">\(\dataScalar=1\)</span> <span class="math display">\[1 = 3m + c\]</span></p>
<p>The solution to these two simultaneous equations can be represented graphically as</p>
<object class="svgplot" align data="../slides/diagrams/ml/over_determined_system003.svg">
</object>
<center>
<em>The solution of two linear equations represented as the fit of a straight line through two data </em>
</center>
<p>The challenge comes when a third data point is observed and it doesn't naturally fit on the straight line.</p>
<p>point 3: <span class="math inline">\(\inputScalar = 2\)</span>, <span class="math inline">\(\dataScalar=2.5\)</span> <span class="math display">\[2.5 = 2m + c\]</span></p>
<object class="svgplot" align data="../slides/diagrams/ml/over_determined_system004.svg">
</object>
<center>
<em>A third observation of data is inconsistent with the solution dictated by the first two observations </em>
</center>
<p>Now there are three candidate lines, each consistent with our data.</p>
<object class="svgplot" align data="../slides/diagrams/ml/over_determined_system007.svg">
</object>
<center>
<em>Three solutions to the problem, each consistent with two points of the three observations </em>
</center>
<p>This is known as an <em>overdetermined</em> system because there are more data than we need to determine our parameters. The problem arises because the model is a simplification of the real world, and the data we observe is therefore inconsistent with our model.</p>
<p>The solution was proposed by Pierre-Simon Laplace. His idea was to accept that the model was an incomplete representation of the real world, and the manner in which it was incomplete is <em>unknown</em>. His idea was that such unknowns could be dealt with through probability.</p>
<img class="" src="../slides/diagrams/ml/Pierre-Simon_Laplace.png" width="30%" align="center" style="background:none; border:none; box-shadow:none;"> {
<center>
<em>Pierre Simon Laplace </em>
</center>
<p><a href="https://play.google.com/books/reader?id=1YQPAAAAQAAJ&pg=PR17-IA2"><img src="../slides/diagrams/books/1YQPAAAAQAAJ-PR17-IA2.png" /></a></p>
<p>Famously, Laplace considered the idea of a deterministic Universe, one in which the model is <em>known</em>, or as the below translation refers to it, "an intelligence which could comprehend all the forces by which nature is animated". He speculates on an "intelligence" that can submit this vast data to analysis and propsoses that such an entity would be able to predict the future.</p>
<blockquote>
<p>Given for one instant an intelligence which could comprehend all the forces by which nature is animated and the respective situation of the beings who compose it---an intelligence sufficiently vast to submit these data to analysis---it would embrace in the same formulate the movements of the greatest bodies of the universe and those of the lightest atom; for it, nothing would be uncertain and the future, as the past, would be present in its eyes.</p>
</blockquote>
<p>This notion is known as <em>Laplace's demon</em> or <em>Laplace's superman</em>.</p>
<p><img class="" src="../slides/diagrams/laplacesDeterminismEnglish.png" width="60%" align="center" style="background:none; border:none; box-shadow:none;"></p>
<p>Unfortunately, most analyses of his ideas stop at that point, whereas his real point is that such a notion is unreachable. Not so much <em>superman</em> as <em>strawman</em>. Just three pages later in the "Philosophical Essay on Probabilities" <span class="citation">(Laplace, 1814)</span>, Laplace goes on to observe:</p>
<blockquote>
<p>The curve described by a simple molecule of air or vapor is regulated in a manner just as certain as the planetary orbits; the only difference between them is that which comes from our ignorance.</p>
<p>Probability is relative, in part to this ignorance, in part to our knowledge.</p>
</blockquote>
<p><a href="https://play.google.com/books/reader?id=1YQPAAAAQAAJ&pg=PR17-IA4"><img src="../slides/diagrams/books/1YQPAAAAQAAJ-PR17-IA4.png" /></a></p>
<p><img class="" src="../slides/diagrams/philosophicaless00lapliala.png" width="60%" align="center" style="background:none; border:none; box-shadow:none;"></p>
<p>In other words, we can never make use of the idealistic deterministc Universe due to our ignorance about the world, Laplace's suggestion, and focus in this essay is that we turn to probability to deal with this uncertainty. This is also our inspiration for using probabilit in machine learning.</p>
<p>The "forces by which nature is animated" is our <em>model</em>, the "situation of beings that compose it" is our <em>data</em> and the "intelligence sufficiently vast enough to submit these data to analysis" is our compute. The fly in the ointment is our <em>ignorance</em> about these aspects. And <em>probability</em> is the tool we use to incorporate this ignorance leading to uncertainty or <em>doubt</em> in our predictions.</p>
<p>Laplace's concept was that the reason that the data doesn't match up to the model is because of unconsidered factors, and that these might be well represented through probability densities. He tackles the challenge of the unknown factors by adding a variable, <span class="math inline">\(\noiseScalar\)</span>, that represents the unknown. In modern parlance we would call this a <em>latent</em> variable. But in the context Laplace uses it, the variable is so common that it has other names such as a "slack" variable or the <em>noise</em> in the system.</p>
<p>point 1: <span class="math inline">\(\inputScalar = 1\)</span>, <span class="math inline">\(\dataScalar=3\)</span> <span class="math display">\[
3 = m + c + \noiseScalar_1
\]</span> point 2: <span class="math inline">\(\inputScalar = 3\)</span>, <span class="math inline">\(\dataScalar=1\)</span> <span class="math display">\[
1 = 3m + c + \noiseScalar_2
\]</span> point 3: <span class="math inline">\(\inputScalar = 2\)</span>, <span class="math inline">\(\dataScalar=2.5\)</span> <span class="math display">\[
2.5 = 2m + c + \noiseScalar_3
\]</span></p>
<p>Laplace's trick has converted the <em>overdetermined</em> system into an <em>underdetermined</em> system. He has now added three variables, <span class="math inline">\(\{\noiseScalar_i\}_{i=1}^3\)</span>, which represent the unknown corruptions of the real world. Laplace's idea is that we should represent that unknown corruption with a <em>probability distribution</em>.</p>
<h3 id="a-probabilistic-process">A Probabilistic Process</h3>
<p>However, it was left to an admirer of Gauss to develop a practical probability density for that purpose. It was Carl Friederich Gauss who suggested that the <em>Gaussian</em> density (which at the time was unnamed!) should be used to represent this error.</p>
<p>The result is a <em>noisy</em> function, a function which has a deterministic part, and a stochastic part. This type of function is sometimes known as a probabilistic or stochastic process, to distinguish it from a deterministic process.</p>
<h3 id="the-gaussian-density">The Gaussian Density</h3>
<p>The Gaussian density is perhaps the most commonly used probability density. It is defined by a <em>mean</em>, <span class="math inline">\(\meanScalar\)</span>, and a <em>variance</em>, <span class="math inline">\(\dataStd^2\)</span>. The variance is taken to be the square of the <em>standard deviation</em>, <span class="math inline">\(\dataStd\)</span>.</p>
<p><span class="math display">\[\begin{align}
p(\dataScalar| \meanScalar, \dataStd^2) & = \frac{1}{\sqrt{2\pi\dataStd^2}}\exp\left(-\frac{(\dataScalar - \meanScalar)^2}{2\dataStd^2}\right)\\& \buildrel\triangle\over = \gaussianDist{\dataScalar}{\meanScalar}{\dataStd^2}
\end{align}\]</span></p>
<object class="svgplot" align data="../slides/diagrams/ml/gaussian_of_height.svg">
</object>
<center>
<em>The Gaussian PDF with <span class="math inline">\({\meanScalar}=1.7\)</span> and variance <span class="math inline">\({\dataStd}^2=0.0225\)</span>. Mean shown as red line. It could represent the heights of a population of students. </em>
</center>
<h3 id="two-important-gaussian-properties">Two Important Gaussian Properties</h3>
<p>The Gaussian density has many important properties, but for the moment we'll review two of them.</p>
<h3 id="sum-of-gaussians">Sum of Gaussians</h3>
<p>If we assume that a variable, <span class="math inline">\(\dataScalar_i\)</span>, is sampled from a Gaussian density,</p>
<p><span class="math display">\[\dataScalar_i \sim \gaussianSamp{\meanScalar_i}{\sigma_i^2}\]</span></p>
<p>Then we can show that the sum of a set of variables, each drawn independently from such a density is also distributed as Gaussian. The mean of the resulting density is the sum of the means, and the variance is the sum of the variances,</p>
<p><span class="math display">\[\sum_{i=1}^{\numData} \dataScalar_i \sim \gaussianSamp{\sum_{i=1}^\numData \meanScalar_i}{\sum_{i=1}^\numData \sigma_i^2}\]</span></p>
<p>Since we are very familiar with the Gaussian density and its properties, it is not immediately apparent how unusual this is. Most random variables, when you add them together, change the family of density they are drawn from. For example, the Gaussian is exceptional in this regard. Indeed, other random variables, if they are independently drawn and summed together tend to a Gaussian density. That is the <a href="https://en.wikipedia.org/wiki/Central_limit_theorem"><em>central limit theorem</em></a> which is a major justification for the use of a Gaussian density.</p>
<h3 id="scaling-a-gaussian">Scaling a Gaussian</h3>
<p>Less unusual is the <em>scaling</em> property of a Gaussian density. If a variable, <span class="math inline">\(\dataScalar\)</span>, is sampled from a Gaussian density,</p>
<p><span class="math display">\[\dataScalar \sim \gaussianSamp{\meanScalar}{\sigma^2}\]</span> and we choose to scale that variable by a <em>deterministic</em> value, <span class="math inline">\(\mappingScalar\)</span>, then the <em>scaled variable</em> is distributed as</p>
<p><span class="math display">\[\mappingScalar \dataScalar \sim \gaussianSamp{\mappingScalar\meanScalar}{\mappingScalar^2 \sigma^2}.\]</span> Unlike the summing properties, where adding two or more random variables independently sampled from a family of densitites typically brings the summed variable <em>outside</em> that family, scaling many densities leaves the distribution of that variable in the same <em>family</em> of densities. Indeed, many densities include a <em>scale</em> parameter (e.g. the <a href="https://en.wikipedia.org/wiki/Gamma_distribution">Gamma density</a>) which is purely for this purpose. In the Gaussian the standard deviation, <span class="math inline">\(\dataStd\)</span>, is the scale parameter. To see why this makes sense, let's consider, <span class="math display">\[z \sim \gaussianSamp{0}{1},\]</span> then if we scale by <span class="math inline">\(\dataStd\)</span> so we have, <span class="math inline">\(\dataScalar=\dataStd z\)</span>, we can write, <span class="math display">\[\dataScalar =\dataStd z \sim \gaussianSamp{0}{\dataStd^2}\]</span></p>
<h3 id="regression-examples">Regression Examples</h3>
<p>Regression involves predicting a real value, <span class="math inline">\(\dataScalar_i\)</span>, given an input vector, <span class="math inline">\(\inputVector_i\)</span>. For example, the Tecator data involves predicting the quality of meat given spectral measurements. Or in radiocarbon dating, the C14 calibration curve maps from radiocarbon age to age measured through a back-trace of tree rings. Regression has also been used to predict the quality of board game moves given expert rated training data.</p>
<h2 id="underdetermined-system">Underdetermined System</h2>
<p>What about the situation where you have more parameters than data in your simultaneous equation? This is known as an <em>underdetermined</em> system. In fact this set up is in some sense <em>easier</em> to solve, because we don't need to think about introducing a slack variable (although it might make a lot of sense from a <em>modelling</em> perspective to do so).</p>
<p>The way Laplace proposed resolving an overdetermined system, was to introduce slack variables, <span class="math inline">\(\noiseScalar_i\)</span>, which needed to be estimated for each point. The slack variable represented the difference between our actual prediction and the true observation. This is known as the <em>residual</em>. By introducing the slack variable we now have an additional <span class="math inline">\(n\)</span> variables to estimate, one for each data point, <span class="math inline">\(\{\noiseScalar_i\}\)</span>. This actually turns the overdetermined system into an underdetermined system. Introduction of <span class="math inline">\(n\)</span> variables, plus the original <span class="math inline">\(m\)</span> and <span class="math inline">\(c\)</span> gives us <span class="math inline">\(\numData+2\)</span> parameters to be estimated from <span class="math inline">\(n\)</span> observations, which actually makes the system <em>underdetermined</em>. However, we then made a probabilistic assumption about the slack variables, we assumed that the slack variables were distributed according to a probability density. And for the moment we have been assuming that density was the Gaussian, <span class="math display">\[\noiseScalar_i \sim \gaussianSamp{0}{\dataStd^2},\]</span> with zero mean and variance <span class="math inline">\(\dataStd^2\)</span>.</p>
<p>The follow up question is whether we can do the same thing with the parameters. If we have two parameters and only one unknown can we place a probability distribution over the parameters, as we did with the slack variables? The answer is yes.</p>
<h3 id="underdetermined-system-1">Underdetermined System</h3>
<object class="svgplot" align data="../slides/diagrams/ml/under_determined_system009.svg">
</object>
<center>
<em>Fit underdetermined system by considering uncertainty </em>
</center>
<p>Classically, there are two types of uncertainty that we consider. The first is known as <em>aleatoric</em> uncertainty. This is uncertainty we couldn't resolve even if we wanted to. An example, would be the result of a football match before it's played, or where a sheet of paper lands on the floor.</p>
<p>The second is known as <em>epistemic</em> uncertainty. This is uncertainty that we could, in principle, resolve. We just haven't yet made the observation. For example, the result of a football match <em>after</em> it is played, or the color of socks that a lecturer is wearing.</p>
<p>Note, that there isn't a clean difference between the two. It is arguable, that if we knew enough about a football match, or the physics of a falling sheet of paper then we might be able to resolve the uncertainty. The reason we can't is because <em>chaotic</em> behaviour means that a very small change in any of the initial conditions we would need to resolve can have a large change in downstream effects. By this argument, the only truly aleatoric uncertainty might be quantum uncertainty. However, in practice the distinction is often applied.</p>
<p>In classical statistics, the frequentist approach only treats <em>aleatoric</em> uncertainty with probability. The key philosophical difference in the <em>Bayesian</em> approach is to treat any unknowns through probability. This approach was formally justified seperately by <span class="citation">Cox (1946)</span> and <span class="citation">Finetti (1937)</span>.</p>
<p>The term Bayesian was a mocking term promoted by Fisher, it comes from the use, by Bayes, of a billiard table formulation to justify the Bernoulli distribution. Bayes considers a ball landing uniform at random between two sides of a billiard table. He then considers the outcome of the Bernoulli as being whether a second ball comes to rest to the right or left of the original. In this way, the parameter of his Bernoulli distribution is a <em>stochastic variable</em> (the uncertainty in the parameter is aleatoric). In contrast, when Bernoulli formulates the distribution he considers a bag of red and black balls. The parameter of his Bernoulli is the ratio of red balls to total balls, a deterministic variable.</p>
<p>Note how this relates to Laplace's demon. Laplace describes the deterministic universe ("... for it nothing would be uncertain and the future, as the past, would be present in its eyes"), but acknowledges the impossibility of achieving this in practice, (" ... the curve described by a simple molecule of air or vapor is regulated in a manner just as certain as the planetary orbits; the only difference between them is that which comes from our ignorance. <em>Probability</em> is relative in part to this ignorance, in part to our knowledge ...)</p>
<h3 id="prior-distribution">Prior Distribution</h3>
<p>The tradition in Bayesian inference is to place a probability density over the parameters of interest in your model. This choice is made regardless of whether you generally believe those parameters to be stochastic or deterministic in origin. In other words, to a Bayesian, the modelling treatment does not differentiate between epistemic and aleatoric uncertainty. For linear regression we could consider the following Gaussian prior on the intercept parameter, <span class="math display">\[c \sim \gaussianSamp{0}{\alpha_1}\]</span> where <span class="math inline">\(\alpha_1\)</span> is the variance of the prior distribution, its mean being zero.</p>
<h3 id="posterior-distribution">Posterior Distribution</h3>
<p>The prior distribution is combined with the likelihood of the data given the parameters <span class="math inline">\(p(\dataScalar|c)\)</span> to give the posterior via <em>Bayes' rule</em>, <span class="math display">\[
p(c|\dataScalar) = \frac{p(\dataScalar|c)p(c)}{p(\dataScalar)}
\]</span> where <span class="math inline">\(p(\dataScalar)\)</span> is the marginal probability of the data, obtained through integration over the joint density, <span class="math inline">\(p(\dataScalar, c)=p(\dataScalar|c)p(c)\)</span>. Overall the equation can be summarized as, <span class="math display">\[
\text{posterior} = \frac{\text{likelihood}\times \text{prior}}{\text{marginal likelihood}}.
\]</span></p>
<object class="svgplot" align data="../slides/diagrams/ml/dem_gaussian003.svg">
</object>
<center>
<em>Combining a Gaussian likelihood with a Gaussian prior to form a Gaussian posterior </em>
</center>
<p>Another way of seeing what's going on is to note that the numerator of Bayes' rule merely multiplies the likelihood by the prior. The denominator, is not a function of <span class="math inline">\(c\)</span>. So the functional form is entirely determined by the multiplication of prior and likelihood. This has the effect of ensuring that the posterior only has probability mass in regions where both the prior and the likelihood have probability mass.</p>
<p>The marginal likelihood, <span class="math inline">\(p(\dataScalar)\)</span>, operates to ensure that the distribution is normalised.</p>
<p>For the Gaussian case, the normalisation of the posterior can be performed analytically. This is because both the prior and the likelihood have the form of an <em>exponentiated quadratic</em>, <span class="math display">\[
\exp(a^2)\exp(b^2) = \exp(a^2 + b^2),
\]</span> and the properties of the exponential mean that the product of two exponentiated quadratics is also an exponentiated quadratic. That implies that the posterior is also Gaussian, because a normalized exponentiated quadratic is a Gaussian distribution.<a href="#fn1" class="footnoteRef" id="fnref1"><sup>1</sup></a></p>
<p>For general Bayesian inference, over more than one parameter, we need <em>multivariate priors</em>. For example, consider the multivariate linear regression where an observation, <span class="math inline">\(\dataScalar_i\)</span> is related to a vector of features, <span class="math inline">\(\inputVector_{i, :}\)</span>, through a vector of parameters, <span class="math inline">\(\weightVector\)</span>, <span class="math display">\[\dataScalar_i = \sum_j \weightScalar_j \inputScalar_{i, j} + \noiseScalar_i,\]</span> or in vector notation, <span class="math display">\[\dataScalar_i = \weightVector^\top \inputVector_{i, :} + \noiseScalar_i.\]</span> Here we've dropped the intercpet for convenience, it can be reintroduced by augmenting the feature vector, <span class="math inline">\(\inputVector_{i, :}\)</span>, with a constant valued feature.</p>
<p>This motivates the need for a <em>multivariate</em> Gaussian density.</p>
<h3 id="multivariate-regression-likelihood">Multivariate Regression Likelihood</h3>
<ul>
<li>Noise corrupted data point <span class="math display">\[\dataScalar_i = \weightVector^\top \inputVector_{i, :} + {\noiseScalar}_i\]</span></li>
</ul>
<div class="incremental">
<ul>
<li>Multivariate regression likelihood: <span class="math display">\[p(\dataVector| \inputMatrix, \weightVector) = \frac{1}{\left(2\pi {\dataStd}^2\right)^{\numData/2}} \exp\left(-\frac{1}{2{\dataStd}^2}\sum_{i=1}^{\numData}\left(\dataScalar_i - \weightVector^\top \inputVector_{i, :}\right)^2\right)\]</span></li>
</ul>
</div>
<div class="incremental">
<ul>
<li>Now use a <em>multivariate</em> Gaussian prior: <span class="math display">\[p(\weightVector) = \frac{1}{\left(2\pi \alpha\right)^\frac{\dataDim}{2}} \exp \left(-\frac{1}{2\alpha} \weightVector^\top \weightVector\right)\]</span></li>
</ul>
</div>
<h3 id="two-dimensional-gaussian">Two Dimensional Gaussian</h3>
<p>Consider the distribution of height (in meters) of an adult male human population. We will approximate the marginal density of heights as a Gaussian density with mean given by <span class="math inline">\(1.7\text{m}\)</span> and a standard deviation of <span class="math inline">\(0.15\text{m}\)</span>, implying a variance of <span class="math inline">\(\dataStd^2=0.0225\)</span>, <span class="math display">\[
p(h) \sim \gaussianSamp{1.7}{0.0225}.
\]</span> Similarly, we assume that weights of the population are distributed a Gaussian density with a mean of <span class="math inline">\(75 \text{kg}\)</span> and a standard deviation of <span class="math inline">\(6 kg\)</span> (implying a variance of 36), <span class="math display">\[
p(w) \sim \gaussianSamp{75}{36}.
\]</span></p>
<object class="svgplot" align data="../slides/diagrams/ml/height_weight_gaussian.svg">
</object>
<center>
<em>Gaussian distributions for height and weight. </em>
</center>
<h3 id="independence-assumption">Independence Assumption</h3>
<p>First of all, we make an independence assumption, we assume that height and weight are independent. The definition of probabilistic independence is that the joint density, <span class="math inline">\(p(w, h)\)</span>, factorizes into its marginal densities, <span class="math display">\[
p(w, h) = p(w)p(h).
\]</span> Given this assumption we can sample from the joint distribution by independently sampling weights and heights.</p>
<object class="svgplot" align data="../slides/diagrams/ml/independent_height_weight007.svg">
</object>
<center>
<em>Samples from independent Gaussian variables that might represent heights and weights. </em>
</center>
<p>In reality height and weight are <em>not</em> independent. Taller people tend on average to be heavier, and heavier people are likely to be taller. This is reflected by the <em>body mass index</em>. A ratio suggested by one of the fathers of statistics, Adolphe Quetelet. Quetelet was interested in the notion of the <em>average man</em> and collected various statistics about people. He defined the BMI to be, <span class="math display">\[
\text{BMI} = \frac{w}{h^2}
\]</span>To deal with this dependence we now introduce the notion of <em>correlation</em> to the multivariate Gaussian density.</p>
<h3 id="sampling-two-dimensional-variables">Sampling Two Dimensional Variables</h3>
<object class="svgplot" align data="../slides/diagrams/ml/correlated_height_weight007.svg">
</object>
<center>
<em>Samples from </em>correlated* Gaussian variables that might represent heights and weights. *
</center>
<h3 id="independent-gaussians">Independent Gaussians</h3>
<p><span class="math display">\[
p(w, h) = p(w)p(h)
\]</span></p>
<p><span class="math display">\[
p(w, h) = \frac{1}{\sqrt{2\pi \dataStd_1^2}\sqrt{2\pi\dataStd_2^2}} \exp\left(-\frac{1}{2}\left(\frac{(w-\meanScalar_1)^2}{\dataStd_1^2} + \frac{(h-\meanScalar_2)^2}{\dataStd_2^2}\right)\right)
\]</span></p>
<p><span class="math display">\[
p(w, h) = \frac{1}{\sqrt{2\pi\dataStd_1^22\pi\dataStd_2^2}} \exp\left(-\frac{1}{2}\left(\begin{bmatrix}w \\ h\end{bmatrix} - \begin{bmatrix}\meanScalar_1 \\ \meanScalar_2\end{bmatrix}\right)^\top\begin{bmatrix}\dataStd_1^2& 0\\0&\dataStd_2^2\end{bmatrix}^{-1}\left(\begin{bmatrix}w \\ h\end{bmatrix} - \begin{bmatrix}\meanScalar_1 \\ \meanScalar_2\end{bmatrix}\right)\right)
\]</span></p>
<p><span class="math display">\[
p(\dataVector) = \frac{1}{\det{2\pi \mathbf{D}}^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\dataVector - \meanVector)^\top\mathbf{D}^{-1}(\dataVector - \meanVector)\right)
\]</span></p>
<h3 id="correlated-gaussian">Correlated Gaussian</h3>
<p>Form correlated from original by rotating the data space using matrix <span class="math inline">\(\rotationMatrix\)</span>.</p>
<p><span class="math display">\[
p(\dataVector) = \frac{1}{\det{2\pi\mathbf{D}}^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\dataVector - \meanVector)^\top\mathbf{D}^{-1}(\dataVector - \meanVector)\right)
\]</span></p>
<p><span class="math display">\[
p(\dataVector) = \frac{1}{\det{2\pi\mathbf{D}}^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\rotationMatrix^\top\dataVector - \rotationMatrix^\top\meanVector)^\top\mathbf{D}^{-1}(\rotationMatrix^\top\dataVector - \rotationMatrix^\top\meanVector)\right)
\]</span></p>
<p><span class="math display">\[
p(\dataVector) = \frac{1}{\det{2\pi\mathbf{D}}^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\dataVector - \meanVector)^\top\rotationMatrix\mathbf{D}^{-1}\rotationMatrix^\top(\dataVector - \meanVector)\right)
\]</span> this gives a covariance matrix: <span class="math display">\[
\covarianceMatrix^{-1} = \rotationMatrix \mathbf{D}^{-1} \rotationMatrix^\top
\]</span></p>
<p><span class="math display">\[
p(\dataVector) = \frac{1}{\det{2\pi\covarianceMatrix}^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\dataVector - \meanVector)^\top\covarianceMatrix^{-1} (\dataVector - \meanVector)\right)
\]</span> this gives a covariance matrix: <span class="math display">\[
\covarianceMatrix = \rotationMatrix \mathbf{D} \rotationMatrix^\top
\]</span></p>
<p>Let's first of all review the properties of the multivariate Gaussian distribution that make linear Gaussian models easier to deal with. We'll return to the, perhaps surprising, result on the parameters within the nonlinearity, <span class="math inline">\(\parameterVector\)</span>, shortly.</p>
<p>To work with linear Gaussian models, to find the marginal likelihood all you need to know is the following rules. If <span class="math display">\[
\dataVector = \mappingMatrix \inputVector + \noiseVector,
\]</span> where <span class="math inline">\(\dataVector\)</span>, <span class="math inline">\(\inputVector\)</span> and <span class="math inline">\(\noiseVector\)</span> are vectors and we assume that <span class="math inline">\(\inputVector\)</span> and <span class="math inline">\(\noiseVector\)</span> are drawn from multivariate Gaussians, <span class="math display">\[\begin{align}
\inputVector & \sim \gaussianSamp{\meanVector}{\covarianceMatrix}\\
\noiseVector & \sim \gaussianSamp{\zerosVector}{\covarianceMatrixTwo}
\end{align}\]</span> then we know that <span class="math inline">\(\dataVector\)</span> is also drawn from a multivariate Gaussian with, <span class="math display">\[
\dataVector \sim \gaussianSamp{\mappingMatrix\meanVector}{\mappingMatrix\covarianceMatrix\mappingMatrix^\top + \covarianceMatrixTwo}.
\]</span></p>
<p>With apprioriately defined covariance, <span class="math inline">\(\covarianceTwoMatrix\)</span>, this is actually the marginal likelihood for Factor Analysis, or Probabilistic Principal Component Analysis <span class="citation">(Tipping and Bishop, 1999)</span>, because we integrated out the inputs (or <em>latent</em> variables they would be called in that case).</p>
<p>However, we are focussing on what happens in models which are non-linear in the inputs, whereas the above would be <em>linear</em> in the inputs. To consider these, we introduce a matrix, called the design matrix. We set each activation function computed at each data point to be <span class="math display">\[
\activationScalar_{i,j} = \activationScalar(\mappingVector^{(1)}_{j}, \inputVector_{i})
\]</span> and define the matrix of activations (known as the <em>design matrix</em> in statistics) to be, <span class="math display">\[
\activationMatrix =
\begin{bmatrix}
\activationScalar_{1, 1} & \activationScalar_{1, 2} & \dots & \activationScalar_{1, \numHidden} \\
\activationScalar_{1, 2} & \activationScalar_{1, 2} & \dots & \activationScalar_{1, \numData} \\
\vdots & \vdots & \ddots & \vdots \\
\activationScalar_{\numData, 1} & \activationScalar_{\numData, 2} & \dots & \activationScalar_{\numData, \numHidden}
\end{bmatrix}.
\]</span> By convention this matrix always has <span class="math inline">\(\numData\)</span> rows and <span class="math inline">\(\numHidden\)</span> columns, now if we define the vector of all noise corruptions, <span class="math inline">\(\noiseVector = \left[\noiseScalar_1, \dots \noiseScalar_\numData\right]^\top\)</span>.</p>
<p>If we define the prior distribution over the vector <span class="math inline">\(\mappingVector\)</span> to be Gaussian, <span class="math display">\[
\mappingVector \sim \gaussianSamp{\zerosVector}{\alpha\eye},
\]</span></p>
<p>then we can use rules of multivariate Gaussians to see that, <span class="math display">\[
\dataVector \sim \gaussianSamp{\zerosVector}{\alpha \activationMatrix \activationMatrix^\top + \dataStd^2 \eye}.
\]</span></p>
<p>In other words, our training data is distributed as a multivariate Gaussian, with zero mean and a covariance given by <span class="math display">\[
\kernelMatrix = \alpha \activationMatrix \activationMatrix^\top + \dataStd^2 \eye.
\]</span></p>
<p>This is an <span class="math inline">\(\numData \times \numData\)</span> size matrix. Its elements are in the form of a function. The maths shows that any element, index by <span class="math inline">\(i\)</span> and <span class="math inline">\(j\)</span>, is a function <em>only</em> of inputs associated with data points <span class="math inline">\(i\)</span> and <span class="math inline">\(j\)</span>, <span class="math inline">\(\dataVector_i\)</span>, <span class="math inline">\(\dataVector_j\)</span>. <span class="math inline">\(\kernel_{i,j} = \kernel\left(\inputVector_i, \inputVector_j\right)\)</span></p>
<p>If we look at the portion of this function associated only with <span class="math inline">\(\mappingFunction(\cdot)\)</span>, i.e. we remove the noise, then we can write down the covariance associated with our neural network, <span class="math display">\[
\kernel_\mappingFunction\left(\inputVector_i, \inputVector_j\right) = \alpha \activationVector\left(\mappingMatrix_1, \inputVector_i\right)^\top \activationVector\left(\mappingMatrix_1, \inputVector_j\right)
\]</span> so the elements of the covariance or <em>kernel</em> matrix are formed by inner products of the rows of the <em>design matrix</em>.</p>
<h3 id="gaussian-process">Gaussian Process</h3>
<p>This is the essence of a Gaussian process. Instead of making assumptions about our density over each data point, <span class="math inline">\(\dataScalar_i\)</span> as i.i.d. we make a joint Gaussian assumption over our data. The covariance matrix is now a function of both the parameters of the activation function, <span class="math inline">\(\mappingMatrixTwo\)</span>, and the input variables, <span class="math inline">\(\inputMatrix\)</span>. This comes about through integrating out the parameters of the model, <span class="math inline">\(\mappingVector\)</span>.</p>
<h3 id="basis-functions">Basis Functions</h3>
<p>We can basically put anything inside the basis functions, and many people do. These can be deep kernels <span class="citation">(Cho and Saul, 2009)</span> or we can learn the parameters of a convolutional neural network inside there.</p>
<p>Viewing a neural network in this way is also what allows us to beform sensible <em>batch</em> normalizations <span class="citation">(Ioffe and Szegedy, 2015)</span>.</p>
<p><a href="file:///Users/neil/lawrennd/talks_gp/includes/gp-intro-very-short.md">_gp/includes/gp-intro-very-short.md</a></p>
<h3 id="bayesian-inference-by-rejection-sampling">Bayesian Inference by Rejection Sampling</h3>
<p>One view of Bayesian inference is to assume we are given a mechanism for generating samples, where we assume that mechanism is representing on accurate view on the way we believe the world works.</p>
<p>This mechanism is known as our <em>prior</em> belief.</p>
<p>We combine our prior belief with our observations of the real world by discarding all those samples that are inconsistent with our prior. The <em>likelihood</em> defines mathematically what we mean by inconsistent with the prior. The higher the noise level in the likelihood, the looser the notion of consistent.</p>
<p>The samples that remain are considered to be samples from the <em>posterior</em>.</p>
<p>This approach to Bayesian inference is closely related to two sampling techniques known as <em>rejection sampling</em> and <em>importance sampling</em>. It is realized in practice in an approach known as <em>approximate Bayesian computation</em> (ABC) or likelihood-free inference.</p>
<p>In practice, the algorithm is often too slow to be practical, because most samples will be inconsistent with the data and as a result the mechanism has to be operated many times to obtain a few posterior samples.</p>
<p>However, in the Gaussian process case, when the likelihood also assumes Gaussian noise, we can operate this mechanims mathematically, and obtain the posterior density <em>analytically</em>. This is the benefit of Gaussian processes.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> numpy <span class="im">as</span> np</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">from</span> mlai <span class="im">import</span> compute_kernel</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">from</span> mlai <span class="im">import</span> exponentiated_quadratic</code></pre></div>
<object class="svgplot" align data="../slides/diagrams/gp/gp_rejection_sample003.svg">
</object>
<object class="svgplot" align data="../slides/diagrams/gp/gp_rejection_sample004.svg">
</object>
<object class="svgplot" align data="../slides/diagrams/gp/gp_rejection_sample005.svg">
</object>
<center>
<em>One view of Bayesian inference is we have a machine for generating samples (the </em>prior<em>), and we discard all samples inconsistent with our data, leaving the samples of interest (the </em>posterior<em>). The Gaussian process allows us to do this analytically. </em>
</center>
<p><a href="file:///Users/neil/lawrennd/talks_gp/includes/gpdistfunc.md">_gp/includes/gpdistfunc.md</a></p>
<h3 id="sampling-a-function">Sampling a Function</h3>
<p>We will consider a Gaussian distribution with a particular structure of covariance matrix. We will generate <em>one</em> sample from a 25-dimensional Gaussian density. <span class="math display">\[
\mappingFunctionVector=\left[\mappingFunction_{1},\mappingFunction_{2}\dots \mappingFunction_{25}\right].
\]</span> in the figure below we plot these data on the <span class="math inline">\(y\)</span>-axis against their <em>indices</em> on the <span class="math inline">\(x\)</span>-axis.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">from</span> mlai <span class="im">import</span> Kernel</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">from</span> mlai <span class="im">import</span> polynomial_cov</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">from</span> mlai <span class="im">import</span> exponentiated_quadratic</code></pre></div>
<object class="svgplot" align data="../slides/diagrams/gp/two_point_sample008.svg">
</object>
<center>
<em>A 25 dimensional correlated random variable (values ploted against index) </em>
</center>
<p><a href="file:///Users/neil/lawrennd/talks_gp/includes/gaussian-predict-index-one-and-two.md">_gp/includes/gaussian-predict-index-one-and-two.md</a></p>
<object class="svgplot" align data="../slides/diagrams/gp/two_point_sample012.svg">
</object>
<center>
<em>The joint Gaussian over <span class="math inline">\(\mappingFunction_1\)</span> and <span class="math inline">\(\mappingFunction_2\)</span> along with the conditional distribution of <span class="math inline">\(\mappingFunction_2\)</span> given <span class="math inline">\(\mappingFunction_1\)</span> </em>
</center>
<h3 id="uluru">Uluru</h3>
<p><img class="" src="../slides/diagrams/gp/799px-Uluru_Panorama.jpg" width="" align="center" style="background:none; border:none; box-shadow:none;"> When viewing these contour plots, I sometimes find it helpful to think of Uluru, the prominent rock formation in Australia. The rock rises above the surface of the plane, just like a probability density rising above the zero line. The rock is three dimensional, but when we view Uluru from the classical position, we are looking at one side of it. This is equivalent to viewing the marginal density.</p>
<p>The joint density can be viewed from above, using contours. The conditional density is equivalent to <em>slicing</em> the rock. Uluru is a holy rock, so this has to be an imaginary slice. Imagine we cut down a vertical plane orthogonal to our view point (e.g. coming across our view point). This would give a profile of the rock, which when renormalized, would give us the conditional distribution, the value of conditioning would be the location of the slice in the direction we are facing.</p>
<h3 id="prediction-with-correlated-gaussians">Prediction with Correlated Gaussians</h3>
<p>Of course in practice, rather than manipulating mountains physically, the advantage of the Gaussian density is that we can perform these manipulations mathematically.</p>
<p>Prediction of <span class="math inline">\(\mappingFunction_2\)</span> given <span class="math inline">\(\mappingFunction_1\)</span> requires the <em>conditional density</em>, <span class="math inline">\(p(\mappingFunction_2|\mappingFunction_1)\)</span>.Another remarkable property of the Gaussian density is that this conditional distribution is <em>also</em> guaranteed to be a Gaussian density. It has the form, <span class="math display">\[
p(\mappingFunction_2|\mappingFunction_1) = \gaussianDist{\mappingFunction_2}{\frac{\kernelScalar_{1, 2}}{\kernelScalar_{1, 1}}\mappingFunction_1}{ \kernelScalar_{2, 2} - \frac{\kernelScalar_{1,2}^2}{\kernelScalar_{1,1}}}
\]</span>where we have assumed that the covariance of the original joint density was given by <span class="math display">\[
\kernelMatrix = \begin{bmatrix} \kernelScalar_{1, 1} & \kernelScalar_{1, 2}\\ \kernelScalar_{2, 1} & \kernelScalar_{2, 2}.\end{bmatrix}
\]</span></p>
<p>Using these formulae we can determine the conditional density for any of the elements of our vector <span class="math inline">\(\mappingFunctionVector\)</span>. For example, the variable <span class="math inline">\(\mappingFunction_8\)</span> is less correlated with <span class="math inline">\(\mappingFunction_1\)</span> than <span class="math inline">\(\mappingFunction_2\)</span>. If we consider this variable we see the conditional density is more diffuse.</p>
<p><a href="file:///Users/neil/lawrennd/talks_gp/includes/gaussian-predict-index-one-and-eight.md">_gp/includes/gaussian-predict-index-one-and-eight.md</a></p>
<object class="svgplot" align data="../slides/diagrams/gp/two_point_sample013.svg">
</object>
<object class="svgplot" align data="../slides/diagrams/gp/two_point_sample017.svg">
</object>
<center>
<em>The joint Gaussian over <span class="math inline">\(\mappingFunction_1\)</span> and <span class="math inline">\(\mappingFunction_8\)</span> along with the conditional distribution of <span class="math inline">\(\mappingFunction_8\)</span> given <span class="math inline">\(\mappingFunction_1\)</span> </em>
</center>
<p><a href="file:///Users/neil/lawrennd/talks_kern/includes/computing-rbf-covariance.md">_kern/includes/computing-rbf-covariance.md</a></p>
<h3 id="where-did-this-covariance-matrix-come-from">Where Did This Covariance Matrix Come From?</h3>
<p><span class="math display">\[
k(\inputVector, \inputVector^\prime) = \alpha \exp\left(-\frac{\left\Vert \inputVector - \inputVector^\prime\right\Vert^2_2}{2\lengthScale^2}\right)\]</span></p>
<object class="svgplot" align data="../slides/diagrams/kern/computing_eq_three_covariance016.svg">
</object>
<object class="svgplot" align data="../slides/diagrams/kern/computing_eq_three_covariance016.svg">
</object>
<center>
<em>Entrywise fill in of the covariance matrix from the covariance function. </em>
</center>
<object class="svgplot" align data="../slides/diagrams/kern/computing_eq_four_covariance027.svg">
</object>
<object class="svgplot" align data="../slides/diagrams/kern/computing_eq_four_covariance027.svg">
</object>
<center>
<em>Entrywise fill in of the covariance matrix from the covariance function. </em>
</center>
<object class="svgplot" align data="../slides/diagrams/kern/computing_eq_three_2_covariance016.svg">
</object>
<object class="svgplot" align data="../slides/diagrams/kern/computing_eq_three_2_covariance016.svg">
</object>
<center>
<em>Entrywise fill in of the covariance matrix from the covariance function. </em>
</center>
<h3 id="polynomial-covariance">Polynomial Covariance</h3>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">from</span> mlai <span class="im">import</span> polynomial_cov</code></pre></div>
<center>
<span class="math display">\[\kernelScalar(\inputVector, \inputVector^\prime) = \alpha(w \inputVector^\top\inputVector^\prime + b)^d\]</span>
</center>
<br>
<table>
<tr>
<td width="45%">
<object class align data="../slides/diagrams/kern/polynomial_covariance.svg">
</object>
<td width="45%">
<img class="negate" src="../slides/diagrams/kern/polynomial_covariance.gif" width="100%" align="center" style="background:none; border:none; box-shadow:none;">
</td>
</tr>
</table>
<h3 id="brownian-covariance">Brownian Covariance</h3>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">from</span> mlai <span class="im">import</span> brownian_cov</code></pre></div>
<p>Brownian motion is also a Gaussian process. It follows a Gaussian random walk, with diffusion occuring at each time point driven by a Gaussian input. This implies it is both Markov and Gaussian. The covariane function for Brownian motion has the form <span class="math display">\[
\kernelScalar(t, t^\prime) = \alpha \min(t, t^\prime)
\]</span></p>
<!--<table><tr><td width="50%">
<object class="svgplot" align="" data="../slides/diagrams/kern/brownian_covariance.svg"></object>
</td><td width="50%">
<iframe src="../slides/diagrams/kern/brownian_covariance.html" width="512" height="384" allowtransparency="true" frameborder="0">
</iframe>
</td></tr></table>
<center>*The covariance of Brownian motion, and some samples from the covariance showing the functional form. *</center>-->
<table>
<tr>
<td width="45%">
<object class align data="../slides/diagrams/kern/brownian_covariance.svg">
</object>
</td>
<td width="45%">
<img class="negate" src="../slides/diagrams/kern/brownian_covariance.gif" width="100%" align="center" style="background:none; border:none; box-shadow:none;">
</td>
</tr>
</table>
<h3 id="periodic-covariance">Periodic Covariance</h3>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">from</span> mlai <span class="im">import</span> periodic_cov</code></pre></div>
<center>
<span class="math display">\[\kernelScalar(\inputVector, \inputVector^\prime) = \alpha\exp\left(\frac{-2\sin(\pi rw)^2}{\lengthScale^2}\right)\]</span>
</center>
<br>
<table>
<tr>
<td width="45%">
<object class align data="../slides/diagrams/kern/periodic_covariance.svg">
</object>
<td width="45%">
<img class="negate" src="../slides/diagrams/kern/periodic_covariance.gif" width="100%" align="center" style="background:none; border:none; box-shadow:none;">
</td>
</tr>
</table>
<p>Any linear basis function can also be incorporated into a covariance function. For example, an RBF network is a type of neural network with a set of radial basis functions. Meaning, the basis funciton is radially symmetric. These basis functions take the form, <span class="math display">\[
\basisFunction_k(\inputScalar) = \exp\left(-\frac{\ltwoNorm{\inputScalar-\meanScalar_k}^{2}}{\lengthScale^{2}}\right).
\]</span> Given a set of parameters, <span class="math display">\[
\meanVector = \begin{bmatrix} -1 \\ 0 \\ 1\end{bmatrix},
\]</span> we can construct the corresponding covariance function, which has the form, <span class="math display">\[
\kernelScalar\left(\inputVals,\inputVals^{\prime}\right)=\alpha\basisVector(\inputVals)^\top \basisVector(\inputVals^\prime).
\]</span></p>
<h3 id="basis-function-covariance">Basis Function Covariance</h3>
<p>The fixed basis function covariance just comes from the properties of a multivariate Gaussian, if we decide <span class="math display">\[
\mappingFunctionVector=\basisMatrix\mappingVector
\]</span> and then we assume <span class="math display">\[
\mappingVector \sim \gaussianSamp{\zerosVector}{\alpha\eye}
\]</span> then it follows from the properties of a multivariate Gaussian that <span class="math display">\[
\mappingFunctionVector \sim \gaussianSamp{\zerosVector}{\alpha\basisMatrix\basisMatrix^\top}
\]</span> meaning that the vector of observations from the function is jointly distributed as a Gaussian process and the covariance matrix is <span class="math inline">\(\kernelMatrix = \alpha\basisMatrix \basisMatrix^\top\)</span>, each element of the covariance matrix can then be found as the inner product between two rows of the basis funciton matrix.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">from</span> mlai <span class="im">import</span> basis_cov</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">from</span> mlai <span class="im">import</span> radial</code></pre></div>
<center>
<span class="math display">\[\kernel(\inputVector, \inputVector^\prime) = \basisVector(\inputVector)^\top \basisVector(\inputVector^\prime)\]</span>
</center>
<br>
<table>
<tr>
<td width="45%">
<object class align data="../slides/diagrams/kern/basis_covariance.svg">
</object>
<td width="45%">
<img class="negate" src="../slides/diagrams/kern/basis_covariance.gif" width="100%" align="center" style="background:none; border:none; box-shadow:none;">
</td>
</tr>
</table>
{
<center>
<em>A covariance function based on a non-linear basis given by <span class="math inline">\(\basisVector(\inputVector)\)</span>. </em>
</center>
<h3 id="selecting-number-and-location-of-basis">Selecting Number and Location of Basis</h3>
<p>In practice for a basis function model we need to choose both 1. the location of the basis functions 2. the number of basis functions</p>
<p>One very clever of finessing this problem is to choose to have <em>infinite</em> basis functions and place them <em>everywhere</em>. To show how this is possible, we will consider a one dimensional system, <span class="math inline">\(\inputScalar\)</span>, which should give the intuition of how to do this. However, these ideas also extend to multidimensional systems as shown in, for example, <span class="citation">Williams (n.d.)</span> and <span class="citation">Neal (1994)</span>.</p>
<p>We consider a one dimensional set up with exponentiated quadratic basis functions, <span class="math display">\[
\basisFunction_k(\inputScalar_i) = \exp\left(\frac{\ltwoNorm{\inputScalar_i - \locationScalar_k}^2}{2\rbfWidth^2}\right)
\]</span></p>
<p>To place these basis functions, we first define the basis function centers in terms of a starting point on the left of our input, <span class="math inline">\(a\)</span>, and a finishing point, <span class="math inline">\(b\)</span>. The gap between basis is given by <span class="math inline">\(\Delta\locationScalar\)</span>. The location of each basis is then given by <span class="math display">\[\locationScalar_k = a+\Delta\locationScalar\cdot (k-1).\]</span> The covariance function can then be given as <span class="math display">\[
\kernelScalar\left(\inputScalar_i,\inputScalar_j\right) = \sum_{k=1}^\numBasisFunc \basisFunction_k(\inputScalar_i)\basisFunction_k(\inputScalar_j)
\]</span> <span class="math display">\[\begin{aligned}
\kernelScalar\left(\inputScalar_i,\inputScalar_j\right) = &\alpha^\prime\Delta\locationScalar \sum_{k=1}^{\numBasisFunc} \exp\Bigg(
-\frac{\inputScalar_i^2 + \inputScalar_j^2}{2\rbfWidth^2}\\
& - \frac{2\left(a+\Delta\locationScalar\cdot (k-1)\right)
\left(\inputScalar_i+\inputScalar_j\right) + 2\left(a+\Delta\locationScalar \cdot (k-1)\right)^2}{2\rbfWidth^2} \Bigg)
\end{aligned}\]</span> where we've also scaled the variance of the process by <span class="math inline">\(\Delta\locationScalar\)</span>.</p>
<p>A consequence of our definition is that the first and last basis function locations are given by <span class="math display">\[
\locationScalar_1=a \ \text{and}\ \locationScalar_\numBasisFunc=b \ \text{so}\ b= a+ \Delta\locationScalar\cdot(\numBasisFunc-1)
\]</span> This implies that the distance between <span class="math inline">\(b\)</span> and <span class="math inline">\(a\)</span> is given by <span class="math display">\[
b-a = \Delta\locationScalar (\numBasisFunc -1)
\]</span> and since the basis functions are separated by <span class="math inline">\(\Delta\locationScalar\)</span> the number of basis functions is given by <span class="math display">\[
\numBasisFunc = \frac{b-a}{\Delta \locationScalar} + 1
\]</span> The next step is to take the limit as <span class="math inline">\(\Delta\locationScalar\rightarrow 0\)</span> so <span class="math inline">\(\numBasisFunc \rightarrow \infty\)</span> where we have used <span class="math inline">\(a + k\cdot\Delta\locationScalar\rightarrow \locationScalar\)</span>.</p>
<p>Performing the integration gives <span class="math display">\[\begin{aligned}
\kernelScalar(\inputScalar_i,&\inputScalar_j) = \alpha^\prime \sqrt{\pi\rbfWidth^2}
\exp\left( -\frac{\left(\inputScalar_i-\inputScalar_j\right)^2}{4\rbfWidth^2}\right)\\ &\times
\frac{1}{2}\left[\text{erf}\left(\frac{\left(b - \frac{1}{2}\left(\inputScalar_i +
\inputScalar_j\right)\right)}{\rbfWidth} \right)-
\text{erf}\left(\frac{\left(a - \frac{1}{2}\left(\inputScalar_i +
\inputScalar_j\right)\right)}{\rbfWidth} \right)\right],
\end{aligned}\]</span>Now we take the limit as <span class="math inline">\(a\rightarrow -\infty\)</span> and <span class="math inline">\(b\rightarrow \infty\)</span> <span class="math display">\[\kernelScalar\left(\inputScalar_i,\inputScalar_j\right) = \alpha\exp\left(
-\frac{\left(\inputScalar_i-\inputScalar_j\right)^2}{4\rbfWidth^2}\right).\]</span> where <span class="math inline">\(\alpha=\alpha^\prime \sqrt{\pi\rbfWidth^2}\)</span>.</p>
<p>In conclusion, an RBF model with infinite basis functions is a Gaussian process with the exponentiated quadratic covariance function <span class="math display">\[\kernelScalar\left(\inputScalar_i,\inputScalar_j\right) = \alpha \exp\left(
-\frac{\left(\inputScalar_i-\inputScalar_j\right)^2}{4\rbfWidth^2}\right).\]</span></p>
<p>Note that while the functional form of the basis function and the covariance function are similar, in general if we repeated this analysis for other basis functions the covariance function will have a very different form. For example the error function, <span class="math inline">\(\text{erf}(\cdot)\)</span>, results in an <span class="math inline">\(\asin(\cdot)\)</span> form. See <span class="citation">Williams (n.d.)</span> for more details.</p>
<h3 id="mlp-covariance">MLP Covariance</h3>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">from</span> mlai <span class="im">import</span> mlp_cov</code></pre></div>
<p>The multi-layer perceptron (MLP) covariance, also known as the neural network covariance or the arcsin covariance, is derived by considering the infinite limit of a neural network.</p>
<center>
<span class="math display">\[\kernelScalar(\inputVector, \inputVector^\prime) = \alpha \arcsin\left(\frac{w \inputVector^\top \inputVector^\prime + b}{\sqrt{\left(w \inputVector^\top \inputVector + b + 1\right)\left(w \left.\inputVector^\prime\right.^\top \inputVector^\prime + b + 1\right)}}\right)\]</span>
</center>
<br>
<table>
<tr>
<td width="45%">
<object class align data="../slides/diagrams/kern/mlp_covariance.svg">
</object>
<td width="45%">
<img class="negate" src="../slides/diagrams/kern/mlp_covariance.gif" width="100%" align="center" style="background:none; border:none; box-shadow:none;">
</td>
</tr>
</table>
<center>
<em>The multi-layer perceptron covariance function. This is derived by considering the infinite limit of a neural network with probit activation functions. </em>
</center>
<!--include{_kern/includes/relu-covariance.md}-->
<h3 id="sinc-covariance">Sinc Covariance</h3>
<p>Another approach to developing covariance function exploits Bochner's theorem <span class="citation">Bochner (1959)</span>. Bochner's theorem tells us that any positve filter in Fourier space implies has an associated Gaussian process with a stationary covariance function. The covariance function is the <em>inverse Fourier transform</em> of the filter applied in Fourier space.</p>
<p>For example, in signal processing, <em>band limitations</em> are commonly applied as an assumption. For example, we may believe that no frequency above <span class="math inline">\(w=2\)</span> exists in the signal. This is equivalent to a rectangle function being applied as a the filter in Fourier space.</p>
<p>The inverse Fourier transform of the rectangle function is the <span class="math inline">\(\text{sinc}(\cdot)\)</span> function. So the sinc is a valid covariance function, and it represents <em>band limited</em> signals.</p>
<p>Note that other covariance functions we've introduced can also be interpreted in this way. For example, the exponentiated quadratic covariance function can be Fourier transformed to see what the implied filter in Fourier space is. The Fourier transform of the exponentiated quadratic is an exponentiated quadratic, so the standard EQ-covariance implies a EQ filter in Fourier space.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">from</span> mlai <span class="im">import</span> sinc_cov</code></pre></div>
<div id="refs" class="references">
<div id="ref-Bochner:book59">
<p>Bochner, S., 1959. Lectures on fourier integrals. Princeton University Press.</p>
</div>
<div id="ref-Cho:deep09">
<p>Cho, Y., Saul, L.K., 2009. Kernel methods for deep learning, in: Bengio, Y., Schuurmans, D., Lafferty, J.D., Williams, C.K.I., Culotta, A. (Eds.), Advances in Neural Information Processing Systems 22. Curran Associates, Inc., pp. 342–350.</p>
</div>
<div id="ref-Cox:probability46">
<p>Cox, R.T., 1946. Probability, frequency and reasonable expectation. American Journal of Physics 14, 1–13.</p>
</div>
<div id="ref-deFinetti:prevision37">
<p>Finetti, B. de, 1937. La prévision: Ses lois logiques, ses sources subjectives. Annales de l’Institut Henri Poincaré 7, 1–68.</p>
</div>
<div id="ref-Ioffe:batch15">
<p>Ioffe, S., Szegedy, C., 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift, in: Bach, F., Blei, D. (Eds.), Proceedings of the 32nd International Conference on Machine Learning, Proceedings of Machine Learning Research. PMLR, Lille, France, pp. 448–456.</p>
</div>
<div id="ref-Laplace:essai14">
<p>Laplace, P.S., 1814. Essai philosophique sur les probabilités, 2nd ed. Courcier, Paris.</p>
</div>
<div id="ref-Neal:thesis94">
<p>Neal, R.M., 1994. Bayesian learning for neural networks (PhD thesis). University of Toronto, Canada.</p>
</div>
<div id="ref-Rasmussen:book06">
<p>Rasmussen, C.E., Williams, C.K.I., 2006. Gaussian processes for machine learning. mit, Cambridge, MA.</p>
</div>
<div id="ref-Tipping:probpca99">
<p>Tipping, M.E., Bishop, C.M., 1999. Probabilistic principal component analysis. Journal of the Royal Statistical Society, B 6, 611–622. <a href="https://doi.org/doi:10.1111/1467-9868.00196" class="uri">https://doi.org/doi:10.1111/1467-9868.00196</a></p>
</div>
<div id="ref-Williams:infinite96">
<p>Williams, C.K.I., n.d. Computing with infinite networks, in:.</p>
</div>
</div>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>Note not all exponentiated quadratics can be normalized, to do so, the coefficient associated with the variable squared, <span class="math inline">\(\dataScalar^2\)</span>, must be strictly positive.<a href="#fnref1">↩</a></p></li>
</ol>
</div>
Mon, 03 Sep 2018 00:00:00 +0000
http://inverseprobability.com/talks/notes/gpss-session-1.html
http://inverseprobability.com/talks/notes/gpss-session-1.htmlnotesProbabilistic Machine Learning<div style="display:none">
$$\newcommand{\Amatrix}{\mathbf{A}}
\newcommand{\KL}[2]{\text{KL}\left( #1\,\|\,#2 \right)}
\newcommand{\Kaast}{\kernelMatrix_{\mathbf{ \ast}\mathbf{ \ast}}}
\newcommand{\Kastu}{\kernelMatrix_{\mathbf{ \ast} \inducingVector}}
\newcommand{\Kff}{\kernelMatrix_{\mappingFunctionVector \mappingFunctionVector}}
\newcommand{\Kfu}{\kernelMatrix_{\mappingFunctionVector \inducingVector}}
\newcommand{\Kuast}{\kernelMatrix_{\inducingVector \bf\ast}}
\newcommand{\Kuf}{\kernelMatrix_{\inducingVector \mappingFunctionVector}}
\newcommand{\Kuu}{\kernelMatrix_{\inducingVector \inducingVector}}
\newcommand{\Kuui}{\Kuu^{-1}}
\newcommand{\Qaast}{\mathbf{Q}_{\bf \ast \ast}}
\newcommand{\Qastf}{\mathbf{Q}_{\ast \mappingFunction}}
\newcommand{\Qfast}{\mathbf{Q}_{\mappingFunctionVector \bf \ast}}
\newcommand{\Qff}{\mathbf{Q}_{\mappingFunctionVector \mappingFunctionVector}}
\newcommand{\aMatrix}{\mathbf{A}}
\newcommand{\aScalar}{a}
\newcommand{\aVector}{\mathbf{a}}
\newcommand{\acceleration}{a}
\newcommand{\bMatrix}{\mathbf{B}}
\newcommand{\bScalar}{b}
\newcommand{\bVector}{\mathbf{b}}
\newcommand{\basisFunc}{\phi}
\newcommand{\basisFuncVector}{\boldsymbol{ \basisFunc}}
\newcommand{\basisFunction}{\phi}
\newcommand{\basisLocation}{\mu}
\newcommand{\basisMatrix}{\boldsymbol{ \Phi}}
\newcommand{\basisScalar}{\basisFunction}
\newcommand{\basisVector}{\boldsymbol{ \basisFunction}}
\newcommand{\activationFunction}{\phi}
\newcommand{\activationMatrix}{\boldsymbol{ \Phi}}
\newcommand{\activationScalar}{\basisFunction}
\newcommand{\activationVector}{\boldsymbol{ \basisFunction}}
\newcommand{\bigO}{\mathcal{O}}
\newcommand{\binomProb}{\pi}
\newcommand{\cMatrix}{\mathbf{C}}
\newcommand{\cbasisMatrix}{\hat{\boldsymbol{ \Phi}}}
\newcommand{\cdataMatrix}{\hat{\dataMatrix}}
\newcommand{\cdataScalar}{\hat{\dataScalar}}
\newcommand{\cdataVector}{\hat{\dataVector}}
\newcommand{\centeredKernelMatrix}{\mathbf{ \MakeUppercase{\centeredKernelScalar}}}
\newcommand{\centeredKernelScalar}{b}
\newcommand{\centeredKernelVector}{\centeredKernelScalar}
\newcommand{\centeringMatrix}{\mathbf{H}}
\newcommand{\chiSquaredDist}[2]{\chi_{#1}^{2}\left(#2\right)}
\newcommand{\chiSquaredSamp}[1]{\chi_{#1}^{2}}
\newcommand{\conditionalCovariance}{\boldsymbol{ \Sigma}}
\newcommand{\coregionalizationMatrix}{\mathbf{B}}
\newcommand{\coregionalizationScalar}{b}
\newcommand{\coregionalizationVector}{\mathbf{ \coregionalizationScalar}}
\newcommand{\covDist}[2]{\text{cov}_{#2}\left(#1\right)}
\newcommand{\covSamp}[1]{\text{cov}\left(#1\right)}
\newcommand{\covarianceScalar}{c}
\newcommand{\covarianceVector}{\mathbf{ \covarianceScalar}}
\newcommand{\covarianceMatrix}{\mathbf{C}}
\newcommand{\covarianceMatrixTwo}{\boldsymbol{ \Sigma}}
\newcommand{\croupierScalar}{s}
\newcommand{\croupierVector}{\mathbf{ \croupierScalar}}
\newcommand{\croupierMatrix}{\mathbf{ \MakeUppercase{\croupierScalar}}}
\newcommand{\dataDim}{p}
\newcommand{\dataIndex}{i}
\newcommand{\dataIndexTwo}{j}
\newcommand{\dataMatrix}{\mathbf{Y}}
\newcommand{\dataScalar}{y}
\newcommand{\dataSet}{\mathcal{D}}
\newcommand{\dataStd}{\sigma}
\newcommand{\dataVector}{\mathbf{ \dataScalar}}
\newcommand{\decayRate}{d}
\newcommand{\degreeMatrix}{\mathbf{ \MakeUppercase{\degreeScalar}}}
\newcommand{\degreeScalar}{d}
\newcommand{\degreeVector}{\mathbf{ \degreeScalar}}
% Already defined by latex
%\newcommand{\det}[1]{\left|#1\right|}
\newcommand{\diag}[1]{\text{diag}\left(#1\right)}
\newcommand{\diagonalMatrix}{\mathbf{D}}
\newcommand{\diff}[2]{\frac{\text{d}#1}{\text{d}#2}}
\newcommand{\diffTwo}[2]{\frac{\text{d}^2#1}{\text{d}#2^2}}
\newcommand{\displacement}{x}
\newcommand{\displacementVector}{\textbf{\displacement}}
\newcommand{\distanceMatrix}{\mathbf{ \MakeUppercase{\distanceScalar}}}
\newcommand{\distanceScalar}{d}
\newcommand{\distanceVector}{\mathbf{ \distanceScalar}}
\newcommand{\eigenvaltwo}{\ell}
\newcommand{\eigenvaltwoMatrix}{\mathbf{L}}
\newcommand{\eigenvaltwoVector}{\mathbf{l}}
\newcommand{\eigenvalue}{\lambda}
\newcommand{\eigenvalueMatrix}{\boldsymbol{ \Lambda}}
\newcommand{\eigenvalueVector}{\boldsymbol{ \lambda}}
\newcommand{\eigenvector}{\mathbf{ \eigenvectorScalar}}
\newcommand{\eigenvectorMatrix}{\mathbf{U}}
\newcommand{\eigenvectorScalar}{u}
\newcommand{\eigenvectwo}{\mathbf{v}}
\newcommand{\eigenvectwoMatrix}{\mathbf{V}}
\newcommand{\eigenvectwoScalar}{v}
\newcommand{\entropy}[1]{\mathcal{H}\left(#1\right)}
\newcommand{\errorFunction}{E}
\newcommand{\expDist}[2]{\left<#1\right>_{#2}}
\newcommand{\expSamp}[1]{\left<#1\right>}
\newcommand{\expectation}[1]{\left\langle #1 \right\rangle }
\newcommand{\expectationDist}[2]{\left\langle #1 \right\rangle _{#2}}
\newcommand{\expectedDistanceMatrix}{\mathcal{D}}
\newcommand{\eye}{\mathbf{I}}
\newcommand{\fantasyDim}{r}
\newcommand{\fantasyMatrix}{\mathbf{ \MakeUppercase{\fantasyScalar}}}
\newcommand{\fantasyScalar}{z}
\newcommand{\fantasyVector}{\mathbf{ \fantasyScalar}}
\newcommand{\featureStd}{\varsigma}
\newcommand{\gammaCdf}[3]{\mathcal{GAMMA CDF}\left(#1|#2,#3\right)}
\newcommand{\gammaDist}[3]{\mathcal{G}\left(#1|#2,#3\right)}
\newcommand{\gammaSamp}[2]{\mathcal{G}\left(#1,#2\right)}
\newcommand{\gaussianDist}[3]{\mathcal{N}\left(#1|#2,#3\right)}
\newcommand{\gaussianSamp}[2]{\mathcal{N}\left(#1,#2\right)}
\newcommand{\given}{|}
\newcommand{\half}{\frac{1}{2}}
\newcommand{\heaviside}{H}
\newcommand{\hiddenMatrix}{\mathbf{ \MakeUppercase{\hiddenScalar}}}
\newcommand{\hiddenScalar}{h}
\newcommand{\hiddenVector}{\mathbf{ \hiddenScalar}}
\newcommand{\identityMatrix}{\eye}
\newcommand{\inducingInputScalar}{z}
\newcommand{\inducingInputVector}{\mathbf{ \inducingInputScalar}}
\newcommand{\inducingInputMatrix}{\mathbf{Z}}
\newcommand{\inducingScalar}{u}
\newcommand{\inducingVector}{\mathbf{ \inducingScalar}}
\newcommand{\inducingMatrix}{\mathbf{U}}
\newcommand{\inlineDiff}[2]{\text{d}#1/\text{d}#2}
\newcommand{\inputDim}{q}
\newcommand{\inputMatrix}{\mathbf{X}}
\newcommand{\inputScalar}{x}
\newcommand{\inputSpace}{\mathcal{X}}
\newcommand{\inputVals}{\inputVector}
\newcommand{\inputVector}{\mathbf{ \inputScalar}}
\newcommand{\iterNum}{k}
\newcommand{\kernel}{\kernelScalar}
\newcommand{\kernelMatrix}{\mathbf{K}}
\newcommand{\kernelScalar}{k}
\newcommand{\kernelVector}{\mathbf{ \kernelScalar}}
\newcommand{\kff}{\kernelScalar_{\mappingFunction \mappingFunction}}
\newcommand{\kfu}{\kernelVector_{\mappingFunction \inducingScalar}}
\newcommand{\kuf}{\kernelVector_{\inducingScalar \mappingFunction}}
\newcommand{\kuu}{\kernelVector_{\inducingScalar \inducingScalar}}
\newcommand{\lagrangeMultiplier}{\lambda}
\newcommand{\lagrangeMultiplierMatrix}{\boldsymbol{ \Lambda}}
\newcommand{\lagrangian}{L}
\newcommand{\laplacianFactor}{\mathbf{ \MakeUppercase{\laplacianFactorScalar}}}
\newcommand{\laplacianFactorScalar}{m}
\newcommand{\laplacianFactorVector}{\mathbf{ \laplacianFactorScalar}}
\newcommand{\laplacianMatrix}{\mathbf{L}}
\newcommand{\laplacianScalar}{\ell}
\newcommand{\laplacianVector}{\mathbf{ \ell}}
\newcommand{\latentDim}{q}
\newcommand{\latentDistanceMatrix}{\boldsymbol{ \Delta}}
\newcommand{\latentDistanceScalar}{\delta}
\newcommand{\latentDistanceVector}{\boldsymbol{ \delta}}
\newcommand{\latentForce}{f}
\newcommand{\latentFunction}{u}
\newcommand{\latentFunctionVector}{\mathbf{ \latentFunction}}
\newcommand{\latentFunctionMatrix}{\mathbf{ \MakeUppercase{\latentFunction}}}
\newcommand{\latentIndex}{j}
\newcommand{\latentScalar}{z}
\newcommand{\latentVector}{\mathbf{ \latentScalar}}
\newcommand{\latentMatrix}{\mathbf{Z}}
\newcommand{\learnRate}{\eta}
\newcommand{\lengthScale}{\ell}
\newcommand{\rbfWidth}{\ell}
\newcommand{\likelihoodBound}{\mathcal{L}}
\newcommand{\likelihoodFunction}{L}
\newcommand{\locationScalar}{\mu}
\newcommand{\locationVector}{\boldsymbol{ \locationScalar}}
\newcommand{\locationMatrix}{\mathbf{M}}
\newcommand{\variance}[1]{\text{var}\left( #1 \right)}
\newcommand{\mappingFunction}{f}
\newcommand{\mappingFunctionMatrix}{\mathbf{F}}
\newcommand{\mappingFunctionTwo}{g}
\newcommand{\mappingFunctionTwoMatrix}{\mathbf{G}}
\newcommand{\mappingFunctionTwoVector}{\mathbf{ \mappingFunctionTwo}}
\newcommand{\mappingFunctionVector}{\mathbf{ \mappingFunction}}
\newcommand{\scaleScalar}{s}
\newcommand{\mappingScalar}{w}
\newcommand{\mappingVector}{\mathbf{ \mappingScalar}}
\newcommand{\mappingMatrix}{\mathbf{W}}
\newcommand{\mappingScalarTwo}{v}
\newcommand{\mappingVectorTwo}{\mathbf{ \mappingScalarTwo}}
\newcommand{\mappingMatrixTwo}{\mathbf{V}}
\newcommand{\maxIters}{K}
\newcommand{\meanMatrix}{\mathbf{M}}
\newcommand{\meanScalar}{\mu}
\newcommand{\meanTwoMatrix}{\mathbf{M}}
\newcommand{\meanTwoScalar}{m}
\newcommand{\meanTwoVector}{\mathbf{ \meanTwoScalar}}
\newcommand{\meanVector}{\boldsymbol{ \meanScalar}}
\newcommand{\mrnaConcentration}{m}
\newcommand{\naturalFrequency}{\omega}
\newcommand{\neighborhood}[1]{\mathcal{N}\left( #1 \right)}
\newcommand{\neilurl}{http://inverseprobability.com/}
\newcommand{\noiseMatrix}{\boldsymbol{ E}}
\newcommand{\noiseScalar}{\epsilon}
\newcommand{\noiseVector}{\boldsymbol{ \epsilon}}
\newcommand{\norm}[1]{\left\Vert #1 \right\Vert}
\newcommand{\normalizedLaplacianMatrix}{\hat{\mathbf{L}}}
\newcommand{\normalizedLaplacianScalar}{\hat{\ell}}
\newcommand{\normalizedLaplacianVector}{\hat{\mathbf{ \ell}}}
\newcommand{\numActive}{m}
\newcommand{\numBasisFunc}{m}
\newcommand{\numComponents}{m}
\newcommand{\numComps}{K}
\newcommand{\numData}{n}
\newcommand{\numFeatures}{K}
\newcommand{\numHidden}{h}
\newcommand{\numInducing}{m}
\newcommand{\numLayers}{\ell}
\newcommand{\numNeighbors}{K}
\newcommand{\numSequences}{s}
\newcommand{\numSuccess}{s}
\newcommand{\numTasks}{m}
\newcommand{\numTime}{T}
\newcommand{\numTrials}{S}
\newcommand{\outputIndex}{j}
\newcommand{\paramVector}{\boldsymbol{ \theta}}
\newcommand{\parameterMatrix}{\boldsymbol{ \Theta}}
\newcommand{\parameterScalar}{\theta}
\newcommand{\parameterVector}{\boldsymbol{ \parameterScalar}}
\newcommand{\partDiff}[2]{\frac{\partial#1}{\partial#2}}
\newcommand{\precisionScalar}{j}
\newcommand{\precisionVector}{\mathbf{ \precisionScalar}}
\newcommand{\precisionMatrix}{\mathbf{J}}
\newcommand{\pseudotargetScalar}{\widetilde{y}}
\newcommand{\pseudotargetVector}{\mathbf{ \pseudotargetScalar}}
\newcommand{\pseudotargetMatrix}{\mathbf{ \widetilde{Y}}}
\newcommand{\rank}[1]{\text{rank}\left(#1\right)}
\newcommand{\rayleighDist}[2]{\mathcal{R}\left(#1|#2\right)}
\newcommand{\rayleighSamp}[1]{\mathcal{R}\left(#1\right)}
\newcommand{\responsibility}{r}
\newcommand{\rotationScalar}{r}
\newcommand{\rotationVector}{\mathbf{ \rotationScalar}}
\newcommand{\rotationMatrix}{\mathbf{R}}
\newcommand{\sampleCovScalar}{s}
\newcommand{\sampleCovVector}{\mathbf{ \sampleCovScalar}}
\newcommand{\sampleCovMatrix}{\mathbf{s}}
\newcommand{\scalarProduct}[2]{\left\langle{#1},{#2}\right\rangle}
\newcommand{\sign}[1]{\text{sign}\left(#1\right)}
\newcommand{\sigmoid}[1]{\sigma\left(#1\right)}
\newcommand{\singularvalue}{\ell}
\newcommand{\singularvalueMatrix}{\mathbf{L}}
\newcommand{\singularvalueVector}{\mathbf{l}}
\newcommand{\sorth}{\mathbf{u}}
\newcommand{\spar}{\lambda}
\newcommand{\trace}[1]{\text{tr}\left(#1\right)}
\newcommand{\BasalRate}{B}
\newcommand{\DampingCoefficient}{C}
\newcommand{\DecayRate}{D}
\newcommand{\Displacement}{X}
\newcommand{\LatentForce}{F}
\newcommand{\Mass}{M}
\newcommand{\Sensitivity}{S}
\newcommand{\basalRate}{b}
\newcommand{\dampingCoefficient}{c}
\newcommand{\mass}{m}
\newcommand{\sensitivity}{s}
\newcommand{\springScalar}{\kappa}
\newcommand{\springVector}{\boldsymbol{ \kappa}}
\newcommand{\springMatrix}{\boldsymbol{ \mathcal{K}}}
\newcommand{\tfConcentration}{p}
\newcommand{\tfDecayRate}{\delta}
\newcommand{\tfMrnaConcentration}{f}
\newcommand{\tfVector}{\mathbf{ \tfConcentration}}
\newcommand{\velocity}{v}
\newcommand{\sufficientStatsScalar}{g}
\newcommand{\sufficientStatsVector}{\mathbf{ \sufficientStatsScalar}}
\newcommand{\sufficientStatsMatrix}{\mathbf{G}}
\newcommand{\switchScalar}{s}
\newcommand{\switchVector}{\mathbf{ \switchScalar}}
\newcommand{\switchMatrix}{\mathbf{S}}
\newcommand{\tr}[1]{\text{tr}\left(#1\right)}
\newcommand{\loneNorm}[1]{\left\Vert #1 \right\Vert_1}
\newcommand{\ltwoNorm}[1]{\left\Vert #1 \right\Vert_2}
\newcommand{\onenorm}[1]{\left\vert#1\right\vert_1}
\newcommand{\twonorm}[1]{\left\Vert #1 \right\Vert}
\newcommand{\vScalar}{v}
\newcommand{\vVector}{\mathbf{v}}
\newcommand{\vMatrix}{\mathbf{V}}
\newcommand{\varianceDist}[2]{\text{var}_{#2}\left( #1 \right)}
% Already defined by latex
%\newcommand{\vec}{#1:}
\newcommand{\vecb}[1]{\left(#1\right):}
\newcommand{\weightScalar}{w}
\newcommand{\weightVector}{\mathbf{ \weightScalar}}
\newcommand{\weightMatrix}{\mathbf{W}}
\newcommand{\weightedAdjacencyMatrix}{\mathbf{A}}
\newcommand{\weightedAdjacencyScalar}{a}
\newcommand{\weightedAdjacencyVector}{\mathbf{ \weightedAdjacencyScalar}}
\newcommand{\onesVector}{\mathbf{1}}
\newcommand{\zerosVector}{\mathbf{0}}
$$
</div>
<p><!--% not ipynb--></p>
<h2 id="what-is-machine-learning">What is Machine Learning?</h2>
<p>What is machine learning? At its most basic level machine learning is a combination of</p>
<p><span class="math display">\[ \text{data} + \text{model} \xrightarrow{\text{compute}} \text{prediction}\]</span></p>
<p>where <em>data</em> is our observations. They can be actively or passively acquired (meta-data). The <em>model</em> contains our assumptions, based on previous experience. That experience can be other data, it can come from transfer learning, or it can merely be our beliefs about the regularities of the universe. In humans our models include our inductive biases. The <em>prediction</em> is an action to be taken or a categorization or a quality score. The reason that machine learning has become a mainstay of artificial intelligence is the importance of predictions in artificial intelligence. The data and the model are combined through computation.</p>
<p>In practice we normally perform machine learning using two functions. To combine data with a model we typically make use of:</p>
<p><strong>a prediction function</strong> a function which is used to make the predictions. It includes our beliefs about the regularities of the universe, our assumptions about how the world works, e.g. smoothness, spatial similarities, temporal similarities.</p>
<p><strong>an objective function</strong> a function which defines the cost of misprediction. Typically it includes knowledge about the world's generating processes (probabilistic objectives) or the costs we pay for mispredictions (empiricial risk minimization).</p>
<p>The combination of data and model through the prediction function and the objectie function leads to a <em>learning algorithm</em>. The class of prediction functions and objective functions we can make use of is restricted by the algorithms they lead to. If the prediction function or the objective function are too complex, then it can be difficult to find an appropriate learning algorithm. Much of the acdemic field of machine learning is the quest for new learning algorithms that allow us to bring different types of models and data together.</p>
<p>A useful reference for state of the art in machine learning is the UK Royal Society Report, <a href="https://royalsociety.org/~/media/policy/projects/machine-learning/publications/machine-learning-report.pdf">Machine Learning: Power and Promise of Computers that Learn by Example</a>.</p>
<p>You can also check my blog post on <a href="http://inverseprobability.com/2017/07/17/what-is-machine-learning">"What is Machine Learning?"</a></p>
<h2 id="probabilities">Probabilities</h2>
<p>We are now going to do some simple review of probabilities and use this review to explore some aspects of our data.</p>
<p>A probability distribution expresses uncertainty about the outcome of an event. We often encode this uncertainty in a variable. So if we are considering the outcome of an event, <span class="math inline">\(Y\)</span>, to be a coin toss, then we might consider <span class="math inline">\(Y=1\)</span> to be heads and <span class="math inline">\(Y=0\)</span> to be tails. We represent the probability of a given outcome with the notation: <span class="math display">\[
P(Y=1) = 0.5
\]</span> The first rule of probability is that the probability must normalize. The sum of the probability of all events must equal 1. So if the probability of heads (<span class="math inline">\(Y=1\)</span>) is 0.5, then the probability of tails (the only other possible outcome) is given by <span class="math display">\[
P(Y=0) = 1-P(Y=1) = 0.5
\]</span></p>
<p>Probabilities are often defined as the limit of the ratio between the number of positive outcomes (e.g. <em>heads</em>) given the number of trials. If the number of positive outcomes for event <span class="math inline">\(y\)</span> is denoted by <span class="math inline">\(n\)</span> and the number of trials is denoted by <span class="math inline">\(N\)</span> then this gives the ratio <span class="math display">\[
P(Y=y) = \lim_{N\rightarrow
\infty}\frac{n_y}{N}.
\]</span> In practice we never get to observe an event infinite times, so rather than considering this we often use the following estimate <span class="math display">\[
P(Y=y) \approx \frac{n_y}{N}.
\]</span></p>
<h2 id="movie-body-count-example">Movie Body Count Example</h2>
<p>There is a crisis in the movie industry, deaths are occuring on a massive scale. In every feature film the body count is tolling up. But what is the cause of all these deaths? Let's try and investigate.</p>
<p>For our first example of data science, we take inspiration from work by <a href="http://www.theswarmlab.com/r-vs-python-round-2/">researchers at NJIT</a>. They researchers were comparing the qualities of Python with R (my brief thoughts on the subject are available in a Google+ post here: https://plus.google.com/116220678599902155344/posts/5iKyqcrNN68). They put together a data base of results from the the "Internet Movie Database" and the <a href="http://www.moviebodycounts.com/">Movie Body Count</a> website which will allow us to do some preliminary investigation.</p>
<p>We will make use of data that has already been 'scraped' from the <a href="http://www.moviebodycounts.com/">Movie Body Count</a> website. Code and the data is available at <a href="https://github.com/sjmgarnier/R-vs-%20Python/tree/master/Deadliest%20movies%20scrape/code">a github repository</a>. Git is a version control system and github is a website that hosts code that can be accessed through git. By sharing the code publicly through github, the authors are licensing the code publicly and allowing you to access and edit it. As well as accessing the code via github you can also <a href="https://github.com/sjmgarnier/R-vs-Python/archive/master.zip">download the zip file</a>.</p>
<p>For ease of use we've packaged this data set in the <code>pods</code> library</p>
<h3 id="pods"><code>pods</code></h3>
<p>The <code>pods</code> library is a library for supporting open data science (python open data science). It allows you to load in various data sets and provides tools for helping teach in the notebook.</p>
<p>To install pods you can use pip:</p>
<p><code>pip install pods</code></p>
<p>The code is also available on github: <a href="https://github.com/sods/ods" class="uri">https://github.com/sods/ods</a></p>
<p>Once <code>pods</code> is installed, it can be imported in the usual manner.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> pods</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">data <span class="op">=</span> pods.datasets.movie_body_count()[<span class="st">'Y'</span>]
data.head()</code></pre></div>
<p>Once it is loaded in the data can be summarized using the <code>describe</code> method in pandas.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">data.describe()</code></pre></div>
<p>In jupyter and jupyter notebook it is possible to see a list of all possible functions and attributes by typing the name of the object followed by .<Tab> for example in the above case if we type data.<Tab> it show the columns available (these are attributes in pandas dataframes) such as Body_Count, and also functions, such as .describe().</p>
<p>For functions we can also see the documentation about the function by following the name with a question mark. This will open a box with documentation at the bottom which can be closed with the x button.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">data.describe?</code></pre></div>
<p>The film deaths data is stored in an object known as a 'data frame'. Data frames come from the statistical family of programming languages based on <code>S</code>, the most widely used of which is <a href="http://en.wikipedia.org/wiki/R_(programming_language)"><code>R</code></a>. The data frame gives us a convenient object for manipulating data. The describe method summarizes which columns there are in the data frame and gives us counts, means, standard deviations and percentiles for the values in those columns. To access a column directly we can write</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="bu">print</span>(data[<span class="st">'Year'</span>])
<span class="co">#print(data['Body_Count'])</span></code></pre></div>
<p>This shows the number of deaths per film across the years. We can plot the data as follows.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co"># this ensures the plot appears in the web browser</span>
<span class="op">%</span>matplotlib inline
<span class="im">import</span> matplotlib.pyplot <span class="im">as</span> plt <span class="co"># this imports the plotting library in python</span></code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">plt.plot(data[<span class="st">'Year'</span>], data[<span class="st">'Body_Count'</span>], <span class="st">'rx'</span>)</code></pre></div>
<p>You may be curious what the arguments we give to plt.plot are for, now is the perfect time to look at the documentation</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">plt.plot?</code></pre></div>
<p>We immediately note that some films have a lot of deaths, which prevent us seeing the detail of the main body of films. First lets identify the films with the most deaths.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">data[data[<span class="st">'Body_Count'</span>]<span class="op">></span><span class="dv">200</span>]</code></pre></div>
<p>Here we are using the command <code>data['Kill_Count']>200</code> to index the films in the pandas data frame which have over 200 deaths. To sort them in order we can also use the <code>sort</code> command. The result of this command on its own is a data series of <code>True</code> and <code>False</code> values. However, when it is passed to the <code>data</code> data frame it returns a new data frame which contains only those values for which the data series is <code>True</code>. We can also sort the result. To sort the result by the values in the <code>Kill_Count</code> column in <em>descending</em> order we use the following command.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">data[data[<span class="st">'Body_Count'</span>]<span class="op">></span><span class="dv">200</span>].sort_values(by<span class="op">=</span><span class="st">'Body_Count'</span>, ascending<span class="op">=</span><span class="va">False</span>)</code></pre></div>
<p>We now see that the 'Lord of the Rings' is a large outlier with a very large number of kills. We can try and determine how much of an outlier by histograming the data.</p>
<h3 id="plotting-the-data">Plotting the Data</h3>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">data[<span class="st">'Body_Count'</span>].hist(bins<span class="op">=</span><span class="dv">20</span>) <span class="co"># histogram the data with 20 bins.</span>
plt.title(<span class="st">'Histogram of Film Kill Count'</span>)</code></pre></div>
<h3 id="question-2">Question 2</h3>
<p>Read on the internet about the following python libraries: <code>numpy</code>, <code>matplotlib</code>, <code>scipy</code> and <code>pandas</code>. What functionality does each provide python. What is the <code>pylab</code> library and how does it relate to the other libraries?</p>
<p><em>10 marks</em></p>
<h3 id="write-your-answer-to-question-2-here">Write your answer to Question 2 here</h3>
<p>We could try and remove these outliers, but another approach would be plot the logarithm of the counts against the year.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">plt.plot(data[<span class="st">'Year'</span>], data[<span class="st">'Body_Count'</span>], <span class="st">'rx'</span>)
ax <span class="op">=</span> plt.gca() <span class="co"># obtain a handle to the current axis</span>
ax.set_yscale(<span class="st">'log'</span>) <span class="co"># use a logarithmic death scale</span>
<span class="co"># give the plot some titles and labels</span>
plt.title(<span class="st">'Film Deaths against Year'</span>)
plt.ylabel(<span class="st">'deaths'</span>)
plt.xlabel(<span class="st">'year'</span>)</code></pre></div>
<p>Note a few things. We are interacting with our data. In particular, we are replotting the data according to what we have learned so far. We are using the progamming language as a <em>scripting</em> language to give the computer one command or another, and then the next command we enter is dependent on the result of the previous. This is a very different paradigm to classical software engineering. In classical software engineering we normally write many lines of code (entire object classes or functions) before compiling the code and running it. Our approach is more similar to the approach we take whilst debugging. Historically, researchers interacted with data using a <em>console</em>. A command line window which allowed command entry. The notebook format we are using is slightly different. Each of the code entry boxes acts like a separate console window. We can move up and down the notebook and run each part in a different order. The <em>state</em> of the program is always as we left it after running the previous part.</p>
<p>Let's use the sum rule to compute the approximate probability that a film from the movie body count website has over 40 deaths.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">deaths <span class="op">=</span> (data.Body_Count<span class="op">></span><span class="dv">40</span>).<span class="bu">sum</span>() <span class="co"># number of positive outcomes (in sum True counts as 1, False counts as 0)</span>
total_films <span class="op">=</span> data.Body_Count.count()
prob_death <span class="op">=</span> <span class="bu">float</span>(deaths)<span class="op">/</span><span class="bu">float</span>(total_films)
<span class="bu">print</span>(<span class="st">"Probability of deaths being greather than 40 is:"</span>, prob_death)</code></pre></div>
<h3 id="question-4">Question 4</h3>
<p>We now have an estimate of the probability a film has greater than 40 deaths. The estimate seems quite high. What could be wrong with the estimate? Do you think any film you go to in the cinema has this probability of having greater than 40 deaths?</p>
<p>Why did we have to use <code>float</code> around our counts of deaths and total films? What would the answer have been if we hadn't used the <code>float</code> command? If we were using Python 3 would we have this problem?</p>
<p><em>20 marks</em></p>
<h3 id="write-your-answer-to-question-4-here">Write your answer to Question 4 here</h3>
<h2 id="conditioning">Conditioning</h2>
<p>When predicting whether a coin turns up head or tails, we might think that this event is <em>independent</em> of the year or time of day. If we include an observation such as time, then in a probability this is known as <em>condtioning</em>. We use this notation, <span class="math inline">\(P(Y=y|T=t)\)</span>, to condition the outcome on a second variable (in this case time). Or, often, for a shorthand we use <span class="math inline">\(P(y|t)\)</span> to represent this distribution (the <span class="math inline">\(Y=\)</span> and <span class="math inline">\(T=\)</span> being implicit). Because we don't believe a coin toss depends on time then we might write that <span class="math display">\[
P(y|t) =
p(y).
\]</span> However, we might believe that the number of deaths is dependent on the year. For this we can try estimating <span class="math inline">\(P(Y>40 | T=2000)\)</span> and compare the result, for example to <span class="math inline">\(P(Y>40|2002)\)</span> using our empirical estimate of the probability.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="cf">for</span> year <span class="kw">in</span> [<span class="dv">2000</span>, <span class="dv">2002</span>]:
deaths <span class="op">=</span> (data.Body_Count[data.Year<span class="op">==</span>year]<span class="op">></span><span class="dv">40</span>).<span class="bu">sum</span>()
total_films <span class="op">=</span> (data.Year<span class="op">==</span>year).<span class="bu">sum</span>()
prob_death <span class="op">=</span> <span class="bu">float</span>(deaths)<span class="op">/</span><span class="bu">float</span>(total_films)
<span class="bu">print</span>(<span class="st">"Probability of deaths being greather than 40 in year"</span>, year, <span class="st">"is:"</span>, prob_death)</code></pre></div>
<h3 id="question-5">Question 5</h3>
<p>Compute the probability for the number of deaths being over 40 for each year we have in our <code>data</code> data frame. Store the result in a <code>numpy</code> array and plot the probabilities against the years using the <code>plot</code> command from <code>matplotlib</code>. Do you think the estimate we have created of <span class="math inline">\(P(y|t)\)</span> is a good estimate? Write your code and your written answers in the box below.</p>
<p><em>20 marks</em></p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co"># Write code for your answer to Question 5 in this box</span>
<span class="co"># provide the answers so that the code runs correctly otherwise you will loose marks!</span>
</code></pre></div>
<h4 id="notes-for-question">Notes for Question</h4>
<p>Make sure the plot is included in <em>this</em> notebook file (the <code>IPython</code> magic command <code>%matplotlib inline</code> we ran above will do that for you, it only needs to be run once per file).</p>
<table>
<thead>
<tr class="header">
<th>Terminology</th>
<th>Mathematical notation</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>joint</td>
<td><span class="math inline">\(P(X=x, Y=y)\)</span></td>
<td>prob. that X=x <em>and</em> Y=y</td>
</tr>
<tr class="even">
<td>marginal</td>
<td><span class="math inline">\(P(X=x)\)</span></td>
<td>prob. that X=x <em>regardless of</em> Y</td>
</tr>
<tr class="odd">
<td>conditional</td>
<td><span class="math inline">\(P(X=x\vert Y=y)\)</span></td>
<td>prob. that X=x <em>given that</em> Y=y</td>
</tr>
</tbody>
</table>
<center>
The different basic probability distributions.
</center>
<h3 id="a-pictorial-definition-of-probability">A Pictorial Definition of Probability</h3>
<object class="svgplot" align data="../slides/diagrams/mlai/prob_diagram.svg">
</object>
<p><span align="right">Inspired by lectures from Christopher Bishop</span></p>
<h3 id="definition-of-probability-distributions.">Definition of probability distributions.</h3>
<table>
<colgroup>
<col width="20%" />
<col width="46%" />
<col width="33%" />
</colgroup>
<thead>
<tr class="header">
<th>Terminology</th>
<th>Definition</th>
<th>Probability Notation</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Joint Probability</td>
<td><span class="math inline">\(\lim_{N\rightarrow\infty}\frac{n_{X=3,Y=4}}{N}\)</span></td>
<td><span class="math inline">\(P\left(X=3,Y=4\right)\)</span></td>
</tr>
<tr class="even">
<td>Marginal Probability</td>
<td><span class="math inline">\(\lim_{N\rightarrow\infty}\frac{n_{X=5}}{N}\)</span></td>
<td><span class="math inline">\(P\left(X=5\right)\)</span></td>
</tr>
<tr class="odd">
<td>Conditional Probability</td>
<td><span class="math inline">\(\lim_{N\rightarrow\infty}\frac{n_{X=3,Y=4}}{n_{Y=4}}\)</span></td>
<td><span class="math inline">\(P\left(X=3\vert Y=4\right)\)</span></td>
</tr>
</tbody>
</table>
<h3 id="notational-details">Notational Details</h3>
<p>Typically we should write out <span class="math inline">\(P\left(X=x,Y=y\right)\)</span>, but in practice we often shorten this to <span class="math inline">\(P\left(x,y\right)\)</span>. This looks very much like we might write a multivariate function, <em>e.g.</em> <span class="math display">\[
f\left(x,y\right)=\frac{x}{y},
\]</span> but for a multivariate function <span class="math display">\[
f\left(x,y\right)\neq f\left(y,x\right).
\]</span> However, <span class="math display">\[
P\left(x,y\right)=P\left(y,x\right)
\]</span> because <span class="math display">\[
P\left(X=x,Y=y\right)=P\left(Y=y,X=x\right).
\]</span> Sometimes I think of this as akin to the way in Python we can write 'keyword arguments' in functions. If we use keyword arguments, the ordering of arguments doesn't matter.</p>
<p>We've now introduced conditioning and independence to the notion of probability and computed some conditional probabilities on a practical example The scatter plot of deaths vs year that we created above can be seen as a <em>joint</em> probability distribution. We represent a joint probability using the notation <span class="math inline">\(P(Y=y, T=t)\)</span> or <span class="math inline">\(P(y, t)\)</span> for short. Computing a joint probability is equivalent to answering the simultaneous questions, what's the probability that the number of deaths was over 40 and the year was 2002? Or any other question that may occur to us. Again we can easily use pandas to ask such questions.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">year <span class="op">=</span> <span class="dv">2000</span>
deaths <span class="op">=</span> (data.Body_Count[data.Year<span class="op">==</span>year]<span class="op">></span><span class="dv">40</span>).<span class="bu">sum</span>()
total_films <span class="op">=</span> data.Body_Count.count() <span class="co"># this is total number of films</span>
prob_death <span class="op">=</span> <span class="bu">float</span>(deaths)<span class="op">/</span><span class="bu">float</span>(total_films)
<span class="bu">print</span>(<span class="st">"Probability of deaths being greather than 40 and year being"</span>, year, <span class="st">"is:"</span>, prob_death)</code></pre></div>
<h3 id="the-product-rule">The Product Rule</h3>
<p>This number is the joint probability, <span class="math inline">\(P(Y, T)\)</span> which is much <em>smaller</em> than the conditional probability. The number can never be bigger than the conditional probabililty because it is computed using the <em>product rule</em>. <span class="math display">\[
p(Y=y, X=x) = p(Y=y|X=x)p(X=x)
\]</span> and <span class="math display">\[p(X=x)\]</span> is a probability distribution, which is equal or less than 1, ensuring the joint distribution is typically smaller than the conditional distribution.</p>
<p>The product rule is a <em>fundamental</em> rule of probability, and you must remember it! It gives the relationship between the two questions: 1) What's the probability that a film was made in 2002 and has over 40 deaths? and 2) What's the probability that a film has over 40 deaths given that it was made in 2002?</p>
<p>In our shorter notation we can write the product rule as <span class="math display">\[
p(y, x) = p(y|x)p(x)
\]</span> We can see the relation working in practice for our data above by computing the different values for <span class="math inline">\(t=2000\)</span>.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">p_x <span class="op">=</span> <span class="bu">float</span>((data.Year<span class="op">==</span><span class="dv">2002</span>).<span class="bu">sum</span>())<span class="op">/</span><span class="bu">float</span>(data.Body_Count.count())
p_y_given_x <span class="op">=</span> <span class="bu">float</span>((data.Body_Count[data.Year<span class="op">==</span><span class="dv">2002</span>]<span class="op">></span><span class="dv">40</span>).<span class="bu">sum</span>())<span class="op">/</span><span class="bu">float</span>((data.Year<span class="op">==</span><span class="dv">2002</span>).<span class="bu">sum</span>())
p_y_and_x <span class="op">=</span> <span class="bu">float</span>((data.Body_Count[data.Year<span class="op">==</span><span class="dv">2002</span>]<span class="op">></span><span class="dv">40</span>).<span class="bu">sum</span>())<span class="op">/</span><span class="bu">float</span>(data.Body_Count.count())
<span class="bu">print</span>(<span class="st">"P(x) is"</span>, p_x)
<span class="bu">print</span>(<span class="st">"P(y|x) is"</span>, p_y_given_x)
<span class="bu">print</span>(<span class="st">"P(y,x) is"</span>, p_y_and_x)</code></pre></div>
<h3 id="the-sum-rule">The Sum Rule</h3>
<p>The other <em>fundamental rule</em> of probability is the <em>sum rule</em> this tells us how to get a <em>marginal</em> distribution from the joint distribution. Simply put it says that we need to sum across the value we'd like to remove. <span class="math display">\[
P(Y=y) = \sum_{x} P(Y=y, X=x)
\]</span> Or in our shortened notation <span class="math display">\[
P(y) = \sum_{x} P(y, x)
\]</span></p>
<h3 id="question-6">Question 6</h3>
<p>Write code that computes <span class="math inline">\(P(y)\)</span> by adding <span class="math inline">\(P(y, x)\)</span> for all values of <span class="math inline">\(x\)</span>.</p>
<p><em>10 marks</em></p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co"># Write code for your answer to Question 6 in this box</span>
<span class="co"># provide the answers so that the code runs correctly otherwise you will loose marks!</span>
</code></pre></div>
<h3 id="bayes-rule">Bayes’ Rule</h3>
<p>Bayes rule is a very simple rule, it's hardly worth the name of a rule at all. It follows directly from the product rule of probability. Because <span class="math inline">\(P(y, x) = P(y|x)P(x)\)</span> and by symmetry <span class="math inline">\(P(y,x)=P(x,y)=P(x|y)P(y)\)</span> then by equating these two equations and dividing through by <span class="math inline">\(P(y)\)</span> we have <span class="math display">\[
P(x|y) =
\frac{P(y|x)P(x)}{P(y)}
\]</span> which is known as Bayes' rule (or Bayes's rule, it depends how you choose to pronounce it). It's not difficult to derive, and its importance is more to do with the semantic operation that it enables. Each of these probability distributions represents the answer to a question we have about the world. Bayes rule (via the product rule) tells us how to <em>invert</em> the probability.</p>
<h3 id="probabilities-for-extracting-information-from-data">Probabilities for Extracting Information from Data</h3>
<p>What use is all this probability in data science? Let's think about how we might use the probabilities to do some decision making. Let's look at the information about the movies.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">data.columns</code></pre></div>
<h3 id="question-7">Question 7</h3>
<p>Now we see we have several additional features including the quality rating (<code>IMDB_Rating</code>). Let's assume we want to predict the rating given the other information in the data base. How would we go about doing it?</p>
<p>Using what you've learnt about joint, conditional and marginal probabilities, as well as the sum and product rule, how would you formulate the question you want to answer in terms of probabilities? Should you be using a joint or a conditional distribution? If it's conditional, what should the distribution be over, and what should it be conditioned on?</p>
<p><em>20 marks</em></p>
<h3 id="write-your-answer-to-question-7-here">Write your answer to Question 7 here</h3>
<h3 id="probabilistic-modelling">Probabilistic Modelling</h3>
<p>This Bayesian approach is designed to deal with uncertainty arising from fitting our prediction function to the data we have, a reduced data set.</p>
<p>The Bayesian approach can be derived from a broader understanding of what our objective is. If we accept that we can jointly represent all things that happen in the world with a probability distribution, then we can interogate that probability to make predictions. So, if we are interested in predictions, <span class="math inline">\(\dataScalar_*\)</span> at future points input locations of interest, <span class="math inline">\(\inputVector_*\)</span> given previously training data, <span class="math inline">\(\dataVector\)</span> and corresponding inputs, <span class="math inline">\(\inputMatrix\)</span>, then we are really interogating the following probability density, <span class="math display">\[
p(\dataScalar_*|\dataVector, \inputMatrix, \inputVector_*),
\]</span> there is nothing controversial here, as long as you accept that you have a good joint model of the world around you that relates test data to training data, <span class="math inline">\(p(\dataScalar_*, \dataVector, \inputMatrix, \inputVector_*)\)</span> then this conditional distribution can be recovered through standard rules of probability (<span class="math inline">\(\text{data} + \text{model} \rightarrow \text{prediction}\)</span>).</p>
<p>We can construct this joint density through the use of the following decomposition: <span class="math display">\[
p(\dataScalar_*|\dataVector, \inputMatrix, \inputVector_*) = \int p(\dataScalar_*|\inputVector_*, \mappingMatrix) p(\mappingMatrix | \dataVector, \inputMatrix) \text{d} \mappingMatrix
\]</span></p>
<p>where, for convenience, we are assuming <em>all</em> the parameters of the model are now represented by <span class="math inline">\(\parameterVector\)</span> (which contains <span class="math inline">\(\mappingMatrix\)</span> and <span class="math inline">\(\mappingMatrixTwo\)</span>) and <span class="math inline">\(p(\parameterVector | \dataVector, \inputMatrix)\)</span> is recognised as the posterior density of the parameters given data and <span class="math inline">\(p(\dataScalar_*|\inputVector_*, \parameterVector)\)</span> is the <em>likelihood</em> of an individual test data point given the parameters.</p>
<p>The likelihood of the data is normally assumed to be independent across the parameters, <span class="math display">\[
p(\dataVector|\inputMatrix, \mappingMatrix) \prod_{i=1}^\numData p(\dataScalar_i|\inputVector_i, \mappingMatrix),\]</span></p>
<p>and if that is so, it is easy to extend our predictions across all future, potential, locations, <span class="math display">\[
p(\dataVector_*|\dataVector, \inputMatrix, \inputMatrix_*) = \int p(\dataVector_*|\inputMatrix_*, \parameterVector) p(\parameterVector | \dataVector, \inputMatrix) \text{d} \parameterVector.
\]</span></p>
<p>The likelihood is also where the <em>prediction function</em> is incorporated. For example in the regression case, we consider an objective based around the Gaussian density, <span class="math display">\[
p(\dataScalar_i | \mappingFunction(\inputVector_i)) = \frac{1}{\sqrt{2\pi \dataStd^2}} \exp\left(-\frac{\left(\dataScalar_i - \mappingFunction(\inputVector_i)\right)^2}{2\dataStd^2}\right)
\]</span></p>
<p>In short, that is the classical approach to probabilistic inference, and all approaches to Bayesian neural networks fall within this path. For a deep probabilistic model, we can simply take this one stage further and place a probability distribution over the input locations, <span class="math display">\[
p(\dataVector_*|\dataVector) = \int p(\dataVector_*|\inputMatrix_*, \parameterVector) p(\parameterVector | \dataVector, \inputMatrix) p(\inputMatrix) p(\inputMatrix_*) \text{d} \parameterVector \text{d} \inputMatrix \text{d}\inputMatrix_*
\]</span> and we have <em>unsupervised learning</em> (from where we can get deep generative models).</p>
<h3 id="graphical-models">Graphical Models</h3>
<p>One way of representing a joint distribution is to consider conditional dependencies between data. Conditional dependencies allow us to factorize the distribution. For example, a Markov chain is a factorization of a distribution into components that represent the conditional relationships between points that are neighboring, often in time or space. It can be decomposed in the following form. <span class="math display">\[p(\dataVector) = p(\dataScalar_\numData | \dataScalar_{\numData-1}) p(\dataScalar_{\numData-1}|\dataScalar_{\numData-2}) \dots p(\dataScalar_{2} | \dataScalar_{1})\]</span></p>
<object class="svgplot" align data="../slides/diagrams/ml/markov.svg">
</object>
<p>By specifying conditional independencies we can reduce the parameterization required for our data, instead of directly specifying the parameters of the joint distribution, we can specify each set of parameters of the conditonal independently. This can also give an advantage in terms of interpretability. Understanding a conditional independence structure gives a structured understanding of data. If developed correctly, according to causal methodology, it can even inform how we should intervene in the system to drive a desired result <span class="citation">(Pearl, 1995)</span>.</p>
<p>However, a challenge arise when the data becomes more complex. Consider the graphical model shown below, used to predict the perioperative risk of <em>C Difficile</em> infection following colon surgery <span class="citation">(Steele et al., 2012)</span>.</p>
<p><img class="negate" src="../slides/diagrams/bayes-net-diagnosis.png" width="40%" align="center" style="background:none; border:none; box-shadow:none;"></p>
<p>To capture the complexity in the interelationship between the data the graph becomes more complex, and less interpretable.</p>
<p>Machine learning problems normally involve a prediction function and an objective function. So far in the course we've mainly focussed on the case where the prediction function was over the real numbers, so the codomain of the functions, <span class="math inline">\(\mappingFunction(\inputMatrix)\)</span> was the real numbers or sometimes real vectors. The classification problem consists of predicting whether or not a particular example is a member of a particular class. So we may want to know if a particular image represents a digit 6 or if a particular user will click on a given advert. These are classification problems, and they require us to map to <em>yes</em> or <em>no</em> answers. That makes them naturally discrete mappings.</p>
<p>In classification we are given an input vector, <span class="math inline">\(\inputVector\)</span>, and an associated label, <span class="math inline">\(\dataScalar\)</span> which either takes the value <span class="math inline">\(-1\)</span> to represent <em>no</em> or <span class="math inline">\(1\)</span> to represent <em>yes</em>.</p>
<ul>
<li>Classifiying hand written digits from binary images (automatic zip code reading)</li>
<li>Detecting faces in images (e.g. digital cameras).</li>
<li>Who a detected face belongs to (e.g. Picasa, Facebook, DeepFace, GaussianFace)</li>
<li>Classifying type of cancer given gene expression data.</li>
<li>Categorization of document types (different types of news article on the internet)</li>
</ul>
<p>Our focus has been on models where the objective function is inspired by a probabilistic analysis of the problem. In particular we've argued that we answer questions about the data set by placing probability distributions over the various quantities of interest. For the case of binary classification this will normally involve introducing probability distributions for discrete variables. Such probability distributions, are in some senses easier than those for continuous variables, in particular we can represent a probability distribution over <span class="math inline">\(\dataScalar\)</span>, where <span class="math inline">\(\dataScalar\)</span> is binary, with one value. If we specify the probability that <span class="math inline">\(\dataScalar=1\)</span> with a number that is between 0 and 1, i.e. let's say that <span class="math inline">\(P(\dataScalar=1) = \pi\)</span> (here we don't mean <span class="math inline">\(\pi\)</span> the number, we are setting <span class="math inline">\(\pi\)</span> to be a variable) then we can specify the probability distribution through a table.</p>
<table>
<thead>
<tr class="header">
<th align="center"><span class="math inline">\(\dataScalar\)</span></th>
<th align="center">0</th>
<th align="center">1</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="center"><span class="math inline">\(P(\dataScalar)\)</span></td>
<td align="center"><span class="math inline">\((1-\pi)\)</span></td>
<td align="center"><span class="math inline">\(\pi\)</span></td>
</tr>
</tbody>
</table>
<p>Mathematically we can use a trick to implement this same table. We can use the value <span class="math inline">\(\dataScalar\)</span> as a mathematical switch and write that <span class="math display">\[
P(\dataScalar) = \pi^\dataScalar (1-\pi)^{(1-\dataScalar)}
\]</span> where our probability distribution is now written as a function of <span class="math inline">\(\dataScalar\)</span>. This probability distribution is known as the <a href="http://en.wikipedia.org/wiki/Bernoulli_distribution">Bernoulli distribution</a>. The Bernoulli distribution is a clever trick for mathematically switching between two probabilities if we were to write it as code it would be better described as</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> bernoulli(y_i, pi):
<span class="cf">if</span> y_i <span class="op">==</span> <span class="dv">1</span>:
<span class="cf">return</span> pi
<span class="cf">else</span>:
<span class="cf">return</span> <span class="dv">1</span><span class="op">-</span>pi</code></pre></div>
<p>If we insert <span class="math inline">\(\dataScalar=1\)</span> then the function is equal to <span class="math inline">\(\pi\)</span>, and if we insert <span class="math inline">\(\dataScalar=0\)</span> then the function is equal to <span class="math inline">\(1-\pi\)</span>. So the function recreates the table for the distribution given above.</p>
<p>The probability distribution is named for <a href="http://en.wikipedia.org/wiki/Jacob_Bernoulli">Jacob Bernoulli</a>, the swiss mathematician. In his book Ars Conjectandi he considered the distribution and the result of a number of 'trials' under the Bernoulli distribution to form the <em>binomial</em> distribution. Below is the page where he considers Pascal's triangle in forming combinations of the Bernoulli distribution to realise the binomial distribution for the outcome of positive trials.</p>
<p><a href="https://play.google.com/books/reader?id=CF4UAAAAQAAJ&pg=PA87"><img src="../slides/diagrams/books/CF4UAAAAQAAJ-PA87.png" /></a></p>
<object class="svgplot" align data="../slides/diagrams/ml/bernoulli-urn.svg">
</object>
<p>Thomas Bayes also described the Bernoulli distribution, only he didn't refer to Jacob Bernoulli's work, so he didn't call it by that name. He described the distribution in terms of a table (think of a <em>billiard table</em>) and two balls. Bayes suggests that each ball can be rolled across the table such that it comes to rest at a position that is <em>uniformly distributed</em> between the sides of the table.</p>
<p>Let's assume that the first ball is rolled, and that it comes to reset at a position that is <span class="math inline">\(\pi\)</span> times the width of the table from the left hand side.</p>
<p>Now, we roll the second ball. We are interested if the second ball ends up on the left side (+ve result) or the right side (-ve result) of the first ball. We use the Bernoulli distribution to determine this.</p>
<p>For this reason in Bayes's distribution there is considered to be <em>aleatoric</em> uncertainty about the distribution parameter.</p>
<object class="svgplot" align data="../slides/diagrams/ml/bayes-billiard009.svg">
</object>
<h3 id="maximum-likelihood-in-the-bernoulli-distribution">Maximum Likelihood in the Bernoulli Distribution</h3>
<p>Maximum likelihood in the Bernoulli distribution is straightforward. Let's assume we have data, <span class="math inline">\(\dataVector\)</span> which consists of a vector of binary values of length <span class="math inline">\(n\)</span>. If we assume each value was sampled independently from the Bernoulli distribution, conditioned on the parameter <span class="math inline">\(\pi\)</span> then our joint probability density has the form <span class="math display">\[
p(\dataVector|\pi) = \prod_{i=1}^{\numData} \pi^{\dataScalar_i} (1-\pi)^{1-\dataScalar_i}.
\]</span> As normal in maximum likelihood we consider the negative log likelihood as our objective, <span class="math display">\[\begin{align*}
\errorFunction(\pi)& = -\log p(\dataVector|\pi)\\
& = -\sum_{i=1}^{\numData} \dataScalar_i \log \pi - \sum_{i=1}^{\numData} (1-\dataScalar_i) \log(1-\pi),
\end{align*}\]</span></p>
<p>and we can derive the gradient with respect to the parameter <span class="math inline">\(\pi\)</span>. <span class="math display">\[\frac{\text{d}\errorFunction(\pi)}{\text{d}\pi} = -\frac{\sum_{i=1}^{\numData} \dataScalar_i}{\pi} + \frac{\sum_{i=1}^{\numData} (1-\dataScalar_i)}{1-\pi},\]</span></p>
<p>and as normal we look for a stationary point for the log likelihood by setting this derivative to zero, <span class="math display">\[0 = -\frac{\sum_{i=1}^{\numData} \dataScalar_i}{\pi} + \frac{\sum_{i=1}^{\numData} (1-\dataScalar_i)}{1-\pi},\]</span> rearranging we form <span class="math display">\[(1-\pi)\sum_{i=1}^{\numData} \dataScalar_i = \pi\sum_{i=1}^{\numData} (1-\dataScalar_i),\]</span> which implies <span class="math display">\[\sum_{i=1}^{\numData} \dataScalar_i = \pi\left(\sum_{i=1}^{\numData} (1-\dataScalar_i) + \sum_{i=1}^{\numData} \dataScalar_i\right),\]</span></p>
<p>and now we recognise that <span class="math inline">\(\sum_{i=1}^{\numData} (1-\dataScalar_i) + \sum_{i=1}^{\numData} \dataScalar_i = \numData\)</span> so we have <span class="math display">\[\pi = \frac{\sum_{i=1}^{\numData} \dataScalar_i}{\numData}\]</span></p>
<p>so in other words we estimate the probability associated with the Bernoulli by setting it to the number of observed positives, divided by the total length of <span class="math inline">\(\dataScalar\)</span>. This makes intiutive sense. If I asked you to estimate the probability of a coin being heads, and you tossed the coin 100 times, and recovered 47 heads, then the estimate of the probability of heads should be <span class="math inline">\(\frac{47}{100}\)</span>.</p>
<h3 id="exercise">Exercise</h3>
<p>Show that the maximum likelihood solution we have found is a <em>minimum</em> for our objective.</p>
<h3 id="write-your-answer-to-exercise-here">Write your answer to Exercise here</h3>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co"># Use this box for any code you need for the exercise</span>
</code></pre></div>
<p><span class="math display">\[
\text{posterior} =
\frac{\text{likelihood}\times\text{prior}}{\text{marginal likelihood}}
\]</span></p>
<p>Four components:</p>
<ol style="list-style-type: decimal">
<li>Prior distribution</li>
<li>Likelihood</li>
<li>Posterior distribution</li>
<li>Marginal likelihood</li>
</ol>
<h3 id="naive-bayes-classifiers">Naive Bayes Classifiers</h3>
<p>In probabilistic machine learning we place probability distributions (or densities) over all the variables of interest, our first classification algorithm will do just that. We will consider how to form a classification by making assumptions about the <em>joint</em> density of our observations. We need to make assumptions to reduce the number of parameters we need to optimise.</p>
<p>In the ideal world, given label data <span class="math inline">\(\dataVector\)</span> and the inputs <span class="math inline">\(\inputMatrix\)</span> we should be able to specify the joint density of all potential values of <span class="math inline">\(\dataVector\)</span> and <span class="math inline">\(\inputMatrix\)</span>, <span class="math inline">\(p(\dataVector, \inputMatrix)\)</span>. If <span class="math inline">\(\inputMatrix\)</span> and <span class="math inline">\(\dataVector\)</span> are our training data, and we can somehow extend our density to incorporate future test data (by augmenting <span class="math inline">\(\dataVector\)</span> with a new observation <span class="math inline">\(\dataScalar^*\)</span> and <span class="math inline">\(\inputMatrix\)</span> with the corresponding inputs, <span class="math inline">\(\inputVector^*\)</span>), then we can answer any given question about a future test point <span class="math inline">\(\dataScalar^*\)</span> given its covariates <span class="math inline">\(\inputVector^*\)</span> by conditioning on the training variables to recover, <span class="math display">\[
p(\dataScalar^*|\inputMatrix, \dataVector, \inputVector^*),
\]</span></p>
<p>We can compute this distribution using the product and sum rules. However, to specify this density we must give the probability associated with all possible combinations of <span class="math inline">\(\dataVector\)</span> and <span class="math inline">\(\inputMatrix\)</span>. There are <span class="math inline">\(2^{\numData}\)</span> possible combinations for the vector <span class="math inline">\(\dataVector\)</span> and the probability for each of these combinations must be jointly specified along with the joint density of the matrix <span class="math inline">\(\inputMatrix\)</span>, as well as being able to <em>extend</em> the density for any chosen test location <span class="math inline">\(\inputVector^*\)</span>.</p>
<p>In naive Bayes we make certain simplifying assumptions that allow us to perform all of the above in practice.</p>
<h3 id="data-conditional-independence">Data Conditional Independence</h3>
<p>If we are given model parameters <span class="math inline">\(\paramVector\)</span> we assume that conditioned on all these parameters that all data points in the model are independent. In other words we have, <span class="math display">\[
p(\dataScalar^*, \inputVector^*, \dataVector, \inputMatrix|\paramVector) = p(\dataScalar^*, \inputVector^*|\paramVector)\prod_{i=1}^{\numData} p(\dataScalar_i, \inputVector_i | \paramVector).
\]</span> This is a conditional independence assumption because we are not assuming our data are purely independent. If we were to assume that, then there would be nothing to learn about our test data given our training data. We are assuming that they are independent <em>given</em> our parameters, <span class="math inline">\(\paramVector\)</span>. We made similar assumptions for regression, where our parameter set included <span class="math inline">\(\mappingVector\)</span> and <span class="math inline">\(\dataStd^2\)</span>. Given those parameters we assumed that the density over <span class="math inline">\(\dataVector, \dataScalar^*\)</span> was <em>independent</em>. Here we are going a little further with that assumption because we are assuming the <em>joint</em> density of <span class="math inline">\(\dataVector\)</span> and <span class="math inline">\(\inputMatrix\)</span> is independent across the data given the parameters.</p>
<p>Computing posterior distribution in this case becomes easier, this is known as the 'Bayes classifier'.</p>
<h3 id="feature-conditional-independence">Feature Conditional Independence</h3>
<p><span class="math display">\[
p(\inputVector_i | \dataScalar_i, \paramVector) = \prod_{j=1}^{\dataDim} p(\inputScalar_{i,j}|\dataScalar_i, \paramVector)
\]</span> where <span class="math inline">\(\dataDim\)</span> is the dimensionality of our inputs.</p>
<p>The assumption that is particular to naive Bayes is to now consider that the <em>features</em> are also conditionally independent, but not only given the parameters. We assume that the features are independent given the parameters <em>and</em> the label. So for each data point we have <span class="math display">\[p(\inputVector_i | \dataScalar_i, \paramVector) = \prod_{j=1}^{\dataDim} p(\inputScalar_{i,j}|\dataScalar_i,\paramVector)\]</span> where <span class="math inline">\(\dataDim\)</span> is the dimensionality of our inputs.</p>
<h3 id="marginal-density-for-datascalar_i">Marginal Density for <span class="math inline">\(\dataScalar_i\)</span></h3>
<p><span class="math display">\[
p(\inputScalar_{i,j},\dataScalar_i| \paramVector) = p(\inputScalar_{i,j}|\dataScalar_i, \paramVector)p(\dataScalar_i).
\]</span></p>
<p>We now have nearly all of the components we need to specify the full joint density. However, the feature conditional independence doesn't yet give us the joint density over <span class="math inline">\(p(\dataScalar_i, \inputVector_i)\)</span> which is required to subsitute in to our data conditional independence to give us the full density. To recover the joint density given the conditional distribution of each feature, <span class="math inline">\(p(\inputScalar_{i,j}|\dataScalar_i, \paramVector)\)</span>, we need to make use of the product rule and combine it with a marginal density for <span class="math inline">\(\dataScalar_i\)</span>,</p>
<p><span class="math display">\[p(\inputScalar_{i,j},\dataScalar_i| \paramVector) = p(\inputScalar_{i,j}|\dataScalar_i, \paramVector)p(\dataScalar_i).\]</span> Because <span class="math inline">\(\dataScalar_i\)</span> is binary the <em>Bernoulli</em> density makes a suitable choice for our prior over <span class="math inline">\(\dataScalar_i\)</span>, <span class="math display">\[p(\dataScalar_i|\pi) = \pi^{\dataScalar_i} (1-\pi)^{1-\dataScalar_i}\]</span> where <span class="math inline">\(\pi\)</span> now has the interpretation as being the <em>prior</em> probability that the classification should be positive.</p>
<h3 id="joint-density-for-naive-bayes">Joint Density for Naive Bayes</h3>
<p>This allows us to write down the full joint density of the training data, <span class="math display">\[
p(\dataVector, \inputMatrix|\paramVector, \pi) = \prod_{i=1}^{\numData} \prod_{j=1}^{\dataDim} p(\inputScalar_{i,j}|\dataScalar_i, \paramVector)p(\dataScalar_i|\pi)
\]</span></p>
<p>which can now be fit by maximum likelihood. As normal we form our objective as the negative log likelihood,</p>
<p><span class="math display">\[\begin{align*}
\errorFunction(\paramVector, \pi)& = -\log p(\dataVector, \inputMatrix|\paramVector, \pi) \\ &= -\sum_{i=1}^{\numData} \sum_{j=1}^{\dataDim} \log p(\inputScalar_{i, j}|\dataScalar_i, \paramVector) - \sum_{i=1}^{\numData} \log p(\dataScalar_i|\pi),
\end{align*}\]</span> which we note <em>decomposes</em> into two objective functions, one which is dependent on <span class="math inline">\(\pi\)</span> alone and one which is dependent on <span class="math inline">\(\paramVector\)</span> alone so we have, <span class="math display">\[
\errorFunction(\pi, \paramVector) = \errorFunction(\paramVector) + \errorFunction(\pi).
\]</span> Since the two objective functions are separately dependent on the parameters <span class="math inline">\(\pi\)</span> and <span class="math inline">\(\paramVector\)</span> we can minimize them independently. Firstly, minimizing the Bernoulli likelihood over the labels we have, <span class="math display">\[
\errorFunction(\pi) = -\sum_{i=1}^{\numData}\log p(\dataScalar_i|\pi) = -\sum_{i=1}^{\numData} \dataScalar_i \log \pi - \sum_{i=1}^{\numData} (1-\dataScalar_i) \log (1-\pi)
\]</span> which we already minimized above recovering <span class="math display">\[
\pi = \frac{\sum_{i=1}^{\numData} \dataScalar_i}{\numData}.
\]</span></p>
<p>We now need to minimize the objective associated with the conditional distributions for the features, <span class="math display">\[
\errorFunction(\paramVector) = -\sum_{i=1}^{\numData} \sum_{j=1}^{\dataDim} \log p(\inputScalar_{i, j} |\dataScalar_i, \paramVector),
\]</span> which necessarily implies making some assumptions about the form of the conditional distributions. The right assumption will depend on the nature of our input data. For example, if we have an input which is real valued, we could use a Gaussian density and we could allow the mean and variance of the Gaussian to be different according to whether the class was positive or negative and according to which feature we were measuring. That would give us the form, <span class="math display">\[
p(\inputScalar_{i, j} | \dataScalar_i,\paramVector) = \frac{1}{\sqrt{2\pi \dataStd_{\dataScalar_i,j}^2}} \exp \left(-\frac{(\inputScalar_{i,j} - \mu_{\dataScalar_i, j})^2}{\dataStd_{\dataScalar_i,j}^2}\right),
\]</span> where <span class="math inline">\(\dataStd_{1, j}^2\)</span> is the variance of the density for the <span class="math inline">\(j\)</span>th output and the class <span class="math inline">\(\dataScalar_i=1\)</span> and <span class="math inline">\(\dataStd_{0, j}^2\)</span> is the variance if the class is 0. The means can vary similarly. Our parameters, <span class="math inline">\(\paramVector\)</span> would consist of all the means and all the variances for the different dimensions.</p>
<p>As normal we form our objective as the negative log likelihood, <span class="math display">\[
\errorFunction(\paramVector, \pi) = -\log p(\dataVector, \inputMatrix|\paramVector, \pi) = -\sum_{i=1}^{\numData} \sum_{j=1}^{\dataDim} \log p(\inputScalar_{i, j}|\dataScalar_i, \paramVector) - \sum_{i=1}^{\numData} \log p(\dataScalar_i|\pi),
\]</span> which we note <em>decomposes</em> into two objective functions, one which is dependent on <span class="math inline">\(\pi\)</span> alone and one which is dependent on <span class="math inline">\(\paramVector\)</span> alone so we have, <span class="math display">\[
\errorFunction(\pi, \paramVector) = \errorFunction(\paramVector) + \errorFunction(\pi).
\]</span></p>
<h3 id="movie-body-count-data">Movie Body Count Data</h3>
<p>First we will load in the movie body count data. Our aim will be to predict whether a movie is rated R or not given the attributes in the data. We will predict on the basis of year, body count and movie genre. The genres in the CSV file are stored as a list in the following form:</p>
<pre><code>Biography|Action|Sci-Fi</code></pre>
<p>First we have to do a little work to extract this form and turn it into a vector of binary values. Let's first load in and remind ourselves of the data.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> pods</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">data <span class="op">=</span> pods.datasets.movie_body_count()[<span class="st">'Y'</span>]
data.head()</code></pre></div>
<p>Now we will convert this data into a form which we can use as inputs <code>X</code>, and labels <code>y</code>.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> pandas <span class="im">as</span> pd
<span class="im">import</span> numpy <span class="im">as</span> np</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">X <span class="op">=</span> data[[<span class="st">'Year'</span>, <span class="st">'Body_Count'</span>]].copy()
y <span class="op">=</span> data[<span class="st">'MPAA_Rating'</span>]<span class="op">==</span><span class="st">'R'</span> <span class="co"># set label to be positive for R rated films.</span>
<span class="co"># Create series of movie genres with the relevant index</span>
s <span class="op">=</span> data[<span class="st">'Genre'</span>].<span class="bu">apply</span>(pd.Series, <span class="dv">1</span>).stack()
s.index <span class="op">=</span> s.index.droplevel(<span class="op">-</span><span class="dv">1</span>) <span class="co"># to line up with df's index</span>
<span class="co"># Extract from the series the unique list of genres.</span>
genres <span class="op">=</span> s.unique()
<span class="co"># For each genre extract the indices where it is present and add a column to X</span>
<span class="cf">for</span> genre <span class="kw">in</span> genres:
index <span class="op">=</span> s[s<span class="op">==</span>genre].index.tolist()
X.loc[:, genre] <span class="op">=</span> <span class="fl">0.0</span>
X.loc[index, genre] <span class="op">=</span> <span class="fl">1.0</span></code></pre></div>
<p>This has given us a new data frame <code>X</code> which contains the different genres in different columns.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">X.describe()</code></pre></div>
<p>We can now specify the naive Bayes model. For the genres we want to model the data as Bernoulli distributed, and for the year and body count we want to model the data as Gaussian distributed. We set up two data frames to contain the parameters for the rows and the columns below.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co"># assume data is binary or real.</span>
<span class="co"># this list encodes whether it is binary or real (1 for binary, 0 for real)</span>
binary_columns <span class="op">=</span> genres
real_columns <span class="op">=</span> [<span class="st">'Year'</span>, <span class="st">'Body_Count'</span>]
Bernoulli <span class="op">=</span> pd.DataFrame(data<span class="op">=</span>np.zeros((<span class="dv">2</span>,<span class="bu">len</span>(binary_columns))), columns<span class="op">=</span>binary_columns, index<span class="op">=</span>[<span class="st">'theta_0'</span>, <span class="st">'theta_1'</span>])
Gaussian <span class="op">=</span> pd.DataFrame(data<span class="op">=</span>np.zeros((<span class="dv">4</span>,<span class="bu">len</span>(real_columns))), columns<span class="op">=</span>real_columns, index<span class="op">=</span>[<span class="st">'mu_0'</span>, <span class="st">'sigma2_0'</span>, <span class="st">'mu_1'</span>, <span class="st">'sigma2_1'</span>])</code></pre></div>
<p>Now we have the data in a form ready for analysis, let's construct our data matrix.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">num_train <span class="op">=</span> <span class="dv">200</span>
indices <span class="op">=</span> np.random.permutation(X.shape[<span class="dv">0</span>])
train_indices <span class="op">=</span> indices[:num_train]
test_indices <span class="op">=</span> indices[num_train:]
X_train <span class="op">=</span> X.loc[train_indices]
y_train <span class="op">=</span> y.loc[train_indices]
X_test <span class="op">=</span> X.loc[test_indices]
y_test <span class="op">=</span> y.loc[test_indices]</code></pre></div>
<p>And we can now train the model. For each feature we can make the fit independently. The fit is given by either counting the number of positives (for binary data) which gives us the maximum likelihood solution for the Bernoulli. Or by computing the empirical mean and variance of the data for the Gaussian, which also gives us the maximum likelihood solution.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="cf">for</span> column <span class="kw">in</span> X_train:
<span class="cf">if</span> column <span class="kw">in</span> Gaussian:
Gaussian[column][<span class="st">'mu_0'</span>] <span class="op">=</span> X_train[column][<span class="op">~</span>y].mean()
Gaussian[column][<span class="st">'mu_1'</span>] <span class="op">=</span> X_train[column][y].mean()
Gaussian[column][<span class="st">'sigma2_0'</span>] <span class="op">=</span> X_train[column][<span class="op">~</span>y].var(ddof<span class="op">=</span><span class="dv">0</span>)
Gaussian[column][<span class="st">'sigma2_1'</span>] <span class="op">=</span> X_train[column][y].var(ddof<span class="op">=</span><span class="dv">0</span>)
<span class="cf">if</span> column <span class="kw">in</span> Bernoulli:
Bernoulli[column][<span class="st">'theta_0'</span>] <span class="op">=</span> X_train[column][<span class="op">~</span>y].<span class="bu">sum</span>()<span class="op">/</span>(<span class="op">~</span>y).<span class="bu">sum</span>()
Bernoulli[column][<span class="st">'theta_1'</span>] <span class="op">=</span> X_train[column][y].<span class="bu">sum</span>()<span class="op">/</span>(y).<span class="bu">sum</span>()</code></pre></div>
<p>We can examine the nature of the distributions we've fitted to the model by looking at the entries in these data frames.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">Bernoulli</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">Gaussian</code></pre></div>
<p>The final model parameter is the prior probability of the positive class, <span class="math inline">\(\pi\)</span>, which is computed by maximum likelihood.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">prior <span class="op">=</span> <span class="bu">float</span>(y_train.<span class="bu">sum</span>())<span class="op">/</span><span class="bu">len</span>(y_train)</code></pre></div>
<h3 id="making-predictions">Making Predictions</h3>
<p>Naive Bayes has given us the class conditional densities: <span class="math inline">\(p(\inputVector_i | \dataScalar_i, \paramVector)\)</span>. To make predictions with these densities we need to form the distribution given by <span class="math display">\[
P(\dataScalar^*| \dataVector, \inputMatrix, \inputVector^*, \paramVector)
\]</span> This can be computed by using the product rule. We know that <span class="math display">\[
P(\dataScalar^*| \dataVector, \inputMatrix, \inputVector^*, \paramVector)p(\dataVector, \inputMatrix, \inputVector^*|\paramVector) = p(\dataScalar*, \dataVector, \inputMatrix, \inputVector^*| \paramVector)
\]</span> implying that <span class="math display">\[
P(\dataScalar^*| \dataVector, \inputMatrix, \inputVector^*, \paramVector) = \frac{p(\dataScalar*, \dataVector, \inputMatrix, \inputVector^*| \paramVector)}{p(\dataVector, \inputMatrix, \inputVector^*|\paramVector)}
\]</span> and we've already defined <span class="math inline">\(p(\dataScalar^*, \dataVector, \inputMatrix, \inputVector^*| \paramVector)\)</span> using our conditional independence assumptions above <span class="math display">\[
p(\dataScalar^*, \dataVector, \inputMatrix, \inputVector^*| \paramVector) = \prod_{j=1}^{\dataDim} p(\inputScalar^*_{j}|\dataScalar^*, \paramVector)p(\dataScalar^*|\pi)\prod_{i=1}^{\numData} \prod_{j=1}^{\dataDim} p(\inputScalar_{i,j}|\dataScalar_i, \paramVector)p(\dataScalar_i|\pi)
\]</span> The other required density is <span class="math display">\[
p(\dataVector, \inputMatrix, \inputVector^*|\paramVector)
\]</span> which can be found from <span class="math display">\[p(\dataScalar^*, \dataVector, \inputMatrix, \inputVector^*| \paramVector)\]</span> using the <em>sum rule</em> of probability, <span class="math display">\[
p(\dataVector, \inputMatrix, \inputVector^*|\paramVector) = \sum_{\dataScalar^*=0}^1 p(\dataScalar^*, \dataVector, \inputMatrix, \inputVector^*| \paramVector).
\]</span> Because of our independence assumptions that is simply equal to <span class="math display">\[
p(\dataVector, \inputMatrix, \inputVector^*| \paramVector) = \sum_{\dataScalar^*=0}^1 \prod_{j=1}^{\dataDim} p(\inputScalar^*_{j}|\dataScalar^*_i, \paramVector)p(\dataScalar^*|\pi)\prod_{i=1}^{\numData} \prod_{j=1}^{\dataDim} p(\inputScalar_{i,j}|\dataScalar_i, \paramVector)p(\dataScalar_i|\pi).
\]</span> Substituting both forms in to recover our distribution over the test label conditioned on the training data we have, <span class="math display">\[
P(\dataScalar^*| \dataVector, \inputMatrix, \inputVector^*, \paramVector) = \frac{\prod_{j=1}^{\dataDim} p(\inputScalar^*_{j}|\dataScalar^*_i, \paramVector)p(\dataScalar^*|\pi)\prod_{i=1}^{\numData} \prod_{j=1}^{\dataDim} p(\inputScalar_{i,j}|\dataScalar_i, \paramVector)p(\dataScalar_i|\pi)}{\sum_{\dataScalar^*=0}^1 \prod_{j=1}^{\dataDim} p(\inputScalar^*_{j}|\dataScalar^*_i, \paramVector)p(\dataScalar^*|\pi)\prod_{i=1}^{\numData} \prod_{j=1}^{\dataDim} p(\inputScalar_{i,j}|\dataScalar_i, \paramVector)p(\dataScalar_i|\pi)}
\]</span> and we notice that all the terms associated with the training data actually cancel, the test prediction is <em>conditionally independent</em> of the training data <em>given</em> the parameters. This is a result of our conditional independence assumptions over the data points. <span class="math display">\[
p(\dataScalar^*| \inputVector^*, \paramVector) = \frac{\prod_{j=1}^{\dataDim} p(\inputScalar^*_{j}|\dataScalar^*_i,
\paramVector)p(\dataScalar^*|\pi)}{\sum_{\dataScalar^*=0}^1 \prod_{j=1}^{\dataDim} p(\inputScalar^*_{j}|\dataScalar^*_i, \paramVector)p(\dataScalar^*|\pi)}
\]</span> This formula is also fairly straightforward to implement. First we implement the log probabilities for the Gaussian density.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> log_gaussian(x, mu, sigma2):
<span class="cf">return</span> <span class="op">-</span><span class="fl">0.5</span><span class="op">*</span> np.log(<span class="dv">2</span><span class="op">*</span>np.pi<span class="op">*</span>sigma2)<span class="op">-</span>((x<span class="op">-</span>mu)<span class="op">**</span><span class="dv">2</span>)<span class="op">/</span>(<span class="dv">2</span><span class="op">*</span>sigma2)</code></pre></div>
<p>Now for any test point we compute the joint distribution of the Gaussian features by <em>summing</em> their log probabilities. Working in log space can be a considerable advantage over computing the probabilities directly: as the number of features we include goes up, because all the probabilities are less than 1, the joint probability will become smaller and smaller, and may be difficult to represent accurately (or even underflow). Working in log space can ameliorate this problem. We can also compute the log probability for the Bernoulli distribution.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> log_bernoulli(x, theta):
<span class="cf">return</span> x<span class="op">*</span>np.log(theta) <span class="op">+</span> (<span class="dv">1</span><span class="op">-</span>x)<span class="op">*</span>np.log(<span class="dv">1</span><span class="op">-</span>theta)</code></pre></div>
<h3 id="laplace-smoothing">Laplace Smoothing</h3>
<p>Before we proceed, let's just pause and think for a moment what will happen if <code>theta</code> here is either zero or one. This will result in <span class="math inline">\(\log 0 = -\infty\)</span> and cause numerical problems. This definitely can happen in practice. If some of the features are rare or very common across the data set then the maximum likelihood solution could find values of zero or one respectively. Such values are problematic because they cause posterior probabilities of class membership of either one or zero. In practice we deal with this using <em>Laplace smoothing</em> (which actually has an interpretation as a Bayesian fit of the Bernoulli distribution. Laplace used an example of the sun rising each day, and a wish to predict the sun rise the following day to describe his idea of smoothing, which can be found at the bottom of following page from Laplace's 'Essai Philosophique ...'</p>
<p><a href="https://play.google.com/books/reader?id=1YQPAAAAQAAJ&pg=PA16"><img src="../slides/diagrams/books/1YQPAAAAQAAJ-PA16.png" /></a></p>
<p>Laplace suggests that when computing the probability of an event where a success or failure is rare (he uses an example of the sun rising across the last 5,000 years or 1,826,213 days) that even though only successes have been observed (in the sun rising case) that the odds for tomorrow shouldn't be given as <span class="math display">\[
\frac{1,826,213}{1,826,213} = 1
\]</span> but rather by adding one to the numerator and two to the denominator, <span class="math display">\[
\frac{1,826,213 + 1}{1,826,213 + 2} = 0.99999945.
\]</span> This technique is sometimes called a 'pseudocount technique' because it has an intepretation of assuming some observations before you start, it's as if instead of observing <span class="math inline">\(\sum_{i}\dataScalar_i\)</span> successes you have an additional success, <span class="math inline">\(\sum_{i}\dataScalar_i + 1\)</span> and instead of having observed <span class="math inline">\(n\)</span> events you've observed <span class="math inline">\(\numData + 2\)</span>. So we can think of Laplace's idea saying (before we start) that we have 'two observations worth of belief, that the odds are 50/50', because before we start (i.e. when <span class="math inline">\(\numData=0\)</span>) our estimate is 0.5, yet because the effective <span class="math inline">\(n\)</span> is only 2, this estimate is quickly overwhelmed by data. Laplace used ideas like this a lot, and it is known as his 'principle of insufficient reason'. His idea was that in the absence of knowledge (i.e. before we start) we should assume that all possible outcomes are equally likely. This idea has a modern counterpart, known as the <a href="http://en.wikipedia.org/wiki/Principle_of_maximum_entropy">principle of maximum entropy</a>. A lot of the theory of this approach was developed by <a href="http://en.wikipedia.org/wiki/Edwin_Thompson_Jaynes">Ed Jaynes</a>, who according to his erstwhile collaborator and friend, John Skilling, learnt French as an undergraduate by reading the works of Laplace. Although John also related that Jaynes's spoken French was not up to the standard of his scientific French. For me Ed Jaynes's work very much carries on the tradition of Laplace into the modern era, in particular his focus on Bayesian approaches. I'm very proud to have met those that knew and worked with him. It turns out that Laplace's idea also has a Bayesian interpretation (as Laplace understood), it comes from assuming a particular prior density for the parameter <span class="math inline">\(\pi\)</span>, but we won't explore that interpretation for the moment, and merely choose to estimate the probability as, <span class="math display">\[
\pi = \frac{\sum_{i=1}^{\numData} \dataScalar_i + 1}{\numData + 2}
\]</span> to prevent problems with certainty causing numerical issues and misclassifications. Let's refit the Bernoulli features now.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co"># fit the Bernoulli with Laplace smoothing.</span>
<span class="cf">for</span> column <span class="kw">in</span> X_train:
<span class="cf">if</span> column <span class="kw">in</span> Bernoulli:
Bernoulli[column][<span class="st">'theta_0'</span>] <span class="op">=</span> (X_train[column][<span class="op">~</span>y].<span class="bu">sum</span>() <span class="op">+</span> <span class="dv">1</span>)<span class="op">/</span>((<span class="op">~</span>y).<span class="bu">sum</span>() <span class="op">+</span> <span class="dv">2</span>)
Bernoulli[column][<span class="st">'theta_1'</span>] <span class="op">=</span> (X_train[column][y].<span class="bu">sum</span>() <span class="op">+</span> <span class="dv">1</span>)<span class="op">/</span>((y).<span class="bu">sum</span>() <span class="op">+</span> <span class="dv">2</span>)</code></pre></div>
<p>That places us in a position to write the prediction function.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> predict(X_test, Gaussian, Bernoulli, prior):
log_positive <span class="op">=</span> pd.Series(data <span class="op">=</span> np.zeros(X_test.shape[<span class="dv">0</span>]), index<span class="op">=</span>X_test.index)
log_negative <span class="op">=</span> pd.Series(data <span class="op">=</span> np.zeros(X_test.shape[<span class="dv">0</span>]), index<span class="op">=</span>X_test.index)
<span class="cf">for</span> column <span class="kw">in</span> X_test.columns:
<span class="cf">if</span> column <span class="kw">in</span> Gaussian:
log_positive <span class="op">+=</span> log_gaussian(X_test[column], Gaussian[column][<span class="st">'mu_1'</span>], Gaussian[column][<span class="st">'sigma2_1'</span>])
log_negative <span class="op">+=</span> log_gaussian(X_test[column], Gaussian[column][<span class="st">'mu_0'</span>], Gaussian[column][<span class="st">'sigma2_0'</span>])
<span class="cf">elif</span> column <span class="kw">in</span> Bernoulli:
log_positive <span class="op">+=</span> log_bernoulli(X_test[column], Bernoulli[column][<span class="st">'theta_1'</span>])
log_negative <span class="op">+=</span> log_bernoulli(X_test[column], Bernoulli[column][<span class="st">'theta_0'</span>])
<span class="cf">return</span> np.exp(log_positive <span class="op">+</span> np.log(prior))<span class="op">/</span>(np.exp(log_positive <span class="op">+</span> np.log(prior)) <span class="op">+</span> np.exp(log_negative <span class="op">+</span> np.log(<span class="dv">1</span><span class="op">-</span>prior)))</code></pre></div>
<p>Now we are in a position to make the predictions for the test data.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">p_y <span class="op">=</span> predict(X_test, Gaussian, Bernoulli, prior)</code></pre></div>
<p>We can test the quality of the predictions in the following way. Firstly, we can threshold our probabilities at 0.5, allocating points with greater than 50% probability of membership of the positive class to the positive class. We can then compare to the true values, and see how many of these values we got correct. This is our total number correct.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">correct <span class="op">=</span> y_test.eq(p_y<span class="op">></span><span class="fl">0.5</span>)
total_correct <span class="op">=</span> <span class="bu">sum</span>(correct)
<span class="bu">print</span>(<span class="st">"Total correct"</span>, total_correct, <span class="st">" out of "</span>, <span class="bu">len</span>(y_test), <span class="st">"which is"</span>, <span class="bu">float</span>(total_correct)<span class="op">/</span><span class="bu">len</span>(y_test), <span class="st">"%"</span>)</code></pre></div>
<p>We can also now plot the <a href="http://en.wikipedia.org/wiki/Confusion_matrix">confusion matrix</a>. A confusion matrix tells us where we are making mistakes. Along the diagonal it stores the <em>true positives</em>, the points that were positive class that we classified correctly, and the <em>true negatives</em>, the points that were negative class and that we classified correctly. The off diagonal terms contain the false positives and the false negatives. Along the rows of the matrix we place the actual class, and along the columns we place our predicted class.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">confusion_matrix <span class="op">=</span> pd.DataFrame(data<span class="op">=</span>np.zeros((<span class="dv">2</span>,<span class="dv">2</span>)),
columns<span class="op">=</span>[<span class="st">'predicted not R-rated'</span>, <span class="st">'predicted R-rated'</span>],
index <span class="op">=</span>[<span class="st">'actual not R-rated'</span>,<span class="st">'actual R-rated'</span>])
confusion_matrix[<span class="st">'predicted R-rated'</span>][<span class="st">'actual R-rated'</span>] <span class="op">=</span> (y_test <span class="op">&</span> (p_y<span class="op">></span><span class="fl">0.5</span>)).<span class="bu">sum</span>()
confusion_matrix[<span class="st">'predicted R-rated'</span>][<span class="st">'actual not R-rated'</span>] <span class="op">=</span> (<span class="op">~</span>y_test <span class="op">&</span> (p_y<span class="op">></span><span class="fl">0.5</span>)).<span class="bu">sum</span>()
confusion_matrix[<span class="st">'predicted not R-rated'</span>][<span class="st">'actual R-rated'</span>] <span class="op">=</span> (y_test <span class="op">&</span> <span class="op">~</span>(p_y<span class="op">></span><span class="fl">0.5</span>)).<span class="bu">sum</span>()
confusion_matrix[<span class="st">'predicted not R-rated'</span>][<span class="st">'actual not R-rated'</span>] <span class="op">=</span> (<span class="op">~</span>y_test <span class="op">&</span> <span class="op">~</span>(p_y<span class="op">></span><span class="fl">0.5</span>)).<span class="bu">sum</span>()
confusion_matrix</code></pre></div>
<h3 id="exercise-1">Exercise</h3>
<p>How can you improve your classification, are all the features equally valid? Are some features more helpful than others? What happens if you remove features that appear to be less helpful. How might you select such features?</p>
<h3 id="write-your-answer-to-exercise-here-1">Write your answer to Exercise here</h3>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co"># Use this box for any code you need for the exercise</span>
</code></pre></div>
<h3 id="exercise-2">Exercise</h3>
<p>We have decided to classify positive if probability of R rating is greater than 0.5. This has led us to accidentally classify some films as 'safe for children' when the aren't in actuallity. Imagine you wish to ensure that the film is safe for children. With your test set how low do you have to set the threshold to avoid all the false negatives (i.e. films where you said it wasn't R-rated, but in actuality it was?</p>
<h3 id="write-your-answer-to-exercise-here-2">Write your answer to Exercise here</h3>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co"># Use this box for any code you need for the exercise</span>
</code></pre></div>
<h3 id="making-predictions-1">Making Predictions</h3>
<p>Naive Bayes has given us the class conditional densities: <span class="math inline">\(p(\inputVector_i | \dataScalar_i, \paramVector)\)</span>. To make predictions with these densities we need to form the distribution given by <span class="math display">\[
P(\dataScalar^*| \dataVector, \inputMatrix, \inputVector^*, \paramVector)
\]</span></p>
<h3 id="exercise-3">Exercise</h3>
<p>Write down the negative log likelihood of the Gaussian density over a vector of variables <span class="math inline">\(\inputVector\)</span>. Assume independence between each variable. Minimize this objective to obtain the maximum likelihood solution of the form. <span class="math display">\[
\mu = \frac{\sum_{i=1}^{\numData} \inputScalar_i}{\numData}
\]</span> <span class="math display">\[
\dataStd^2 = \frac{\sum_{i=1}^{\numData} (\inputScalar_i - \mu)^2}{\numData}
\]</span></p>
<h3 id="write-your-answer-to-exercise-here-3">Write your answer to Exercise here</h3>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co"># Use this box for any code you need for the exercise</span>
</code></pre></div>
<p>If the input data was <em>binary</em> then we could also make use of the Bernoulli distribution for the features. For that case we would have the form, <span class="math display">\[
p(\inputScalar_{i, j} | \dataScalar_i,\paramVector) = \theta_{\dataScalar_i, j}^{\inputScalar_{i, j}}(1-\theta_{\dataScalar_i, j})^{(1-\inputScalar_{i,j})},
\]</span> where <span class="math inline">\(\theta_{1, j}\)</span> is the probability that the <span class="math inline">\(j\)</span>th feature is on if <span class="math inline">\(\dataScalar_i\)</span> is 1.</p>
<p>In either case, maximum likelihood fitting would proceed in the same way. The objective has the form, <span class="math display">\[
\errorFunction(\paramVector) = -\sum_{j=1}^{\dataDim} \sum_{i=1}^{\numData} \log p(\inputScalar_{i,j} |\dataScalar_i, \paramVector),
\]</span> and if, as above, the parameters of the distributions are specific to each feature vector (we had means and variances for each continuous feature, and a probability for each binary feature) then we can use the fact that these parameters separate into disjoint subsets across the features to write, <span class="math display">\[
\begin{align*}
\errorFunction(\paramVector) &= -\sum_{j=1}^{\dataDim} \sum_{i=1}^{\numData} \log
p(\inputScalar_{i,j} |\dataScalar_i, \paramVector_j)\\
& \sum_{j=1}^{\dataDim}
\errorFunction(\paramVector_j),
\end{align*}
\]</span> which means we can minimize our objective on each feature independently.</p>
<p>These characteristics mean that naive Bayes scales very well with big data. To fit the model we consider each feature in turn, we select the positive class and fit parameters for that class, then we select each negative class and fit features for that class. We have code below.</p>
<h3 id="naive-bayes-summary">Naive Bayes Summary</h3>
<p>Naive Bayes is making very simple assumptions about the data, in particular it is modeling the full <em>joint</em> probability of the data set, <span class="math inline">\(p(\dataVector, \inputMatrix | \paramVector, \pi)\)</span> by very strong assumptions about factorizations that are unlikely to be true in practice. The data conditional independence assumption is common, and relies on a rich parameter vector to absorb all the information in the training data. The additional assumption of naive Bayes is that features are conditional independent given the class label <span class="math inline">\(\dataScalar_i\)</span> (and the parameter vector, <span class="math inline">\(\paramVector\)</span>. This is quite a strong assumption. However, it causes the objective function to decompose into parts which can be independently fitted to the different feature vectors, meaning it is very easy to fit the model to large data. It is also clear how we should handle <em>streaming</em> data and <em>missing</em> data. This means that the model can be run 'live', adapting parameters and information as it arrives. Indeed, the model is even capable of dealing with new <em>features</em> that might arrive at run time. Such is the strength of the modeling the joint probability density. However, the factorization assumption that allows us to do this efficiently is very strong and may lead to poor decision boundaries in practice.</p>
<h3 id="other-reading">Other Reading</h3>
<ul>
<li>Chapter 5 of <span class="citation">Rogers and Girolami (2011)</span> up to pg 179 (Section 5.1, and 5.2 up to 5.2.2).</li>
</ul>
<h3 id="references" class="unnumbered">References</h3>
<div id="refs" class="references">
<div id="ref-Pearl:causality95">
<p>Pearl, J., 1995. From Bayesian networks to causal networks, in: Gammerman, A. (Ed.), Probabilistic Reasoning and Bayesian Belief Networks. Alfred Waller, pp. 1–31.</p>
</div>
<div id="ref-Rogers:book11">
<p>Rogers, S., Girolami, M., 2011. A first course in machine learning. CRC Press.</p>
</div>
<div id="ref-Steele:predictive12">
<p>Steele, S., Bilchik, A., Eberhardt, J., Kalina, P., Nissan, A., Johnson, E., Avital, I., Stojadinovic, A., 2012. Using machine-learned Bayesian belief networks to predict perioperative risk of clostridium difficile infection following colon surgery. Interact J Med Res 1, e6. <a href="https://doi.org/10.2196/ijmr.2131" class="uri">https://doi.org/10.2196/ijmr.2131</a></p>
</div>
</div>
Sat, 25 Aug 2018 00:00:00 +0000
http://inverseprobability.com/talks/notes/probabilistic-machine-learning.html
http://inverseprobability.com/talks/notes/probabilistic-machine-learning.htmlnotesBayesian Methods<div style="display:none">
$$\newcommand{\Amatrix}{\mathbf{A}}
\newcommand{\KL}[2]{\text{KL}\left( #1\,\|\,#2 \right)}
\newcommand{\Kaast}{\kernelMatrix_{\mathbf{ \ast}\mathbf{ \ast}}}
\newcommand{\Kastu}{\kernelMatrix_{\mathbf{ \ast} \inducingVector}}
\newcommand{\Kff}{\kernelMatrix_{\mappingFunctionVector \mappingFunctionVector}}
\newcommand{\Kfu}{\kernelMatrix_{\mappingFunctionVector \inducingVector}}
\newcommand{\Kuast}{\kernelMatrix_{\inducingVector \bf\ast}}
\newcommand{\Kuf}{\kernelMatrix_{\inducingVector \mappingFunctionVector}}
\newcommand{\Kuu}{\kernelMatrix_{\inducingVector \inducingVector}}
\newcommand{\Kuui}{\Kuu^{-1}}
\newcommand{\Qaast}{\mathbf{Q}_{\bf \ast \ast}}
\newcommand{\Qastf}{\mathbf{Q}_{\ast \mappingFunction}}
\newcommand{\Qfast}{\mathbf{Q}_{\mappingFunctionVector \bf \ast}}
\newcommand{\Qff}{\mathbf{Q}_{\mappingFunctionVector \mappingFunctionVector}}
\newcommand{\aMatrix}{\mathbf{A}}
\newcommand{\aScalar}{a}
\newcommand{\aVector}{\mathbf{a}}
\newcommand{\acceleration}{a}
\newcommand{\bMatrix}{\mathbf{B}}
\newcommand{\bScalar}{b}
\newcommand{\bVector}{\mathbf{b}}
\newcommand{\basisFunc}{\phi}
\newcommand{\basisFuncVector}{\boldsymbol{ \basisFunc}}
\newcommand{\basisFunction}{\phi}
\newcommand{\basisLocation}{\mu}
\newcommand{\basisMatrix}{\boldsymbol{ \Phi}}
\newcommand{\basisScalar}{\basisFunction}
\newcommand{\basisVector}{\boldsymbol{ \basisFunction}}
\newcommand{\activationFunction}{\phi}
\newcommand{\activationMatrix}{\boldsymbol{ \Phi}}
\newcommand{\activationScalar}{\basisFunction}
\newcommand{\activationVector}{\boldsymbol{ \basisFunction}}
\newcommand{\bigO}{\mathcal{O}}
\newcommand{\binomProb}{\pi}
\newcommand{\cMatrix}{\mathbf{C}}
\newcommand{\cbasisMatrix}{\hat{\boldsymbol{ \Phi}}}
\newcommand{\cdataMatrix}{\hat{\dataMatrix}}
\newcommand{\cdataScalar}{\hat{\dataScalar}}
\newcommand{\cdataVector}{\hat{\dataVector}}
\newcommand{\centeredKernelMatrix}{\mathbf{ \MakeUppercase{\centeredKernelScalar}}}
\newcommand{\centeredKernelScalar}{b}
\newcommand{\centeredKernelVector}{\centeredKernelScalar}
\newcommand{\centeringMatrix}{\mathbf{H}}
\newcommand{\chiSquaredDist}[2]{\chi_{#1}^{2}\left(#2\right)}
\newcommand{\chiSquaredSamp}[1]{\chi_{#1}^{2}}
\newcommand{\conditionalCovariance}{\boldsymbol{ \Sigma}}
\newcommand{\coregionalizationMatrix}{\mathbf{B}}
\newcommand{\coregionalizationScalar}{b}
\newcommand{\coregionalizationVector}{\mathbf{ \coregionalizationScalar}}
\newcommand{\covDist}[2]{\text{cov}_{#2}\left(#1\right)}
\newcommand{\covSamp}[1]{\text{cov}\left(#1\right)}
\newcommand{\covarianceScalar}{c}
\newcommand{\covarianceVector}{\mathbf{ \covarianceScalar}}
\newcommand{\covarianceMatrix}{\mathbf{C}}
\newcommand{\covarianceMatrixTwo}{\boldsymbol{ \Sigma}}
\newcommand{\croupierScalar}{s}
\newcommand{\croupierVector}{\mathbf{ \croupierScalar}}
\newcommand{\croupierMatrix}{\mathbf{ \MakeUppercase{\croupierScalar}}}
\newcommand{\dataDim}{p}
\newcommand{\dataIndex}{i}
\newcommand{\dataIndexTwo}{j}
\newcommand{\dataMatrix}{\mathbf{Y}}
\newcommand{\dataScalar}{y}
\newcommand{\dataSet}{\mathcal{D}}
\newcommand{\dataStd}{\sigma}
\newcommand{\dataVector}{\mathbf{ \dataScalar}}
\newcommand{\decayRate}{d}
\newcommand{\degreeMatrix}{\mathbf{ \MakeUppercase{\degreeScalar}}}
\newcommand{\degreeScalar}{d}
\newcommand{\degreeVector}{\mathbf{ \degreeScalar}}
% Already defined by latex
%\newcommand{\det}[1]{\left|#1\right|}
\newcommand{\diag}[1]{\text{diag}\left(#1\right)}
\newcommand{\diagonalMatrix}{\mathbf{D}}
\newcommand{\diff}[2]{\frac{\text{d}#1}{\text{d}#2}}
\newcommand{\diffTwo}[2]{\frac{\text{d}^2#1}{\text{d}#2^2}}
\newcommand{\displacement}{x}
\newcommand{\displacementVector}{\textbf{\displacement}}
\newcommand{\distanceMatrix}{\mathbf{ \MakeUppercase{\distanceScalar}}}
\newcommand{\distanceScalar}{d}
\newcommand{\distanceVector}{\mathbf{ \distanceScalar}}
\newcommand{\eigenvaltwo}{\ell}
\newcommand{\eigenvaltwoMatrix}{\mathbf{L}}
\newcommand{\eigenvaltwoVector}{\mathbf{l}}
\newcommand{\eigenvalue}{\lambda}
\newcommand{\eigenvalueMatrix}{\boldsymbol{ \Lambda}}
\newcommand{\eigenvalueVector}{\boldsymbol{ \lambda}}
\newcommand{\eigenvector}{\mathbf{ \eigenvectorScalar}}
\newcommand{\eigenvectorMatrix}{\mathbf{U}}
\newcommand{\eigenvectorScalar}{u}
\newcommand{\eigenvectwo}{\mathbf{v}}
\newcommand{\eigenvectwoMatrix}{\mathbf{V}}
\newcommand{\eigenvectwoScalar}{v}
\newcommand{\entropy}[1]{\mathcal{H}\left(#1\right)}
\newcommand{\errorFunction}{E}
\newcommand{\expDist}[2]{\left<#1\right>_{#2}}
\newcommand{\expSamp}[1]{\left<#1\right>}
\newcommand{\expectation}[1]{\left\langle #1 \right\rangle }
\newcommand{\expectationDist}[2]{\left\langle #1 \right\rangle _{#2}}
\newcommand{\expectedDistanceMatrix}{\mathcal{D}}
\newcommand{\eye}{\mathbf{I}}
\newcommand{\fantasyDim}{r}
\newcommand{\fantasyMatrix}{\mathbf{ \MakeUppercase{\fantasyScalar}}}
\newcommand{\fantasyScalar}{z}
\newcommand{\fantasyVector}{\mathbf{ \fantasyScalar}}
\newcommand{\featureStd}{\varsigma}
\newcommand{\gammaCdf}[3]{\mathcal{GAMMA CDF}\left(#1|#2,#3\right)}
\newcommand{\gammaDist}[3]{\mathcal{G}\left(#1|#2,#3\right)}
\newcommand{\gammaSamp}[2]{\mathcal{G}\left(#1,#2\right)}
\newcommand{\gaussianDist}[3]{\mathcal{N}\left(#1|#2,#3\right)}
\newcommand{\gaussianSamp}[2]{\mathcal{N}\left(#1,#2\right)}
\newcommand{\given}{|}
\newcommand{\half}{\frac{1}{2}}
\newcommand{\heaviside}{H}
\newcommand{\hiddenMatrix}{\mathbf{ \MakeUppercase{\hiddenScalar}}}
\newcommand{\hiddenScalar}{h}
\newcommand{\hiddenVector}{\mathbf{ \hiddenScalar}}
\newcommand{\identityMatrix}{\eye}
\newcommand{\inducingInputScalar}{z}
\newcommand{\inducingInputVector}{\mathbf{ \inducingInputScalar}}
\newcommand{\inducingInputMatrix}{\mathbf{Z}}
\newcommand{\inducingScalar}{u}
\newcommand{\inducingVector}{\mathbf{ \inducingScalar}}
\newcommand{\inducingMatrix}{\mathbf{U}}
\newcommand{\inlineDiff}[2]{\text{d}#1/\text{d}#2}
\newcommand{\inputDim}{q}
\newcommand{\inputMatrix}{\mathbf{X}}
\newcommand{\inputScalar}{x}
\newcommand{\inputSpace}{\mathcal{X}}
\newcommand{\inputVals}{\inputVector}
\newcommand{\inputVector}{\mathbf{ \inputScalar}}
\newcommand{\iterNum}{k}
\newcommand{\kernel}{\kernelScalar}
\newcommand{\kernelMatrix}{\mathbf{K}}
\newcommand{\kernelScalar}{k}
\newcommand{\kernelVector}{\mathbf{ \kernelScalar}}
\newcommand{\kff}{\kernelScalar_{\mappingFunction \mappingFunction}}
\newcommand{\kfu}{\kernelVector_{\mappingFunction \inducingScalar}}
\newcommand{\kuf}{\kernelVector_{\inducingScalar \mappingFunction}}
\newcommand{\kuu}{\kernelVector_{\inducingScalar \inducingScalar}}
\newcommand{\lagrangeMultiplier}{\lambda}
\newcommand{\lagrangeMultiplierMatrix}{\boldsymbol{ \Lambda}}
\newcommand{\lagrangian}{L}
\newcommand{\laplacianFactor}{\mathbf{ \MakeUppercase{\laplacianFactorScalar}}}
\newcommand{\laplacianFactorScalar}{m}
\newcommand{\laplacianFactorVector}{\mathbf{ \laplacianFactorScalar}}
\newcommand{\laplacianMatrix}{\mathbf{L}}
\newcommand{\laplacianScalar}{\ell}
\newcommand{\laplacianVector}{\mathbf{ \ell}}
\newcommand{\latentDim}{q}
\newcommand{\latentDistanceMatrix}{\boldsymbol{ \Delta}}
\newcommand{\latentDistanceScalar}{\delta}
\newcommand{\latentDistanceVector}{\boldsymbol{ \delta}}
\newcommand{\latentForce}{f}
\newcommand{\latentFunction}{u}
\newcommand{\latentFunctionVector}{\mathbf{ \latentFunction}}
\newcommand{\latentFunctionMatrix}{\mathbf{ \MakeUppercase{\latentFunction}}}
\newcommand{\latentIndex}{j}
\newcommand{\latentScalar}{z}
\newcommand{\latentVector}{\mathbf{ \latentScalar}}
\newcommand{\latentMatrix}{\mathbf{Z}}
\newcommand{\learnRate}{\eta}
\newcommand{\lengthScale}{\ell}
\newcommand{\rbfWidth}{\ell}
\newcommand{\likelihoodBound}{\mathcal{L}}
\newcommand{\likelihoodFunction}{L}
\newcommand{\locationScalar}{\mu}
\newcommand{\locationVector}{\boldsymbol{ \locationScalar}}
\newcommand{\locationMatrix}{\mathbf{M}}
\newcommand{\variance}[1]{\text{var}\left( #1 \right)}
\newcommand{\mappingFunction}{f}
\newcommand{\mappingFunctionMatrix}{\mathbf{F}}
\newcommand{\mappingFunctionTwo}{g}
\newcommand{\mappingFunctionTwoMatrix}{\mathbf{G}}
\newcommand{\mappingFunctionTwoVector}{\mathbf{ \mappingFunctionTwo}}
\newcommand{\mappingFunctionVector}{\mathbf{ \mappingFunction}}
\newcommand{\scaleScalar}{s}
\newcommand{\mappingScalar}{w}
\newcommand{\mappingVector}{\mathbf{ \mappingScalar}}
\newcommand{\mappingMatrix}{\mathbf{W}}
\newcommand{\mappingScalarTwo}{v}
\newcommand{\mappingVectorTwo}{\mathbf{ \mappingScalarTwo}}
\newcommand{\mappingMatrixTwo}{\mathbf{V}}
\newcommand{\maxIters}{K}
\newcommand{\meanMatrix}{\mathbf{M}}
\newcommand{\meanScalar}{\mu}
\newcommand{\meanTwoMatrix}{\mathbf{M}}
\newcommand{\meanTwoScalar}{m}
\newcommand{\meanTwoVector}{\mathbf{ \meanTwoScalar}}
\newcommand{\meanVector}{\boldsymbol{ \meanScalar}}
\newcommand{\mrnaConcentration}{m}
\newcommand{\naturalFrequency}{\omega}
\newcommand{\neighborhood}[1]{\mathcal{N}\left( #1 \right)}
\newcommand{\neilurl}{http://inverseprobability.com/}
\newcommand{\noiseMatrix}{\boldsymbol{ E}}
\newcommand{\noiseScalar}{\epsilon}
\newcommand{\noiseVector}{\boldsymbol{ \epsilon}}
\newcommand{\norm}[1]{\left\Vert #1 \right\Vert}
\newcommand{\normalizedLaplacianMatrix}{\hat{\mathbf{L}}}
\newcommand{\normalizedLaplacianScalar}{\hat{\ell}}
\newcommand{\normalizedLaplacianVector}{\hat{\mathbf{ \ell}}}
\newcommand{\numActive}{m}
\newcommand{\numBasisFunc}{m}
\newcommand{\numComponents}{m}
\newcommand{\numComps}{K}
\newcommand{\numData}{n}
\newcommand{\numFeatures}{K}
\newcommand{\numHidden}{h}
\newcommand{\numInducing}{m}
\newcommand{\numLayers}{\ell}
\newcommand{\numNeighbors}{K}
\newcommand{\numSequences}{s}
\newcommand{\numSuccess}{s}
\newcommand{\numTasks}{m}
\newcommand{\numTime}{T}
\newcommand{\numTrials}{S}
\newcommand{\outputIndex}{j}
\newcommand{\paramVector}{\boldsymbol{ \theta}}
\newcommand{\parameterMatrix}{\boldsymbol{ \Theta}}
\newcommand{\parameterScalar}{\theta}
\newcommand{\parameterVector}{\boldsymbol{ \parameterScalar}}
\newcommand{\partDiff}[2]{\frac{\partial#1}{\partial#2}}
\newcommand{\precisionScalar}{j}
\newcommand{\precisionVector}{\mathbf{ \precisionScalar}}
\newcommand{\precisionMatrix}{\mathbf{J}}
\newcommand{\pseudotargetScalar}{\widetilde{y}}
\newcommand{\pseudotargetVector}{\mathbf{ \pseudotargetScalar}}
\newcommand{\pseudotargetMatrix}{\mathbf{ \widetilde{Y}}}
\newcommand{\rank}[1]{\text{rank}\left(#1\right)}
\newcommand{\rayleighDist}[2]{\mathcal{R}\left(#1|#2\right)}
\newcommand{\rayleighSamp}[1]{\mathcal{R}\left(#1\right)}
\newcommand{\responsibility}{r}
\newcommand{\rotationScalar}{r}
\newcommand{\rotationVector}{\mathbf{ \rotationScalar}}
\newcommand{\rotationMatrix}{\mathbf{R}}
\newcommand{\sampleCovScalar}{s}
\newcommand{\sampleCovVector}{\mathbf{ \sampleCovScalar}}
\newcommand{\sampleCovMatrix}{\mathbf{s}}
\newcommand{\scalarProduct}[2]{\left\langle{#1},{#2}\right\rangle}
\newcommand{\sign}[1]{\text{sign}\left(#1\right)}
\newcommand{\sigmoid}[1]{\sigma\left(#1\right)}
\newcommand{\singularvalue}{\ell}
\newcommand{\singularvalueMatrix}{\mathbf{L}}
\newcommand{\singularvalueVector}{\mathbf{l}}
\newcommand{\sorth}{\mathbf{u}}
\newcommand{\spar}{\lambda}
\newcommand{\trace}[1]{\text{tr}\left(#1\right)}
\newcommand{\BasalRate}{B}
\newcommand{\DampingCoefficient}{C}
\newcommand{\DecayRate}{D}
\newcommand{\Displacement}{X}
\newcommand{\LatentForce}{F}
\newcommand{\Mass}{M}
\newcommand{\Sensitivity}{S}
\newcommand{\basalRate}{b}
\newcommand{\dampingCoefficient}{c}
\newcommand{\mass}{m}
\newcommand{\sensitivity}{s}
\newcommand{\springScalar}{\kappa}
\newcommand{\springVector}{\boldsymbol{ \kappa}}
\newcommand{\springMatrix}{\boldsymbol{ \mathcal{K}}}
\newcommand{\tfConcentration}{p}
\newcommand{\tfDecayRate}{\delta}
\newcommand{\tfMrnaConcentration}{f}
\newcommand{\tfVector}{\mathbf{ \tfConcentration}}
\newcommand{\velocity}{v}
\newcommand{\sufficientStatsScalar}{g}
\newcommand{\sufficientStatsVector}{\mathbf{ \sufficientStatsScalar}}
\newcommand{\sufficientStatsMatrix}{\mathbf{G}}
\newcommand{\switchScalar}{s}
\newcommand{\switchVector}{\mathbf{ \switchScalar}}
\newcommand{\switchMatrix}{\mathbf{S}}
\newcommand{\tr}[1]{\text{tr}\left(#1\right)}
\newcommand{\loneNorm}[1]{\left\Vert #1 \right\Vert_1}
\newcommand{\ltwoNorm}[1]{\left\Vert #1 \right\Vert_2}
\newcommand{\onenorm}[1]{\left\vert#1\right\vert_1}
\newcommand{\twonorm}[1]{\left\Vert #1 \right\Vert}
\newcommand{\vScalar}{v}
\newcommand{\vVector}{\mathbf{v}}
\newcommand{\vMatrix}{\mathbf{V}}
\newcommand{\varianceDist}[2]{\text{var}_{#2}\left( #1 \right)}
% Already defined by latex
%\newcommand{\vec}{#1:}
\newcommand{\vecb}[1]{\left(#1\right):}
\newcommand{\weightScalar}{w}
\newcommand{\weightVector}{\mathbf{ \weightScalar}}
\newcommand{\weightMatrix}{\mathbf{W}}
\newcommand{\weightedAdjacencyMatrix}{\mathbf{A}}
\newcommand{\weightedAdjacencyScalar}{a}
\newcommand{\weightedAdjacencyVector}{\mathbf{ \weightedAdjacencyScalar}}
\newcommand{\onesVector}{\mathbf{1}}
\newcommand{\zerosVector}{\mathbf{0}}
$$
</div>
<h2 id="what-is-machine-learning">What is Machine Learning?</h2>
<p>What is machine learning? At its most basic level machine learning is a combination of</p>
<p><span class="math display">\[ \text{data} + \text{model} \xrightarrow{\text{compute}} \text{prediction}\]</span></p>
<p>where <em>data</em> is our observations. They can be actively or passively acquired (meta-data). The <em>model</em> contains our assumptions, based on previous experience. THat experience can be other data, it can come from transfer learning, or it can merely be our beliefs about the regularities of the universe. In humans our models include our inductive biases. The <em>prediction</em> is an action to be taken or a categorization or a quality score. The reason that machine learning has become a mainstay of artificial intelligence is the importance of predictions in artificial intelligence. The data and the model are combined through computation.</p>
<p>In practice we normally perform machine learning using two functions. To combine data with a model we typically make use of:</p>
<p><strong>a prediction function</strong> a function which is used to make the predictions. It includes our beliefs about the regularities of the universe, our assumptions about how the world works, e.g. smoothness, spatial similarities, temporal similarities.</p>
<p><strong>an objective function</strong> a function which defines the cost of misprediction. Typically it includes knowledge about the world's generating processes (probabilistic objectives) or the costs we pay for mispredictions (empiricial risk minimization).</p>
<p>The combination of data and model through the prediction function and the objectie function leads to a <em>learning algorithm</em>. The class of prediction functions and objective functions we can make use of is restricted by the algorithms they lead to. If the prediction function or the objective function are too complex, then it can be difficult to find an appropriate learning algorithm. Much of the acdemic field of machine learning is the quest for new learning algorithms that allow us to bring different types of models and data together.</p>
<p>A useful reference for state of the art in machine learning is the UK Royal Society Report, <a href="https://royalsociety.org/~/media/policy/projects/machine-learning/publications/machine-learning-report.pdf">Machine Learning: Power and Promise of Computers that Learn by Example</a>.</p>
<p>You can also check my blog post on <a href="http://inverseprobability.com/2017/07/17/what-is-machine-learning">"What is Machine Learning?"</a></p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> numpy <span class="im">as</span> np
<span class="im">import</span> matplotlib.pyplot <span class="im">as</span> plt
<span class="im">import</span> pods
<span class="im">import</span> teaching_plots <span class="im">as</span> plot
<span class="im">import</span> mlai</code></pre></div>
<h3 id="olympic-marathon-data">Olympic Marathon Data</h3>
<p>The first thing we will do is load a standard data set for regression modelling. The data consists of the pace of Olympic Gold Medal Marathon winners for the Olympics from 1896 to present. First we load in the data and plot.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">data <span class="op">=</span> pods.datasets.olympic_marathon_men()
x <span class="op">=</span> data[<span class="st">'X'</span>]
y <span class="op">=</span> data[<span class="st">'Y'</span>]
offset <span class="op">=</span> y.mean()
scale <span class="op">=</span> np.sqrt(y.var())
xlim <span class="op">=</span> (<span class="dv">1875</span>,<span class="dv">2030</span>)
ylim <span class="op">=</span> (<span class="fl">2.5</span>, <span class="fl">6.5</span>)
yhat <span class="op">=</span> (y<span class="op">-</span>offset)<span class="op">/</span>scale
fig, ax <span class="op">=</span> plt.subplots(figsize<span class="op">=</span>plot.big_wide_figsize)
_ <span class="op">=</span> ax.plot(x, y, <span class="st">'r.'</span>,markersize<span class="op">=</span><span class="dv">10</span>)
ax.set_xlabel(<span class="st">'year'</span>, fontsize<span class="op">=</span><span class="dv">20</span>)
ax.set_ylabel(<span class="st">'pace min/km'</span>, fontsize<span class="op">=</span><span class="dv">20</span>)
ax.set_xlim(xlim)
ax.set_ylim(ylim)
mlai.write_figure(figure<span class="op">=</span>fig, filename<span class="op">=</span><span class="st">'../slides/diagrams/datasets/olympic-marathon.svg'</span>, transparent<span class="op">=</span><span class="va">True</span>, frameon<span class="op">=</span><span class="va">True</span>)</code></pre></div>
<h3 id="olympic-marathon-data-1">Olympic Marathon Data</h3>
<table>
<tr>
<td width="70%">
<ul>
<li><p>Gold medal times for Olympic Marathon since 1896.</p></li>
<li><p>Marathons before 1924 didn’t have a standardised distance.</p></li>
<li><p>Present results using pace per km.</p></li>
<li>In 1904 Marathon was badly organised leading to very slow times.</li>
</ul>
</td>
<td width="30%">
<img src="../slides/diagrams/Stephen_Kiprotich.jpg" alt="image" /> <small>Image from Wikimedia Commons <a href="http://bit.ly/16kMKHQ" class="uri">http://bit.ly/16kMKHQ</a></small>
</td>
</tr>
</table>
<object class="svgplot" align data="../slides/diagrams/datasets/olympic-marathon.svg">
</object>
<p>Things to notice about the data include the outlier in 1904, in this year, the olympics was in St Louis, USA. Organizational problems and challenges with dust kicked up by the cars following the race meant that participants got lost, and only very few participants completed.</p>
<p>More recent years see more consistently quick marathons.</p>
<h3 id="regression-linear-releationship">Regression: Linear Releationship</h3>
<p>For many their first encounter with what might be termed a machine learning method is fitting a straight line. A straight line is characterized by two parameters, the scale, <span class="math inline">\(m\)</span>, and the offset <span class="math inline">\(c\)</span>.</p>
<p><span class="math display">\[\dataScalar_i = m \inputScalar_i + c\]</span></p>
<p>For the olympic marathon example <span class="math inline">\(\dataScalar_i\)</span> is the winning pace and it is given as a function of the year which is represented by <span class="math inline">\(\inputScalar_i\)</span>. There are two further parameters of the prediction function. For the olympics example we can interpret these parameters, the scale <span class="math inline">\(m\)</span> is the rate of improvement of the olympic marathon pace on a yearly basis. And <span class="math inline">\(c\)</span> is the winning pace as estimated at year 0.</p>
<h2 id="overdetermined-system">Overdetermined System</h2>
<p>The challenge with a linear model is that it has two unknowns, <span class="math inline">\(m\)</span>, and <span class="math inline">\(c\)</span>. Observing data allows us to write down a system of simultaneous linear equations. So, for example if we observe two data points, the first with the input value, <span class="math inline">\(\inputScalar_1 = 1\)</span> and the output value, <span class="math inline">\(\dataScalar_1 =3\)</span> and a second data point, <span class="math inline">\(\inputScalar = 3\)</span>, <span class="math inline">\(\dataScalar=1\)</span>, then we can write two simultaneous linear equations of the form.</p>
<p>point 1: <span class="math inline">\(\inputScalar = 1\)</span>, <span class="math inline">\(\dataScalar=3\)</span> <span class="math display">\[3 = m + c\]</span> point 2: <span class="math inline">\(\inputScalar = 3\)</span>, <span class="math inline">\(\dataScalar=1\)</span> <span class="math display">\[1 = 3m + c\]</span></p>
<p>The solution to these two simultaneous equations can be represented graphically as</p>
<object class="svgplot" align data="../slides/diagrams/ml/over_determined_system003.svg">
</object>
<center>
<em>The solution of two linear equations represented as the fit of a straight line through two data </em>
</center>
<p>The challenge comes when a third data point is observed and it doesn't naturally fit on the straight line.</p>
<p>point 3: <span class="math inline">\(\inputScalar = 2\)</span>, <span class="math inline">\(\dataScalar=2.5\)</span> <span class="math display">\[2.5 = 2m + c\]</span></p>
<object class="svgplot" align data="../slides/diagrams/ml/over_determined_system004.svg">
</object>
<center>
<em>A third observation of data is inconsistent with the solution dictated by the first two observations </em>
</center>
<p>Now there are three candidate lines, each consistent with our data.</p>
<object class="svgplot" align data="../slides/diagrams/ml/over_determined_system007.svg">
</object>
<center>
<em>Three solutions to the problem, each consistent with two points of the three observations </em>
</center>
<p>This is known as an <em>overdetermined</em> system because there are more data than we need to determine our parameters. The problem arises because the model is a simplification of the real world, and the data we observe is therefore inconsistent with our model.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> teaching_plots <span class="im">as</span> plot</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">plot.over_determined_system(diagrams<span class="op">=</span><span class="st">'../slides/diagrams/ml'</span>)</code></pre></div>
<p>The solution was proposed by Pierre-Simon Laplace. His idea was to accept that the model was an incomplete representation of the real world, and the manner in which it was incomplete is <em>unknown</em>. His idea was that such unknowns could be dealt with through probability.</p>
<p><img class="" src="../slides/diagrams/ml/Pierre-Simon_Laplace.png" width="30%" align="" style="background:none; border:none; box-shadow:none;"></p>
<p>Famously, Laplace considered the idea of a deterministic Universe, one in which the model is <em>known</em>, or as the below translation refers to it, "an intelligence which could comprehend all the forces by which nature is animated". He speculates on an "intelligence" that can submit this vast data to analysis and propsoses that such an entity would be able to predict the future.</p>
<blockquote>
<p>Given for one instant an intelligence which could comprehend all the forces by which nature is animated and the respective situation of the beings who compose it---an intelligence sufficiently vast to submit these data to analysis---it would embrace in the same formulate the movements of the greatest bodies of the universe and those of the lightest atom; for it, nothing would be uncertain and the future, as the past, would be present in its eyes.</p>
</blockquote>
<p>This notion is known as <em>Laplace's demon</em> or <em>Laplace's superman</em>.</p>
<p>Unfortunately, most analyses of his ideas stop at that point, whereas his real point is that such a notion is unreachable. Not so much <em>superman</em> as <em>strawman</em>. Just three pages later in the "Philosophical Essay on Probabilities" <span class="citation">(Laplace, 1814)</span>, Laplace goes on to observe:</p>
<blockquote>
<p>The curve described by a simple molecule of air or vapor is regulated in a manner just as certain as the planetary orbits; the only difference between them is that which comes from our ignorance.</p>
<p>Probability is relative, in part to this ignorance, in part to our knowledge.</p>
</blockquote>
<p>In other words, we can never utilize the idealistic deterministc Universe due to our ignorance about the world, Laplace's suggestion, and focus in this essay is that we turn to probability to deal with this uncertainty. This is also our inspiration for using probabilit in machine learning.</p>
<p>The "forces by which nature is animated" is our <em>model</em>, the "situation of beings that compose it" is our <em>data</em> and the "intelligence sufficiently vast enough to submit these data to analysis" is our compute. The fly in the ointment is our <em>ignorance</em> about these aspects. And <em>probability</em> is the tool we use to incorporate this ignorance leading to uncertainty or <em>doubt</em> in our predictions.</p>
<p>Laplace's concept was that the reason that the data doesn't match up to the model is because of unconsidered factors, and that these might be well represented through probability densities. He tackles the challenge of the unknown factors by adding a variable, <span class="math inline">\(\noiseScalar\)</span>, that represents the unknown. In modern parlance we would call this a <em>latent</em> variable. But in the context Laplace uses it, the variable is so common that it has other names such as a "slack" variable or the <em>noise</em> in the system.</p>
<p>point 1: <span class="math inline">\(\inputScalar = 1\)</span>, <span class="math inline">\(\dataScalar=3\)</span> <span class="math display">\[
3 = m + c + \noiseScalar_1
\]</span> point 2: <span class="math inline">\(\inputScalar = 3\)</span>, <span class="math inline">\(\dataScalar=1\)</span> <span class="math display">\[
1 = 3m + c + \noiseScalar_2
\]</span> point 3: <span class="math inline">\(\inputScalar = 2\)</span>, <span class="math inline">\(\dataScalar=2.5\)</span> <span class="math display">\[
2.5 = 2m + c + \noiseScalar_3
\]</span></p>
<p>Laplace's trick has converted the <em>overdetermined</em> system into an <em>underdetermined</em> system. He has now added three variables, <span class="math inline">\(\{\noiseScalar_i\}_{i=1}^3\)</span>, which represent the unknown corruptions of the real world. Laplace's idea is that we should represent that unknown corruption with a <em>probability distribution</em>.</p>
<h3 id="a-probabilistic-process">A Probabilistic Process</h3>
<p>However, it was left to an admirer of Gauss to develop a practical probability density for that purpose. It was Carl Friederich Gauss who suggested that the <em>Gaussian</em> density (which at the time was unnamed!) should be used to represent this error.</p>
<p>The result is a <em>noisy</em> function, a function which has a deterministic part, and a stochastic part. This type of function is sometimes known as a probabilistic or stochastic process, to distinguish it from a deterministic process.</p>
<h3 id="the-gaussian-density">The Gaussian Density</h3>
<p>The Gaussian density is perhaps the most commonly used probability density. It is defined by a <em>mean</em>, <span class="math inline">\(\meanScalar\)</span>, and a <em>variance</em>, <span class="math inline">\(\dataStd^2\)</span>. The variance is taken to be the square of the <em>standard deviation</em>, <span class="math inline">\(\dataStd\)</span>.</p>
<p><span class="math display">\[\begin{align}
p(\dataScalar| \meanScalar, \dataStd^2) & = \frac{1}{\sqrt{2\pi\dataStd^2}}\exp\left(-\frac{(\dataScalar - \meanScalar)^2}{2\dataStd^2}\right)\\& \buildrel\triangle\over = \gaussianDist{\dataScalar}{\meanScalar}{\dataStd^2}
\end{align}\]</span></p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> teaching_plots <span class="im">as</span> plot</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">plot.gaussian_of_height(diagrams<span class="op">=</span><span class="st">'../../slides/diagrams/ml'</span>)</code></pre></div>
<object class="svgplot" align data="../slides/diagrams/ml/gaussian_of_height.svg">
</object>
<center>
<em>The Gaussian PDF with <span class="math inline">\({\meanScalar}=1.7\)</span> and variance <span class="math inline">\({\dataStd}^2=0.0225\)</span>. Mean shown as cyan line. It could represent the heights of a population of students. </em>
</center>
<h3 id="two-important-gaussian-properties">Two Important Gaussian Properties</h3>
<p>The Gaussian density has many important properties, but for the moment we'll review two of them.</p>
<h3 id="sum-of-gaussians">Sum of Gaussians</h3>
<p>If we assume that a variable, <span class="math inline">\(\dataScalar_i\)</span>, is sampled from a Gaussian density,</p>
<p><span class="math display">\[\dataScalar_i \sim \gaussianSamp{\meanScalar_i}{\sigma_i^2}\]</span></p>
<p>Then we can show that the sum of a set of variables, each drawn independently from such a density is also distributed as Gaussian. The mean of the resulting density is the sum of the means, and the variance is the sum of the variances,</p>
<p><span class="math display">\[\sum_{i=1}^{\numData} \dataScalar_i \sim \gaussianSamp{\sum_{i=1}^\numData \meanScalar_i}{\sum_{i=1}^\numData \sigma_i^2}\]</span></p>
<p>Since we are very familiar with the Gaussian density and its properties, it is not immediately apparent how unusual this is. Most random variables, when you add them together, change the family of density they are drawn from. For example, the Gaussian is exceptional in this regard. Indeed, other random variables, if they are independently drawn and summed together tend to a Gaussian density. That is the <a href="https://en.wikipedia.org/wiki/Central_limit_theorem"><em>central limit theorem</em></a> which is a major justification for the use of a Gaussian density.</p>
<h3 id="scaling-a-gaussian">Scaling a Gaussian</h3>
<p>Less unusual is the <em>scaling</em> property of a Gaussian density. If a variable, <span class="math inline">\(\dataScalar\)</span>, is sampled from a Gaussian density,</p>
<p><span class="math display">\[\dataScalar \sim \gaussianSamp{\meanScalar}{\sigma^2}\]</span> and we choose to scale that variable by a <em>deterministic</em> value, <span class="math inline">\(\mappingScalar\)</span>, then the <em>scaled variable</em> is distributed as</p>
<p><span class="math display">\[\mappingScalar \dataScalar \sim \gaussianSamp{\mappingScalar\meanScalar}{\mappingScalar^2 \sigma^2}.\]</span> Unlike the summing properties, where adding two or more random variables independently sampled from a family of densitites typically brings the summed variable <em>outside</em> that family, scaling many densities leaves the distribution of that variable in the same <em>family</em> of densities. Indeed, many densities include a <em>scale</em> parameter (e.g. the <a href="https://en.wikipedia.org/wiki/Gamma_distribution">Gamma density</a>) which is purely for this purpose. In the Gaussian the standard deviation, <span class="math inline">\(\dataStd\)</span>, is the scale parameter. To see why this makes sense, let's consider, <span class="math display">\[z \sim \gaussianSamp{0}{1},\]</span> then if we scale by <span class="math inline">\(\dataStd\)</span> so we have, <span class="math inline">\(\dataScalar=\dataStd z\)</span>, we can write, <span class="math display">\[\dataScalar =\dataStd z \sim \gaussianSamp{0}{\dataStd^2}\]</span></p>
<h2 id="laplaces-idea">Laplace's Idea</h2>
<h3 id="a-probabilistic-process-1">A Probabilistic Process</h3>
<p>Laplace had the idea to augment the observations by noise, that is equivalent to considering a probability density whose mean is given by the <em>prediction function</em> <span class="math display">\[p\left(\dataScalar_i|\inputScalar_i\right)=\frac{1}{\sqrt{2\pi\dataStd^2}}\exp\left(-\frac{\left(\dataScalar_i-f\left(\inputScalar_i\right)\right)^{2}}{2\dataStd^2}\right).\]</span></p>
<p>This is known as <em>stochastic process</em>. It is a function that is corrupted by noise. Laplace didn't suggest the Gaussian density for that purpose, that was an innovation from Carl Friederich Gauss, which is what gives the Gaussian density its name.</p>
<h3 id="height-as-a-function-of-weight">Height as a Function of Weight</h3>
<p>In the standard Gaussian, parametized by mean and variance.</p>
<p>Make the mean a linear function of an <em>input</em>.</p>
<p>This leads to a regression model. <span class="math display">\[
\begin{align*}
\dataScalar_i=&\mappingFunction\left(\inputScalar_i\right)+\noiseScalar_i,\\
\noiseScalar_i \sim & \gaussianSamp{0}{\dataStd^2}.
\end{align*}
\]</span></p>
<p>Assume <span class="math inline">\(\dataScalar_i\)</span> is height and <span class="math inline">\(\inputScalar_i\)</span> is weight.</p>
<p>Likelihood of an individual data point <span class="math display">\[
p\left(\dataScalar_i|\inputScalar_i,m,c\right)=\frac{1}{\sqrt{2\pi \dataStd^2}}\exp\left(-\frac{\left(\dataScalar_i-m\inputScalar_i-c\right)^{2}}{2\dataStd^2}\right).
\]</span> Parameters are gradient, <span class="math inline">\(m\)</span>, offset, <span class="math inline">\(c\)</span> of the function and noise variance <span class="math inline">\(\dataStd^2\)</span>.</p>
<h3 id="data-set-likelihood">Data Set Likelihood</h3>
<p>If the noise, <span class="math inline">\(\epsilon_i\)</span> is sampled independently for each data point. Each data point is independent (given <span class="math inline">\(m\)</span> and <span class="math inline">\(c\)</span>). For <em>independent</em> variables: <span class="math display">\[
p(\dataVector) = \prod_{i=1}^\numData p(\dataScalar_i)
\]</span> <span class="math display">\[
p(\dataVector|\inputVector, m, c) = \prod_{i=1}^\numData p(\dataScalar_i|\inputScalar_i, m, c)
\]</span></p>
<h3 id="for-gaussian">For Gaussian</h3>
<p>i.i.d. assumption <span class="math display">\[
p(\dataVector|\inputVector, m, c) = \prod_{i=1}^\numData \frac{1}{\sqrt{2\pi \dataStd^2}}\exp \left(-\frac{\left(\dataScalar_i- m\inputScalar_i-c\right)^{2}}{2\dataStd^2}\right).
\]</span> <span class="math display">\[
p(\dataVector|\inputVector, m, c) = \frac{1}{\left(2\pi \dataStd^2\right)^{\frac{\numData}{2}}}\exp\left(-\frac{\sum_{i=1}^\numData\left(\dataScalar_i-m\inputScalar_i-c\right)^{2}}{2\dataStd^2}\right).
\]</span></p>
<h3 id="log-likelihood-function">Log Likelihood Function</h3>
<ul>
<li>Normally work with the log likelihood: <span class="math display">\[
L(m,c,\dataStd^{2})=-\frac{\numData}{2}\log 2\pi -\frac{\numData}{2}\log \dataStd^2 -\sum_{i=1}^{\numData}\frac{\left(\dataScalar_i-m\inputScalar_i-c\right)^{2}}{2\dataStd^2}.
\]</span></li>
</ul>
<h3 id="consistency-of-maximum-likelihood">Consistency of Maximum Likelihood</h3>
<ul>
<li>If data was really generated according to probability we specified.</li>
<li>Correct parameters will be recovered in limit as <span class="math inline">\(\numData \rightarrow \infty\)</span>.</li>
<li>This can be proven through sample based approximations (law of large numbers) of "KL divergences".</li>
<li>Mainstay of classical statistics.</li>
</ul>
<h3 id="probabilistic-interpretation-of-the-error-function">Probabilistic Interpretation of the Error Function</h3>
<ul>
<li>Probabilistic Interpretation for Error Function is Negative Log Likelihood.</li>
<li><em>Minimizing</em> error function is equivalent to <em>maximizing</em> log likelihood.</li>
<li>Maximizing <em>log likelihood</em> is equivalent to maximizing the <em>likelihood</em> because <span class="math inline">\(\log\)</span> is monotonic.</li>
<li>Probabilistic interpretation: Minimizing error function is equivalent to maximum likelihood with respect to parameters.</li>
</ul>
<h3 id="error-function">Error Function</h3>
<ul>
<li>Negative log likelihood is the error function leading to an error function <span class="math display">\[\errorFunction(m,c,\dataStd^{2})=\frac{\numData}{2}\log \dataStd^2+\frac{1}{2\dataStd^2}\sum _{i=1}^{\numData}\left(\dataScalar_i-m\inputScalar_i-c\right)^{2}.\]</span></li>
<li>Learning proceeds by minimizing this error function for the data set provided.</li>
</ul>
<h3 id="connection-sum-of-squares-error">Connection: Sum of Squares Error</h3>
<ul>
<li>Ignoring terms which don’t depend on <span class="math inline">\(m\)</span> and <span class="math inline">\(c\)</span> gives <span class="math display">\[\errorFunction(m, c) \propto \sum_{i=1}^\numData (\dataScalar_i - \mappingFunction(\inputScalar_i))^2\]</span> where <span class="math inline">\(\mappingFunction(\inputScalar_i) = m\inputScalar_i + c\)</span>.</li>
<li>This is known as the <em>sum of squares</em> error function.</li>
<li>Commonly used and is closely associated with the Gaussian likelihood.</li>
</ul>
<h3 id="reminder">Reminder</h3>
<ul>
<li>Two functions involved:
<ul>
<li><em>Prediction function</em>: <span class="math inline">\(\mappingFunction(\inputScalar_i)\)</span></li>
<li>Error, or <em>Objective function</em>: <span class="math inline">\(\errorFunction(m, c)\)</span></li>
</ul></li>
<li>Error function depends on parameters through prediction function.</li>
</ul>
<h3 id="mathematical-interpretation">Mathematical Interpretation</h3>
<ul>
<li>What is the mathematical interpretation?</li>
<li>There is a cost function.
<ul>
<li>It expresses mismatch between your prediction and reality. <span class="math display">\[
\errorFunction(m, c)=\sum_{i=1}^\numData \left(\dataScalar_i - m\inputScalar_i-c\right)^2
\]</span></li>
<li>This is known as the sum of squares error.</li>
</ul></li>
</ul>
<h2 id="sum-of-squares-error">Sum of Squares Error</h2>
<p>Minimizing the sum of squares error was first proposed by <a href="http://en.wikipedia.org/wiki/Adrien-Marie_Legendre">Legendre</a> in 1805. His book, which was on the orbit of comets, is available on google books, we can take a look at the relevant page by calling the code below.</p>
<p><a href="https://play.google.com/books/reader?id=spcAAAAAMAAJ&pg=PA72"><img src="../slides/diagrams/books/spcAAAAAMAAJ-72.png" /></a></p>
<p>Of course, the main text is in French, but the key part we are interested in can be roughly translated as</p>
<blockquote>
<p>In most matters where we take measures data through observation, the most accurate results they can offer, it is almost always leads to a system of equations of the form <span class="math display">\[E = a + bx + cy + fz + etc .\]</span> where <span class="math inline">\(a\)</span>, <span class="math inline">\(b\)</span>, <span class="math inline">\(c\)</span>, <span class="math inline">\(f\)</span> etc are the known coefficients and <span class="math inline">\(x\)</span>, <span class="math inline">\(y\)</span>, <span class="math inline">\(z\)</span> etc are unknown and must be determined by the condition that the value of E is reduced, for each equation, to an amount or zero or very small.</p>
</blockquote>
<p>He continues</p>
<blockquote>
Of all the principles that we can offer for this item, I think it is not broader, more accurate, nor easier than the one we have used in previous research application, and that is to make the minimum sum of the squares of the errors. By this means, it is between the errors a kind of balance that prevents extreme to prevail, is very specific to make known the state of the closest to the truth system. The sum of the squares of the errors <span class="math inline">\(E^2 + \left.E^\prime\right.^2 + \left.E^{\prime\prime}\right.^2 + etc\)</span> being
<span class="math display">\[\begin{align*} &(a + bx + cy + fz + etc)^2 \\
+ &(a^\prime +
b^\prime x + c^\prime y + f^\prime z + etc ) ^2\\
+ &(a^{\prime\prime} +
b^{\prime\prime}x + c^{\prime\prime}y + f^{\prime\prime}z + etc )^2 \\
+ & etc
\end{align*}\]</span>
<p>if we wanted a minimum, by varying x alone, we will have the equation ...</p>
</blockquote>
<p>This is the earliest know printed version of the problem of least squares. The notation, however, is a little awkward for mordern eyes. In particular Legendre doesn't make use of the sum sign, <span class="math display">\[
\sum_{i=1}^3 z_i = z_1
+ z_2 + z_3
\]</span> nor does he make use of the inner product.</p>
<p>In our notation, if we were to do linear regression, we would need to subsititue: <span class="math display">\[\begin{align*}
a &\leftarrow \dataScalar_1-c, \\ a^\prime &\leftarrow \dataScalar_2-c,\\ a^{\prime\prime} &\leftarrow
\dataScalar_3 -c,\\
\text{etc.}
\end{align*}\]</span> to introduce the data observations <span class="math inline">\(\{\dataScalar_i\}_{i=1}^{\numData}\)</span> alongside <span class="math inline">\(c\)</span>, the offset. We would then introduce the input locations <span class="math display">\[\begin{align*}
b & \leftarrow \inputScalar_1,\\
b^\prime & \leftarrow \inputScalar_2,\\
b^{\prime\prime} & \leftarrow \inputScalar_3\\
\text{etc.}
\end{align*}\]</span> and finally the gradient of the function <span class="math display">\[x \leftarrow -m.\]</span> The remaining coefficients (<span class="math inline">\(c\)</span> and <span class="math inline">\(f\)</span>) would then be zero. That would give us <span class="math display">\[\begin{align*} &(\dataScalar_1 -
(m\inputScalar_1+c))^2 \\
+ &(\dataScalar_2 -(m\inputScalar_2 + c))^2\\
+ &(\dataScalar_3 -(m\inputScalar_3 + c))^2 \\
+ &
\text{etc.}
\end{align*}\]</span> which we would write in the modern notation for sums as <span class="math display">\[
\sum_{i=1}^\numData (\dataScalar_i-(m\inputScalar_i + c))^2
\]</span> which is recognised as the sum of squares error for a linear regression.</p>
<p>This shows the advantage of modern <a href="http://en.wikipedia.org/wiki/Summation">summation operator</a>, <span class="math inline">\(\sum\)</span>, in keeping our mathematical notation compact. Whilst it may look more complicated the first time you see it, understanding the mathematical rules that go around it, allows us to go much further with the notation.</p>
<p>Inner products (or <a href="http://en.wikipedia.org/wiki/Dot_product">dot products</a>) are similar. They allow us to write <span class="math display">\[
\sum_{i=1}^q u_i v_i
\]</span> in a more compact notation, <span class="math inline">\(\mathbf{u}\cdot\mathbf{v}.\)</span></p>
<p>Here we are using bold face to represent vectors, and we assume that the individual elements of a vector <span class="math inline">\(\mathbf{z}\)</span> are given as a series of scalars <span class="math display">\[
\mathbf{z} = \begin{bmatrix} z_1\\ z_2\\ \vdots\\ z_\numData
\end{bmatrix}
\]</span> which are each indexed by their position in the vector.</p>
<h2 id="linear-algebra">Linear Algebra</h2>
<p>Linear algebra provides a very similar role, when we introduce <a href="http://en.wikipedia.org/wiki/Linear_algebra">linear algebra</a>, it is because we are faced with a large number of addition and multiplication operations. These operations need to be done together and would be very tedious to write down as a group. So the first reason we reach for linear algebra is for a more compact representation of our mathematical formulae.</p>
<h3 id="running-example-olympic-marathons">Running Example: Olympic Marathons</h3>
<p>Now we will load in the Olympic marathon data. This is data of the olympic marath times for the men's marathon from the first olympics in 1896 up until the London 2012 olympics.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">data <span class="op">=</span> pods.datasets.olympic_marathon_men()
x <span class="op">=</span> data[<span class="st">'X'</span>]
y <span class="op">=</span> data[<span class="st">'Y'</span>]</code></pre></div>
<p>You can see what these values are by typing:</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="bu">print</span>(x)
<span class="bu">print</span>(y)</code></pre></div>
<p>Note that they are not <code>pandas</code> data frames for this example, they are just arrays of dimensionality <span class="math inline">\(\numData\times 1\)</span>, where <span class="math inline">\(\numData\)</span> is the number of data.</p>
<p>The aim of this lab is to have you coding linear regression in python. We will do it in two ways, once using iterative updates (coordinate ascent) and then using linear algebra. The linear algebra approach will not only work much better, it is easy to extend to multiple input linear regression and <em>non-linear</em> regression using basis functions.</p>
<h3 id="plotting-the-data">Plotting the Data</h3>
<p>You can make a plot of <span class="math inline">\(\dataScalar\)</span> vs <span class="math inline">\(\inputScalar\)</span> with the following command:</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="op">%</span>matplotlib inline
<span class="im">import</span> matplotlib.pyplot <span class="im">as</span> plt</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">plt.plot(x, y, <span class="st">'rx'</span>)
plt.xlabel(<span class="st">'year'</span>)
plt.ylabel(<span class="st">'pace in min/km'</span>)</code></pre></div>
<h3 id="maximum-likelihood-iterative-solution">Maximum Likelihood: Iterative Solution</h3>
<p>Now we will take the maximum likelihood approach we derived in the lecture to fit a line, <span class="math inline">\(\dataScalar_i=m\inputScalar_i + c\)</span>, to the data you've plotted. We are trying to minimize the error function: <span class="math display">\[
\errorFunction(m, c) = \sum_{i=1}^\numData(\dataScalar_i-m\inputScalar_i-c)^2
\]</span> with respect to <span class="math inline">\(m\)</span>, <span class="math inline">\(c\)</span> and <span class="math inline">\(\sigma^2\)</span>. We can start with an initial guess for <span class="math inline">\(m\)</span>,</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">m <span class="op">=</span> <span class="op">-</span><span class="fl">0.4</span>
c <span class="op">=</span> <span class="dv">80</span></code></pre></div>
<p>Then we use the maximum likelihood update to find an estimate for the offset, <span class="math inline">\(c\)</span>.</p>
<h3 id="coordinate-descent">Coordinate Descent</h3>
<p>In the movie recommender system example, we minimised the objective function by steepest descent based gradient methods. Our updates required us to compute the gradient at the position we were located, then to update the gradient according to the direction of steepest descent. This time, we will take another approach. It is known as <em>coordinate descent</em>. In coordinate descent, we choose to move one parameter at a time. Ideally, we design an algorithm that at each step moves the parameter to its minimum value. At each step we choose to move the individual parameter to its minimum.</p>
<p>To find the minimum, we look for the point in the curve where the gradient is zero. This can be found by taking the gradient of <span class="math inline">\(\errorFunction(m,c)\)</span> with respect to the parameter.</p>
<h4 id="update-for-offset">Update for Offset</h4>
<p>Let's consider the parameter <span class="math inline">\(c\)</span> first. The gradient goes nicely through the summation operator, and we obtain <span class="math display">\[
\frac{\text{d}\errorFunction(m,c)}{\text{d}c} = -\sum_{i=1}^\numData 2(\dataScalar_i-m\inputScalar_i-c).
\]</span> Now we want the point that is a minimum. A minimum is an example of a <a href="http://en.wikipedia.org/wiki/Stationary_point"><em>stationary point</em></a>, the stationary points are those points of the function where the gradient is zero. They are found by solving the equation for <span class="math inline">\(\frac{\text{d}\errorFunction(m,c)}{\text{d}c} = 0\)</span>. Substituting in to our gradient, we can obtain the following equation, <span class="math display">\[
0 = -\sum_{i=1}^\numData 2(\dataScalar_i-m\inputScalar_i-c)
\]</span> which can be reorganised as follows, <span class="math display">\[
c^* = \frac{\sum_{i=1}^\numData(\dataScalar_i-m^*\inputScalar_i)}{\numData}.
\]</span> The fact that the stationary point is easily extracted in this manner implies that the solution is <em>unique</em>. There is only one stationary point for this system. Traditionally when trying to determine the type of stationary point we have encountered we now compute the <em>second derivative</em>, <span class="math display">\[
\frac{\text{d}^2\errorFunction(m,c)}{\text{d}c^2} = 2n.
\]</span> The second derivative is positive, which in turn implies that we have found a minimum of the function. This means that setting <span class="math inline">\(c\)</span> in this way will take us to the lowest point along that axes.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co"># set c to the minimum</span>
c <span class="op">=</span> (y <span class="op">-</span> m<span class="op">*</span>x).mean()
<span class="bu">print</span>(c)</code></pre></div>
<h3 id="update-for-slope">Update for Slope</h3>
<p>Now we have the offset set to the minimum value, in coordinate descent, the next step is to optimise another parameter. Only one further parameter remains. That is the slope of the system.</p>
<p>Now we can turn our attention to the slope. We once again peform the same set of computations to find the minima. We end up with an update equation of the following form.</p>
<p><span class="math display">\[m^* = \frac{\sum_{i=1}^\numData (\dataScalar_i - c)\inputScalar_i}{\sum_{i=1}^\numData \inputScalar_i^2}\]</span></p>
<p>Communication of mathematics in data science is an essential skill, in a moment, you will be asked to rederive the equation above. Before we do that, however, we will briefly review how to write mathematics in the notebook.</p>
<h3 id="latex-for-maths"><span class="math inline">\(\LaTeX\)</span> for Maths</h3>
<p>These cells use <a href="http://en.wikipedia.org/wiki/Markdown">Markdown format</a>. You can include maths in your markdown using <a href="http://en.wikipedia.org/wiki/LaTeX"><span class="math inline">\(\LaTeX\)</span> syntax</a>, all you have to do is write your answer inside dollar signs, as follows:</p>
<p>To write a fraction, we write <code>$\frac{a}{b}$</code>, and it will display like this <span class="math inline">\(\frac{a}{b}\)</span>. To write a subscript we write <code>$a_b$</code> which will appear as <span class="math inline">\(a_b\)</span>. To write a superscript (for example in a polynomial) we write <code>$a^b$</code> which will appear as <span class="math inline">\(a^b\)</span>. There are lots of other macros as well, for example we can do greek letters such as <code>$\alpha, \beta, \gamma$</code> rendering as <span class="math inline">\(\alpha, \beta, \gamma\)</span>. And we can do sum and intergral signs as <code>$\sum \int \int$</code>.</p>
<p>You can combine many of these operations together for composing expressions.</p>
<h3 id="question-1">Question 1</h3>
<p>Convert the following python code expressions into <span class="math inline">\(\LaTeX\)</span>j, writing your answers below. In each case write your answer as a single equality (i.e. your maths should only contain one expression, not several lines of expressions). For the purposes of your <span class="math inline">\(\LaTeX\)</span> please assume that <code>x</code> and <code>w</code> are <span class="math inline">\(n\)</span> dimensional vectors.</p>
<p><code>(a) f = x.sum()</code></p>
<p><code>(b) m = x.mean()</code></p>
<p><code>(c) g = (x*w).sum()</code></p>
<p><em>15 marks</em></p>
<h3 id="write-your-answer-to-question-1-here">Write your answer to Question 1 here</h3>
<h3 id="fixed-point-updates">Fixed Point Updates</h3>
<p><span align="left">Worked example.</span> <span class="math display">\[
\begin{aligned}
c^{*}=&\frac{\sum
_{i=1}^{\numData}\left(\dataScalar_i-m^{*}\inputScalar_i\right)}{\numData},\\
m^{*}=&\frac{\sum
_{i=1}^{\numData}\inputScalar_i\left(\dataScalar_i-c^{*}\right)}{\sum _{i=1}^{\numData}\inputScalar_i^{2}},\\
\left.\dataStd^2\right.^{*}=&\frac{\sum
_{i=1}^{\numData}\left(\dataScalar_i-m^{*}\inputScalar_i-c^{*}\right)^{2}}{\numData}
\end{aligned}
\]</span></p>
<h3 id="gradient-with-respect-to-the-slope">Gradient With Respect to the Slope</h3>
<p>Now that you've had a little training in writing maths with <span class="math inline">\(\LaTeX\)</span>, we will be able to use it to answer questions. The next thing we are going to do is a little differentiation practice.</p>
<h3 id="question-2">Question 2</h3>
<p>Derive the the gradient of the objective function with respect to the slope, <span class="math inline">\(m\)</span>. Rearrange it to show that the update equation written above does find the stationary points of the objective function. By computing its derivative show that it's a minimum.</p>
<p><em>20 marks</em></p>
<h3 id="write-your-answer-to-question-2-here">Write your answer to Question 2 here</h3>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">m <span class="op">=</span> ((y <span class="op">-</span> c)<span class="op">*</span>x).<span class="bu">sum</span>()<span class="op">/</span>(x<span class="op">**</span><span class="dv">2</span>).<span class="bu">sum</span>()
<span class="bu">print</span>(m)</code></pre></div>
<p>We can have a look at how good our fit is by computing the prediction across the input space. First create a vector of 'test points',</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> numpy <span class="im">as</span> np
x_test <span class="op">=</span> np.linspace(<span class="dv">1890</span>, <span class="dv">2020</span>, <span class="dv">130</span>)[:, <span class="va">None</span>]</code></pre></div>
<p>Now use this vector to compute some test predictions,</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">f_test <span class="op">=</span> m<span class="op">*</span>x_test <span class="op">+</span> c</code></pre></div>
<p>Now plot those test predictions with a blue line on the same plot as the data,</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">plt.plot(x_test, f_test, <span class="st">'b-'</span>)
plt.plot(x, y, <span class="st">'rx'</span>)</code></pre></div>
<p>The fit isn't very good, we need to iterate between these parameter updates in a loop to improve the fit, we have to do this several times,</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="cf">for</span> i <span class="kw">in</span> np.arange(<span class="dv">10</span>):
m <span class="op">=</span> ((y <span class="op">-</span> c)<span class="op">*</span>x).<span class="bu">sum</span>()<span class="op">/</span>(x<span class="op">*</span>x).<span class="bu">sum</span>()
c <span class="op">=</span> (y<span class="op">-</span>m<span class="op">*</span>x).<span class="bu">sum</span>()<span class="op">/</span>y.shape[<span class="dv">0</span>]
<span class="bu">print</span>(m)
<span class="bu">print</span>(c)</code></pre></div>
<p>And let's try plotting the result again</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">f_test <span class="op">=</span> m<span class="op">*</span>x_test <span class="op">+</span> c
plt.plot(x_test, f_test, <span class="st">'b-'</span>)
plt.plot(x, y, <span class="st">'rx'</span>)</code></pre></div>
<p>Clearly we need more iterations than 10! In the next question you will add more iterations and report on the error as optimisation proceeds.</p>
<h3 id="question-3">Question 3</h3>
<p>There is a problem here, we seem to need many interations to get to a good solution. Let's explore what's going on. Write code which alternates between updates of <code>c</code> and <code>m</code>. Include the following features in your code.</p>
<ol style="list-style-type: decimal">
<li>Initialise with <code>m=-0.4</code> and <code>c=80</code>.</li>
<li>Every 10 iterations compute the value of the objective function for the training data and print it to the screen (you'll find hints on this in <a href="./week2.ipynb">the lab from last week</a>.</li>
<li>Cause the code to stop running when the error change over less than 10 iterations is smaller than <span class="math inline">\(1\times10^{-4}\)</span>. This is known as a stopping criterion.</li>
</ol>
<p>Why do we need so many iterations to get to the solution?</p>
<p><em>25 marks</em></p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co"># Write code for your answer to Question 3 in this box</span>
</code></pre></div>
<h3 id="important-concepts-not-covered">Important Concepts Not Covered</h3>
<ul>
<li>Other optimization methods:
<ul>
<li>Second order methods, conjugate gradient, quasi-Newton and Newton.</li>
</ul></li>
<li>Effective heuristics such as momentum.</li>
<li>Local vs global solutions.</li>
</ul>
<h3 id="reading">Reading</h3>
<ul>
<li>Section 1.1-1.2 of <span class="citation">Rogers and Girolami (2011)</span> for fitting linear models.</li>
<li>Section 1.2.5 of <span class="citation">Bishop (2006)</span> up to equation 1.65.</li>
</ul>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> numpy <span class="im">as</span> np
<span class="im">import</span> matplotlib.pyplot <span class="im">as</span> plt
<span class="im">import</span> mlai</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">x <span class="op">=</span> np.random.normal(size<span class="op">=</span><span class="dv">4</span>)</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">m_true <span class="op">=</span> <span class="fl">1.4</span>
c_true <span class="op">=</span> <span class="op">-</span><span class="fl">3.1</span></code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">y <span class="op">=</span> m_true<span class="op">*</span>x<span class="op">+</span>c_true</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">plt.plot(x, y, <span class="st">'r.'</span>, markersize<span class="op">=</span><span class="dv">10</span>) <span class="co"># plot data as red dots</span>
plt.xlim([<span class="op">-</span><span class="dv">3</span>, <span class="dv">3</span>])
mlai.write_figure(filename<span class="op">=</span><span class="st">"../slides/diagrams/ml/regression.svg"</span>, transparent<span class="op">=</span><span class="va">True</span>)</code></pre></div>
<object class="svgplot" align data="../slides/diagrams/ml/regression.svg">
</object>
<h3 id="noise-corrupted-plot">Noise Corrupted Plot</h3>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">noise <span class="op">=</span> np.random.normal(scale<span class="op">=</span><span class="fl">0.5</span>, size<span class="op">=</span><span class="dv">4</span>) <span class="co"># standard deviation of the noise is 0.5</span>
y <span class="op">=</span> m_true<span class="op">*</span>x <span class="op">+</span> c_true <span class="op">+</span> noise
plt.plot(x, y, <span class="st">'r.'</span>, markersize<span class="op">=</span><span class="dv">10</span>)
plt.xlim([<span class="op">-</span><span class="dv">3</span>, <span class="dv">3</span>])
mlai.write_figure(filename<span class="op">=</span><span class="st">"../slides/diagrams/ml/regression_noise.svg"</span>, transparent<span class="op">=</span><span class="va">True</span>)</code></pre></div>
<object class="svgplot" align data="../slides/diagrams/ml/regression_noise.svg">
</object>
<h3 id="contour-plot-of-error-function">Contour Plot of Error Function</h3>
<ul>
<li>Visualise the error function surface, create vectors of values.</li>
</ul>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co"># create an array of linearly separated values around m_true</span>
m_vals <span class="op">=</span> np.linspace(m_true<span class="op">-</span><span class="dv">3</span>, m_true<span class="op">+</span><span class="dv">3</span>, <span class="dv">100</span>)
<span class="co"># create an array of linearly separated values ae</span>
c_vals <span class="op">=</span> np.linspace(c_true<span class="op">-</span><span class="dv">3</span>, c_true<span class="op">+</span><span class="dv">3</span>, <span class="dv">100</span>)</code></pre></div>
<ul>
<li>create a grid of values to evaluate the error function in 2D.</li>
</ul>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">m_grid, c_grid <span class="op">=</span> np.meshgrid(m_vals, c_vals)</code></pre></div>
<ul>
<li>compute the error function at each combination of <span class="math inline">\(c\)</span> and <span class="math inline">\(m\)</span>.</li>
</ul>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">E_grid <span class="op">=</span> np.zeros((<span class="dv">100</span>, <span class="dv">100</span>))
<span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">100</span>):
<span class="cf">for</span> j <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">100</span>):
E_grid[i, j] <span class="op">=</span> ((y <span class="op">-</span> m_grid[i, j]<span class="op">*</span>x <span class="op">-</span> c_grid[i, j])<span class="op">**</span><span class="dv">2</span>).<span class="bu">sum</span>()</code></pre></div>
<h3 id="contour-plot-of-error">Contour Plot of Error</h3>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="op">%</span>load <span class="op">-</span>s regression_contour teaching_plots.py</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">f, ax <span class="op">=</span> plt.subplots(figsize<span class="op">=</span>(<span class="dv">5</span>,<span class="dv">5</span>))
regression_contour(f, ax, m_vals, c_vals, E_grid)
mlai.write_figure(filename<span class="op">=</span><span class="st">'../slides/diagrams/ml/regression_contour.svg'</span>)</code></pre></div>
<object class="svgplot" align data="../slides/diagrams/ml/regression_contour.svg">
</object>
<h3 id="steepest-descent">Steepest Descent</h3>
<h3 id="algorithm">Algorithm</h3>
<ul>
<li>We start with a guess for <span class="math inline">\(m\)</span> and <span class="math inline">\(c\)</span>.</li>
</ul>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">m_star <span class="op">=</span> <span class="fl">0.0</span>
c_star <span class="op">=</span> <span class="op">-</span><span class="fl">5.0</span></code></pre></div>
<h3 id="offset-gradient">Offset Gradient</h3>
<ul>
<li>Now we need to compute the gradient of the error function, firstly with respect to <span class="math inline">\(c\)</span>,</li>
</ul>
<p><span class="math display">\[\frac{\text{d}\errorFunction(m, c)}{\text{d} c} =
-2\sum_{i=1}^\numData (\dataScalar_i - m\inputScalar_i - c)\]</span></p>
<ul>
<li>This is computed in python as follows</li>
</ul>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">c_grad <span class="op">=</span> <span class="op">-</span><span class="dv">2</span><span class="op">*</span>(y<span class="op">-</span>m_star<span class="op">*</span>x <span class="op">-</span> c_star).<span class="bu">sum</span>()
<span class="bu">print</span>(<span class="st">"Gradient with respect to c is "</span>, c_grad)</code></pre></div>
<h3 id="deriving-the-gradient">Deriving the Gradient</h3>
<p>To see how the gradient was derived, first note that the <span class="math inline">\(c\)</span> appears in every term in the sum. So we are just differentiating <span class="math inline">\((\dataScalar_i - m\inputScalar_i - c)^2\)</span> for each term in the sum. The gradient of this term with respect to <span class="math inline">\(c\)</span> is simply the gradient of the outer quadratic, multiplied by the gradient with respect to <span class="math inline">\(c\)</span> of the part inside the quadratic. The gradient of a quadratic is two times the argument of the quadratic, and the gradient of the inside linear term is just minus one. This is true for all terms in the sum, so we are left with the sum in the gradient.</p>
<h3 id="slope-gradient">Slope Gradient</h3>
<p>The gradient with respect tom <span class="math inline">\(m\)</span> is similar, but now the gradient of the quadratic's argument is <span class="math inline">\(-\inputScalar_i\)</span> so the gradient with respect to <span class="math inline">\(m\)</span> is</p>
<p><span class="math display">\[\frac{\text{d}\errorFunction(m, c)}{\text{d} m} = -2\sum_{i=1}^\numData \inputScalar_i(\dataScalar_i - m\inputScalar_i -
c)\]</span></p>
<p>which can be implemented in python (numpy) as</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">m_grad <span class="op">=</span> <span class="op">-</span><span class="dv">2</span><span class="op">*</span>(x<span class="op">*</span>(y<span class="op">-</span>m_star<span class="op">*</span>x <span class="op">-</span> c_star)).<span class="bu">sum</span>()
<span class="bu">print</span>(<span class="st">"Gradient with respect to m is "</span>, m_grad)</code></pre></div>
<h3 id="update-equations">Update Equations</h3>
<ul>
<li>Now we have gradients with respect to <span class="math inline">\(m\)</span> and <span class="math inline">\(c\)</span>.</li>
<li>Can update our inital guesses for <span class="math inline">\(m\)</span> and <span class="math inline">\(c\)</span> using the gradient.</li>
<li>We don't want to just subtract the gradient from <span class="math inline">\(m\)</span> and <span class="math inline">\(c\)</span>,</li>
<li>We need to take a <em>small</em> step in the gradient direction.</li>
<li>Otherwise we might overshoot the minimum.</li>
<li>We want to follow the gradient to get to the minimum, the gradient changes all the time.</li>
</ul>
<h3 id="move-in-direction-of-gradient">Move in Direction of Gradient</h3>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> teaching_plots <span class="im">as</span> plot</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">f, ax <span class="op">=</span> plt.subplots(figsize<span class="op">=</span>plot.big_figsize)
plot.regression_contour(f, ax, m_vals, c_vals, E_grid)
ax.plot(m_star, c_star, <span class="st">'g*'</span>, markersize<span class="op">=</span><span class="dv">20</span>)
ax.arrow(m_star, c_star, <span class="op">-</span>m_grad<span class="op">*</span><span class="fl">0.1</span>, <span class="op">-</span>c_grad<span class="op">*</span><span class="fl">0.1</span>, head_width<span class="op">=</span><span class="fl">0.2</span>)
mlai.write_figure(filename<span class="op">=</span><span class="st">'../slides/diagrams/ml/regression_contour_step001.svg'</span>, transparent<span class="op">=</span><span class="va">True</span>)</code></pre></div>
<object class="svgplot" align data="../slides/diagrams/ml/regression_contour_step001.svg">
</object>
<h3 id="update-equations-1">Update Equations</h3>
<ul>
<li><p>The step size has already been introduced, it's again known as the learning rate and is denoted by <span class="math inline">\(\learnRate\)</span>. <span class="math display">\[
c_\text{new}\leftarrow c_{\text{old}} - \learnRate \frac{\text{d}\errorFunction(m, c)}{\text{d}c}
\]</span></p></li>
<li>gives us an update for our estimate of <span class="math inline">\(c\)</span> (which in the code we've been calling <code>c_star</code> to represent a common way of writing a parameter estimate, <span class="math inline">\(c^*\)</span>) and <span class="math display">\[
m_\text{new} \leftarrow m_{\text{old}} - \learnRate \frac{\text{d}\errorFunction(m, c)}{\text{d}m}
\]</span></li>
<li><p>Giving us an update for <span class="math inline">\(m\)</span>.</p></li>
</ul>
<h3 id="update-code">Update Code</h3>
<ul>
<li>These updates can be coded as</li>
</ul>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="bu">print</span>(<span class="st">"Original m was"</span>, m_star, <span class="st">"and original c was"</span>, c_star)
learn_rate <span class="op">=</span> <span class="fl">0.01</span>
c_star <span class="op">=</span> c_star <span class="op">-</span> learn_rate<span class="op">*</span>c_grad
m_star <span class="op">=</span> m_star <span class="op">-</span> learn_rate<span class="op">*</span>m_grad
<span class="bu">print</span>(<span class="st">"New m is"</span>, m_star, <span class="st">"and new c is"</span>, c_star)</code></pre></div>
<h2 id="iterating-updates">Iterating Updates</h2>
<ul>
<li>Fit model by descending gradient.</li>
</ul>
<h3 id="gradient-descent-algorithm">Gradient Descent Algorithm</h3>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">num_plots <span class="op">=</span> plot.regression_contour_fit(x, y, diagrams<span class="op">=</span><span class="st">'../slides/diagrams/ml'</span>)</code></pre></div>
<object class="svgplot" align data="../slides/diagrams/ml/regression_contour_fit028.svg">
</object>
<center>
<em>Stochastic gradient descent for linear regression </em>
</center>
<h3 id="stochastic-gradient-descent">Stochastic Gradient Descent</h3>
<ul>
<li>If <span class="math inline">\(\numData\)</span> is small, gradient descent is fine.</li>
<li>But sometimes (e.g. on the internet <span class="math inline">\(\numData\)</span> could be a billion.</li>
<li>Stochastic gradient descent is more similar to perceptron.</li>
<li>Look at gradient of one data point at a time rather than summing across <em>all</em> data points)</li>
<li>This gives a stochastic estimate of gradient.</li>
</ul>
<h3 id="stochastic-gradient-descent-1">Stochastic Gradient Descent</h3>
<ul>
<li>The real gradient with respect to <span class="math inline">\(m\)</span> is given by</li>
</ul>
<p><span class="math display">\[\frac{\text{d}\errorFunction(m, c)}{\text{d} m} = -2\sum_{i=1}^\numData \inputScalar_i(\dataScalar_i -
m\inputScalar_i - c)\]</span></p>
<p>but it has <span class="math inline">\(\numData\)</span> terms in the sum. Substituting in the gradient we can see that the full update is of the form</p>
<p><span class="math display">\[m_\text{new} \leftarrow
m_\text{old} + 2\learnRate \left[\inputScalar_1 (\dataScalar_1 - m_\text{old}\inputScalar_1 - c_\text{old}) + (\inputScalar_2 (\dataScalar_2 - m_\text{old}\inputScalar_2 - c_\text{old}) + \dots + (\inputScalar_n (\dataScalar_n - m_\text{old}\inputScalar_n - c_\text{old})\right]\]</span></p>
<p>This could be split up into lots of individual updates <span class="math display">\[m_1 \leftarrow m_\text{old} + 2\learnRate \left[\inputScalar_1 (\dataScalar_1 - m_\text{old}\inputScalar_1 -
c_\text{old})\right]\]</span> <span class="math display">\[m_2 \leftarrow m_1 + 2\learnRate \left[\inputScalar_2 (\dataScalar_2 -
m_\text{old}\inputScalar_2 - c_\text{old})\right]\]</span> <span class="math display">\[m_3 \leftarrow m_2 + 2\learnRate
\left[\dots\right]\]</span> <span class="math display">\[m_n \leftarrow m_{n-1} + 2\learnRate \left[\inputScalar_n (\dataScalar_n -
m_\text{old}\inputScalar_n - c_\text{old})\right]\]</span></p>
<p>which would lead to the same final update.</p>
<h3 id="updating-c-and-m">Updating <span class="math inline">\(c\)</span> and <span class="math inline">\(m\)</span></h3>
<ul>
<li>In the sum we don't <span class="math inline">\(m\)</span> and <span class="math inline">\(c\)</span> we use for computing the gradient term at each update.</li>
<li>In stochastic gradient descent we <em>do</em> change them.</li>
<li>This means it's not quite the same as steepest desceint.</li>
<li>But we can present each data point in a random order, like we did for the perceptron.</li>
<li>This makes the algorithm suitable for large scale web use (recently this domain is know as 'Big Data') and algorithms like this are widely used by Google, Microsoft, Amazon, Twitter and Facebook.</li>
</ul>
<h3 id="stochastic-gradient-descent-2">Stochastic Gradient Descent</h3>
<ul>
<li>Or more accurate, since the data is normally presented in a random order we just can write <span class="math display">\[
m_\text{new} = m_\text{old} + 2\learnRate\left[\inputScalar_i (\dataScalar_i - m_\text{old}\inputScalar_i - c_\text{old})\right]
\]</span></li>
</ul>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co"># choose a random point for the update </span>
i <span class="op">=</span> np.random.randint(x.shape[<span class="dv">0</span>]<span class="op">-</span><span class="dv">1</span>)
<span class="co"># update m</span>
m_star <span class="op">=</span> m_star <span class="op">+</span> <span class="dv">2</span><span class="op">*</span>learn_rate<span class="op">*</span>(x[i]<span class="op">*</span>(y[i]<span class="op">-</span>m_star<span class="op">*</span>x[i] <span class="op">-</span> c_star))
<span class="co"># update c</span>
c_star <span class="op">=</span> c_star <span class="op">+</span> <span class="dv">2</span><span class="op">*</span>learn_rate<span class="op">*</span>(y[i]<span class="op">-</span>m_star<span class="op">*</span>x[i] <span class="op">-</span> c_star)</code></pre></div>
<h3 id="sgd-for-linear-regression">SGD for Linear Regression</h3>
<p>Putting it all together in an algorithm, we can do stochastic gradient descent for our regression data.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">num_plots <span class="op">=</span> plot.regression_contour_sgd(x, y, diagrams<span class="op">=</span><span class="st">'../slides/diagrams/ml'</span>)</code></pre></div>
<object class="svgplot" align data="../slides/diagrams/ml/regression_sgd_contour_fit058.svg">
</object>
<center>
<em>Stochastic gradient descent for linear regression </em>
</center>
<h3 id="reflection-on-linear-regression-and-supervised-learning">Reflection on Linear Regression and Supervised Learning</h3>
<p>Think about:</p>
<ol style="list-style-type: decimal">
<li><p>What effect does the learning rate have in the optimization? What's the effect of making it too small, what's the effect of making it too big? Do you get the same result for both stochastic and steepest gradient descent?</p></li>
<li><p>The stochastic gradient descent doesn't help very much for such a small data set. It's real advantage comes when there are many, you'll see this in the lab.</p></li>
</ol>
<h2 id="multiple-input-solution-with-linear-algebra">Multiple Input Solution with Linear Algebra</h2>
<p>You've now seen how slow it can be to perform a coordinate ascent on a system. Another approach to solving the system (which is not always possible, particularly in <em>non-linear</em> systems) is to go direct to the minimum. To do this we need to introduce <em>linear algebra</em>. We will represent all our errors and functions in the form of linear algebra. As we mentioned above, linear algebra is just a shorthand for performing lots of multiplications and additions simultaneously. What does it have to do with our system then? Well the first thing to note is that the linear function we were trying to fit has the following form: <span class="math display">\[
\mappingFunction(x) = mx + c
\]</span> the classical form for a straight line. From a linear algebraic perspective we are looking for multiplications and additions. We are also looking to separate our parameters from our data. The data is the <em>givens</em> remember, in French the word is données literally translated means <em>givens</em> that's great, because we don't need to change the data, what we need to change are the parameters (or variables) of the model. In this function the data comes in through <span class="math inline">\(x\)</span>, and the parameters are <span class="math inline">\(m\)</span> and <span class="math inline">\(c\)</span>.</p>
<p>What we'd like to create is a vector of parameters and a vector of data. Then we could represent the system with vectors that represent the data, and vectors that represent the parameters.</p>
<p>We look to turn the multiplications and additions into a linear algebraic form, we have one multiplication (<span class="math inline">\(m\times c\)</span> and one addition (<span class="math inline">\(mx + c\)</span>). But we can turn this into a inner product by writing it in the following way, <span class="math display">\[
\mappingFunction(x) = m \times x +
c \times 1,
\]</span> in other words we've extracted the unit value, from the offset, <span class="math inline">\(c\)</span>. We can think of this unit value like an extra item of data, because it is always given to us, and it is always set to 1 (unlike regular data, which is likely to vary!). We can therefore write each input data location, <span class="math inline">\(\inputVector\)</span>, as a vector <span class="math display">\[
\inputVector = \begin{bmatrix} 1\\ x\end{bmatrix}.
\]</span></p>
<p>Now we choose to also turn our parameters into a vector. The parameter vector will be defined to contain <span class="math display">\[
\mappingVector = \begin{bmatrix} c \\ m\end{bmatrix}
\]</span> because if we now take the inner product between these to vectors we recover <span class="math display">\[
\inputVector\cdot\mappingVector = 1 \times c + x \times m = mx + c
\]</span> In <code>numpy</code> we can define this vector as follows</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co"># define the vector w</span>
w <span class="op">=</span> np.zeros(shape<span class="op">=</span>(<span class="dv">2</span>, <span class="dv">1</span>))
w[<span class="dv">0</span>] <span class="op">=</span> m
w[<span class="dv">1</span>] <span class="op">=</span> c</code></pre></div>
<p>This gives us the equivalence between original operation and an operation in vector space. Whilst the notation here isn't a lot shorter, the beauty is that we will be able to add as many features as we like and still keep the seame representation. In general, we are now moving to a system where each of our predictions is given by an inner product. When we want to represent a linear product in linear algebra, we tend to do it with the transpose operation, so since we have <span class="math inline">\(\mathbf{a}\cdot\mathbf{b} = \mathbf{a}^\top\mathbf{b}\)</span> we can write <span class="math display">\[
\mappingFunction(\inputVector_i) = \inputVector_i^\top\mappingVector.
\]</span> Where we've assumed that each data point, <span class="math inline">\(\inputVector_i\)</span>, is now written by appending a 1 onto the original vector <span class="math display">\[
\inputVector_i = \begin{bmatrix}
1 \\
\inputScalar_i
\end{bmatrix}
\]</span></p>
<h2 id="design-matrix">Design Matrix</h2>
<p>We can do this for the entire data set to form a <a href="http://en.wikipedia.org/wiki/Design_matrix"><em>design matrix</em></a> <span class="math inline">\(\inputMatrix\)</span>,</p>
<p><span class="math display">\[\inputMatrix
= \begin{bmatrix}
\inputVector_1^\top \\\
\inputVector_2^\top \\\
\vdots \\\
\inputVector_\numData^\top
\end{bmatrix} = \begin{bmatrix}
1 & \inputScalar_1 \\\
1 & \inputScalar_2 \\\
\vdots
& \vdots \\\
1 & \inputScalar_\numData
\end{bmatrix},\]</span></p>
<p>which in <code>numpy</code> can be done with the following commands:</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">X <span class="op">=</span> np.hstack((np.ones_like(x), x))
<span class="bu">print</span>(X)</code></pre></div>
<h3 id="writing-the-objective-with-linear-algebra">Writing the Objective with Linear Algebra</h3>
<p>When we think of the objective function, we can think of it as the errors where the error is defined in a similar way to what it was in Legendre's day <span class="math inline">\(\dataScalar_i - \mappingFunction(\inputVector_i)\)</span>, in statistics these errors are also sometimes called <a href="http://en.wikipedia.org/wiki/Errors_and_residuals_in_statistics"><em>residuals</em></a>. So we can think as the objective and the prediction function as two separate parts, first we have, <span class="math display">\[
\errorFunction(\mappingVector) = \sum_{i=1}^\numData (\dataScalar_i - \mappingFunction(\inputVector_i; \mappingVector))^2,
\]</span> where we've made the function <span class="math inline">\(\mappingFunction(\cdot)\)</span>'s dependence on the parameters <span class="math inline">\(\mappingVector\)</span> explicit in this equation. Then we have the definition of the function itself, <span class="math display">\[
\mappingFunction(\inputVector_i; \mappingVector) = \inputVector_i^\top \mappingVector.
\]</span> Let's look again at these two equations and see if we can identify any inner products. The first equation is a sum of squares, which is promising. Any sum of squares can be represented by an inner product, <span class="math display">\[
a = \sum_{i=1}^{k} b^2_i = \mathbf{b}^\top\mathbf{b},
\]</span> so if we wish to represent <span class="math inline">\(\errorFunction(\mappingVector)\)</span> in this way, all we need to do is convert the sum operator to an inner product. We can get a vector from that sum operator by placing both <span class="math inline">\(\dataScalar_i\)</span> and <span class="math inline">\(\mappingFunction(\inputVector_i; \mappingVector)\)</span> into vectors, which we do by defining <span class="math display">\[
\dataVector = \begin{bmatrix}\dataScalar_1\\ \dataScalar_2\\ \vdots \\ \dataScalar_\numData\end{bmatrix}
\]</span> and defining <span class="math display">\[
\mappingFunctionVector(\inputVector_1; \mappingVector) = \begin{bmatrix}\mappingFunction(\inputVector_1; \mappingVector)\\ \mappingFunction(\inputVector_2; \mappingVector)\\ \vdots \\ \mappingFunction(\inputVector_\numData; \mappingVector)\end{bmatrix}.
\]</span> The second of these is actually a vector-valued function. This term may appear intimidating, but the idea is straightforward. A vector valued function is simply a vector whose elements are themselves defined as <em>functions</em>, i.e. it is a vector of functions, rather than a vector of scalars. The idea is so straightforward, that we are going to ignore it for the moment, and barely use it in the derivation. But it will reappear later when we introduce <em>basis functions</em>. So we will, for the moment, ignore the dependence of <span class="math inline">\(\mappingFunctionVector\)</span> on <span class="math inline">\(\mappingVector\)</span> and <span class="math inline">\(\inputMatrix\)</span> and simply summarise it by a vector of numbers <span class="math display">\[
\mappingFunctionVector = \begin{bmatrix}\mappingFunction_1\\\mappingFunction_2\\
\vdots \\ \mappingFunction_\numData\end{bmatrix}.
\]</span> This allows us to write our objective in the folowing, linear algebraic form, <span class="math display">\[
\errorFunction(\mappingVector) = (\dataVector - \mappingFunctionVector)^\top(\dataVector - \mappingFunctionVector)
\]</span> from the rules of inner products. But what of our matrix <span class="math inline">\(\inputMatrix\)</span> of input data? At this point, we need to dust off <a href="http://en.wikipedia.org/wiki/Matrix_multiplication"><em>matrix-vector multiplication</em></a>. Matrix multiplication is simply a convenient way of performing many inner products together, and it's exactly what we need to summarise the operation <span class="math display">\[
f_i = \inputVector_i^\top\mappingVector.
\]</span> This operation tells us that each element of the vector <span class="math inline">\(\mappingFunctionVector\)</span> (our vector valued function) is given by an inner product between <span class="math inline">\(\inputVector_i\)</span> and <span class="math inline">\(\mappingVector\)</span>. In other words it is a series of inner products. Let's look at the definition of matrix multiplication, it takes the form <span class="math display">\[
\mathbf{c} = \mathbf{B}\mathbf{a}
\]</span> where <span class="math inline">\(\mathbf{c}\)</span> might be a <span class="math inline">\(k\)</span> dimensional vector (which we can intepret as a <span class="math inline">\(k\times 1\)</span> dimensional matrix), and <span class="math inline">\(\mathbf{B}\)</span> is a <span class="math inline">\(k\times k\)</span> dimensional matrix and <span class="math inline">\(\mathbf{a}\)</span> is a <span class="math inline">\(k\)</span> dimensional vector (<span class="math inline">\(k\times 1\)</span> dimensional matrix).</p>
<p>The result of this multiplication is of the form <span class="math display">\[
\begin{bmatrix}c_1\\c_2 \\ \vdots \\
a_k\end{bmatrix} =
\begin{bmatrix} b_{1,1} & b_{1, 2} & \dots & b_{1, k} \\
b_{2, 1} & b_{2, 2} & \dots & b_{2, k} \\
\vdots & \vdots & \ddots & \vdots \\
b_{k, 1} & b_{k, 2} & \dots & b_{k, k} \end{bmatrix} \begin{bmatrix}a_1\\a_2 \\
\vdots\\ c_k\end{bmatrix} = \begin{bmatrix} b_{1, 1}a_1 + b_{1, 2}a_2 + \dots +
b_{1, k}a_k\\
b_{2, 1}a_1 + b_{2, 2}a_2 + \dots + b_{2, k}a_k \\
\vdots\\
b_{k, 1}a_1 + b_{k, 2}a_2 + \dots + b_{k, k}a_k\end{bmatrix}
\]</span> so we see that each element of the result, <span class="math inline">\(\mathbf{a}\)</span> is simply the inner product between each <em>row</em> of <span class="math inline">\(\mathbf{B}\)</span> and the vector <span class="math inline">\(\mathbf{c}\)</span>. Because we have defined each element of <span class="math inline">\(\mappingFunctionVector\)</span> to be given by the inner product between each <em>row</em> of the design matrix and the vector <span class="math inline">\(\mappingVector\)</span> we now can write the full operation in one matrix multiplication, <span class="math display">\[
\mappingFunctionVector = \inputMatrix\mappingVector.
\]</span></p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">f <span class="op">=</span> np.dot(X, w) <span class="co"># np.dot does matrix multiplication in python</span></code></pre></div>
<p>Combining this result with our objective function, <span class="math display">\[
\errorFunction(\mappingVector) = (\dataVector - \mappingFunctionVector)^\top(\dataVector - \mappingFunctionVector)
\]</span> we find we have defined the <em>model</em> with two equations. One equation tells us the form of our predictive function and how it depends on its parameters, the other tells us the form of our objective function.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">resid <span class="op">=</span> (y<span class="op">-</span>f)
E <span class="op">=</span> np.dot(resid.T, resid) <span class="co"># matrix multiplication on a single vector is equivalent to a dot product.</span>
<span class="bu">print</span>(<span class="st">"Error function is:"</span>, E)</code></pre></div>
<h3 id="question-4">Question 4</h3>
<p>The prediction for our movie recommender system had the form <span class="math display">\[
f_{i,j} = \mathbf{u}_i^\top \mathbf{v}_j
\]</span> and the objective function was then <span class="math display">\[
E = \sum_{i,j} s_{i,j}(\dataScalar_{i,j} - f_{i, j})^2
\]</span> Try writing this down in matrix and vector form. How many of the terms can you do? For each variable and parameter carefully think about whether it should be represented as a matrix or vector. Do as many of the terms as you can. Use <span class="math inline">\(\LaTeX\)</span> to give your answers and give the <em>dimensions</em> of any matrices you create.</p>
<p><em>20 marks</em></p>
<h3 id="write-your-answer-to-question-4-here">Write your answer to Question 4 here</h3>
<h2 id="objective-optimisation">Objective Optimisation</h2>
<p>Our <em>model</em> has now been defined with two equations, the prediction function and the objective function. Next we will use multivariate calculus to define an <em>algorithm</em> to fit the model. The separation between model and algorithm is important and is often overlooked. Our model contains a function that shows how it will be used for prediction, and a function that describes the objective function we need to optimise to obtain a good set of parameters.</p>
<p>The model linear regression model we have described is still the same as the one we fitted above with a coordinate ascent algorithm. We have only played with the notation to obtain the same model in a matrix and vector notation. However, we will now fit this model with a different algorithm, one that is much faster. It is such a widely used algorithm that from the end user's perspective it doesn't even look like an algorithm, it just appears to be a single operation (or function). However, underneath the computer calls an algorithm to find the solution. Further, the algorithm we obtain is very widely used, and because of this it turns out to be highly optimised.</p>
<p>Once again we are going to try and find the stationary points of our objective by finding the <em>stationary points</em>. However, the stationary points of a multivariate function, are a little bit more complext to find. Once again we need to find the point at which the derivative is zero, but now we need to use <em>multivariate calculus</em> to find it. This involves learning a few additional rules of differentiation (that allow you to do the derivatives of a function with respect to vector), but in the end it makes things quite a bit easier. We define vectorial derivatives as follows, <span class="math display">\[
\frac{\text{d}\errorFunction(\mappingVector)}{\text{d}\mappingVector} =
\begin{bmatrix}\frac{\text{d}\errorFunction(\mappingVector)}{\text{d}\mappingScalar_1}\\\frac{\text{d}\errorFunction(\mappingVector)}{\text{d}\mappingScalar_2}\end{bmatrix}.
\]</span> where <span class="math inline">\(\frac{\text{d}\errorFunction(\mappingVector)}{\text{d}\mappingScalar_1}\)</span> is the <a href="http://en.wikipedia.org/wiki/Partial_derivative">partial derivative</a> of the error function with respect to <span class="math inline">\(\mappingScalar_1\)</span>.</p>
<p>Differentiation through multiplications and additions is relatively straightforward, and since linear algebra is just multiplication and addition, then its rules of diffentiation are quite straightforward too, but slightly more complex than regular derivatives.</p>
<h3 id="multivariate-derivatives">Multivariate Derivatives</h3>
<p>We will need two rules of multivariate or <em>matrix</em> differentiation. The first is diffentiation of an inner product. By remembering that the inner product is made up of multiplication and addition, we can hope that its derivative is quite straightforward, and so it proves to be. We can start by thinking about the definition of the inner product, <span class="math display">\[
\mathbf{a}^\top\mathbf{z} = \sum_{i} a_i
z_i,
\]</span> which if we were to take the derivative with respect to <span class="math inline">\(z_k\)</span> would simply return the gradient of the one term in the sum for which the derivative was non zero, that of <span class="math inline">\(a_k\)</span>, so we know that <span class="math display">\[
\frac{\text{d}}{\text{d}z_k} \mathbf{a}^\top \mathbf{z} = a_k
\]</span> and by our definition of multivariate derivatives we can simply stack all the partial derivatives of this form in a vector to obtain the result that <span class="math display">\[
\frac{\text{d}}{\text{d}\mathbf{z}}
\mathbf{a}^\top \mathbf{z} = \mathbf{a}.
\]</span> The second rule that's required is differentiation of a 'matrix quadratic'. A scalar quadratic in <span class="math inline">\(z\)</span> with coefficient <span class="math inline">\(c\)</span> has the form <span class="math inline">\(cz^2\)</span>. If <span class="math inline">\(\mathbf{z}\)</span> is a <span class="math inline">\(k\times 1\)</span> vector and <span class="math inline">\(\mathbf{C}\)</span> is a <span class="math inline">\(k \times k\)</span> <em>matrix</em> of coefficients then the matrix quadratic form is written as <span class="math inline">\(\mathbf{z}^\top \mathbf{C}\mathbf{z}\)</span>, which is itself a <em>scalar</em> quantity, but it is a function of a <em>vector</em>.</p>
<h4 id="matching-dimensions-in-matrix-multiplications">Matching Dimensions in Matrix Multiplications</h4>
<p>There's a trick for telling that it's a scalar result. When you are doing maths with matrices, it's always worth pausing to perform a quick sanity check on the dimensions. Matrix multplication only works when the dimensions match. To be precise, the 'inner' dimension of the matrix must match. What is the inner dimension. If we multiply two matrices <span class="math inline">\(\mathbf{A}\)</span> and <span class="math inline">\(\mathbf{B}\)</span>, the first of which has <span class="math inline">\(k\)</span> rows and <span class="math inline">\(\ell\)</span> columns and the second of which has <span class="math inline">\(p\)</span> rows and <span class="math inline">\(q\)</span> columns, then we can check whether the multiplication works by writing the dimensionalities next to each other, <span class="math display">\[
\mathbf{A} \mathbf{B} \rightarrow (k \times
\underbrace{\ell)(p}_\text{inner dimensions} \times q) \rightarrow (k\times q).
\]</span> The inner dimensions are the two inside dimensions, <span class="math inline">\(\ell\)</span> and <span class="math inline">\(p\)</span>. The multiplication will only work if <span class="math inline">\(\ell=p\)</span>. The result of the multiplication will then be a <span class="math inline">\(k\times q\)</span> matrix: this dimensionality comes from the 'outer dimensions'. Note that matrix multiplication is not <a href="http://en.wikipedia.org/wiki/Commutative_property"><em>commutative</em></a>. And if you change the order of the multiplication, <span class="math display">\[
\mathbf{B} \mathbf{A} \rightarrow (\ell \times \underbrace{k)(q}_\text{inner dimensions} \times p) \rightarrow (\ell \times p).
\]</span> firstly it may no longer even work, because now the condition is that <span class="math inline">\(k=q\)</span>, and secondly the result could be of a different dimensionality. An exception is if the matrices are square matrices (e.g. same number of rows as columns) and they are both <em>symmetric</em>. A symmetric matrix is one for which <span class="math inline">\(\mathbf{A}=\mathbf{A}^\top\)</span>, or equivalently, <span class="math inline">\(a_{i,j} = a_{j,i}\)</span> for all <span class="math inline">\(i\)</span> and <span class="math inline">\(j\)</span>.</p>
<p>You will need to get used to working with matrices and vectors applying and developing new machine learning techniques. You should have come across them before, but you may not have used them as extensively as we will now do in this course. You should get used to using this trick to check your work and ensure you know what the dimension of an output matrix should be. For our matrix quadratic form, it turns out that we can see it as a special type of inner product. <span class="math display">\[
\mathbf{z}^\top\mathbf{C}\mathbf{z} \rightarrow (1\times
\underbrace{k) (k}_\text{inner dimensions}\times k) (k\times 1) \rightarrow
\mathbf{b}^\top\mathbf{z}
\]</span> where <span class="math inline">\(\mathbf{b} = \mathbf{C}\mathbf{z}\)</span> so therefore the result is a scalar, <span class="math display">\[
\mathbf{b}^\top\mathbf{z} \rightarrow
(1\times \underbrace{k) (k}_\text{inner dimensions}\times 1) \rightarrow
(1\times 1)
\]</span> where a <span class="math inline">\((1\times 1)\)</span> matrix is recognised as a scalar.</p>
<p>This implies that we should be able to differentiate this form, and indeed the rule for its differentiation is slightly more complex than the inner product, but still quite simple, <span class="math display">\[
\frac{\text{d}}{\text{d}\mathbf{z}}
\mathbf{z}^\top\mathbf{C}\mathbf{z}= \mathbf{C}\mathbf{z} + \mathbf{C}^\top
\mathbf{z}.
\]</span> Note that in the special case where <span class="math inline">\(\mathbf{C}\)</span> is symmetric then we have <span class="math inline">\(\mathbf{C} = \mathbf{C}^\top\)</span> and the derivative simplifies to <span class="math display">\[
\frac{\text{d}}{\text{d}\mathbf{z}} \mathbf{z}^\top\mathbf{C}\mathbf{z}=
2\mathbf{C}\mathbf{z}.
\]</span></p>
<h3 id="differentiate-the-objective">Differentiate the Objective</h3>
<p>First, we need to compute the full objective by substituting our prediction function into the objective function to obtain the objective in terms of <span class="math inline">\(\mappingVector\)</span>. Doing this we obtain <span class="math display">\[
\errorFunction(\mappingVector)= (\dataVector - \inputMatrix\mappingVector)^\top (\dataVector - \inputMatrix\mappingVector).
\]</span> We now need to differentiate this <em>quadratic form</em> to find the minimum. We differentiate with respect to the <em>vector</em> <span class="math inline">\(\mappingVector\)</span>. But before we do that, we'll expand the brackets in the quadratic form to obtain a series of scalar terms. The rules for bracket expansion across the vectors are similar to those for the scalar system giving, <span class="math display">\[
(\mathbf{a} - \mathbf{b})^\top
(\mathbf{c} - \mathbf{d}) = \mathbf{a}^\top \mathbf{c} - \mathbf{a}^\top
\mathbf{d} - \mathbf{b}^\top \mathbf{c} + \mathbf{b}^\top \mathbf{d}
\]</span> which substituting for <span class="math inline">\(\mathbf{a} = \mathbf{c} = \dataVector\)</span> and <span class="math inline">\(\mathbf{b}=\mathbf{d} = \inputMatrix\mappingVector\)</span> gives <span class="math display">\[
\errorFunction(\mappingVector)=
\dataVector^\top\dataVector - 2\dataVector^\top\inputMatrix\mappingVector +
\mappingVector^\top\inputMatrix^\top\inputMatrix\mappingVector
\]</span> where we used the fact that <span class="math inline">\(\dataVector^\top\inputMatrix\mappingVector=\mappingVector^\top\inputMatrix^\top\dataVector\)</span>. Now we can use our rules of differentiation to compute the derivative of this form, which is, <span class="math display">\[
\frac{\text{d}}{\text{d}\mappingVector}\errorFunction(\mappingVector)=- 2\inputMatrix^\top \dataVector +
2\inputMatrix^\top\inputMatrix\mappingVector,
\]</span> where we have exploited the fact that <span class="math inline">\(\inputMatrix^\top\inputMatrix\)</span> is symmetric to obtain this result.</p>
<h3 id="question-5">Question 5</h3>
<p>Use the equivalence between our vector and our matrix formulations of linear regression, alongside our definition of vector derivates, to match the gradients we've computed directly for <span class="math inline">\(\frac{\text{d}\errorFunction(c, m)}{\text{d}c}\)</span> and <span class="math inline">\(\frac{\text{d}\errorFunction(c, m)}{\text{d}m}\)</span> to those for <span class="math inline">\(\frac{\text{d}\errorFunction(\mappingVector)}{\text{d}\mappingVector}\)</span>.</p>
<p><em>20 marks</em></p>
<h3 id="write-your-answer-to-question-5-here">Write your answer to Question 5 here</h3>
<h2 id="update-equation-for-global-optimum">Update Equation for Global Optimum</h2>
<p>Once again, we need to find the minimum of our objective function. Using our likelihood for multiple input regression we can now minimize for our parameter vector <span class="math inline">\(\mappingVector\)</span>. Firstly, just as in the single input case, we seek stationary points by find parameter vectors that solve for when the gradients are zero, <span class="math display">\[
\mathbf{0}=- 2\inputMatrix^\top
\dataVector + 2\inputMatrix^\top\inputMatrix\mappingVector,
\]</span> where <span class="math inline">\(\mathbf{0}\)</span> is a <em>vector</em> of zeros. Rearranging this equation we find the solution to be <span class="math display">\[
\mappingVector = \left[\inputMatrix^\top \inputMatrix\right]^{-1} \inputMatrix^\top
\dataVector
\]</span> where <span class="math inline">\(\mathbf{A}^{-1}\)</span> denotes <a href="http://en.wikipedia.org/wiki/Invertible_matrix"><em>matrix inverse</em></a>.</p>
<h3 id="solving-the-multivariate-system">Solving the Multivariate System</h3>
<p>The solution for <span class="math inline">\(\mappingVector\)</span> is given in terms of a matrix inverse, but computation of a matrix inverse requires, in itself, an algorithm to resolve it. You'll know this if you had to invert, by hand, a <span class="math inline">\(3\times 3\)</span> matrix in high school. From a numerical stability perspective, it is also best not to compute the matrix inverse directly, but rather to ask the computer to <em>solve</em> the system of linear equations given by <span class="math display">\[\inputMatrix^\top\inputMatrix \mappingVector = \inputMatrix^\top\dataVector\]</span> for <span class="math inline">\(\mappingVector\)</span>. This can be done in <code>numpy</code> using the command</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">np.linalg.solve?</code></pre></div>
<p>so we can obtain the solution using</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">w <span class="op">=</span> np.linalg.solve(np.dot(X.T, X), np.dot(X.T, y))
<span class="bu">print</span>(w)</code></pre></div>
<p>We can map it back to the liner regression and plot the fit as follows</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">m <span class="op">=</span> w[<span class="dv">1</span>]<span class="op">;</span> c<span class="op">=</span>w[<span class="dv">0</span>]
f_test <span class="op">=</span> m<span class="op">*</span>x_test <span class="op">+</span> c
<span class="bu">print</span>(m)
<span class="bu">print</span>(c)
plt.plot(x_test, f_test, <span class="st">'b-'</span>)
plt.plot(x, y, <span class="st">'rx'</span>)</code></pre></div>
<h3 id="multivariate-linear-regression">Multivariate Linear Regression</h3>
<p>A major advantage of the new system is that we can build a linear regression on a multivariate system. The matrix calculus didn't specify what the length of the vector <span class="math inline">\(\inputVector\)</span> should be, or equivalently the size of the design matrix.</p>
<h3 id="movie-body-count-data">Movie Body Count Data</h3>
<p>Let's consider the movie body count data.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">data <span class="op">=</span> pods.datasets.movie_body_count()
movies <span class="op">=</span> data[<span class="st">'Y'</span>]</code></pre></div>
<p>Let's remind ourselves of the features we've been provided with.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="bu">print</span>(<span class="st">', '</span>.join(movies.columns))</code></pre></div>
<p>Now we will build a design matrix based on the numeric features: year, Body_Count, Length_Minutes in an effort to predict the rating. We build the design matrix as follows:</p>
<h3 id="relation-to-single-input-system">Relation to Single Input System</h3>
<p>Bias as an additional feature.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">select_features <span class="op">=</span> [<span class="st">'Year'</span>, <span class="st">'Body_Count'</span>, <span class="st">'Length_Minutes'</span>]
X <span class="op">=</span> movies[select_features]
X[<span class="st">'Eins'</span>] <span class="op">=</span> <span class="dv">1</span> <span class="co"># add a column for the offset</span>
y <span class="op">=</span> movies[[<span class="st">'IMDB_Rating'</span>]]</code></pre></div>
<p>Now let's perform a linear regression. But this time, we will create a pandas data frame for the result so we can store it in a form that we can visualise easily.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> pandas <span class="im">as</span> pd
w <span class="op">=</span> pd.DataFrame(data<span class="op">=</span>np.linalg.solve(np.dot(X.T, X), np.dot(X.T, y)), <span class="co"># solve linear regression here</span>
index <span class="op">=</span> X.columns, <span class="co"># columns of X become rows of w</span>
columns<span class="op">=</span>[<span class="st">'regression_coefficient'</span>]) <span class="co"># the column of X is the value of regression coefficient</span></code></pre></div>
<p>We can check the residuals to see how good our estimates are</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">(y <span class="op">-</span> np.dot(X, w)).hist()</code></pre></div>
<p>Which shows our model <em>hasn't</em> yet done a great job of representation, because the spread of values is large. We can check what the rating is dominated by in terms of regression coefficients.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">w</code></pre></div>
<p>Although we have to be a little careful about interpretation because our input values live on different scales, however it looks like we are dominated by the bias, with a small negative effect for later films (but bear in mind the years are large, so this effect is probably larger than it looks) and a positive effect for length. So it looks like long earlier films generally do better, but the residuals are so high that we probably haven't modelled the system very well.</p>
<p><a href="https://www.youtube.com/watch?v=ui-uNlFHoms&t="><img src="https://img.youtube.com/vi/ui-uNlFHoms/0.jpg" /></a></p>
<p><a href="https://www.youtube.com/watch?v=78YNphT90-k&t="><img src="https://img.youtube.com/vi/78YNphT90-k/0.jpg" /></a></p>
<h3 id="solution-with-qr-decomposition">Solution with QR Decomposition</h3>
<p>Performing a solve instead of a matrix inverse is the more numerically stable approach, but we can do even better. A <a href="http://en.wikipedia.org/wiki/QR_decomposition">QR-decomposition</a> of a matrix factorises it into a matrix which is an orthogonal matrix <span class="math inline">\(\mathbf{Q}\)</span>, so that <span class="math inline">\(\mathbf{Q}^\top \mathbf{Q} = \eye\)</span>. And a matrix which is upper triangular, <span class="math inline">\(\mathbf{R}\)</span>. <span class="math display">\[
\inputMatrix^\top \inputMatrix \boldsymbol{\beta} =
\inputMatrix^\top \dataVector
\]</span> <span class="math display">\[
(\mathbf{Q}\mathbf{R})^\top
(\mathbf{Q}\mathbf{R})\boldsymbol{\beta} = (\mathbf{Q}\mathbf{R})^\top
\dataVector
\]</span> <span class="math display">\[
\mathbf{R}^\top (\mathbf{Q}^\top \mathbf{Q}) \mathbf{R}
\boldsymbol{\beta} = \mathbf{R}^\top \mathbf{Q}^\top \dataVector
\]</span> <span class="math display">\[
\mathbf{R}^\top \mathbf{R} \boldsymbol{\beta} = \mathbf{R}^\top \mathbf{Q}^\top
\dataVector
\]</span> <span class="math display">\[
\mathbf{R} \boldsymbol{\beta} = \mathbf{Q}^\top \dataVector
\]</span> This is a more numerically stable solution because it removes the need to compute <span class="math inline">\(\inputMatrix^\top\inputMatrix\)</span> as an intermediate. Computing <span class="math inline">\(\inputMatrix^\top\inputMatrix\)</span> is a bad idea because it involves squaring all the elements of <span class="math inline">\(\inputMatrix\)</span> and thereby potentially reducing the numerical precision with which we can represent the solution. Operating on <span class="math inline">\(\inputMatrix\)</span> directly preserves the numerical precision of the model.</p>
<p>This can be more particularly seen when we begin to work with <em>basis functions</em> in the next session. Some systems that can be resolved with the QR decomposition can not be resolved by using solve directly.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> scipy <span class="im">as</span> sp
Q, R <span class="op">=</span> np.linalg.qr(X)
w <span class="op">=</span> sp.linalg.solve_triangular(R, np.dot(Q.T, y))
w <span class="op">=</span> pd.DataFrame(w, index<span class="op">=</span>X.columns)
w</code></pre></div>
<h3 id="reading-1">Reading</h3>
<ul>
<li>Section 1.3 of <span class="citation">Rogers and Girolami (2011)</span> for Matrix & Vector Review.</li>
</ul>
<h3 id="basis-functions">Basis Functions</h3>
<p>Here's the idea, instead of working directly on the original input space, <span class="math inline">\(\inputVector\)</span>, we build models in a new space, <span class="math inline">\(\basisVector(\inputVector)\)</span> where <span class="math inline">\(\basisVector(\cdot)\)</span> is a <em>vector-valued</em> function that is defined on the space <span class="math inline">\(\inputVector\)</span>.</p>
<h3 id="quadratic-basis">Quadratic Basis</h3>
<p>Remember, that a <em>vector-valued function</em> is just a vector that contains functions instead of values. Here's an example for a one dimensional input space, <span class="math inline">\(x\)</span>, being projected to a <em>quadratic</em> basis. First we consider each basis function in turn, we can think of the elements of our vector as being indexed so that we have <span class="math display">\[
\begin{align*}
\basisFunc_1(\inputScalar) = 1, \\
\basisFunc_2(\inputScalar) = x, \\
\basisFunc_3(\inputScalar) = \inputScalar^2.
\end{align*}
\]</span> Now we can consider them together by placing them in a vector, <span class="math display">\[
\basisVector(\inputScalar) = \begin{bmatrix} 1\\ x \\ \inputScalar^2\end{bmatrix}.
\]</span> For the vector-valued function, we have simply collected the different functions together in the same vector making them notationally easier to deal with in our mathematics.</p>
<p>When we consider the vector-valued function for each data point, then we place all the data into a matrix. The result is a matrix valued function, <span class="math display">\[
\basisMatrix(\inputVector) =
\begin{bmatrix} 1 & \inputScalar_1 &
\inputScalar_1^2 \\
1 & \inputScalar_2 & \inputScalar_2^2\\
\vdots & \vdots & \vdots \\
1 & \inputScalar_n & \inputScalar_n^2
\end{bmatrix}
\]</span> where we are still in the one dimensional input setting so <span class="math inline">\(\inputVector\)</span> here represents a vector of our inputs with <span class="math inline">\(\numData\)</span> elements.</p>
<p>Let's try constructing such a matrix for a set of inputs. First of all, we create a function that returns the matrix valued function</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> numpy <span class="im">as</span> np</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> quadratic(x, <span class="op">**</span>kwargs):
<span class="co">"""Take in a vector of input values and return the design matrix associated </span>
<span class="co"> with the basis functions."""</span>
<span class="cf">return</span> np.hstack([np.ones((x.shape[<span class="dv">0</span>], <span class="dv">1</span>)), x, x<span class="op">**</span><span class="dv">2</span>])</code></pre></div>
<h3 id="functions-derived-from-quadratic-basis">Functions Derived from Quadratic Basis</h3>
<p><span class="math display">\[
\mappingFunction(\inputScalar) = {\color{cyan}\mappingScalar_0} + {\color{green}\mappingScalar_1 \inputScalar} + {\color{yellow}\mappingScalar_2 \inputScalar^2}
\]</span></p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> matplotlib.pyplot <span class="im">as</span> plt
<span class="im">import</span> teaching_plots <span class="im">as</span> plot</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">f, ax <span class="op">=</span> plt.subplots(figsize<span class="op">=</span>plot.big_wide_figsize)
loc <span class="op">=</span>[[<span class="dv">0</span>, <span class="fl">1.4</span>,],
[<span class="dv">0</span>, <span class="op">-</span><span class="fl">0.7</span>],
[<span class="fl">0.75</span>, <span class="op">-</span><span class="fl">0.2</span>]]
text <span class="op">=</span>[<span class="st">'$\phi(x) = 1$'</span>,
<span class="st">'$\phi(x) = x$'</span>,
<span class="st">'$\phi(x) = x^2$'</span>]
plot.basis(quadratic, x_min<span class="op">=-</span><span class="fl">1.3</span>, x_max<span class="op">=</span><span class="fl">1.3</span>,
fig<span class="op">=</span>f, ax<span class="op">=</span>ax, loc<span class="op">=</span>loc, text<span class="op">=</span>text,
diagrams<span class="op">=</span><span class="st">'../slides/diagrams/ml'</span>)</code></pre></div>
<object class="svgplot" align data="../slides/diagrams/ml/quadratic_basis002.svg">
</object>
<p>This function takes in an <span class="math inline">\(\numData \times 1\)</span> dimensional vector and returns an <span class="math inline">\(\numData \times 3\)</span> dimensional <em>design matrix</em> containing the basis functions. We can plot those basis functions against there input as follows.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co"># first let's generate some inputs</span>
n <span class="op">=</span> <span class="dv">100</span>
x <span class="op">=</span> np.zeros((n, <span class="dv">1</span>)) <span class="co"># create a data set of zeros</span>
x[:, <span class="dv">0</span>] <span class="op">=</span> np.linspace(<span class="op">-</span><span class="dv">1</span>, <span class="dv">1</span>, n) <span class="co"># fill it with values between -1 and 1</span>
Phi <span class="op">=</span> quadratic(x)
fig, ax <span class="op">=</span> plt.subplots(figsize<span class="op">=</span>plot.big_wide_figsize)
ax.set_ylim([<span class="op">-</span><span class="fl">1.2</span>, <span class="fl">1.2</span>]) <span class="co"># set y limits to ensure basis functions show.</span>
ax.plot(x[:,<span class="dv">0</span>], Phi[:, <span class="dv">0</span>], <span class="st">'r-'</span>, label <span class="op">=</span> <span class="st">'$\phi=1$'</span>, linewidth<span class="op">=</span><span class="dv">3</span>)
ax.plot(x[:,<span class="dv">0</span>], Phi[:, <span class="dv">1</span>], <span class="st">'g-'</span>, label <span class="op">=</span> <span class="st">'$\phi=x$'</span>, linewidth<span class="op">=</span><span class="dv">3</span>)
ax.plot(x[:,<span class="dv">0</span>], Phi[:, <span class="dv">2</span>], <span class="st">'b-'</span>, label <span class="op">=</span> <span class="st">'$\phi=x^2$'</span>, linewidth<span class="op">=</span><span class="dv">3</span>)
ax.legend(loc<span class="op">=</span><span class="st">'lower right'</span>)
_ <span class="op">=</span> ax.set_title(<span class="st">'Quadratic Basis Functions'</span>)</code></pre></div>
<p>The actual function we observe is then made up of a sum of these functions. This is the reason for the name basis. The term <em>basis</em> means 'the underlying support or foundation for an idea, argument, or process', and in this context they form the underlying support for our prediction function. Our prediction function can only be composed of a weighted linear sum of our basis functions.</p>
<h3 id="quadratic-functions">Quadratic Functions</h3>
<object class="svgplot" align data="../slides/diagrams/ml/quadratic_function002.svg">
</object>
<h3 id="polynomial-fits-to-olympic-data">Polynomial Fits to Olympic Data</h3>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> numpy <span class="im">as</span> np
<span class="im">from</span> matplotlib <span class="im">import</span> pyplot <span class="im">as</span> plt
<span class="im">import</span> teaching_plots <span class="im">as</span> plot
<span class="im">import</span> mlai
<span class="im">import</span> pods</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">basis <span class="op">=</span> mlai.polynomial
data <span class="op">=</span> pods.datasets.olympic_marathon_men()
x <span class="op">=</span> data[<span class="st">'X'</span>]
y <span class="op">=</span> data[<span class="st">'Y'</span>]
xlim <span class="op">=</span> [<span class="dv">1892</span>, <span class="dv">2020</span>]
basis<span class="op">=</span>mlai.basis(mlai.polynomial, number<span class="op">=</span><span class="dv">1</span>, data_limits<span class="op">=</span>xlim)</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">plot.rmse_fit(x, y, param_name<span class="op">=</span><span class="st">'number'</span>, param_range<span class="op">=</span>(<span class="dv">1</span>, <span class="dv">27</span>),
model<span class="op">=</span>mlai.LM,
basis<span class="op">=</span>basis,
xlim<span class="op">=</span>xlim, objective_ylim<span class="op">=</span>[<span class="dv">0</span>, <span class="fl">0.8</span>],
diagrams<span class="op">=</span><span class="st">'../slides/diagrams/ml'</span>)</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">from</span> ipywidgets <span class="im">import</span> IntSlider</code></pre></div>
<object class="svgplot" align data="../slides/diagrams/ml/olympic_LM_polynomial_num_basis026.svg">
</object>
<h2 id="underdetermined-system">Underdetermined System</h2>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> teaching_plots <span class="im">as</span> plot</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">plot.under_determined_system(diagrams<span class="op">=</span><span class="st">'../slides/diagrams/ml'</span>)</code></pre></div>
<p>What about the situation where you have more parameters than data in your simultaneous equation? This is known as an <em>underdetermined</em> system. In fact this set up is in some sense <em>easier</em> to solve, because we don't need to think about introducing a slack variable (although it might make a lot of sense from a <em>modelling</em> perspective to do so).</p>
<p>The way Laplace proposed resolving an overdetermined system, was to introduce slack variables, <span class="math inline">\(\noiseScalar_i\)</span>, which needed to be estimated for each point. The slack variable represented the difference between our actual prediction and the true observation. This is known as the <em>residual</em>. By introducing the slack variable we now have an additional <span class="math inline">\(n\)</span> variables to estimate, one for each data point, <span class="math inline">\(\{\noiseScalar_i\}\)</span>. This actually turns the overdetermined system into an underdetermined system. Introduction of <span class="math inline">\(n\)</span> variables, plus the original <span class="math inline">\(m\)</span> and <span class="math inline">\(c\)</span> gives us <span class="math inline">\(\numData+2\)</span> parameters to be estimated from <span class="math inline">\(n\)</span> observations, which actually makes the system <em>underdetermined</em>. However, we then made a probabilistic assumption about the slack variables, we assumed that the slack variables were distributed according to a probability density. And for the moment we have been assuming that density was the Gaussian, <span class="math display">\[\noiseScalar_i \sim \gaussianSamp{0}{\dataStd^2},\]</span> with zero mean and variance <span class="math inline">\(\dataStd^2\)</span>.</p>
<p>The follow up question is whether we can do the same thing with the parameters. If we have two parameters and only one unknown can we place a probability distribution over the parameters, as we did with the slack variables? The answer is yes.</p>
<h3 id="underdetermined-system-1">Underdetermined System</h3>
<object class="svgplot" align data="../slides/diagrams/ml/under_determined_system009.svg">
</object>
<center>
<em>Fit underdetermined system by considering uncertainty </em>
</center>
<h3 id="alan-turing">Alan Turing</h3>
<table>
<tr>
<td width="50%">
<img class="" src="../slides/diagrams/turing-times.gif" width="" align="center" style="background:none; border:none; box-shadow:none;">
</td>
<td width="50%">
<img class="" src="../slides/diagrams/turing-run.jpg" width="" align="center" style="background:none; border:none; box-shadow:none;">
</td>
</tr>
</table>
<center>
<em>Alan Turing, in 1946 he was only 11 minutes slower than the winner of the 1948 games. Would he have won a hypothetical games held in 1946? Source: <a href="http://www.turing.org.uk/scrapbook/run.html">Alan Turing Internet Scrapbook</a>.</em>
</center>
<p>If we had to summarise the objectives of machine learning in one word, a very good candidate for that word would be <em>generalization</em>. What is generalization? From a human perspective it might be summarised as the ability to take lessons learned in one domain and apply them to another domain. If we accept the definition given in the first session for machine learning, <span class="math display">\[
\text{data} + \text{model} \xrightarrow{\text{compute}} \text{prediction}
\]</span> then we see that without a model we can't generalise: we only have data. Data is fine for answering very specific questions, like "Who won the Olympic Marathon in 2012?", because we have that answer stored, however, we are not given the answer to many other questions. For example, Alan Turing was a formidable marathon runner, in 1946 he ran a time 2 hours 46 minutes (just under four minutes per kilometer, faster than I and most of the other <a href="http://www.parkrun.org.uk/sheffieldhallam/">Endcliffe Park Run</a> runners can do 5 km). What is the probability he would have won an Olympics if one had been held in 1946?</p>
<p>To answer this question we need to generalize, but before we formalize the concept of generalization let's introduce some formal representation of what it means to generalize in machine learning.</p>
<object class="svgplot" align data="../slides/diagrams/ml/dem_gaussian003.svg">
</object>
<center>
<em>Combining a Gaussian likelihood with a Gaussian prior to form a Gaussian posterior </em>
</center>
<h3 id="main-trick">Main Trick</h3>
<p><span class="math display">\[p(c) = \frac{1}{\sqrt{2\pi\alpha_1}} \exp\left(-\frac{1}{2\alpha_1}c^2\right)\]</span> <span class="math display">\[p(\dataVector|\inputVector, c, m, \dataStd^2) = \frac{1}{\left(2\pi\dataStd^2\right)^{\frac{\numData}{2}}} \exp\left(-\frac{1}{2\dataStd^2}\sum_{i=1}^\numData(\dataScalar_i - m\inputScalar_i - c)^2\right)\]</span></p>
<h3 id="section"></h3>
<p><span class="math display">\[p(c| \dataVector, \inputVector, m, \dataStd^2) = \frac{p(\dataVector|\inputVector, c, m, \dataStd^2)p(c)}{p(\dataVector|\inputVector, m, \dataStd^2)}\]</span></p>
<p><span class="math display">\[p(c| \dataVector, \inputVector, m, \dataStd^2) = \frac{p(\dataVector|\inputVector, c, m, \dataStd^2)p(c)}{\int p(\dataVector|\inputVector, c, m, \dataStd^2)p(c) \text{d} c}\]</span></p>
<h3 id="section-1"></h3>
<p><span class="math display">\[p(c| \dataVector, \inputVector, m, \dataStd^2) \propto p(\dataVector|\inputVector, c, m, \dataStd^2)p(c)\]</span></p>
<p><span class="math display">\[\begin{aligned}
\log p(c | \dataVector, \inputVector, m, \dataStd^2) =&-\frac{1}{2\dataStd^2} \sum_{i=1}^\numData(\dataScalar_i-c - m\inputScalar_i)^2-\frac{1}{2\alpha_1} c^2 + \text{const}\\
= &-\frac{1}{2\dataStd^2}\sum_{i=1}^\numData(\dataScalar_i-m\inputScalar_i)^2 -\left(\frac{\numData}{2\dataStd^2} + \frac{1}{2\alpha_1}\right)c^2\\
& + c\frac{\sum_{i=1}^\numData(\dataScalar_i-m\inputScalar_i)}{\dataStd^2},
\end{aligned}\]</span></p>
<h3 id="section-2"></h3>
<p>complete the square of the quadratic form to obtain <span class="math display">\[\log p(c | \dataVector, \inputVector, m, \dataStd^2) = -\frac{1}{2\tau^2}(c - \mu)^2 +\text{const},\]</span> where <span class="math inline">\(\tau^2 = \left(\numData\dataStd^{-2} +\alpha_1^{-1}\right)^{-1}\)</span> and <span class="math inline">\(\mu = \frac{\tau^2}{\dataStd^2} \sum_{i=1}^\numData(\dataScalar_i-m\inputScalar_i)\)</span>.</p>
<h3 id="two-dimensional-gaussian">Two Dimensional Gaussian</h3>
<ul>
<li>Consider height, <span class="math inline">\(h/m\)</span> and weight, <span class="math inline">\(w/kg\)</span>.</li>
<li>Could sample height from a distribution: <span class="math display">\[
p(h) \sim \gaussianSamp{1.7}{0.0225}
\]</span></li>
<li>And similarly weight: <span class="math display">\[
p(w) \sim \gaussianSamp{75}{36}
\]</span></li>
</ul>
<object class="svgplot" align data="../slides/diagrams/ml/independent_height_weight007.svg">
</object>
<center>
<em>Samples from independent Gaussian variables that might represent heights and weights. </em>
</center>
<h3 id="independence-assumption">Independence Assumption</h3>
<ul>
<li><p>This assumes height and weight are independent. <span class="math display">\[p(h, w) = p(h)p(w)\]</span></p></li>
<li><p>In reality they are dependent (body mass index) <span class="math inline">\(= \frac{w}{h^2}\)</span>.</p></li>
</ul>
<object class="svgplot" align data="../slides/diagrams/ml/correlated_height_weight007.svg">
</object>
<center>
<em>Samples from correlated Gaussian variables that might represent heights and weights. </em>
</center>
<h3 id="independent-gaussians">Independent Gaussians</h3>
<p><span class="math display">\[
p(w, h) = p(w)p(h)
\]</span></p>
<p><span class="math display">\[
p(w, h) = \frac{1}{\sqrt{2\pi \dataStd_1^2}\sqrt{2\pi\dataStd_2^2}} \exp\left(-\frac{1}{2}\left(\frac{(w-\meanScalar_1)^2}{\dataStd_1^2} + \frac{(h-\meanScalar_2)^2}{\dataStd_2^2}\right)\right)
\]</span></p>
<p><span class="math display">\[
p(w, h) = \frac{1}{\sqrt{2\pi\dataStd_1^22\pi\dataStd_2^2}} \exp\left(-\frac{1}{2}\left(\begin{bmatrix}w \\ h\end{bmatrix} - \begin{bmatrix}\meanScalar_1 \\ \meanScalar_2\end{bmatrix}\right)^\top\begin{bmatrix}\dataStd_1^2& 0\\0&\dataStd_2^2\end{bmatrix}^{-1}\left(\begin{bmatrix}w \\ h\end{bmatrix} - \begin{bmatrix}\meanScalar_1 \\ \meanScalar_2\end{bmatrix}\right)\right)
\]</span></p>
<p><span class="math display">\[
p(\dataVector) = \frac{1}{\det{2\pi \mathbf{D}}^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\dataVector - \meanVector)^\top\mathbf{D}^{-1}(\dataVector - \meanVector)\right)
\]</span></p>
<h3 id="correlated-gaussian">Correlated Gaussian</h3>
<p>Form correlated from original by rotating the data space using matrix <span class="math inline">\(\rotationMatrix\)</span>.</p>
<p><span class="math display">\[
p(\dataVector) = \frac{1}{\det{2\pi\mathbf{D}}^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\dataVector - \meanVector)^\top\mathbf{D}^{-1}(\dataVector - \meanVector)\right)
\]</span></p>
<p><span class="math display">\[
p(\dataVector) = \frac{1}{\det{2\pi\mathbf{D}}^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\rotationMatrix^\top\dataVector - \rotationMatrix^\top\meanVector)^\top\mathbf{D}^{-1}(\rotationMatrix^\top\dataVector - \rotationMatrix^\top\meanVector)\right)
\]</span></p>
<p><span class="math display">\[
p(\dataVector) = \frac{1}{\det{2\pi\mathbf{D}}^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\dataVector - \meanVector)^\top\rotationMatrix\mathbf{D}^{-1}\rotationMatrix^\top(\dataVector - \meanVector)\right)
\]</span> this gives a covariance matrix: <span class="math display">\[
\covarianceMatrix^{-1} = \rotationMatrix \mathbf{D}^{-1} \rotationMatrix^\top
\]</span></p>
<p><span class="math display">\[
p(\dataVector) = \frac{1}{\det{2\pi\covarianceMatrix}^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\dataVector - \meanVector)^\top\covarianceMatrix^{-1} (\dataVector - \meanVector)\right)
\]</span> this gives a covariance matrix: <span class="math display">\[
\covarianceMatrix = \rotationMatrix \mathbf{D} \rotationMatrix^\top
\]</span></p>
<h3 id="generating-from-the-model">Generating from the Model</h3>
<p>A very important aspect of probabilistic modelling is to <em>sample</em> from your model to see what type of assumptions you are making about your data. In this case that involves a two stage process.</p>
<ol style="list-style-type: decimal">
<li>Sample a candiate parameter vector from the prior.</li>
<li>Place the candidate parameter vector in the likelihood and sample functions conditiond on that candidate vector.</li>
<li>Repeat to try and characterise the type of functions you are generating.</li>
</ol>
<p>Given a prior variance (as defined above) we can now sample from the prior distribution and combine with a basis set to see what assumptions we are making about the functions <em>a priori</em> (i.e. before we've seen the data). Firstly we compute the basis function matrix. We will do it both for our training data, and for a range of prediction locations (<code>x_pred</code>).</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> numpy <span class="im">as</span> np
<span class="im">import</span> pods</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">data <span class="op">=</span> pods.datasets.olympic_marathon_men()
x <span class="op">=</span> data[<span class="st">'X'</span>]
y <span class="op">=</span> data[<span class="st">'Y'</span>]
num_data <span class="op">=</span> x.shape[<span class="dv">0</span>]
num_pred_data <span class="op">=</span> <span class="dv">100</span> <span class="co"># how many points to use for plotting predictions</span>
x_pred <span class="op">=</span> np.linspace(<span class="dv">1890</span>, <span class="dv">2016</span>, num_pred_data)[:, <span class="va">None</span>] <span class="co"># input locations for predictions</span></code></pre></div>
<p>now let's build the basis matrices. We define the polynomial basis as follows.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> polynomial(x, num_basis<span class="op">=</span><span class="dv">2</span>, loc<span class="op">=</span><span class="dv">0</span>., scale<span class="op">=</span><span class="dv">1</span>.):
degree<span class="op">=</span>num_basis<span class="op">-</span><span class="dv">1</span>
degrees <span class="op">=</span> np.arange(degree<span class="op">+</span><span class="dv">1</span>)
<span class="cf">return</span> ((x<span class="op">-</span>loc)<span class="op">/</span>scale)<span class="op">**</span>degrees</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> mlai</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">loc<span class="op">=</span><span class="dv">1950</span>
scale<span class="op">=</span><span class="dv">1</span>
degree<span class="op">=</span><span class="dv">4</span>
basis <span class="op">=</span> mlai.basis(polynomial, number<span class="op">=</span>degree<span class="op">+</span><span class="dv">1</span>, loc<span class="op">=</span>loc, scale<span class="op">=</span>scale)
Phi_pred <span class="op">=</span> basis.Phi(x_pred)
Phi <span class="op">=</span> basis.Phi(x)</code></pre></div>
<h3 id="sampling-from-the-prior">Sampling from the Prior</h3>
<p>Now we will sample from the prior to produce a vector <span class="math inline">\(\mappingVector\)</span> and use it to plot a function which is representative of our belief <em>before</em> we fit the data. To do this we are going to use the properties of the Gaussian density and a sample from a <em>standard normal</em> using the function <code>np.random.normal</code>.</p>
<h3 id="scaling-gaussian-distributed-variables">Scaling Gaussian-distributed Variables</h3>
<p>First, let's consider the case where we have one data point and one feature in our basis set. In otherwords <span class="math inline">\(\mappingFunctionVector\)</span> would be a scalar, <span class="math inline">\(\mappingVector\)</span> would be a scalar and <span class="math inline">\(\basisMatrix\)</span> would be a scalar. In this case we have <span class="math display">\[
\mappingFunction = \basisScalar \mappingScalar
\]</span> If <span class="math inline">\(\mappingScalar\)</span> is drawn from a normal density, <span class="math display">\[
\mappingScalar \sim \gaussianSamp{\meanScalar_\mappingScalar}{c_\mappingScalar}
\]</span> and <span class="math inline">\(\basisScalar\)</span> is a scalar value which we are given, then properties of the Gaussian density tell us that <span class="math display">\[
\basisScalar \mappingScalar \sim \gaussianSamp{\basisScalar\meanScalar_\mappingScalar}{\basisScalar^2c_\mappingScalar}
\]</span> Let's test this out numerically. First we will draw 200 samples from a standard normal,</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">w_vec <span class="op">=</span> np.random.normal(size<span class="op">=</span><span class="dv">200</span>)</code></pre></div>
<p>We can compute the mean of these samples and their variance</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="bu">print</span>(<span class="st">'w sample mean is '</span>, w_vec.mean())
<span class="bu">print</span>(<span class="st">'w sample variance is '</span>, w_vec.var())</code></pre></div>
<p>These are close to zero (the mean) and one (the variance) as you'd expect. Now compute the mean and variance of the scaled version,</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">phi <span class="op">=</span> <span class="dv">7</span>
f_vec <span class="op">=</span> phi<span class="op">*</span>w_vec
<span class="bu">print</span>(<span class="st">'True mean should be phi*0 = 0.'</span>)
<span class="bu">print</span>(<span class="st">'True variance should be phi*phi*1 = '</span>, phi<span class="op">*</span>phi)
<span class="bu">print</span>(<span class="st">'f sample mean is '</span>, f_vec.mean())
<span class="bu">print</span>(<span class="st">'f sample variance is '</span>, f_vec.var())</code></pre></div>
<p>If you increase the number of samples then you will see that the sample mean and the sample variance begin to converge towards the true mean and the true variance. Obviously adding an offset to a sample from <code>np.random.normal</code> will change the mean. So if you want to sample from a Gaussian with mean <code>mu</code> and standard deviation <code>sigma</code> one way of doing it is to sample from the standard normal and scale and shift the result, so to sample a set of <span class="math inline">\(\mappingScalar\)</span> from a Gaussian with mean <span class="math inline">\(\meanScalar\)</span> and variance <span class="math inline">\(\alpha\)</span>, <span class="math display">\[\mappingScalar \sim \gaussianSamp{\meanScalar}{\alpha}\]</span> We can simply scale and offset samples from the <em>standard normal</em>.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">mu <span class="op">=</span> <span class="dv">4</span> <span class="co"># mean of the distribution</span>
alpha <span class="op">=</span> <span class="dv">2</span> <span class="co"># variance of the distribution</span>
w_vec <span class="op">=</span> np.random.normal(size<span class="op">=</span><span class="dv">200</span>)<span class="op">*</span>np.sqrt(alpha) <span class="op">+</span> mu
<span class="bu">print</span>(<span class="st">'w sample mean is '</span>, w_vec.mean())
<span class="bu">print</span>(<span class="st">'w sample variance is '</span>, w_vec.var())</code></pre></div>
<p>Here the <code>np.sqrt</code> is necesssary because we need to multiply by the standard deviation and we specified the variance as <code>alpha</code>. So scaling and offsetting a Gaussian distributed variable keeps the variable Gaussian, but it effects the mean and variance of the resulting variable.</p>
<p>To get an idea of the overall shape of the resulting distribution, let's do the same thing with a histogram of the results.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> matplotlib.pyplot <span class="im">as</span> plt
<span class="im">import</span> teaching_plots <span class="im">as</span> plot</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co"># First the standard normal</span>
z_vec <span class="op">=</span> np.random.normal(size<span class="op">=</span><span class="dv">1000</span>) <span class="co"># by convention, in statistics, z is often used to denote samples from the standard normal</span>
w_vec <span class="op">=</span> z_vec<span class="op">*</span>np.sqrt(alpha) <span class="op">+</span> mu
<span class="co"># plot normalized histogram of w, and then normalized histogram of z on top</span>
fig, ax <span class="op">=</span> plt.subplots(figsize<span class="op">=</span>plot.big_wide_figsize)
ax.hist(w_vec, bins<span class="op">=</span><span class="dv">30</span>, density<span class="op">=</span><span class="va">True</span>)
ax.hist(z_vec, bins<span class="op">=</span><span class="dv">30</span>, density<span class="op">=</span><span class="va">True</span>)
_ <span class="op">=</span> ax.legend((<span class="st">'$w$'</span>, <span class="st">'$z$'</span>))</code></pre></div>
<p>Now re-run this histogram with 100,000 samples and check that the both histograms look qualitatively Gaussian.</p>
<h3 id="sampling-from-the-prior-1">Sampling from the Prior</h3>
<p>Let's use this way of constructing samples from a Gaussian to check what functions look like <em>a priori</em>. The process will be as follows. First, we sample a random vector <span class="math inline">\(K\)</span> dimensional from <code>np.random.normal</code>. Then we scale it by <span class="math inline">\(\sqrt{\alpha}\)</span> to obtain a prior sample of <span class="math inline">\(\mappingVector\)</span>.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">K <span class="op">=</span> degree <span class="op">+</span> <span class="dv">1</span>
z_vec <span class="op">=</span> np.random.normal(size<span class="op">=</span>K)
w_sample <span class="op">=</span> z_vec<span class="op">*</span>np.sqrt(alpha)
<span class="bu">print</span>(w_sample)</code></pre></div>
<p>Now we can combine our sample from the prior with the basis functions to create a function,</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">f_sample <span class="op">=</span> np.dot(Phi_pred,w_sample)
fig, ax <span class="op">=</span> plt.subplots(figsize<span class="op">=</span>plot.big_wide_figsize)
_ <span class="op">=</span> ax.plot(x_pred.flatten(), f_sample.flatten(), <span class="st">'r-'</span>, linewidth<span class="op">=</span><span class="dv">3</span>)</code></pre></div>
<p>This shows the recurring problem with the polynomial basis (note the scale on the left hand side!). Our prior allows relatively large coefficients for the basis associated with high polynomial degrees. Because we are operating with input values of around 2000, this leads to output functions of very high values. The fix we have used for this before is to rescale our data before we apply the polynomial basis to it. Above, we set the scale of the basis to 1. Here let's set it to 100 and try again.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">scale <span class="op">=</span> <span class="dv">100</span>.
basis <span class="op">=</span> mlai.basis(polynomial, number<span class="op">=</span>degree<span class="op">+</span><span class="dv">1</span>, loc<span class="op">=</span>loc, scale<span class="op">=</span>scale)
Phi_pred <span class="op">=</span> basis.Phi(x_pred)
Phi <span class="op">=</span> basis.Phi(x)</code></pre></div>
<p>Now we need to recompute the basis functions from above,</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">f_sample <span class="op">=</span> np.dot(Phi_pred, w_sample)
fig, ax <span class="op">=</span> plt.subplots(figsize<span class="op">=</span>plot.big_wide_figsize)
_ <span class="op">=</span> ax.plot(x_pred.flatten(), f_sample.flatten(), <span class="st">'r-'</span>, linewidth<span class="op">=</span><span class="dv">3</span>)</code></pre></div>
<p>Now let's loop through some samples and plot various functions as samples from this system,</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">num_samples <span class="op">=</span> <span class="dv">10</span>
K <span class="op">=</span> degree<span class="op">+</span><span class="dv">1</span>
fig, ax <span class="op">=</span> plt.subplots(figsize<span class="op">=</span>plot.big_wide_figsize)
<span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(num_samples):
z_vec <span class="op">=</span> np.random.normal(size<span class="op">=</span>K)
w_sample <span class="op">=</span> z_vec<span class="op">*</span>np.sqrt(alpha)
f_sample <span class="op">=</span> np.dot(Phi_pred,w_sample)
_ <span class="op">=</span> ax.plot(x_pred.flatten(), f_sample.flatten(), linewidth<span class="op">=</span><span class="dv">2</span>)</code></pre></div>
<p>The predictions for the mean output can now be computed. We want the expected value of the predictions under the posterior distribution. In matrix form, the predictions can be computed as <span class="math display">\[
\mappingFunctionVector = \basisMatrix \mappingVector.
\]</span> This involves a matrix multiplication between a fixed matrix <span class="math inline">\(\basisMatrix\)</span> and a vector that is drawn from a distribution <span class="math inline">\(\mappingVector\)</span>. Because <span class="math inline">\(\mappingVector\)</span> is drawn from a distribution, this imples that <span class="math inline">\(\mappingFunctionVector\)</span> should also be drawn from a distribution. There are two distributions we are interested in though. We have just been sampling from the <em>prior</em> distribution to see what sort of functions we get <em>before</em> looking at the data. In Bayesian inference, we need to computer the <em>posterior</em> distribution and sample from that density.</p>
<h3 id="computing-the-posterior">Computing the Posterior</h3>
<p>We will now attampt to compute the <em>posterior distribution</em>. In the lecture we went through the maths that allows us to compute the posterior distribution for <span class="math inline">\(\mappingVector\)</span>. This distribution is also Gaussian, <span class="math display">\[
p(\mappingVector | \dataVector, \inputVector, \dataStd^2) = \gaussianDist{\mappingVector}{\meanVector_\mappingScalar}{\covarianceMatrix_\mappingScalar}
\]</span> with covariance, <span class="math inline">\(\covarianceMatrix_\mappingScalar\)</span>, given by <span class="math display">\[
\covarianceMatrix_\mappingScalar = \left(\dataStd^{-2}\basisMatrix^\top \basisMatrix + \alpha^{-1}\eye\right)^{-1}
\]</span> whilst the mean is given by <span class="math display">\[
\meanVector_\mappingScalar = \covarianceMatrix_\mappingScalar \dataStd^{-2}\basisMatrix^\top \dataVector
\]</span> Let's compute the posterior covariance and mean, then we'll sample from these densities to have a look at the posterior belief about <span class="math inline">\(\mappingVector\)</span> once the data has been accounted for. Remember, the process of Bayesian inference involves combining the prior, <span class="math inline">\(p(\mappingVector)\)</span> with the likelihood, <span class="math inline">\(p(\dataVector|\inputVector, \mappingVector)\)</span> to form the posterior, <span class="math inline">\(p(\mappingVector | \dataVector, \inputVector)\)</span> through Bayes' rule, <span class="math display">\[
p(\mappingVector|\dataVector, \inputVector) = \frac{p(\dataVector|\inputVector, \mappingVector)p(\mappingVector)}{p(\dataVector)}
\]</span> We've looked at the samples for our function <span class="math inline">\(\mappingFunctionVector = \basisMatrix\mappingVector\)</span>, which forms the mean of the Gaussian likelihood, under the prior distribution. I.e. we've sampled from <span class="math inline">\(p(\mappingVector)\)</span> and multiplied the result by the basis matrix. Now we will sample from the posterior density, <span class="math inline">\(p(\mappingVector|\dataVector, \inputVector)\)</span>, and check that the new samples fit do correspond to the data, i.e. we want to check that the updated distribution includes information from the data set. First we need to compute the posterior mean and <em>covariance</em>.</p>
<h3 id="bayesian-inference-in-the-univariate-case">Bayesian Inference in the Univariate Case</h3>
<p>This video talks about Bayesian inference across the single parameter, the offset <span class="math inline">\(c\)</span>, illustrating how the prior and the likelihood combine in one dimension to form a posterior.</p>
<p><a href="https://www.youtube.com/watch?v=AvlnFnvFw_0&t=15"><img src="https://img.youtube.com/vi/AvlnFnvFw_0/0.jpg" /></a></p>
<h3 id="multivariate-bayesian-inference">Multivariate Bayesian Inference</h3>
<p>This section of the lecture talks about how we extend the idea of Bayesian inference for the multivariate case. It goes through the multivariate Gaussian and how to complete the square in the linear algebra as we managed below.</p>
<p><a href="https://www.youtube.com/watch?v=Os1iqgpelPw&t=1362"><img src="https://img.youtube.com/vi/Os1iqgpelPw/0.jpg" /></a></p>
<p>The lecture informs us the the posterior density for <span class="math inline">\(\mappingVector\)</span> is given by a Gaussian density with covariance <span class="math display">\[
\covarianceMatrix_w = \left(\dataStd^{-2}\basisMatrix^\top \basisMatrix + \alpha^{-1}\eye\right)^{-1}
\]</span> and mean <span class="math display">\[
\meanVector_w = \covarianceMatrix_w\dataStd^{-2}\basisMatrix^\top \dataVector.
\]</span></p>
<h3 id="question-1-1">Question 1</h3>
<p>Compute the covariance for <span class="math inline">\(\mappingVector\)</span> given the training data, call the resulting variable <code>w_cov</code>. Compute the mean for <span class="math inline">\(\mappingVector\)</span> given the training data. Call the resulting variable <code>w_mean</code>. Assume that <span class="math inline">\(\dataStd^2 = 0.01\)</span></p>
<p><em>10 marks</em></p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co"># Write code for your answer to Question 1 in this box</span>
sigma2 <span class="op">=</span>
w_cov <span class="op">=</span>
w_mean <span class="op">=</span>
</code></pre></div>
<h3 id="olympic-data-with-bayesian-polynomials">Olympic Data with Bayesian Polynomials</h3>
<p>Five fold cross validation tests the ability of the model to <em>interpolate</em>.</p>
<object class="svgplot" align data="../slides/diagrams/ml/olympic_BLM_polynomial_number026.svg">
</object>
<center>
<em>Bayesian fit with 26th degree polynomial and negative marginal log likelihood. </em>
</center>
<h3 id="hold-out-validation">Hold Out Validation</h3>
<p>For the polynomial fit, we will now look at <em>hold out</em> validation, where we are holding out some of the most recent points. This tests the abilit of our model to <em>extrapolate</em>.</p>
<object class="svgplot" align data="../slides/diagrams/ml/olympic_val_BLM_polynomial_number026.svg">
</object>
<center>
<em>Bayesian fit with 26th degree polynomial and hold out validation scores. </em>
</center>
<h3 id="fold-cross-validation">5-fold Cross Validation</h3>
<p>Five fold cross validation tests the ability of the model to <em>interpolate</em>.</p>
<object class="svgplot" align data="../slides/diagrams/ml/olympic_5cv05_BLM_polynomial_number026.svg">
</object>
<center>
<em>Bayesian fit with 26th degree polynomial and five fold cross validation scores. </em>
</center>
<h3 id="marginal-likelihood">Marginal Likelihood</h3>
<ul>
<li><p>The marginal likelihood can also be computed, it has the form: <span class="math display">\[
p(\dataVector|\inputMatrix, \dataStd^2, \alpha) = \frac{1}{(2\pi)^\frac{n}{2}\left|\kernelMatrix\right|^\frac{1}{2}} \exp\left(-\frac{1}{2} \dataVector^\top \kernelMatrix^{-1} \dataVector\right)
\]</span> where <span class="math inline">\(\kernelMatrix = \alpha \basisMatrix\basisMatrix^\top + \dataStd^2 \eye\)</span>.</p></li>
<li><p>So it is a zero mean <span class="math inline">\(\numData\)</span>-dimensional Gaussian with covariance matrix <span class="math inline">\(\kernelMatrix\)</span>.</p></li>
</ul>
<h3 id="references" class="unnumbered">References</h3>
<div id="refs" class="references">
<div id="ref-Bishop:book06">
<p>Bishop, C.M., 2006. Pattern recognition and machine learning. springer.</p>
</div>
<div id="ref-Laplace:essai14">
<p>Laplace, P.S., 1814. Essai philosophique sur les probabilités, 2nd ed. Courcier, Paris.</p>
</div>
<div id="ref-Rogers:book11">
<p>Rogers, S., Girolami, M., 2011. A first course in machine learning. CRC Press.</p>
</div>
</div>
Mon, 04 Jun 2018 00:00:00 +0000
http://inverseprobability.com/talks/notes/bayesian-methods.html
http://inverseprobability.com/talks/notes/bayesian-methods.htmlnotesFaith and AI<div style="display:none">
$$\newcommand{\Amatrix}{\mathbf{A}}
\newcommand{\KL}[2]{\text{KL}\left( #1\,\|\,#2 \right)}
\newcommand{\Kaast}{\kernelMatrix_{\mathbf{ \ast}\mathbf{ \ast}}}
\newcommand{\Kastu}{\kernelMatrix_{\mathbf{ \ast} \inducingVector}}
\newcommand{\Kff}{\kernelMatrix_{\mappingFunctionVector \mappingFunctionVector}}
\newcommand{\Kfu}{\kernelMatrix_{\mappingFunctionVector \inducingVector}}
\newcommand{\Kuast}{\kernelMatrix_{\inducingVector \bf\ast}}
\newcommand{\Kuf}{\kernelMatrix_{\inducingVector \mappingFunctionVector}}
\newcommand{\Kuu}{\kernelMatrix_{\inducingVector \inducingVector}}
\newcommand{\Kuui}{\Kuu^{-1}}
\newcommand{\Qaast}{\mathbf{Q}_{\bf \ast \ast}}
\newcommand{\Qastf}{\mathbf{Q}_{\ast \mappingFunction}}
\newcommand{\Qfast}{\mathbf{Q}_{\mappingFunctionVector \bf \ast}}
\newcommand{\Qff}{\mathbf{Q}_{\mappingFunctionVector \mappingFunctionVector}}
\newcommand{\aMatrix}{\mathbf{A}}
\newcommand{\aScalar}{a}
\newcommand{\aVector}{\mathbf{a}}
\newcommand{\acceleration}{a}
\newcommand{\bMatrix}{\mathbf{B}}
\newcommand{\bScalar}{b}
\newcommand{\bVector}{\mathbf{b}}
\newcommand{\basisFunc}{\phi}
\newcommand{\basisFuncVector}{\boldsymbol{ \basisFunc}}
\newcommand{\basisFunction}{\phi}
\newcommand{\basisLocation}{\mu}
\newcommand{\basisMatrix}{\boldsymbol{ \Phi}}
\newcommand{\basisScalar}{\basisFunction}
\newcommand{\basisVector}{\boldsymbol{ \basisFunction}}
\newcommand{\activationFunction}{\phi}
\newcommand{\activationMatrix}{\boldsymbol{ \Phi}}
\newcommand{\activationScalar}{\basisFunction}
\newcommand{\activationVector}{\boldsymbol{ \basisFunction}}
\newcommand{\bigO}{\mathcal{O}}
\newcommand{\binomProb}{\pi}
\newcommand{\cMatrix}{\mathbf{C}}
\newcommand{\cbasisMatrix}{\hat{\boldsymbol{ \Phi}}}
\newcommand{\cdataMatrix}{\hat{\dataMatrix}}
\newcommand{\cdataScalar}{\hat{\dataScalar}}
\newcommand{\cdataVector}{\hat{\dataVector}}
\newcommand{\centeredKernelMatrix}{\mathbf{ \MakeUppercase{\centeredKernelScalar}}}
\newcommand{\centeredKernelScalar}{b}
\newcommand{\centeredKernelVector}{\centeredKernelScalar}
\newcommand{\centeringMatrix}{\mathbf{H}}
\newcommand{\chiSquaredDist}[2]{\chi_{#1}^{2}\left(#2\right)}
\newcommand{\chiSquaredSamp}[1]{\chi_{#1}^{2}}
\newcommand{\conditionalCovariance}{\boldsymbol{ \Sigma}}
\newcommand{\coregionalizationMatrix}{\mathbf{B}}
\newcommand{\coregionalizationScalar}{b}
\newcommand{\coregionalizationVector}{\mathbf{ \coregionalizationScalar}}
\newcommand{\covDist}[2]{\text{cov}_{#2}\left(#1\right)}
\newcommand{\covSamp}[1]{\text{cov}\left(#1\right)}
\newcommand{\covarianceScalar}{c}
\newcommand{\covarianceVector}{\mathbf{ \covarianceScalar}}
\newcommand{\covarianceMatrix}{\mathbf{C}}
\newcommand{\covarianceMatrixTwo}{\boldsymbol{ \Sigma}}
\newcommand{\croupierScalar}{s}
\newcommand{\croupierVector}{\mathbf{ \croupierScalar}}
\newcommand{\croupierMatrix}{\mathbf{ \MakeUppercase{\croupierScalar}}}
\newcommand{\dataDim}{p}
\newcommand{\dataIndex}{i}
\newcommand{\dataIndexTwo}{j}
\newcommand{\dataMatrix}{\mathbf{Y}}
\newcommand{\dataScalar}{y}
\newcommand{\dataSet}{\mathcal{D}}
\newcommand{\dataStd}{\sigma}
\newcommand{\dataVector}{\mathbf{ \dataScalar}}
\newcommand{\decayRate}{d}
\newcommand{\degreeMatrix}{\mathbf{ \MakeUppercase{\degreeScalar}}}
\newcommand{\degreeScalar}{d}
\newcommand{\degreeVector}{\mathbf{ \degreeScalar}}
% Already defined by latex
%\newcommand{\det}[1]{\left|#1\right|}
\newcommand{\diag}[1]{\text{diag}\left(#1\right)}
\newcommand{\diagonalMatrix}{\mathbf{D}}
\newcommand{\diff}[2]{\frac{\text{d}#1}{\text{d}#2}}
\newcommand{\diffTwo}[2]{\frac{\text{d}^2#1}{\text{d}#2^2}}
\newcommand{\displacement}{x}
\newcommand{\displacementVector}{\textbf{\displacement}}
\newcommand{\distanceMatrix}{\mathbf{ \MakeUppercase{\distanceScalar}}}
\newcommand{\distanceScalar}{d}
\newcommand{\distanceVector}{\mathbf{ \distanceScalar}}
\newcommand{\eigenvaltwo}{\ell}
\newcommand{\eigenvaltwoMatrix}{\mathbf{L}}
\newcommand{\eigenvaltwoVector}{\mathbf{l}}
\newcommand{\eigenvalue}{\lambda}
\newcommand{\eigenvalueMatrix}{\boldsymbol{ \Lambda}}
\newcommand{\eigenvalueVector}{\boldsymbol{ \lambda}}
\newcommand{\eigenvector}{\mathbf{ \eigenvectorScalar}}
\newcommand{\eigenvectorMatrix}{\mathbf{U}}
\newcommand{\eigenvectorScalar}{u}
\newcommand{\eigenvectwo}{\mathbf{v}}
\newcommand{\eigenvectwoMatrix}{\mathbf{V}}
\newcommand{\eigenvectwoScalar}{v}
\newcommand{\entropy}[1]{\mathcal{H}\left(#1\right)}
\newcommand{\errorFunction}{E}
\newcommand{\expDist}[2]{\left<#1\right>_{#2}}
\newcommand{\expSamp}[1]{\left<#1\right>}
\newcommand{\expectation}[1]{\left\langle #1 \right\rangle }
\newcommand{\expectationDist}[2]{\left\langle #1 \right\rangle _{#2}}
\newcommand{\expectedDistanceMatrix}{\mathcal{D}}
\newcommand{\eye}{\mathbf{I}}
\newcommand{\fantasyDim}{r}
\newcommand{\fantasyMatrix}{\mathbf{ \MakeUppercase{\fantasyScalar}}}
\newcommand{\fantasyScalar}{z}
\newcommand{\fantasyVector}{\mathbf{ \fantasyScalar}}
\newcommand{\featureStd}{\varsigma}
\newcommand{\gammaCdf}[3]{\mathcal{GAMMA CDF}\left(#1|#2,#3\right)}
\newcommand{\gammaDist}[3]{\mathcal{G}\left(#1|#2,#3\right)}
\newcommand{\gammaSamp}[2]{\mathcal{G}\left(#1,#2\right)}
\newcommand{\gaussianDist}[3]{\mathcal{N}\left(#1|#2,#3\right)}
\newcommand{\gaussianSamp}[2]{\mathcal{N}\left(#1,#2\right)}
\newcommand{\given}{|}
\newcommand{\half}{\frac{1}{2}}
\newcommand{\heaviside}{H}
\newcommand{\hiddenMatrix}{\mathbf{ \MakeUppercase{\hiddenScalar}}}
\newcommand{\hiddenScalar}{h}
\newcommand{\hiddenVector}{\mathbf{ \hiddenScalar}}
\newcommand{\identityMatrix}{\eye}
\newcommand{\inducingInputScalar}{z}
\newcommand{\inducingInputVector}{\mathbf{ \inducingInputScalar}}
\newcommand{\inducingInputMatrix}{\mathbf{Z}}
\newcommand{\inducingScalar}{u}
\newcommand{\inducingVector}{\mathbf{ \inducingScalar}}
\newcommand{\inducingMatrix}{\mathbf{U}}
\newcommand{\inlineDiff}[2]{\text{d}#1/\text{d}#2}
\newcommand{\inputDim}{q}
\newcommand{\inputMatrix}{\mathbf{X}}
\newcommand{\inputScalar}{x}
\newcommand{\inputSpace}{\mathcal{X}}
\newcommand{\inputVals}{\inputVector}
\newcommand{\inputVector}{\mathbf{ \inputScalar}}
\newcommand{\iterNum}{k}
\newcommand{\kernel}{\kernelScalar}
\newcommand{\kernelMatrix}{\mathbf{K}}
\newcommand{\kernelScalar}{k}
\newcommand{\kernelVector}{\mathbf{ \kernelScalar}}
\newcommand{\kff}{\kernelScalar_{\mappingFunction \mappingFunction}}
\newcommand{\kfu}{\kernelVector_{\mappingFunction \inducingScalar}}
\newcommand{\kuf}{\kernelVector_{\inducingScalar \mappingFunction}}
\newcommand{\kuu}{\kernelVector_{\inducingScalar \inducingScalar}}
\newcommand{\lagrangeMultiplier}{\lambda}
\newcommand{\lagrangeMultiplierMatrix}{\boldsymbol{ \Lambda}}
\newcommand{\lagrangian}{L}
\newcommand{\laplacianFactor}{\mathbf{ \MakeUppercase{\laplacianFactorScalar}}}
\newcommand{\laplacianFactorScalar}{m}
\newcommand{\laplacianFactorVector}{\mathbf{ \laplacianFactorScalar}}
\newcommand{\laplacianMatrix}{\mathbf{L}}
\newcommand{\laplacianScalar}{\ell}
\newcommand{\laplacianVector}{\mathbf{ \ell}}
\newcommand{\latentDim}{q}
\newcommand{\latentDistanceMatrix}{\boldsymbol{ \Delta}}
\newcommand{\latentDistanceScalar}{\delta}
\newcommand{\latentDistanceVector}{\boldsymbol{ \delta}}
\newcommand{\latentForce}{f}
\newcommand{\latentFunction}{u}
\newcommand{\latentFunctionVector}{\mathbf{ \latentFunction}}
\newcommand{\latentFunctionMatrix}{\mathbf{ \MakeUppercase{\latentFunction}}}
\newcommand{\latentIndex}{j}
\newcommand{\latentScalar}{z}
\newcommand{\latentVector}{\mathbf{ \latentScalar}}
\newcommand{\latentMatrix}{\mathbf{Z}}
\newcommand{\learnRate}{\eta}
\newcommand{\lengthScale}{\ell}
\newcommand{\rbfWidth}{\ell}
\newcommand{\likelihoodBound}{\mathcal{L}}
\newcommand{\likelihoodFunction}{L}
\newcommand{\locationScalar}{\mu}
\newcommand{\locationVector}{\boldsymbol{ \locationScalar}}
\newcommand{\locationMatrix}{\mathbf{M}}
\newcommand{\variance}[1]{\text{var}\left( #1 \right)}
\newcommand{\mappingFunction}{f}
\newcommand{\mappingFunctionMatrix}{\mathbf{F}}
\newcommand{\mappingFunctionTwo}{g}
\newcommand{\mappingFunctionTwoMatrix}{\mathbf{G}}
\newcommand{\mappingFunctionTwoVector}{\mathbf{ \mappingFunctionTwo}}
\newcommand{\mappingFunctionVector}{\mathbf{ \mappingFunction}}
\newcommand{\scaleScalar}{s}
\newcommand{\mappingScalar}{w}
\newcommand{\mappingVector}{\mathbf{ \mappingScalar}}
\newcommand{\mappingMatrix}{\mathbf{W}}
\newcommand{\mappingScalarTwo}{v}
\newcommand{\mappingVectorTwo}{\mathbf{ \mappingScalarTwo}}
\newcommand{\mappingMatrixTwo}{\mathbf{V}}
\newcommand{\maxIters}{K}
\newcommand{\meanMatrix}{\mathbf{M}}
\newcommand{\meanScalar}{\mu}
\newcommand{\meanTwoMatrix}{\mathbf{M}}
\newcommand{\meanTwoScalar}{m}
\newcommand{\meanTwoVector}{\mathbf{ \meanTwoScalar}}
\newcommand{\meanVector}{\boldsymbol{ \meanScalar}}
\newcommand{\mrnaConcentration}{m}
\newcommand{\naturalFrequency}{\omega}
\newcommand{\neighborhood}[1]{\mathcal{N}\left( #1 \right)}
\newcommand{\neilurl}{http://inverseprobability.com/}
\newcommand{\noiseMatrix}{\boldsymbol{ E}}
\newcommand{\noiseScalar}{\epsilon}
\newcommand{\noiseVector}{\boldsymbol{ \epsilon}}
\newcommand{\norm}[1]{\left\Vert #1 \right\Vert}
\newcommand{\normalizedLaplacianMatrix}{\hat{\mathbf{L}}}
\newcommand{\normalizedLaplacianScalar}{\hat{\ell}}
\newcommand{\normalizedLaplacianVector}{\hat{\mathbf{ \ell}}}
\newcommand{\numActive}{m}
\newcommand{\numBasisFunc}{m}
\newcommand{\numComponents}{m}
\newcommand{\numComps}{K}
\newcommand{\numData}{n}
\newcommand{\numFeatures}{K}
\newcommand{\numHidden}{h}
\newcommand{\numInducing}{m}
\newcommand{\numLayers}{\ell}
\newcommand{\numNeighbors}{K}
\newcommand{\numSequences}{s}
\newcommand{\numSuccess}{s}
\newcommand{\numTasks}{m}
\newcommand{\numTime}{T}
\newcommand{\numTrials}{S}
\newcommand{\outputIndex}{j}
\newcommand{\paramVector}{\boldsymbol{ \theta}}
\newcommand{\parameterMatrix}{\boldsymbol{ \Theta}}
\newcommand{\parameterScalar}{\theta}
\newcommand{\parameterVector}{\boldsymbol{ \parameterScalar}}
\newcommand{\partDiff}[2]{\frac{\partial#1}{\partial#2}}
\newcommand{\precisionScalar}{j}
\newcommand{\precisionVector}{\mathbf{ \precisionScalar}}
\newcommand{\precisionMatrix}{\mathbf{J}}
\newcommand{\pseudotargetScalar}{\widetilde{y}}
\newcommand{\pseudotargetVector}{\mathbf{ \pseudotargetScalar}}
\newcommand{\pseudotargetMatrix}{\mathbf{ \widetilde{Y}}}
\newcommand{\rank}[1]{\text{rank}\left(#1\right)}
\newcommand{\rayleighDist}[2]{\mathcal{R}\left(#1|#2\right)}
\newcommand{\rayleighSamp}[1]{\mathcal{R}\left(#1\right)}
\newcommand{\responsibility}{r}
\newcommand{\rotationScalar}{r}
\newcommand{\rotationVector}{\mathbf{ \rotationScalar}}
\newcommand{\rotationMatrix}{\mathbf{R}}
\newcommand{\sampleCovScalar}{s}
\newcommand{\sampleCovVector}{\mathbf{ \sampleCovScalar}}
\newcommand{\sampleCovMatrix}{\mathbf{s}}
\newcommand{\scalarProduct}[2]{\left\langle{#1},{#2}\right\rangle}
\newcommand{\sign}[1]{\text{sign}\left(#1\right)}
\newcommand{\sigmoid}[1]{\sigma\left(#1\right)}
\newcommand{\singularvalue}{\ell}
\newcommand{\singularvalueMatrix}{\mathbf{L}}
\newcommand{\singularvalueVector}{\mathbf{l}}
\newcommand{\sorth}{\mathbf{u}}
\newcommand{\spar}{\lambda}
\newcommand{\trace}[1]{\text{tr}\left(#1\right)}
\newcommand{\BasalRate}{B}
\newcommand{\DampingCoefficient}{C}
\newcommand{\DecayRate}{D}
\newcommand{\Displacement}{X}
\newcommand{\LatentForce}{F}
\newcommand{\Mass}{M}
\newcommand{\Sensitivity}{S}
\newcommand{\basalRate}{b}
\newcommand{\dampingCoefficient}{c}
\newcommand{\mass}{m}
\newcommand{\sensitivity}{s}
\newcommand{\springScalar}{\kappa}
\newcommand{\springVector}{\boldsymbol{ \kappa}}
\newcommand{\springMatrix}{\boldsymbol{ \mathcal{K}}}
\newcommand{\tfConcentration}{p}
\newcommand{\tfDecayRate}{\delta}
\newcommand{\tfMrnaConcentration}{f}
\newcommand{\tfVector}{\mathbf{ \tfConcentration}}
\newcommand{\velocity}{v}
\newcommand{\sufficientStatsScalar}{g}
\newcommand{\sufficientStatsVector}{\mathbf{ \sufficientStatsScalar}}
\newcommand{\sufficientStatsMatrix}{\mathbf{G}}
\newcommand{\switchScalar}{s}
\newcommand{\switchVector}{\mathbf{ \switchScalar}}
\newcommand{\switchMatrix}{\mathbf{S}}
\newcommand{\tr}[1]{\text{tr}\left(#1\right)}
\newcommand{\loneNorm}[1]{\left\Vert #1 \right\Vert_1}
\newcommand{\ltwoNorm}[1]{\left\Vert #1 \right\Vert_2}
\newcommand{\onenorm}[1]{\left\vert#1\right\vert_1}
\newcommand{\twonorm}[1]{\left\Vert #1 \right\Vert}
\newcommand{\vScalar}{v}
\newcommand{\vVector}{\mathbf{v}}
\newcommand{\vMatrix}{\mathbf{V}}
\newcommand{\varianceDist}[2]{\text{var}_{#2}\left( #1 \right)}
% Already defined by latex
%\newcommand{\vec}{#1:}
\newcommand{\vecb}[1]{\left(#1\right):}
\newcommand{\weightScalar}{w}
\newcommand{\weightVector}{\mathbf{ \weightScalar}}
\newcommand{\weightMatrix}{\mathbf{W}}
\newcommand{\weightedAdjacencyMatrix}{\mathbf{A}}
\newcommand{\weightedAdjacencyScalar}{a}
\newcommand{\weightedAdjacencyVector}{\mathbf{ \weightedAdjacencyScalar}}
\newcommand{\onesVector}{\mathbf{1}}
\newcommand{\zerosVector}{\mathbf{0}}
$$
</div>
<h2 id="what-is-machine-learning">What is Machine Learning?</h2>
<p>What is machine learning? At its most basic level machine learning is a combination of</p>
<p><span class="math display">\[ \text{data} + \text{model} \xrightarrow{\text{compute}} \text{prediction}\]</span></p>
<p>where <em>data</em> is our observations. They can be actively or passively acquired (meta-data). The <em>model</em> contains our assumptions, based on previous experience. THat experience can be other data, it can come from transfer learning, or it can merely be our beliefs about the regularities of the universe. In humans our models include our inductive biases. The <em>prediction</em> is an action to be taken or a categorization or a quality score. The reason that machine learning has become a mainstay of artificial intelligence is the importance of predictions in artificial intelligence. The data and the model are combined through computation.</p>
<p>In practice we normally perform machine learning using two functions. To combine data with a model we typically make use of:</p>
<p><strong>a prediction function</strong> a function which is used to make the predictions. It includes our beliefs about the regularities of the universe, our assumptions about how the world works, e.g. smoothness, spatial similarities, temporal similarities.</p>
<p><strong>an objective function</strong> a function which defines the cost of misprediction. Typically it includes knowledge about the world's generating processes (probabilistic objectives) or the costs we pay for mispredictions (empiricial risk minimization).</p>
<p>The combination of data and model through the prediction function and the objectie function leads to a <em>learning algorithm</em>. The class of prediction functions and objective functions we can make use of is restricted by the algorithms they lead to. If the prediction function or the objective function are too complex, then it can be difficult to find an appropriate learning algorithm. Much of the acdemic field of machine learning is the quest for new learning algorithms that allow us to bring different types of models and data together.</p>
<p>A useful reference for state of the art in machine learning is the UK Royal Society Report, <a href="https://royalsociety.org/~/media/policy/projects/machine-learning/publications/machine-learning-report.pdf">Machine Learning: Power and Promise of Computers that Learn by Example</a>.</p>
<p>You can also check my blog post on <a href="http://inverseprobability.com/2017/07/17/what-is-machine-learning">"What is Machine Learning?"</a></p>
<h3 id="artificial-intelligence-and-data-science">Artificial Intelligence and Data Science</h3>
<p>Machine learning technologies have been the driver of two related, but distinct disciplines. The first is <em>data science</em>. Data science is an emerging field that arises from the fact that we now collect so much data by happenstance, rather than by <em>experimental design</em>. Classical statistics is the science of drawing conclusions from data, and to do so statistical experiments are carefully designed. In the modern era we collect so much data that there's a desire to draw inferences directly from the data.</p>
<p>As well as machine learning, the field of data science draws from statistics, cloud computing, data storage (e.g. streaming data), visualization and data mining.</p>
<p>In contrast, artificial intelligence technologies typically focus on emulating some form of human behaviour, such as understanding an image, or some speech, or translating text from one form to another. The recent advances in artifcial intelligence have come from machine learning providing the automation. But in contrast to data science, in artifcial intelligence the data is normally collected with the specific task in mind. In this sense it has strong relations to classical statistics.</p>
<p>Classically artificial intelligence worried more about <em>logic</em> and <em>planning</em> and focussed less on data driven decision making. Modern machine learning owes more to the field of <em>Cybernetics</em> <span class="citation">(Wiener, 1948)</span> than artificial intelligence. Related fields include <em>robotics</em>, <em>speech recognition</em>, <em>language understanding</em> and <em>computer vision</em>.</p>
<p>There are strong overlaps between the fields, the wide availability of data by happenstance makes it easier to collect data for designing AI systems. These relations are coming through wide availability of sensing technologies that are interconnected by celluar networks, WiFi and the internet. This phenomenon is sometimes known as the <em>Internet of Things</em>, but this feels like a dangerous misnomer. We must never forget that we are interconnecting people, not things.</p>
<h3 id="what-does-machine-learning-do">What does Machine Learning do?</h3>
<p>Any process of automation allows us to scale what we do by codifying a process in some way that makes it efficient and repeatable. Machine learning automates by emulating human (or other actions) found in data. Machine learning codifies in the form of a mathematical function that is learnt by a computer. If we can create these mathematical functions in ways in which they can interconnect, then we can also build systems.</p>
<p>Machine learning works through codifing a prediction of interest into a mathematical function. For example, we can try and predict the probability that a customer wants to by a jersey given knowledge of their age, and the latitude where they live. The technique known as logistic regression estimates the odds that someone will by a jumper as a linear weighted sum of the features of interest.</p>
<p><span class="math display">\[ \text{odds} = \frac{p(\text{bought})}{p(\text{not bought})} \]</span> <span class="math display">\[ \log \text{odds} = \beta_0 + \beta_1 \text{age} + \beta_2 \text{latitude}\]</span></p>
<p>Here <span class="math inline">\(\beta_0\)</span>, <span class="math inline">\(\beta_1\)</span> and <span class="math inline">\(\beta_2\)</span> are the parameters of the model. If <span class="math inline">\(\beta_1\)</span> and <span class="math inline">\(\beta_2\)</span> are both positive, then the log-odds that someone will buy a jumper increase with increasing latitude and age, so the further north you are and the older you are the more likely you are to buy a jumper. The parameter <span class="math inline">\(\beta_0\)</span> is an offset parameter, and gives the log-odds of buying a jumper at zero age and on the equator. It is likely to be negative[^logarithms] indicating that the purchase is odds-against. This is actually a classical statistical model, and models like logistic regression are widely used to estimate probabilities from ad-click prediction to risk of disease.</p>
<p>This is called a generalized linear model, we can also think of it as estimating the <em>probability</em> of a purchase as a nonlinear function of the features (age, lattitude) and the parameters (the <span class="math inline">\(\beta\)</span> values). The function is known as the <em>sigmoid</em> or <a href="https://en.wikipedia.org/wiki/Logistic_regression">logistic function</a>, thus the name <em>logistic</em> regression.</p>
<p><span class="math display">\[ p(\text{bought}) = \sigmoid{\beta_0 + \beta_1 \text{age} + \beta_2 \text{latitude}}\]</span></p>
<p>In the case where we have <em>features</em> to help us predict, we sometimes denote such features as a vector, <span class="math inline">\(\inputVector\)</span>, and we then use an inner product between the features and the parameters, <span class="math inline">\(\boldsymbol{\beta}^\top \inputVector = \beta_1 \inputScalar_1 + \beta_2 \inputScalar_2 + \beta_3 \inputScalar_3 ...\)</span>, to represent the argument of the sigmoid.</p>
<p><span class="math display">\[ p(\text{bought}) = \sigmoid{\boldsymbol{\beta}^\top \inputVector}\]</span></p>
<p>More generally, we aim to predict some aspect of our data, <span class="math inline">\(\dataScalar\)</span>, by relating it through a mathematical function, <span class="math inline">\(\mappingFunction(\cdot)\)</span>, to the parameters, <span class="math inline">\(\boldsymbol{\beta}\)</span> and the data, <span class="math inline">\(\inputVector\)</span>.</p>
<p><span class="math display">\[ \dataScalar = \mappingFunction\left(\inputVector, \boldsymbol{\beta}\right)\]</span></p>
<p>We call <span class="math inline">\(\mappingFunction(\cdot)\)</span> the <em>prediction function</em></p>
<p>To obtain the fit to data, we use a separate function called the <em>objective function</em> that gives us a mathematical representation of the difference between our predictions and the real data.</p>
<p><span class="math display">\[\errorFunction(\boldsymbol{\beta}, \dataMatrix, \inputMatrix)\]</span> A commonly used examples (for example in a regression problem) is least squares, <span class="math display">\[\errorFunction(\boldsymbol{\beta}, \dataMatrix, \inputMatrix) = \sum_{i=1}^\numData \left(\dataScalar_i - \mappingFunction(\inputVector_i, \boldsymbol{\beta})\right)^2.\]</span></p>
<p>If a linear prediction function is combined with the least squares objective function then that gives us a classical <em>linear regression</em>, another classical statistical model. Statistics often focusses on linear models because it makes interpretation of the model easier. Interpretation is key in statistics because the aim is normally to validate questions by analysis of data. Machine learning has typically focussed more on the prediction function itself and worried less about the interpretation of parameters, which are normally denoted by <span class="math inline">\(\mathbf{w}\)</span> instead of <span class="math inline">\(\boldsymbol{\beta}\)</span>. As a result <em>non-linear</em> functions are explored more often as they tend to improve quality of predictions but at the expense of interpretability.</p>
<img class="" src="../slides/diagrams/deepface_neg.png" width="100%" align="center" style="background:none; border:none; box-shadow:none;">
<center>
<em>The DeepFace architecture <span class="citation">(Taigman et al., 2014)</span>, visualized through colors to represent the functional mappings at each layer. There are 120 million parameters in the model. </em>
</center>
<p>The DeepFace architecture <span class="citation">(Taigman et al., 2014)</span> consists of layers that deal with <em>translation</em> and <em>rotational</em> invariances. These layers are followed by three locally-connected layers and two fully-connected layers. Color illustrates feature maps produced at each layer. The net includes more than 120 million parameters, where more than 95% come from the local and fully connected layers.</p>
<img class="" src="../slides/diagrams/576px-Early_Pinball.jpg" width="50%" align="center" style="background:none; border:none; box-shadow:none;">
<center>
<em>Deep learning models are composition of simple functions. We can think of a pinball machine as an analogy. Each layer of pins corresponds to one of the layers of functions in the model. Input data is represented by the location of the ball from left to right when it is dropped in from the top. Output class comes from the position of the ball as it leaves the pins at the bottom. </em>
</center>
<p>We can think of what these models are doing as being similar to early pin ball machines. In a neural network, we input a number (or numbers), whereas in pinball, we input a ball. The location of the ball on the left-right axis can be thought of as the number. As the ball falls through the machine, each layer of pins can be thought of as a different layer of neurons. Each layer acts to move the ball from left to right.</p>
<p>In a pinball machine, when the ball gets to the bottom it might fall into a hole defining a score, in a neural network, that is equivalent to the decision: a classification of the input object.</p>
<p>An image has more than one number associated with it, so it's like playing pinball in a <em>hyper-space</em>.</p>
<object class="svgplot" align data="../slides/diagrams/pinball001.svg">
</object>
<center>
<em>At initialization, the pins, which represent the parameters of the function, aren't in the right place to bring the balls to the correct decisions. </em>
</center>
<object class="svgplot" align data="../slides/diagrams/pinball002.svg">
</object>
<center>
<em>After learning the pins are now in the right place to bring the balls to the correct decisions. </em>
</center>
<p>Learning involves moving all the pins to be in the right position, so that the ball falls in the right place. But moving all these pins in hyperspace can be difficult. In a hyper space you have to put a lot of data through the machine for to explore the positions of all the pins. Adversarial learning reflects the fact that a ball can be moved a small distance and lead to a very different result.</p>
<p>Probabilistic methods explore more of the space by considering a range of possible paths for the ball through the machine.</p>
<h2 id="natural-and-artificial-intelligence-embodiment-factors">Natural and Artificial Intelligence: Embodiment Factors</h2>
<table>
<tr>
<td>
</td>
<td align="center">
<img class="" src="../slides/diagrams/IBM_Blue_Gene_P_supercomputer.jpg" width="40%" align="center" style="background:none; border:none; box-shadow:none;">
</td>
<td align="center">
<img class="" src="../slides/diagrams/ClaudeShannon_MFO3807.jpg" width="25%" align="center" style="background:none; border:none; box-shadow:none;">
</td>
</tr>
<tr>
<td>
compute
</td>
<td align="center">
<span class="math display">\[\approx 100 \text{ gigaflops}\]</span>
</td>
<td align="center">
<span class="math display">\[\approx 16 \text{ petaflops}\]</span>
</td>
</tr>
<tr>
<td>
communicate
</td>
<td align="center">
<span class="math display">\[1 \text{ gigbit/s}\]</span>
</td>
<td align="center">
<span class="math display">\[100 \text{ bit/s}\]</span>
</td>
</tr>
<tr>
<td>
(compute/communicate)
</td>
<td align="center">
<span class="math display">\[10^{4}\]</span>
</td>
<td align="center">
<span class="math display">\[10^{14}\]</span>
</td>
</tr>
</table>
<p>There is a fundamental limit placed on our intelligence based on our ability to communicate. Claude Shannon founded the field of information theory. The clever part of this theory is it allows us to separate our measurement of information from what the information pertains to<a href="#fn1" class="footnoteRef" id="fnref1"><sup>1</sup></a>.</p>
<p>Shannon measured information in bits. One bit of information is the amount of information I pass to you when I give you the result of a coin toss. Shannon was also interested in the amount of information in the English language. He estimated that on average a word in the English language contains 12 bits of information.</p>
<p>Given typical speaking rates, that gives us an estimate of our ability to communicate of around 100 bits per second <span class="citation">(Reed and Durlach, 1998)</span>. Computers on the other hand can communicate much more rapidly. Current wired network speeds are around a billion bits per second, ten million times faster.</p>
<p>When it comes to compute though, our best estimates indicate our computers are slower. A typical modern computer can process make around 100 billion floating point operations per second, each floating point operation involves a 64 bit number. So the computer is processing around 6,400 billion bits per second.</p>
<p>It's difficult to get similar estimates for humans, but by some estimates the amount of compute we would require to <em>simulate</em> a human brain is equivalent to that in the UK's fastest computer <span class="citation">(Ananthanarayanan et al., 2009)</span>, the MET office machine in Exeter, which in 2018 ranks as the 11th fastest computer in the world. That machine simulates the world's weather each morning, and then simulates the world's climate. It is a 16 petaflop machine, processing around 1,000 <em>trillion</em> bits per second.</p>
<p>So when it comes to our ability to compute we are extraordinary, not compute in our conscious mind, but the underlying neuron firings that underpin both our consciousness, our sbuconsciousness as well as our motor control etc. By analogy I sometimes like to think of us as a Formula One engine. But in terms of our ability to deploy that computation in actual use, to share the results of what we have inferred, we are very limited. So when you imagine the F1 car that represents a psyche, think of an F1 car with bicycle wheels.</p>
<p><img class="" src="../slides/diagrams/640px-Marcel_Renault_1903.jpg" width="70%" align="center" style="background:none; border:none; box-shadow:none;"></p>
<p>In contrast, our computers have less computational power, but they can communicate far more fluidly. They are more like a go-kart, less well powered, but with tires that allow them to deploy that power.</p>
<p><img class="" src="../slides/diagrams/Caleb_McDuff_WIX_Silence_Racing_livery.jpg" width="70%" align="center" style="background:none; border:none; box-shadow:none;"></p>
<p>For humans, that means much of our computation should be dedicated to considering <em>what</em> we should compute. To do that efficiently we need to model the world around us. The most complex thing in the world around us is other humans. So it is no surprise that we model them. We second guess what their intentions are, and our communication is only necessary when they are departing from how we model them. Naturally, for this to work well, we need to understand those we work closely with. So it is no surprise that social communication, social bonding, forms so much of a part of our use of our limited bandwidth.</p>
<p>There is a second effect here, our need to anthropomorphise objects around us. Our tendency to model our fellow humans extends to when we interact with other entities in our environment. To our pets as well as inanimate objects around us, such as computers or even our cars. This tendency to overinterpret could be a consequence of our limited ability to communicate.</p>
<p>For more details see this paper <a href="https://arxiv.org/abs/1705.07996">"Living Together: Mind and Machine Intelligence"</a>, and this <a href="http://inverseprobability.com/talks/lawrence-tedx17/living-together.html">TEDx talk</a>.</p>
<h2 id="evolved-relationship-with-information">Evolved Relationship with Information</h2>
<p>The high bandwidth of computers has resulted in a close relationship between the computer and data. Large amounts of information can flow between the two. The degree to which the computer is mediating our relationship with data means that we should consider it an intermediary.</p>
<p>Originaly our low bandwith relationship with data was affected by two characteristics. Firstly, our tendency to over-interpret driven by our need to extract as much knowledge from our low bandwidth information channel as possible. Secondly, by our improved understanding of the domain of <em>mathematical</em> statistics and how our cognitive biases can mislead us.</p>
<p>With this new set up there is a potential for assimilating far more information via the computer, but the computer can present this to us in various ways. If it's motives are not aligned with ours then it can misrepresent the information. This needn't be nefarious it can be simply as a result of the computer pursuing a different objective from us. For example, if the computer is aiming to maximize our interaction time that may be a different objective from ours which may be to summarize information in a representative manner in the <em>shortest</em> possible length of time.</p>
<p>For example, for me it was a common experience to pick up my telephone with the intention of checking when my next appointment was, but to soon find myself distracted by another application on the phone, and end up reading something on the internet. By the time I'd finished reading, I would often have forgotten the reason I picked up my phone in the first place.</p>
<p>There are great benefits to be had from the huge amount of information we can unlock from this evolved relationship between us and data. In biology, large scale data sharing has been driven by a revolution in genomic, transcriptomic and epigenomic measurement. The improved inferences that that can be drawn through summarizing data by computer have fundamentally changed the nature of biological science, now this phenomenon is also infuencing us in our daily lives as data measured by <em>happenstance</em> is increasingly used to characterize us.</p>
<p>Better mediation of this flow actually requires a better understanding of human-computer interaction. This in turn involves understanding our own intelligence better, what its cognitive biases are and how these might mislead us.</p>
<p>For further thoughts see <a href="https://www.theguardian.com/media-network/2015/jul/23/data-driven-economy-marketing">this Guardian article</a> from 2015 on marketing in the internet era and <a href="http://inverseprobability.com/2015/12/04/what-kind-of-ai">this blog post</a> on System Zero.</p>
<object class="svgplot" align data="../slides/diagrams/data-science/information-flow003.svg">
</object>
<center>
<em>New direction of information flow, information is reaching us mediated by the computer </em>
</center>
<h3 id="societal-effects">Societal Effects</h3>
<p>We have already seen the effects of this changed dynamic in biology and computational biology. Improved sensorics have led to the new domains of transcriptomics, epigenomics, and 'rich phenomics' as well as considerably augmenting our capabilities in genomics.</p>
<p>Biologists have had to become data-savvy, they require a rich understanding of the available data resources and need to assimilate existing data sets in their hypothesis generation as well as their experimental design. Modern biology has become a far more quantitative science, but the quantitativeness has required new methods developed in the domains of <em>computational biology</em> and <em>bioinformatics</em>.</p>
<p>There is also great promise for personalized health, but in health the wide data-sharing that has underpinned success in the computational biology community is much harder to cary out.</p>
<p>We can expect to see these phenomena reflected in wider society. Particularly as we make use of more automated decision making based only on data.</p>
<p>The main phenomenon we see across the board is the shift in dynamic from the direct pathway between human and data, as traditionally mediated by classical statistcs, to a new flow of information via the computer. This change of dynamics gives us the modern and emerging domain of <em>data science</em>.</p>
<h2 id="human-communication">Human Communication</h2>
<p>For human conversation to work, we require an internal model of who we are speaking to. We model each other, and combine our sense of who they are, who they think we are, and what has been said. This is our approach to dealing with the limited bandwidth connection we have. Empathy and understanding of intent. Mental dispositional concepts are used to augment our limited communication bandwidth.</p>
<p>Fritz Heider referred to the important point of a conversation as being that they are happenings that are "<em>psychologically represented</em> in each of the participants" (his emphasis) <span class="citation">(Heider, 1958)</span></p>
<h3 id="machine-learning-and-narratives">Machine Learning and Narratives</h3>
<p><img class="" src="../slides/diagrams/Classic_baby_shoes.jpg" width="60%" align="center" style="background:none; border:none; box-shadow:none;"></p>
<center>
<em>For sale: baby shoes, never worn.</em>
</center>
<p>Consider the six word novel, apocraphally credited to Ernest Hemingway, "For sale: baby shoes, never worn". To understand what that means to a human, you need a great deal of additional context. Context that is not directly accessible to a machine that has not got both the evolved and contextual understanding of our own condition to realize both the implication of the advert and what that implication means emotionally to the previous owner.</p>
<p><a href="https://www.youtube.com/watch?v=8FIEZXMUM2I&t=7"><img src="https://img.youtube.com/vi/8FIEZXMUM2I/0.jpg" /></a></p>
<p><a href="https://en.wikipedia.org/wiki/Fritz_Heider">Fritz Heider</a> and <a href="https://en.wikipedia.org/wiki/Marianne_Simmel">Marianne Simmel</a>'s experiments with animated shapes from 1944 <span class="citation">(Heider and Simmel, 1944)</span>. Our interpretation of these objects as showing motives and even emotion is a combination of our desire for narrative, a need for understanding of each other, and our ability to empathise. At one level, these are crudely drawn objects, but in another key way, the animator has communicated a story through simple facets such as their relative motions, their sizes and their actions. We apply our psychological representations to these faceless shapes in an effort to interpret their actions.</p>
<h3 id="faith-and-ai">Faith and AI</h3>
<p>There would seem to be at least three ways in which artificial intelligence and religion interconnect.</p>
<ol style="list-style-type: decimal">
<li>Artificial Intelligence as Cartoon Religion</li>
<li>Artificial Intelligence and Introspection</li>
<li>Independence of thought and Control: A Systemic Catch 22</li>
</ol>
<h3 id="singulariansm-ai-as-cartoon-religion">Singulariansm: AI as Cartoon Religion</h3>
<p>The first parallels one can find between artificial intelligence and religion come in somewhat of a cartoon doomsday scenario form. The publically hyped fears of superintelligence and singularity can equally be placed within the framework of the simpler questions that religion can try to answer. The parallels are</p>
<ol style="list-style-type: decimal">
<li>Superintelligence as god</li>
<li>Demi-god status achievable through transhumanism</li>
<li>Immortality through uploading the connectome</li>
<li>The day of judgement as the "singularity"</li>
</ol>
<p>The notion of a ultra-intelligence is similar to the notion of an interventionist god, with omniscience in the past, present and the future. This notion was described by Pierre Simon Laplace.</p>
<p><img class="" src="../slides/diagrams/ml/Pierre-Simon_Laplace.png" width="30%" align="center" style="background:none; border:none; box-shadow:none;"></p>
<p>Famously, Laplace considered the idea of a deterministic Universe, one in which the model is <em>known</em>, or as the below translation refers to it, "an intelligence which could comprehend all the forces by which nature is animated". He speculates on an "intelligence" that can submit this vast data to analysis and propsoses that such an entity would be able to predict the future.</p>
<blockquote>
<p>Given for one instant an intelligence which could comprehend all the forces by which nature is animated and the respective situation of the beings who compose it---an intelligence sufficiently vast to submit these data to analysis---it would embrace in the same formulate the movements of the greatest bodies of the universe and those of the lightest atom; for it, nothing would be uncertain and the future, as the past, would be present in its eyes.</p>
</blockquote>
<p>This notion is known as <em>Laplace's demon</em> or <em>Laplace's superman</em>.</p>
<p>Unfortunately, most analyses of his ideas stop at that point, whereas his real point is that such a notion is unreachable. Not so much <em>superman</em> as <em>strawman</em>. Just three pages later in the "Philosophical Essay on Probabilities" <span class="citation">(Laplace, 1814)</span>, Laplace goes on to observe:</p>
<blockquote>
<p>The curve described by a simple molecule of air or vapor is regulated in a manner just as certain as the planetary orbits; the only difference between them is that which comes from our ignorance.</p>
<p>Probability is relative, in part to this ignorance, in part to our knowledge.</p>
</blockquote>
<p>In other words, we can never utilize the idealistic deterministc Universe due to our ignorance about the world, Laplace's suggestion, and focus in this essay is that we turn to probability to deal with this uncertainty. This is also our inspiration for using probabilit in machine learning.</p>
<p>The "forces by which nature is animated" is our <em>model</em>, the "situation of beings that compose it" is our <em>data</em> and the "intelligence sufficiently vast enough to submit these data to analysis" is our compute. The fly in the ointment is our <em>ignorance</em> about these aspects. And <em>probability</em> is the tool we use to incorporate this ignorance leading to uncertainty or <em>doubt</em> in our predictions.</p>
<p>The notion of Superintelligence in, e.g. Nick Bostrom's book <span class="citation">(Bostrom, 2014)</span>, is that of an infallible omniscience. A major narrative of the book is that the challenge of Superintelligence according is to constrain the power of such an entity. In practice, this narrative is strongly related to Laplace's "straw superman". No such intelligence could exist due to our ignorance, in practice any real intelligence must express <em>doubt</em>.</p>
<p>Elon Musk has proposed that the only way to defeat the inevitable omniscience would be to augment ourselves with machine like capabilities. Ray Kurzweil has pushed the notion of developing ourselves by augmenting our existing cortex with direct connection to the internet.</p>
<p>Within Sillicon Valley there is a particular obsession with 'uploading', once the brain is connected, we can achieve immortality by continuing to exist digitally in an artificial environment of our own creation while our physical body is left behind us.</p>
<p>In this scenario, doomsday is the 'technological singularity', the moment at which computers rapidly outstrip our capabilities and take over our world. The high priests are the scientists, and the aim is to bring about the latter while restraining the former.</p>
<p><em>Singularism</em> is to religion what <em>scientology</em> is to science. Scientology is religion expressing itself as science and Singularism is science expressing itself as religion.</p>
<p>For further reading see <a href="http://inverseprobability.com/2016/05/09/machine-learning-futures-5">this post on Singularism</a> as well as this <a href="http://www.academia.edu/15037984/Singularitarians_AItheists_and_Why_the_Problem_with_Artificial_Intelligence_is_H.A.L._Humanity_At_Large_not_HAL">paper by Luciano Floridi</a> and this <a href="http://inverseprobability.com/2016/05/09/machine-learning-futures-6">review of Superintelligence</a> <span class="citation">(Bostrom, 2014)</span>.</p>
<h3 id="artificial-intelligence-and-introspection">Artificial Intelligence and Introspection</h3>
<p>Ignoring the cartoon view of religion we've outlined above and focussing more on how religion can bring strength to people in their day-to-day living, religious environments bring a place to self reflect and meditate on our existence, and the wider cosmos. How are we here? What is our role? What makes us different?</p>
<p>Creating machine intelligences characterizes the manner in which we are different, helps us understand what is special about us rather than the machine.</p>
<p>I have in the past argued strongly against the term artificial intelligence but there is a sense in which it is a good term. If we think of artificial plants, then we have the right sense in which we are creating an artificial intelligence. An artificial plant is fundamentally different from a real plant, but can appear similar, or from a distance identical. However, a creator of an artificial plant gains a greater appreciation for the complexity of a natural plant.</p>
<p>In a similar way, we might expect that attempts to emulate human intelligence would lead to a deeper appreciation of that intelligence. This type of reflection on who we are has a lot in common with many of the (to me) most endearing characteristics of religion.</p>
<h3 id="the-cosmic-catch-22">The Cosmic Catch 22</h3>
<p>A final parallel between the world of AI and that of religion is the conundrums they raise for us. In particular the tradeoffs between a paternalistic society and individual freedoms. Two models for artificial intelligence that may be realistic are the "Big Brother" and the "Big Mother" models.</p>
<p>Big Brother refers to the surveillance society and the control of populations that can be achieved with a greater understanding of the individual self. A perceptual understanding of the individual that conceivably be of better than the individual's self perception. This scenario was most famously explored by George Orwell, but also came into being in Communist East Germany where it is estimated that one in 66 citizens acted as an informants, <span class="citation">(<em>Stasi</em>, 1999)</span>.</p>
<p>The same understanding of individual is also necessary for the "Big Mother" scenario, where intelligent agents provide for us in the manner in which our parents did for us when we were young. Both scenarios are disempowering in terms of individual liberties. In a metaphorical sense, this could be seen as a return to Eden, a surrendering of individual liberties for a perceived paradise. But those individual liberties are also what we value. There is a tension between a desire to create the perfect environment, where no evil exists and our individual liberty. Our society chooses a balance between the pros and cons that attempts to sustain a diversity of perspectives and beliefs. Even if it were possible to use AI to organzie society in such a way that particular malevolent behaviours were prevented, doing so may come at the cost of the individual freedom we enjoy. These are difficult trade offs, and the exist both when explaining the nature of religious belief and when considering the nature of either the dystopian Big Brother or the "dys-utopian" Big Mother view of AI.</p>
<h3 id="conclusion">Conclusion</h3>
<p>We've provided an overview of the advances in artificial intelligence from the perspective of machine learning, and tried to give a sense of how machine learning models operate to learn about us.</p>
<p>We've highlighted a quintissential difference between humans and computers: the embodiment factor, the relatively restricted ability of human to communicate themselves when compared to computers. We explored how this has effected our evolved relationship with data and the relationship between the human and narrative.</p>
<p>Finally, we explored three parallels between faith and AI, in particular the cartoon nature of religion based on technological promises of the singularity and AI. A more sophisticated relationship occurs when we see the way in which, as artificial intelligences invade our notion of personhood we will need to intrspect about who we are and what we want to be, a characteristic shared with many religions. The final parallel was in the emergent questions of AI, "Should we build an artificial intelligence to eliminate war?" has a strong parallel with the question "Why does God allow war?". War is a consequence of human choices. Building such a system would likely severely restrict our freedoms to make choices, and there is a tension between how much we wish those freedoms to be impinged versus the potential lives that could be saved.</p>
<h3 id="thanks">Thanks!</h3>
<ul>
<li>twitter: @lawrennd</li>
<li>blog: <a href="http://inverseprobability.com/blog.html">http://inverseprobability.com</a></li>
</ul>
<div id="refs" class="references">
<div id="ref-Ananthanarayanan-cat09">
<p>Ananthanarayanan, R., Esser, S.K., Simon, H.D., Modha, D.S., 2009. The cat is out of the bag: Cortical simulations with <span class="math inline">\(10^9\)</span> neurons, <span class="math inline">\(10^{13}\)</span> synapses, in: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis - Sc ’09. <a href="https://doi.org/10.1145/1654059.1654124" class="uri">https://doi.org/10.1145/1654059.1654124</a></p>
</div>
<div id="ref-Bostrom-superintelligence14">
<p>Bostrom, N., 2014. Superintelligence: Paths, dangers, strategies, 1st ed. Oxford University Press, Oxford, UK.</p>
</div>
<div id="ref-Heider:interpersonal58">
<p>Heider, F., 1958. The psychology of interpersonal relations. John Wiley.</p>
</div>
<div id="ref-Heider:experimental44">
<p>Heider, F., Simmel, M., 1944. An experimental study of apparent behavior. The American Journal of Psychology 57, 243–259.</p>
</div>
<div id="ref-Laplace:essai14">
<p>Laplace, P.S., 1814. Essai philosophique sur les probabilités, 2nd ed. Courcier, Paris.</p>
</div>
<div id="ref-Reed-information98">
<p>Reed, C., Durlach, N.I., 1998. Note on information transfer rates in human communication. Presence Teleoperators & Virtual Environments 7, 509–518. <a href="https://doi.org/10.1162/105474698565893" class="uri">https://doi.org/10.1162/105474698565893</a></p>
</div>
<div id="ref-Koehler-stasi99">
<p>Stasi: The untold story of the East German secret police, 1999.</p>
</div>
<div id="ref-Taigman:deepface14">
<p>Taigman, Y., Yang, M., Ranzato, M., Wolf, L., 2014. DeepFace: Closing the gap to human-level performance in face verification, in: Proceedings of the Ieee Computer Society Conference on Computer Vision and Pattern Recognition. <a href="https://doi.org/10.1109/CVPR.2014.220" class="uri">https://doi.org/10.1109/CVPR.2014.220</a></p>
</div>
<div id="ref-Wiener:cybernetics48">
<p>Wiener, N., 1948. Cybernetics: Control and communication in the animal and the machine. MIT Press, Cambridge, MA.</p>
</div>
</div>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>the challenge of understanding what information pertains to is known as knowledge representation.<a href="#fnref1">↩</a></p></li>
</ol>
</div>
Thu, 31 May 2018 00:00:00 +0000
http://inverseprobability.com/talks/notes/faith-and-ai.html
http://inverseprobability.com/talks/notes/faith-and-ai.htmlnotesUncertainty in Loss Functions<div style="display:none">
$$\newcommand{\Amatrix}{\mathbf{A}}
\newcommand{\KL}[2]{\text{KL}\left( #1\,\|\,#2 \right)}
\newcommand{\Kaast}{\kernelMatrix_{\mathbf{ \ast}\mathbf{ \ast}}}
\newcommand{\Kastu}{\kernelMatrix_{\mathbf{ \ast} \inducingVector}}
\newcommand{\Kff}{\kernelMatrix_{\mappingFunctionVector \mappingFunctionVector}}
\newcommand{\Kfu}{\kernelMatrix_{\mappingFunctionVector \inducingVector}}
\newcommand{\Kuast}{\kernelMatrix_{\inducingVector \bf\ast}}
\newcommand{\Kuf}{\kernelMatrix_{\inducingVector \mappingFunctionVector}}
\newcommand{\Kuu}{\kernelMatrix_{\inducingVector \inducingVector}}
\newcommand{\Kuui}{\Kuu^{-1}}
\newcommand{\Qaast}{\mathbf{Q}_{\bf \ast \ast}}
\newcommand{\Qastf}{\mathbf{Q}_{\ast \mappingFunction}}
\newcommand{\Qfast}{\mathbf{Q}_{\mappingFunctionVector \bf \ast}}
\newcommand{\Qff}{\mathbf{Q}_{\mappingFunctionVector \mappingFunctionVector}}
\newcommand{\aMatrix}{\mathbf{A}}
\newcommand{\aScalar}{a}
\newcommand{\aVector}{\mathbf{a}}
\newcommand{\acceleration}{a}
\newcommand{\bMatrix}{\mathbf{B}}
\newcommand{\bScalar}{b}
\newcommand{\bVector}{\mathbf{b}}
\newcommand{\basisFunc}{\phi}
\newcommand{\basisFuncVector}{\boldsymbol{ \basisFunc}}
\newcommand{\basisFunction}{\phi}
\newcommand{\basisLocation}{\mu}
\newcommand{\basisMatrix}{\boldsymbol{ \Phi}}
\newcommand{\basisScalar}{\basisFunction}
\newcommand{\basisVector}{\boldsymbol{ \basisFunction}}
\newcommand{\activationFunction}{\phi}
\newcommand{\activationMatrix}{\boldsymbol{ \Phi}}
\newcommand{\activationScalar}{\basisFunction}
\newcommand{\activationVector}{\boldsymbol{ \basisFunction}}
\newcommand{\bigO}{\mathcal{O}}
\newcommand{\binomProb}{\pi}
\newcommand{\cMatrix}{\mathbf{C}}
\newcommand{\cbasisMatrix}{\hat{\boldsymbol{ \Phi}}}
\newcommand{\cdataMatrix}{\hat{\dataMatrix}}
\newcommand{\cdataScalar}{\hat{\dataScalar}}
\newcommand{\cdataVector}{\hat{\dataVector}}
\newcommand{\centeredKernelMatrix}{\mathbf{ \MakeUppercase{\centeredKernelScalar}}}
\newcommand{\centeredKernelScalar}{b}
\newcommand{\centeredKernelVector}{\centeredKernelScalar}
\newcommand{\centeringMatrix}{\mathbf{H}}
\newcommand{\chiSquaredDist}[2]{\chi_{#1}^{2}\left(#2\right)}
\newcommand{\chiSquaredSamp}[1]{\chi_{#1}^{2}}
\newcommand{\conditionalCovariance}{\boldsymbol{ \Sigma}}
\newcommand{\coregionalizationMatrix}{\mathbf{B}}
\newcommand{\coregionalizationScalar}{b}
\newcommand{\coregionalizationVector}{\mathbf{ \coregionalizationScalar}}
\newcommand{\covDist}[2]{\text{cov}_{#2}\left(#1\right)}
\newcommand{\covSamp}[1]{\text{cov}\left(#1\right)}
\newcommand{\covarianceScalar}{c}
\newcommand{\covarianceVector}{\mathbf{ \covarianceScalar}}
\newcommand{\covarianceMatrix}{\mathbf{C}}
\newcommand{\covarianceMatrixTwo}{\boldsymbol{ \Sigma}}
\newcommand{\croupierScalar}{s}
\newcommand{\croupierVector}{\mathbf{ \croupierScalar}}
\newcommand{\croupierMatrix}{\mathbf{ \MakeUppercase{\croupierScalar}}}
\newcommand{\dataDim}{p}
\newcommand{\dataIndex}{i}
\newcommand{\dataIndexTwo}{j}
\newcommand{\dataMatrix}{\mathbf{Y}}
\newcommand{\dataScalar}{y}
\newcommand{\dataSet}{\mathcal{D}}
\newcommand{\dataStd}{\sigma}
\newcommand{\dataVector}{\mathbf{ \dataScalar}}
\newcommand{\decayRate}{d}
\newcommand{\degreeMatrix}{\mathbf{ \MakeUppercase{\degreeScalar}}}
\newcommand{\degreeScalar}{d}
\newcommand{\degreeVector}{\mathbf{ \degreeScalar}}
% Already defined by latex
%\newcommand{\det}[1]{\left|#1\right|}
\newcommand{\diag}[1]{\text{diag}\left(#1\right)}
\newcommand{\diagonalMatrix}{\mathbf{D}}
\newcommand{\diff}[2]{\frac{\text{d}#1}{\text{d}#2}}
\newcommand{\diffTwo}[2]{\frac{\text{d}^2#1}{\text{d}#2^2}}
\newcommand{\displacement}{x}
\newcommand{\displacementVector}{\textbf{\displacement}}
\newcommand{\distanceMatrix}{\mathbf{ \MakeUppercase{\distanceScalar}}}
\newcommand{\distanceScalar}{d}
\newcommand{\distanceVector}{\mathbf{ \distanceScalar}}
\newcommand{\eigenvaltwo}{\ell}
\newcommand{\eigenvaltwoMatrix}{\mathbf{L}}
\newcommand{\eigenvaltwoVector}{\mathbf{l}}
\newcommand{\eigenvalue}{\lambda}
\newcommand{\eigenvalueMatrix}{\boldsymbol{ \Lambda}}
\newcommand{\eigenvalueVector}{\boldsymbol{ \lambda}}
\newcommand{\eigenvector}{\mathbf{ \eigenvectorScalar}}
\newcommand{\eigenvectorMatrix}{\mathbf{U}}
\newcommand{\eigenvectorScalar}{u}
\newcommand{\eigenvectwo}{\mathbf{v}}
\newcommand{\eigenvectwoMatrix}{\mathbf{V}}
\newcommand{\eigenvectwoScalar}{v}
\newcommand{\entropy}[1]{\mathcal{H}\left(#1\right)}
\newcommand{\errorFunction}{E}
\newcommand{\expDist}[2]{\left<#1\right>_{#2}}
\newcommand{\expSamp}[1]{\left<#1\right>}
\newcommand{\expectation}[1]{\left\langle #1 \right\rangle }
\newcommand{\expectationDist}[2]{\left\langle #1 \right\rangle _{#2}}
\newcommand{\expectedDistanceMatrix}{\mathcal{D}}
\newcommand{\eye}{\mathbf{I}}
\newcommand{\fantasyDim}{r}
\newcommand{\fantasyMatrix}{\mathbf{ \MakeUppercase{\fantasyScalar}}}
\newcommand{\fantasyScalar}{z}
\newcommand{\fantasyVector}{\mathbf{ \fantasyScalar}}
\newcommand{\featureStd}{\varsigma}
\newcommand{\gammaCdf}[3]{\mathcal{GAMMA CDF}\left(#1|#2,#3\right)}
\newcommand{\gammaDist}[3]{\mathcal{G}\left(#1|#2,#3\right)}
\newcommand{\gammaSamp}[2]{\mathcal{G}\left(#1,#2\right)}
\newcommand{\gaussianDist}[3]{\mathcal{N}\left(#1|#2,#3\right)}
\newcommand{\gaussianSamp}[2]{\mathcal{N}\left(#1,#2\right)}
\newcommand{\given}{|}
\newcommand{\half}{\frac{1}{2}}
\newcommand{\heaviside}{H}
\newcommand{\hiddenMatrix}{\mathbf{ \MakeUppercase{\hiddenScalar}}}
\newcommand{\hiddenScalar}{h}
\newcommand{\hiddenVector}{\mathbf{ \hiddenScalar}}
\newcommand{\identityMatrix}{\eye}
\newcommand{\inducingInputScalar}{z}
\newcommand{\inducingInputVector}{\mathbf{ \inducingInputScalar}}
\newcommand{\inducingInputMatrix}{\mathbf{Z}}
\newcommand{\inducingScalar}{u}
\newcommand{\inducingVector}{\mathbf{ \inducingScalar}}
\newcommand{\inducingMatrix}{\mathbf{U}}
\newcommand{\inlineDiff}[2]{\text{d}#1/\text{d}#2}
\newcommand{\inputDim}{q}
\newcommand{\inputMatrix}{\mathbf{X}}
\newcommand{\inputScalar}{x}
\newcommand{\inputSpace}{\mathcal{X}}
\newcommand{\inputVals}{\inputVector}
\newcommand{\inputVector}{\mathbf{ \inputScalar}}
\newcommand{\iterNum}{k}
\newcommand{\kernel}{\kernelScalar}
\newcommand{\kernelMatrix}{\mathbf{K}}
\newcommand{\kernelScalar}{k}
\newcommand{\kernelVector}{\mathbf{ \kernelScalar}}
\newcommand{\kff}{\kernelScalar_{\mappingFunction \mappingFunction}}
\newcommand{\kfu}{\kernelVector_{\mappingFunction \inducingScalar}}
\newcommand{\kuf}{\kernelVector_{\inducingScalar \mappingFunction}}
\newcommand{\kuu}{\kernelVector_{\inducingScalar \inducingScalar}}
\newcommand{\lagrangeMultiplier}{\lambda}
\newcommand{\lagrangeMultiplierMatrix}{\boldsymbol{ \Lambda}}
\newcommand{\lagrangian}{L}
\newcommand{\laplacianFactor}{\mathbf{ \MakeUppercase{\laplacianFactorScalar}}}
\newcommand{\laplacianFactorScalar}{m}
\newcommand{\laplacianFactorVector}{\mathbf{ \laplacianFactorScalar}}
\newcommand{\laplacianMatrix}{\mathbf{L}}
\newcommand{\laplacianScalar}{\ell}
\newcommand{\laplacianVector}{\mathbf{ \ell}}
\newcommand{\latentDim}{q}
\newcommand{\latentDistanceMatrix}{\boldsymbol{ \Delta}}
\newcommand{\latentDistanceScalar}{\delta}
\newcommand{\latentDistanceVector}{\boldsymbol{ \delta}}
\newcommand{\latentForce}{f}
\newcommand{\latentFunction}{u}
\newcommand{\latentFunctionVector}{\mathbf{ \latentFunction}}
\newcommand{\latentFunctionMatrix}{\mathbf{ \MakeUppercase{\latentFunction}}}
\newcommand{\latentIndex}{j}
\newcommand{\latentScalar}{z}
\newcommand{\latentVector}{\mathbf{ \latentScalar}}
\newcommand{\latentMatrix}{\mathbf{Z}}
\newcommand{\learnRate}{\eta}
\newcommand{\lengthScale}{\ell}
\newcommand{\rbfWidth}{\ell}
\newcommand{\likelihoodBound}{\mathcal{L}}
\newcommand{\likelihoodFunction}{L}
\newcommand{\locationScalar}{\mu}
\newcommand{\locationVector}{\boldsymbol{ \locationScalar}}
\newcommand{\locationMatrix}{\mathbf{M}}
\newcommand{\variance}[1]{\text{var}\left( #1 \right)}
\newcommand{\mappingFunction}{f}
\newcommand{\mappingFunctionMatrix}{\mathbf{F}}
\newcommand{\mappingFunctionTwo}{g}
\newcommand{\mappingFunctionTwoMatrix}{\mathbf{G}}
\newcommand{\mappingFunctionTwoVector}{\mathbf{ \mappingFunctionTwo}}
\newcommand{\mappingFunctionVector}{\mathbf{ \mappingFunction}}
\newcommand{\scaleScalar}{s}
\newcommand{\mappingScalar}{w}
\newcommand{\mappingVector}{\mathbf{ \mappingScalar}}
\newcommand{\mappingMatrix}{\mathbf{W}}
\newcommand{\mappingScalarTwo}{v}
\newcommand{\mappingVectorTwo}{\mathbf{ \mappingScalarTwo}}
\newcommand{\mappingMatrixTwo}{\mathbf{V}}
\newcommand{\maxIters}{K}
\newcommand{\meanMatrix}{\mathbf{M}}
\newcommand{\meanScalar}{\mu}
\newcommand{\meanTwoMatrix}{\mathbf{M}}
\newcommand{\meanTwoScalar}{m}
\newcommand{\meanTwoVector}{\mathbf{ \meanTwoScalar}}
\newcommand{\meanVector}{\boldsymbol{ \meanScalar}}
\newcommand{\mrnaConcentration}{m}
\newcommand{\naturalFrequency}{\omega}
\newcommand{\neighborhood}[1]{\mathcal{N}\left( #1 \right)}
\newcommand{\neilurl}{http://inverseprobability.com/}
\newcommand{\noiseMatrix}{\boldsymbol{ E}}
\newcommand{\noiseScalar}{\epsilon}
\newcommand{\noiseVector}{\boldsymbol{ \epsilon}}
\newcommand{\norm}[1]{\left\Vert #1 \right\Vert}
\newcommand{\normalizedLaplacianMatrix}{\hat{\mathbf{L}}}
\newcommand{\normalizedLaplacianScalar}{\hat{\ell}}
\newcommand{\normalizedLaplacianVector}{\hat{\mathbf{ \ell}}}
\newcommand{\numActive}{m}
\newcommand{\numBasisFunc}{m}
\newcommand{\numComponents}{m}
\newcommand{\numComps}{K}
\newcommand{\numData}{n}
\newcommand{\numFeatures}{K}
\newcommand{\numHidden}{h}
\newcommand{\numInducing}{m}
\newcommand{\numLayers}{\ell}
\newcommand{\numNeighbors}{K}
\newcommand{\numSequences}{s}
\newcommand{\numSuccess}{s}
\newcommand{\numTasks}{m}
\newcommand{\numTime}{T}
\newcommand{\numTrials}{S}
\newcommand{\outputIndex}{j}
\newcommand{\paramVector}{\boldsymbol{ \theta}}
\newcommand{\parameterMatrix}{\boldsymbol{ \Theta}}
\newcommand{\parameterScalar}{\theta}
\newcommand{\parameterVector}{\boldsymbol{ \parameterScalar}}
\newcommand{\partDiff}[2]{\frac{\partial#1}{\partial#2}}
\newcommand{\precisionScalar}{j}
\newcommand{\precisionVector}{\mathbf{ \precisionScalar}}
\newcommand{\precisionMatrix}{\mathbf{J}}
\newcommand{\pseudotargetScalar}{\widetilde{y}}
\newcommand{\pseudotargetVector}{\mathbf{ \pseudotargetScalar}}
\newcommand{\pseudotargetMatrix}{\mathbf{ \widetilde{Y}}}
\newcommand{\rank}[1]{\text{rank}\left(#1\right)}
\newcommand{\rayleighDist}[2]{\mathcal{R}\left(#1|#2\right)}
\newcommand{\rayleighSamp}[1]{\mathcal{R}\left(#1\right)}
\newcommand{\responsibility}{r}
\newcommand{\rotationScalar}{r}
\newcommand{\rotationVector}{\mathbf{ \rotationScalar}}
\newcommand{\rotationMatrix}{\mathbf{R}}
\newcommand{\sampleCovScalar}{s}
\newcommand{\sampleCovVector}{\mathbf{ \sampleCovScalar}}
\newcommand{\sampleCovMatrix}{\mathbf{s}}
\newcommand{\scalarProduct}[2]{\left\langle{#1},{#2}\right\rangle}
\newcommand{\sign}[1]{\text{sign}\left(#1\right)}
\newcommand{\sigmoid}[1]{\sigma\left(#1\right)}
\newcommand{\singularvalue}{\ell}
\newcommand{\singularvalueMatrix}{\mathbf{L}}
\newcommand{\singularvalueVector}{\mathbf{l}}
\newcommand{\sorth}{\mathbf{u}}
\newcommand{\spar}{\lambda}
\newcommand{\trace}[1]{\text{tr}\left(#1\right)}
\newcommand{\BasalRate}{B}
\newcommand{\DampingCoefficient}{C}
\newcommand{\DecayRate}{D}
\newcommand{\Displacement}{X}
\newcommand{\LatentForce}{F}
\newcommand{\Mass}{M}
\newcommand{\Sensitivity}{S}
\newcommand{\basalRate}{b}
\newcommand{\dampingCoefficient}{c}
\newcommand{\mass}{m}
\newcommand{\sensitivity}{s}
\newcommand{\springScalar}{\kappa}
\newcommand{\springVector}{\boldsymbol{ \kappa}}
\newcommand{\springMatrix}{\boldsymbol{ \mathcal{K}}}
\newcommand{\tfConcentration}{p}
\newcommand{\tfDecayRate}{\delta}
\newcommand{\tfMrnaConcentration}{f}
\newcommand{\tfVector}{\mathbf{ \tfConcentration}}
\newcommand{\velocity}{v}
\newcommand{\sufficientStatsScalar}{g}
\newcommand{\sufficientStatsVector}{\mathbf{ \sufficientStatsScalar}}
\newcommand{\sufficientStatsMatrix}{\mathbf{G}}
\newcommand{\switchScalar}{s}
\newcommand{\switchVector}{\mathbf{ \switchScalar}}
\newcommand{\switchMatrix}{\mathbf{S}}
\newcommand{\tr}[1]{\text{tr}\left(#1\right)}
\newcommand{\loneNorm}[1]{\left\Vert #1 \right\Vert_1}
\newcommand{\ltwoNorm}[1]{\left\Vert #1 \right\Vert_2}
\newcommand{\onenorm}[1]{\left\vert#1\right\vert_1}
\newcommand{\twonorm}[1]{\left\Vert #1 \right\Vert}
\newcommand{\vScalar}{v}
\newcommand{\vVector}{\mathbf{v}}
\newcommand{\vMatrix}{\mathbf{V}}
\newcommand{\varianceDist}[2]{\text{var}_{#2}\left( #1 \right)}
% Already defined by latex
%\newcommand{\vec}{#1:}
\newcommand{\vecb}[1]{\left(#1\right):}
\newcommand{\weightScalar}{w}
\newcommand{\weightVector}{\mathbf{ \weightScalar}}
\newcommand{\weightMatrix}{\mathbf{W}}
\newcommand{\weightedAdjacencyMatrix}{\mathbf{A}}
\newcommand{\weightedAdjacencyScalar}{a}
\newcommand{\weightedAdjacencyVector}{\mathbf{ \weightedAdjacencyScalar}}
\newcommand{\onesVector}{\mathbf{1}}
\newcommand{\zerosVector}{\mathbf{0}}
$$
</div>
<h2 id="what-is-machine-learning">What is Machine Learning?</h2>
<p>What is machine learning? At its most basic level machine learning is a combination of</p>
<p><span class="math display">\[ \text{data} + \text{model} \xrightarrow{\text{compute}} \text{prediction}\]</span></p>
<p>where <em>data</em> is our observations. They can be actively or passively acquired (meta-data). The <em>model</em> contains our assumptions, based on previous experience. THat experience can be other data, it can come from transfer learning, or it can merely be our beliefs about the regularities of the universe. In humans our models include our inductive biases. The <em>prediction</em> is an action to be taken or a categorization or a quality score. The reason that machine learning has become a mainstay of artificial intelligence is the importance of predictions in artificial intelligence. The data and the model are combined through computation.</p>
<p>In practice we normally perform machine learning using two functions. To combine data with a model we typically make use of:</p>
<p><strong>a prediction function</strong> a function which is used to make the predictions. It includes our beliefs about the regularities of the universe, our assumptions about how the world works, e.g. smoothness, spatial similarities, temporal similarities.</p>
<p><strong>an objective function</strong> a function which defines the cost of misprediction. Typically it includes knowledge about the world's generating processes (probabilistic objectives) or the costs we pay for mispredictions (empiricial risk minimization).</p>
<p>The combination of data and model through the prediction function and the objectie function leads to a <em>learning algorithm</em>. The class of prediction functions and objective functions we can make use of is restricted by the algorithms they lead to. If the prediction function or the objective function are too complex, then it can be difficult to find an appropriate learning algorithm. Much of the acdemic field of machine learning is the quest for new learning algorithms that allow us to bring different types of models and data together.</p>
<p>A useful reference for state of the art in machine learning is the UK Royal Society Report, <a href="https://royalsociety.org/~/media/policy/projects/machine-learning/publications/machine-learning-report.pdf">Machine Learning: Power and Promise of Computers that Learn by Example</a>.</p>
<p>You can also check my blog post on <a href="http://inverseprobability.com/2017/07/17/what-is-machine-learning">"What is Machine Learning?"</a></p>
<h2 id="artificial-vs-natural-systems">Artificial vs Natural Systems</h2>
<h3 id="natural-systems-are-evolved">Natural Systems are Evolved</h3>
<blockquote>
<p>Survival of the fittest</p>
<p><a href="https://en.wikipedia.org/wiki/Herbert_Spencer">Herbet Spencer</a>, 1864</p>
</blockquote>
<p>Darwin never said "Survival of the Fittest" he talked about evolution by natural selection.</p>
<p>Evolution is better described as "non-survival of the non-fit". You don't have to be the fittest to survive, you just need to avoid the pitfalls of life. This is the first priority.</p>
<p>A mistake we make in the design of our systems is to equate fitness with the objective function, and to assume it is known and static. In practice, a real environment would have an evolving fitness function which would be unknown at any given time.</p>
<p>Uncertainty in models is handled by Bayesian inference, here we consider uncertainty arising in loss functions.</p>
<p>Consider a loss function which decomposes across individual observations, <span class="math inline">\(\dataScalar_{k,j}\)</span>, each of which is dependent on some set of features, <span class="math inline">\(\inputVector_k\)</span>.</p>
<p><span class="math display">\[
\errorFunction(\dataVector, \inputMatrix) = \sum_{k}\sum_{j}
L(\dataScalar_{k,j}, \inputVector_k)
\]</span> Assume that the loss function depends on the features through some mapping function, <span class="math inline">\(\mappingFunction_j(\cdot)\)</span> which we call the <em>prediction function</em>.</p>
<p><span class="math display">\[
\errorFunction(\dataVector, \inputMatrix) = \sum_{k}\sum_{j} L(\dataScalar_{k,j},
\mappingFunction_j(\inputVector_k))
\]</span> without loss of generality, we can move the index to the inputs, so we have <span class="math inline">\(\inputVector_i =\left[\inputVector \quad j\right]\)</span>, and we set <span class="math inline">\(\dataScalar_i = \dataScalar_{k, j}\)</span>. So we have</p>
<p><span class="math display">\[
\errorFunction(\dataVector, \inputMatrix) = \sum_{i} L(\dataScalar_i, \mappingFunction(\inputVector_i))
\]</span> Bayesian inference considers uncertainty in <span class="math inline">\(\mappingFunction\)</span>, often through parameterizing it, <span class="math inline">\(\mappingFunction(\inputVector; \parameterVector)\)</span>, and considering a <em>prior</em> distribution for the parameters, <span class="math inline">\(p(\parameterVector)\)</span>, this in turn implies a distribution over functions, <span class="math inline">\(p(\mappingFunction)\)</span>. Process models, such as Gaussian processes specify this distribution, known as a process, directly.</p>
<p>Bayesian inference proceeds by specifying a <em>likelihood</em> which relates the data, <span class="math inline">\(\dataScalar\)</span>, to the parameters. Here we choose not to do this, but instead we only consider the <em>loss</em> function for our objective. The loss is the cost we pay for a misclassification.</p>
<p>The <em>risk function</em> is the expectation of the loss under the distribution of the data. Here we are using the framework of <em>empirical risk</em> minimization, because we have a sample based approximation. The new expectation we are considering is around the loss function itself, not the uncertainty in the data.</p>
<p>The loss function and the log likelihood may take a mathematically similar form but they are philosophically very different. The log likelihood assumes something about the <em>generating</em> function of the data, whereas the loss function assumes something about the cost we pay. Importantly the loss function in Bayesian inference only normally enters at the point of decision.</p>
<p>The key idea in Bayesian inference is that the probabilistic inference can be performed <em>without</em> knowing the loss becasue if the model is correct, then the form of the loss function is irrelevant when performing inference. In practice, however, for real data sets the model is almost never correct.</p>
<p>Some of the maths below looks similar to the maths we can find in Bayesian methods, in particular variational Bayes, but that is merely a consequence of the availability of analytical mathematics. There are only particular ways of developing tractable algorithms, one route involves linear algebra. However, the similarity of the mathematics belies a difference in interpretation. It is similar to travelling a road (e.g. Ermine Street) in a wild landscape. We travel together because that is where efficient progress is to be made, but in practice a our destinations (Lincoln, York), may be different.</p>
<h3 id="introduce-uncertainty">Introduce Uncertainty</h3>
<p>To introduce uncertainty we consider a weighted version of the loss function, we introduce positive weights, <span class="math inline">\(\left\{ \scaleScalar_i\right\}_{i=1}^\numData\)</span>. <span class="math display">\[
\errorFunction(\dataVector, \inputMatrix) = \sum_{i}
\scaleScalar_i L(\dataScalar_i, \mappingFunction(\inputVector_i))
\]</span> We now assume that tmake the assumption that these weights are drawn from a distribution, <span class="math inline">\(q(\scaleScalar)\)</span>. Instead of looking to minimize the loss direction, we look at the expected loss under this distribution.</p>
<p><span class="math display">\[
\begin{align*}
\errorFunction(\dataVector, \inputMatrix) = & \sum_{i}\expectationDist{\scaleScalar_i L(\dataScalar_i, \mappingFunction(\inputVector_i))}{q(\scaleScalar)} \\
& =\sum_{i}\expectationDist{\scaleScalar_i }{q(\scaleScalar)}L(\dataScalar_i, \mappingFunction(\inputVector_i))
\end{align*}
\]</span> We will assume that our process, <span class="math inline">\(q(\scaleScalar)\)</span> can depend on a variety of inputs such as <span class="math inline">\(\dataVector\)</span>, <span class="math inline">\(\inputMatrix\)</span> and time, <span class="math inline">\(t\)</span>.</p>
<h3 id="principle-of-maximum-entropy">Principle of Maximum Entropy</h3>
<p>To maximize uncertainty in <span class="math inline">\(q(\scaleScalar)\)</span> we maximize its entropy. Following Jaynes formalism of maximum entropy, in the continuous space we do this with respect to an invariant measure, <span class="math display">\[
H(\scaleScalar)= - \int q(\scaleScalar) \log \frac{q(\scaleScalar)}{m(\scaleScalar)} \text{d}\scaleScalar
\]</span> and since we minimize the loss, we balance this by adding in this term to form <span class="math display">\[
\begin{align*}
\errorFunction = & \beta\sum_{i}\expectationDist{\scaleScalar_i }{q(\scaleScalar)}L(\dataScalar_i, \mappingFunction(\inputVector_i)) - H(\scaleScalar)\\
&= \beta\sum_{i}\expectationDist{\scaleScalar_i }{q(\scaleScalar)}L(\dataScalar_i, \mappingFunction(\inputVector_i)) + \int q(\scaleScalar) \log \frac{q(\scaleScalar)}{m(\scaleScalar)}\text{d}\scaleScalar
\end{align*}
\]</span> where <span class="math inline">\(\beta\)</span> serves to weight the relative contribution of the entropy term and the loss term.</p>
<p>We can now minimize this modified loss with respect to the density <span class="math inline">\(q(\scaleScalar)\)</span>, the freeform optimization over this term leads to <span class="math display">\[
\begin{align*}
q(\scaleScalar) \propto & \exp\left(- \beta \sum_{i=1}^\numData \scaleScalar_i L(\dataScalar_i, \mappingFunction(\inputVector_i)) \right) m(\scaleScalar)\\
\propto & \prod_{i=1}^\numData \exp\left(- \beta \scaleScalar_i L(\dataScalar_i, \mappingFunction(\inputVector_i)) \right) m(\scaleScalar)
\end{align*}
\]</span></p>
<h3 id="example">Example</h3>
<p>Assume <span class="math display">\[
m(\scaleScalar) = \prod_i \lambda\exp\left(-\lambda\scaleScalar_i\right)
\]</span> which is the distribution with the maximum entropy for a given mean, <span class="math inline">\(\scaleScalar\)</span>. Then we have <span class="math display">\[
q(\scaleScalar) = \prod_i q(\scaleScalar_i)
\]</span> <span class="math display">\[
q(\scaleScalar_i) \propto \frac{1}{\lambda+\beta L_i} \exp\left(-(\lambda+\beta L_i) \scaleScalar_i\right)
\]</span> and we can compute <span class="math display">\[
\expectationDist{\scaleScalar_i}{q(\scaleScalar)} =
\frac{1}{\lambda + \beta L_i}
\]</span></p>
<h3 id="coordinate-descent">Coordinate Descent</h3>
<p>We can minimize with respect to <span class="math inline">\(q(\scaleScalar)\)</span> recovering, <span class="math display">\[
q(\scaleScalar_i) = \frac{1}{\lambda+\beta L_i} \exp\left(-(\lambda+\beta L_i) \scaleScalar_i\right)
\]</span>t allowing us to compute the expectation of <span class="math inline">\(\scaleScalar\)</span>, <span class="math display">\[
\expectationDist{\scaleScalar_i}{q(\scaleScalar_i)} = \frac{1}{\lambda+\beta
L_i}
\]</span> then, we can minimize our expected loss with respect to <span class="math inline">\(\mappingFunction(\cdot)\)</span> <span class="math display">\[
\beta \sum_{i=1}^\numData \expectationDist{\scaleScalar_i}{q(\scaleScalar_i)} L(\dataScalar_i, \mappingFunction(\inputVector_i))
\]</span> If the loss is the <em>squared loss</em>, then this is recognised as a <em>reweighted least squares algorithm</em>. However, the loss can be of any form as long as <span class="math inline">\(q(\scaleScalar)\)</span> defined above exists.</p>
<p>In addition to the above, in our example below, we updated <span class="math inline">\(\beta\)</span> to normalize the expected loss to be <span class="math inline">\(\numData\)</span> at each iteration, so we have <span class="math display">\[
\beta = \frac{\numData}{\sum_{i=1}^\numData \expectationDist{\scaleScalar_i}{q(\scaleScalar_i)} L(\dataScalar_i, \mappingFunction(\inputVector_i))}
\]</span></p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> numpy <span class="im">as</span> np
<span class="im">import</span> matplotlib.pyplot <span class="im">as</span> plt
<span class="im">import</span> pods
<span class="im">import</span> teaching_plots <span class="im">as</span> plot
<span class="im">import</span> mlai</code></pre></div>
<h3 id="olympic-marathon-data">Olympic Marathon Data</h3>
<p>The first thing we will do is load a standard data set for regression modelling. The data consists of the pace of Olympic Gold Medal Marathon winners for the Olympics from 1896 to present. First we load in the data and plot.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">data <span class="op">=</span> pods.datasets.olympic_marathon_men()
x <span class="op">=</span> data[<span class="st">'X'</span>]
y <span class="op">=</span> data[<span class="st">'Y'</span>]
offset <span class="op">=</span> y.mean()
scale <span class="op">=</span> np.sqrt(y.var())
xlim <span class="op">=</span> (<span class="dv">1875</span>,<span class="dv">2030</span>)
ylim <span class="op">=</span> (<span class="fl">2.5</span>, <span class="fl">6.5</span>)
yhat <span class="op">=</span> (y<span class="op">-</span>offset)<span class="op">/</span>scale
fig, ax <span class="op">=</span> plt.subplots(figsize<span class="op">=</span>plot.big_wide_figsize)
_ <span class="op">=</span> ax.plot(x, y, <span class="st">'r.'</span>,markersize<span class="op">=</span><span class="dv">10</span>)
ax.set_xlabel(<span class="st">'year'</span>, fontsize<span class="op">=</span><span class="dv">20</span>)
ax.set_ylabel(<span class="st">'pace min/km'</span>, fontsize<span class="op">=</span><span class="dv">20</span>)
ax.set_xlim(xlim)
ax.set_ylim(ylim)
mlai.write_figure(figure<span class="op">=</span>fig, filename<span class="op">=</span><span class="st">'../slides/diagrams/datasets/olympic-marathon.svg'</span>, transparent<span class="op">=</span><span class="va">True</span>, frameon<span class="op">=</span><span class="va">True</span>)</code></pre></div>
<h3 id="olympic-marathon-data-1">Olympic Marathon Data</h3>
<table>
<tr>
<td width="70%">
<ul>
<li><p>Gold medal times for Olympic Marathon since 1896.</p></li>
<li><p>Marathons before 1924 didn’t have a standardised distance.</p></li>
<li><p>Present results using pace per km.</p></li>
<li>In 1904 Marathon was badly organised leading to very slow times.</li>
</ul>
</td>
<td width="30%">
<img src="../slides/diagrams/Stephen_Kiprotich.jpg" alt="image" /> <small>Image from Wikimedia Commons <a href="http://bit.ly/16kMKHQ" class="uri">http://bit.ly/16kMKHQ</a></small>
</td>
</tr>
</table>
<object class="svgplot" align data="../slides/diagrams/datasets/olympic-marathon.svg">
</object>
<p>Things to notice about the data include the outlier in 1904, in this year, the olympics was in St Louis, USA. Organizational problems and challenges with dust kicked up by the cars following the race meant that participants got lost, and only very few participants completed.</p>
<p>More recent years see more consistently quick marathons.</p>
<h3 id="example-linear-regression">Example: Linear Regression</h3>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> mlai
<span class="im">import</span> numpy <span class="im">as</span> np
<span class="im">import</span> scipy <span class="im">as</span> sp</code></pre></div>
<p>Create a weighted linear regression class, inheriting from the <code>mlai.LM</code> class.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">class</span> LML(mlai.LM):
<span class="co">"""Linear model with evolving loss</span>
<span class="co"> :param X: input values</span>
<span class="co"> :type X: numpy.ndarray</span>
<span class="co"> :param y: target values</span>
<span class="co"> :type y: numpy.ndarray</span>
<span class="co"> :param basis: basis function </span>
<span class="co"> :param type: function</span>
<span class="co"> :param beta: weight of the loss function</span>
<span class="co"> :param type: float"""</span>
<span class="kw">def</span> <span class="fu">__init__</span>(<span class="va">self</span>, X, y, basis<span class="op">=</span><span class="va">None</span>, beta<span class="op">=</span><span class="fl">1.0</span>, lambd<span class="op">=</span><span class="fl">1.0</span>):
<span class="co">"Initialise"</span>
<span class="cf">if</span> basis <span class="kw">is</span> <span class="va">None</span>:
basis <span class="op">=</span> mlai.basis(mlai.polynomial, number<span class="op">=</span><span class="dv">2</span>)
mlai.LM.<span class="fu">__init__</span>(<span class="va">self</span>, X, y, basis)
<span class="va">self</span>.s <span class="op">=</span> np.ones((<span class="va">self</span>.num_data, <span class="dv">1</span>))<span class="co">#np.random.rand(self.num_data, 1)>0.5</span>
<span class="va">self</span>.update_w()
<span class="va">self</span>.sigma2 <span class="op">=</span> <span class="dv">1</span><span class="op">/</span>beta
<span class="va">self</span>.beta <span class="op">=</span> beta
<span class="va">self</span>.name <span class="op">=</span> <span class="st">'LML_'</span><span class="op">+</span>basis.function.<span class="va">__name__</span>
<span class="va">self</span>.objective_name <span class="op">=</span> <span class="st">'Weighted Sum of Square Training Error'</span>
<span class="va">self</span>.lambd <span class="op">=</span> lambd
<span class="kw">def</span> update_QR(<span class="va">self</span>):
<span class="co">"Perform the QR decomposition on the basis matrix."</span>
<span class="va">self</span>.Q, <span class="va">self</span>.R <span class="op">=</span> np.linalg.qr(<span class="va">self</span>.Phi<span class="op">*</span>np.sqrt(<span class="va">self</span>.s))
<span class="kw">def</span> fit(<span class="va">self</span>):
<span class="co">"""Minimize the objective function with respect to the parameters"""</span>
<span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">30</span>):
<span class="va">self</span>.update_w() <span class="co"># In the linear regression clas</span>
<span class="va">self</span>.update_s()
<span class="kw">def</span> update_w(<span class="va">self</span>):
<span class="va">self</span>.update_QR()
<span class="va">self</span>.w_star <span class="op">=</span> sp.linalg.solve_triangular(<span class="va">self</span>.R, np.dot(<span class="va">self</span>.Q.T, <span class="va">self</span>.y<span class="op">*</span>np.sqrt(<span class="va">self</span>.s)))
<span class="va">self</span>.update_losses()
<span class="kw">def</span> predict(<span class="va">self</span>, X):
<span class="co">"""Return the result of the prediction function."""</span>
<span class="cf">return</span> np.dot(<span class="va">self</span>.basis.Phi(X), <span class="va">self</span>.w_star), <span class="va">None</span>
<span class="kw">def</span> update_s(<span class="va">self</span>):
<span class="co">"""Update the weights"""</span>
<span class="va">self</span>.s <span class="op">=</span> <span class="dv">1</span><span class="op">/</span>(<span class="va">self</span>.lambd <span class="op">+</span> <span class="va">self</span>.beta<span class="op">*</span><span class="va">self</span>.losses)
<span class="kw">def</span> update_losses(<span class="va">self</span>):
<span class="co">"""Compute the loss functions for each data point."""</span>
<span class="va">self</span>.update_f()
<span class="va">self</span>.losses <span class="op">=</span> ((<span class="va">self</span>.y<span class="op">-</span><span class="va">self</span>.f)<span class="op">**</span><span class="dv">2</span>)
<span class="va">self</span>.beta <span class="op">=</span> <span class="dv">1</span><span class="op">/</span>(<span class="va">self</span>.losses<span class="op">*</span><span class="va">self</span>.s).mean()
<span class="kw">def</span> objective(<span class="va">self</span>):
<span class="co">"""Compute the objective function."""</span>
<span class="va">self</span>.update_losses()
<span class="cf">return</span> (<span class="va">self</span>.losses<span class="op">*</span><span class="va">self</span>.s).<span class="bu">sum</span>()</code></pre></div>
<p>Set up a linear model (polynomial with two basis functions).</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">num_basis<span class="op">=</span><span class="dv">2</span>
data_limits<span class="op">=</span>[<span class="dv">1890</span>, <span class="dv">2020</span>]
basis <span class="op">=</span> mlai.basis(mlai.polynomial, num_basis, data_limits<span class="op">=</span>data_limits)
model <span class="op">=</span> LML(x, y, basis<span class="op">=</span>basis, lambd<span class="op">=</span><span class="dv">1</span>, beta<span class="op">=</span><span class="dv">1</span>)
model2 <span class="op">=</span> mlai.LM(x, y, basis<span class="op">=</span>basis)</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">model.fit()
model2.fit()</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> matplotlib.pyplot <span class="im">as</span> plt</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">x_test <span class="op">=</span> np.linspace(data_limits[<span class="dv">0</span>], data_limits[<span class="dv">1</span>], <span class="dv">130</span>)[:, <span class="va">None</span>]
f_test, f_var <span class="op">=</span> model.predict(x_test)
f2_test, f2_var <span class="op">=</span> model2.predict(x_test)</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> teaching_plots <span class="im">as</span> plot
<span class="im">from</span> matplotlib <span class="im">import</span> rc, rcParams
rcParams.update({<span class="st">'font.size'</span>: <span class="dv">22</span>})
rc(<span class="st">'text'</span>, usetex<span class="op">=</span><span class="va">True</span>)</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">fig, ax <span class="op">=</span> plt.subplots(figsize<span class="op">=</span>plot.big_wide_figsize)
ax.plot(x_test, f2_test, linewidth<span class="op">=</span><span class="dv">3</span>, color<span class="op">=</span><span class="st">'r'</span>)
ax.plot(x, y, <span class="st">'g.'</span>, markersize<span class="op">=</span><span class="dv">10</span>)
ax.set_xlim(data_limits[<span class="dv">0</span>], data_limits[<span class="dv">1</span>])
ax.set_xlabel(<span class="st">'year'</span>)
ax.set_ylabel(<span class="st">'pace min/km'</span>)
_ <span class="op">=</span> ax.set_ylim(<span class="dv">2</span>, <span class="dv">6</span>)
mlai.write_figure(<span class="st">'../slides/diagrams/ml/olympic-loss-linear-regression000.svg'</span>, transparent<span class="op">=</span><span class="va">True</span>)
ax.plot(x_test, f_test, linewidth<span class="op">=</span><span class="dv">3</span>, color<span class="op">=</span><span class="st">'b'</span>)
ax.plot(x, y, <span class="st">'g.'</span>, markersize<span class="op">=</span><span class="dv">10</span>)
ax2 <span class="op">=</span> ax.twinx()
ax2.bar(x.flatten(), model.s.flatten(), width<span class="op">=</span><span class="dv">2</span>, color<span class="op">=</span><span class="st">'b'</span>)
ax2.set_ylim(<span class="dv">0</span>, <span class="dv">4</span>)
ax2.set_yticks([<span class="dv">0</span>, <span class="dv">1</span>, <span class="dv">2</span>])
ax2.set_ylabel(<span class="st">'$\langle s_i </span><span class="ch">\\</span><span class="st">rangle$'</span>)
mlai.write_figure(<span class="st">'../slides/diagrams/ml/olympic-loss-linear-regression001.svg'</span>, transparent<span class="op">=</span><span class="va">True</span>)</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> pods
pods.notebook.display_plots(<span class="st">'olympic-loss-linear-regression</span><span class="sc">{number:0>3}</span><span class="st">.svg'</span>,
directory<span class="op">=</span><span class="st">'../slides/diagrams/ml'</span>, number<span class="op">=</span>(<span class="dv">0</span>, <span class="dv">1</span>))</code></pre></div>
<object class="svgplot" align data="../slides/diagrams/ml/olympic-loss-linear-regression001.svg">
</object>
<center>
<em>Linear regression for the standard quadratic loss in </em>red* and the probabilistically weighted loss in <em>blue</em>. *
</center>
<h3 id="parameter-uncertainty">Parameter Uncertainty</h3>
<p>Classical Bayesian inference is concerned with parameter uncertainty, which equates to uncertainty in the <em>prediction function</em>, <span class="math inline">\(\mappingFunction(\inputVector)\)</span>. The prediction function is normally an estimate of the value of <span class="math inline">\(\dataScalar\)</span> or constructs a probability density for <span class="math inline">\(\dataScalar\)</span>.</p>
<p>Uncertainty in the prediction function can arise through uncertainty in our loss function, but also through uncertainty in parameters in the classical Bayesian sense. The full maximum entropy formalism would now be <span class="math display">\[
\expectationDist{\beta \scaleScalar_i L(\dataScalar_i,
\mappingFunction(\inputVector_i))}{q(\scaleScalar, \mappingFunction)} + \int
q(\scaleScalar, \mappingFunction) \log \frac{q(\scaleScalar,
\mappingFunction)}{m(\scaleScalar)m(\mappingFunction)}\text{d}\scaleScalar
\text{d}\mappingFunction
\]</span></p>
<p><span class="math display">\[
q(\mappingFunction, \scaleScalar) \propto
\prod_{i=1}^\numData \exp\left(- \beta \scaleScalar_i L(\dataScalar_i,
\mappingFunction(\inputVector_i)) \right) m(\scaleScalar)m(\mappingFunction)
\]</span></p>
<h3 id="approximation">Approximation</h3>
<ul>
<li><p>Generally intractable, so assume: <span class="math display">\[
q(\mappingFunction, \scaleScalar) = q(\mappingFunction)q(\scaleScalar)
\]</span></p></li>
<li><p>Entropy maximization proceeds as before but with <span class="math display">\[
q(\scaleScalar) \propto
\prod_{i=1}^\numData \exp\left(- \beta \scaleScalar_i \expectationDist{L(\dataScalar_i,
\mappingFunction(\inputVector_i))}{q(\mappingFunction)} \right) m(\scaleScalar)
\]</span> and <span class="math display">\[
q(\mappingFunction) \propto
\prod_{i=1}^\numData \exp\left(- \beta \expectationDist{\scaleScalar_i}{q(\scaleScalar)} L(\dataScalar_i,
\mappingFunction(\inputVector_i)) \right) m(\mappingFunction)
\]</span></p></li>
<li><p>Can now proceed with iteration between <span class="math inline">\(q(\scaleScalar)\)</span>, <span class="math inline">\(q(\mappingFunction)\)</span></p></li>
</ul>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">class</span> BLML(mlai.BLM):
<span class="co">"""Bayesian Linear model with evolving loss</span>
<span class="co"> :param X: input values</span>
<span class="co"> :type X: numpy.ndarray</span>
<span class="co"> :param y: target values</span>
<span class="co"> :type y: numpy.ndarray</span>
<span class="co"> :param basis: basis function </span>
<span class="co"> :param type: function</span>
<span class="co"> :param beta: weight of the loss function</span>
<span class="co"> :param type: float"""</span>
<span class="kw">def</span> <span class="fu">__init__</span>(<span class="va">self</span>, X, y, basis<span class="op">=</span><span class="va">None</span>, alpha<span class="op">=</span><span class="fl">1.0</span>, beta<span class="op">=</span><span class="fl">1.0</span>, lambd<span class="op">=</span><span class="fl">1.0</span>):
<span class="co">"Initialise"</span>
<span class="cf">if</span> basis <span class="kw">is</span> <span class="va">None</span>:
basis <span class="op">=</span> mlai.basis(mlai.polynomial, number<span class="op">=</span><span class="dv">2</span>)
mlai.BLM.<span class="fu">__init__</span>(<span class="va">self</span>, X, y, basis<span class="op">=</span>basis, alpha<span class="op">=</span>alpha, sigma2<span class="op">=</span><span class="dv">1</span><span class="op">/</span>beta)
<span class="va">self</span>.s <span class="op">=</span> np.ones((<span class="va">self</span>.num_data, <span class="dv">1</span>))<span class="co">#np.random.rand(self.num_data, 1)>0.5 </span>
<span class="va">self</span>.update_w()
<span class="va">self</span>.beta <span class="op">=</span> beta
<span class="va">self</span>.name <span class="op">=</span> <span class="st">'BLML_'</span><span class="op">+</span>basis.function.<span class="va">__name__</span>
<span class="va">self</span>.objective_name <span class="op">=</span> <span class="st">'Weighted Sum of Square Training Error'</span>
<span class="va">self</span>.lambd <span class="op">=</span> lambd
<span class="kw">def</span> update_QR(<span class="va">self</span>):
<span class="co">"Perform the QR decomposition on the basis matrix."</span>
<span class="va">self</span>.Q, <span class="va">self</span>.R <span class="op">=</span> np.linalg.qr(np.vstack([<span class="va">self</span>.Phi<span class="op">*</span>np.sqrt(<span class="va">self</span>.s), np.sqrt(<span class="va">self</span>.sigma2<span class="op">/</span><span class="va">self</span>.alpha)<span class="op">*</span>np.eye(<span class="va">self</span>.basis.number)]))
<span class="kw">def</span> fit(<span class="va">self</span>):
<span class="co">"""Minimize the objective function with respect to the parameters"""</span>
<span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">30</span>):
<span class="va">self</span>.update_w()
<span class="va">self</span>.update_s()
<span class="kw">def</span> update_w(<span class="va">self</span>):
<span class="va">self</span>.update_QR()
<span class="va">self</span>.QTy <span class="op">=</span> np.dot(<span class="va">self</span>.Q[:<span class="va">self</span>.y.shape[<span class="dv">0</span>], :].T, <span class="va">self</span>.y<span class="op">*</span>np.sqrt(<span class="va">self</span>.s))
<span class="va">self</span>.mu_w <span class="op">=</span> sp.linalg.solve_triangular(<span class="va">self</span>.R, <span class="va">self</span>.QTy)
<span class="va">self</span>.RTinv <span class="op">=</span> sp.linalg.solve_triangular(<span class="va">self</span>.R, np.eye(<span class="va">self</span>.R.shape[<span class="dv">0</span>]), trans<span class="op">=</span><span class="st">'T'</span>)
<span class="va">self</span>.C_w <span class="op">=</span> np.dot(<span class="va">self</span>.RTinv, <span class="va">self</span>.RTinv.T)
<span class="va">self</span>.update_losses()
<span class="kw">def</span> update_s(<span class="va">self</span>):
<span class="co">"""Update the weights"""</span>
<span class="va">self</span>.s <span class="op">=</span> <span class="dv">1</span><span class="op">/</span>(<span class="va">self</span>.lambd <span class="op">+</span> <span class="va">self</span>.beta<span class="op">*</span><span class="va">self</span>.losses)
<span class="kw">def</span> update_losses(<span class="va">self</span>):
<span class="co">"""Compute the loss functions for each data point."""</span>
<span class="va">self</span>.update_f()
<span class="va">self</span>.losses <span class="op">=</span> ((<span class="va">self</span>.y<span class="op">-</span><span class="va">self</span>.f_bar)<span class="op">**</span><span class="dv">2</span>) <span class="op">+</span> <span class="va">self</span>.f_cov[:, np.newaxis]
<span class="va">self</span>.beta <span class="op">=</span> <span class="dv">1</span><span class="op">/</span>(<span class="va">self</span>.losses<span class="op">*</span><span class="va">self</span>.s).mean()
<span class="va">self</span>.sigma2<span class="op">=</span><span class="dv">1</span><span class="op">/</span><span class="va">self</span>.beta
</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">model <span class="op">=</span> BLML(x, y, basis<span class="op">=</span>basis, alpha<span class="op">=</span><span class="dv">1000</span>, lambd<span class="op">=</span><span class="dv">1</span>, beta<span class="op">=</span><span class="dv">1</span>)
model2 <span class="op">=</span> mlai.BLM(x, y, basis<span class="op">=</span>basis, alpha<span class="op">=</span><span class="dv">1000</span>, sigma2<span class="op">=</span><span class="dv">1</span>)</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">model.fit()
model2.fit()</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">x_test <span class="op">=</span> np.linspace(data_limits[<span class="dv">0</span>], data_limits[<span class="dv">1</span>], <span class="dv">130</span>)[:, <span class="va">None</span>]
f_test, f_var <span class="op">=</span> model.predict(x_test)
f2_test, f2_var <span class="op">=</span> model2.predict(x_test)</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> gp_tutorial</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">fig, ax <span class="op">=</span> plt.subplots(figsize<span class="op">=</span>plot.big_wide_figsize)
<span class="im">from</span> matplotlib <span class="im">import</span> rc, rcParams
rcParams.update({<span class="st">'font.size'</span>: <span class="dv">22</span>})
rc(<span class="st">'text'</span>, usetex<span class="op">=</span><span class="va">True</span>)
gp_tutorial.gpplot(x_test, f2_test, f2_test <span class="op">-</span> <span class="dv">2</span><span class="op">*</span>np.sqrt(f2_var), f2_test <span class="op">+</span> <span class="dv">2</span><span class="op">*</span>np.sqrt(f2_var), ax<span class="op">=</span>ax, edgecol<span class="op">=</span><span class="st">'r'</span>, fillcol<span class="op">=</span><span class="st">'#CC3300'</span>)
ax.plot(x, y, <span class="st">'g.'</span>, markersize<span class="op">=</span><span class="dv">10</span>)
ax.set_xlim(data_limits[<span class="dv">0</span>], data_limits[<span class="dv">1</span>])
ax.set_xlabel(<span class="st">'year'</span>)
ax.set_ylabel(<span class="st">'pace min/km'</span>)
_ <span class="op">=</span> ax.set_ylim(<span class="dv">2</span>, <span class="dv">6</span>)
mlai.write_figure(<span class="st">'../slides/diagrams/ml/olympic-loss-bayes-linear-regression000.svg'</span>, transparent<span class="op">=</span><span class="va">True</span>)
gp_tutorial.gpplot(x_test, f_test, f_test <span class="op">-</span> <span class="dv">2</span><span class="op">*</span>np.sqrt(f_var), f_test <span class="op">+</span> <span class="dv">2</span><span class="op">*</span>np.sqrt(f_var), ax<span class="op">=</span>ax, edgecol<span class="op">=</span><span class="st">'b'</span>, fillcol<span class="op">=</span><span class="st">'#0033CC'</span>)
<span class="co">#ax.plot(x_test, f_test, linewidth=3, color='b')</span>
ax.plot(x, y, <span class="st">'g.'</span>, markersize<span class="op">=</span><span class="dv">10</span>)
ax2 <span class="op">=</span> ax.twinx()
ax2.bar(x.flatten(), model.s.flatten(), width<span class="op">=</span><span class="dv">2</span>, color<span class="op">=</span><span class="st">'b'</span>)
ax2.set_ylim(<span class="dv">0</span>, <span class="fl">0.2</span>)
ax2.set_yticks([<span class="dv">0</span>, <span class="fl">0.1</span>, <span class="fl">0.2</span>])
ax2.set_ylabel(<span class="st">'$\langle s_i </span><span class="ch">\\</span><span class="st">rangle$'</span>)
mlai.write_figure(<span class="st">'../slides/diagrams/ml/olympic-loss-bayes-linear-regression001.svg'</span>, transparent<span class="op">=</span><span class="va">True</span>)</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> pods
pods.notebook.display_plots(<span class="st">'olympic-loss-bayes-linear-regression</span><span class="sc">{number:0>3}</span><span class="st">.svg'</span>,
directory<span class="op">=</span><span class="st">'../slides/diagrams/ml'</span>, number<span class="op">=</span>(<span class="dv">0</span>, <span class="dv">1</span>))</code></pre></div>
<object class="svgplot" align data="../slides/diagrams/ml/olympic-loss-bayes-linear-regression001.svg">
</object>
<center>
<em>Probabilistic linear regression for the standard quadratic loss in </em>red* and the probabilistically weighted loss in <em>blue</em>. *
</center>
<h3 id="correlated-scales">Correlated Scales</h3>
<p>Going beyond independence between weights, we now consider <span class="math inline">\(m(\vScalar)\)</span> to be a Gaussian process, and scale by the <em>square</em> of <span class="math inline">\(\vScalar\)</span>, <span class="math inline">\(\scaleScalar=\vScalar^2\)</span> <span class="math display">\[
\vScalar \sim \mathcal{GP}\left(\meanScalar(\inputVector), \kernel(\inputVector, \inputVector^\prime)\right)
\]</span></p>
<p><span class="math display">\[
q(\vScalar) \propto
\prod_{i=1}^\numData \exp\left(- \beta \vScalar_i^2 L(\dataScalar_i,
\mappingFunction(\inputVector_i)) \right)
\exp\left(-\frac{1}{2}(\vVector-\meanVector)^\top \kernelMatrix^{-1}
(\vVector-\meanVector)\right)
\]</span> where <span class="math inline">\(\kernelMatrix\)</span> is the covariance of the process made up of elements taken from the covariance function, <span class="math inline">\(\kernelScalar(\inputVector, t, \dataVector; \inputVector^\prime, t^\prime, \dataVector^\prime)\)</span> so <span class="math inline">\(q(\vScalar)\)</span> itself is Gaussian with covariance <span class="math display">\[
\covarianceMatrix = \left(\beta\mathbf{L} + \kernelMatrix^{-1}\right)^{-1}
\]</span> and mean <span class="math display">\[
\meanTwoVector = \beta\covarianceMatrix\mathbf{L}\meanVector
\]</span> where <span class="math inline">\(\mathbf{L}\)</span> is a matrix containing the loss functions, <span class="math inline">\(L(\dataScalar_i, \mappingFunction(\inputVector_i))\)</span> along its diagonal elements with zeros elsewhere.</p>
<p>The update is given by <span class="math display">\[
\expectationDist{\vScalar_i^2}{q(\vScalar)} = \meanTwoScalar_i^2 +
\covarianceScalar_{i, i}.
\]</span> To compare with before, if the mean of the measure <span class="math inline">\(m(\vScalar)\)</span> was zero and the prior covariance was spherical, <span class="math inline">\(\kernelMatrix=\lambda^{-1}\eye\)</span>. Then this would equate to an update, <span class="math display">\[
\expectationDist{\vScalar_i^2}{q(\vScalar)} = \frac{1}{\lambda + \beta L_i}
\]</span> which is the same as we had before for the exponential prior over <span class="math inline">\(\scaleScalar\)</span>.</p>
<h3 id="conditioning-the-measure">Conditioning the Measure</h3>
<p>Now that we have defined a process over <span class="math inline">\(\vScalar\)</span>, we could define a region in which we're certain that we would like the weights to be high. For example, if we were looking to have a test point at location <span class="math inline">\(\inputVector_\ast\)</span>, we could update our measure to be a Gaussian process that is conditioned on the observation of <span class="math inline">\(\vScalar_\ast\)</span> set appropriately at <span class="math inline">\(\inputScalar_\ast\)</span>. In this case we have,</p>
<p><span class="math display">\[
\kernelMatrix^\prime = \kernelMatrix - \frac{\kernelVector_\ast\kernelVector^\top_\ast}{\kernelScalar_{*,*}}
\]</span> and <span class="math display">\[
\meanVector^\prime = \meanVector + \frac{\kernelVector_\ast}{\kernelScalar_{*,*}}
(\vScalar_\ast-\meanScalar)
\]</span> where <span class="math inline">\(\kernelScalar_\ast\)</span> is the vector computed through the covariance function between the training data <span class="math inline">\(\inputMatrix\)</span> and the proposed point that we are conditioning the scale upon, <span class="math inline">\(\inputVector_\ast\)</span> and <span class="math inline">\(\kernelScalar_{*,*}\)</span> is the covariance function computed for <span class="math inline">\(\inputVector_\ast\)</span>. Now the updated mean and covariance can be used in the maximum entropy formulation as before. <span class="math display">\[
q(\vScalar) \propto \prod_{i=1}^\numData \exp\left(-
\beta \vScalar_i^2 L(\dataScalar_i, \mappingFunction(\inputVector_i)) \right)
\exp\left(-\frac{1}{2}(\vVector-\meanVector^\prime)^\top
\left.\kernelMatrix^\prime\right.^{-1} (\vVector-\meanVector^\prime)\right)
\]</span></p>
<p>We will consider the same data set as above. We first create a Gaussian process model for the update.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">class</span> GPL(mlai.GP):
<span class="kw">def</span> <span class="fu">__init__</span>(<span class="va">self</span>, X, losses, kernel, beta<span class="op">=</span><span class="fl">1.0</span>, mu<span class="op">=</span><span class="fl">0.0</span>, X_star<span class="op">=</span><span class="va">None</span>, v_star<span class="op">=</span><span class="va">None</span>):
<span class="co"># Bring together locations</span>
<span class="va">self</span>.kernel <span class="op">=</span> kernel
<span class="va">self</span>.K <span class="op">=</span> <span class="va">self</span>.kernel.K(X)
<span class="va">self</span>.mu <span class="op">=</span> np.ones((X.shape[<span class="dv">0</span>],<span class="dv">1</span>))<span class="op">*</span>mu
<span class="va">self</span>.beta <span class="op">=</span> beta
<span class="cf">if</span> X_star <span class="kw">is</span> <span class="kw">not</span> <span class="va">None</span>:
kstar <span class="op">=</span> kernel.K(X, X_star)
kstarstar <span class="op">=</span> kernel.K(X_star, X_star)
kstarstarInv <span class="op">=</span> np.linalg.inv(kstarstar)
kskssInv <span class="op">=</span> np.dot(kstar, kstarstarInv)
<span class="va">self</span>.K <span class="op">-=</span> np.dot(kskssInv,kstar.T)
<span class="cf">if</span> v_star <span class="kw">is</span> <span class="kw">not</span> <span class="va">None</span>:
<span class="va">self</span>.mu <span class="op">=</span> kskssInv<span class="op">*</span>(v_star<span class="op">-</span><span class="va">self</span>.mu)<span class="op">+</span><span class="va">self</span>.mu
Xaug <span class="op">=</span> np.vstack((X, X_star))
<span class="cf">else</span>:
<span class="cf">raise</span> <span class="pp">ValueError</span>(<span class="st">"v_star should not be None when X_star is None"</span>)</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">class</span> BLMLGP(BLML):
<span class="kw">def</span> <span class="fu">__init__</span>(<span class="va">self</span>, X, y, basis<span class="op">=</span><span class="va">None</span>, kernel<span class="op">=</span><span class="va">None</span>, beta<span class="op">=</span><span class="fl">1.0</span>, mu<span class="op">=</span><span class="fl">0.0</span>, alpha<span class="op">=</span><span class="fl">1.0</span>, X_star<span class="op">=</span><span class="va">None</span>, v_star<span class="op">=</span><span class="va">None</span>):
BLML.<span class="fu">__init__</span>(<span class="va">self</span>, X, y, basis<span class="op">=</span>basis, alpha<span class="op">=</span>alpha, beta<span class="op">=</span>beta, lambd<span class="op">=</span><span class="va">None</span>)
<span class="va">self</span>.gp_model<span class="op">=</span>GPL(<span class="va">self</span>.X, <span class="va">self</span>.losses, kernel<span class="op">=</span>kernel, beta<span class="op">=</span>beta, mu<span class="op">=</span>mu, X_star<span class="op">=</span>X_star, v_star<span class="op">=</span>v_star)
<span class="kw">def</span> update_s(<span class="va">self</span>):
<span class="co">"""Update the weights"""</span>
<span class="va">self</span>.gp_model.C <span class="op">=</span> sp.linalg.inv(sp.linalg.inv(<span class="va">self</span>.gp_model.K<span class="op">+</span>np.eye(<span class="va">self</span>.X.shape[<span class="dv">0</span>])<span class="op">*</span><span class="fl">1e-6</span>) <span class="op">+</span> <span class="va">self</span>.beta<span class="op">*</span>np.diag(<span class="va">self</span>.losses.flatten()))
<span class="va">self</span>.gp_model.diagC <span class="op">=</span> np.diag(<span class="va">self</span>.gp_model.C)[:, np.newaxis]
<span class="va">self</span>.gp_model.f <span class="op">=</span> <span class="va">self</span>.gp_model.beta<span class="op">*</span>np.dot(np.dot(<span class="va">self</span>.gp_model.C,np.diag(<span class="va">self</span>.losses.flatten())),<span class="va">self</span>.gp_model.mu) <span class="op">+</span><span class="va">self</span>.gp_model.mu
<span class="co">#f, v = self.gp_model.K self.gp_model.predict(self.X)</span>
<span class="va">self</span>.s <span class="op">=</span> <span class="va">self</span>.gp_model.f<span class="op">*</span><span class="va">self</span>.gp_model.f <span class="op">+</span> <span class="va">self</span>.gp_model.diagC <span class="co"># + 1.0/(self.losses*self.gp_model.beta)</span></code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">model <span class="op">=</span> BLMLGP(x, y,
basis<span class="op">=</span>basis,
kernel<span class="op">=</span>mlai.kernel(mlai.eq_cov, lengthscale<span class="op">=</span><span class="dv">20</span>, variance<span class="op">=</span><span class="fl">1.0</span>),
mu<span class="op">=</span><span class="fl">0.0</span>,
beta<span class="op">=</span><span class="fl">1.0</span>,
alpha<span class="op">=</span><span class="dv">1000</span>,
X_star<span class="op">=</span>np.asarray([[<span class="dv">2020</span>]]),
v_star<span class="op">=</span>np.asarray([[<span class="dv">1</span>]]))</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">model.fit()</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">f_test, f_var <span class="op">=</span> model.predict(x_test)</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">fig, ax <span class="op">=</span> plt.subplots(figsize<span class="op">=</span>plot.big_wide_figsize)
ax.cla()
<span class="im">from</span> matplotlib <span class="im">import</span> rc, rcParams
rcParams.update({<span class="st">'font.size'</span>: <span class="dv">22</span>})
rc(<span class="st">'text'</span>, usetex<span class="op">=</span><span class="va">True</span>)
gp_tutorial.gpplot(x_test, f2_test, f2_test <span class="op">-</span> <span class="dv">2</span><span class="op">*</span>np.sqrt(f2_var), f2_test <span class="op">+</span> <span class="dv">2</span><span class="op">*</span>np.sqrt(f2_var), ax<span class="op">=</span>ax, edgecol<span class="op">=</span><span class="st">'r'</span>, fillcol<span class="op">=</span><span class="st">'#CC3300'</span>)
ax.plot(x, y, <span class="st">'g.'</span>, markersize<span class="op">=</span><span class="dv">10</span>)
ax.set_xlim(data_limits[<span class="dv">0</span>], data_limits[<span class="dv">1</span>])
ax.set_xlabel(<span class="st">'year'</span>)
ax.set_ylabel(<span class="st">'pace min/km'</span>)
_ <span class="op">=</span> ax.set_ylim(<span class="dv">2</span>, <span class="dv">6</span>)
mlai.write_figure(<span class="st">'../slides/diagrams/ml/olympic-gp-loss-bayes-linear-regression000.svg'</span>, transparent<span class="op">=</span><span class="va">True</span>)
gp_tutorial.gpplot(x_test, f_test, f_test <span class="op">-</span> <span class="dv">2</span><span class="op">*</span>np.sqrt(f_var), f_test <span class="op">+</span> <span class="dv">2</span><span class="op">*</span>np.sqrt(f_var), ax<span class="op">=</span>ax, edgecol<span class="op">=</span><span class="st">'b'</span>, fillcol<span class="op">=</span><span class="st">'#0033CC'</span>)
<span class="co">#ax.plot(x_test, f_test, linewidth=3, color='b')</span>
ax.plot(x, y, <span class="st">'g.'</span>, markersize<span class="op">=</span><span class="dv">10</span>)
ax2 <span class="op">=</span> ax.twinx()
ax2.bar(x.flatten(), model.s.flatten(), width<span class="op">=</span><span class="dv">2</span>, color<span class="op">=</span><span class="st">'b'</span>)
ax2.set_ylim(<span class="dv">0</span>, <span class="dv">3</span>)
ax2.set_yticks([<span class="dv">0</span>, <span class="fl">0.5</span>, <span class="dv">1</span>])
ax2.set_ylabel(<span class="st">'$\langle s_i </span><span class="ch">\\</span><span class="st">rangle$'</span>)
mlai.write_figure(<span class="st">'../slides/diagrams/ml/olympic-gp-loss-bayes-linear-regression001.svg'</span>, transparent<span class="op">=</span><span class="va">True</span>)</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> pods</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">pods.notebook.display_plots(<span class="st">'olympic-gp-loss-bayes-linear-regression</span><span class="sc">{number:0>3}</span><span class="st">.svg'</span>,
directory<span class="op">=</span><span class="st">'../slides/diagrams/ml'</span>, number<span class="op">=</span>(<span class="dv">0</span>, <span class="dv">1</span>))</code></pre></div>
<object class="svgplot" align data="../slides/diagrams/ml/olympic-gp-loss-bayes-linear-regression001.svg">
</object>
<center>
<em>Probabilistic linear regression for the standard quadratic loss in </em>red* and the probabilistically weighted loss with a Gaussian process measure in <em>blue</em>. *
</center>
<p>Finally, we make an attempt to show the joint uncertainty by first of all sampling from the loss function weights density, <span class="math inline">\(q(\scaleScalar)\)</span>.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">fig, ax <span class="op">=</span> plt.subplots(figsize<span class="op">=</span>plot.big_wide_figsize)
num_samps<span class="op">=</span><span class="dv">10</span>
samps<span class="op">=</span>np.random.multivariate_normal(model.gp_model.f.flatten(), model.gp_model.C, size<span class="op">=</span><span class="dv">100</span>).T<span class="op">**</span><span class="dv">2</span>
ax.plot(x, samps, <span class="st">'-x'</span>, markersize<span class="op">=</span><span class="dv">10</span>, linewidth<span class="op">=</span><span class="dv">2</span>)
ax.set_xlim(data_limits[<span class="dv">0</span>], data_limits[<span class="dv">1</span>])
ax.set_xlabel(<span class="st">'year'</span>)
_ <span class="op">=</span> ax.set_ylabel(<span class="st">'$s_i$'</span>)
mlai.write_figure(<span class="st">'../slides/diagrams/ml/olympic-gp-loss-samples.svg'</span>, transparent<span class="op">=</span><span class="va">True</span>)</code></pre></div>
<object class="svgplot" align data="../slides/diagrams/ml/olympic-gp-loss-samples.svg">
</object>
<center>
<em>Samples of loss weightings from the density <span class="math inline">\(q(\scaleSamples)\)</span>. </em>
</center>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">fig, ax <span class="op">=</span> plt.subplots(figsize<span class="op">=</span>plot.big_wide_figsize)
ax.plot(x, y, <span class="st">'r.'</span>, markersize<span class="op">=</span><span class="dv">10</span>)
ax.set_xlim(data_limits[<span class="dv">0</span>], data_limits[<span class="dv">1</span>])
ax.set_ylim(<span class="dv">2</span>, <span class="dv">6</span>)
ax.set_xlabel(<span class="st">'year'</span>)
ax.set_ylabel(<span class="st">'pace min/km'</span>)
gp_tutorial.gpplot(x_test, f_test, f_test <span class="op">-</span> <span class="dv">2</span><span class="op">*</span>np.sqrt(f_var), f_test <span class="op">+</span> <span class="dv">2</span><span class="op">*</span>np.sqrt(f_var), ax<span class="op">=</span>ax, edgecol<span class="op">=</span><span class="st">'b'</span>, fillcol<span class="op">=</span><span class="st">'#0033CC'</span>)
mlai.write_figure(<span class="st">'../slides/diagrams/ml/olympic-gp-loss-bayes-linear-regression-and-samples000.svg'</span>, transparent<span class="op">=</span><span class="va">True</span>)
allsamps <span class="op">=</span> []
<span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(samps.shape[<span class="dv">1</span>]):
model.s <span class="op">=</span> samps[:, i:i<span class="op">+</span><span class="dv">1</span>]
model.update_w()
f_bar, f_cov <span class="op">=</span>model.predict(x_test, full_cov<span class="op">=</span><span class="va">True</span>)
f_samp <span class="op">=</span> np.random.multivariate_normal(f_bar.flatten(), f_cov, size<span class="op">=</span><span class="dv">10</span>).T
ax.plot(x_test, f_samp, linewidth<span class="op">=</span><span class="fl">0.5</span>, color<span class="op">=</span><span class="st">'k'</span>)
allsamps<span class="op">+=</span><span class="bu">list</span>(f_samp[<span class="op">-</span><span class="dv">1</span>, :])
mlai.write_figure(<span class="st">'../slides/diagrams/ml/olympic-gp-loss-bayes-linear-regression-and-samples001.svg'</span>, transparent<span class="op">=</span><span class="va">True</span>)</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> pods
pods.notebook.display_plots(<span class="st">'olympic-gp-loss-bayes-linear-regression-and-samples</span><span class="sc">{number:0>3}</span><span class="st">.svg'</span>,
directory<span class="op">=</span><span class="st">'../slides/diagrams/ml'</span>, number<span class="op">=</span>(<span class="dv">0</span>, <span class="dv">1</span>))</code></pre></div>
<object class="svgplot" align data="../slides/diagrams/ml/olympic-gp-loss-bayes-linear-regression-and-samples001.svg">
</object>
<center>
<em>Samples from the joint density of loss weightings and regression weights show the full distribution of function predictions. </em>
</center>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">fig, ax <span class="op">=</span> plt.subplots(figsize<span class="op">=</span>plot.big_figsize)
ax.hist(np.asarray(allsamps), bins<span class="op">=</span><span class="dv">30</span>, density<span class="op">=</span><span class="va">True</span>)
ax.set_xlabel<span class="op">=</span><span class="st">'pace min/kim'</span>
mlai.write_figure(<span class="st">'../slides/diagrams/ml/olympic-gp-loss-histogram-2020.svg'</span>, transparent<span class="op">=</span><span class="va">True</span>)</code></pre></div>
<object class="svgplot" align data="../slides/diagrams/ml/olympic-gp-loss-histogram-2020.svg">
</object>
<center>
<em>Histogram of samples from the year 2020, where the weight of the loss function was pinned to ensure that the model focussed its predictions on this region for test data. </em>
</center>
<h3 id="conclusions">Conclusions</h3>
<ul>
<li>Maximum Entropy Framework for uncertainty in
<ul>
<li>Loss functions</li>
<li>Prediction functions</li>
</ul></li>
</ul>
<h3 id="thanks">Thanks!</h3>
<ul>
<li>twitter: @lawrennd</li>
<li>blog: <a href="http://inverseprobability.com/blog.html">http://inverseprobability.com</a></li>
</ul>
Tue, 29 May 2018 00:00:00 +0000
http://inverseprobability.com/talks/notes/uncertainty-in-loss-functions.html
http://inverseprobability.com/talks/notes/uncertainty-in-loss-functions.htmlnotes