Neil Lawrence's Talkstalks given by Neil Lawrence
http://inverseprobability.com/talks/
Tue, 21 Aug 2018 12:32:55 +0000Tue, 21 Aug 2018 12:32:55 +0000Jekyll v3.7.3Probabilistic Machine Learning<div style="display:none">
$$\newcommand{\Amatrix}{\mathbf{A}}
\newcommand{\KL}[2]{\text{KL}\left( #1\,\|\,#2 \right)}
\newcommand{\Kaast}{\kernelMatrix_{\mathbf{ \ast}\mathbf{ \ast}}}
\newcommand{\Kastu}{\kernelMatrix_{\mathbf{ \ast} \inducingVector}}
\newcommand{\Kff}{\kernelMatrix_{\mappingFunctionVector \mappingFunctionVector}}
\newcommand{\Kfu}{\kernelMatrix_{\mappingFunctionVector \inducingVector}}
\newcommand{\Kuast}{\kernelMatrix_{\inducingVector \bf\ast}}
\newcommand{\Kuf}{\kernelMatrix_{\inducingVector \mappingFunctionVector}}
\newcommand{\Kuu}{\kernelMatrix_{\inducingVector \inducingVector}}
\newcommand{\Kuui}{\Kuu^{-1}}
\newcommand{\Qaast}{\mathbf{Q}_{\bf \ast \ast}}
\newcommand{\Qastf}{\mathbf{Q}_{\ast \mappingFunction}}
\newcommand{\Qfast}{\mathbf{Q}_{\mappingFunctionVector \bf \ast}}
\newcommand{\Qff}{\mathbf{Q}_{\mappingFunctionVector \mappingFunctionVector}}
\newcommand{\aMatrix}{\mathbf{A}}
\newcommand{\aScalar}{a}
\newcommand{\aVector}{\mathbf{a}}
\newcommand{\acceleration}{a}
\newcommand{\bMatrix}{\mathbf{B}}
\newcommand{\bScalar}{b}
\newcommand{\bVector}{\mathbf{b}}
\newcommand{\basisFunc}{\phi}
\newcommand{\basisFuncVector}{\boldsymbol{ \basisFunc}}
\newcommand{\basisFunction}{\phi}
\newcommand{\basisLocation}{\mu}
\newcommand{\basisMatrix}{\boldsymbol{ \Phi}}
\newcommand{\basisScalar}{\basisFunction}
\newcommand{\basisVector}{\boldsymbol{ \basisFunction}}
\newcommand{\activationFunction}{\phi}
\newcommand{\activationMatrix}{\boldsymbol{ \Phi}}
\newcommand{\activationScalar}{\basisFunction}
\newcommand{\activationVector}{\boldsymbol{ \basisFunction}}
\newcommand{\bigO}{\mathcal{O}}
\newcommand{\binomProb}{\pi}
\newcommand{\cMatrix}{\mathbf{C}}
\newcommand{\cbasisMatrix}{\hat{\boldsymbol{ \Phi}}}
\newcommand{\cdataMatrix}{\hat{\dataMatrix}}
\newcommand{\cdataScalar}{\hat{\dataScalar}}
\newcommand{\cdataVector}{\hat{\dataVector}}
\newcommand{\centeredKernelMatrix}{\mathbf{ \MakeUppercase{\centeredKernelScalar}}}
\newcommand{\centeredKernelScalar}{b}
\newcommand{\centeredKernelVector}{\centeredKernelScalar}
\newcommand{\centeringMatrix}{\mathbf{H}}
\newcommand{\chiSquaredDist}[2]{\chi_{#1}^{2}\left(#2\right)}
\newcommand{\chiSquaredSamp}[1]{\chi_{#1}^{2}}
\newcommand{\conditionalCovariance}{\boldsymbol{ \Sigma}}
\newcommand{\coregionalizationMatrix}{\mathbf{B}}
\newcommand{\coregionalizationScalar}{b}
\newcommand{\coregionalizationVector}{\mathbf{ \coregionalizationScalar}}
\newcommand{\covDist}[2]{\text{cov}_{#2}\left(#1\right)}
\newcommand{\covSamp}[1]{\text{cov}\left(#1\right)}
\newcommand{\covarianceScalar}{c}
\newcommand{\covarianceVector}{\mathbf{ \covarianceScalar}}
\newcommand{\covarianceMatrix}{\mathbf{C}}
\newcommand{\covarianceMatrixTwo}{\boldsymbol{ \Sigma}}
\newcommand{\croupierScalar}{s}
\newcommand{\croupierVector}{\mathbf{ \croupierScalar}}
\newcommand{\croupierMatrix}{\mathbf{ \MakeUppercase{\croupierScalar}}}
\newcommand{\dataDim}{p}
\newcommand{\dataIndex}{i}
\newcommand{\dataIndexTwo}{j}
\newcommand{\dataMatrix}{\mathbf{Y}}
\newcommand{\dataScalar}{y}
\newcommand{\dataSet}{\mathcal{D}}
\newcommand{\dataStd}{\sigma}
\newcommand{\dataVector}{\mathbf{ \dataScalar}}
\newcommand{\decayRate}{d}
\newcommand{\degreeMatrix}{\mathbf{ \MakeUppercase{\degreeScalar}}}
\newcommand{\degreeScalar}{d}
\newcommand{\degreeVector}{\mathbf{ \degreeScalar}}
% Already defined by latex
%\newcommand{\det}[1]{\left|#1\right|}
\newcommand{\diag}[1]{\text{diag}\left(#1\right)}
\newcommand{\diagonalMatrix}{\mathbf{D}}
\newcommand{\diff}[2]{\frac{\text{d}#1}{\text{d}#2}}
\newcommand{\diffTwo}[2]{\frac{\text{d}^2#1}{\text{d}#2^2}}
\newcommand{\displacement}{x}
\newcommand{\displacementVector}{\textbf{\displacement}}
\newcommand{\distanceMatrix}{\mathbf{ \MakeUppercase{\distanceScalar}}}
\newcommand{\distanceScalar}{d}
\newcommand{\distanceVector}{\mathbf{ \distanceScalar}}
\newcommand{\eigenvaltwo}{\ell}
\newcommand{\eigenvaltwoMatrix}{\mathbf{L}}
\newcommand{\eigenvaltwoVector}{\mathbf{l}}
\newcommand{\eigenvalue}{\lambda}
\newcommand{\eigenvalueMatrix}{\boldsymbol{ \Lambda}}
\newcommand{\eigenvalueVector}{\boldsymbol{ \lambda}}
\newcommand{\eigenvector}{\mathbf{ \eigenvectorScalar}}
\newcommand{\eigenvectorMatrix}{\mathbf{U}}
\newcommand{\eigenvectorScalar}{u}
\newcommand{\eigenvectwo}{\mathbf{v}}
\newcommand{\eigenvectwoMatrix}{\mathbf{V}}
\newcommand{\eigenvectwoScalar}{v}
\newcommand{\entropy}[1]{\mathcal{H}\left(#1\right)}
\newcommand{\errorFunction}{E}
\newcommand{\expDist}[2]{\left<#1\right>_{#2}}
\newcommand{\expSamp}[1]{\left<#1\right>}
\newcommand{\expectation}[1]{\left\langle #1 \right\rangle }
\newcommand{\expectationDist}[2]{\left\langle #1 \right\rangle _{#2}}
\newcommand{\expectedDistanceMatrix}{\mathcal{D}}
\newcommand{\eye}{\mathbf{I}}
\newcommand{\fantasyDim}{r}
\newcommand{\fantasyMatrix}{\mathbf{ \MakeUppercase{\fantasyScalar}}}
\newcommand{\fantasyScalar}{z}
\newcommand{\fantasyVector}{\mathbf{ \fantasyScalar}}
\newcommand{\featureStd}{\varsigma}
\newcommand{\gammaCdf}[3]{\mathcal{GAMMA CDF}\left(#1|#2,#3\right)}
\newcommand{\gammaDist}[3]{\mathcal{G}\left(#1|#2,#3\right)}
\newcommand{\gammaSamp}[2]{\mathcal{G}\left(#1,#2\right)}
\newcommand{\gaussianDist}[3]{\mathcal{N}\left(#1|#2,#3\right)}
\newcommand{\gaussianSamp}[2]{\mathcal{N}\left(#1,#2\right)}
\newcommand{\given}{|}
\newcommand{\half}{\frac{1}{2}}
\newcommand{\heaviside}{H}
\newcommand{\hiddenMatrix}{\mathbf{ \MakeUppercase{\hiddenScalar}}}
\newcommand{\hiddenScalar}{h}
\newcommand{\hiddenVector}{\mathbf{ \hiddenScalar}}
\newcommand{\identityMatrix}{\eye}
\newcommand{\inducingInputScalar}{z}
\newcommand{\inducingInputVector}{\mathbf{ \inducingInputScalar}}
\newcommand{\inducingInputMatrix}{\mathbf{Z}}
\newcommand{\inducingScalar}{u}
\newcommand{\inducingVector}{\mathbf{ \inducingScalar}}
\newcommand{\inducingMatrix}{\mathbf{U}}
\newcommand{\inlineDiff}[2]{\text{d}#1/\text{d}#2}
\newcommand{\inputDim}{q}
\newcommand{\inputMatrix}{\mathbf{X}}
\newcommand{\inputScalar}{x}
\newcommand{\inputSpace}{\mathcal{X}}
\newcommand{\inputVals}{\inputVector}
\newcommand{\inputVector}{\mathbf{ \inputScalar}}
\newcommand{\iterNum}{k}
\newcommand{\kernel}{\kernelScalar}
\newcommand{\kernelMatrix}{\mathbf{K}}
\newcommand{\kernelScalar}{k}
\newcommand{\kernelVector}{\mathbf{ \kernelScalar}}
\newcommand{\kff}{\kernelScalar_{\mappingFunction \mappingFunction}}
\newcommand{\kfu}{\kernelVector_{\mappingFunction \inducingScalar}}
\newcommand{\kuf}{\kernelVector_{\inducingScalar \mappingFunction}}
\newcommand{\kuu}{\kernelVector_{\inducingScalar \inducingScalar}}
\newcommand{\lagrangeMultiplier}{\lambda}
\newcommand{\lagrangeMultiplierMatrix}{\boldsymbol{ \Lambda}}
\newcommand{\lagrangian}{L}
\newcommand{\laplacianFactor}{\mathbf{ \MakeUppercase{\laplacianFactorScalar}}}
\newcommand{\laplacianFactorScalar}{m}
\newcommand{\laplacianFactorVector}{\mathbf{ \laplacianFactorScalar}}
\newcommand{\laplacianMatrix}{\mathbf{L}}
\newcommand{\laplacianScalar}{\ell}
\newcommand{\laplacianVector}{\mathbf{ \ell}}
\newcommand{\latentDim}{q}
\newcommand{\latentDistanceMatrix}{\boldsymbol{ \Delta}}
\newcommand{\latentDistanceScalar}{\delta}
\newcommand{\latentDistanceVector}{\boldsymbol{ \delta}}
\newcommand{\latentForce}{f}
\newcommand{\latentFunction}{u}
\newcommand{\latentFunctionVector}{\mathbf{ \latentFunction}}
\newcommand{\latentFunctionMatrix}{\mathbf{ \MakeUppercase{\latentFunction}}}
\newcommand{\latentIndex}{j}
\newcommand{\latentScalar}{z}
\newcommand{\latentVector}{\mathbf{ \latentScalar}}
\newcommand{\latentMatrix}{\mathbf{Z}}
\newcommand{\learnRate}{\eta}
\newcommand{\lengthScale}{\ell}
\newcommand{\rbfWidth}{\ell}
\newcommand{\likelihoodBound}{\mathcal{L}}
\newcommand{\likelihoodFunction}{L}
\newcommand{\locationScalar}{\mu}
\newcommand{\locationVector}{\boldsymbol{ \locationScalar}}
\newcommand{\locationMatrix}{\mathbf{M}}
\newcommand{\variance}[1]{\text{var}\left( #1 \right)}
\newcommand{\mappingFunction}{f}
\newcommand{\mappingFunctionMatrix}{\mathbf{F}}
\newcommand{\mappingFunctionTwo}{g}
\newcommand{\mappingFunctionTwoMatrix}{\mathbf{G}}
\newcommand{\mappingFunctionTwoVector}{\mathbf{ \mappingFunctionTwo}}
\newcommand{\mappingFunctionVector}{\mathbf{ \mappingFunction}}
\newcommand{\scaleScalar}{s}
\newcommand{\mappingScalar}{w}
\newcommand{\mappingVector}{\mathbf{ \mappingScalar}}
\newcommand{\mappingMatrix}{\mathbf{W}}
\newcommand{\mappingScalarTwo}{v}
\newcommand{\mappingVectorTwo}{\mathbf{ \mappingScalarTwo}}
\newcommand{\mappingMatrixTwo}{\mathbf{V}}
\newcommand{\maxIters}{K}
\newcommand{\meanMatrix}{\mathbf{M}}
\newcommand{\meanScalar}{\mu}
\newcommand{\meanTwoMatrix}{\mathbf{M}}
\newcommand{\meanTwoScalar}{m}
\newcommand{\meanTwoVector}{\mathbf{ \meanTwoScalar}}
\newcommand{\meanVector}{\boldsymbol{ \meanScalar}}
\newcommand{\mrnaConcentration}{m}
\newcommand{\naturalFrequency}{\omega}
\newcommand{\neighborhood}[1]{\mathcal{N}\left( #1 \right)}
\newcommand{\neilurl}{http://inverseprobability.com/}
\newcommand{\noiseMatrix}{\boldsymbol{ E}}
\newcommand{\noiseScalar}{\epsilon}
\newcommand{\noiseVector}{\boldsymbol{ \epsilon}}
\newcommand{\norm}[1]{\left\Vert #1 \right\Vert}
\newcommand{\normalizedLaplacianMatrix}{\hat{\mathbf{L}}}
\newcommand{\normalizedLaplacianScalar}{\hat{\ell}}
\newcommand{\normalizedLaplacianVector}{\hat{\mathbf{ \ell}}}
\newcommand{\numActive}{m}
\newcommand{\numBasisFunc}{m}
\newcommand{\numComponents}{m}
\newcommand{\numComps}{K}
\newcommand{\numData}{n}
\newcommand{\numFeatures}{K}
\newcommand{\numHidden}{h}
\newcommand{\numInducing}{m}
\newcommand{\numLayers}{\ell}
\newcommand{\numNeighbors}{K}
\newcommand{\numSequences}{s}
\newcommand{\numSuccess}{s}
\newcommand{\numTasks}{m}
\newcommand{\numTime}{T}
\newcommand{\numTrials}{S}
\newcommand{\outputIndex}{j}
\newcommand{\paramVector}{\boldsymbol{ \theta}}
\newcommand{\parameterMatrix}{\boldsymbol{ \Theta}}
\newcommand{\parameterScalar}{\theta}
\newcommand{\parameterVector}{\boldsymbol{ \parameterScalar}}
\newcommand{\partDiff}[2]{\frac{\partial#1}{\partial#2}}
\newcommand{\precisionScalar}{j}
\newcommand{\precisionVector}{\mathbf{ \precisionScalar}}
\newcommand{\precisionMatrix}{\mathbf{J}}
\newcommand{\pseudotargetScalar}{\widetilde{y}}
\newcommand{\pseudotargetVector}{\mathbf{ \pseudotargetScalar}}
\newcommand{\pseudotargetMatrix}{\mathbf{ \widetilde{Y}}}
\newcommand{\rank}[1]{\text{rank}\left(#1\right)}
\newcommand{\rayleighDist}[2]{\mathcal{R}\left(#1|#2\right)}
\newcommand{\rayleighSamp}[1]{\mathcal{R}\left(#1\right)}
\newcommand{\responsibility}{r}
\newcommand{\rotationScalar}{r}
\newcommand{\rotationVector}{\mathbf{ \rotationScalar}}
\newcommand{\rotationMatrix}{\mathbf{R}}
\newcommand{\sampleCovScalar}{s}
\newcommand{\sampleCovVector}{\mathbf{ \sampleCovScalar}}
\newcommand{\sampleCovMatrix}{\mathbf{s}}
\newcommand{\scalarProduct}[2]{\left\langle{#1},{#2}\right\rangle}
\newcommand{\sign}[1]{\text{sign}\left(#1\right)}
\newcommand{\sigmoid}[1]{\sigma\left(#1\right)}
\newcommand{\singularvalue}{\ell}
\newcommand{\singularvalueMatrix}{\mathbf{L}}
\newcommand{\singularvalueVector}{\mathbf{l}}
\newcommand{\sorth}{\mathbf{u}}
\newcommand{\spar}{\lambda}
\newcommand{\trace}[1]{\text{tr}\left(#1\right)}
\newcommand{\BasalRate}{B}
\newcommand{\DampingCoefficient}{C}
\newcommand{\DecayRate}{D}
\newcommand{\Displacement}{X}
\newcommand{\LatentForce}{F}
\newcommand{\Mass}{M}
\newcommand{\Sensitivity}{S}
\newcommand{\basalRate}{b}
\newcommand{\dampingCoefficient}{c}
\newcommand{\mass}{m}
\newcommand{\sensitivity}{s}
\newcommand{\springScalar}{\kappa}
\newcommand{\springVector}{\boldsymbol{ \kappa}}
\newcommand{\springMatrix}{\boldsymbol{ \mathcal{K}}}
\newcommand{\tfConcentration}{p}
\newcommand{\tfDecayRate}{\delta}
\newcommand{\tfMrnaConcentration}{f}
\newcommand{\tfVector}{\mathbf{ \tfConcentration}}
\newcommand{\velocity}{v}
\newcommand{\sufficientStatsScalar}{g}
\newcommand{\sufficientStatsVector}{\mathbf{ \sufficientStatsScalar}}
\newcommand{\sufficientStatsMatrix}{\mathbf{G}}
\newcommand{\switchScalar}{s}
\newcommand{\switchVector}{\mathbf{ \switchScalar}}
\newcommand{\switchMatrix}{\mathbf{S}}
\newcommand{\tr}[1]{\text{tr}\left(#1\right)}
\newcommand{\loneNorm}[1]{\left\Vert #1 \right\Vert_1}
\newcommand{\ltwoNorm}[1]{\left\Vert #1 \right\Vert_2}
\newcommand{\onenorm}[1]{\left\vert#1\right\vert_1}
\newcommand{\twonorm}[1]{\left\Vert #1 \right\Vert}
\newcommand{\vScalar}{v}
\newcommand{\vVector}{\mathbf{v}}
\newcommand{\vMatrix}{\mathbf{V}}
\newcommand{\varianceDist}[2]{\text{var}_{#2}\left( #1 \right)}
% Already defined by latex
%\newcommand{\vec}{#1:}
\newcommand{\vecb}[1]{\left(#1\right):}
\newcommand{\weightScalar}{w}
\newcommand{\weightVector}{\mathbf{ \weightScalar}}
\newcommand{\weightMatrix}{\mathbf{W}}
\newcommand{\weightedAdjacencyMatrix}{\mathbf{A}}
\newcommand{\weightedAdjacencyScalar}{a}
\newcommand{\weightedAdjacencyVector}{\mathbf{ \weightedAdjacencyScalar}}
\newcommand{\onesVector}{\mathbf{1}}
\newcommand{\zerosVector}{\mathbf{0}}
$$
</div>
<p>% not ipynb</p>
<h2 id="what-is-machine-learning">What is Machine Learning?</h2>
<p>What is machine learning? At its most basic level machine learning is a combination of</p>
<p><span class="math display">\[ \text{data} + \text{model} \xrightarrow{\text{compute}} \text{prediction}\]</span></p>
<p>where <em>data</em> is our observations. They can be actively or passively acquired (meta-data). The <em>model</em> contains our assumptions, based on previous experience. That experience can be other data, it can come from transfer learning, or it can merely be our beliefs about the regularities of the universe. In humans our models include our inductive biases. The <em>prediction</em> is an action to be taken or a categorization or a quality score. The reason that machine learning has become a mainstay of artificial intelligence is the importance of predictions in artificial intelligence. The data and the model are combined through computation.</p>
<p>In practice we normally perform machine learning using two functions. To combine data with a model we typically make use of:</p>
<p><strong>a prediction function</strong> a function which is used to make the predictions. It includes our beliefs about the regularities of the universe, our assumptions about how the world works, e.g. smoothness, spatial similarities, temporal similarities.</p>
<p><strong>an objective function</strong> a function which defines the cost of misprediction. Typically it includes knowledge about the world's generating processes (probabilistic objectives) or the costs we pay for mispredictions (empiricial risk minimization).</p>
<p>The combination of data and model through the prediction function and the objectie function leads to a <em>learning algorithm</em>. The class of prediction functions and objective functions we can make use of is restricted by the algorithms they lead to. If the prediction function or the objective function are too complex, then it can be difficult to find an appropriate learning algorithm. Much of the acdemic field of machine learning is the quest for new learning algorithms that allow us to bring different types of models and data together.</p>
<p>A useful reference for state of the art in machine learning is the UK Royal Society Report, <a href="https://royalsociety.org/~/media/policy/projects/machine-learning/publications/machine-learning-report.pdf">Machine Learning: Power and Promise of Computers that Learn by Example</a>.</p>
<p>You can also check my blog post on <a href="http://inverseprobability.com/2017/07/17/what-is-machine-learning">"What is Machine Learning?"</a></p>
<h2 id="probabilities">Probabilities</h2>
<p>We are now going to do some simple review of probabilities and use this review to explore some aspects of our data.</p>
<p>A probability distribution expresses uncertainty about the outcome of an event. We often encode this uncertainty in a variable. So if we are considering the outcome of an event, <span class="math inline">\(Y\)</span>, to be a coin toss, then we might consider <span class="math inline">\(Y=1\)</span> to be heads and <span class="math inline">\(Y=0\)</span> to be tails. We represent the probability of a given outcome with the notation: <span class="math display">\[
P(Y=1) = 0.5
\]</span> The first rule of probability is that the probability must normalize. The sum of the probability of all events must equal 1. So if the probability of heads (<span class="math inline">\(Y=1\)</span>) is 0.5, then the probability of tails (the only other possible outcome) is given by <span class="math display">\[
P(Y=0) = 1-P(Y=1) = 0.5
\]</span></p>
<p>Probabilities are often defined as the limit of the ratio between the number of positive outcomes (e.g. <em>heads</em>) given the number of trials. If the number of positive outcomes for event <span class="math inline">\(y\)</span> is denoted by <span class="math inline">\(n\)</span> and the number of trials is denoted by <span class="math inline">\(N\)</span> then this gives the ratio <span class="math display">\[
P(Y=y) = \lim_{N\rightarrow
\infty}\frac{n_y}{N}.
\]</span> In practice we never get to observe an event infinite times, so rather than considering this we often use the following estimate <span class="math display">\[
P(Y=y) \approx \frac{n_y}{N}.
\]</span> Let's use this rule to compute the approximate probability that a film from the movie body count website has over 40 deaths.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">deaths <span class="op">=</span> (film_deaths.Body_Count<span class="op">></span><span class="dv">40</span>).<span class="bu">sum</span>() <span class="co"># number of positive outcomes (in sum True counts as 1, False counts as 0)</span>
total_films <span class="op">=</span> film_deaths.Body_Count.count()
prob_death <span class="op">=</span> <span class="bu">float</span>(deaths)<span class="op">/</span><span class="bu">float</span>(total_films)
<span class="bu">print</span>(<span class="st">"Probability of deaths being greather than 40 is:"</span>, prob_death)</code></pre></div>
<h3 id="question-4">Question 4</h3>
<p>We now have an estimate of the probability a film has greater than 40 deaths. The estimate seems quite high. What could be wrong with the estimate? Do you think any film you go to in the cinema has this probability of having greater than 40 deaths?</p>
<p>Why did we have to use <code>float</code> around our counts of deaths and total films? What would the answer have been if we hadn't used the <code>float</code> command? If we were using Python 3 would we have this problem?</p>
<p><em>20 marks</em></p>
<h3 id="write-your-answer-to-question-4-here">Write your answer to Question 4 here</h3>
<h2 id="conditioning">Conditioning</h2>
<p>When predicting whether a coin turns up head or tails, we might think that this event is <em>independent</em> of the year or time of day. If we include an observation such as time, then in a probability this is known as <em>condtioning</em>. We use this notation, <span class="math inline">\(P(Y=y|T=t)\)</span>, to condition the outcome on a second variable (in this case time). Or, often, for a shorthand we use <span class="math inline">\(P(y|t)\)</span> to represent this distribution (the <span class="math inline">\(Y=\)</span> and <span class="math inline">\(T=\)</span> being implicit). Because we don't believe a coin toss depends on time then we might write that <span class="math display">\[
P(y|t) =
p(y).
\]</span> However, we might believe that the number of deaths is dependent on the year. For this we can try estimating <span class="math inline">\(P(Y>40 | T=2000)\)</span> and compare the result, for example to <span class="math inline">\(P(Y>40|2002)\)</span> using our empirical estimate of the probability.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="cf">for</span> year <span class="kw">in</span> [<span class="dv">2000</span>, <span class="dv">2002</span>]:
deaths <span class="op">=</span> (film_deaths.Body_Count[film_deaths.Year<span class="op">==</span>year]<span class="op">></span><span class="dv">40</span>).<span class="bu">sum</span>()
total_films <span class="op">=</span> (film_deaths.Year<span class="op">==</span>year).<span class="bu">sum</span>()
prob_death <span class="op">=</span> <span class="bu">float</span>(deaths)<span class="op">/</span><span class="bu">float</span>(total_films)
<span class="bu">print</span>(<span class="st">"Probability of deaths being greather than 40 in year"</span>, year, <span class="st">"is:"</span>, prob_death)</code></pre></div>
<h3 id="question-5">Question 5</h3>
<p>Compute the probability for the number of deaths being over 40 for each year we have in our <code>film_deaths</code> data frame. Store the result in a <code>numpy</code> array and plot the probabilities against the years using the <code>plot</code> command from <code>matplotlib</code>. Do you think the estimate we have created of <span class="math inline">\(P(y|t)\)</span> is a good estimate? Write your code and your written answers in the box below.</p>
<p><em>20 marks</em></p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co"># Write code for your answer to Question 5 in this box</span>
<span class="co"># provide the answers so that the code runs correctly otherwise you will loose marks!</span>
</code></pre></div>
<h4 id="question-5-answer-text">Question 5 Answer Text</h4>
<p>Write your answer to the question in this box.</p>
<h4 id="notes-for-question-5">Notes for Question 5</h4>
<p>Make sure the plot is included in <em>this</em> notebook file (the <code>IPython</code> magic command <code>%matplotlib inline</code> we ran above will do that for you, it only needs to be run once per file).</p>
<table>
<thead>
<tr class="header">
<th>Terminology</th>
<th>Mathematical notation</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>joint</td>
<td><span class="math inline">\(P(X=x, Y=y)\)</span></td>
<td>prob. that X=x <em>and</em> Y=y</td>
</tr>
<tr class="even">
<td>marginal</td>
<td><span class="math inline">\(P(X=x)\)</span></td>
<td>prob. that X=x <em>regardless of</em> Y</td>
</tr>
<tr class="odd">
<td>conditional</td>
<td><span class="math inline">\(P(X=x\vert Y=y)\)</span></td>
<td>prob. that X=x <em>given that</em> Y=y</td>
</tr>
</tbody>
</table>
<center>
The different basic probability distributions.
</center>
<h3 id="a-pictorial-definition-of-probability">A Pictorial Definition of Probability</h3>
<object class="svgplot" align data="../slides/diagrams/mlai/prob_diagram.svg">
</object>
<p><span align="right">Inspired by lectures from Christopher Bishop</span></p>
<h3 id="definition-of-probability-distributions.">Definition of probability distributions.</h3>
<pre><code> Terminology | Definition | Probability Notation</code></pre>
<p>-------------------------|--------------------------------------------------------|------------------------------ Joint Probability | <span class="math inline">\(\lim_{N\rightarrow\infty}\frac{n_{X=3,Y=4}}{N}\)</span> | <span class="math inline">\(P\left(X=3,Y=4\right)\)</span> Marginal Probability | <span class="math inline">\(\lim_{N\rightarrow\infty}\frac{n_{X=5}}{N}\)</span> | <span class="math inline">\(P\left(X=5\right)\)</span> Conditional Probability | <span class="math inline">\(\lim_{N\rightarrow\infty}\frac{n_{X=3,Y=4}}{n_{Y=4}}\)</span> | <span class="math inline">\(P\left(X=3\vert Y=4\right)\)</span></p>
<h3 id="notational-details">Notational Details</h3>
<p>Typically we should write out <span class="math inline">\(P\left(X=x,Y=y\right)\)</span>, but in practice we often shorten this to <span class="math inline">\(P\left(x,y\right)\)</span>. This looks very much like we might write a multivariate function, <em>e.g.</em> <span class="math display">\[
f\left(x,y\right)=\frac{x}{y},
\]</span> but for a multivariate function <span class="math display">\[
f\left(x,y\right)\neq f\left(y,x\right).
\]</span> However, <span class="math display">\[
P\left(x,y\right)=P\left(y,x\right)
\]</span> because <span class="math display">\[
P\left(X=x,Y=y\right)=P\left(Y=y,X=x\right).
\]</span> Sometimes I think of this as akin to the way in Python we can write 'keyword arguments' in functions. If we use keyword arguments, the ordering of arguments doesn't matter.</p>
<p>We've now introduced conditioning and independence to the notion of probability and computed some conditional probabilities on a practical example The scatter plot of deaths vs year that we created above can be seen as a <em>joint</em> probability distribution. We represent a joint probability using the notation <span class="math inline">\(P(Y=y, T=t)\)</span> or <span class="math inline">\(P(y, t)\)</span> for short. Computing a joint probability is equivalent to answering the simultaneous questions, what's the probability that the number of deaths was over 40 and the year was 2002? Or any other question that may occur to us. Again we can easily use pandas to ask such questions.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">year <span class="op">=</span> <span class="dv">2000</span>
deaths <span class="op">=</span> (film_deaths.Body_Count[film_deaths.Year<span class="op">==</span>year]<span class="op">></span><span class="dv">40</span>).<span class="bu">sum</span>()
total_films <span class="op">=</span> film_deaths.Body_Count.count() <span class="co"># this is total number of films</span>
prob_death <span class="op">=</span> <span class="bu">float</span>(deaths)<span class="op">/</span><span class="bu">float</span>(total_films)
<span class="bu">print</span>(<span class="st">"Probability of deaths being greather than 40 and year being"</span>, year, <span class="st">"is:"</span>, prob_death)</code></pre></div>
<h3 id="the-product-rule">The Product Rule</h3>
<p>This number is the joint probability, <span class="math inline">\(P(Y, T)\)</span> which is much <em>smaller</em> than the conditional probability. The number can never be bigger than the conditional probabililty because it is computed using the <em>product rule</em>. <span class="math display">\[
p(Y=y, X=x) = p(Y=y|X=x)p(X=x)
\]</span> and <span class="math display">\[p(X=x)\]</span> is a probability distribution, which is equal or less than 1, ensuring the joint distribution is typically smaller than the conditional distribution.</p>
<p>The product rule is a <em>fundamental</em> rule of probability, and you must remember it! It gives the relationship between the two questions: 1) What's the probability that a film was made in 2002 and has over 40 deaths? and 2) What's the probability that a film has over 40 deaths given that it was made in 2002?</p>
<p>In our shorter notation we can write the product rule as <span class="math display">\[
p(y, x) = p(y|x)p(x)
\]</span> We can see the relation working in practice for our data above by computing the different values for <span class="math inline">\(t=2000\)</span>.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">p_x <span class="op">=</span> <span class="bu">float</span>((film_deaths.Year<span class="op">==</span><span class="dv">2002</span>).<span class="bu">sum</span>())<span class="op">/</span><span class="bu">float</span>(film_deaths.Body_Count.count())
p_y_given_x <span class="op">=</span> <span class="bu">float</span>((film_deaths.Body_Count[film_deaths.Year<span class="op">==</span><span class="dv">2002</span>]<span class="op">></span><span class="dv">40</span>).<span class="bu">sum</span>())<span class="op">/</span><span class="bu">float</span>((film_deaths.Year<span class="op">==</span><span class="dv">2002</span>).<span class="bu">sum</span>())
p_y_and_x <span class="op">=</span> <span class="bu">float</span>((film_deaths.Body_Count[film_deaths.Year<span class="op">==</span><span class="dv">2002</span>]<span class="op">></span><span class="dv">40</span>).<span class="bu">sum</span>())<span class="op">/</span><span class="bu">float</span>(film_deaths.Body_Count.count())
<span class="bu">print</span>(<span class="st">"P(x) is"</span>, p_x)
<span class="bu">print</span>(<span class="st">"P(y|x) is"</span>, p_y_given_x)
<span class="bu">print</span>(<span class="st">"P(y,x) is"</span>, p_y_and_x)</code></pre></div>
<h3 id="the-sum-rule">The Sum Rule</h3>
<p>The other <em>fundamental rule</em> of probability is the <em>sum rule</em> this tells us how to get a <em>marginal</em> distribution from the joint distribution. Simply put it says that we need to sum across the value we'd like to remove. <span class="math display">\[
P(Y=y) = \sum_{x} P(Y=y, X=x)
\]</span> Or in our shortened notation <span class="math display">\[
P(y) = \sum_{x} P(y, x)
\]</span></p>
<h3 id="question-6">Question 6</h3>
<p>Write code that computes <span class="math inline">\(P(y)\)</span> by adding <span class="math inline">\(P(y, x)\)</span> for all values of <span class="math inline">\(x\)</span>.</p>
<p><em>10 marks</em></p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co"># Write code for your answer to Question 6 in this box</span>
<span class="co"># provide the answers so that the code runs correctly otherwise you will loose marks!</span>
</code></pre></div>
<h3 id="bayes-rule">Bayes’ Rule</h3>
<p>Bayes rule is a very simple rule, it's hardly worth the name of a rule at all. It follows directly from the product rule of probability. Because <span class="math inline">\(P(y, x) = P(y|x)P(x)\)</span> and by symmetry <span class="math inline">\(P(y,x)=P(x,y)=P(x|y)P(y)\)</span> then by equating these two equations and dividing through by <span class="math inline">\(P(y)\)</span> we have <span class="math display">\[
P(x|y) =
\frac{P(y|x)P(x)}{P(y)}
\]</span> which is known as Bayes' rule (or Bayes's rule, it depends how you choose to pronounce it). It's not difficult to derive, and its importance is more to do with the semantic operation that it enables. Each of these probability distributions represents the answer to a question we have about the world. Bayes rule (via the product rule) tells us how to <em>invert</em> the probability.</p>
<h3 id="probabilities-for-extracting-information-from-data">Probabilities for Extracting Information from Data</h3>
<p>What use is all this probability in data science? Let's think about how we might use the probabilities to do some decision making. Let's load up a little more information about the movies.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> pandas <span class="im">as</span> pd</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">movies <span class="op">=</span> pd.read_csv(<span class="st">'./R-vs-Python-master/Deadliest movies scrape/code/film-death-counts-Python.csv'</span>)
movies.columns</code></pre></div>
<h3 id="question-7">Question 7</h3>
<p>Now we see we have several additional features including the quality rating (<code>IMDB_Rating</code>). Let's assume we want to predict the rating given the other information in the data base. How would we go about doing it?</p>
<p>Using what you've learnt about joint, conditional and marginal probabilities, as well as the sum and product rule, how would you formulate the question you want to answer in terms of probabilities? Should you be using a joint or a conditional distribution? If it's conditional, what should the distribution be over, and what should it be conditioned on?</p>
<p><em>20 marks</em></p>
<h3 id="write-your-answer-to-question-7-here">Write your answer to Question 7 here</h3>
<h3 id="probabilistic-modelling">Probabilistic Modelling</h3>
<p>This Bayesian approach is designed to deal with uncertainty arising from fitting our prediction function to the data we have, a reduced data set.</p>
<p>The Bayesian approach can be derived from a broader understanding of what our objective is. If we accept that we can jointly represent all things that happen in the world with a probability distribution, then we can interogate that probability to make predictions. So, if we are interested in predictions, <span class="math inline">\(\dataScalar_*\)</span> at future points input locations of interest, <span class="math inline">\(\inputVector_*\)</span> given previously training data, <span class="math inline">\(\dataVector\)</span> and corresponding inputs, <span class="math inline">\(\inputMatrix\)</span>, then we are really interogating the following probability density, <span class="math display">\[
p(\dataScalar_*|\dataVector, \inputMatrix, \inputVector_*),
\]</span> there is nothing controversial here, as long as you accept that you have a good joint model of the world around you that relates test data to training data, <span class="math inline">\(p(\dataScalar_*, \dataVector, \inputMatrix, \inputVector_*)\)</span> then this conditional distribution can be recovered through standard rules of probability (<span class="math inline">\(\text{data} + \text{model} \rightarrow \text{prediction}\)</span>).</p>
<p>We can construct this joint density through the use of the following decomposition: <span class="math display">\[
p(\dataScalar_*|\dataVector, \inputMatrix, \inputVector_*) = \int p(\dataScalar_*|\inputVector_*, \mappingMatrix) p(\mappingMatrix | \dataVector, \inputMatrix) \text{d} \mappingMatrix
\]</span></p>
<p>where, for convenience, we are assuming <em>all</em> the parameters of the model are now represented by <span class="math inline">\(\parameterVector\)</span> (which contains <span class="math inline">\(\mappingMatrix\)</span> and <span class="math inline">\(\mappingMatrixTwo\)</span>) and <span class="math inline">\(p(\parameterVector | \dataVector, \inputMatrix)\)</span> is recognised as the posterior density of the parameters given data and <span class="math inline">\(p(\dataScalar_*|\inputVector_*, \parameterVector)\)</span> is the <em>likelihood</em> of an individual test data point given the parameters.</p>
<p>The likelihood of the data is normally assumed to be independent across the parameters, <span class="math display">\[
p(\dataVector|\inputMatrix, \mappingMatrix) \prod_{i=1}^\numData p(\dataScalar_i|\inputVector_i, \mappingMatrix),\]</span></p>
<p>and if that is so, it is easy to extend our predictions across all future, potential, locations, <span class="math display">\[
p(\dataVector_*|\dataVector, \inputMatrix, \inputMatrix_*) = \int p(\dataVector_*|\inputMatrix_*, \parameterVector) p(\parameterVector | \dataVector, \inputMatrix) \text{d} \parameterVector.
\]</span></p>
<p>The likelihood is also where the <em>prediction function</em> is incorporated. For example in the regression case, we consider an objective based around the Gaussian density, <span class="math display">\[
p(\dataScalar_i | \mappingFunction(\inputVector_i)) = \frac{1}{\sqrt{2\pi \dataStd^2}} \exp\left(-\frac{\left(\dataScalar_i - \mappingFunction(\inputVector_i)\right)^2}{2\dataStd^2}\right)
\]</span></p>
<p>In short, that is the classical approach to probabilistic inference, and all approaches to Bayesian neural networks fall within this path. For a deep probabilistic model, we can simply take this one stage further and place a probability distribution over the input locations, <span class="math display">\[
p(\dataVector_*|\dataVector) = \int p(\dataVector_*|\inputMatrix_*, \parameterVector) p(\parameterVector | \dataVector, \inputMatrix) p(\inputMatrix) p(\inputMatrix_*) \text{d} \parameterVector \text{d} \inputMatrix \text{d}\inputMatrix_*
\]</span> and we have <em>unsupervised learning</em> (from where we can get deep generative models).</p>
<h3 id="introduce-dataset-earlier.-here">INTRODUCE DATASET EARLIER. HERE!</h3>
<h3 id="graphical-models">Graphical Models</h3>
<p>One way of representing a joint distribution is to consider conditional dependencies between data. Conditional dependencies allow us to factorize the distribution. For example, a Markov chain is a factorization of a distribution into components that represent the conditional relationships between points that are neighboring, often in time or space. It can be decomposed in the following form. <span class="math display">\[p(\dataVector) = p(\dataScalar_\numData | \dataScalar_{\numData-1}) p(\dataScalar_{\numData-1}|\dataScalar_{\numData-2}) \dots p(\dataScalar_{2} | \dataScalar_{1})\]</span></p>
<object class="svgplot" align data="../slides/diagrams/ml/markov.svg">
</object>
<p>By specifying conditional independencies we can reduce the parameterization required for our data, instead of directly specifying the parameters of the joint distribution, we can specify each set of parameters of the conditonal independently. This can also give an advantage in terms of interpretability. Understanding a conditional independence structure gives a structured understanding of data. If developed correctly, according to causal methodology, it can even inform how we should intervene in the system to drive a desired result <span class="citation">(Pearl, 1995)</span>.</p>
<p>However, a challenge arise when the data becomes more complex. Consider the graphical model shown below, used to predict the perioperative risk of <em>C Difficile</em> infection following colon surgery <span class="citation">(Steele et al., 2012)</span>.</p>
<p><img class="negate" src="../slides/diagrams/bayes-net-diagnosis.png" width="40%" align="center" style="background:none; border:none; box-shadow:none;"></p>
<p>To capture the complexity in the interelationship between the data the graph becomes more complex, and less interpretable.</p>
<p>Machine learning problems normally involve a prediction function and an objective function. So far in the course we've mainly focussed on the case where the prediction function was over the real numbers, so the codomain of the functions, <span class="math inline">\(\mappingFunction(\inputMatrix)\)</span> was the real numbers or sometimes real vectors. The classification problem consists of predicting whether or not a particular example is a member of a particular class. So we may want to know if a particular image represents a digit 6 or if a particular user will click on a given advert. These are classification problems, and they require us to map to <em>yes</em> or <em>no</em> answers. That makes them naturally discrete mappings.</p>
<p>In classification we are given an input vector, <span class="math inline">\(\inputVector\)</span>, and an associated label, <span class="math inline">\(\dataScalar\)</span> which either takes the value <span class="math inline">\(-1\)</span> to represent <em>no</em> or <span class="math inline">\(1\)</span> to represent <em>yes</em>.</p>
<ul>
<li>Classifiying hand written digits from binary images (automatic zip code reading)</li>
<li>Detecting faces in images (e.g. digital cameras).</li>
<li>Who a detected face belongs to (e.g. Picasa, Facebook, DeepFace, GaussianFace)</li>
<li>Classifying type of cancer given gene expression data.</li>
<li>Categorization of document types (different types of news article on the internet)</li>
</ul>
<p>Our focus has been on models where the objective function is inspired by a probabilistic analysis of the problem. In particular we've argued that we answer questions about the data set by placing probability distributions over the various quantities of interest. For the case of binary classification this will normally involve introducing probability distributions for discrete variables. Such probability distributions, are in some senses easier than those for continuous variables, in particular we can represent a probability distribution over <span class="math inline">\(\dataScalar\)</span>, where <span class="math inline">\(\dataScalar\)</span> is binary, with one value. If we specify the probability that <span class="math inline">\(\dataScalar=1\)</span> with a number that is between 0 and 1, i.e. let's say that <span class="math inline">\(P(\dataScalar=1) = \pi\)</span> (here we don't mean <span class="math inline">\(\pi\)</span> the number, we are setting <span class="math inline">\(\pi\)</span> to be a variable) then we can specify the probability distribution through a table.</p>
<p>| <span class="math inline">\(\dataScalar\)</span> | 0 | 1 | |:------:|:---------:|:-----:| | <span class="math inline">\(P(\dataScalar)\)</span> | <span class="math inline">\((1-\pi)\)</span> | <span class="math inline">\(\pi\)</span> |</p>
<p>Mathematically we can use a trick to implement this same table. We can use the value <span class="math inline">\(\dataScalar\)</span> as a mathematical switch and write that <span class="math display">\[
P(\dataScalar) = \pi^\dataScalar (1-\pi)^{(1-\dataScalar)}
\]</span> where our probability distribution is now written as a function of <span class="math inline">\(\dataScalar\)</span>. This probability distribution is known as the <a href="http://en.wikipedia.org/wiki/Bernoulli_distribution">Bernoulli distribution</a>. The Bernoulli distribution is a clever trick for mathematically switching between two probabilities if we were to write it as code it would be better described as</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> bernoulli(y_i, pi):
<span class="cf">if</span> y_i <span class="op">==</span> <span class="dv">1</span>:
<span class="cf">return</span> pi
<span class="cf">else</span>:
<span class="cf">return</span> <span class="dv">1</span><span class="op">-</span>pi</code></pre></div>
<p>If we insert <span class="math inline">\(\dataScalar=1\)</span> then the function is equal to <span class="math inline">\(\pi\)</span>, and if we insert <span class="math inline">\(\dataScalar=0\)</span> then the function is equal to <span class="math inline">\(1-\pi\)</span>. So the function recreates the table for the distribution given above.</p>
<p>The probability distribution is named for <a href="http://en.wikipedia.org/wiki/Jacob_Bernoulli">Jacob Bernoulli</a>, the swiss mathematician. In his book Ars Conjectandi he considered the distribution and the result of a number of 'trials' under the Bernoulli distribution to form the <em>binomial</em> distribution. Below is the page where he considers Pascal's triangle in forming combinations of the Bernoulli distribution to realise the binomial distribution for the outcome of positive trials.</p>
<p><a href="https://play.google.com/books/reader?id=CF4UAAAAQAAJ&pg=PA87"><img src="../slides/diagrams/books/CF4UAAAAQAAJ-PA87.png" /></a></p>
<object class="svgplot" align data="../slides/diagrams/ml/bernoulli-urn.svg">
</object>
<p>Thomas Bayes also described the Bernoulli distribution, only he didn't refer to Jacob Bernoulli's work, so he didn't call it by that name. He described the distribution in terms of a table (think of a <em>billiard table</em>) and two balls. Bayes suggests that each ball can be rolled across the table such that it comes to rest at a position that is <em>uniformly distributed</em> between the sides of the table.</p>
<p>Let's assume that the first ball is rolled, and that it comes to reset at a position that is <span class="math inline">\(\pi\)</span> times the width of the table from the left hand side.</p>
<p>Now, we roll the second ball. We are interested if the second ball ends up on the left side (+ve result) or the right side (-ve result) of the first ball. We use the Bernoulli distribution to determine this.</p>
<p>For this reason in Bayes's distribution there is considered to be <em>aleatoric</em> uncertainty about the distribution parameter.</p>
<object class="svgplot" align data="../slides/diagrams/ml/bayes-billiard009.svg">
</object>
<h3 id="maximum-likelihood-in-the-bernoulli-distribution">Maximum Likelihood in the Bernoulli Distribution</h3>
<p>Maximum likelihood in the Bernoulli distribution is straightforward. Let's assume we have data, <span class="math inline">\(\dataVector\)</span> which consists of a vector of binary values of length <span class="math inline">\(n\)</span>. If we assume each value was sampled independently from the Bernoulli distribution, conditioned on the parameter <span class="math inline">\(\pi\)</span> then our joint probability density has the form <span class="math display">\[
p(\dataVector|\pi) = \prod_{i=1}^{\numData} \pi^{\dataScalar_i} (1-\pi)^{1-\dataScalar_i}.
\]</span> As normal in maximum likelihood we consider the negative log likelihood as our objective, <span class="math display">\[\begin{align*}
\errorFunction(\pi)& = -\log p(\dataVector|\pi)\\
& = -\sum_{i=1}^{\numData} \dataScalar_i \log \pi - \sum_{i=1}^{\numData} (1-\dataScalar_i) \log(1-\pi),
\end{align*}\]</span></p>
<p>and we can derive the gradient with respect to the parameter <span class="math inline">\(\pi\)</span>. <span class="math display">\[\frac{\text{d}\errorFunction(\pi)}{\text{d}\pi} = -\frac{\sum_{i=1}^{\numData} \dataScalar_i}{\pi} + \frac{\sum_{i=1}^{\numData} (1-\dataScalar_i)}{1-\pi},\]</span></p>
<p>and as normal we look for a stationary point for the log likelihood by setting this derivative to zero, <span class="math display">\[0 = -\frac{\sum_{i=1}^{\numData} \dataScalar_i}{\pi} + \frac{\sum_{i=1}^{\numData} (1-\dataScalar_i)}{1-\pi},\]</span> rearranging we form <span class="math display">\[(1-\pi)\sum_{i=1}^{\numData} \dataScalar_i = \pi\sum_{i=1}^{\numData} (1-\dataScalar_i),\]</span> which implies <span class="math display">\[\sum_{i=1}^{\numData} \dataScalar_i = \pi\left(\sum_{i=1}^{\numData} (1-\dataScalar_i) + \sum_{i=1}^{\numData} \dataScalar_i\right),\]</span></p>
<p>and now we recognise that <span class="math inline">\(\sum_{i=1}^{\numData} (1-\dataScalar_i) + \sum_{i=1}^{\numData} \dataScalar_i = \numData\)</span> so we have <span class="math display">\[\pi = \frac{\sum_{i=1}^{\numData} \dataScalar_i}{\numData}\]</span></p>
<p>so in other words we estimate the probability associated with the Bernoulli by setting it to the number of observed positives, divided by the total length of <span class="math inline">\(\dataScalar\)</span>. This makes intiutive sense. If I asked you to estimate the probability of a coin being heads, and you tossed the coin 100 times, and recovered 47 heads, then the estimate of the probability of heads should be <span class="math inline">\(\frac{47}{100}\)</span>.</p>
<h3 id="exercise">Exercise</h3>
<p>Show that the maximume likelihood solution we have found is a <em>minimum</em> for our objective.</p>
<h3 id="write-your-answer-to-exercise-here">Write your answer to Exercise here</h3>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co"># Use this box for any code you need for the exercise</span>
</code></pre></div>
<p><span class="math display">\[
\text{posterior} =
\frac{\text{likelihood}\times\text{prior}}{\text{marginal likelihood}}
\]</span></p>
<p>Four components:</p>
<ol style="list-style-type: decimal">
<li>Prior distribution</li>
<li>Likelihood</li>
<li>Posterior distribution</li>
<li>Marginal likelihood</li>
</ol>
<h3 id="naive-bayes-classifiers">Naive Bayes Classifiers</h3>
<p>In probabilistic machine learning we place probability distributions (or densities) over all the variables of interest, our first classification algorithm will do just that. We will consider how to form a classification by making assumptions about the <em>joint</em> density of our observations. We need to make assumptions to reduce the number of parameters we need to optimise.</p>
<p>In the ideal world, given label data <span class="math inline">\(\dataVector\)</span> and the inputs <span class="math inline">\(\inputMatrix\)</span> we should be able to specify the joint density of all potential values of <span class="math inline">\(\dataVector\)</span> and <span class="math inline">\(\inputMatrix\)</span>, <span class="math inline">\(p(\dataVector, \inputMatrix)\)</span>. If <span class="math inline">\(\inputMatrix\)</span> and <span class="math inline">\(\dataVector\)</span> are our training data, and we can somehow extend our density to incorporate future test data (by augmenting <span class="math inline">\(\dataVector\)</span> with a new observation <span class="math inline">\(\dataScalar^*\)</span> and <span class="math inline">\(\inputMatrix\)</span> with the corresponding inputs, <span class="math inline">\(\inputVector^*\)</span>), then we can answer any given question about a future test point <span class="math inline">\(\dataScalar^*\)</span> given its covariates <span class="math inline">\(\inputVector^*\)</span> by conditioning on the training variables to recover, <span class="math display">\[
p(\dataScalar^*|\inputMatrix, \dataVector, \inputVector^*),
\]</span></p>
<p>We can compute this distribution using the product and sum rules. However, to specify this density we must give the probability associated with all possible combinations of <span class="math inline">\(\dataVector\)</span> and <span class="math inline">\(\inputMatrix\)</span>. There are <span class="math inline">\(2^{\numData}\)</span> possible combinations for the vector <span class="math inline">\(\dataVector\)</span> and the probability for each of these combinations must be jointly specified along with the joint density of the matrix <span class="math inline">\(\inputMatrix\)</span>, as well as being able to <em>extend</em> the density for any chosen test location <span class="math inline">\(\inputVector^*\)</span>.</p>
<p>In naive Bayes we make certain simplifying assumptions that allow us to perform all of the above in practice.</p>
<h3 id="data-conditional-independence">Data Conditional Independence</h3>
<p>If we are given model parameters <span class="math inline">\(\paramVector\)</span> we assume that conditioned on all these parameters that all data points in the model are independent. In other words we have, <span class="math display">\[
p(\dataScalar^*, \inputVector^*, \dataVector, \inputMatrix|\paramVector) = p(\dataScalar^*, \inputVector^*|\paramVector)\prod_{i=1}^{\numData} p(\dataScalar_i, \inputVector_i | \paramVector).
\]</span> This is a conditional independence assumption because we are not assuming our data are purely independent. If we were to assume that, then there would be nothing to learn about our test data given our training data. We are assuming that they are independent <em>given</em> our parameters, <span class="math inline">\(\paramVector\)</span>. We made similar assumptions for regression, where our parameter set included <span class="math inline">\(\mappingVector\)</span> and <span class="math inline">\(\dataStd^2\)</span>. Given those parameters we assumed that the density over <span class="math inline">\(\dataVector, \dataScalar^*\)</span> was <em>independent</em>. Here we are going a little further with that assumption because we are assuming the <em>joint</em> density of <span class="math inline">\(\dataVector\)</span> and <span class="math inline">\(\inputMatrix\)</span> is independent across the data given the parameters.</p>
<p>Computing posterior distribution in this case becomes easier, this is known as the 'Bayes classifier'.</p>
<h3 id="feature-conditional-independence">Feature Conditional Independence</h3>
<p><span class="math display">\[
p(\inputVector_i | \dataScalar_i, \paramVector) = \prod_{j=1}^{\dataDim} p(\inputScalar_{i,j}|\dataScalar_i, \paramVector)
\]</span> where <span class="math inline">\(\dataDim\)</span> is the dimensionality of our inputs.</p>
<p>The assumption that is particular to naive Bayes is to now consider that the <em>features</em> are also conditionally independent, but not only given the parameters. We assume that the features are independent given the parameters <em>and</em> the label. So for each data point we have <span class="math display">\[p(\inputVector_i | \dataScalar_i, \paramVector) = \prod_{j=1}^{\dataDim} p(\inputScalar_{i,j}|\dataScalar_i,\paramVector)\]</span> where <span class="math inline">\(\dataDim\)</span> is the dimensionality of our inputs.</p>
<h3 id="marginal-density-for-datascalar_i">Marginal Density for <span class="math inline">\(\dataScalar_i\)</span></h3>
<p><span class="math display">\[
p(\inputScalar_{i,j},\dataScalar_i| \paramVector) = p(\inputScalar_{i,j}|\dataScalar_i, \paramVector)p(\dataScalar_i).
\]</span></p>
<p>We now have nearly all of the components we need to specify the full joint density. However, the feature conditional independence doesn't yet give us the joint density over <span class="math inline">\(p(\dataScalar_i, \inputVector_i)\)</span> which is required to subsitute in to our data conditional independence to give us the full density. To recover the joint density given the conditional distribution of each feature, <span class="math inline">\(p(\inputScalar_{i,j}|\dataScalar_i, \paramVector)\)</span>, we need to make use of the product rule and combine it with a marginal density for <span class="math inline">\(\dataScalar_i\)</span>,</p>
<p><span class="math display">\[p(\inputScalar_{i,j},\dataScalar_i| \paramVector) = p(\inputScalar_{i,j}|\dataScalar_i, \paramVector)p(\dataScalar_i).\]</span> Because <span class="math inline">\(\dataScalar_i\)</span> is binary the <em>Bernoulli</em> density makes a suitable choice for our prior over <span class="math inline">\(\dataScalar_i\)</span>, <span class="math display">\[p(\dataScalar_i|\pi) = \pi^{\dataScalar_i} (1-\pi)^{1-\dataScalar_i}\]</span> where <span class="math inline">\(\pi\)</span> now has the interpretation as being the <em>prior</em> probability that the classification should be positive.</p>
<h3 id="joint-density-for-naive-bayes">Joint Density for Naive Bayes</h3>
<p>This allows us to write down the full joint density of the training data, <span class="math display">\[
p(\dataVector, \inputMatrix|\paramVector, \pi) = \prod_{i=1}^{\numData} \prod_{j=1}^{\dataDim} p(\inputScalar_{i,j}|\dataScalar_i, \paramVector)p(\dataScalar_i|\pi)
\]</span></p>
<p>which can now be fit by maximum likelihood. As normal we form our objective as the negative log likelihood, <span class="math display">\[
\errorFunction(\paramVector, \pi) = -\log p(\dataVector, \inputMatrix|\paramVector, \pi) = -\sum_{i=1}^{\numData} \sum_{j=1}^{\dataDim} \log p(\inputScalar_{i, j}|\dataScalar_i, \paramVector) - \sum_{i=1}^{\numData} \log p(\dataScalar_i|\pi),
\]</span> which we note <em>decomposes</em> into two objective functions, one which is dependent on <span class="math inline">\(\pi\)</span> alone and one which is dependent on <span class="math inline">\(\paramVector\)</span> alone so we have, <span class="math display">\[
\errorFunction(\pi, \paramVector) = \errorFunction(\paramVector) + \errorFunction(\pi).
\]</span> Since the two objective functions are separately dependent on the parameters <span class="math inline">\(\pi\)</span> and <span class="math inline">\(\paramVector\)</span> we can minimize them independently. Firstly, minimizing the Bernoulli likelihood over the labels we have, <span class="math display">\[
\errorFunction(\pi) = -\sum_{i=1}^{\numData}\log p(\dataScalar_i|\pi) = -\sum_{i=1}^{\numData} \dataScalar_i \log \pi - \sum_{i=1}^{\numData} (1-\dataScalar_i) \log (1-\pi)
\]</span> which we already minimized above recovering <span class="math display">\[
\pi = \frac{\sum_{i=1}^{\numData} \dataScalar_i}{\numData}.
\]</span></p>
<p>We now need to minimize the objective associated with the conditional distributions for the features, <span class="math display">\[
\errorFunction(\paramVector) = -\sum_{i=1}^{\numData} \sum_{j=1}^{\dataDim} \log p(\inputScalar_{i, j} |\dataScalar_i, \paramVector),
\]</span> which necessarily implies making some assumptions about the form of the conditional distributions. The right assumption will depend on the nature of our input data. For example, if we have an input which is real valued, we could use a Gaussian density and we could allow the mean and variance of the Gaussian to be different according to whether the class was positive or negative and according to which feature we were measuring. That would give us the form, <span class="math display">\[
p(\inputScalar_{i, j} | \dataScalar_i,\paramVector) = \frac{1}{\sqrt{2\pi \dataStd_{\dataScalar_i,j}^2}} \exp \left(-\frac{(\inputScalar_{i,j} - \mu_{\dataScalar_i, j})^2}{\dataStd_{\dataScalar_i,j}^2}\right),
\]</span> where <span class="math inline">\(\dataStd_{1, j}^2\)</span> is the variance of the density for the <span class="math inline">\(j\)</span>th output and the class <span class="math inline">\(\dataScalar_i=1\)</span> and <span class="math inline">\(\dataStd_{0, j}^2\)</span> is the variance if the class is 0. The means can vary similarly. Our parameters, <span class="math inline">\(\paramVector\)</span> would consist of all the means and all the variances for the different dimensions.</p>
<p>As normal we form our objective as the negative log likelihood, <span class="math display">\[
\errorFunction(\paramVector, \pi) = -\log p(\dataVector, \inputMatrix|\paramVector, \pi) = -\sum_{i=1}^{\numData} \sum_{j=1}^{\dataDim} \log p(\inputScalar_{i, j}|\dataScalar_i, \paramVector) - \sum_{i=1}^{\numData} \log p(\dataScalar_i|\pi),
\]</span> which we note <em>decomposes</em> into two objective functions, one which is dependent on <span class="math inline">\(\pi\)</span> alone and one which is dependent on <span class="math inline">\(\paramVector\)</span> alone so we have, <span class="math display">\[
\errorFunction(\pi, \paramVector) = \errorFunction(\paramVector) + \errorFunction(\pi).
\]</span></p>
<h3 id="movie-body-count-data">Movie Body Count Data</h3>
<p>First we will load in the movie body count data. Our aim will be to predict whether a movie is rated R or not given the attributes in the data. We will predict on the basis of year, body count and movie genre. The genres in the CSV file are stored as a list in the following form:</p>
<pre><code>Biography|Action|Sci-Fi</code></pre>
<p>First we have to do a little work to extract this form and turn it into a vector of binary values. Let's first load in and remind ourselves of the data.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> pods</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">data <span class="op">=</span> pods.datasets.movie_body_count()[<span class="st">'Y'</span>]
data.head()</code></pre></div>
<p>Now we will convert this data into a form which we can use as inputs <code>X</code>, and labels <code>y</code>.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> pandas <span class="im">as</span> pd
<span class="im">import</span> numpy <span class="im">as</span> np</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">X <span class="op">=</span> data[[<span class="st">'Year'</span>, <span class="st">'Body_Count'</span>]]
y <span class="op">=</span> data[<span class="st">'MPAA_Rating'</span>]<span class="op">==</span><span class="st">'R'</span> <span class="co"># set label to be positive for R rated films.</span>
<span class="co"># Create series of movie genres with the relevant index</span>
s <span class="op">=</span> data[<span class="st">'Genre'</span>].<span class="bu">str</span>.split(<span class="st">'|'</span>).<span class="bu">apply</span>(pd.Series, <span class="dv">1</span>).stack()
s.index <span class="op">=</span> s.index.droplevel(<span class="op">-</span><span class="dv">1</span>) <span class="co"># to line up with df's index</span>
<span class="co"># Extract from the series the unique list of genres.</span>
genres <span class="op">=</span> s.unique()
<span class="co"># For each genre extract the indices where it is present and add a column to X</span>
<span class="cf">for</span> genre <span class="kw">in</span> genres:
index <span class="op">=</span> s[s<span class="op">==</span>genre].index.tolist()
X[genre] <span class="op">=</span> np.zeros(X.shape[<span class="dv">0</span>])
X[genre][index] <span class="op">=</span> np.ones(<span class="bu">len</span>(index))</code></pre></div>
<p>This has given us a new data frame <code>X</code> which contains the different genres in different columns.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">X.describe()</code></pre></div>
<p>We can now specify the naive Bayes model. For the genres we want to model the data as Bernoulli distributed, and for the year and body count we want to model the data as Gaussian distributed. We set up two data frames to contain the parameters for the rows and the columns below.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co"># assume data is binary or real.</span>
<span class="co"># this list encodes whether it is binary or real (1 for binary, 0 for real)</span>
binary_columns <span class="op">=</span> genres
real_columns <span class="op">=</span> [<span class="st">'Year'</span>, <span class="st">'Body_Count'</span>]
Bernoulli <span class="op">=</span> pd.DataFrame(data<span class="op">=</span>np.zeros((<span class="dv">2</span>,<span class="bu">len</span>(binary_columns))), columns<span class="op">=</span>binary_columns, index<span class="op">=</span>[<span class="st">'theta_0'</span>, <span class="st">'theta_1'</span>])
Gaussian <span class="op">=</span> pd.DataFrame(data<span class="op">=</span>np.zeros((<span class="dv">4</span>,<span class="bu">len</span>(real_columns))), columns<span class="op">=</span>real_columns, index<span class="op">=</span>[<span class="st">'mu_0'</span>, <span class="st">'sigma2_0'</span>, <span class="st">'mu_1'</span>, <span class="st">'sigma2_1'</span>])</code></pre></div>
<p>Now we have the data in a form ready for analysis, let's construct our data matrix.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">num_train <span class="op">=</span> <span class="dv">200</span>
indices <span class="op">=</span> np.random.permutation(X.shape[<span class="dv">0</span>])
train_indices <span class="op">=</span> indices[:num_train]
test_indices <span class="op">=</span> indices[num_train:]
X_train <span class="op">=</span> X.loc[train_indices]
y_train <span class="op">=</span> y.loc[train_indices]
X_test <span class="op">=</span> X.loc[test_indices]
y_test <span class="op">=</span> y.loc[test_indices]</code></pre></div>
<p>And we can now train the model. For each feature we can make the fit independently. The fit is given by either counting the number of positives (for binary data) which gives us the maximum likelihood solution for the Bernoulli. Or by computing the empirical mean and variance of the data for the Gaussian, which also gives us the maximum likelihood solution.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="cf">for</span> column <span class="kw">in</span> X_train:
<span class="cf">if</span> column <span class="kw">in</span> Gaussian:
Gaussian[column][<span class="st">'mu_0'</span>] <span class="op">=</span> X_train[column][<span class="op">~</span>y].mean()
Gaussian[column][<span class="st">'mu_1'</span>] <span class="op">=</span> X_train[column][y].mean()
Gaussian[column][<span class="st">'sigma2_0'</span>] <span class="op">=</span> X_train[column][<span class="op">~</span>y].var(ddof<span class="op">=</span><span class="dv">0</span>)
Gaussian[column][<span class="st">'sigma2_1'</span>] <span class="op">=</span> X_train[column][y].var(ddof<span class="op">=</span><span class="dv">0</span>)
<span class="cf">if</span> column <span class="kw">in</span> Bernoulli:
Bernoulli[column][<span class="st">'theta_0'</span>] <span class="op">=</span> X_train[column][<span class="op">~</span>y].<span class="bu">sum</span>()<span class="op">/</span>(<span class="op">~</span>y).<span class="bu">sum</span>()
Bernoulli[column][<span class="st">'theta_1'</span>] <span class="op">=</span> X_train[column][y].<span class="bu">sum</span>()<span class="op">/</span>(y).<span class="bu">sum</span>()</code></pre></div>
<p>We can examine the nature of the distributions we've fitted to the model by looking at the entries in these data frames.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">Bernoulli</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">Gaussian</code></pre></div>
<p>The final model parameter is the prior probability of the positive class, <span class="math inline">\(\pi\)</span>, which is computed by maximum likelihood.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">prior <span class="op">=</span> <span class="bu">float</span>(y_train.<span class="bu">sum</span>())<span class="op">/</span><span class="bu">len</span>(y_train)</code></pre></div>
<ul>
<li>We know that <span class="math display">\[
P(\dataScalar^*| \dataVector, \inputMatrix, \inputVector^*, \paramVector)p(\dataVector,\inputMatrix, \inputVector^*|\paramVector) = p(\dataScalar*, \dataVector, \inputMatrix,\inputVector^*| \paramVector)
\]</span></li>
<li><p>This implies <span class="math display">\[
P(\dataScalar^*| \dataVector, \inputMatrix, \inputVector^*, \paramVector) = \frac{p(\dataScalar*, \dataVector, \inputMatrix, \inputVector^*| \paramVector)}{p(\dataVector, \inputMatrix, \inputVector^*|\paramVector)}
\]</span></p></li>
<li>From conditional independence assumptions <span class="math display">\[
p(\dataScalar^*, \dataVector, \inputMatrix, \inputVector^*| \paramVector) = \prod_{j=1}^{\dataDim} p(\inputScalar^*_{j}|\dataScalar^*, \paramVector)p(\dataScalar^*|\pi)\prod_{i=1}^{\numData} \prod_{j=1}^{\dataDim} p(\inputScalar_{i,j}|\dataScalar_i, \paramVector)p(\dataScalar_i|\pi)
\]</span></li>
<li>We also need <span class="math display">\[
p(\dataVector, \inputMatrix, \inputVector^*|\paramVector)\]</span> which can be found from <span class="math display">\[p(\dataScalar^*, \dataVector, \inputMatrix, \inputVector^*| \paramVector)
\]</span></li>
<li><p>Using the <em>sum rule</em> of probability, <span class="math display">\[
p(\dataVector, \inputMatrix, \inputVector^*|\paramVector) = \sum_{\dataScalar^*=0}^1 p(\dataScalar^*, \dataVector, \inputMatrix, \inputVector^*| \paramVector).
\]</span></p></li>
<li>From independence assumptions <span class="math display">\[
p(\dataVector, \inputMatrix, \inputVector^*| \paramVector) = \sum_{\dataScalar^*=0}^1 \prod_{j=1}^{\dataDim} p(\inputScalar^*_{j}|\dataScalar^*_i, \paramVector)p(\dataScalar^*|\pi)\prod_{i=1}^{\numData} \prod_{j=1}^{\dataDim} p(\inputScalar_{i,j}|\dataScalar_i, \paramVector)p(\dataScalar_i|\pi).
\]</span></li>
<li><p>Substitute both forms to recover, <span class="math display">\[
P(\dataScalar^*| \dataVector, \inputMatrix, \inputVector^*, \paramVector) = \frac{\prod_{j=1}^{\dataDim} p(\inputScalar^*_{j}|\dataScalar^*_i, \paramVector)p(\dataScalar^*|\pi)\prod_{i=1}^{\numData} \prod_{j=1}^{\dataDim} p(\inputScalar_{i,j}|\dataScalar_i, \paramVector)p(\dataScalar_i|\pi)}{\sum_{\dataScalar^*=0}^1 \prod_{j=1}^{\dataDim} p(\inputScalar^*_{j}|\dataScalar^*_i, \paramVector)p(\dataScalar^*|\pi)\prod_{i=1}^{\numData} \prod_{j=1}^{\dataDim} p(\inputScalar_{i,j}|\dataScalar_i, \paramVector)p(\dataScalar_i|\pi)}
\]</span></p></li>
<li>Note training data terms cancel. <span class="math display">\[
p(\dataScalar^*| \inputVector^*, \paramVector) = \frac{\prod_{j=1}^{\dataDim} p(\inputScalar^*_{j}|\dataScalar^*_i, \paramVector)p(\dataScalar^*|\pi)}{\sum_{\dataScalar^*=0}^1 \prod_{j=1}^{\dataDim} p(\inputScalar^*_{j}|\dataScalar^*_i, \paramVector)p(\dataScalar^*|\pi)}
\]</span></li>
<li><p>This formula is also fairly straightforward to implement for different class conditional distributions.</p></li>
</ul>
<h3 id="making-predictions">Making Predictions</h3>
<p>Naive Bayes has given us the class conditional densities: <span class="math inline">\(p(\inputVector_i | \dataScalar_i, \paramVector)\)</span>. To make predictions with these densities we need to form the distribution given by <span class="math display">\[
P(\dataScalar^*| \dataVector, \inputMatrix, \inputVector^*, \paramVector)
\]</span> This can be computed by using the product rule. We know that <span class="math display">\[
P(\dataScalar^*| \dataVector, \inputMatrix, \inputVector^*, \paramVector)p(\dataVector, \inputMatrix, \inputVector^*|\paramVector) = p(\dataScalar*, \dataVector, \inputMatrix, \inputVector^*| \paramVector)
\]</span> implying that <span class="math display">\[
P(\dataScalar^*| \dataVector, \inputMatrix, \inputVector^*, \paramVector) = \frac{p(\dataScalar*, \dataVector, \inputMatrix, \inputVector^*| \paramVector)}{p(\dataVector, \inputMatrix, \inputVector^*|\paramVector)}
\]</span> and we've already defined <span class="math inline">\(p(\dataScalar^*, \dataVector, \inputMatrix, \inputVector^*| \paramVector)\)</span> using our conditional independence assumptions above <span class="math display">\[
p(\dataScalar^*, \dataVector, \inputMatrix, \inputVector^*| \paramVector) = \prod_{j=1}^{\dataDim} p(\inputScalar^*_{j}|\dataScalar^*, \paramVector)p(\dataScalar^*|\pi)\prod_{i=1}^{\numData} \prod_{j=1}^{\dataDim} p(\inputScalar_{i,j}|\dataScalar_i, \paramVector)p(\dataScalar_i|\pi)
\]</span> The other required density is <span class="math display">\[
p(\dataVector, \inputMatrix, \inputVector^*|\paramVector)
\]</span> which can be found from <span class="math display">\[p(\dataScalar^*, \dataVector, \inputMatrix, \inputVector^*| \paramVector)\]</span> using the <em>sum rule</em> of probability, <span class="math display">\[
p(\dataVector, \inputMatrix, \inputVector^*|\paramVector) = \sum_{\dataScalar^*=0}^1 p(\dataScalar^*, \dataVector, \inputMatrix, \inputVector^*| \paramVector).
\]</span> Because of our independence assumptions that is simply equal to <span class="math display">\[
p(\dataVector, \inputMatrix, \inputVector^*| \paramVector) = \sum_{\dataScalar^*=0}^1 \prod_{j=1}^{\dataDim} p(\inputScalar^*_{j}|\dataScalar^*_i, \paramVector)p(\dataScalar^*|\pi)\prod_{i=1}^{\numData} \prod_{j=1}^{\dataDim} p(\inputScalar_{i,j}|\dataScalar_i, \paramVector)p(\dataScalar_i|\pi).
\]</span> Substituting both forms in to recover our distribution over the test label conditioned on the training data we have, <span class="math display">\[
P(\dataScalar^*| \dataVector, \inputMatrix, \inputVector^*, \paramVector) = \frac{\prod_{j=1}^{\dataDim} p(\inputScalar^*_{j}|\dataScalar^*_i, \paramVector)p(\dataScalar^*|\pi)\prod_{i=1}^{\numData} \prod_{j=1}^{\dataDim} p(\inputScalar_{i,j}|\dataScalar_i, \paramVector)p(\dataScalar_i|\pi)}{\sum_{\dataScalar^*=0}^1 \prod_{j=1}^{\dataDim} p(\inputScalar^*_{j}|\dataScalar^*_i, \paramVector)p(\dataScalar^*|\pi)\prod_{i=1}^{\numData} \prod_{j=1}^{\dataDim} p(\inputScalar_{i,j}|\dataScalar_i, \paramVector)p(\dataScalar_i|\pi)}
\]</span> and we notice that all the terms associated with the training data actually cancel, the test prediction is <em>conditionally independent</em> of the training data <em>given</em> the parameters. This is a result of our conditional independence assumptions over the data points. <span class="math display">\[
p(\dataScalar^*| \inputVector^*, \paramVector) = \frac{\prod_{j=1}^{\dataDim} p(\inputScalar^*_{j}|\dataScalar^*_i,
\paramVector)p(\dataScalar^*|\pi)}{\sum_{\dataScalar^*=0}^1 \prod_{j=1}^{\dataDim} p(\inputScalar^*_{j}|\dataScalar^*_i, \paramVector)p(\dataScalar^*|\pi)}
\]</span> This formula is also fairly straightforward to implement. First we implement the log probabilities for the Gaussian density.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> log_gaussian(x, mu, sigma2):
<span class="cf">return</span> <span class="op">-</span><span class="fl">0.5</span><span class="op">*</span> np.log(<span class="dv">2</span><span class="op">*</span>np.pi<span class="op">*</span>sigma2)<span class="op">-</span>((x<span class="op">-</span>mu)<span class="op">**</span><span class="dv">2</span>)<span class="op">/</span>(<span class="dv">2</span><span class="op">*</span>sigma2)</code></pre></div>
<p>Now for any test point we compute the joint distribution of the Gaussian features by <em>summing</em> their log probabilities. Working in log space can be a considerable advantage over computing the probabilities directly: as the number of features we include goes up, because all the probabilities are less than 1, the joint probability will become smaller and smaller, and may be difficult to represent accurately (or even underflow). Working in log space can ameliorate this problem. We can also compute the log probability for the Bernoulli distribution.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> log_bernoulli(x, theta):
<span class="cf">return</span> x<span class="op">*</span>np.log(theta) <span class="op">+</span> (<span class="dv">1</span><span class="op">-</span>x)<span class="op">*</span>np.log(<span class="dv">1</span><span class="op">-</span>theta)</code></pre></div>
<h3 id="laplace-smoothing">Laplace Smoothing</h3>
<p>Before we proceed, let's just pause and think for a moment what will happen if <code>theta</code> here is either zero or one. This will result in <span class="math inline">\(\log 0 = -\infty\)</span> and cause numerical problems. This definitely can happen in practice. If some of the features are rare or very common across the data set then the maximum likelihood solution could find values of zero or one respectively. Such values are problematic because they cause posterior probabilities of class membership of either one or zero. In practice we deal with this using <em>Laplace smoothing</em> (which actually has an interpretation as a Bayesian fit of the Bernoulli distribution. Laplace used an example of the sun rising each day, and a wish to predict the sun rise the following day to describe his idea of smoothing, which can be found at the bottom of following page from Laplace's 'Essai Philosophique ...'</p>
<p><a href="https://play.google.com/books/reader?id=1YQPAAAAQAAJ&pg=PA16"><img src="../slides/diagrams/books/1YQPAAAAQAAJ-PA16.png" /></a></p>
<p>{ Laplace suggests that when computing the probability of an event where a success or failure is rare (he uses an example of the sun rising across the last 5,000 years or 1,826,213 days) that even though only successes have been observed (in the sun rising case) that the odds for tomorrow shouldn't be given as <span class="math display">\[
\frac{1,826,213}{1,826,213} = 1
\]</span> but rather by adding one to the numerator and two to the denominator, <span class="math display">\[
\frac{1,826,213 + 1}{1,826,213 + 2} = 0.99999945.
\]</span> This technique is sometimes called a 'pseudocount technique' because it has an intepretation of assuming some observations before you start, it's as if instead of observing <span class="math inline">\(\sum_{i}\dataScalar_i\)</span> successes you have an additional success, <span class="math inline">\(\sum_{i}\dataScalar_i + 1\)</span> and instead of having observed <span class="math inline">\(n\)</span> events you've observed <span class="math inline">\(\numData + 2\)</span>. So we can think of Laplace's idea saying (before we start) that we have 'two observations worth of belief, that the odds are 50/50', because before we start (i.e. when <span class="math inline">\(\numData=0\)</span>) our estimate is 0.5, yet because the effective <span class="math inline">\(n\)</span> is only 2, this estimate is quickly overwhelmed by data. Laplace used ideas like this a lot, and it is known as his 'principle of insufficient reason'. His idea was that in the absence of knowledge (i.e. before we start) we should assume that all possible outcomes are equally likely. This idea has a modern counterpart, known as the <a href="http://en.wikipedia.org/wiki/Principle_of_maximum_entropy">principle of maximum entropy</a>. A lot of the theory of this approach was developed by <a href="http://en.wikipedia.org/wiki/Edwin_Thompson_Jaynes">Ed Jaynes</a>, who according to his erstwhile collaborator and friend, John Skilling, learnt French as an undergraduate by reading the works of Laplace. Although John also related that Jaynes's spoken French was not up to the standard of his scientific French. For me Ed Jaynes's work very much carries on the tradition of Laplace into the modern era, in particular his focus on Bayesian approaches. I'm very proud to have met those that knew and worked with him. It turns out that Laplace's idea also has a Bayesian interpretation (as Laplace understood), it comes from assuming a particular prior density for the parameter <span class="math inline">\(\pi\)</span>, but we won't explore that interpretation for the moment, and merely choose to estimate the probability as, <span class="math display">\[
\pi = \frac{\sum_{i=1}^{\numData} \dataScalar_i + 1}{\numData + 2}
\]</span> to prevent problems with certainty causing numerical issues and misclassifications. Let's refit the Bernoulli features now.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co"># fit the Bernoulli with Laplace smoothing.</span>
<span class="cf">for</span> column <span class="kw">in</span> X_train:
<span class="cf">if</span> column <span class="kw">in</span> Bernoulli:
Bernoulli[column][<span class="st">'theta_0'</span>] <span class="op">=</span> (X_train[column][<span class="op">~</span>y].<span class="bu">sum</span>() <span class="op">+</span> <span class="dv">1</span>)<span class="op">/</span>((<span class="op">~</span>y).<span class="bu">sum</span>() <span class="op">+</span> <span class="dv">2</span>)
Bernoulli[column][<span class="st">'theta_1'</span>] <span class="op">=</span> (X_train[column][y].<span class="bu">sum</span>() <span class="op">+</span> <span class="dv">1</span>)<span class="op">/</span>((y).<span class="bu">sum</span>() <span class="op">+</span> <span class="dv">2</span>)</code></pre></div>
<p>That places us in a position to write the prediction function.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> predict(X_test, Gaussian, Bernoulli, prior):
log_positive <span class="op">=</span> pd.Series(data <span class="op">=</span> np.zeros(X_test.shape[<span class="dv">0</span>]), index<span class="op">=</span>X_test.index)
log_negative <span class="op">=</span> pd.Series(data <span class="op">=</span> np.zeros(X_test.shape[<span class="dv">0</span>]), index<span class="op">=</span>X_test.index)
<span class="cf">for</span> column <span class="kw">in</span> X_test.columns:
<span class="cf">if</span> column <span class="kw">in</span> Gaussian:
log_positive <span class="op">+=</span> log_gaussian(X_test[column], Gaussian[column][<span class="st">'mu_1'</span>], Gaussian[column][<span class="st">'sigma2_1'</span>])
log_negative <span class="op">+=</span> log_gaussian(X_test[column], Gaussian[column][<span class="st">'mu_0'</span>], Gaussian[column][<span class="st">'sigma2_0'</span>])
<span class="cf">elif</span> column <span class="kw">in</span> Bernoulli:
log_positive <span class="op">+=</span> log_bernoulli(X_test[column], Bernoulli[column][<span class="st">'theta_1'</span>])
log_negative <span class="op">+=</span> log_bernoulli(X_test[column], Bernoulli[column][<span class="st">'theta_0'</span>])
<span class="cf">return</span> np.exp(log_positive <span class="op">+</span> np.log(prior))<span class="op">/</span>(np.exp(log_positive <span class="op">+</span> np.log(prior)) <span class="op">+</span> np.exp(log_negative <span class="op">+</span> np.log(<span class="dv">1</span><span class="op">-</span>prior)))</code></pre></div>
<p>Now we are in a position to make the predictions for the test data.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">p_y <span class="op">=</span> predict(X_test, Gaussian, Bernoulli, prior)</code></pre></div>
<p>We can test the quality of the predictions in the following way. Firstly, we can threshold our probabilities at 0.5, allocating points with greater than 50% probability of membership of the positive class to the positive class. We can then compare to the true values, and see how many of these values we got correct. This is our total number correct.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">correct <span class="op">=</span> y_test <span class="op">&</span> p_y<span class="op">></span><span class="fl">0.5</span>
total_correct <span class="op">=</span> <span class="bu">sum</span>(correct)
<span class="bu">print</span>(<span class="st">"Total correct"</span>, total_correct, <span class="st">" out of "</span>, <span class="bu">len</span>(y_test), <span class="st">"which is"</span>, <span class="bu">float</span>(total_correct)<span class="op">/</span><span class="bu">len</span>(y_test), <span class="st">"%"</span>)</code></pre></div>
<p>We can also now plot the <a href="http://en.wikipedia.org/wiki/Confusion_matrix">confusion matrix</a>. A confusion matrix tells us where we are making mistakes. Along the diagonal it stores the <em>true positives</em>, the points that were positive class that we classified correctly, and the <em>true negatives</em>, the points that were negative class and that we classified correctly. The off diagonal terms contain the false positives and the false negatives. Along the rows of the matrix we place the actual class, and along the columns we place our predicted class.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">confusion_matrix <span class="op">=</span> pd.DataFrame(data<span class="op">=</span>np.zeros((<span class="dv">2</span>,<span class="dv">2</span>)),
columns<span class="op">=</span>[<span class="st">'predicted R-rated'</span>, <span class="st">'predicted not R-rated'</span>],
index <span class="op">=</span>[<span class="st">'actual R-rated'</span>, <span class="st">'actual not R-rated'</span>])
confusion_matrix[<span class="st">'predicted R-rated'</span>][<span class="st">'actual R-rated'</span>] <span class="op">=</span> (y_test <span class="op">&</span> p_y<span class="op">></span><span class="fl">0.5</span>).<span class="bu">sum</span>()
confusion_matrix[<span class="st">'predicted R-rated'</span>][<span class="st">'actual not R-rated'</span>] <span class="op">=</span> (<span class="op">~</span>y_test <span class="op">&</span> p_y<span class="op">></span><span class="fl">0.5</span>).<span class="bu">sum</span>()
confusion_matrix[<span class="st">'predicted not R-rated'</span>][<span class="st">'actual R-rated'</span>] <span class="op">=</span> (y_test <span class="op">&</span> <span class="op">~</span>(p_y<span class="op">></span><span class="fl">0.5</span>)).<span class="bu">sum</span>()
confusion_matrix[<span class="st">'predicted not R-rated'</span>][<span class="st">'actual not R-rated'</span>] <span class="op">=</span> (<span class="op">~</span>y_test <span class="op">&</span> <span class="op">~</span>(p_y<span class="op">></span><span class="fl">0.5</span>)).<span class="bu">sum</span>()
confusion_matrix</code></pre></div>
<h3 id="exercise-1">Exercise</h3>
<p>How can you improve your classification, are all the features equally valid? Are some features more helpful than others? What happens if you remove features that appear to be less helpful. How might you select such features?</p>
<h3 id="write-your-answer-to-exercise-here-1">Write your answer to Exercise here</h3>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co"># Use this box for any code you need for the exercise</span>
</code></pre></div>
<h3 id="exercise-2">Exercise</h3>
<p>We have decided to classify positive if probability of R rating is greater than 0.5. This has led us to accidentally classify some films as 'safe for children' when the aren't in actuallity. Imagine you wish to ensure that the film is safe for children. With your test set how low do you have to set the threshold to avoid all the false negatives (i.e. films where you said it wasn't R-rated, but in actuality it was?</p>
<h3 id="write-your-answer-to-exercise-here-2">Write your answer to Exercise here</h3>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co"># Use this box for any code you need for the exercise</span>
</code></pre></div>
<p>Naive Bayes has given us the class conditional densities: <span class="math inline">\(p(\inputVector_i | \dataScalar_i, \paramVector)\)</span>. To make predictions with these densities we need to form the distribution given by <span class="math display">\[
P(\dataScalar^*| \dataVector, \inputMatrix, \inputVector^*, \paramVector)
\]</span></p>
<h3 id="exercise-3">Exercise</h3>
<p>Write down the negative log likelihood of the Gaussian density over a vector of variables <span class="math inline">\(\inputVector\)</span>. Assume independence between each variable. Minimize this objective to obtain the maximum likelihood solution of the form. <span class="math display">\[
\mu = \frac{\sum_{i=1}^{\numData} \inputScalar_i}{\numData}
\]</span> <span class="math display">\[
\dataStd^2 = \frac{\sum_{i=1}^{\numData} (\inputScalar_i - \mu)^2}{\numData}
\]</span></p>
<h3 id="write-your-answer-to-exercise-here-3">Write your answer to Exercise here</h3>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co"># Use this box for any code you need for the exercise</span>
</code></pre></div>
<p>If the input data was <em>binary</em> then we could also make use of the Bernoulli distribution for the features. For that case we would have the form, <span class="math display">\[
p(\inputScalar_{i, j} | \dataScalar_i,\paramVector) = \theta_{\dataScalar_i, j}^{\inputScalar_{i, j}}(1-\theta_{\dataScalar_i, j})^{(1-\inputScalar_{i,j})},
\]</span> where <span class="math inline">\(\theta_{1, j}\)</span> is the probability that the <span class="math inline">\(j\)</span>th feature is on if <span class="math inline">\(\dataScalar_i\)</span> is 1.</p>
<p>In either case, maximum likelihood fitting would proceed in the same way. The objective has the form, <span class="math display">\[
\errorFunction(\paramVector) = -\sum_{j=1}^{\dataDim} \sum_{i=1}^{\numData} \log p(\inputScalar_{i,j} |\dataScalar_i, \paramVector),
\]</span> and if, as above, the parameters of the distributions are specific to each feature vector (we had means and variances for each continuous feature, and a probability for each binary feature) then we can use the fact that these parameters separate into disjoint subsets across the features to write, <span class="math display">\[
\begin{align*}
\errorFunction(\paramVector) &= -\sum_{j=1}^{\dataDim} \sum_{i=1}^{\numData} \log
p(\inputScalar_{i,j} |\dataScalar_i, \paramVector_j)\\
& \sum_{j=1}^{\dataDim}
\errorFunction(\paramVector_j),
\end{align*}
\]</span> which means we can minimize our objective on each feature independently.</p>
<p>These characteristics mean that naive Bayes scales very well with big data. To fit the model we consider each feature in turn, we select the positive class and fit parameters for that class, then we select each negative class and fit features for that class. We have code below.</p>
<h3 id="naive-bayes-summary">Naive Bayes Summary</h3>
<p>Naive Bayes is making very simple assumptions about the data, in particular it is modeling the full <em>joint</em> probability of the data set, <span class="math inline">\(p(\dataVector, \inputMatrix | \paramVector, \pi)\)</span> by very strong assumptions about factorizations that are unlikely to be true in practice. The data conditional independence assumption is common, and relies on a rich parameter vector to absorb all the information in the training data. The additional assumption of naive Bayes is that features are conditional independent given the class label <span class="math inline">\(\dataScalar_i\)</span> (and the parameter vector, <span class="math inline">\(\paramVector\)</span>. This is quite a strong assumption. However, it causes the objective function to decompose into parts which can be independently fitted to the different feature vectors, meaning it is very easy to fit the model to large data. It is also clear how we should handle <em>streaming</em> data and <em>missing</em> data. This means that the model can be run 'live', adapting parameters and information as it arrives. Indeed, the model is even capable of dealing with new <em>features</em> that might arrive at run time. Such is the strength of the modeling the joint probability density. However, the factorization assumption that allows us to do this efficiently is very strong and may lead to poor decision boundaries in practice.</p>
<h3 id="other-reading">Other Reading</h3>
<ul>
<li>Chapter 5 of <span class="citation">Rogers and Girolami (2011)</span> up to pg 179 (Section 5.1, and 5.2 up to 5.2.2).</li>
</ul>
<h3 id="references" class="unnumbered">References</h3>
<div id="refs" class="references">
<div id="ref-Pearl:causality95">
<p>Pearl, J., 1995. From Bayesian networks to causal networks, in: Gammerman, A. (Ed.), Probabilistic Reasoning and Bayesian Belief Networks. Alfred Waller, pp. 1–31.</p>
</div>
<div id="ref-Rogers:book11">
<p>Rogers, S., Girolami, M., 2011. A first course in machine learning. CRC Press.</p>
</div>
<div id="ref-Steele:predictive12">
<p>Steele, S., Bilchik, A., Eberhardt, J., Kalina, P., Nissan, A., Johnson, E., Avital, I., Stojadinovic, A., 2012. Using machine-learned Bayesian belief networks to predict perioperative risk of clostridium difficile infection following colon surgery. Interact J Med Res 1, e6. <a href="https://doi.org/10.2196/ijmr.2131" class="uri">https://doi.org/10.2196/ijmr.2131</a></p>
</div>
</div>
Sat, 25 Aug 2018 00:00:00 +0000
http://inverseprobability.com/talks/notes/probabilistic-machine-learning.html
http://inverseprobability.com/talks/notes/probabilistic-machine-learning.htmlnotesBayesian Methods<div style="display:none">
$$\newcommand{\Amatrix}{\mathbf{A}}
\newcommand{\KL}[2]{\text{KL}\left( #1\,\|\,#2 \right)}
\newcommand{\Kaast}{\kernelMatrix_{\mathbf{ \ast}\mathbf{ \ast}}}
\newcommand{\Kastu}{\kernelMatrix_{\mathbf{ \ast} \inducingVector}}
\newcommand{\Kff}{\kernelMatrix_{\mappingFunctionVector \mappingFunctionVector}}
\newcommand{\Kfu}{\kernelMatrix_{\mappingFunctionVector \inducingVector}}
\newcommand{\Kuast}{\kernelMatrix_{\inducingVector \bf\ast}}
\newcommand{\Kuf}{\kernelMatrix_{\inducingVector \mappingFunctionVector}}
\newcommand{\Kuu}{\kernelMatrix_{\inducingVector \inducingVector}}
\newcommand{\Kuui}{\Kuu^{-1}}
\newcommand{\Qaast}{\mathbf{Q}_{\bf \ast \ast}}
\newcommand{\Qastf}{\mathbf{Q}_{\ast \mappingFunction}}
\newcommand{\Qfast}{\mathbf{Q}_{\mappingFunctionVector \bf \ast}}
\newcommand{\Qff}{\mathbf{Q}_{\mappingFunctionVector \mappingFunctionVector}}
\newcommand{\aMatrix}{\mathbf{A}}
\newcommand{\aScalar}{a}
\newcommand{\aVector}{\mathbf{a}}
\newcommand{\acceleration}{a}
\newcommand{\bMatrix}{\mathbf{B}}
\newcommand{\bScalar}{b}
\newcommand{\bVector}{\mathbf{b}}
\newcommand{\basisFunc}{\phi}
\newcommand{\basisFuncVector}{\boldsymbol{ \basisFunc}}
\newcommand{\basisFunction}{\phi}
\newcommand{\basisLocation}{\mu}
\newcommand{\basisMatrix}{\boldsymbol{ \Phi}}
\newcommand{\basisScalar}{\basisFunction}
\newcommand{\basisVector}{\boldsymbol{ \basisFunction}}
\newcommand{\activationFunction}{\phi}
\newcommand{\activationMatrix}{\boldsymbol{ \Phi}}
\newcommand{\activationScalar}{\basisFunction}
\newcommand{\activationVector}{\boldsymbol{ \basisFunction}}
\newcommand{\bigO}{\mathcal{O}}
\newcommand{\binomProb}{\pi}
\newcommand{\cMatrix}{\mathbf{C}}
\newcommand{\cbasisMatrix}{\hat{\boldsymbol{ \Phi}}}
\newcommand{\cdataMatrix}{\hat{\dataMatrix}}
\newcommand{\cdataScalar}{\hat{\dataScalar}}
\newcommand{\cdataVector}{\hat{\dataVector}}
\newcommand{\centeredKernelMatrix}{\mathbf{ \MakeUppercase{\centeredKernelScalar}}}
\newcommand{\centeredKernelScalar}{b}
\newcommand{\centeredKernelVector}{\centeredKernelScalar}
\newcommand{\centeringMatrix}{\mathbf{H}}
\newcommand{\chiSquaredDist}[2]{\chi_{#1}^{2}\left(#2\right)}
\newcommand{\chiSquaredSamp}[1]{\chi_{#1}^{2}}
\newcommand{\conditionalCovariance}{\boldsymbol{ \Sigma}}
\newcommand{\coregionalizationMatrix}{\mathbf{B}}
\newcommand{\coregionalizationScalar}{b}
\newcommand{\coregionalizationVector}{\mathbf{ \coregionalizationScalar}}
\newcommand{\covDist}[2]{\text{cov}_{#2}\left(#1\right)}
\newcommand{\covSamp}[1]{\text{cov}\left(#1\right)}
\newcommand{\covarianceScalar}{c}
\newcommand{\covarianceVector}{\mathbf{ \covarianceScalar}}
\newcommand{\covarianceMatrix}{\mathbf{C}}
\newcommand{\covarianceMatrixTwo}{\boldsymbol{ \Sigma}}
\newcommand{\croupierScalar}{s}
\newcommand{\croupierVector}{\mathbf{ \croupierScalar}}
\newcommand{\croupierMatrix}{\mathbf{ \MakeUppercase{\croupierScalar}}}
\newcommand{\dataDim}{p}
\newcommand{\dataIndex}{i}
\newcommand{\dataIndexTwo}{j}
\newcommand{\dataMatrix}{\mathbf{Y}}
\newcommand{\dataScalar}{y}
\newcommand{\dataSet}{\mathcal{D}}
\newcommand{\dataStd}{\sigma}
\newcommand{\dataVector}{\mathbf{ \dataScalar}}
\newcommand{\decayRate}{d}
\newcommand{\degreeMatrix}{\mathbf{ \MakeUppercase{\degreeScalar}}}
\newcommand{\degreeScalar}{d}
\newcommand{\degreeVector}{\mathbf{ \degreeScalar}}
% Already defined by latex
%\newcommand{\det}[1]{\left|#1\right|}
\newcommand{\diag}[1]{\text{diag}\left(#1\right)}
\newcommand{\diagonalMatrix}{\mathbf{D}}
\newcommand{\diff}[2]{\frac{\text{d}#1}{\text{d}#2}}
\newcommand{\diffTwo}[2]{\frac{\text{d}^2#1}{\text{d}#2^2}}
\newcommand{\displacement}{x}
\newcommand{\displacementVector}{\textbf{\displacement}}
\newcommand{\distanceMatrix}{\mathbf{ \MakeUppercase{\distanceScalar}}}
\newcommand{\distanceScalar}{d}
\newcommand{\distanceVector}{\mathbf{ \distanceScalar}}
\newcommand{\eigenvaltwo}{\ell}
\newcommand{\eigenvaltwoMatrix}{\mathbf{L}}
\newcommand{\eigenvaltwoVector}{\mathbf{l}}
\newcommand{\eigenvalue}{\lambda}
\newcommand{\eigenvalueMatrix}{\boldsymbol{ \Lambda}}
\newcommand{\eigenvalueVector}{\boldsymbol{ \lambda}}
\newcommand{\eigenvector}{\mathbf{ \eigenvectorScalar}}
\newcommand{\eigenvectorMatrix}{\mathbf{U}}
\newcommand{\eigenvectorScalar}{u}
\newcommand{\eigenvectwo}{\mathbf{v}}
\newcommand{\eigenvectwoMatrix}{\mathbf{V}}
\newcommand{\eigenvectwoScalar}{v}
\newcommand{\entropy}[1]{\mathcal{H}\left(#1\right)}
\newcommand{\errorFunction}{E}
\newcommand{\expDist}[2]{\left<#1\right>_{#2}}
\newcommand{\expSamp}[1]{\left<#1\right>}
\newcommand{\expectation}[1]{\left\langle #1 \right\rangle }
\newcommand{\expectationDist}[2]{\left\langle #1 \right\rangle _{#2}}
\newcommand{\expectedDistanceMatrix}{\mathcal{D}}
\newcommand{\eye}{\mathbf{I}}
\newcommand{\fantasyDim}{r}
\newcommand{\fantasyMatrix}{\mathbf{ \MakeUppercase{\fantasyScalar}}}
\newcommand{\fantasyScalar}{z}
\newcommand{\fantasyVector}{\mathbf{ \fantasyScalar}}
\newcommand{\featureStd}{\varsigma}
\newcommand{\gammaCdf}[3]{\mathcal{GAMMA CDF}\left(#1|#2,#3\right)}
\newcommand{\gammaDist}[3]{\mathcal{G}\left(#1|#2,#3\right)}
\newcommand{\gammaSamp}[2]{\mathcal{G}\left(#1,#2\right)}
\newcommand{\gaussianDist}[3]{\mathcal{N}\left(#1|#2,#3\right)}
\newcommand{\gaussianSamp}[2]{\mathcal{N}\left(#1,#2\right)}
\newcommand{\given}{|}
\newcommand{\half}{\frac{1}{2}}
\newcommand{\heaviside}{H}
\newcommand{\hiddenMatrix}{\mathbf{ \MakeUppercase{\hiddenScalar}}}
\newcommand{\hiddenScalar}{h}
\newcommand{\hiddenVector}{\mathbf{ \hiddenScalar}}
\newcommand{\identityMatrix}{\eye}
\newcommand{\inducingInputScalar}{z}
\newcommand{\inducingInputVector}{\mathbf{ \inducingInputScalar}}
\newcommand{\inducingInputMatrix}{\mathbf{Z}}
\newcommand{\inducingScalar}{u}
\newcommand{\inducingVector}{\mathbf{ \inducingScalar}}
\newcommand{\inducingMatrix}{\mathbf{U}}
\newcommand{\inlineDiff}[2]{\text{d}#1/\text{d}#2}
\newcommand{\inputDim}{q}
\newcommand{\inputMatrix}{\mathbf{X}}
\newcommand{\inputScalar}{x}
\newcommand{\inputSpace}{\mathcal{X}}
\newcommand{\inputVals}{\inputVector}
\newcommand{\inputVector}{\mathbf{ \inputScalar}}
\newcommand{\iterNum}{k}
\newcommand{\kernel}{\kernelScalar}
\newcommand{\kernelMatrix}{\mathbf{K}}
\newcommand{\kernelScalar}{k}
\newcommand{\kernelVector}{\mathbf{ \kernelScalar}}
\newcommand{\kff}{\kernelScalar_{\mappingFunction \mappingFunction}}
\newcommand{\kfu}{\kernelVector_{\mappingFunction \inducingScalar}}
\newcommand{\kuf}{\kernelVector_{\inducingScalar \mappingFunction}}
\newcommand{\kuu}{\kernelVector_{\inducingScalar \inducingScalar}}
\newcommand{\lagrangeMultiplier}{\lambda}
\newcommand{\lagrangeMultiplierMatrix}{\boldsymbol{ \Lambda}}
\newcommand{\lagrangian}{L}
\newcommand{\laplacianFactor}{\mathbf{ \MakeUppercase{\laplacianFactorScalar}}}
\newcommand{\laplacianFactorScalar}{m}
\newcommand{\laplacianFactorVector}{\mathbf{ \laplacianFactorScalar}}
\newcommand{\laplacianMatrix}{\mathbf{L}}
\newcommand{\laplacianScalar}{\ell}
\newcommand{\laplacianVector}{\mathbf{ \ell}}
\newcommand{\latentDim}{q}
\newcommand{\latentDistanceMatrix}{\boldsymbol{ \Delta}}
\newcommand{\latentDistanceScalar}{\delta}
\newcommand{\latentDistanceVector}{\boldsymbol{ \delta}}
\newcommand{\latentForce}{f}
\newcommand{\latentFunction}{u}
\newcommand{\latentFunctionVector}{\mathbf{ \latentFunction}}
\newcommand{\latentFunctionMatrix}{\mathbf{ \MakeUppercase{\latentFunction}}}
\newcommand{\latentIndex}{j}
\newcommand{\latentScalar}{z}
\newcommand{\latentVector}{\mathbf{ \latentScalar}}
\newcommand{\latentMatrix}{\mathbf{Z}}
\newcommand{\learnRate}{\eta}
\newcommand{\lengthScale}{\ell}
\newcommand{\rbfWidth}{\ell}
\newcommand{\likelihoodBound}{\mathcal{L}}
\newcommand{\likelihoodFunction}{L}
\newcommand{\locationScalar}{\mu}
\newcommand{\locationVector}{\boldsymbol{ \locationScalar}}
\newcommand{\locationMatrix}{\mathbf{M}}
\newcommand{\variance}[1]{\text{var}\left( #1 \right)}
\newcommand{\mappingFunction}{f}
\newcommand{\mappingFunctionMatrix}{\mathbf{F}}
\newcommand{\mappingFunctionTwo}{g}
\newcommand{\mappingFunctionTwoMatrix}{\mathbf{G}}
\newcommand{\mappingFunctionTwoVector}{\mathbf{ \mappingFunctionTwo}}
\newcommand{\mappingFunctionVector}{\mathbf{ \mappingFunction}}
\newcommand{\scaleScalar}{s}
\newcommand{\mappingScalar}{w}
\newcommand{\mappingVector}{\mathbf{ \mappingScalar}}
\newcommand{\mappingMatrix}{\mathbf{W}}
\newcommand{\mappingScalarTwo}{v}
\newcommand{\mappingVectorTwo}{\mathbf{ \mappingScalarTwo}}
\newcommand{\mappingMatrixTwo}{\mathbf{V}}
\newcommand{\maxIters}{K}
\newcommand{\meanMatrix}{\mathbf{M}}
\newcommand{\meanScalar}{\mu}
\newcommand{\meanTwoMatrix}{\mathbf{M}}
\newcommand{\meanTwoScalar}{m}
\newcommand{\meanTwoVector}{\mathbf{ \meanTwoScalar}}
\newcommand{\meanVector}{\boldsymbol{ \meanScalar}}
\newcommand{\mrnaConcentration}{m}
\newcommand{\naturalFrequency}{\omega}
\newcommand{\neighborhood}[1]{\mathcal{N}\left( #1 \right)}
\newcommand{\neilurl}{http://inverseprobability.com/}
\newcommand{\noiseMatrix}{\boldsymbol{ E}}
\newcommand{\noiseScalar}{\epsilon}
\newcommand{\noiseVector}{\boldsymbol{ \epsilon}}
\newcommand{\norm}[1]{\left\Vert #1 \right\Vert}
\newcommand{\normalizedLaplacianMatrix}{\hat{\mathbf{L}}}
\newcommand{\normalizedLaplacianScalar}{\hat{\ell}}
\newcommand{\normalizedLaplacianVector}{\hat{\mathbf{ \ell}}}
\newcommand{\numActive}{m}
\newcommand{\numBasisFunc}{m}
\newcommand{\numComponents}{m}
\newcommand{\numComps}{K}
\newcommand{\numData}{n}
\newcommand{\numFeatures}{K}
\newcommand{\numHidden}{h}
\newcommand{\numInducing}{m}
\newcommand{\numLayers}{\ell}
\newcommand{\numNeighbors}{K}
\newcommand{\numSequences}{s}
\newcommand{\numSuccess}{s}
\newcommand{\numTasks}{m}
\newcommand{\numTime}{T}
\newcommand{\numTrials}{S}
\newcommand{\outputIndex}{j}
\newcommand{\paramVector}{\boldsymbol{ \theta}}
\newcommand{\parameterMatrix}{\boldsymbol{ \Theta}}
\newcommand{\parameterScalar}{\theta}
\newcommand{\parameterVector}{\boldsymbol{ \parameterScalar}}
\newcommand{\partDiff}[2]{\frac{\partial#1}{\partial#2}}
\newcommand{\precisionScalar}{j}
\newcommand{\precisionVector}{\mathbf{ \precisionScalar}}
\newcommand{\precisionMatrix}{\mathbf{J}}
\newcommand{\pseudotargetScalar}{\widetilde{y}}
\newcommand{\pseudotargetVector}{\mathbf{ \pseudotargetScalar}}
\newcommand{\pseudotargetMatrix}{\mathbf{ \widetilde{Y}}}
\newcommand{\rank}[1]{\text{rank}\left(#1\right)}
\newcommand{\rayleighDist}[2]{\mathcal{R}\left(#1|#2\right)}
\newcommand{\rayleighSamp}[1]{\mathcal{R}\left(#1\right)}
\newcommand{\responsibility}{r}
\newcommand{\rotationScalar}{r}
\newcommand{\rotationVector}{\mathbf{ \rotationScalar}}
\newcommand{\rotationMatrix}{\mathbf{R}}
\newcommand{\sampleCovScalar}{s}
\newcommand{\sampleCovVector}{\mathbf{ \sampleCovScalar}}
\newcommand{\sampleCovMatrix}{\mathbf{s}}
\newcommand{\scalarProduct}[2]{\left\langle{#1},{#2}\right\rangle}
\newcommand{\sign}[1]{\text{sign}\left(#1\right)}
\newcommand{\sigmoid}[1]{\sigma\left(#1\right)}
\newcommand{\singularvalue}{\ell}
\newcommand{\singularvalueMatrix}{\mathbf{L}}
\newcommand{\singularvalueVector}{\mathbf{l}}
\newcommand{\sorth}{\mathbf{u}}
\newcommand{\spar}{\lambda}
\newcommand{\trace}[1]{\text{tr}\left(#1\right)}
\newcommand{\BasalRate}{B}
\newcommand{\DampingCoefficient}{C}
\newcommand{\DecayRate}{D}
\newcommand{\Displacement}{X}
\newcommand{\LatentForce}{F}
\newcommand{\Mass}{M}
\newcommand{\Sensitivity}{S}
\newcommand{\basalRate}{b}
\newcommand{\dampingCoefficient}{c}
\newcommand{\mass}{m}
\newcommand{\sensitivity}{s}
\newcommand{\springScalar}{\kappa}
\newcommand{\springVector}{\boldsymbol{ \kappa}}
\newcommand{\springMatrix}{\boldsymbol{ \mathcal{K}}}
\newcommand{\tfConcentration}{p}
\newcommand{\tfDecayRate}{\delta}
\newcommand{\tfMrnaConcentration}{f}
\newcommand{\tfVector}{\mathbf{ \tfConcentration}}
\newcommand{\velocity}{v}
\newcommand{\sufficientStatsScalar}{g}
\newcommand{\sufficientStatsVector}{\mathbf{ \sufficientStatsScalar}}
\newcommand{\sufficientStatsMatrix}{\mathbf{G}}
\newcommand{\switchScalar}{s}
\newcommand{\switchVector}{\mathbf{ \switchScalar}}
\newcommand{\switchMatrix}{\mathbf{S}}
\newcommand{\tr}[1]{\text{tr}\left(#1\right)}
\newcommand{\loneNorm}[1]{\left\Vert #1 \right\Vert_1}
\newcommand{\ltwoNorm}[1]{\left\Vert #1 \right\Vert_2}
\newcommand{\onenorm}[1]{\left\vert#1\right\vert_1}
\newcommand{\twonorm}[1]{\left\Vert #1 \right\Vert}
\newcommand{\vScalar}{v}
\newcommand{\vVector}{\mathbf{v}}
\newcommand{\vMatrix}{\mathbf{V}}
\newcommand{\varianceDist}[2]{\text{var}_{#2}\left( #1 \right)}
% Already defined by latex
%\newcommand{\vec}{#1:}
\newcommand{\vecb}[1]{\left(#1\right):}
\newcommand{\weightScalar}{w}
\newcommand{\weightVector}{\mathbf{ \weightScalar}}
\newcommand{\weightMatrix}{\mathbf{W}}
\newcommand{\weightedAdjacencyMatrix}{\mathbf{A}}
\newcommand{\weightedAdjacencyScalar}{a}
\newcommand{\weightedAdjacencyVector}{\mathbf{ \weightedAdjacencyScalar}}
\newcommand{\onesVector}{\mathbf{1}}
\newcommand{\zerosVector}{\mathbf{0}}
$$
</div>
<h2 id="what-is-machine-learning">What is Machine Learning?</h2>
<p>What is machine learning? At its most basic level machine learning is a combination of</p>
<p><span class="math display">\[ \text{data} + \text{model} \xrightarrow{\text{compute}} \text{prediction}\]</span></p>
<p>where <em>data</em> is our observations. They can be actively or passively acquired (meta-data). The <em>model</em> contains our assumptions, based on previous experience. THat experience can be other data, it can come from transfer learning, or it can merely be our beliefs about the regularities of the universe. In humans our models include our inductive biases. The <em>prediction</em> is an action to be taken or a categorization or a quality score. The reason that machine learning has become a mainstay of artificial intelligence is the importance of predictions in artificial intelligence. The data and the model are combined through computation.</p>
<p>In practice we normally perform machine learning using two functions. To combine data with a model we typically make use of:</p>
<p><strong>a prediction function</strong> a function which is used to make the predictions. It includes our beliefs about the regularities of the universe, our assumptions about how the world works, e.g. smoothness, spatial similarities, temporal similarities.</p>
<p><strong>an objective function</strong> a function which defines the cost of misprediction. Typically it includes knowledge about the world's generating processes (probabilistic objectives) or the costs we pay for mispredictions (empiricial risk minimization).</p>
<p>The combination of data and model through the prediction function and the objectie function leads to a <em>learning algorithm</em>. The class of prediction functions and objective functions we can make use of is restricted by the algorithms they lead to. If the prediction function or the objective function are too complex, then it can be difficult to find an appropriate learning algorithm. Much of the acdemic field of machine learning is the quest for new learning algorithms that allow us to bring different types of models and data together.</p>
<p>A useful reference for state of the art in machine learning is the UK Royal Society Report, <a href="https://royalsociety.org/~/media/policy/projects/machine-learning/publications/machine-learning-report.pdf">Machine Learning: Power and Promise of Computers that Learn by Example</a>.</p>
<p>You can also check my blog post on <a href="http://inverseprobability.com/2017/07/17/what-is-machine-learning">"What is Machine Learning?"</a></p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> numpy <span class="im">as</span> np
<span class="im">import</span> matplotlib.pyplot <span class="im">as</span> plt
<span class="im">import</span> pods
<span class="im">import</span> teaching_plots <span class="im">as</span> plot
<span class="im">import</span> mlai</code></pre></div>
<h3 id="olympic-marathon-data">Olympic Marathon Data</h3>
<p>The first thing we will do is load a standard data set for regression modelling. The data consists of the pace of Olympic Gold Medal Marathon winners for the Olympics from 1896 to present. First we load in the data and plot.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">data <span class="op">=</span> pods.datasets.olympic_marathon_men()
x <span class="op">=</span> data[<span class="st">'X'</span>]
y <span class="op">=</span> data[<span class="st">'Y'</span>]
offset <span class="op">=</span> y.mean()
scale <span class="op">=</span> np.sqrt(y.var())
xlim <span class="op">=</span> (<span class="dv">1875</span>,<span class="dv">2030</span>)
ylim <span class="op">=</span> (<span class="fl">2.5</span>, <span class="fl">6.5</span>)
yhat <span class="op">=</span> (y<span class="op">-</span>offset)<span class="op">/</span>scale
fig, ax <span class="op">=</span> plt.subplots(figsize<span class="op">=</span>plot.big_wide_figsize)
_ <span class="op">=</span> ax.plot(x, y, <span class="st">'r.'</span>,markersize<span class="op">=</span><span class="dv">10</span>)
ax.set_xlabel(<span class="st">'year'</span>, fontsize<span class="op">=</span><span class="dv">20</span>)
ax.set_ylabel(<span class="st">'pace min/km'</span>, fontsize<span class="op">=</span><span class="dv">20</span>)
ax.set_xlim(xlim)
ax.set_ylim(ylim)
mlai.write_figure(figure<span class="op">=</span>fig, filename<span class="op">=</span><span class="st">'../slides/diagrams/datasets/olympic-marathon.svg'</span>, transparent<span class="op">=</span><span class="va">True</span>, frameon<span class="op">=</span><span class="va">True</span>)</code></pre></div>
<h3 id="olympic-marathon-data-1">Olympic Marathon Data</h3>
<table>
<tr>
<td width="70%">
<ul>
<li><p>Gold medal times for Olympic Marathon since 1896.</p></li>
<li><p>Marathons before 1924 didn’t have a standardised distance.</p></li>
<li><p>Present results using pace per km.</p></li>
<li>In 1904 Marathon was badly organised leading to very slow times.</li>
</ul>
</td>
<td width="30%">
<img src="../slides/diagrams/Stephen_Kiprotich.jpg" alt="image" /> <small>Image from Wikimedia Commons <a href="http://bit.ly/16kMKHQ" class="uri">http://bit.ly/16kMKHQ</a></small>
</td>
</tr>
</table>
<object class="svgplot" align data="../slides/diagrams/datasets/olympic-marathon.svg">
</object>
<p>Things to notice about the data include the outlier in 1904, in this year, the olympics was in St Louis, USA. Organizational problems and challenges with dust kicked up by the cars following the race meant that participants got lost, and only very few participants completed.</p>
<p>More recent years see more consistently quick marathons.</p>
<h3 id="regression-linear-releationship">Regression: Linear Releationship</h3>
<p>For many their first encounter with what might be termed a machine learning method is fitting a straight line. A straight line is characterized by two parameters, the scale, <span class="math inline">\(m\)</span>, and the offset <span class="math inline">\(c\)</span>.</p>
<p><span class="math display">\[\dataScalar_i = m \inputScalar_i + c\]</span></p>
<p>For the olympic marathon example <span class="math inline">\(\dataScalar_i\)</span> is the winning pace and it is given as a function of the year which is represented by <span class="math inline">\(\inputScalar_i\)</span>. There are two further parameters of the prediction function. For the olympics example we can interpret these parameters, the scale <span class="math inline">\(m\)</span> is the rate of improvement of the olympic marathon pace on a yearly basis. And <span class="math inline">\(c\)</span> is the winning pace as estimated at year 0.</p>
<h2 id="overdetermined-system">Overdetermined System</h2>
<p>The challenge with a linear model is that it has two unknowns, <span class="math inline">\(m\)</span>, and <span class="math inline">\(c\)</span>. Observing data allows us to write down a system of simultaneous linear equations. So, for example if we observe two data points, the first with the input value, <span class="math inline">\(\inputScalar_1 = 1\)</span> and the output value, <span class="math inline">\(\dataScalar_1 =3\)</span> and a second data point, <span class="math inline">\(\inputScalar = 3\)</span>, <span class="math inline">\(\dataScalar=1\)</span>, then we can write two simultaneous linear equations of the form.</p>
<p>point 1: <span class="math inline">\(\inputScalar = 1\)</span>, <span class="math inline">\(\dataScalar=3\)</span> <span class="math display">\[3 = m + c\]</span> point 2: <span class="math inline">\(\inputScalar = 3\)</span>, <span class="math inline">\(\dataScalar=1\)</span> <span class="math display">\[1 = 3m + c\]</span></p>
<p>The solution to these two simultaneous equations can be represented graphically as</p>
<object class="svgplot" align data="../slides/diagrams/ml/over_determined_system003.svg">
</object>
<center>
<em>The solution of two linear equations represented as the fit of a straight line through two data </em>
</center>
<p>The challenge comes when a third data point is observed and it doesn't naturally fit on the straight line.</p>
<p>point 3: <span class="math inline">\(\inputScalar = 2\)</span>, <span class="math inline">\(\dataScalar=2.5\)</span> <span class="math display">\[2.5 = 2m + c\]</span></p>
<object class="svgplot" align data="../slides/diagrams/ml/over_determined_system004.svg">
</object>
<center>
<em>A third observation of data is inconsistent with the solution dictated by the first two observations </em>
</center>
<p>Now there are three candidate lines, each consistent with our data.</p>
<object class="svgplot" align data="../slides/diagrams/ml/over_determined_system007.svg">
</object>
<center>
<em>Three solutions to the problem, each consistent with two points of the three observations </em>
</center>
<p>This is known as an <em>overdetermined</em> system because there are more data than we need to determine our parameters. The problem arises because the model is a simplification of the real world, and the data we observe is therefore inconsistent with our model.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> teaching_plots <span class="im">as</span> plot</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">plot.over_determined_system(diagrams<span class="op">=</span><span class="st">'../slides/diagrams/ml'</span>)</code></pre></div>
<p>The solution was proposed by Pierre-Simon Laplace. His idea was to accept that the model was an incomplete representation of the real world, and the manner in which it was incomplete is <em>unknown</em>. His idea was that such unknowns could be dealt with through probability.</p>
<p><img class="" src="../slides/diagrams/ml/Pierre-Simon_Laplace.png" width="30%" align="" style="background:none; border:none; box-shadow:none;"></p>
<p>Famously, Laplace considered the idea of a deterministic Universe, one in which the model is <em>known</em>, or as the below translation refers to it, "an intelligence which could comprehend all the forces by which nature is animated". He speculates on an "intelligence" that can submit this vast data to analysis and propsoses that such an entity would be able to predict the future.</p>
<blockquote>
<p>Given for one instant an intelligence which could comprehend all the forces by which nature is animated and the respective situation of the beings who compose it---an intelligence sufficiently vast to submit these data to analysis---it would embrace in the same formulate the movements of the greatest bodies of the universe and those of the lightest atom; for it, nothing would be uncertain and the future, as the past, would be present in its eyes.</p>
</blockquote>
<p>This notion is known as <em>Laplace's demon</em> or <em>Laplace's superman</em>.</p>
<p>Unfortunately, most analyses of his ideas stop at that point, whereas his real point is that such a notion is unreachable. Not so much <em>superman</em> as <em>strawman</em>. Just three pages later in the "Philosophical Essay on Probabilities" <span class="citation">(Laplace, 1814)</span>, Laplace goes on to observe:</p>
<blockquote>
<p>The curve described by a simple molecule of air or vapor is regulated in a manner just as certain as the planetary orbits; the only difference between them is that which comes from our ignorance.</p>
<p>Probability is relative, in part to this ignorance, in part to our knowledge.</p>
</blockquote>
<p>In other words, we can never utilize the idealistic deterministc Universe due to our ignorance about the world, Laplace's suggestion, and focus in this essay is that we turn to probability to deal with this uncertainty. This is also our inspiration for using probabilit in machine learning.</p>
<p>The "forces by which nature is animated" is our <em>model</em>, the "situation of beings that compose it" is our <em>data</em> and the "intelligence sufficiently vast enough to submit these data to analysis" is our compute. The fly in the ointment is our <em>ignorance</em> about these aspects. And <em>probability</em> is the tool we use to incorporate this ignorance leading to uncertainty or <em>doubt</em> in our predictions.</p>
<p>Laplace's concept was that the reason that the data doesn't match up to the model is because of unconsidered factors, and that these might be well represented through probability densities. He tackles the challenge of the unknown factors by adding a variable, <span class="math inline">\(\noiseScalar\)</span>, that represents the unknown. In modern parlance we would call this a <em>latent</em> variable. But in the context Laplace uses it, the variable is so common that it has other names such as a "slack" variable or the <em>noise</em> in the system.</p>
<p>point 1: <span class="math inline">\(\inputScalar = 1\)</span>, <span class="math inline">\(\dataScalar=3\)</span> <span class="math display">\[
3 = m + c + \noiseScalar_1
\]</span> point 2: <span class="math inline">\(\inputScalar = 3\)</span>, <span class="math inline">\(\dataScalar=1\)</span> <span class="math display">\[
1 = 3m + c + \noiseScalar_2
\]</span> point 3: <span class="math inline">\(\inputScalar = 2\)</span>, <span class="math inline">\(\dataScalar=2.5\)</span> <span class="math display">\[
2.5 = 2m + c + \noiseScalar_3
\]</span></p>
<p>Laplace's trick has converted the <em>overdetermined</em> system into an <em>underdetermined</em> system. He has now added three variables, <span class="math inline">\(\{\noiseScalar_i\}_{i=1}^3\)</span>, which represent the unknown corruptions of the real world. Laplace's idea is that we should represent that unknown corruption with a <em>probability distribution</em>.</p>
<h3 id="a-probabilistic-process">A Probabilistic Process</h3>
<p>However, it was left to an admirer of Gauss to develop a practical probability density for that purpose. It was Carl Friederich Gauss who suggested that the <em>Gaussian</em> density (which at the time was unnamed!) should be used to represent this error.</p>
<p>The result is a <em>noisy</em> function, a function which has a deterministic part, and a stochastic part. This type of function is sometimes known as a probabilistic or stochastic process, to distinguish it from a deterministic process.</p>
<h3 id="the-gaussian-density">The Gaussian Density</h3>
<p>The Gaussian density is perhaps the most commonly used probability density. It is defined by a <em>mean</em>, <span class="math inline">\(\meanScalar\)</span>, and a <em>variance</em>, <span class="math inline">\(\dataStd^2\)</span>. The variance is taken to be the square of the <em>standard deviation</em>, <span class="math inline">\(\dataStd\)</span>.</p>
<p><span class="math display">\[\begin{align}
p(\dataScalar| \meanScalar, \dataStd^2) & = \frac{1}{\sqrt{2\pi\dataStd^2}}\exp\left(-\frac{(\dataScalar - \meanScalar)^2}{2\dataStd^2}\right)\\& \buildrel\triangle\over = \gaussianDist{\dataScalar}{\meanScalar}{\dataStd^2}
\end{align}\]</span></p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> teaching_plots <span class="im">as</span> plot</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">plot.gaussian_of_height(diagrams<span class="op">=</span><span class="st">'../../slides/diagrams/ml'</span>)</code></pre></div>
<object class="svgplot" align data="../slides/diagrams/ml/gaussian_of_height.svg">
</object>
<center>
<em>The Gaussian PDF with <span class="math inline">\({\meanScalar}=1.7\)</span> and variance <span class="math inline">\({\dataStd}^2=0.0225\)</span>. Mean shown as cyan line. It could represent the heights of a population of students. </em>
</center>
<h3 id="two-important-gaussian-properties">Two Important Gaussian Properties</h3>
<p>The Gaussian density has many important properties, but for the moment we'll review two of them.</p>
<h3 id="sum-of-gaussians">Sum of Gaussians</h3>
<p>If we assume that a variable, <span class="math inline">\(\dataScalar_i\)</span>, is sampled from a Gaussian density,</p>
<p><span class="math display">\[\dataScalar_i \sim \gaussianSamp{\meanScalar_i}{\sigma_i^2}\]</span></p>
<p>Then we can show that the sum of a set of variables, each drawn independently from such a density is also distributed as Gaussian. The mean of the resulting density is the sum of the means, and the variance is the sum of the variances,</p>
<p><span class="math display">\[\sum_{i=1}^{\numData} \dataScalar_i \sim \gaussianSamp{\sum_{i=1}^\numData \meanScalar_i}{\sum_{i=1}^\numData \sigma_i^2}\]</span></p>
<p>Since we are very familiar with the Gaussian density and its properties, it is not immediately apparent how unusual this is. Most random variables, when you add them together, change the family of density they are drawn from. For example, the Gaussian is exceptional in this regard. Indeed, other random variables, if they are independently drawn and summed together tend to a Gaussian density. That is the <a href="https://en.wikipedia.org/wiki/Central_limit_theorem"><em>central limit theorem</em></a> which is a major justification for the use of a Gaussian density.</p>
<h3 id="scaling-a-gaussian">Scaling a Gaussian</h3>
<p>Less unusual is the <em>scaling</em> property of a Gaussian density. If a variable, <span class="math inline">\(\dataScalar\)</span>, is sampled from a Gaussian density,</p>
<p><span class="math display">\[\dataScalar \sim \gaussianSamp{\meanScalar}{\sigma^2}\]</span> and we choose to scale that variable by a <em>deterministic</em> value, <span class="math inline">\(\mappingScalar\)</span>, then the <em>scaled variable</em> is distributed as</p>
<p><span class="math display">\[\mappingScalar \dataScalar \sim \gaussianSamp{\mappingScalar\meanScalar}{\mappingScalar^2 \sigma^2}.\]</span> Unlike the summing properties, where adding two or more random variables independently sampled from a family of densitites typically brings the summed variable <em>outside</em> that family, scaling many densities leaves the distribution of that variable in the same <em>family</em> of densities. Indeed, many densities include a <em>scale</em> parameter (e.g. the <a href="https://en.wikipedia.org/wiki/Gamma_distribution">Gamma density</a>) which is purely for this purpose. In the Gaussian the standard deviation, <span class="math inline">\(\dataStd\)</span>, is the scale parameter. To see why this makes sense, let's consider, <span class="math display">\[z \sim \gaussianSamp{0}{1},\]</span> then if we scale by <span class="math inline">\(\dataStd\)</span> so we have, <span class="math inline">\(\dataScalar=\dataStd z\)</span>, we can write, <span class="math display">\[\dataScalar =\dataStd z \sim \gaussianSamp{0}{\dataStd^2}\]</span></p>
<h2 id="laplaces-idea">Laplace's Idea</h2>
<h3 id="a-probabilistic-process-1">A Probabilistic Process</h3>
<p>Laplace had the idea to augment the observations by noise, that is equivalent to considering a probability density whose mean is given by the <em>prediction function</em> <span class="math display">\[p\left(\dataScalar_i|\inputScalar_i\right)=\frac{1}{\sqrt{2\pi\dataStd^2}}\exp\left(-\frac{\left(\dataScalar_i-f\left(\inputScalar_i\right)\right)^{2}}{2\dataStd^2}\right).\]</span></p>
<p>This is known as <em>stochastic process</em>. It is a function that is corrupted by noise. Laplace didn't suggest the Gaussian density for that purpose, that was an innovation from Carl Friederich Gauss, which is what gives the Gaussian density its name.</p>
<h3 id="height-as-a-function-of-weight">Height as a Function of Weight</h3>
<p>In the standard Gaussian, parametized by mean and variance.</p>
<p>Make the mean a linear function of an <em>input</em>.</p>
<p>This leads to a regression model. <span class="math display">\[
\begin{align*}
\dataScalar_i=&\mappingFunction\left(\inputScalar_i\right)+\noiseScalar_i,\\
\noiseScalar_i \sim & \gaussianSamp{0}{\dataStd^2}.
\end{align*}
\]</span></p>
<p>Assume <span class="math inline">\(\dataScalar_i\)</span> is height and <span class="math inline">\(\inputScalar_i\)</span> is weight.</p>
<p>Likelihood of an individual data point <span class="math display">\[
p\left(\dataScalar_i|\inputScalar_i,m,c\right)=\frac{1}{\sqrt{2\pi \dataStd^2}}\exp\left(-\frac{\left(\dataScalar_i-m\inputScalar_i-c\right)^{2}}{2\dataStd^2}\right).
\]</span> Parameters are gradient, <span class="math inline">\(m\)</span>, offset, <span class="math inline">\(c\)</span> of the function and noise variance <span class="math inline">\(\dataStd^2\)</span>.</p>
<h3 id="data-set-likelihood">Data Set Likelihood</h3>
<p>If the noise, <span class="math inline">\(\epsilon_i\)</span> is sampled independently for each data point. Each data point is independent (given <span class="math inline">\(m\)</span> and <span class="math inline">\(c\)</span>). For <em>independent</em> variables: <span class="math display">\[
p(\dataVector) = \prod_{i=1}^\numData p(\dataScalar_i)
\]</span> <span class="math display">\[
p(\dataVector|\inputVector, m, c) = \prod_{i=1}^\numData p(\dataScalar_i|\inputScalar_i, m, c)
\]</span></p>
<h3 id="for-gaussian">For Gaussian</h3>
<p>i.i.d. assumption <span class="math display">\[
p(\dataVector|\inputVector, m, c) = \prod_{i=1}^\numData \frac{1}{\sqrt{2\pi \dataStd^2}}\exp \left(-\frac{\left(\dataScalar_i- m\inputScalar_i-c\right)^{2}}{2\dataStd^2}\right).
\]</span> <span class="math display">\[
p(\dataVector|\inputVector, m, c) = \frac{1}{\left(2\pi \dataStd^2\right)^{\frac{\numData}{2}}}\exp\left(-\frac{\sum_{i=1}^\numData\left(\dataScalar_i-m\inputScalar_i-c\right)^{2}}{2\dataStd^2}\right).
\]</span></p>
<h3 id="log-likelihood-function">Log Likelihood Function</h3>
<ul>
<li>Normally work with the log likelihood: <span class="math display">\[
L(m,c,\dataStd^{2})=-\frac{\numData}{2}\log 2\pi -\frac{\numData}{2}\log \dataStd^2 -\sum_{i=1}^{\numData}\frac{\left(\dataScalar_i-m\inputScalar_i-c\right)^{2}}{2\dataStd^2}.
\]</span></li>
</ul>
<h3 id="consistency-of-maximum-likelihood">Consistency of Maximum Likelihood</h3>
<ul>
<li>If data was really generated according to probability we specified.</li>
<li>Correct parameters will be recovered in limit as <span class="math inline">\(\numData \rightarrow \infty\)</span>.</li>
<li>This can be proven through sample based approximations (law of large numbers) of "KL divergences".</li>
<li>Mainstay of classical statistics.</li>
</ul>
<h3 id="probabilistic-interpretation-of-the-error-function">Probabilistic Interpretation of the Error Function</h3>
<ul>
<li>Probabilistic Interpretation for Error Function is Negative Log Likelihood.</li>
<li><em>Minimizing</em> error function is equivalent to <em>maximizing</em> log likelihood.</li>
<li>Maximizing <em>log likelihood</em> is equivalent to maximizing the <em>likelihood</em> because <span class="math inline">\(\log\)</span> is monotonic.</li>
<li>Probabilistic interpretation: Minimizing error function is equivalent to maximum likelihood with respect to parameters.</li>
</ul>
<h3 id="error-function">Error Function</h3>
<ul>
<li>Negative log likelihood is the error function leading to an error function <span class="math display">\[\errorFunction(m,c,\dataStd^{2})=\frac{\numData}{2}\log \dataStd^2+\frac{1}{2\dataStd^2}\sum _{i=1}^{\numData}\left(\dataScalar_i-m\inputScalar_i-c\right)^{2}.\]</span></li>
<li>Learning proceeds by minimizing this error function for the data set provided.</li>
</ul>
<h3 id="connection-sum-of-squares-error">Connection: Sum of Squares Error</h3>
<ul>
<li>Ignoring terms which don’t depend on <span class="math inline">\(m\)</span> and <span class="math inline">\(c\)</span> gives <span class="math display">\[\errorFunction(m, c) \propto \sum_{i=1}^\numData (\dataScalar_i - \mappingFunction(\inputScalar_i))^2\]</span> where <span class="math inline">\(\mappingFunction(\inputScalar_i) = m\inputScalar_i + c\)</span>.</li>
<li>This is known as the <em>sum of squares</em> error function.</li>
<li>Commonly used and is closely associated with the Gaussian likelihood.</li>
</ul>
<h3 id="reminder">Reminder</h3>
<ul>
<li>Two functions involved:
<ul>
<li><em>Prediction function</em>: <span class="math inline">\(\mappingFunction(\inputScalar_i)\)</span></li>
<li>Error, or <em>Objective function</em>: <span class="math inline">\(\errorFunction(m, c)\)</span></li>
</ul></li>
<li>Error function depends on parameters through prediction function.</li>
</ul>
<h3 id="mathematical-interpretation">Mathematical Interpretation</h3>
<ul>
<li>What is the mathematical interpretation?</li>
<li>There is a cost function.
<ul>
<li>It expresses mismatch between your prediction and reality. <span class="math display">\[
\errorFunction(m, c)=\sum_{i=1}^\numData \left(\dataScalar_i - m\inputScalar_i-c\right)^2
\]</span></li>
<li>This is known as the sum of squares error.</li>
</ul></li>
</ul>
<h2 id="sum-of-squares-error">Sum of Squares Error</h2>
<p>Minimizing the sum of squares error was first proposed by <a href="http://en.wikipedia.org/wiki/Adrien-Marie_Legendre">Legendre</a> in 1805. His book, which was on the orbit of comets, is available on google books, we can take a look at the relevant page by calling the code below.</p>
<p><a href="https://play.google.com/books/reader?id=spcAAAAAMAAJ&pg=PA72"><img src="../slides/diagrams/books/spcAAAAAMAAJ-72.png" /></a></p>
<p>Of course, the main text is in French, but the key part we are interested in can be roughly translated as</p>
<blockquote>
<p>In most matters where we take measures data through observation, the most accurate results they can offer, it is almost always leads to a system of equations of the form <span class="math display">\[E = a + bx + cy + fz + etc .\]</span> where <span class="math inline">\(a\)</span>, <span class="math inline">\(b\)</span>, <span class="math inline">\(c\)</span>, <span class="math inline">\(f\)</span> etc are the known coefficients and <span class="math inline">\(x\)</span>, <span class="math inline">\(y\)</span>, <span class="math inline">\(z\)</span> etc are unknown and must be determined by the condition that the value of E is reduced, for each equation, to an amount or zero or very small.</p>
</blockquote>
<p>He continues</p>
<blockquote>
Of all the principles that we can offer for this item, I think it is not broader, more accurate, nor easier than the one we have used in previous research application, and that is to make the minimum sum of the squares of the errors. By this means, it is between the errors a kind of balance that prevents extreme to prevail, is very specific to make known the state of the closest to the truth system. The sum of the squares of the errors <span class="math inline">\(E^2 + \left.E^\prime\right.^2 + \left.E^{\prime\prime}\right.^2 + etc\)</span> being
<span class="math display">\[\begin{align*} &(a + bx + cy + fz + etc)^2 \\
+ &(a^\prime +
b^\prime x + c^\prime y + f^\prime z + etc ) ^2\\
+ &(a^{\prime\prime} +
b^{\prime\prime}x + c^{\prime\prime}y + f^{\prime\prime}z + etc )^2 \\
+ & etc
\end{align*}\]</span>
<p>if we wanted a minimum, by varying x alone, we will have the equation ...</p>
</blockquote>
<p>This is the earliest know printed version of the problem of least squares. The notation, however, is a little awkward for mordern eyes. In particular Legendre doesn't make use of the sum sign, <span class="math display">\[
\sum_{i=1}^3 z_i = z_1
+ z_2 + z_3
\]</span> nor does he make use of the inner product.</p>
<p>In our notation, if we were to do linear regression, we would need to subsititue: <span class="math display">\[\begin{align*}
a &\leftarrow \dataScalar_1-c, \\ a^\prime &\leftarrow \dataScalar_2-c,\\ a^{\prime\prime} &\leftarrow
\dataScalar_3 -c,\\
\text{etc.}
\end{align*}\]</span> to introduce the data observations <span class="math inline">\(\{\dataScalar_i\}_{i=1}^{\numData}\)</span> alongside <span class="math inline">\(c\)</span>, the offset. We would then introduce the input locations <span class="math display">\[\begin{align*}
b & \leftarrow \inputScalar_1,\\
b^\prime & \leftarrow \inputScalar_2,\\
b^{\prime\prime} & \leftarrow \inputScalar_3\\
\text{etc.}
\end{align*}\]</span> and finally the gradient of the function <span class="math display">\[x \leftarrow -m.\]</span> The remaining coefficients (<span class="math inline">\(c\)</span> and <span class="math inline">\(f\)</span>) would then be zero. That would give us <span class="math display">\[\begin{align*} &(\dataScalar_1 -
(m\inputScalar_1+c))^2 \\
+ &(\dataScalar_2 -(m\inputScalar_2 + c))^2\\
+ &(\dataScalar_3 -(m\inputScalar_3 + c))^2 \\
+ &
\text{etc.}
\end{align*}\]</span> which we would write in the modern notation for sums as <span class="math display">\[
\sum_{i=1}^\numData (\dataScalar_i-(m\inputScalar_i + c))^2
\]</span> which is recognised as the sum of squares error for a linear regression.</p>
<p>This shows the advantage of modern <a href="http://en.wikipedia.org/wiki/Summation">summation operator</a>, <span class="math inline">\(\sum\)</span>, in keeping our mathematical notation compact. Whilst it may look more complicated the first time you see it, understanding the mathematical rules that go around it, allows us to go much further with the notation.</p>
<p>Inner products (or <a href="http://en.wikipedia.org/wiki/Dot_product">dot products</a>) are similar. They allow us to write <span class="math display">\[
\sum_{i=1}^q u_i v_i
\]</span> in a more compact notation, <span class="math inline">\(\mathbf{u}\cdot\mathbf{v}.\)</span></p>
<p>Here we are using bold face to represent vectors, and we assume that the individual elements of a vector <span class="math inline">\(\mathbf{z}\)</span> are given as a series of scalars <span class="math display">\[
\mathbf{z} = \begin{bmatrix} z_1\\ z_2\\ \vdots\\ z_\numData
\end{bmatrix}
\]</span> which are each indexed by their position in the vector.</p>
<h2 id="linear-algebra">Linear Algebra</h2>
<p>Linear algebra provides a very similar role, when we introduce <a href="http://en.wikipedia.org/wiki/Linear_algebra">linear algebra</a>, it is because we are faced with a large number of addition and multiplication operations. These operations need to be done together and would be very tedious to write down as a group. So the first reason we reach for linear algebra is for a more compact representation of our mathematical formulae.</p>
<h3 id="running-example-olympic-marathons">Running Example: Olympic Marathons</h3>
<p>Now we will load in the Olympic marathon data. This is data of the olympic marath times for the men's marathon from the first olympics in 1896 up until the London 2012 olympics.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">data <span class="op">=</span> pods.datasets.olympic_marathon_men()
x <span class="op">=</span> data[<span class="st">'X'</span>]
y <span class="op">=</span> data[<span class="st">'Y'</span>]</code></pre></div>
<p>You can see what these values are by typing:</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="bu">print</span>(x)
<span class="bu">print</span>(y)</code></pre></div>
<p>Note that they are not <code>pandas</code> data frames for this example, they are just arrays of dimensionality <span class="math inline">\(\numData\times 1\)</span>, where <span class="math inline">\(\numData\)</span> is the number of data.</p>
<p>The aim of this lab is to have you coding linear regression in python. We will do it in two ways, once using iterative updates (coordinate ascent) and then using linear algebra. The linear algebra approach will not only work much better, it is easy to extend to multiple input linear regression and <em>non-linear</em> regression using basis functions.</p>
<h3 id="plotting-the-data">Plotting the Data</h3>
<p>You can make a plot of <span class="math inline">\(\dataScalar\)</span> vs <span class="math inline">\(\inputScalar\)</span> with the following command:</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="op">%</span>matplotlib inline
<span class="im">import</span> matplotlib.pyplot <span class="im">as</span> plt</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">plt.plot(x, y, <span class="st">'rx'</span>)
plt.xlabel(<span class="st">'year'</span>)
plt.ylabel(<span class="st">'pace in min/km'</span>)</code></pre></div>
<h3 id="maximum-likelihood-iterative-solution">Maximum Likelihood: Iterative Solution</h3>
<p>Now we will take the maximum likelihood approach we derived in the lecture to fit a line, <span class="math inline">\(\dataScalar_i=m\inputScalar_i + c\)</span>, to the data you've plotted. We are trying to minimize the error function: <span class="math display">\[
\errorFunction(m, c) = \sum_{i=1}^\numData(\dataScalar_i-m\inputScalar_i-c)^2
\]</span> with respect to <span class="math inline">\(m\)</span>, <span class="math inline">\(c\)</span> and <span class="math inline">\(\sigma^2\)</span>. We can start with an initial guess for <span class="math inline">\(m\)</span>,</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">m <span class="op">=</span> <span class="op">-</span><span class="fl">0.4</span>
c <span class="op">=</span> <span class="dv">80</span></code></pre></div>
<p>Then we use the maximum likelihood update to find an estimate for the offset, <span class="math inline">\(c\)</span>.</p>
<h3 id="coordinate-descent">Coordinate Descent</h3>
<p>In the movie recommender system example, we minimised the objective function by steepest descent based gradient methods. Our updates required us to compute the gradient at the position we were located, then to update the gradient according to the direction of steepest descent. This time, we will take another approach. It is known as <em>coordinate descent</em>. In coordinate descent, we choose to move one parameter at a time. Ideally, we design an algorithm that at each step moves the parameter to its minimum value. At each step we choose to move the individual parameter to its minimum.</p>
<p>To find the minimum, we look for the point in the curve where the gradient is zero. This can be found by taking the gradient of <span class="math inline">\(\errorFunction(m,c)\)</span> with respect to the parameter.</p>
<h4 id="update-for-offset">Update for Offset</h4>
<p>Let's consider the parameter <span class="math inline">\(c\)</span> first. The gradient goes nicely through the summation operator, and we obtain <span class="math display">\[
\frac{\text{d}\errorFunction(m,c)}{\text{d}c} = -\sum_{i=1}^\numData 2(\dataScalar_i-m\inputScalar_i-c).
\]</span> Now we want the point that is a minimum. A minimum is an example of a <a href="http://en.wikipedia.org/wiki/Stationary_point"><em>stationary point</em></a>, the stationary points are those points of the function where the gradient is zero. They are found by solving the equation for <span class="math inline">\(\frac{\text{d}\errorFunction(m,c)}{\text{d}c} = 0\)</span>. Substituting in to our gradient, we can obtain the following equation, <span class="math display">\[
0 = -\sum_{i=1}^\numData 2(\dataScalar_i-m\inputScalar_i-c)
\]</span> which can be reorganised as follows, <span class="math display">\[
c^* = \frac{\sum_{i=1}^\numData(\dataScalar_i-m^*\inputScalar_i)}{\numData}.
\]</span> The fact that the stationary point is easily extracted in this manner implies that the solution is <em>unique</em>. There is only one stationary point for this system. Traditionally when trying to determine the type of stationary point we have encountered we now compute the <em>second derivative</em>, <span class="math display">\[
\frac{\text{d}^2\errorFunction(m,c)}{\text{d}c^2} = 2n.
\]</span> The second derivative is positive, which in turn implies that we have found a minimum of the function. This means that setting <span class="math inline">\(c\)</span> in this way will take us to the lowest point along that axes.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co"># set c to the minimum</span>
c <span class="op">=</span> (y <span class="op">-</span> m<span class="op">*</span>x).mean()
<span class="bu">print</span>(c)</code></pre></div>
<h3 id="update-for-slope">Update for Slope</h3>
<p>Now we have the offset set to the minimum value, in coordinate descent, the next step is to optimise another parameter. Only one further parameter remains. That is the slope of the system.</p>
<p>Now we can turn our attention to the slope. We once again peform the same set of computations to find the minima. We end up with an update equation of the following form.</p>
<p><span class="math display">\[m^* = \frac{\sum_{i=1}^\numData (\dataScalar_i - c)\inputScalar_i}{\sum_{i=1}^\numData \inputScalar_i^2}\]</span></p>
<p>Communication of mathematics in data science is an essential skill, in a moment, you will be asked to rederive the equation above. Before we do that, however, we will briefly review how to write mathematics in the notebook.</p>
<h3 id="latex-for-maths"><span class="math inline">\(\LaTeX\)</span> for Maths</h3>
<p>These cells use <a href="http://en.wikipedia.org/wiki/Markdown">Markdown format</a>. You can include maths in your markdown using <a href="http://en.wikipedia.org/wiki/LaTeX"><span class="math inline">\(\LaTeX\)</span> syntax</a>, all you have to do is write your answer inside dollar signs, as follows:</p>
<p>To write a fraction, we write <code>$\frac{a}{b}$</code>, and it will display like this <span class="math inline">\(\frac{a}{b}\)</span>. To write a subscript we write <code>$a_b$</code> which will appear as <span class="math inline">\(a_b\)</span>. To write a superscript (for example in a polynomial) we write <code>$a^b$</code> which will appear as <span class="math inline">\(a^b\)</span>. There are lots of other macros as well, for example we can do greek letters such as <code>$\alpha, \beta, \gamma$</code> rendering as <span class="math inline">\(\alpha, \beta, \gamma\)</span>. And we can do sum and intergral signs as <code>$\sum \int \int$</code>.</p>
<p>You can combine many of these operations together for composing expressions.</p>
<h3 id="question-1">Question 1</h3>
<p>Convert the following python code expressions into <span class="math inline">\(\LaTeX\)</span>j, writing your answers below. In each case write your answer as a single equality (i.e. your maths should only contain one expression, not several lines of expressions). For the purposes of your <span class="math inline">\(\LaTeX\)</span> please assume that <code>x</code> and <code>w</code> are <span class="math inline">\(n\)</span> dimensional vectors.</p>
<p><code>(a) f = x.sum()</code></p>
<p><code>(b) m = x.mean()</code></p>
<p><code>(c) g = (x*w).sum()</code></p>
<p><em>15 marks</em></p>
<h3 id="write-your-answer-to-question-1-here">Write your answer to Question 1 here</h3>
<h3 id="fixed-point-updates">Fixed Point Updates</h3>
<p><span align="left">Worked example.</span> <span class="math display">\[
\begin{aligned}
c^{*}=&\frac{\sum
_{i=1}^{\numData}\left(\dataScalar_i-m^{*}\inputScalar_i\right)}{\numData},\\
m^{*}=&\frac{\sum
_{i=1}^{\numData}\inputScalar_i\left(\dataScalar_i-c^{*}\right)}{\sum _{i=1}^{\numData}\inputScalar_i^{2}},\\
\left.\dataStd^2\right.^{*}=&\frac{\sum
_{i=1}^{\numData}\left(\dataScalar_i-m^{*}\inputScalar_i-c^{*}\right)^{2}}{\numData}
\end{aligned}
\]</span></p>
<h3 id="gradient-with-respect-to-the-slope">Gradient With Respect to the Slope</h3>
<p>Now that you've had a little training in writing maths with <span class="math inline">\(\LaTeX\)</span>, we will be able to use it to answer questions. The next thing we are going to do is a little differentiation practice.</p>
<h3 id="question-2">Question 2</h3>
<p>Derive the the gradient of the objective function with respect to the slope, <span class="math inline">\(m\)</span>. Rearrange it to show that the update equation written above does find the stationary points of the objective function. By computing its derivative show that it's a minimum.</p>
<p><em>20 marks</em></p>
<h3 id="write-your-answer-to-question-2-here">Write your answer to Question 2 here</h3>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">m <span class="op">=</span> ((y <span class="op">-</span> c)<span class="op">*</span>x).<span class="bu">sum</span>()<span class="op">/</span>(x<span class="op">**</span><span class="dv">2</span>).<span class="bu">sum</span>()
<span class="bu">print</span>(m)</code></pre></div>
<p>We can have a look at how good our fit is by computing the prediction across the input space. First create a vector of 'test points',</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> numpy <span class="im">as</span> np
x_test <span class="op">=</span> np.linspace(<span class="dv">1890</span>, <span class="dv">2020</span>, <span class="dv">130</span>)[:, <span class="va">None</span>]</code></pre></div>
<p>Now use this vector to compute some test predictions,</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">f_test <span class="op">=</span> m<span class="op">*</span>x_test <span class="op">+</span> c</code></pre></div>
<p>Now plot those test predictions with a blue line on the same plot as the data,</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">plt.plot(x_test, f_test, <span class="st">'b-'</span>)
plt.plot(x, y, <span class="st">'rx'</span>)</code></pre></div>
<p>The fit isn't very good, we need to iterate between these parameter updates in a loop to improve the fit, we have to do this several times,</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="cf">for</span> i <span class="kw">in</span> np.arange(<span class="dv">10</span>):
m <span class="op">=</span> ((y <span class="op">-</span> c)<span class="op">*</span>x).<span class="bu">sum</span>()<span class="op">/</span>(x<span class="op">*</span>x).<span class="bu">sum</span>()
c <span class="op">=</span> (y<span class="op">-</span>m<span class="op">*</span>x).<span class="bu">sum</span>()<span class="op">/</span>y.shape[<span class="dv">0</span>]
<span class="bu">print</span>(m)
<span class="bu">print</span>(c)</code></pre></div>
<p>And let's try plotting the result again</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">f_test <span class="op">=</span> m<span class="op">*</span>x_test <span class="op">+</span> c
plt.plot(x_test, f_test, <span class="st">'b-'</span>)
plt.plot(x, y, <span class="st">'rx'</span>)</code></pre></div>
<p>Clearly we need more iterations than 10! In the next question you will add more iterations and report on the error as optimisation proceeds.</p>
<h3 id="question-3">Question 3</h3>
<p>There is a problem here, we seem to need many interations to get to a good solution. Let's explore what's going on. Write code which alternates between updates of <code>c</code> and <code>m</code>. Include the following features in your code.</p>
<ol style="list-style-type: decimal">
<li>Initialise with <code>m=-0.4</code> and <code>c=80</code>.</li>
<li>Every 10 iterations compute the value of the objective function for the training data and print it to the screen (you'll find hints on this in <a href="./week2.ipynb">the lab from last week</a>.</li>
<li>Cause the code to stop running when the error change over less than 10 iterations is smaller than <span class="math inline">\(1\times10^{-4}\)</span>. This is known as a stopping criterion.</li>
</ol>
<p>Why do we need so many iterations to get to the solution?</p>
<p><em>25 marks</em></p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co"># Write code for your answer to Question 3 in this box</span>
</code></pre></div>
<h3 id="important-concepts-not-covered">Important Concepts Not Covered</h3>
<ul>
<li>Other optimization methods:
<ul>
<li>Second order methods, conjugate gradient, quasi-Newton and Newton.</li>
</ul></li>
<li>Effective heuristics such as momentum.</li>
<li>Local vs global solutions.</li>
</ul>
<h3 id="reading">Reading</h3>
<ul>
<li>Section 1.1-1.2 of <span class="citation">Rogers and Girolami (2011)</span> for fitting linear models.</li>
<li>Section 1.2.5 of <span class="citation">Bishop (2006)</span> up to equation 1.65.</li>
</ul>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> numpy <span class="im">as</span> np
<span class="im">import</span> matplotlib.pyplot <span class="im">as</span> plt
<span class="im">import</span> mlai</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">x <span class="op">=</span> np.random.normal(size<span class="op">=</span><span class="dv">4</span>)</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">m_true <span class="op">=</span> <span class="fl">1.4</span>
c_true <span class="op">=</span> <span class="op">-</span><span class="fl">3.1</span></code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">y <span class="op">=</span> m_true<span class="op">*</span>x<span class="op">+</span>c_true</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">plt.plot(x, y, <span class="st">'r.'</span>, markersize<span class="op">=</span><span class="dv">10</span>) <span class="co"># plot data as red dots</span>
plt.xlim([<span class="op">-</span><span class="dv">3</span>, <span class="dv">3</span>])
mlai.write_figure(filename<span class="op">=</span><span class="st">"../slides/diagrams/ml/regression.svg"</span>, transparent<span class="op">=</span><span class="va">True</span>)</code></pre></div>
<object class="svgplot" align data="../slides/diagrams/ml/regression.svg">
</object>
<h3 id="noise-corrupted-plot">Noise Corrupted Plot</h3>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">noise <span class="op">=</span> np.random.normal(scale<span class="op">=</span><span class="fl">0.5</span>, size<span class="op">=</span><span class="dv">4</span>) <span class="co"># standard deviation of the noise is 0.5</span>
y <span class="op">=</span> m_true<span class="op">*</span>x <span class="op">+</span> c_true <span class="op">+</span> noise
plt.plot(x, y, <span class="st">'r.'</span>, markersize<span class="op">=</span><span class="dv">10</span>)
plt.xlim([<span class="op">-</span><span class="dv">3</span>, <span class="dv">3</span>])
mlai.write_figure(filename<span class="op">=</span><span class="st">"../slides/diagrams/ml/regression_noise.svg"</span>, transparent<span class="op">=</span><span class="va">True</span>)</code></pre></div>
<object class="svgplot" align data="../slides/diagrams/ml/regression_noise.svg">
</object>
<h3 id="contour-plot-of-error-function">Contour Plot of Error Function</h3>
<ul>
<li>Visualise the error function surface, create vectors of values.</li>
</ul>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co"># create an array of linearly separated values around m_true</span>
m_vals <span class="op">=</span> np.linspace(m_true<span class="op">-</span><span class="dv">3</span>, m_true<span class="op">+</span><span class="dv">3</span>, <span class="dv">100</span>)
<span class="co"># create an array of linearly separated values ae</span>
c_vals <span class="op">=</span> np.linspace(c_true<span class="op">-</span><span class="dv">3</span>, c_true<span class="op">+</span><span class="dv">3</span>, <span class="dv">100</span>)</code></pre></div>
<ul>
<li>create a grid of values to evaluate the error function in 2D.</li>
</ul>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">m_grid, c_grid <span class="op">=</span> np.meshgrid(m_vals, c_vals)</code></pre></div>
<ul>
<li>compute the error function at each combination of <span class="math inline">\(c\)</span> and <span class="math inline">\(m\)</span>.</li>
</ul>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">E_grid <span class="op">=</span> np.zeros((<span class="dv">100</span>, <span class="dv">100</span>))
<span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">100</span>):
<span class="cf">for</span> j <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">100</span>):
E_grid[i, j] <span class="op">=</span> ((y <span class="op">-</span> m_grid[i, j]<span class="op">*</span>x <span class="op">-</span> c_grid[i, j])<span class="op">**</span><span class="dv">2</span>).<span class="bu">sum</span>()</code></pre></div>
<h3 id="contour-plot-of-error">Contour Plot of Error</h3>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="op">%</span>load <span class="op">-</span>s regression_contour teaching_plots.py</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">f, ax <span class="op">=</span> plt.subplots(figsize<span class="op">=</span>(<span class="dv">5</span>,<span class="dv">5</span>))
regression_contour(f, ax, m_vals, c_vals, E_grid)
mlai.write_figure(filename<span class="op">=</span><span class="st">'../slides/diagrams/ml/regression_contour.svg'</span>)</code></pre></div>
<object class="svgplot" align data="../slides/diagrams/ml/regression_contour.svg">
</object>
<h3 id="steepest-descent">Steepest Descent</h3>
<h3 id="algorithm">Algorithm</h3>
<ul>
<li>We start with a guess for <span class="math inline">\(m\)</span> and <span class="math inline">\(c\)</span>.</li>
</ul>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">m_star <span class="op">=</span> <span class="fl">0.0</span>
c_star <span class="op">=</span> <span class="op">-</span><span class="fl">5.0</span></code></pre></div>
<h3 id="offset-gradient">Offset Gradient</h3>
<ul>
<li>Now we need to compute the gradient of the error function, firstly with respect to <span class="math inline">\(c\)</span>,</li>
</ul>
<p><span class="math display">\[\frac{\text{d}\errorFunction(m, c)}{\text{d} c} =
-2\sum_{i=1}^\numData (\dataScalar_i - m\inputScalar_i - c)\]</span></p>
<ul>
<li>This is computed in python as follows</li>
</ul>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">c_grad <span class="op">=</span> <span class="op">-</span><span class="dv">2</span><span class="op">*</span>(y<span class="op">-</span>m_star<span class="op">*</span>x <span class="op">-</span> c_star).<span class="bu">sum</span>()
<span class="bu">print</span>(<span class="st">"Gradient with respect to c is "</span>, c_grad)</code></pre></div>
<h3 id="deriving-the-gradient">Deriving the Gradient</h3>
<p>To see how the gradient was derived, first note that the <span class="math inline">\(c\)</span> appears in every term in the sum. So we are just differentiating <span class="math inline">\((\dataScalar_i - m\inputScalar_i - c)^2\)</span> for each term in the sum. The gradient of this term with respect to <span class="math inline">\(c\)</span> is simply the gradient of the outer quadratic, multiplied by the gradient with respect to <span class="math inline">\(c\)</span> of the part inside the quadratic. The gradient of a quadratic is two times the argument of the quadratic, and the gradient of the inside linear term is just minus one. This is true for all terms in the sum, so we are left with the sum in the gradient.</p>
<h3 id="slope-gradient">Slope Gradient</h3>
<p>The gradient with respect tom <span class="math inline">\(m\)</span> is similar, but now the gradient of the quadratic's argument is <span class="math inline">\(-\inputScalar_i\)</span> so the gradient with respect to <span class="math inline">\(m\)</span> is</p>
<p><span class="math display">\[\frac{\text{d}\errorFunction(m, c)}{\text{d} m} = -2\sum_{i=1}^\numData \inputScalar_i(\dataScalar_i - m\inputScalar_i -
c)\]</span></p>
<p>which can be implemented in python (numpy) as</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">m_grad <span class="op">=</span> <span class="op">-</span><span class="dv">2</span><span class="op">*</span>(x<span class="op">*</span>(y<span class="op">-</span>m_star<span class="op">*</span>x <span class="op">-</span> c_star)).<span class="bu">sum</span>()
<span class="bu">print</span>(<span class="st">"Gradient with respect to m is "</span>, m_grad)</code></pre></div>
<h3 id="update-equations">Update Equations</h3>
<ul>
<li>Now we have gradients with respect to <span class="math inline">\(m\)</span> and <span class="math inline">\(c\)</span>.</li>
<li>Can update our inital guesses for <span class="math inline">\(m\)</span> and <span class="math inline">\(c\)</span> using the gradient.</li>
<li>We don't want to just subtract the gradient from <span class="math inline">\(m\)</span> and <span class="math inline">\(c\)</span>,</li>
<li>We need to take a <em>small</em> step in the gradient direction.</li>
<li>Otherwise we might overshoot the minimum.</li>
<li>We want to follow the gradient to get to the minimum, the gradient changes all the time.</li>
</ul>
<h3 id="move-in-direction-of-gradient">Move in Direction of Gradient</h3>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> teaching_plots <span class="im">as</span> plot</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">f, ax <span class="op">=</span> plt.subplots(figsize<span class="op">=</span>plot.big_figsize)
plot.regression_contour(f, ax, m_vals, c_vals, E_grid)
ax.plot(m_star, c_star, <span class="st">'g*'</span>, markersize<span class="op">=</span><span class="dv">20</span>)
ax.arrow(m_star, c_star, <span class="op">-</span>m_grad<span class="op">*</span><span class="fl">0.1</span>, <span class="op">-</span>c_grad<span class="op">*</span><span class="fl">0.1</span>, head_width<span class="op">=</span><span class="fl">0.2</span>)
mlai.write_figure(filename<span class="op">=</span><span class="st">'../slides/diagrams/ml/regression_contour_step001.svg'</span>, transparent<span class="op">=</span><span class="va">True</span>)</code></pre></div>
<object class="svgplot" align data="../slides/diagrams/ml/regression_contour_step001.svg">
</object>
<h3 id="update-equations-1">Update Equations</h3>
<ul>
<li><p>The step size has already been introduced, it's again known as the learning rate and is denoted by <span class="math inline">\(\learnRate\)</span>. <span class="math display">\[
c_\text{new}\leftarrow c_{\text{old}} - \learnRate \frac{\text{d}\errorFunction(m, c)}{\text{d}c}
\]</span></p></li>
<li>gives us an update for our estimate of <span class="math inline">\(c\)</span> (which in the code we've been calling <code>c_star</code> to represent a common way of writing a parameter estimate, <span class="math inline">\(c^*\)</span>) and <span class="math display">\[
m_\text{new} \leftarrow m_{\text{old}} - \learnRate \frac{\text{d}\errorFunction(m, c)}{\text{d}m}
\]</span></li>
<li><p>Giving us an update for <span class="math inline">\(m\)</span>.</p></li>
</ul>
<h3 id="update-code">Update Code</h3>
<ul>
<li>These updates can be coded as</li>
</ul>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="bu">print</span>(<span class="st">"Original m was"</span>, m_star, <span class="st">"and original c was"</span>, c_star)
learn_rate <span class="op">=</span> <span class="fl">0.01</span>
c_star <span class="op">=</span> c_star <span class="op">-</span> learn_rate<span class="op">*</span>c_grad
m_star <span class="op">=</span> m_star <span class="op">-</span> learn_rate<span class="op">*</span>m_grad
<span class="bu">print</span>(<span class="st">"New m is"</span>, m_star, <span class="st">"and new c is"</span>, c_star)</code></pre></div>
<h2 id="iterating-updates">Iterating Updates</h2>
<ul>
<li>Fit model by descending gradient.</li>
</ul>
<h3 id="gradient-descent-algorithm">Gradient Descent Algorithm</h3>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">num_plots <span class="op">=</span> plot.regression_contour_fit(x, y, diagrams<span class="op">=</span><span class="st">'../slides/diagrams/ml'</span>)</code></pre></div>
<object class="svgplot" align data="../slides/diagrams/ml/regression_contour_fit028.svg">
</object>
<center>
<em>Stochastic gradient descent for linear regression </em>
</center>
<h3 id="stochastic-gradient-descent">Stochastic Gradient Descent</h3>
<ul>
<li>If <span class="math inline">\(\numData\)</span> is small, gradient descent is fine.</li>
<li>But sometimes (e.g. on the internet <span class="math inline">\(\numData\)</span> could be a billion.</li>
<li>Stochastic gradient descent is more similar to perceptron.</li>
<li>Look at gradient of one data point at a time rather than summing across <em>all</em> data points)</li>
<li>This gives a stochastic estimate of gradient.</li>
</ul>
<h3 id="stochastic-gradient-descent-1">Stochastic Gradient Descent</h3>
<ul>
<li>The real gradient with respect to <span class="math inline">\(m\)</span> is given by</li>
</ul>
<p><span class="math display">\[\frac{\text{d}\errorFunction(m, c)}{\text{d} m} = -2\sum_{i=1}^\numData \inputScalar_i(\dataScalar_i -
m\inputScalar_i - c)\]</span></p>
<p>but it has <span class="math inline">\(\numData\)</span> terms in the sum. Substituting in the gradient we can see that the full update is of the form</p>
<p><span class="math display">\[m_\text{new} \leftarrow
m_\text{old} + 2\learnRate \left[\inputScalar_1 (\dataScalar_1 - m_\text{old}\inputScalar_1 - c_\text{old}) + (\inputScalar_2 (\dataScalar_2 - m_\text{old}\inputScalar_2 - c_\text{old}) + \dots + (\inputScalar_n (\dataScalar_n - m_\text{old}\inputScalar_n - c_\text{old})\right]\]</span></p>
<p>This could be split up into lots of individual updates <span class="math display">\[m_1 \leftarrow m_\text{old} + 2\learnRate \left[\inputScalar_1 (\dataScalar_1 - m_\text{old}\inputScalar_1 -
c_\text{old})\right]\]</span> <span class="math display">\[m_2 \leftarrow m_1 + 2\learnRate \left[\inputScalar_2 (\dataScalar_2 -
m_\text{old}\inputScalar_2 - c_\text{old})\right]\]</span> <span class="math display">\[m_3 \leftarrow m_2 + 2\learnRate
\left[\dots\right]\]</span> <span class="math display">\[m_n \leftarrow m_{n-1} + 2\learnRate \left[\inputScalar_n (\dataScalar_n -
m_\text{old}\inputScalar_n - c_\text{old})\right]\]</span></p>
<p>which would lead to the same final update.</p>
<h3 id="updating-c-and-m">Updating <span class="math inline">\(c\)</span> and <span class="math inline">\(m\)</span></h3>
<ul>
<li>In the sum we don't <span class="math inline">\(m\)</span> and <span class="math inline">\(c\)</span> we use for computing the gradient term at each update.</li>
<li>In stochastic gradient descent we <em>do</em> change them.</li>
<li>This means it's not quite the same as steepest desceint.</li>
<li>But we can present each data point in a random order, like we did for the perceptron.</li>
<li>This makes the algorithm suitable for large scale web use (recently this domain is know as 'Big Data') and algorithms like this are widely used by Google, Microsoft, Amazon, Twitter and Facebook.</li>
</ul>
<h3 id="stochastic-gradient-descent-2">Stochastic Gradient Descent</h3>
<ul>
<li>Or more accurate, since the data is normally presented in a random order we just can write <span class="math display">\[
m_\text{new} = m_\text{old} + 2\learnRate\left[\inputScalar_i (\dataScalar_i - m_\text{old}\inputScalar_i - c_\text{old})\right]
\]</span></li>
</ul>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co"># choose a random point for the update </span>
i <span class="op">=</span> np.random.randint(x.shape[<span class="dv">0</span>]<span class="op">-</span><span class="dv">1</span>)
<span class="co"># update m</span>
m_star <span class="op">=</span> m_star <span class="op">+</span> <span class="dv">2</span><span class="op">*</span>learn_rate<span class="op">*</span>(x[i]<span class="op">*</span>(y[i]<span class="op">-</span>m_star<span class="op">*</span>x[i] <span class="op">-</span> c_star))
<span class="co"># update c</span>
c_star <span class="op">=</span> c_star <span class="op">+</span> <span class="dv">2</span><span class="op">*</span>learn_rate<span class="op">*</span>(y[i]<span class="op">-</span>m_star<span class="op">*</span>x[i] <span class="op">-</span> c_star)</code></pre></div>
<h3 id="sgd-for-linear-regression">SGD for Linear Regression</h3>
<p>Putting it all together in an algorithm, we can do stochastic gradient descent for our regression data.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">num_plots <span class="op">=</span> plot.regression_contour_sgd(x, y, diagrams<span class="op">=</span><span class="st">'../slides/diagrams/ml'</span>)</code></pre></div>
<object class="svgplot" align data="../slides/diagrams/ml/regression_sgd_contour_fit058.svg">
</object>
<center>
<em>Stochastic gradient descent for linear regression </em>
</center>
<h3 id="reflection-on-linear-regression-and-supervised-learning">Reflection on Linear Regression and Supervised Learning</h3>
<p>Think about:</p>
<ol style="list-style-type: decimal">
<li><p>What effect does the learning rate have in the optimization? What's the effect of making it too small, what's the effect of making it too big? Do you get the same result for both stochastic and steepest gradient descent?</p></li>
<li><p>The stochastic gradient descent doesn't help very much for such a small data set. It's real advantage comes when there are many, you'll see this in the lab.</p></li>
</ol>
<h2 id="multiple-input-solution-with-linear-algebra">Multiple Input Solution with Linear Algebra</h2>
<p>You've now seen how slow it can be to perform a coordinate ascent on a system. Another approach to solving the system (which is not always possible, particularly in <em>non-linear</em> systems) is to go direct to the minimum. To do this we need to introduce <em>linear algebra</em>. We will represent all our errors and functions in the form of linear algebra. As we mentioned above, linear algebra is just a shorthand for performing lots of multiplications and additions simultaneously. What does it have to do with our system then? Well the first thing to note is that the linear function we were trying to fit has the following form: <span class="math display">\[
\mappingFunction(x) = mx + c
\]</span> the classical form for a straight line. From a linear algebraic perspective we are looking for multiplications and additions. We are also looking to separate our parameters from our data. The data is the <em>givens</em> remember, in French the word is données literally translated means <em>givens</em> that's great, because we don't need to change the data, what we need to change are the parameters (or variables) of the model. In this function the data comes in through <span class="math inline">\(x\)</span>, and the parameters are <span class="math inline">\(m\)</span> and <span class="math inline">\(c\)</span>.</p>
<p>What we'd like to create is a vector of parameters and a vector of data. Then we could represent the system with vectors that represent the data, and vectors that represent the parameters.</p>
<p>We look to turn the multiplications and additions into a linear algebraic form, we have one multiplication (<span class="math inline">\(m\times c\)</span> and one addition (<span class="math inline">\(mx + c\)</span>). But we can turn this into a inner product by writing it in the following way, <span class="math display">\[
\mappingFunction(x) = m \times x +
c \times 1,
\]</span> in other words we've extracted the unit value, from the offset, <span class="math inline">\(c\)</span>. We can think of this unit value like an extra item of data, because it is always given to us, and it is always set to 1 (unlike regular data, which is likely to vary!). We can therefore write each input data location, <span class="math inline">\(\inputVector\)</span>, as a vector <span class="math display">\[
\inputVector = \begin{bmatrix} 1\\ x\end{bmatrix}.
\]</span></p>
<p>Now we choose to also turn our parameters into a vector. The parameter vector will be defined to contain <span class="math display">\[
\mappingVector = \begin{bmatrix} c \\ m\end{bmatrix}
\]</span> because if we now take the inner product between these to vectors we recover <span class="math display">\[
\inputVector\cdot\mappingVector = 1 \times c + x \times m = mx + c
\]</span> In <code>numpy</code> we can define this vector as follows</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co"># define the vector w</span>
w <span class="op">=</span> np.zeros(shape<span class="op">=</span>(<span class="dv">2</span>, <span class="dv">1</span>))
w[<span class="dv">0</span>] <span class="op">=</span> m
w[<span class="dv">1</span>] <span class="op">=</span> c</code></pre></div>
<p>This gives us the equivalence between original operation and an operation in vector space. Whilst the notation here isn't a lot shorter, the beauty is that we will be able to add as many features as we like and still keep the seame representation. In general, we are now moving to a system where each of our predictions is given by an inner product. When we want to represent a linear product in linear algebra, we tend to do it with the transpose operation, so since we have <span class="math inline">\(\mathbf{a}\cdot\mathbf{b} = \mathbf{a}^\top\mathbf{b}\)</span> we can write <span class="math display">\[
\mappingFunction(\inputVector_i) = \inputVector_i^\top\mappingVector.
\]</span> Where we've assumed that each data point, <span class="math inline">\(\inputVector_i\)</span>, is now written by appending a 1 onto the original vector <span class="math display">\[
\inputVector_i = \begin{bmatrix}
1 \\
\inputScalar_i
\end{bmatrix}
\]</span></p>
<h2 id="design-matrix">Design Matrix</h2>
<p>We can do this for the entire data set to form a <a href="http://en.wikipedia.org/wiki/Design_matrix"><em>design matrix</em></a> <span class="math inline">\(\inputMatrix\)</span>,</p>
<p><span class="math display">\[\inputMatrix
= \begin{bmatrix}
\inputVector_1^\top \\\
\inputVector_2^\top \\\
\vdots \\\
\inputVector_\numData^\top
\end{bmatrix} = \begin{bmatrix}
1 & \inputScalar_1 \\\
1 & \inputScalar_2 \\\
\vdots
& \vdots \\\
1 & \inputScalar_\numData
\end{bmatrix},\]</span></p>
<p>which in <code>numpy</code> can be done with the following commands:</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">X <span class="op">=</span> np.hstack((np.ones_like(x), x))
<span class="bu">print</span>(X)</code></pre></div>
<h3 id="writing-the-objective-with-linear-algebra">Writing the Objective with Linear Algebra</h3>
<p>When we think of the objective function, we can think of it as the errors where the error is defined in a similar way to what it was in Legendre's day <span class="math inline">\(\dataScalar_i - \mappingFunction(\inputVector_i)\)</span>, in statistics these errors are also sometimes called <a href="http://en.wikipedia.org/wiki/Errors_and_residuals_in_statistics"><em>residuals</em></a>. So we can think as the objective and the prediction function as two separate parts, first we have, <span class="math display">\[
\errorFunction(\mappingVector) = \sum_{i=1}^\numData (\dataScalar_i - \mappingFunction(\inputVector_i; \mappingVector))^2,
\]</span> where we've made the function <span class="math inline">\(\mappingFunction(\cdot)\)</span>'s dependence on the parameters <span class="math inline">\(\mappingVector\)</span> explicit in this equation. Then we have the definition of the function itself, <span class="math display">\[
\mappingFunction(\inputVector_i; \mappingVector) = \inputVector_i^\top \mappingVector.
\]</span> Let's look again at these two equations and see if we can identify any inner products. The first equation is a sum of squares, which is promising. Any sum of squares can be represented by an inner product, <span class="math display">\[
a = \sum_{i=1}^{k} b^2_i = \mathbf{b}^\top\mathbf{b},
\]</span> so if we wish to represent <span class="math inline">\(\errorFunction(\mappingVector)\)</span> in this way, all we need to do is convert the sum operator to an inner product. We can get a vector from that sum operator by placing both <span class="math inline">\(\dataScalar_i\)</span> and <span class="math inline">\(\mappingFunction(\inputVector_i; \mappingVector)\)</span> into vectors, which we do by defining <span class="math display">\[
\dataVector = \begin{bmatrix}\dataScalar_1\\ \dataScalar_2\\ \vdots \\ \dataScalar_\numData\end{bmatrix}
\]</span> and defining <span class="math display">\[
\mappingFunctionVector(\inputVector_1; \mappingVector) = \begin{bmatrix}\mappingFunction(\inputVector_1; \mappingVector)\\ \mappingFunction(\inputVector_2; \mappingVector)\\ \vdots \\ \mappingFunction(\inputVector_\numData; \mappingVector)\end{bmatrix}.
\]</span> The second of these is actually a vector-valued function. This term may appear intimidating, but the idea is straightforward. A vector valued function is simply a vector whose elements are themselves defined as <em>functions</em>, i.e. it is a vector of functions, rather than a vector of scalars. The idea is so straightforward, that we are going to ignore it for the moment, and barely use it in the derivation. But it will reappear later when we introduce <em>basis functions</em>. So we will, for the moment, ignore the dependence of <span class="math inline">\(\mappingFunctionVector\)</span> on <span class="math inline">\(\mappingVector\)</span> and <span class="math inline">\(\inputMatrix\)</span> and simply summarise it by a vector of numbers <span class="math display">\[
\mappingFunctionVector = \begin{bmatrix}\mappingFunction_1\\\mappingFunction_2\\
\vdots \\ \mappingFunction_\numData\end{bmatrix}.
\]</span> This allows us to write our objective in the folowing, linear algebraic form, <span class="math display">\[
\errorFunction(\mappingVector) = (\dataVector - \mappingFunctionVector)^\top(\dataVector - \mappingFunctionVector)
\]</span> from the rules of inner products. But what of our matrix <span class="math inline">\(\inputMatrix\)</span> of input data? At this point, we need to dust off <a href="http://en.wikipedia.org/wiki/Matrix_multiplication"><em>matrix-vector multiplication</em></a>. Matrix multiplication is simply a convenient way of performing many inner products together, and it's exactly what we need to summarise the operation <span class="math display">\[
f_i = \inputVector_i^\top\mappingVector.
\]</span> This operation tells us that each element of the vector <span class="math inline">\(\mappingFunctionVector\)</span> (our vector valued function) is given by an inner product between <span class="math inline">\(\inputVector_i\)</span> and <span class="math inline">\(\mappingVector\)</span>. In other words it is a series of inner products. Let's look at the definition of matrix multiplication, it takes the form <span class="math display">\[
\mathbf{c} = \mathbf{B}\mathbf{a}
\]</span> where <span class="math inline">\(\mathbf{c}\)</span> might be a <span class="math inline">\(k\)</span> dimensional vector (which we can intepret as a <span class="math inline">\(k\times 1\)</span> dimensional matrix), and <span class="math inline">\(\mathbf{B}\)</span> is a <span class="math inline">\(k\times k\)</span> dimensional matrix and <span class="math inline">\(\mathbf{a}\)</span> is a <span class="math inline">\(k\)</span> dimensional vector (<span class="math inline">\(k\times 1\)</span> dimensional matrix).</p>
<p>The result of this multiplication is of the form <span class="math display">\[
\begin{bmatrix}c_1\\c_2 \\ \vdots \\
a_k\end{bmatrix} =
\begin{bmatrix} b_{1,1} & b_{1, 2} & \dots & b_{1, k} \\
b_{2, 1} & b_{2, 2} & \dots & b_{2, k} \\
\vdots & \vdots & \ddots & \vdots \\
b_{k, 1} & b_{k, 2} & \dots & b_{k, k} \end{bmatrix} \begin{bmatrix}a_1\\a_2 \\
\vdots\\ c_k\end{bmatrix} = \begin{bmatrix} b_{1, 1}a_1 + b_{1, 2}a_2 + \dots +
b_{1, k}a_k\\
b_{2, 1}a_1 + b_{2, 2}a_2 + \dots + b_{2, k}a_k \\
\vdots\\
b_{k, 1}a_1 + b_{k, 2}a_2 + \dots + b_{k, k}a_k\end{bmatrix}
\]</span> so we see that each element of the result, <span class="math inline">\(\mathbf{a}\)</span> is simply the inner product between each <em>row</em> of <span class="math inline">\(\mathbf{B}\)</span> and the vector <span class="math inline">\(\mathbf{c}\)</span>. Because we have defined each element of <span class="math inline">\(\mappingFunctionVector\)</span> to be given by the inner product between each <em>row</em> of the design matrix and the vector <span class="math inline">\(\mappingVector\)</span> we now can write the full operation in one matrix multiplication, <span class="math display">\[
\mappingFunctionVector = \inputMatrix\mappingVector.
\]</span></p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">f <span class="op">=</span> np.dot(X, w) <span class="co"># np.dot does matrix multiplication in python</span></code></pre></div>
<p>Combining this result with our objective function, <span class="math display">\[
\errorFunction(\mappingVector) = (\dataVector - \mappingFunctionVector)^\top(\dataVector - \mappingFunctionVector)
\]</span> we find we have defined the <em>model</em> with two equations. One equation tells us the form of our predictive function and how it depends on its parameters, the other tells us the form of our objective function.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">resid <span class="op">=</span> (y<span class="op">-</span>f)
E <span class="op">=</span> np.dot(resid.T, resid) <span class="co"># matrix multiplication on a single vector is equivalent to a dot product.</span>
<span class="bu">print</span>(<span class="st">"Error function is:"</span>, E)</code></pre></div>
<h3 id="question-4">Question 4</h3>
<p>The prediction for our movie recommender system had the form <span class="math display">\[
f_{i,j} = \mathbf{u}_i^\top \mathbf{v}_j
\]</span> and the objective function was then <span class="math display">\[
E = \sum_{i,j} s_{i,j}(\dataScalar_{i,j} - f_{i, j})^2
\]</span> Try writing this down in matrix and vector form. How many of the terms can you do? For each variable and parameter carefully think about whether it should be represented as a matrix or vector. Do as many of the terms as you can. Use <span class="math inline">\(\LaTeX\)</span> to give your answers and give the <em>dimensions</em> of any matrices you create.</p>
<p><em>20 marks</em></p>
<h3 id="write-your-answer-to-question-4-here">Write your answer to Question 4 here</h3>
<h2 id="objective-optimisation">Objective Optimisation</h2>
<p>Our <em>model</em> has now been defined with two equations, the prediction function and the objective function. Next we will use multivariate calculus to define an <em>algorithm</em> to fit the model. The separation between model and algorithm is important and is often overlooked. Our model contains a function that shows how it will be used for prediction, and a function that describes the objective function we need to optimise to obtain a good set of parameters.</p>
<p>The model linear regression model we have described is still the same as the one we fitted above with a coordinate ascent algorithm. We have only played with the notation to obtain the same model in a matrix and vector notation. However, we will now fit this model with a different algorithm, one that is much faster. It is such a widely used algorithm that from the end user's perspective it doesn't even look like an algorithm, it just appears to be a single operation (or function). However, underneath the computer calls an algorithm to find the solution. Further, the algorithm we obtain is very widely used, and because of this it turns out to be highly optimised.</p>
<p>Once again we are going to try and find the stationary points of our objective by finding the <em>stationary points</em>. However, the stationary points of a multivariate function, are a little bit more complext to find. Once again we need to find the point at which the derivative is zero, but now we need to use <em>multivariate calculus</em> to find it. This involves learning a few additional rules of differentiation (that allow you to do the derivatives of a function with respect to vector), but in the end it makes things quite a bit easier. We define vectorial derivatives as follows, <span class="math display">\[
\frac{\text{d}\errorFunction(\mappingVector)}{\text{d}\mappingVector} =
\begin{bmatrix}\frac{\text{d}\errorFunction(\mappingVector)}{\text{d}\mappingScalar_1}\\\frac{\text{d}\errorFunction(\mappingVector)}{\text{d}\mappingScalar_2}\end{bmatrix}.
\]</span> where <span class="math inline">\(\frac{\text{d}\errorFunction(\mappingVector)}{\text{d}\mappingScalar_1}\)</span> is the <a href="http://en.wikipedia.org/wiki/Partial_derivative">partial derivative</a> of the error function with respect to <span class="math inline">\(\mappingScalar_1\)</span>.</p>
<p>Differentiation through multiplications and additions is relatively straightforward, and since linear algebra is just multiplication and addition, then its rules of diffentiation are quite straightforward too, but slightly more complex than regular derivatives.</p>
<h3 id="multivariate-derivatives">Multivariate Derivatives</h3>
<p>We will need two rules of multivariate or <em>matrix</em> differentiation. The first is diffentiation of an inner product. By remembering that the inner product is made up of multiplication and addition, we can hope that its derivative is quite straightforward, and so it proves to be. We can start by thinking about the definition of the inner product, <span class="math display">\[
\mathbf{a}^\top\mathbf{z} = \sum_{i} a_i
z_i,
\]</span> which if we were to take the derivative with respect to <span class="math inline">\(z_k\)</span> would simply return the gradient of the one term in the sum for which the derivative was non zero, that of <span class="math inline">\(a_k\)</span>, so we know that <span class="math display">\[
\frac{\text{d}}{\text{d}z_k} \mathbf{a}^\top \mathbf{z} = a_k
\]</span> and by our definition of multivariate derivatives we can simply stack all the partial derivatives of this form in a vector to obtain the result that <span class="math display">\[
\frac{\text{d}}{\text{d}\mathbf{z}}
\mathbf{a}^\top \mathbf{z} = \mathbf{a}.
\]</span> The second rule that's required is differentiation of a 'matrix quadratic'. A scalar quadratic in <span class="math inline">\(z\)</span> with coefficient <span class="math inline">\(c\)</span> has the form <span class="math inline">\(cz^2\)</span>. If <span class="math inline">\(\mathbf{z}\)</span> is a <span class="math inline">\(k\times 1\)</span> vector and <span class="math inline">\(\mathbf{C}\)</span> is a <span class="math inline">\(k \times k\)</span> <em>matrix</em> of coefficients then the matrix quadratic form is written as <span class="math inline">\(\mathbf{z}^\top \mathbf{C}\mathbf{z}\)</span>, which is itself a <em>scalar</em> quantity, but it is a function of a <em>vector</em>.</p>
<h4 id="matching-dimensions-in-matrix-multiplications">Matching Dimensions in Matrix Multiplications</h4>
<p>There's a trick for telling that it's a scalar result. When you are doing maths with matrices, it's always worth pausing to perform a quick sanity check on the dimensions. Matrix multplication only works when the dimensions match. To be precise, the 'inner' dimension of the matrix must match. What is the inner dimension. If we multiply two matrices <span class="math inline">\(\mathbf{A}\)</span> and <span class="math inline">\(\mathbf{B}\)</span>, the first of which has <span class="math inline">\(k\)</span> rows and <span class="math inline">\(\ell\)</span> columns and the second of which has <span class="math inline">\(p\)</span> rows and <span class="math inline">\(q\)</span> columns, then we can check whether the multiplication works by writing the dimensionalities next to each other, <span class="math display">\[
\mathbf{A} \mathbf{B} \rightarrow (k \times
\underbrace{\ell)(p}_\text{inner dimensions} \times q) \rightarrow (k\times q).
\]</span> The inner dimensions are the two inside dimensions, <span class="math inline">\(\ell\)</span> and <span class="math inline">\(p\)</span>. The multiplication will only work if <span class="math inline">\(\ell=p\)</span>. The result of the multiplication will then be a <span class="math inline">\(k\times q\)</span> matrix: this dimensionality comes from the 'outer dimensions'. Note that matrix multiplication is not <a href="http://en.wikipedia.org/wiki/Commutative_property"><em>commutative</em></a>. And if you change the order of the multiplication, <span class="math display">\[
\mathbf{B} \mathbf{A} \rightarrow (\ell \times \underbrace{k)(q}_\text{inner dimensions} \times p) \rightarrow (\ell \times p).
\]</span> firstly it may no longer even work, because now the condition is that <span class="math inline">\(k=q\)</span>, and secondly the result could be of a different dimensionality. An exception is if the matrices are square matrices (e.g. same number of rows as columns) and they are both <em>symmetric</em>. A symmetric matrix is one for which <span class="math inline">\(\mathbf{A}=\mathbf{A}^\top\)</span>, or equivalently, <span class="math inline">\(a_{i,j} = a_{j,i}\)</span> for all <span class="math inline">\(i\)</span> and <span class="math inline">\(j\)</span>.</p>
<p>You will need to get used to working with matrices and vectors applying and developing new machine learning techniques. You should have come across them before, but you may not have used them as extensively as we will now do in this course. You should get used to using this trick to check your work and ensure you know what the dimension of an output matrix should be. For our matrix quadratic form, it turns out that we can see it as a special type of inner product. <span class="math display">\[
\mathbf{z}^\top\mathbf{C}\mathbf{z} \rightarrow (1\times
\underbrace{k) (k}_\text{inner dimensions}\times k) (k\times 1) \rightarrow
\mathbf{b}^\top\mathbf{z}
\]</span> where <span class="math inline">\(\mathbf{b} = \mathbf{C}\mathbf{z}\)</span> so therefore the result is a scalar, <span class="math display">\[
\mathbf{b}^\top\mathbf{z} \rightarrow
(1\times \underbrace{k) (k}_\text{inner dimensions}\times 1) \rightarrow
(1\times 1)
\]</span> where a <span class="math inline">\((1\times 1)\)</span> matrix is recognised as a scalar.</p>
<p>This implies that we should be able to differentiate this form, and indeed the rule for its differentiation is slightly more complex than the inner product, but still quite simple, <span class="math display">\[
\frac{\text{d}}{\text{d}\mathbf{z}}
\mathbf{z}^\top\mathbf{C}\mathbf{z}= \mathbf{C}\mathbf{z} + \mathbf{C}^\top
\mathbf{z}.
\]</span> Note that in the special case where <span class="math inline">\(\mathbf{C}\)</span> is symmetric then we have <span class="math inline">\(\mathbf{C} = \mathbf{C}^\top\)</span> and the derivative simplifies to <span class="math display">\[
\frac{\text{d}}{\text{d}\mathbf{z}} \mathbf{z}^\top\mathbf{C}\mathbf{z}=
2\mathbf{C}\mathbf{z}.
\]</span></p>
<h3 id="differentiate-the-objective">Differentiate the Objective</h3>
<p>First, we need to compute the full objective by substituting our prediction function into the objective function to obtain the objective in terms of <span class="math inline">\(\mappingVector\)</span>. Doing this we obtain <span class="math display">\[
\errorFunction(\mappingVector)= (\dataVector - \inputMatrix\mappingVector)^\top (\dataVector - \inputMatrix\mappingVector).
\]</span> We now need to differentiate this <em>quadratic form</em> to find the minimum. We differentiate with respect to the <em>vector</em> <span class="math inline">\(\mappingVector\)</span>. But before we do that, we'll expand the brackets in the quadratic form to obtain a series of scalar terms. The rules for bracket expansion across the vectors are similar to those for the scalar system giving, <span class="math display">\[
(\mathbf{a} - \mathbf{b})^\top
(\mathbf{c} - \mathbf{d}) = \mathbf{a}^\top \mathbf{c} - \mathbf{a}^\top
\mathbf{d} - \mathbf{b}^\top \mathbf{c} + \mathbf{b}^\top \mathbf{d}
\]</span> which substituting for <span class="math inline">\(\mathbf{a} = \mathbf{c} = \dataVector\)</span> and <span class="math inline">\(\mathbf{b}=\mathbf{d} = \inputMatrix\mappingVector\)</span> gives <span class="math display">\[
\errorFunction(\mappingVector)=
\dataVector^\top\dataVector - 2\dataVector^\top\inputMatrix\mappingVector +
\mappingVector^\top\inputMatrix^\top\inputMatrix\mappingVector
\]</span> where we used the fact that <span class="math inline">\(\dataVector^\top\inputMatrix\mappingVector=\mappingVector^\top\inputMatrix^\top\dataVector\)</span>. Now we can use our rules of differentiation to compute the derivative of this form, which is, <span class="math display">\[
\frac{\text{d}}{\text{d}\mappingVector}\errorFunction(\mappingVector)=- 2\inputMatrix^\top \dataVector +
2\inputMatrix^\top\inputMatrix\mappingVector,
\]</span> where we have exploited the fact that <span class="math inline">\(\inputMatrix^\top\inputMatrix\)</span> is symmetric to obtain this result.</p>
<h3 id="question-5">Question 5</h3>
<p>Use the equivalence between our vector and our matrix formulations of linear regression, alongside our definition of vector derivates, to match the gradients we've computed directly for <span class="math inline">\(\frac{\text{d}\errorFunction(c, m)}{\text{d}c}\)</span> and <span class="math inline">\(\frac{\text{d}\errorFunction(c, m)}{\text{d}m}\)</span> to those for <span class="math inline">\(\frac{\text{d}\errorFunction(\mappingVector)}{\text{d}\mappingVector}\)</span>.</p>
<p><em>20 marks</em></p>
<h3 id="write-your-answer-to-question-5-here">Write your answer to Question 5 here</h3>
<h2 id="update-equation-for-global-optimum">Update Equation for Global Optimum</h2>
<p>Once again, we need to find the minimum of our objective function. Using our likelihood for multiple input regression we can now minimize for our parameter vector <span class="math inline">\(\mappingVector\)</span>. Firstly, just as in the single input case, we seek stationary points by find parameter vectors that solve for when the gradients are zero, <span class="math display">\[
\mathbf{0}=- 2\inputMatrix^\top
\dataVector + 2\inputMatrix^\top\inputMatrix\mappingVector,
\]</span> where <span class="math inline">\(\mathbf{0}\)</span> is a <em>vector</em> of zeros. Rearranging this equation we find the solution to be <span class="math display">\[
\mappingVector = \left[\inputMatrix^\top \inputMatrix\right]^{-1} \inputMatrix^\top
\dataVector
\]</span> where <span class="math inline">\(\mathbf{A}^{-1}\)</span> denotes <a href="http://en.wikipedia.org/wiki/Invertible_matrix"><em>matrix inverse</em></a>.</p>
<h3 id="solving-the-multivariate-system">Solving the Multivariate System</h3>
<p>The solution for <span class="math inline">\(\mappingVector\)</span> is given in terms of a matrix inverse, but computation of a matrix inverse requires, in itself, an algorithm to resolve it. You'll know this if you had to invert, by hand, a <span class="math inline">\(3\times 3\)</span> matrix in high school. From a numerical stability perspective, it is also best not to compute the matrix inverse directly, but rather to ask the computer to <em>solve</em> the system of linear equations given by <span class="math display">\[\inputMatrix^\top\inputMatrix \mappingVector = \inputMatrix^\top\dataVector\]</span> for <span class="math inline">\(\mappingVector\)</span>. This can be done in <code>numpy</code> using the command</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">np.linalg.solve?</code></pre></div>
<p>so we can obtain the solution using</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">w <span class="op">=</span> np.linalg.solve(np.dot(X.T, X), np.dot(X.T, y))
<span class="bu">print</span>(w)</code></pre></div>
<p>We can map it back to the liner regression and plot the fit as follows</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">m <span class="op">=</span> w[<span class="dv">1</span>]<span class="op">;</span> c<span class="op">=</span>w[<span class="dv">0</span>]
f_test <span class="op">=</span> m<span class="op">*</span>x_test <span class="op">+</span> c
<span class="bu">print</span>(m)
<span class="bu">print</span>(c)
plt.plot(x_test, f_test, <span class="st">'b-'</span>)
plt.plot(x, y, <span class="st">'rx'</span>)</code></pre></div>
<h3 id="multivariate-linear-regression">Multivariate Linear Regression</h3>
<p>A major advantage of the new system is that we can build a linear regression on a multivariate system. The matrix calculus didn't specify what the length of the vector <span class="math inline">\(\inputVector\)</span> should be, or equivalently the size of the design matrix.</p>
<h3 id="movie-body-count-data">Movie Body Count Data</h3>
<p>Let's consider the movie body count data.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">data <span class="op">=</span> pods.datasets.movie_body_count()
movies <span class="op">=</span> data[<span class="st">'Y'</span>]</code></pre></div>
<p>Let's remind ourselves of the features we've been provided with.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="bu">print</span>(<span class="st">', '</span>.join(movies.columns))</code></pre></div>
<p>Now we will build a design matrix based on the numeric features: year, Body_Count, Length_Minutes in an effort to predict the rating. We build the design matrix as follows:</p>
<h3 id="relation-to-single-input-system">Relation to Single Input System</h3>
<p>Bias as an additional feature.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">select_features <span class="op">=</span> [<span class="st">'Year'</span>, <span class="st">'Body_Count'</span>, <span class="st">'Length_Minutes'</span>]
X <span class="op">=</span> movies[select_features]
X[<span class="st">'Eins'</span>] <span class="op">=</span> <span class="dv">1</span> <span class="co"># add a column for the offset</span>
y <span class="op">=</span> movies[[<span class="st">'IMDB_Rating'</span>]]</code></pre></div>
<p>Now let's perform a linear regression. But this time, we will create a pandas data frame for the result so we can store it in a form that we can visualise easily.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> pandas <span class="im">as</span> pd
w <span class="op">=</span> pd.DataFrame(data<span class="op">=</span>np.linalg.solve(np.dot(X.T, X), np.dot(X.T, y)), <span class="co"># solve linear regression here</span>
index <span class="op">=</span> X.columns, <span class="co"># columns of X become rows of w</span>
columns<span class="op">=</span>[<span class="st">'regression_coefficient'</span>]) <span class="co"># the column of X is the value of regression coefficient</span></code></pre></div>
<p>We can check the residuals to see how good our estimates are</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">(y <span class="op">-</span> np.dot(X, w)).hist()</code></pre></div>
<p>Which shows our model <em>hasn't</em> yet done a great job of representation, because the spread of values is large. We can check what the rating is dominated by in terms of regression coefficients.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">w</code></pre></div>
<p>Although we have to be a little careful about interpretation because our input values live on different scales, however it looks like we are dominated by the bias, with a small negative effect for later films (but bear in mind the years are large, so this effect is probably larger than it looks) and a positive effect for length. So it looks like long earlier films generally do better, but the residuals are so high that we probably haven't modelled the system very well.</p>
<p><a href="https://www.youtube.com/watch?v=ui-uNlFHoms&t="><img src="https://img.youtube.com/vi/ui-uNlFHoms/0.jpg" /></a></p>
<p><a href="https://www.youtube.com/watch?v=78YNphT90-k&t="><img src="https://img.youtube.com/vi/78YNphT90-k/0.jpg" /></a></p>
<h3 id="solution-with-qr-decomposition">Solution with QR Decomposition</h3>
<p>Performing a solve instead of a matrix inverse is the more numerically stable approach, but we can do even better. A <a href="http://en.wikipedia.org/wiki/QR_decomposition">QR-decomposition</a> of a matrix factorises it into a matrix which is an orthogonal matrix <span class="math inline">\(\mathbf{Q}\)</span>, so that <span class="math inline">\(\mathbf{Q}^\top \mathbf{Q} = \eye\)</span>. And a matrix which is upper triangular, <span class="math inline">\(\mathbf{R}\)</span>. <span class="math display">\[
\inputMatrix^\top \inputMatrix \boldsymbol{\beta} =
\inputMatrix^\top \dataVector
\]</span> <span class="math display">\[
(\mathbf{Q}\mathbf{R})^\top
(\mathbf{Q}\mathbf{R})\boldsymbol{\beta} = (\mathbf{Q}\mathbf{R})^\top
\dataVector
\]</span> <span class="math display">\[
\mathbf{R}^\top (\mathbf{Q}^\top \mathbf{Q}) \mathbf{R}
\boldsymbol{\beta} = \mathbf{R}^\top \mathbf{Q}^\top \dataVector
\]</span> <span class="math display">\[
\mathbf{R}^\top \mathbf{R} \boldsymbol{\beta} = \mathbf{R}^\top \mathbf{Q}^\top
\dataVector
\]</span> <span class="math display">\[
\mathbf{R} \boldsymbol{\beta} = \mathbf{Q}^\top \dataVector
\]</span> This is a more numerically stable solution because it removes the need to compute <span class="math inline">\(\inputMatrix^\top\inputMatrix\)</span> as an intermediate. Computing <span class="math inline">\(\inputMatrix^\top\inputMatrix\)</span> is a bad idea because it involves squaring all the elements of <span class="math inline">\(\inputMatrix\)</span> and thereby potentially reducing the numerical precision with which we can represent the solution. Operating on <span class="math inline">\(\inputMatrix\)</span> directly preserves the numerical precision of the model.</p>
<p>This can be more particularly seen when we begin to work with <em>basis functions</em> in the next session. Some systems that can be resolved with the QR decomposition can not be resolved by using solve directly.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> scipy <span class="im">as</span> sp
Q, R <span class="op">=</span> np.linalg.qr(X)
w <span class="op">=</span> sp.linalg.solve_triangular(R, np.dot(Q.T, y))
w <span class="op">=</span> pd.DataFrame(w, index<span class="op">=</span>X.columns)
w</code></pre></div>
<h3 id="reading-1">Reading</h3>
<ul>
<li>Section 1.3 of <span class="citation">Rogers and Girolami (2011)</span> for Matrix & Vector Review.</li>
</ul>
<h3 id="basis-functions">Basis Functions</h3>
<p>Here's the idea, instead of working directly on the original input space, <span class="math inline">\(\inputVector\)</span>, we build models in a new space, <span class="math inline">\(\basisVector(\inputVector)\)</span> where <span class="math inline">\(\basisVector(\cdot)\)</span> is a <em>vector-valued</em> function that is defined on the space <span class="math inline">\(\inputVector\)</span>.</p>
<h3 id="quadratic-basis">Quadratic Basis</h3>
<p>Remember, that a <em>vector-valued function</em> is just a vector that contains functions instead of values. Here's an example for a one dimensional input space, <span class="math inline">\(x\)</span>, being projected to a <em>quadratic</em> basis. First we consider each basis function in turn, we can think of the elements of our vector as being indexed so that we have <span class="math display">\[
\begin{align*}
\basisFunc_1(\inputScalar) = 1, \\
\basisFunc_2(\inputScalar) = x, \\
\basisFunc_3(\inputScalar) = \inputScalar^2.
\end{align*}
\]</span> Now we can consider them together by placing them in a vector, <span class="math display">\[
\basisVector(\inputScalar) = \begin{bmatrix} 1\\ x \\ \inputScalar^2\end{bmatrix}.
\]</span> For the vector-valued function, we have simply collected the different functions together in the same vector making them notationally easier to deal with in our mathematics.</p>
<p>When we consider the vector-valued function for each data point, then we place all the data into a matrix. The result is a matrix valued function, <span class="math display">\[
\basisMatrix(\inputVector) =
\begin{bmatrix} 1 & \inputScalar_1 &
\inputScalar_1^2 \\
1 & \inputScalar_2 & \inputScalar_2^2\\
\vdots & \vdots & \vdots \\
1 & \inputScalar_n & \inputScalar_n^2
\end{bmatrix}
\]</span> where we are still in the one dimensional input setting so <span class="math inline">\(\inputVector\)</span> here represents a vector of our inputs with <span class="math inline">\(\numData\)</span> elements.</p>
<p>Let's try constructing such a matrix for a set of inputs. First of all, we create a function that returns the matrix valued function</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> numpy <span class="im">as</span> np</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> quadratic(x, <span class="op">**</span>kwargs):
<span class="co">"""Take in a vector of input values and return the design matrix associated </span>
<span class="co"> with the basis functions."""</span>
<span class="cf">return</span> np.hstack([np.ones((x.shape[<span class="dv">0</span>], <span class="dv">1</span>)), x, x<span class="op">**</span><span class="dv">2</span>])</code></pre></div>
<h3 id="functions-derived-from-quadratic-basis">Functions Derived from Quadratic Basis</h3>
<p><span class="math display">\[
\mappingFunction(\inputScalar) = {\color{cyan}\mappingScalar_0} + {\color{green}\mappingScalar_1 \inputScalar} + {\color{yellow}\mappingScalar_2 \inputScalar^2}
\]</span></p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> matplotlib.pyplot <span class="im">as</span> plt
<span class="im">import</span> teaching_plots <span class="im">as</span> plot</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">f, ax <span class="op">=</span> plt.subplots(figsize<span class="op">=</span>plot.big_wide_figsize)
loc <span class="op">=</span>[[<span class="dv">0</span>, <span class="fl">1.4</span>,],
[<span class="dv">0</span>, <span class="op">-</span><span class="fl">0.7</span>],
[<span class="fl">0.75</span>, <span class="op">-</span><span class="fl">0.2</span>]]
text <span class="op">=</span>[<span class="st">'$\phi(x) = 1$'</span>,
<span class="st">'$\phi(x) = x$'</span>,
<span class="st">'$\phi(x) = x^2$'</span>]
plot.basis(quadratic, x_min<span class="op">=-</span><span class="fl">1.3</span>, x_max<span class="op">=</span><span class="fl">1.3</span>,
fig<span class="op">=</span>f, ax<span class="op">=</span>ax, loc<span class="op">=</span>loc, text<span class="op">=</span>text,
diagrams<span class="op">=</span><span class="st">'../slides/diagrams/ml'</span>)</code></pre></div>
<object class="svgplot" align data="../slides/diagrams/ml/quadratic_basis002.svg">
</object>
<p>This function takes in an <span class="math inline">\(\numData \times 1\)</span> dimensional vector and returns an <span class="math inline">\(\numData \times 3\)</span> dimensional <em>design matrix</em> containing the basis functions. We can plot those basis functions against there input as follows.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co"># first let's generate some inputs</span>
n <span class="op">=</span> <span class="dv">100</span>
x <span class="op">=</span> np.zeros((n, <span class="dv">1</span>)) <span class="co"># create a data set of zeros</span>
x[:, <span class="dv">0</span>] <span class="op">=</span> np.linspace(<span class="op">-</span><span class="dv">1</span>, <span class="dv">1</span>, n) <span class="co"># fill it with values between -1 and 1</span>
Phi <span class="op">=</span> quadratic(x)
fig, ax <span class="op">=</span> plt.subplots(figsize<span class="op">=</span>plot.big_wide_figsize)
ax.set_ylim([<span class="op">-</span><span class="fl">1.2</span>, <span class="fl">1.2</span>]) <span class="co"># set y limits to ensure basis functions show.</span>
ax.plot(x[:,<span class="dv">0</span>], Phi[:, <span class="dv">0</span>], <span class="st">'r-'</span>, label <span class="op">=</span> <span class="st">'$\phi=1$'</span>, linewidth<span class="op">=</span><span class="dv">3</span>)
ax.plot(x[:,<span class="dv">0</span>], Phi[:, <span class="dv">1</span>], <span class="st">'g-'</span>, label <span class="op">=</span> <span class="st">'$\phi=x$'</span>, linewidth<span class="op">=</span><span class="dv">3</span>)
ax.plot(x[:,<span class="dv">0</span>], Phi[:, <span class="dv">2</span>], <span class="st">'b-'</span>, label <span class="op">=</span> <span class="st">'$\phi=x^2$'</span>, linewidth<span class="op">=</span><span class="dv">3</span>)
ax.legend(loc<span class="op">=</span><span class="st">'lower right'</span>)
_ <span class="op">=</span> ax.set_title(<span class="st">'Quadratic Basis Functions'</span>)</code></pre></div>
<p>The actual function we observe is then made up of a sum of these functions. This is the reason for the name basis. The term <em>basis</em> means 'the underlying support or foundation for an idea, argument, or process', and in this context they form the underlying support for our prediction function. Our prediction function can only be composed of a weighted linear sum of our basis functions.</p>
<h3 id="quadratic-functions">Quadratic Functions</h3>
<object class="svgplot" align data="../slides/diagrams/ml/quadratic_function002.svg">
</object>
<h3 id="polynomial-fits-to-olympic-data">Polynomial Fits to Olympic Data</h3>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> numpy <span class="im">as</span> np
<span class="im">from</span> matplotlib <span class="im">import</span> pyplot <span class="im">as</span> plt
<span class="im">import</span> teaching_plots <span class="im">as</span> plot
<span class="im">import</span> mlai
<span class="im">import</span> pods</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">basis <span class="op">=</span> mlai.polynomial
data <span class="op">=</span> pods.datasets.olympic_marathon_men()
x <span class="op">=</span> data[<span class="st">'X'</span>]
y <span class="op">=</span> data[<span class="st">'Y'</span>]
xlim <span class="op">=</span> [<span class="dv">1892</span>, <span class="dv">2020</span>]
basis<span class="op">=</span>mlai.basis(mlai.polynomial, number<span class="op">=</span><span class="dv">1</span>, data_limits<span class="op">=</span>xlim)</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">plot.rmse_fit(x, y, param_name<span class="op">=</span><span class="st">'number'</span>, param_range<span class="op">=</span>(<span class="dv">1</span>, <span class="dv">27</span>),
model<span class="op">=</span>mlai.LM,
basis<span class="op">=</span>basis,
xlim<span class="op">=</span>xlim, objective_ylim<span class="op">=</span>[<span class="dv">0</span>, <span class="fl">0.8</span>],
diagrams<span class="op">=</span><span class="st">'../slides/diagrams/ml'</span>)</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">from</span> ipywidgets <span class="im">import</span> IntSlider</code></pre></div>
<object class="svgplot" align data="../slides/diagrams/ml/olympic_LM_polynomial_num_basis026.svg">
</object>
<h2 id="underdetermined-system">Underdetermined System</h2>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> teaching_plots <span class="im">as</span> plot</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">plot.under_determined_system(diagrams<span class="op">=</span><span class="st">'../slides/diagrams/ml'</span>)</code></pre></div>
<p>What about the situation where you have more parameters than data in your simultaneous equation? This is known as an <em>underdetermined</em> system. In fact this set up is in some sense <em>easier</em> to solve, because we don't need to think about introducing a slack variable (although it might make a lot of sense from a <em>modelling</em> perspective to do so).</p>
<p>The way Laplace proposed resolving an overdetermined system, was to introduce slack variables, <span class="math inline">\(\noiseScalar_i\)</span>, which needed to be estimated for each point. The slack variable represented the difference between our actual prediction and the true observation. This is known as the <em>residual</em>. By introducing the slack variable we now have an additional <span class="math inline">\(n\)</span> variables to estimate, one for each data point, <span class="math inline">\(\{\noiseScalar_i\}\)</span>. This actually turns the overdetermined system into an underdetermined system. Introduction of <span class="math inline">\(n\)</span> variables, plus the original <span class="math inline">\(m\)</span> and <span class="math inline">\(c\)</span> gives us <span class="math inline">\(\numData+2\)</span> parameters to be estimated from <span class="math inline">\(n\)</span> observations, which actually makes the system <em>underdetermined</em>. However, we then made a probabilistic assumption about the slack variables, we assumed that the slack variables were distributed according to a probability density. And for the moment we have been assuming that density was the Gaussian, <span class="math display">\[\noiseScalar_i \sim \gaussianSamp{0}{\dataStd^2},\]</span> with zero mean and variance <span class="math inline">\(\dataStd^2\)</span>.</p>
<p>The follow up question is whether we can do the same thing with the parameters. If we have two parameters and only one unknown can we place a probability distribution over the parameters, as we did with the slack variables? The answer is yes.</p>
<h3 id="underdetermined-system-1">Underdetermined System</h3>
<object class="svgplot" align data="../slides/diagrams/ml/under_determined_system009.svg">
</object>
<center>
<em>Fit underdetermined system by considering uncertainty </em>
</center>
<h3 id="alan-turing">Alan Turing</h3>
<table>
<tr>
<td width="50%">
<img class="" src="../slides/diagrams/turing-times.gif" width="" align="center" style="background:none; border:none; box-shadow:none;">
</td>
<td width="50%">
<img class="" src="../slides/diagrams/turing-run.jpg" width="" align="center" style="background:none; border:none; box-shadow:none;">
</td>
</tr>
</table>
<center>
<em>Alan Turing, in 1946 he was only 11 minutes slower than the winner of the 1948 games. Would he have won a hypothetical games held in 1946? Source: <a href="http://www.turing.org.uk/scrapbook/run.html">Alan Turing Internet Scrapbook</a>.</em>
</center>
<p>If we had to summarise the objectives of machine learning in one word, a very good candidate for that word would be <em>generalization</em>. What is generalization? From a human perspective it might be summarised as the ability to take lessons learned in one domain and apply them to another domain. If we accept the definition given in the first session for machine learning, <span class="math display">\[
\text{data} + \text{model} \xrightarrow{\text{compute}} \text{prediction}
\]</span> then we see that without a model we can't generalise: we only have data. Data is fine for answering very specific questions, like "Who won the Olympic Marathon in 2012?", because we have that answer stored, however, we are not given the answer to many other questions. For example, Alan Turing was a formidable marathon runner, in 1946 he ran a time 2 hours 46 minutes (just under four minutes per kilometer, faster than I and most of the other <a href="http://www.parkrun.org.uk/sheffieldhallam/">Endcliffe Park Run</a> runners can do 5 km). What is the probability he would have won an Olympics if one had been held in 1946?</p>
<p>To answer this question we need to generalize, but before we formalize the concept of generalization let's introduce some formal representation of what it means to generalize in machine learning.</p>
<object class="svgplot" align data="../slides/diagrams/ml/dem_gaussian003.svg">
</object>
<center>
<em>Combining a Gaussian likelihood with a Gaussian prior to form a Gaussian posterior </em>
</center>
<h3 id="main-trick">Main Trick</h3>
<p><span class="math display">\[p(c) = \frac{1}{\sqrt{2\pi\alpha_1}} \exp\left(-\frac{1}{2\alpha_1}c^2\right)\]</span> <span class="math display">\[p(\dataVector|\inputVector, c, m, \dataStd^2) = \frac{1}{\left(2\pi\dataStd^2\right)^{\frac{\numData}{2}}} \exp\left(-\frac{1}{2\dataStd^2}\sum_{i=1}^\numData(\dataScalar_i - m\inputScalar_i - c)^2\right)\]</span></p>
<h3 id="section"></h3>
<p><span class="math display">\[p(c| \dataVector, \inputVector, m, \dataStd^2) = \frac{p(\dataVector|\inputVector, c, m, \dataStd^2)p(c)}{p(\dataVector|\inputVector, m, \dataStd^2)}\]</span></p>
<p><span class="math display">\[p(c| \dataVector, \inputVector, m, \dataStd^2) = \frac{p(\dataVector|\inputVector, c, m, \dataStd^2)p(c)}{\int p(\dataVector|\inputVector, c, m, \dataStd^2)p(c) \text{d} c}\]</span></p>
<h3 id="section-1"></h3>
<p><span class="math display">\[p(c| \dataVector, \inputVector, m, \dataStd^2) \propto p(\dataVector|\inputVector, c, m, \dataStd^2)p(c)\]</span></p>
<p><span class="math display">\[\begin{aligned}
\log p(c | \dataVector, \inputVector, m, \dataStd^2) =&-\frac{1}{2\dataStd^2} \sum_{i=1}^\numData(\dataScalar_i-c - m\inputScalar_i)^2-\frac{1}{2\alpha_1} c^2 + \text{const}\\
= &-\frac{1}{2\dataStd^2}\sum_{i=1}^\numData(\dataScalar_i-m\inputScalar_i)^2 -\left(\frac{\numData}{2\dataStd^2} + \frac{1}{2\alpha_1}\right)c^2\\
& + c\frac{\sum_{i=1}^\numData(\dataScalar_i-m\inputScalar_i)}{\dataStd^2},
\end{aligned}\]</span></p>
<h3 id="section-2"></h3>
<p>complete the square of the quadratic form to obtain <span class="math display">\[\log p(c | \dataVector, \inputVector, m, \dataStd^2) = -\frac{1}{2\tau^2}(c - \mu)^2 +\text{const},\]</span> where <span class="math inline">\(\tau^2 = \left(\numData\dataStd^{-2} +\alpha_1^{-1}\right)^{-1}\)</span> and <span class="math inline">\(\mu = \frac{\tau^2}{\dataStd^2} \sum_{i=1}^\numData(\dataScalar_i-m\inputScalar_i)\)</span>.</p>
<h3 id="two-dimensional-gaussian">Two Dimensional Gaussian</h3>
<ul>
<li>Consider height, <span class="math inline">\(h/m\)</span> and weight, <span class="math inline">\(w/kg\)</span>.</li>
<li>Could sample height from a distribution: <span class="math display">\[
p(h) \sim \gaussianSamp{1.7}{0.0225}
\]</span></li>
<li>And similarly weight: <span class="math display">\[
p(w) \sim \gaussianSamp{75}{36}
\]</span></li>
</ul>
<object class="svgplot" align data="../slides/diagrams/ml/independent_height_weight007.svg">
</object>
<center>
<em>Samples from independent Gaussian variables that might represent heights and weights. </em>
</center>
<h3 id="independence-assumption">Independence Assumption</h3>
<ul>
<li><p>This assumes height and weight are independent. <span class="math display">\[p(h, w) = p(h)p(w)\]</span></p></li>
<li><p>In reality they are dependent (body mass index) <span class="math inline">\(= \frac{w}{h^2}\)</span>.</p></li>
</ul>
<object class="svgplot" align data="../slides/diagrams/ml/correlated_height_weight007.svg">
</object>
<center>
<em>Samples from correlated Gaussian variables that might represent heights and weights. </em>
</center>
<h3 id="independent-gaussians">Independent Gaussians</h3>
<p><span class="math display">\[
p(w, h) = p(w)p(h)
\]</span></p>
<p><span class="math display">\[
p(w, h) = \frac{1}{\sqrt{2\pi \dataStd_1^2}\sqrt{2\pi\dataStd_2^2}} \exp\left(-\frac{1}{2}\left(\frac{(w-\meanScalar_1)^2}{\dataStd_1^2} + \frac{(h-\meanScalar_2)^2}{\dataStd_2^2}\right)\right)
\]</span></p>
<p><span class="math display">\[
p(w, h) = \frac{1}{\sqrt{2\pi\dataStd_1^22\pi\dataStd_2^2}} \exp\left(-\frac{1}{2}\left(\begin{bmatrix}w \\ h\end{bmatrix} - \begin{bmatrix}\meanScalar_1 \\ \meanScalar_2\end{bmatrix}\right)^\top\begin{bmatrix}\dataStd_1^2& 0\\0&\dataStd_2^2\end{bmatrix}^{-1}\left(\begin{bmatrix}w \\ h\end{bmatrix} - \begin{bmatrix}\meanScalar_1 \\ \meanScalar_2\end{bmatrix}\right)\right)
\]</span></p>
<p><span class="math display">\[
p(\dataVector) = \frac{1}{\det{2\pi \mathbf{D}}^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\dataVector - \meanVector)^\top\mathbf{D}^{-1}(\dataVector - \meanVector)\right)
\]</span></p>
<h3 id="correlated-gaussian">Correlated Gaussian</h3>
<p>Form correlated from original by rotating the data space using matrix <span class="math inline">\(\rotationMatrix\)</span>.</p>
<p><span class="math display">\[
p(\dataVector) = \frac{1}{\det{2\pi\mathbf{D}}^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\dataVector - \meanVector)^\top\mathbf{D}^{-1}(\dataVector - \meanVector)\right)
\]</span></p>
<p><span class="math display">\[
p(\dataVector) = \frac{1}{\det{2\pi\mathbf{D}}^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\rotationMatrix^\top\dataVector - \rotationMatrix^\top\meanVector)^\top\mathbf{D}^{-1}(\rotationMatrix^\top\dataVector - \rotationMatrix^\top\meanVector)\right)
\]</span></p>
<p><span class="math display">\[
p(\dataVector) = \frac{1}{\det{2\pi\mathbf{D}}^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\dataVector - \meanVector)^\top\rotationMatrix\mathbf{D}^{-1}\rotationMatrix^\top(\dataVector - \meanVector)\right)
\]</span> this gives a covariance matrix: <span class="math display">\[
\covarianceMatrix^{-1} = \rotationMatrix \mathbf{D}^{-1} \rotationMatrix^\top
\]</span></p>
<p><span class="math display">\[
p(\dataVector) = \frac{1}{\det{2\pi\covarianceMatrix}^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\dataVector - \meanVector)^\top\covarianceMatrix^{-1} (\dataVector - \meanVector)\right)
\]</span> this gives a covariance matrix: <span class="math display">\[
\covarianceMatrix = \rotationMatrix \mathbf{D} \rotationMatrix^\top
\]</span></p>
<h3 id="generating-from-the-model">Generating from the Model</h3>
<p>A very important aspect of probabilistic modelling is to <em>sample</em> from your model to see what type of assumptions you are making about your data. In this case that involves a two stage process.</p>
<ol style="list-style-type: decimal">
<li>Sample a candiate parameter vector from the prior.</li>
<li>Place the candidate parameter vector in the likelihood and sample functions conditiond on that candidate vector.</li>
<li>Repeat to try and characterise the type of functions you are generating.</li>
</ol>
<p>Given a prior variance (as defined above) we can now sample from the prior distribution and combine with a basis set to see what assumptions we are making about the functions <em>a priori</em> (i.e. before we've seen the data). Firstly we compute the basis function matrix. We will do it both for our training data, and for a range of prediction locations (<code>x_pred</code>).</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> numpy <span class="im">as</span> np
<span class="im">import</span> pods</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">data <span class="op">=</span> pods.datasets.olympic_marathon_men()
x <span class="op">=</span> data[<span class="st">'X'</span>]
y <span class="op">=</span> data[<span class="st">'Y'</span>]
num_data <span class="op">=</span> x.shape[<span class="dv">0</span>]
num_pred_data <span class="op">=</span> <span class="dv">100</span> <span class="co"># how many points to use for plotting predictions</span>
x_pred <span class="op">=</span> np.linspace(<span class="dv">1890</span>, <span class="dv">2016</span>, num_pred_data)[:, <span class="va">None</span>] <span class="co"># input locations for predictions</span></code></pre></div>
<p>now let's build the basis matrices. We define the polynomial basis as follows.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> polynomial(x, num_basis<span class="op">=</span><span class="dv">2</span>, loc<span class="op">=</span><span class="dv">0</span>., scale<span class="op">=</span><span class="dv">1</span>.):
degree<span class="op">=</span>num_basis<span class="op">-</span><span class="dv">1</span>
degrees <span class="op">=</span> np.arange(degree<span class="op">+</span><span class="dv">1</span>)
<span class="cf">return</span> ((x<span class="op">-</span>loc)<span class="op">/</span>scale)<span class="op">**</span>degrees</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> mlai</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">loc<span class="op">=</span><span class="dv">1950</span>
scale<span class="op">=</span><span class="dv">1</span>
degree<span class="op">=</span><span class="dv">4</span>
basis <span class="op">=</span> mlai.basis(polynomial, number<span class="op">=</span>degree<span class="op">+</span><span class="dv">1</span>, loc<span class="op">=</span>loc, scale<span class="op">=</span>scale)
Phi_pred <span class="op">=</span> basis.Phi(x_pred)
Phi <span class="op">=</span> basis.Phi(x)</code></pre></div>
<h3 id="sampling-from-the-prior">Sampling from the Prior</h3>
<p>Now we will sample from the prior to produce a vector <span class="math inline">\(\mappingVector\)</span> and use it to plot a function which is representative of our belief <em>before</em> we fit the data. To do this we are going to use the properties of the Gaussian density and a sample from a <em>standard normal</em> using the function <code>np.random.normal</code>.</p>
<h3 id="scaling-gaussian-distributed-variables">Scaling Gaussian-distributed Variables</h3>
<p>First, let's consider the case where we have one data point and one feature in our basis set. In otherwords <span class="math inline">\(\mappingFunctionVector\)</span> would be a scalar, <span class="math inline">\(\mappingVector\)</span> would be a scalar and <span class="math inline">\(\basisMatrix\)</span> would be a scalar. In this case we have <span class="math display">\[
\mappingFunction = \basisScalar \mappingScalar
\]</span> If <span class="math inline">\(\mappingScalar\)</span> is drawn from a normal density, <span class="math display">\[
\mappingScalar \sim \gaussianSamp{\meanScalar_\mappingScalar}{c_\mappingScalar}
\]</span> and <span class="math inline">\(\basisScalar\)</span> is a scalar value which we are given, then properties of the Gaussian density tell us that <span class="math display">\[
\basisScalar \mappingScalar \sim \gaussianSamp{\basisScalar\meanScalar_\mappingScalar}{\basisScalar^2c_\mappingScalar}
\]</span> Let's test this out numerically. First we will draw 200 samples from a standard normal,</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">w_vec <span class="op">=</span> np.random.normal(size<span class="op">=</span><span class="dv">200</span>)</code></pre></div>
<p>We can compute the mean of these samples and their variance</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="bu">print</span>(<span class="st">'w sample mean is '</span>, w_vec.mean())
<span class="bu">print</span>(<span class="st">'w sample variance is '</span>, w_vec.var())</code></pre></div>
<p>These are close to zero (the mean) and one (the variance) as you'd expect. Now compute the mean and variance of the scaled version,</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">phi <span class="op">=</span> <span class="dv">7</span>
f_vec <span class="op">=</span> phi<span class="op">*</span>w_vec
<span class="bu">print</span>(<span class="st">'True mean should be phi*0 = 0.'</span>)
<span class="bu">print</span>(<span class="st">'True variance should be phi*phi*1 = '</span>, phi<span class="op">*</span>phi)
<span class="bu">print</span>(<span class="st">'f sample mean is '</span>, f_vec.mean())
<span class="bu">print</span>(<span class="st">'f sample variance is '</span>, f_vec.var())</code></pre></div>
<p>If you increase the number of samples then you will see that the sample mean and the sample variance begin to converge towards the true mean and the true variance. Obviously adding an offset to a sample from <code>np.random.normal</code> will change the mean. So if you want to sample from a Gaussian with mean <code>mu</code> and standard deviation <code>sigma</code> one way of doing it is to sample from the standard normal and scale and shift the result, so to sample a set of <span class="math inline">\(\mappingScalar\)</span> from a Gaussian with mean <span class="math inline">\(\meanScalar\)</span> and variance <span class="math inline">\(\alpha\)</span>, <span class="math display">\[\mappingScalar \sim \gaussianSamp{\meanScalar}{\alpha}\]</span> We can simply scale and offset samples from the <em>standard normal</em>.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">mu <span class="op">=</span> <span class="dv">4</span> <span class="co"># mean of the distribution</span>
alpha <span class="op">=</span> <span class="dv">2</span> <span class="co"># variance of the distribution</span>
w_vec <span class="op">=</span> np.random.normal(size<span class="op">=</span><span class="dv">200</span>)<span class="op">*</span>np.sqrt(alpha) <span class="op">+</span> mu
<span class="bu">print</span>(<span class="st">'w sample mean is '</span>, w_vec.mean())
<span class="bu">print</span>(<span class="st">'w sample variance is '</span>, w_vec.var())</code></pre></div>
<p>Here the <code>np.sqrt</code> is necesssary because we need to multiply by the standard deviation and we specified the variance as <code>alpha</code>. So scaling and offsetting a Gaussian distributed variable keeps the variable Gaussian, but it effects the mean and variance of the resulting variable.</p>
<p>To get an idea of the overall shape of the resulting distribution, let's do the same thing with a histogram of the results.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> matplotlib.pyplot <span class="im">as</span> plt
<span class="im">import</span> teaching_plots <span class="im">as</span> plot</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co"># First the standard normal</span>
z_vec <span class="op">=</span> np.random.normal(size<span class="op">=</span><span class="dv">1000</span>) <span class="co"># by convention, in statistics, z is often used to denote samples from the standard normal</span>
w_vec <span class="op">=</span> z_vec<span class="op">*</span>np.sqrt(alpha) <span class="op">+</span> mu
<span class="co"># plot normalized histogram of w, and then normalized histogram of z on top</span>
fig, ax <span class="op">=</span> plt.subplots(figsize<span class="op">=</span>plot.big_wide_figsize)
ax.hist(w_vec, bins<span class="op">=</span><span class="dv">30</span>, density<span class="op">=</span><span class="va">True</span>)
ax.hist(z_vec, bins<span class="op">=</span><span class="dv">30</span>, density<span class="op">=</span><span class="va">True</span>)
_ <span class="op">=</span> ax.legend((<span class="st">'$w$'</span>, <span class="st">'$z$'</span>))</code></pre></div>
<p>Now re-run this histogram with 100,000 samples and check that the both histograms look qualitatively Gaussian.</p>
<h3 id="sampling-from-the-prior-1">Sampling from the Prior</h3>
<p>Let's use this way of constructing samples from a Gaussian to check what functions look like <em>a priori</em>. The process will be as follows. First, we sample a random vector <span class="math inline">\(K\)</span> dimensional from <code>np.random.normal</code>. Then we scale it by <span class="math inline">\(\sqrt{\alpha}\)</span> to obtain a prior sample of <span class="math inline">\(\mappingVector\)</span>.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">K <span class="op">=</span> degree <span class="op">+</span> <span class="dv">1</span>
z_vec <span class="op">=</span> np.random.normal(size<span class="op">=</span>K)
w_sample <span class="op">=</span> z_vec<span class="op">*</span>np.sqrt(alpha)
<span class="bu">print</span>(w_sample)</code></pre></div>
<p>Now we can combine our sample from the prior with the basis functions to create a function,</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">f_sample <span class="op">=</span> np.dot(Phi_pred,w_sample)
fig, ax <span class="op">=</span> plt.subplots(figsize<span class="op">=</span>plot.big_wide_figsize)
_ <span class="op">=</span> ax.plot(x_pred.flatten(), f_sample.flatten(), <span class="st">'r-'</span>, linewidth<span class="op">=</span><span class="dv">3</span>)</code></pre></div>
<p>This shows the recurring problem with the polynomial basis (note the scale on the left hand side!). Our prior allows relatively large coefficients for the basis associated with high polynomial degrees. Because we are operating with input values of around 2000, this leads to output functions of very high values. The fix we have used for this before is to rescale our data before we apply the polynomial basis to it. Above, we set the scale of the basis to 1. Here let's set it to 100 and try again.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">scale <span class="op">=</span> <span class="dv">100</span>.
basis <span class="op">=</span> mlai.basis(polynomial, number<span class="op">=</span>degree<span class="op">+</span><span class="dv">1</span>, loc<span class="op">=</span>loc, scale<span class="op">=</span>scale)
Phi_pred <span class="op">=</span> basis.Phi(x_pred)
Phi <span class="op">=</span> basis.Phi(x)</code></pre></div>
<p>Now we need to recompute the basis functions from above,</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">f_sample <span class="op">=</span> np.dot(Phi_pred, w_sample)
fig, ax <span class="op">=</span> plt.subplots(figsize<span class="op">=</span>plot.big_wide_figsize)
_ <span class="op">=</span> ax.plot(x_pred.flatten(), f_sample.flatten(), <span class="st">'r-'</span>, linewidth<span class="op">=</span><span class="dv">3</span>)</code></pre></div>
<p>Now let's loop through some samples and plot various functions as samples from this system,</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">num_samples <span class="op">=</span> <span class="dv">10</span>
K <span class="op">=</span> degree<span class="op">+</span><span class="dv">1</span>
fig, ax <span class="op">=</span> plt.subplots(figsize<span class="op">=</span>plot.big_wide_figsize)
<span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(num_samples):
z_vec <span class="op">=</span> np.random.normal(size<span class="op">=</span>K)
w_sample <span class="op">=</span> z_vec<span class="op">*</span>np.sqrt(alpha)
f_sample <span class="op">=</span> np.dot(Phi_pred,w_sample)
_ <span class="op">=</span> ax.plot(x_pred.flatten(), f_sample.flatten(), linewidth<span class="op">=</span><span class="dv">2</span>)</code></pre></div>
<p>The predictions for the mean output can now be computed. We want the expected value of the predictions under the posterior distribution. In matrix form, the predictions can be computed as <span class="math display">\[
\mappingFunctionVector = \basisMatrix \mappingVector.
\]</span> This involves a matrix multiplication between a fixed matrix <span class="math inline">\(\basisMatrix\)</span> and a vector that is drawn from a distribution <span class="math inline">\(\mappingVector\)</span>. Because <span class="math inline">\(\mappingVector\)</span> is drawn from a distribution, this imples that <span class="math inline">\(\mappingFunctionVector\)</span> should also be drawn from a distribution. There are two distributions we are interested in though. We have just been sampling from the <em>prior</em> distribution to see what sort of functions we get <em>before</em> looking at the data. In Bayesian inference, we need to computer the <em>posterior</em> distribution and sample from that density.</p>
<h3 id="computing-the-posterior">Computing the Posterior</h3>
<p>We will now attampt to compute the <em>posterior distribution</em>. In the lecture we went through the maths that allows us to compute the posterior distribution for <span class="math inline">\(\mappingVector\)</span>. This distribution is also Gaussian, <span class="math display">\[
p(\mappingVector | \dataVector, \inputVector, \dataStd^2) = \gaussianDist{\mappingVector}{\meanVector_\mappingScalar}{\covarianceMatrix_\mappingScalar}
\]</span> with covariance, <span class="math inline">\(\covarianceMatrix_\mappingScalar\)</span>, given by <span class="math display">\[
\covarianceMatrix_\mappingScalar = \left(\dataStd^{-2}\basisMatrix^\top \basisMatrix + \alpha^{-1}\eye\right)^{-1}
\]</span> whilst the mean is given by <span class="math display">\[
\meanVector_\mappingScalar = \covarianceMatrix_\mappingScalar \dataStd^{-2}\basisMatrix^\top \dataVector
\]</span> Let's compute the posterior covariance and mean, then we'll sample from these densities to have a look at the posterior belief about <span class="math inline">\(\mappingVector\)</span> once the data has been accounted for. Remember, the process of Bayesian inference involves combining the prior, <span class="math inline">\(p(\mappingVector)\)</span> with the likelihood, <span class="math inline">\(p(\dataVector|\inputVector, \mappingVector)\)</span> to form the posterior, <span class="math inline">\(p(\mappingVector | \dataVector, \inputVector)\)</span> through Bayes' rule, <span class="math display">\[
p(\mappingVector|\dataVector, \inputVector) = \frac{p(\dataVector|\inputVector, \mappingVector)p(\mappingVector)}{p(\dataVector)}
\]</span> We've looked at the samples for our function <span class="math inline">\(\mappingFunctionVector = \basisMatrix\mappingVector\)</span>, which forms the mean of the Gaussian likelihood, under the prior distribution. I.e. we've sampled from <span class="math inline">\(p(\mappingVector)\)</span> and multiplied the result by the basis matrix. Now we will sample from the posterior density, <span class="math inline">\(p(\mappingVector|\dataVector, \inputVector)\)</span>, and check that the new samples fit do correspond to the data, i.e. we want to check that the updated distribution includes information from the data set. First we need to compute the posterior mean and <em>covariance</em>.</p>
<h3 id="bayesian-inference-in-the-univariate-case">Bayesian Inference in the Univariate Case</h3>
<p>This video talks about Bayesian inference across the single parameter, the offset <span class="math inline">\(c\)</span>, illustrating how the prior and the likelihood combine in one dimension to form a posterior.</p>
<p><a href="https://www.youtube.com/watch?v=AvlnFnvFw_0&t=15"><img src="https://img.youtube.com/vi/AvlnFnvFw_0/0.jpg" /></a></p>
<h3 id="multivariate-bayesian-inference">Multivariate Bayesian Inference</h3>
<p>This section of the lecture talks about how we extend the idea of Bayesian inference for the multivariate case. It goes through the multivariate Gaussian and how to complete the square in the linear algebra as we managed below.</p>
<p><a href="https://www.youtube.com/watch?v=Os1iqgpelPw&t=1362"><img src="https://img.youtube.com/vi/Os1iqgpelPw/0.jpg" /></a></p>
<p>The lecture informs us the the posterior density for <span class="math inline">\(\mappingVector\)</span> is given by a Gaussian density with covariance <span class="math display">\[
\covarianceMatrix_w = \left(\dataStd^{-2}\basisMatrix^\top \basisMatrix + \alpha^{-1}\eye\right)^{-1}
\]</span> and mean <span class="math display">\[
\meanVector_w = \covarianceMatrix_w\dataStd^{-2}\basisMatrix^\top \dataVector.
\]</span></p>
<h3 id="question-1-1">Question 1</h3>
<p>Compute the covariance for <span class="math inline">\(\mappingVector\)</span> given the training data, call the resulting variable <code>w_cov</code>. Compute the mean for <span class="math inline">\(\mappingVector\)</span> given the training data. Call the resulting variable <code>w_mean</code>. Assume that <span class="math inline">\(\dataStd^2 = 0.01\)</span></p>
<p><em>10 marks</em></p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co"># Write code for your answer to Question 1 in this box</span>
sigma2 <span class="op">=</span>
w_cov <span class="op">=</span>
w_mean <span class="op">=</span>
</code></pre></div>
<h3 id="olympic-data-with-bayesian-polynomials">Olympic Data with Bayesian Polynomials</h3>
<p>Five fold cross validation tests the ability of the model to <em>interpolate</em>.</p>
<object class="svgplot" align data="../slides/diagrams/ml/olympic_BLM_polynomial_number026.svg">
</object>
<center>
<em>Bayesian fit with 26th degree polynomial and negative marginal log likelihood. </em>
</center>
<h3 id="hold-out-validation">Hold Out Validation</h3>
<p>For the polynomial fit, we will now look at <em>hold out</em> validation, where we are holding out some of the most recent points. This tests the abilit of our model to <em>extrapolate</em>.</p>
<object class="svgplot" align data="../slides/diagrams/ml/olympic_val_BLM_polynomial_number026.svg">
</object>
<center>
<em>Bayesian fit with 26th degree polynomial and hold out validation scores. </em>
</center>
<h3 id="fold-cross-validation">5-fold Cross Validation</h3>
<p>Five fold cross validation tests the ability of the model to <em>interpolate</em>.</p>
<object class="svgplot" align data="../slides/diagrams/ml/olympic_5cv05_BLM_polynomial_number026.svg">
</object>
<center>
<em>Bayesian fit with 26th degree polynomial and five fold cross validation scores. </em>
</center>
<h3 id="marginal-likelihood">Marginal Likelihood</h3>
<ul>
<li><p>The marginal likelihood can also be computed, it has the form: <span class="math display">\[
p(\dataVector|\inputMatrix, \dataStd^2, \alpha) = \frac{1}{(2\pi)^\frac{n}{2}\left|\kernelMatrix\right|^\frac{1}{2}} \exp\left(-\frac{1}{2} \dataVector^\top \kernelMatrix^{-1} \dataVector\right)
\]</span> where <span class="math inline">\(\kernelMatrix = \alpha \basisMatrix\basisMatrix^\top + \dataStd^2 \eye\)</span>.</p></li>
<li><p>So it is a zero mean <span class="math inline">\(\numData\)</span>-dimensional Gaussian with covariance matrix <span class="math inline">\(\kernelMatrix\)</span>.</p></li>
</ul>
<h3 id="references" class="unnumbered">References</h3>
<div id="refs" class="references">
<div id="ref-Bishop:book06">
<p>Bishop, C.M., 2006. Pattern recognition and machine learning. springer.</p>
</div>
<div id="ref-Laplace:essai14">
<p>Laplace, P.S., 1814. Essai philosophique sur les probabilités, 2nd ed. Courcier, Paris.</p>
</div>
<div id="ref-Rogers:book11">
<p>Rogers, S., Girolami, M., 2011. A first course in machine learning. CRC Press.</p>
</div>
</div>
Mon, 04 Jun 2018 00:00:00 +0000
http://inverseprobability.com/talks/notes/bayesian-methods.html
http://inverseprobability.com/talks/notes/bayesian-methods.htmlnotesFaith and AI<div style="display:none">
$$\newcommand{\Amatrix}{\mathbf{A}}
\newcommand{\KL}[2]{\text{KL}\left( #1\,\|\,#2 \right)}
\newcommand{\Kaast}{\kernelMatrix_{\mathbf{ \ast}\mathbf{ \ast}}}
\newcommand{\Kastu}{\kernelMatrix_{\mathbf{ \ast} \inducingVector}}
\newcommand{\Kff}{\kernelMatrix_{\mappingFunctionVector \mappingFunctionVector}}
\newcommand{\Kfu}{\kernelMatrix_{\mappingFunctionVector \inducingVector}}
\newcommand{\Kuast}{\kernelMatrix_{\inducingVector \bf\ast}}
\newcommand{\Kuf}{\kernelMatrix_{\inducingVector \mappingFunctionVector}}
\newcommand{\Kuu}{\kernelMatrix_{\inducingVector \inducingVector}}
\newcommand{\Kuui}{\Kuu^{-1}}
\newcommand{\Qaast}{\mathbf{Q}_{\bf \ast \ast}}
\newcommand{\Qastf}{\mathbf{Q}_{\ast \mappingFunction}}
\newcommand{\Qfast}{\mathbf{Q}_{\mappingFunctionVector \bf \ast}}
\newcommand{\Qff}{\mathbf{Q}_{\mappingFunctionVector \mappingFunctionVector}}
\newcommand{\aMatrix}{\mathbf{A}}
\newcommand{\aScalar}{a}
\newcommand{\aVector}{\mathbf{a}}
\newcommand{\acceleration}{a}
\newcommand{\bMatrix}{\mathbf{B}}
\newcommand{\bScalar}{b}
\newcommand{\bVector}{\mathbf{b}}
\newcommand{\basisFunc}{\phi}
\newcommand{\basisFuncVector}{\boldsymbol{ \basisFunc}}
\newcommand{\basisFunction}{\phi}
\newcommand{\basisLocation}{\mu}
\newcommand{\basisMatrix}{\boldsymbol{ \Phi}}
\newcommand{\basisScalar}{\basisFunction}
\newcommand{\basisVector}{\boldsymbol{ \basisFunction}}
\newcommand{\activationFunction}{\phi}
\newcommand{\activationMatrix}{\boldsymbol{ \Phi}}
\newcommand{\activationScalar}{\basisFunction}
\newcommand{\activationVector}{\boldsymbol{ \basisFunction}}
\newcommand{\bigO}{\mathcal{O}}
\newcommand{\binomProb}{\pi}
\newcommand{\cMatrix}{\mathbf{C}}
\newcommand{\cbasisMatrix}{\hat{\boldsymbol{ \Phi}}}
\newcommand{\cdataMatrix}{\hat{\dataMatrix}}
\newcommand{\cdataScalar}{\hat{\dataScalar}}
\newcommand{\cdataVector}{\hat{\dataVector}}
\newcommand{\centeredKernelMatrix}{\mathbf{ \MakeUppercase{\centeredKernelScalar}}}
\newcommand{\centeredKernelScalar}{b}
\newcommand{\centeredKernelVector}{\centeredKernelScalar}
\newcommand{\centeringMatrix}{\mathbf{H}}
\newcommand{\chiSquaredDist}[2]{\chi_{#1}^{2}\left(#2\right)}
\newcommand{\chiSquaredSamp}[1]{\chi_{#1}^{2}}
\newcommand{\conditionalCovariance}{\boldsymbol{ \Sigma}}
\newcommand{\coregionalizationMatrix}{\mathbf{B}}
\newcommand{\coregionalizationScalar}{b}
\newcommand{\coregionalizationVector}{\mathbf{ \coregionalizationScalar}}
\newcommand{\covDist}[2]{\text{cov}_{#2}\left(#1\right)}
\newcommand{\covSamp}[1]{\text{cov}\left(#1\right)}
\newcommand{\covarianceScalar}{c}
\newcommand{\covarianceVector}{\mathbf{ \covarianceScalar}}
\newcommand{\covarianceMatrix}{\mathbf{C}}
\newcommand{\covarianceMatrixTwo}{\boldsymbol{ \Sigma}}
\newcommand{\croupierScalar}{s}
\newcommand{\croupierVector}{\mathbf{ \croupierScalar}}
\newcommand{\croupierMatrix}{\mathbf{ \MakeUppercase{\croupierScalar}}}
\newcommand{\dataDim}{p}
\newcommand{\dataIndex}{i}
\newcommand{\dataIndexTwo}{j}
\newcommand{\dataMatrix}{\mathbf{Y}}
\newcommand{\dataScalar}{y}
\newcommand{\dataSet}{\mathcal{D}}
\newcommand{\dataStd}{\sigma}
\newcommand{\dataVector}{\mathbf{ \dataScalar}}
\newcommand{\decayRate}{d}
\newcommand{\degreeMatrix}{\mathbf{ \MakeUppercase{\degreeScalar}}}
\newcommand{\degreeScalar}{d}
\newcommand{\degreeVector}{\mathbf{ \degreeScalar}}
% Already defined by latex
%\newcommand{\det}[1]{\left|#1\right|}
\newcommand{\diag}[1]{\text{diag}\left(#1\right)}
\newcommand{\diagonalMatrix}{\mathbf{D}}
\newcommand{\diff}[2]{\frac{\text{d}#1}{\text{d}#2}}
\newcommand{\diffTwo}[2]{\frac{\text{d}^2#1}{\text{d}#2^2}}
\newcommand{\displacement}{x}
\newcommand{\displacementVector}{\textbf{\displacement}}
\newcommand{\distanceMatrix}{\mathbf{ \MakeUppercase{\distanceScalar}}}
\newcommand{\distanceScalar}{d}
\newcommand{\distanceVector}{\mathbf{ \distanceScalar}}
\newcommand{\eigenvaltwo}{\ell}
\newcommand{\eigenvaltwoMatrix}{\mathbf{L}}
\newcommand{\eigenvaltwoVector}{\mathbf{l}}
\newcommand{\eigenvalue}{\lambda}
\newcommand{\eigenvalueMatrix}{\boldsymbol{ \Lambda}}
\newcommand{\eigenvalueVector}{\boldsymbol{ \lambda}}
\newcommand{\eigenvector}{\mathbf{ \eigenvectorScalar}}
\newcommand{\eigenvectorMatrix}{\mathbf{U}}
\newcommand{\eigenvectorScalar}{u}
\newcommand{\eigenvectwo}{\mathbf{v}}
\newcommand{\eigenvectwoMatrix}{\mathbf{V}}
\newcommand{\eigenvectwoScalar}{v}
\newcommand{\entropy}[1]{\mathcal{H}\left(#1\right)}
\newcommand{\errorFunction}{E}
\newcommand{\expDist}[2]{\left<#1\right>_{#2}}
\newcommand{\expSamp}[1]{\left<#1\right>}
\newcommand{\expectation}[1]{\left\langle #1 \right\rangle }
\newcommand{\expectationDist}[2]{\left\langle #1 \right\rangle _{#2}}
\newcommand{\expectedDistanceMatrix}{\mathcal{D}}
\newcommand{\eye}{\mathbf{I}}
\newcommand{\fantasyDim}{r}
\newcommand{\fantasyMatrix}{\mathbf{ \MakeUppercase{\fantasyScalar}}}
\newcommand{\fantasyScalar}{z}
\newcommand{\fantasyVector}{\mathbf{ \fantasyScalar}}
\newcommand{\featureStd}{\varsigma}
\newcommand{\gammaCdf}[3]{\mathcal{GAMMA CDF}\left(#1|#2,#3\right)}
\newcommand{\gammaDist}[3]{\mathcal{G}\left(#1|#2,#3\right)}
\newcommand{\gammaSamp}[2]{\mathcal{G}\left(#1,#2\right)}
\newcommand{\gaussianDist}[3]{\mathcal{N}\left(#1|#2,#3\right)}
\newcommand{\gaussianSamp}[2]{\mathcal{N}\left(#1,#2\right)}
\newcommand{\given}{|}
\newcommand{\half}{\frac{1}{2}}
\newcommand{\heaviside}{H}
\newcommand{\hiddenMatrix}{\mathbf{ \MakeUppercase{\hiddenScalar}}}
\newcommand{\hiddenScalar}{h}
\newcommand{\hiddenVector}{\mathbf{ \hiddenScalar}}
\newcommand{\identityMatrix}{\eye}
\newcommand{\inducingInputScalar}{z}
\newcommand{\inducingInputVector}{\mathbf{ \inducingInputScalar}}
\newcommand{\inducingInputMatrix}{\mathbf{Z}}
\newcommand{\inducingScalar}{u}
\newcommand{\inducingVector}{\mathbf{ \inducingScalar}}
\newcommand{\inducingMatrix}{\mathbf{U}}
\newcommand{\inlineDiff}[2]{\text{d}#1/\text{d}#2}
\newcommand{\inputDim}{q}
\newcommand{\inputMatrix}{\mathbf{X}}
\newcommand{\inputScalar}{x}
\newcommand{\inputSpace}{\mathcal{X}}
\newcommand{\inputVals}{\inputVector}
\newcommand{\inputVector}{\mathbf{ \inputScalar}}
\newcommand{\iterNum}{k}
\newcommand{\kernel}{\kernelScalar}
\newcommand{\kernelMatrix}{\mathbf{K}}
\newcommand{\kernelScalar}{k}
\newcommand{\kernelVector}{\mathbf{ \kernelScalar}}
\newcommand{\kff}{\kernelScalar_{\mappingFunction \mappingFunction}}
\newcommand{\kfu}{\kernelVector_{\mappingFunction \inducingScalar}}
\newcommand{\kuf}{\kernelVector_{\inducingScalar \mappingFunction}}
\newcommand{\kuu}{\kernelVector_{\inducingScalar \inducingScalar}}
\newcommand{\lagrangeMultiplier}{\lambda}
\newcommand{\lagrangeMultiplierMatrix}{\boldsymbol{ \Lambda}}
\newcommand{\lagrangian}{L}
\newcommand{\laplacianFactor}{\mathbf{ \MakeUppercase{\laplacianFactorScalar}}}
\newcommand{\laplacianFactorScalar}{m}
\newcommand{\laplacianFactorVector}{\mathbf{ \laplacianFactorScalar}}
\newcommand{\laplacianMatrix}{\mathbf{L}}
\newcommand{\laplacianScalar}{\ell}
\newcommand{\laplacianVector}{\mathbf{ \ell}}
\newcommand{\latentDim}{q}
\newcommand{\latentDistanceMatrix}{\boldsymbol{ \Delta}}
\newcommand{\latentDistanceScalar}{\delta}
\newcommand{\latentDistanceVector}{\boldsymbol{ \delta}}
\newcommand{\latentForce}{f}
\newcommand{\latentFunction}{u}
\newcommand{\latentFunctionVector}{\mathbf{ \latentFunction}}
\newcommand{\latentFunctionMatrix}{\mathbf{ \MakeUppercase{\latentFunction}}}
\newcommand{\latentIndex}{j}
\newcommand{\latentScalar}{z}
\newcommand{\latentVector}{\mathbf{ \latentScalar}}
\newcommand{\latentMatrix}{\mathbf{Z}}
\newcommand{\learnRate}{\eta}
\newcommand{\lengthScale}{\ell}
\newcommand{\rbfWidth}{\ell}
\newcommand{\likelihoodBound}{\mathcal{L}}
\newcommand{\likelihoodFunction}{L}
\newcommand{\locationScalar}{\mu}
\newcommand{\locationVector}{\boldsymbol{ \locationScalar}}
\newcommand{\locationMatrix}{\mathbf{M}}
\newcommand{\variance}[1]{\text{var}\left( #1 \right)}
\newcommand{\mappingFunction}{f}
\newcommand{\mappingFunctionMatrix}{\mathbf{F}}
\newcommand{\mappingFunctionTwo}{g}
\newcommand{\mappingFunctionTwoMatrix}{\mathbf{G}}
\newcommand{\mappingFunctionTwoVector}{\mathbf{ \mappingFunctionTwo}}
\newcommand{\mappingFunctionVector}{\mathbf{ \mappingFunction}}
\newcommand{\scaleScalar}{s}
\newcommand{\mappingScalar}{w}
\newcommand{\mappingVector}{\mathbf{ \mappingScalar}}
\newcommand{\mappingMatrix}{\mathbf{W}}
\newcommand{\mappingScalarTwo}{v}
\newcommand{\mappingVectorTwo}{\mathbf{ \mappingScalarTwo}}
\newcommand{\mappingMatrixTwo}{\mathbf{V}}
\newcommand{\maxIters}{K}
\newcommand{\meanMatrix}{\mathbf{M}}
\newcommand{\meanScalar}{\mu}
\newcommand{\meanTwoMatrix}{\mathbf{M}}
\newcommand{\meanTwoScalar}{m}
\newcommand{\meanTwoVector}{\mathbf{ \meanTwoScalar}}
\newcommand{\meanVector}{\boldsymbol{ \meanScalar}}
\newcommand{\mrnaConcentration}{m}
\newcommand{\naturalFrequency}{\omega}
\newcommand{\neighborhood}[1]{\mathcal{N}\left( #1 \right)}
\newcommand{\neilurl}{http://inverseprobability.com/}
\newcommand{\noiseMatrix}{\boldsymbol{ E}}
\newcommand{\noiseScalar}{\epsilon}
\newcommand{\noiseVector}{\boldsymbol{ \epsilon}}
\newcommand{\norm}[1]{\left\Vert #1 \right\Vert}
\newcommand{\normalizedLaplacianMatrix}{\hat{\mathbf{L}}}
\newcommand{\normalizedLaplacianScalar}{\hat{\ell}}
\newcommand{\normalizedLaplacianVector}{\hat{\mathbf{ \ell}}}
\newcommand{\numActive}{m}
\newcommand{\numBasisFunc}{m}
\newcommand{\numComponents}{m}
\newcommand{\numComps}{K}
\newcommand{\numData}{n}
\newcommand{\numFeatures}{K}
\newcommand{\numHidden}{h}
\newcommand{\numInducing}{m}
\newcommand{\numLayers}{\ell}
\newcommand{\numNeighbors}{K}
\newcommand{\numSequences}{s}
\newcommand{\numSuccess}{s}
\newcommand{\numTasks}{m}
\newcommand{\numTime}{T}
\newcommand{\numTrials}{S}
\newcommand{\outputIndex}{j}
\newcommand{\paramVector}{\boldsymbol{ \theta}}
\newcommand{\parameterMatrix}{\boldsymbol{ \Theta}}
\newcommand{\parameterScalar}{\theta}
\newcommand{\parameterVector}{\boldsymbol{ \parameterScalar}}
\newcommand{\partDiff}[2]{\frac{\partial#1}{\partial#2}}
\newcommand{\precisionScalar}{j}
\newcommand{\precisionVector}{\mathbf{ \precisionScalar}}
\newcommand{\precisionMatrix}{\mathbf{J}}
\newcommand{\pseudotargetScalar}{\widetilde{y}}
\newcommand{\pseudotargetVector}{\mathbf{ \pseudotargetScalar}}
\newcommand{\pseudotargetMatrix}{\mathbf{ \widetilde{Y}}}
\newcommand{\rank}[1]{\text{rank}\left(#1\right)}
\newcommand{\rayleighDist}[2]{\mathcal{R}\left(#1|#2\right)}
\newcommand{\rayleighSamp}[1]{\mathcal{R}\left(#1\right)}
\newcommand{\responsibility}{r}
\newcommand{\rotationScalar}{r}
\newcommand{\rotationVector}{\mathbf{ \rotationScalar}}
\newcommand{\rotationMatrix}{\mathbf{R}}
\newcommand{\sampleCovScalar}{s}
\newcommand{\sampleCovVector}{\mathbf{ \sampleCovScalar}}
\newcommand{\sampleCovMatrix}{\mathbf{s}}
\newcommand{\scalarProduct}[2]{\left\langle{#1},{#2}\right\rangle}
\newcommand{\sign}[1]{\text{sign}\left(#1\right)}
\newcommand{\sigmoid}[1]{\sigma\left(#1\right)}
\newcommand{\singularvalue}{\ell}
\newcommand{\singularvalueMatrix}{\mathbf{L}}
\newcommand{\singularvalueVector}{\mathbf{l}}
\newcommand{\sorth}{\mathbf{u}}
\newcommand{\spar}{\lambda}
\newcommand{\trace}[1]{\text{tr}\left(#1\right)}
\newcommand{\BasalRate}{B}
\newcommand{\DampingCoefficient}{C}
\newcommand{\DecayRate}{D}
\newcommand{\Displacement}{X}
\newcommand{\LatentForce}{F}
\newcommand{\Mass}{M}
\newcommand{\Sensitivity}{S}
\newcommand{\basalRate}{b}
\newcommand{\dampingCoefficient}{c}
\newcommand{\mass}{m}
\newcommand{\sensitivity}{s}
\newcommand{\springScalar}{\kappa}
\newcommand{\springVector}{\boldsymbol{ \kappa}}
\newcommand{\springMatrix}{\boldsymbol{ \mathcal{K}}}
\newcommand{\tfConcentration}{p}
\newcommand{\tfDecayRate}{\delta}
\newcommand{\tfMrnaConcentration}{f}
\newcommand{\tfVector}{\mathbf{ \tfConcentration}}
\newcommand{\velocity}{v}
\newcommand{\sufficientStatsScalar}{g}
\newcommand{\sufficientStatsVector}{\mathbf{ \sufficientStatsScalar}}
\newcommand{\sufficientStatsMatrix}{\mathbf{G}}
\newcommand{\switchScalar}{s}
\newcommand{\switchVector}{\mathbf{ \switchScalar}}
\newcommand{\switchMatrix}{\mathbf{S}}
\newcommand{\tr}[1]{\text{tr}\left(#1\right)}
\newcommand{\loneNorm}[1]{\left\Vert #1 \right\Vert_1}
\newcommand{\ltwoNorm}[1]{\left\Vert #1 \right\Vert_2}
\newcommand{\onenorm}[1]{\left\vert#1\right\vert_1}
\newcommand{\twonorm}[1]{\left\Vert #1 \right\Vert}
\newcommand{\vScalar}{v}
\newcommand{\vVector}{\mathbf{v}}
\newcommand{\vMatrix}{\mathbf{V}}
\newcommand{\varianceDist}[2]{\text{var}_{#2}\left( #1 \right)}
% Already defined by latex
%\newcommand{\vec}{#1:}
\newcommand{\vecb}[1]{\left(#1\right):}
\newcommand{\weightScalar}{w}
\newcommand{\weightVector}{\mathbf{ \weightScalar}}
\newcommand{\weightMatrix}{\mathbf{W}}
\newcommand{\weightedAdjacencyMatrix}{\mathbf{A}}
\newcommand{\weightedAdjacencyScalar}{a}
\newcommand{\weightedAdjacencyVector}{\mathbf{ \weightedAdjacencyScalar}}
\newcommand{\onesVector}{\mathbf{1}}
\newcommand{\zerosVector}{\mathbf{0}}
$$
</div>
<h2 id="what-is-machine-learning">What is Machine Learning?</h2>
<p>What is machine learning? At its most basic level machine learning is a combination of</p>
<p><span class="math display">\[ \text{data} + \text{model} \xrightarrow{\text{compute}} \text{prediction}\]</span></p>
<p>where <em>data</em> is our observations. They can be actively or passively acquired (meta-data). The <em>model</em> contains our assumptions, based on previous experience. THat experience can be other data, it can come from transfer learning, or it can merely be our beliefs about the regularities of the universe. In humans our models include our inductive biases. The <em>prediction</em> is an action to be taken or a categorization or a quality score. The reason that machine learning has become a mainstay of artificial intelligence is the importance of predictions in artificial intelligence. The data and the model are combined through computation.</p>
<p>In practice we normally perform machine learning using two functions. To combine data with a model we typically make use of:</p>
<p><strong>a prediction function</strong> a function which is used to make the predictions. It includes our beliefs about the regularities of the universe, our assumptions about how the world works, e.g. smoothness, spatial similarities, temporal similarities.</p>
<p><strong>an objective function</strong> a function which defines the cost of misprediction. Typically it includes knowledge about the world's generating processes (probabilistic objectives) or the costs we pay for mispredictions (empiricial risk minimization).</p>
<p>The combination of data and model through the prediction function and the objectie function leads to a <em>learning algorithm</em>. The class of prediction functions and objective functions we can make use of is restricted by the algorithms they lead to. If the prediction function or the objective function are too complex, then it can be difficult to find an appropriate learning algorithm. Much of the acdemic field of machine learning is the quest for new learning algorithms that allow us to bring different types of models and data together.</p>
<p>A useful reference for state of the art in machine learning is the UK Royal Society Report, <a href="https://royalsociety.org/~/media/policy/projects/machine-learning/publications/machine-learning-report.pdf">Machine Learning: Power and Promise of Computers that Learn by Example</a>.</p>
<p>You can also check my blog post on <a href="http://inverseprobability.com/2017/07/17/what-is-machine-learning">"What is Machine Learning?"</a></p>
<h3 id="artificial-intelligence-and-data-science">Artificial Intelligence and Data Science</h3>
<p>Machine learning technologies have been the driver of two related, but distinct disciplines. The first is <em>data science</em>. Data science is an emerging field that arises from the fact that we now collect so much data by happenstance, rather than by <em>experimental design</em>. Classical statistics is the science of drawing conclusions from data, and to do so statistical experiments are carefully designed. In the modern era we collect so much data that there's a desire to draw inferences directly from the data.</p>
<p>As well as machine learning, the field of data science draws from statistics, cloud computing, data storage (e.g. streaming data), visualization and data mining.</p>
<p>In contrast, artificial intelligence technologies typically focus on emulating some form of human behaviour, such as understanding an image, or some speech, or translating text from one form to another. The recent advances in artifcial intelligence have come from machine learning providing the automation. But in contrast to data science, in artifcial intelligence the data is normally collected with the specific task in mind. In this sense it has strong relations to classical statistics.</p>
<p>Classically artificial intelligence worried more about <em>logic</em> and <em>planning</em> and focussed less on data driven decision making. Modern machine learning owes more to the field of <em>Cybernetics</em> <span class="citation">(Wiener, 1948)</span> than artificial intelligence. Related fields include <em>robotics</em>, <em>speech recognition</em>, <em>language understanding</em> and <em>computer vision</em>.</p>
<p>There are strong overlaps between the fields, the wide availability of data by happenstance makes it easier to collect data for designing AI systems. These relations are coming through wide availability of sensing technologies that are interconnected by celluar networks, WiFi and the internet. This phenomenon is sometimes known as the <em>Internet of Things</em>, but this feels like a dangerous misnomer. We must never forget that we are interconnecting people, not things.</p>
<h3 id="what-does-machine-learning-do">What does Machine Learning do?</h3>
<p>Any process of automation allows us to scale what we do by codifying a process in some way that makes it efficient and repeatable. Machine learning automates by emulating human (or other actions) found in data. Machine learning codifies in the form of a mathematical function that is learnt by a computer. If we can create these mathematical functions in ways in which they can interconnect, then we can also build systems.</p>
<p>Machine learning works through codifing a prediction of interest into a mathematical function. For example, we can try and predict the probability that a customer wants to by a jersey given knowledge of their age, and the latitude where they live. The technique known as logistic regression estimates the odds that someone will by a jumper as a linear weighted sum of the features of interest.</p>
<p><span class="math display">\[ \text{odds} = \frac{p(\text{bought})}{p(\text{not bought})} \]</span> <span class="math display">\[ \log \text{odds} = \beta_0 + \beta_1 \text{age} + \beta_2 \text{latitude}\]</span></p>
<p>Here <span class="math inline">\(\beta_0\)</span>, <span class="math inline">\(\beta_1\)</span> and <span class="math inline">\(\beta_2\)</span> are the parameters of the model. If <span class="math inline">\(\beta_1\)</span> and <span class="math inline">\(\beta_2\)</span> are both positive, then the log-odds that someone will buy a jumper increase with increasing latitude and age, so the further north you are and the older you are the more likely you are to buy a jumper. The parameter <span class="math inline">\(\beta_0\)</span> is an offset parameter, and gives the log-odds of buying a jumper at zero age and on the equator. It is likely to be negative[^logarithms] indicating that the purchase is odds-against. This is actually a classical statistical model, and models like logistic regression are widely used to estimate probabilities from ad-click prediction to risk of disease.</p>
<p>This is called a generalized linear model, we can also think of it as estimating the <em>probability</em> of a purchase as a nonlinear function of the features (age, lattitude) and the parameters (the <span class="math inline">\(\beta\)</span> values). The function is known as the <em>sigmoid</em> or <a href="https://en.wikipedia.org/wiki/Logistic_regression">logistic function</a>, thus the name <em>logistic</em> regression.</p>
<p><span class="math display">\[ p(\text{bought}) = \sigmoid{\beta_0 + \beta_1 \text{age} + \beta_2 \text{latitude}}\]</span></p>
<p>In the case where we have <em>features</em> to help us predict, we sometimes denote such features as a vector, <span class="math inline">\(\inputVector\)</span>, and we then use an inner product between the features and the parameters, <span class="math inline">\(\boldsymbol{\beta}^\top \inputVector = \beta_1 \inputScalar_1 + \beta_2 \inputScalar_2 + \beta_3 \inputScalar_3 ...\)</span>, to represent the argument of the sigmoid.</p>
<p><span class="math display">\[ p(\text{bought}) = \sigmoid{\boldsymbol{\beta}^\top \inputVector}\]</span></p>
<p>More generally, we aim to predict some aspect of our data, <span class="math inline">\(\dataScalar\)</span>, by relating it through a mathematical function, <span class="math inline">\(\mappingFunction(\cdot)\)</span>, to the parameters, <span class="math inline">\(\boldsymbol{\beta}\)</span> and the data, <span class="math inline">\(\inputVector\)</span>.</p>
<p><span class="math display">\[ \dataScalar = \mappingFunction\left(\inputVector, \boldsymbol{\beta}\right)\]</span></p>
<p>We call <span class="math inline">\(\mappingFunction(\cdot)\)</span> the <em>prediction function</em></p>
<p>To obtain the fit to data, we use a separate function called the <em>objective function</em> that gives us a mathematical representation of the difference between our predictions and the real data.</p>
<p><span class="math display">\[\errorFunction(\boldsymbol{\beta}, \dataMatrix, \inputMatrix)\]</span> A commonly used examples (for example in a regression problem) is least squares, <span class="math display">\[\errorFunction(\boldsymbol{\beta}, \dataMatrix, \inputMatrix) = \sum_{i=1}^\numData \left(\dataScalar_i - \mappingFunction(\inputVector_i, \boldsymbol{\beta})\right)^2.\]</span></p>
<p>If a linear prediction function is combined with the least squares objective function then that gives us a classical <em>linear regression</em>, another classical statistical model. Statistics often focusses on linear models because it makes interpretation of the model easier. Interpretation is key in statistics because the aim is normally to validate questions by analysis of data. Machine learning has typically focussed more on the prediction function itself and worried less about the interpretation of parameters, which are normally denoted by <span class="math inline">\(\mathbf{w}\)</span> instead of <span class="math inline">\(\boldsymbol{\beta}\)</span>. As a result <em>non-linear</em> functions are explored more often as they tend to improve quality of predictions but at the expense of interpretability.</p>
<img class="" src="../slides/diagrams/deepface_neg.png" width="100%" align="center" style="background:none; border:none; box-shadow:none;">
<center>
<em>The DeepFace architecture <span class="citation">(Taigman et al., 2014)</span>, visualized through colors to represent the functional mappings at each layer. There are 120 million parameters in the model. </em>
</center>
<p>The DeepFace architecture <span class="citation">(Taigman et al., 2014)</span> consists of layers that deal with <em>translation</em> and <em>rotational</em> invariances. These layers are followed by three locally-connected layers and two fully-connected layers. Color illustrates feature maps produced at each layer. The net includes more than 120 million parameters, where more than 95% come from the local and fully connected layers.</p>
<img class="" src="../slides/diagrams/576px-Early_Pinball.jpg" width="50%" align="center" style="background:none; border:none; box-shadow:none;">
<center>
<em>Deep learning models are composition of simple functions. We can think of a pinball machine as an analogy. Each layer of pins corresponds to one of the layers of functions in the model. Input data is represented by the location of the ball from left to right when it is dropped in from the top. Output class comes from the position of the ball as it leaves the pins at the bottom. </em>
</center>
<p>We can think of what these models are doing as being similar to early pin ball machines. In a neural network, we input a number (or numbers), whereas in pinball, we input a ball. The location of the ball on the left-right axis can be thought of as the number. As the ball falls through the machine, each layer of pins can be thought of as a different layer of neurons. Each layer acts to move the ball from left to right.</p>
<p>In a pinball machine, when the ball gets to the bottom it might fall into a hole defining a score, in a neural network, that is equivalent to the decision: a classification of the input object.</p>
<p>An image has more than one number associated with it, so it's like playing pinball in a <em>hyper-space</em>.</p>
<object class="svgplot" align data="../slides/diagrams/pinball001.svg">
</object>
<center>
<em>At initialization, the pins, which represent the parameters of the function, aren't in the right place to bring the balls to the correct decisions. </em>
</center>
<object class="svgplot" align data="../slides/diagrams/pinball002.svg">
</object>
<center>
<em>After learning the pins are now in the right place to bring the balls to the correct decisions. </em>
</center>
<p>Learning involves moving all the pins to be in the right position, so that the ball falls in the right place. But moving all these pins in hyperspace can be difficult. In a hyper space you have to put a lot of data through the machine for to explore the positions of all the pins. Adversarial learning reflects the fact that a ball can be moved a small distance and lead to a very different result.</p>
<p>Probabilistic methods explore more of the space by considering a range of possible paths for the ball through the machine.</p>
<h2 id="natural-and-artificial-intelligence-embodiment-factors">Natural and Artificial Intelligence: Embodiment Factors</h2>
<table>
<tr>
<td>
</td>
<td align="center">
<img class="" src="../slides/diagrams/IBM_Blue_Gene_P_supercomputer.jpg" width="40%" align="center" style="background:none; border:none; box-shadow:none;">
</td>
<td align="center">
<img class="" src="../slides/diagrams/ClaudeShannon_MFO3807.jpg" width="25%" align="center" style="background:none; border:none; box-shadow:none;">
</td>
</tr>
<tr>
<td>
compute
</td>
<td align="center">
<span class="math display">\[\approx 100 \text{ gigaflops}\]</span>
</td>
<td align="center">
<span class="math display">\[\approx 16 \text{ petaflops}\]</span>
</td>
</tr>
<tr>
<td>
communicate
</td>
<td align="center">
<span class="math display">\[1 \text{ gigbit/s}\]</span>
</td>
<td align="center">
<span class="math display">\[100 \text{ bit/s}\]</span>
</td>
</tr>
<tr>
<td>
(compute/communicate)
</td>
<td align="center">
<span class="math display">\[10^{4}\]</span>
</td>
<td align="center">
<span class="math display">\[10^{14}\]</span>
</td>
</tr>
</table>
<p>There is a fundamental limit placed on our intelligence based on our ability to communicate. Claude Shannon founded the field of information theory. The clever part of this theory is it allows us to separate our measurement of information from what the information pertains to<a href="#fn1" class="footnoteRef" id="fnref1"><sup>1</sup></a>.</p>
<p>Shannon measured information in bits. One bit of information is the amount of information I pass to you when I give you the result of a coin toss. Shannon was also interested in the amount of information in the English language. He estimated that on average a word in the English language contains 12 bits of information.</p>
<p>Given typical speaking rates, that gives us an estimate of our ability to communicate of around 100 bits per second <span class="citation">(Reed and Durlach, 1998)</span>. Computers on the other hand can communicate much more rapidly. Current wired network speeds are around a billion bits per second, ten million times faster.</p>
<p>When it comes to compute though, our best estimates indicate our computers are slower. A typical modern computer can process make around 100 billion floating point operations per second, each floating point operation involves a 64 bit number. So the computer is processing around 6,400 billion bits per second.</p>
<p>It's difficult to get similar estimates for humans, but by some estimates the amount of compute we would require to <em>simulate</em> a human brain is equivalent to that in the UK's fastest computer <span class="citation">(Ananthanarayanan et al., 2009)</span>, the MET office machine in Exeter, which in 2018 ranks as the 11th fastest computer in the world. That machine simulates the world's weather each morning, and then simulates the world's climate. It is a 16 petaflop machine, processing around 1,000 <em>trillion</em> bits per second.</p>
<p>So when it comes to our ability to compute we are extraordinary, not compute in our conscious mind, but the underlying neuron firings that underpin both our consciousness, our sbuconsciousness as well as our motor control etc. By analogy I sometimes like to think of us as a Formula One engine. But in terms of our ability to deploy that computation in actual use, to share the results of what we have inferred, we are very limited. So when you imagine the F1 car that represents a psyche, think of an F1 car with bicycle wheels.</p>
<p><img class="" src="../slides/diagrams/640px-Marcel_Renault_1903.jpg" width="70%" align="center" style="background:none; border:none; box-shadow:none;"></p>
<p>In contrast, our computers have less computational power, but they can communicate far more fluidly. They are more like a go-kart, less well powered, but with tires that allow them to deploy that power.</p>
<p><img class="" src="../slides/diagrams/Caleb_McDuff_WIX_Silence_Racing_livery.jpg" width="70%" align="center" style="background:none; border:none; box-shadow:none;"></p>
<p>For humans, that means much of our computation should be dedicated to considering <em>what</em> we should compute. To do that efficiently we need to model the world around us. The most complex thing in the world around us is other humans. So it is no surprise that we model them. We second guess what their intentions are, and our communication is only necessary when they are departing from how we model them. Naturally, for this to work well, we need to understand those we work closely with. So it is no surprise that social communication, social bonding, forms so much of a part of our use of our limited bandwidth.</p>
<p>There is a second effect here, our need to anthropomorphise objects around us. Our tendency to model our fellow humans extends to when we interact with other entities in our environment. To our pets as well as inanimate objects around us, such as computers or even our cars. This tendency to overinterpret could be a consequence of our limited ability to communicate.</p>
<p>For more details see this paper <a href="https://arxiv.org/abs/1705.07996">"Living Together: Mind and Machine Intelligence"</a>, and this <a href="http://inverseprobability.com/talks/lawrence-tedx17/living-together.html">TEDx talk</a>.</p>
<h2 id="evolved-relationship-with-information">Evolved Relationship with Information</h2>
<p>The high bandwidth of computers has resulted in a close relationship between the computer and data. Large amounts of information can flow between the two. The degree to which the computer is mediating our relationship with data means that we should consider it an intermediary.</p>
<p>Originaly our low bandwith relationship with data was affected by two characteristics. Firstly, our tendency to over-interpret driven by our need to extract as much knowledge from our low bandwidth information channel as possible. Secondly, by our improved understanding of the domain of <em>mathematical</em> statistics and how our cognitive biases can mislead us.</p>
<p>With this new set up there is a potential for assimilating far more information via the computer, but the computer can present this to us in various ways. If it's motives are not aligned with ours then it can misrepresent the information. This needn't be nefarious it can be simply as a result of the computer pursuing a different objective from us. For example, if the computer is aiming to maximize our interaction time that may be a different objective from ours which may be to summarize information in a representative manner in the <em>shortest</em> possible length of time.</p>
<p>For example, for me it was a common experience to pick up my telephone with the intention of checking when my next appointment was, but to soon find myself distracted by another application on the phone, and end up reading something on the internet. By the time I'd finished reading, I would often have forgotten the reason I picked up my phone in the first place.</p>
<p>There are great benefits to be had from the huge amount of information we can unlock from this evolved relationship between us and data. In biology, large scale data sharing has been driven by a revolution in genomic, transcriptomic and epigenomic measurement. The improved inferences that that can be drawn through summarizing data by computer have fundamentally changed the nature of biological science, now this phenomenon is also infuencing us in our daily lives as data measured by <em>happenstance</em> is increasingly used to characterize us.</p>
<p>Better mediation of this flow actually requires a better understanding of human-computer interaction. This in turn involves understanding our own intelligence better, what its cognitive biases are and how these might mislead us.</p>
<p>For further thoughts see <a href="https://www.theguardian.com/media-network/2015/jul/23/data-driven-economy-marketing">this Guardian article</a> from 2015 on marketing in the internet era and <a href="http://inverseprobability.com/2015/12/04/what-kind-of-ai">this blog post</a> on System Zero.</p>
<object class="svgplot" align data="../slides/diagrams/data-science/information-flow003.svg">
</object>
<center>
<em>New direction of information flow, information is reaching us mediated by the computer </em>
</center>
<h3 id="societal-effects">Societal Effects</h3>
<p>We have already seen the effects of this changed dynamic in biology and computational biology. Improved sensorics have led to the new domains of transcriptomics, epigenomics, and 'rich phenomics' as well as considerably augmenting our capabilities in genomics.</p>
<p>Biologists have had to become data-savvy, they require a rich understanding of the available data resources and need to assimilate existing data sets in their hypothesis generation as well as their experimental design. Modern biology has become a far more quantitative science, but the quantitativeness has required new methods developed in the domains of <em>computational biology</em> and <em>bioinformatics</em>.</p>
<p>There is also great promise for personalized health, but in health the wide data-sharing that has underpinned success in the computational biology community is much harder to cary out.</p>
<p>We can expect to see these phenomena reflected in wider society. Particularly as we make use of more automated decision making based only on data.</p>
<p>The main phenomenon we see across the board is the shift in dynamic from the direct pathway between human and data, as traditionally mediated by classical statistcs, to a new flow of information via the computer. This change of dynamics gives us the modern and emerging domain of <em>data science</em>.</p>
<h2 id="human-communication">Human Communication</h2>
<p>For human conversation to work, we require an internal model of who we are speaking to. We model each other, and combine our sense of who they are, who they think we are, and what has been said. This is our approach to dealing with the limited bandwidth connection we have. Empathy and understanding of intent. Mental dispositional concepts are used to augment our limited communication bandwidth.</p>
<p>Fritz Heider referred to the important point of a conversation as being that they are happenings that are "<em>psychologically represented</em> in each of the participants" (his emphasis) <span class="citation">(Heider, 1958)</span></p>
<h3 id="machine-learning-and-narratives">Machine Learning and Narratives</h3>
<p><img class="" src="../slides/diagrams/Classic_baby_shoes.jpg" width="60%" align="center" style="background:none; border:none; box-shadow:none;"></p>
<center>
<em>For sale: baby shoes, never worn.</em>
</center>
<p>Consider the six word novel, apocraphally credited to Ernest Hemingway, "For sale: baby shoes, never worn". To understand what that means to a human, you need a great deal of additional context. Context that is not directly accessible to a machine that has not got both the evolved and contextual understanding of our own condition to realize both the implication of the advert and what that implication means emotionally to the previous owner.</p>
<p><a href="https://www.youtube.com/watch?v=8FIEZXMUM2I&t=7"><img src="https://img.youtube.com/vi/8FIEZXMUM2I/0.jpg" /></a></p>
<p><a href="https://en.wikipedia.org/wiki/Fritz_Heider">Fritz Heider</a> and <a href="https://en.wikipedia.org/wiki/Marianne_Simmel">Marianne Simmel</a>'s experiments with animated shapes from 1944 <span class="citation">(Heider and Simmel, 1944)</span>. Our interpretation of these objects as showing motives and even emotion is a combination of our desire for narrative, a need for understanding of each other, and our ability to empathise. At one level, these are crudely drawn objects, but in another key way, the animator has communicated a story through simple facets such as their relative motions, their sizes and their actions. We apply our psychological representations to these faceless shapes in an effort to interpret their actions.</p>
<h3 id="faith-and-ai">Faith and AI</h3>
<p>There would seem to be at least three ways in which artificial intelligence and religion interconnect.</p>
<ol style="list-style-type: decimal">
<li>Artificial Intelligence as Cartoon Religion</li>
<li>Artificial Intelligence and Introspection</li>
<li>Independence of thought and Control: A Systemic Catch 22</li>
</ol>
<h3 id="singulariansm-ai-as-cartoon-religion">Singulariansm: AI as Cartoon Religion</h3>
<p>The first parallels one can find between artificial intelligence and religion come in somewhat of a cartoon doomsday scenario form. The publically hyped fears of superintelligence and singularity can equally be placed within the framework of the simpler questions that religion can try to answer. The parallels are</p>
<ol style="list-style-type: decimal">
<li>Superintelligence as god</li>
<li>Demi-god status achievable through transhumanism</li>
<li>Immortality through uploading the connectome</li>
<li>The day of judgement as the "singularity"</li>
</ol>
<p>The notion of a ultra-intelligence is similar to the notion of an interventionist god, with omniscience in the past, present and the future. This notion was described by Pierre Simon Laplace.</p>
<p><img class="" src="../slides/diagrams/ml/Pierre-Simon_Laplace.png" width="30%" align="center" style="background:none; border:none; box-shadow:none;"></p>
<p>Famously, Laplace considered the idea of a deterministic Universe, one in which the model is <em>known</em>, or as the below translation refers to it, "an intelligence which could comprehend all the forces by which nature is animated". He speculates on an "intelligence" that can submit this vast data to analysis and propsoses that such an entity would be able to predict the future.</p>
<blockquote>
<p>Given for one instant an intelligence which could comprehend all the forces by which nature is animated and the respective situation of the beings who compose it---an intelligence sufficiently vast to submit these data to analysis---it would embrace in the same formulate the movements of the greatest bodies of the universe and those of the lightest atom; for it, nothing would be uncertain and the future, as the past, would be present in its eyes.</p>
</blockquote>
<p>This notion is known as <em>Laplace's demon</em> or <em>Laplace's superman</em>.</p>
<p>Unfortunately, most analyses of his ideas stop at that point, whereas his real point is that such a notion is unreachable. Not so much <em>superman</em> as <em>strawman</em>. Just three pages later in the "Philosophical Essay on Probabilities" <span class="citation">(Laplace, 1814)</span>, Laplace goes on to observe:</p>
<blockquote>
<p>The curve described by a simple molecule of air or vapor is regulated in a manner just as certain as the planetary orbits; the only difference between them is that which comes from our ignorance.</p>
<p>Probability is relative, in part to this ignorance, in part to our knowledge.</p>
</blockquote>
<p>In other words, we can never utilize the idealistic deterministc Universe due to our ignorance about the world, Laplace's suggestion, and focus in this essay is that we turn to probability to deal with this uncertainty. This is also our inspiration for using probabilit in machine learning.</p>
<p>The "forces by which nature is animated" is our <em>model</em>, the "situation of beings that compose it" is our <em>data</em> and the "intelligence sufficiently vast enough to submit these data to analysis" is our compute. The fly in the ointment is our <em>ignorance</em> about these aspects. And <em>probability</em> is the tool we use to incorporate this ignorance leading to uncertainty or <em>doubt</em> in our predictions.</p>
<p>The notion of Superintelligence in, e.g. Nick Bostrom's book <span class="citation">(Bostrom, 2014)</span>, is that of an infallible omniscience. A major narrative of the book is that the challenge of Superintelligence according is to constrain the power of such an entity. In practice, this narrative is strongly related to Laplace's "straw superman". No such intelligence could exist due to our ignorance, in practice any real intelligence must express <em>doubt</em>.</p>
<p>Elon Musk has proposed that the only way to defeat the inevitable omniscience would be to augment ourselves with machine like capabilities. Ray Kurzweil has pushed the notion of developing ourselves by augmenting our existing cortex with direct connection to the internet.</p>
<p>Within Sillicon Valley there is a particular obsession with 'uploading', once the brain is connected, we can achieve immortality by continuing to exist digitally in an artificial environment of our own creation while our physical body is left behind us.</p>
<p>In this scenario, doomsday is the 'technological singularity', the moment at which computers rapidly outstrip our capabilities and take over our world. The high priests are the scientists, and the aim is to bring about the latter while restraining the former.</p>
<p><em>Singularism</em> is to religion what <em>scientology</em> is to science. Scientology is religion expressing itself as science and Singularism is science expressing itself as religion.</p>
<p>For further reading see <a href="http://inverseprobability.com/2016/05/09/machine-learning-futures-5">this post on Singularism</a> as well as this <a href="http://www.academia.edu/15037984/Singularitarians_AItheists_and_Why_the_Problem_with_Artificial_Intelligence_is_H.A.L._Humanity_At_Large_not_HAL">paper by Luciano Floridi</a> and this <a href="http://inverseprobability.com/2016/05/09/machine-learning-futures-6">review of Superintelligence</a> <span class="citation">(Bostrom, 2014)</span>.</p>
<h3 id="artificial-intelligence-and-introspection">Artificial Intelligence and Introspection</h3>
<p>Ignoring the cartoon view of religion we've outlined above and focussing more on how religion can bring strength to people in their day-to-day living, religious environments bring a place to self reflect and meditate on our existence, and the wider cosmos. How are we here? What is our role? What makes us different?</p>
<p>Creating machine intelligences characterizes the manner in which we are different, helps us understand what is special about us rather than the machine.</p>
<p>I have in the past argued strongly against the term artificial intelligence but there is a sense in which it is a good term. If we think of artificial plants, then we have the right sense in which we are creating an artificial intelligence. An artificial plant is fundamentally different from a real plant, but can appear similar, or from a distance identical. However, a creator of an artificial plant gains a greater appreciation for the complexity of a natural plant.</p>
<p>In a similar way, we might expect that attempts to emulate human intelligence would lead to a deeper appreciation of that intelligence. This type of reflection on who we are has a lot in common with many of the (to me) most endearing characteristics of religion.</p>
<h3 id="the-cosmic-catch-22">The Cosmic Catch 22</h3>
<p>A final parallel between the world of AI and that of religion is the conundrums they raise for us. In particular the tradeoffs between a paternalistic society and individual freedoms. Two models for artificial intelligence that may be realistic are the "Big Brother" and the "Big Mother" models.</p>
<p>Big Brother refers to the surveillance society and the control of populations that can be achieved with a greater understanding of the individual self. A perceptual understanding of the individual that conceivably be of better than the individual's self perception. This scenario was most famously explored by George Orwell, but also came into being in Communist East Germany where it is estimated that one in 66 citizens acted as an informants, <span class="citation">(<em>Stasi</em>, 1999)</span>.</p>
<p>The same understanding of individual is also necessary for the "Big Mother" scenario, where intelligent agents provide for us in the manner in which our parents did for us when we were young. Both scenarios are disempowering in terms of individual liberties. In a metaphorical sense, this could be seen as a return to Eden, a surrendering of individual liberties for a perceived paradise. But those individual liberties are also what we value. There is a tension between a desire to create the perfect environment, where no evil exists and our individual liberty. Our society chooses a balance between the pros and cons that attempts to sustain a diversity of perspectives and beliefs. Even if it were possible to use AI to organzie society in such a way that particular malevolent behaviours were prevented, doing so may come at the cost of the individual freedom we enjoy. These are difficult trade offs, and the exist both when explaining the nature of religious belief and when considering the nature of either the dystopian Big Brother or the "dys-utopian" Big Mother view of AI.</p>
<h3 id="conclusion">Conclusion</h3>
<p>We've provided an overview of the advances in artificial intelligence from the perspective of machine learning, and tried to give a sense of how machine learning models operate to learn about us.</p>
<p>We've highlighted a quintissential difference between humans and computers: the embodiment factor, the relatively restricted ability of human to communicate themselves when compared to computers. We explored how this has effected our evolved relationship with data and the relationship between the human and narrative.</p>
<p>Finally, we explored three parallels between faith and AI, in particular the cartoon nature of religion based on technological promises of the singularity and AI. A more sophisticated relationship occurs when we see the way in which, as artificial intelligences invade our notion of personhood we will need to intrspect about who we are and what we want to be, a characteristic shared with many religions. The final parallel was in the emergent questions of AI, "Should we build an artificial intelligence to eliminate war?" has a strong parallel with the question "Why does God allow war?". War is a consequence of human choices. Building such a system would likely severely restrict our freedoms to make choices, and there is a tension between how much we wish those freedoms to be impinged versus the potential lives that could be saved.</p>
<h3 id="thanks">Thanks!</h3>
<ul>
<li>twitter: @lawrennd</li>
<li>blog: <a href="http://inverseprobability.com/blog.html">http://inverseprobability.com</a></li>
</ul>
<div id="refs" class="references">
<div id="ref-Ananthanarayanan-cat09">
<p>Ananthanarayanan, R., Esser, S.K., Simon, H.D., Modha, D.S., 2009. The cat is out of the bag: Cortical simulations with <span class="math inline">\(10^9\)</span> neurons, <span class="math inline">\(10^{13}\)</span> synapses, in: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis - Sc ’09. <a href="https://doi.org/10.1145/1654059.1654124" class="uri">https://doi.org/10.1145/1654059.1654124</a></p>
</div>
<div id="ref-Bostrom-superintelligence14">
<p>Bostrom, N., 2014. Superintelligence: Paths, dangers, strategies, 1st ed. Oxford University Press, Oxford, UK.</p>
</div>
<div id="ref-Heider:interpersonal58">
<p>Heider, F., 1958. The psychology of interpersonal relations. John Wiley.</p>
</div>
<div id="ref-Heider:experimental44">
<p>Heider, F., Simmel, M., 1944. An experimental study of apparent behavior. The American Journal of Psychology 57, 243–259.</p>
</div>
<div id="ref-Laplace:essai14">
<p>Laplace, P.S., 1814. Essai philosophique sur les probabilités, 2nd ed. Courcier, Paris.</p>
</div>
<div id="ref-Reed-information98">
<p>Reed, C., Durlach, N.I., 1998. Note on information transfer rates in human communication. Presence Teleoperators & Virtual Environments 7, 509–518. <a href="https://doi.org/10.1162/105474698565893" class="uri">https://doi.org/10.1162/105474698565893</a></p>
</div>
<div id="ref-Koehler-stasi99">
<p>Stasi: The untold story of the east german secret police, 1999.</p>
</div>
<div id="ref-Taigman:deepface14">
<p>Taigman, Y., Yang, M., Ranzato, M., Wolf, L., 2014. DeepFace: Closing the gap to human-level performance in face verification, in: Proceedings of the Ieee Computer Society Conference on Computer Vision and Pattern Recognition. <a href="https://doi.org/10.1109/CVPR.2014.220" class="uri">https://doi.org/10.1109/CVPR.2014.220</a></p>
</div>
<div id="ref-Wiener:cybernetics48">
<p>Wiener, N., 1948. Cybernetics: Control and communication in the animal and the machine. MIT Press, Cambridge, MA.</p>
</div>
</div>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>the challenge of understanding what information pertains to is known as knowledge representation.<a href="#fnref1">↩</a></p></li>
</ol>
</div>
Thu, 31 May 2018 00:00:00 +0000
http://inverseprobability.com/talks/notes/faith-and-ai.html
http://inverseprobability.com/talks/notes/faith-and-ai.htmlnotesUncertainty in Loss Functions<div style="display:none">
$$\newcommand{\Amatrix}{\mathbf{A}}
\newcommand{\KL}[2]{\text{KL}\left( #1\,\|\,#2 \right)}
\newcommand{\Kaast}{\kernelMatrix_{\mathbf{ \ast}\mathbf{ \ast}}}
\newcommand{\Kastu}{\kernelMatrix_{\mathbf{ \ast} \inducingVector}}
\newcommand{\Kff}{\kernelMatrix_{\mappingFunctionVector \mappingFunctionVector}}
\newcommand{\Kfu}{\kernelMatrix_{\mappingFunctionVector \inducingVector}}
\newcommand{\Kuast}{\kernelMatrix_{\inducingVector \bf\ast}}
\newcommand{\Kuf}{\kernelMatrix_{\inducingVector \mappingFunctionVector}}
\newcommand{\Kuu}{\kernelMatrix_{\inducingVector \inducingVector}}
\newcommand{\Kuui}{\Kuu^{-1}}
\newcommand{\Qaast}{\mathbf{Q}_{\bf \ast \ast}}
\newcommand{\Qastf}{\mathbf{Q}_{\ast \mappingFunction}}
\newcommand{\Qfast}{\mathbf{Q}_{\mappingFunctionVector \bf \ast}}
\newcommand{\Qff}{\mathbf{Q}_{\mappingFunctionVector \mappingFunctionVector}}
\newcommand{\aMatrix}{\mathbf{A}}
\newcommand{\aScalar}{a}
\newcommand{\aVector}{\mathbf{a}}
\newcommand{\acceleration}{a}
\newcommand{\bMatrix}{\mathbf{B}}
\newcommand{\bScalar}{b}
\newcommand{\bVector}{\mathbf{b}}
\newcommand{\basisFunc}{\phi}
\newcommand{\basisFuncVector}{\boldsymbol{ \basisFunc}}
\newcommand{\basisFunction}{\phi}
\newcommand{\basisLocation}{\mu}
\newcommand{\basisMatrix}{\boldsymbol{ \Phi}}
\newcommand{\basisScalar}{\basisFunction}
\newcommand{\basisVector}{\boldsymbol{ \basisFunction}}
\newcommand{\activationFunction}{\phi}
\newcommand{\activationMatrix}{\boldsymbol{ \Phi}}
\newcommand{\activationScalar}{\basisFunction}
\newcommand{\activationVector}{\boldsymbol{ \basisFunction}}
\newcommand{\bigO}{\mathcal{O}}
\newcommand{\binomProb}{\pi}
\newcommand{\cMatrix}{\mathbf{C}}
\newcommand{\cbasisMatrix}{\hat{\boldsymbol{ \Phi}}}
\newcommand{\cdataMatrix}{\hat{\dataMatrix}}
\newcommand{\cdataScalar}{\hat{\dataScalar}}
\newcommand{\cdataVector}{\hat{\dataVector}}
\newcommand{\centeredKernelMatrix}{\mathbf{ \MakeUppercase{\centeredKernelScalar}}}
\newcommand{\centeredKernelScalar}{b}
\newcommand{\centeredKernelVector}{\centeredKernelScalar}
\newcommand{\centeringMatrix}{\mathbf{H}}
\newcommand{\chiSquaredDist}[2]{\chi_{#1}^{2}\left(#2\right)}
\newcommand{\chiSquaredSamp}[1]{\chi_{#1}^{2}}
\newcommand{\conditionalCovariance}{\boldsymbol{ \Sigma}}
\newcommand{\coregionalizationMatrix}{\mathbf{B}}
\newcommand{\coregionalizationScalar}{b}
\newcommand{\coregionalizationVector}{\mathbf{ \coregionalizationScalar}}
\newcommand{\covDist}[2]{\text{cov}_{#2}\left(#1\right)}
\newcommand{\covSamp}[1]{\text{cov}\left(#1\right)}
\newcommand{\covarianceScalar}{c}
\newcommand{\covarianceVector}{\mathbf{ \covarianceScalar}}
\newcommand{\covarianceMatrix}{\mathbf{C}}
\newcommand{\covarianceMatrixTwo}{\boldsymbol{ \Sigma}}
\newcommand{\croupierScalar}{s}
\newcommand{\croupierVector}{\mathbf{ \croupierScalar}}
\newcommand{\croupierMatrix}{\mathbf{ \MakeUppercase{\croupierScalar}}}
\newcommand{\dataDim}{p}
\newcommand{\dataIndex}{i}
\newcommand{\dataIndexTwo}{j}
\newcommand{\dataMatrix}{\mathbf{Y}}
\newcommand{\dataScalar}{y}
\newcommand{\dataSet}{\mathcal{D}}
\newcommand{\dataStd}{\sigma}
\newcommand{\dataVector}{\mathbf{ \dataScalar}}
\newcommand{\decayRate}{d}
\newcommand{\degreeMatrix}{\mathbf{ \MakeUppercase{\degreeScalar}}}
\newcommand{\degreeScalar}{d}
\newcommand{\degreeVector}{\mathbf{ \degreeScalar}}
% Already defined by latex
%\newcommand{\det}[1]{\left|#1\right|}
\newcommand{\diag}[1]{\text{diag}\left(#1\right)}
\newcommand{\diagonalMatrix}{\mathbf{D}}
\newcommand{\diff}[2]{\frac{\text{d}#1}{\text{d}#2}}
\newcommand{\diffTwo}[2]{\frac{\text{d}^2#1}{\text{d}#2^2}}
\newcommand{\displacement}{x}
\newcommand{\displacementVector}{\textbf{\displacement}}
\newcommand{\distanceMatrix}{\mathbf{ \MakeUppercase{\distanceScalar}}}
\newcommand{\distanceScalar}{d}
\newcommand{\distanceVector}{\mathbf{ \distanceScalar}}
\newcommand{\eigenvaltwo}{\ell}
\newcommand{\eigenvaltwoMatrix}{\mathbf{L}}
\newcommand{\eigenvaltwoVector}{\mathbf{l}}
\newcommand{\eigenvalue}{\lambda}
\newcommand{\eigenvalueMatrix}{\boldsymbol{ \Lambda}}
\newcommand{\eigenvalueVector}{\boldsymbol{ \lambda}}
\newcommand{\eigenvector}{\mathbf{ \eigenvectorScalar}}
\newcommand{\eigenvectorMatrix}{\mathbf{U}}
\newcommand{\eigenvectorScalar}{u}
\newcommand{\eigenvectwo}{\mathbf{v}}
\newcommand{\eigenvectwoMatrix}{\mathbf{V}}
\newcommand{\eigenvectwoScalar}{v}
\newcommand{\entropy}[1]{\mathcal{H}\left(#1\right)}
\newcommand{\errorFunction}{E}
\newcommand{\expDist}[2]{\left<#1\right>_{#2}}
\newcommand{\expSamp}[1]{\left<#1\right>}
\newcommand{\expectation}[1]{\left\langle #1 \right\rangle }
\newcommand{\expectationDist}[2]{\left\langle #1 \right\rangle _{#2}}
\newcommand{\expectedDistanceMatrix}{\mathcal{D}}
\newcommand{\eye}{\mathbf{I}}
\newcommand{\fantasyDim}{r}
\newcommand{\fantasyMatrix}{\mathbf{ \MakeUppercase{\fantasyScalar}}}
\newcommand{\fantasyScalar}{z}
\newcommand{\fantasyVector}{\mathbf{ \fantasyScalar}}
\newcommand{\featureStd}{\varsigma}
\newcommand{\gammaCdf}[3]{\mathcal{GAMMA CDF}\left(#1|#2,#3\right)}
\newcommand{\gammaDist}[3]{\mathcal{G}\left(#1|#2,#3\right)}
\newcommand{\gammaSamp}[2]{\mathcal{G}\left(#1,#2\right)}
\newcommand{\gaussianDist}[3]{\mathcal{N}\left(#1|#2,#3\right)}
\newcommand{\gaussianSamp}[2]{\mathcal{N}\left(#1,#2\right)}
\newcommand{\given}{|}
\newcommand{\half}{\frac{1}{2}}
\newcommand{\heaviside}{H}
\newcommand{\hiddenMatrix}{\mathbf{ \MakeUppercase{\hiddenScalar}}}
\newcommand{\hiddenScalar}{h}
\newcommand{\hiddenVector}{\mathbf{ \hiddenScalar}}
\newcommand{\identityMatrix}{\eye}
\newcommand{\inducingInputScalar}{z}
\newcommand{\inducingInputVector}{\mathbf{ \inducingInputScalar}}
\newcommand{\inducingInputMatrix}{\mathbf{Z}}
\newcommand{\inducingScalar}{u}
\newcommand{\inducingVector}{\mathbf{ \inducingScalar}}
\newcommand{\inducingMatrix}{\mathbf{U}}
\newcommand{\inlineDiff}[2]{\text{d}#1/\text{d}#2}
\newcommand{\inputDim}{q}
\newcommand{\inputMatrix}{\mathbf{X}}
\newcommand{\inputScalar}{x}
\newcommand{\inputSpace}{\mathcal{X}}
\newcommand{\inputVals}{\inputVector}
\newcommand{\inputVector}{\mathbf{ \inputScalar}}
\newcommand{\iterNum}{k}
\newcommand{\kernel}{\kernelScalar}
\newcommand{\kernelMatrix}{\mathbf{K}}
\newcommand{\kernelScalar}{k}
\newcommand{\kernelVector}{\mathbf{ \kernelScalar}}
\newcommand{\kff}{\kernelScalar_{\mappingFunction \mappingFunction}}
\newcommand{\kfu}{\kernelVector_{\mappingFunction \inducingScalar}}
\newcommand{\kuf}{\kernelVector_{\inducingScalar \mappingFunction}}
\newcommand{\kuu}{\kernelVector_{\inducingScalar \inducingScalar}}
\newcommand{\lagrangeMultiplier}{\lambda}
\newcommand{\lagrangeMultiplierMatrix}{\boldsymbol{ \Lambda}}
\newcommand{\lagrangian}{L}
\newcommand{\laplacianFactor}{\mathbf{ \MakeUppercase{\laplacianFactorScalar}}}
\newcommand{\laplacianFactorScalar}{m}
\newcommand{\laplacianFactorVector}{\mathbf{ \laplacianFactorScalar}}
\newcommand{\laplacianMatrix}{\mathbf{L}}
\newcommand{\laplacianScalar}{\ell}
\newcommand{\laplacianVector}{\mathbf{ \ell}}
\newcommand{\latentDim}{q}
\newcommand{\latentDistanceMatrix}{\boldsymbol{ \Delta}}
\newcommand{\latentDistanceScalar}{\delta}
\newcommand{\latentDistanceVector}{\boldsymbol{ \delta}}
\newcommand{\latentForce}{f}
\newcommand{\latentFunction}{u}
\newcommand{\latentFunctionVector}{\mathbf{ \latentFunction}}
\newcommand{\latentFunctionMatrix}{\mathbf{ \MakeUppercase{\latentFunction}}}
\newcommand{\latentIndex}{j}
\newcommand{\latentScalar}{z}
\newcommand{\latentVector}{\mathbf{ \latentScalar}}
\newcommand{\latentMatrix}{\mathbf{Z}}
\newcommand{\learnRate}{\eta}
\newcommand{\lengthScale}{\ell}
\newcommand{\rbfWidth}{\ell}
\newcommand{\likelihoodBound}{\mathcal{L}}
\newcommand{\likelihoodFunction}{L}
\newcommand{\locationScalar}{\mu}
\newcommand{\locationVector}{\boldsymbol{ \locationScalar}}
\newcommand{\locationMatrix}{\mathbf{M}}
\newcommand{\variance}[1]{\text{var}\left( #1 \right)}
\newcommand{\mappingFunction}{f}
\newcommand{\mappingFunctionMatrix}{\mathbf{F}}
\newcommand{\mappingFunctionTwo}{g}
\newcommand{\mappingFunctionTwoMatrix}{\mathbf{G}}
\newcommand{\mappingFunctionTwoVector}{\mathbf{ \mappingFunctionTwo}}
\newcommand{\mappingFunctionVector}{\mathbf{ \mappingFunction}}
\newcommand{\scaleScalar}{s}
\newcommand{\mappingScalar}{w}
\newcommand{\mappingVector}{\mathbf{ \mappingScalar}}
\newcommand{\mappingMatrix}{\mathbf{W}}
\newcommand{\mappingScalarTwo}{v}
\newcommand{\mappingVectorTwo}{\mathbf{ \mappingScalarTwo}}
\newcommand{\mappingMatrixTwo}{\mathbf{V}}
\newcommand{\maxIters}{K}
\newcommand{\meanMatrix}{\mathbf{M}}
\newcommand{\meanScalar}{\mu}
\newcommand{\meanTwoMatrix}{\mathbf{M}}
\newcommand{\meanTwoScalar}{m}
\newcommand{\meanTwoVector}{\mathbf{ \meanTwoScalar}}
\newcommand{\meanVector}{\boldsymbol{ \meanScalar}}
\newcommand{\mrnaConcentration}{m}
\newcommand{\naturalFrequency}{\omega}
\newcommand{\neighborhood}[1]{\mathcal{N}\left( #1 \right)}
\newcommand{\neilurl}{http://inverseprobability.com/}
\newcommand{\noiseMatrix}{\boldsymbol{ E}}
\newcommand{\noiseScalar}{\epsilon}
\newcommand{\noiseVector}{\boldsymbol{ \epsilon}}
\newcommand{\norm}[1]{\left\Vert #1 \right\Vert}
\newcommand{\normalizedLaplacianMatrix}{\hat{\mathbf{L}}}
\newcommand{\normalizedLaplacianScalar}{\hat{\ell}}
\newcommand{\normalizedLaplacianVector}{\hat{\mathbf{ \ell}}}
\newcommand{\numActive}{m}
\newcommand{\numBasisFunc}{m}
\newcommand{\numComponents}{m}
\newcommand{\numComps}{K}
\newcommand{\numData}{n}
\newcommand{\numFeatures}{K}
\newcommand{\numHidden}{h}
\newcommand{\numInducing}{m}
\newcommand{\numLayers}{\ell}
\newcommand{\numNeighbors}{K}
\newcommand{\numSequences}{s}
\newcommand{\numSuccess}{s}
\newcommand{\numTasks}{m}
\newcommand{\numTime}{T}
\newcommand{\numTrials}{S}
\newcommand{\outputIndex}{j}
\newcommand{\paramVector}{\boldsymbol{ \theta}}
\newcommand{\parameterMatrix}{\boldsymbol{ \Theta}}
\newcommand{\parameterScalar}{\theta}
\newcommand{\parameterVector}{\boldsymbol{ \parameterScalar}}
\newcommand{\partDiff}[2]{\frac{\partial#1}{\partial#2}}
\newcommand{\precisionScalar}{j}
\newcommand{\precisionVector}{\mathbf{ \precisionScalar}}
\newcommand{\precisionMatrix}{\mathbf{J}}
\newcommand{\pseudotargetScalar}{\widetilde{y}}
\newcommand{\pseudotargetVector}{\mathbf{ \pseudotargetScalar}}
\newcommand{\pseudotargetMatrix}{\mathbf{ \widetilde{Y}}}
\newcommand{\rank}[1]{\text{rank}\left(#1\right)}
\newcommand{\rayleighDist}[2]{\mathcal{R}\left(#1|#2\right)}
\newcommand{\rayleighSamp}[1]{\mathcal{R}\left(#1\right)}
\newcommand{\responsibility}{r}
\newcommand{\rotationScalar}{r}
\newcommand{\rotationVector}{\mathbf{ \rotationScalar}}
\newcommand{\rotationMatrix}{\mathbf{R}}
\newcommand{\sampleCovScalar}{s}
\newcommand{\sampleCovVector}{\mathbf{ \sampleCovScalar}}
\newcommand{\sampleCovMatrix}{\mathbf{s}}
\newcommand{\scalarProduct}[2]{\left\langle{#1},{#2}\right\rangle}
\newcommand{\sign}[1]{\text{sign}\left(#1\right)}
\newcommand{\sigmoid}[1]{\sigma\left(#1\right)}
\newcommand{\singularvalue}{\ell}
\newcommand{\singularvalueMatrix}{\mathbf{L}}
\newcommand{\singularvalueVector}{\mathbf{l}}
\newcommand{\sorth}{\mathbf{u}}
\newcommand{\spar}{\lambda}
\newcommand{\trace}[1]{\text{tr}\left(#1\right)}
\newcommand{\BasalRate}{B}
\newcommand{\DampingCoefficient}{C}
\newcommand{\DecayRate}{D}
\newcommand{\Displacement}{X}
\newcommand{\LatentForce}{F}
\newcommand{\Mass}{M}
\newcommand{\Sensitivity}{S}
\newcommand{\basalRate}{b}
\newcommand{\dampingCoefficient}{c}
\newcommand{\mass}{m}
\newcommand{\sensitivity}{s}
\newcommand{\springScalar}{\kappa}
\newcommand{\springVector}{\boldsymbol{ \kappa}}
\newcommand{\springMatrix}{\boldsymbol{ \mathcal{K}}}
\newcommand{\tfConcentration}{p}
\newcommand{\tfDecayRate}{\delta}
\newcommand{\tfMrnaConcentration}{f}
\newcommand{\tfVector}{\mathbf{ \tfConcentration}}
\newcommand{\velocity}{v}
\newcommand{\sufficientStatsScalar}{g}
\newcommand{\sufficientStatsVector}{\mathbf{ \sufficientStatsScalar}}
\newcommand{\sufficientStatsMatrix}{\mathbf{G}}
\newcommand{\switchScalar}{s}
\newcommand{\switchVector}{\mathbf{ \switchScalar}}
\newcommand{\switchMatrix}{\mathbf{S}}
\newcommand{\tr}[1]{\text{tr}\left(#1\right)}
\newcommand{\loneNorm}[1]{\left\Vert #1 \right\Vert_1}
\newcommand{\ltwoNorm}[1]{\left\Vert #1 \right\Vert_2}
\newcommand{\onenorm}[1]{\left\vert#1\right\vert_1}
\newcommand{\twonorm}[1]{\left\Vert #1 \right\Vert}
\newcommand{\vScalar}{v}
\newcommand{\vVector}{\mathbf{v}}
\newcommand{\vMatrix}{\mathbf{V}}
\newcommand{\varianceDist}[2]{\text{var}_{#2}\left( #1 \right)}
% Already defined by latex
%\newcommand{\vec}{#1:}
\newcommand{\vecb}[1]{\left(#1\right):}
\newcommand{\weightScalar}{w}
\newcommand{\weightVector}{\mathbf{ \weightScalar}}
\newcommand{\weightMatrix}{\mathbf{W}}
\newcommand{\weightedAdjacencyMatrix}{\mathbf{A}}
\newcommand{\weightedAdjacencyScalar}{a}
\newcommand{\weightedAdjacencyVector}{\mathbf{ \weightedAdjacencyScalar}}
\newcommand{\onesVector}{\mathbf{1}}
\newcommand{\zerosVector}{\mathbf{0}}
$$
</div>
<h2 id="what-is-machine-learning">What is Machine Learning?</h2>
<p>What is machine learning? At its most basic level machine learning is a combination of</p>
<p><span class="math display">\[ \text{data} + \text{model} \xrightarrow{\text{compute}} \text{prediction}\]</span></p>
<p>where <em>data</em> is our observations. They can be actively or passively acquired (meta-data). The <em>model</em> contains our assumptions, based on previous experience. THat experience can be other data, it can come from transfer learning, or it can merely be our beliefs about the regularities of the universe. In humans our models include our inductive biases. The <em>prediction</em> is an action to be taken or a categorization or a quality score. The reason that machine learning has become a mainstay of artificial intelligence is the importance of predictions in artificial intelligence. The data and the model are combined through computation.</p>
<p>In practice we normally perform machine learning using two functions. To combine data with a model we typically make use of:</p>
<p><strong>a prediction function</strong> a function which is used to make the predictions. It includes our beliefs about the regularities of the universe, our assumptions about how the world works, e.g. smoothness, spatial similarities, temporal similarities.</p>
<p><strong>an objective function</strong> a function which defines the cost of misprediction. Typically it includes knowledge about the world's generating processes (probabilistic objectives) or the costs we pay for mispredictions (empiricial risk minimization).</p>
<p>The combination of data and model through the prediction function and the objectie function leads to a <em>learning algorithm</em>. The class of prediction functions and objective functions we can make use of is restricted by the algorithms they lead to. If the prediction function or the objective function are too complex, then it can be difficult to find an appropriate learning algorithm. Much of the acdemic field of machine learning is the quest for new learning algorithms that allow us to bring different types of models and data together.</p>
<p>A useful reference for state of the art in machine learning is the UK Royal Society Report, <a href="https://royalsociety.org/~/media/policy/projects/machine-learning/publications/machine-learning-report.pdf">Machine Learning: Power and Promise of Computers that Learn by Example</a>.</p>
<p>You can also check my blog post on <a href="http://inverseprobability.com/2017/07/17/what-is-machine-learning">"What is Machine Learning?"</a></p>
<h2 id="artificial-vs-natural-systems">Artificial vs Natural Systems</h2>
<h3 id="natural-systems-are-evolved">Natural Systems are Evolved</h3>
<blockquote>
<p>Survival of the fittest</p>
<p><a href="https://en.wikipedia.org/wiki/Herbert_Spencer">Herbet Spencer</a>, 1864</p>
</blockquote>
<p>Darwin never said "Survival of the Fittest" he talked about evolution by natural selection.</p>
<p>Evolution is better described as "non-survival of the non-fit". You don't have to be the fittest to survive, you just need to avoid the pitfalls of life. This is the first priority.</p>
<p>A mistake we make in the design of our systems is to equate fitness with the objective function, and to assume it is known and static. In practice, a real environment would have an evolving fitness function which would be unknown at any given time.</p>
<p>Uncertainty in models is handled by Bayesian inference, here we consider uncertainty arising in loss functions.</p>
<p>Consider a loss function which decomposes across individual observations, <span class="math inline">\(\dataScalar_{k,j}\)</span>, each of which is dependent on some set of features, <span class="math inline">\(\inputVector_k\)</span>.</p>
<p><span class="math display">\[
\errorFunction(\dataVector, \inputMatrix) = \sum_{k}\sum_{j}
L(\dataScalar_{k,j}, \inputVector_k)
\]</span> Assume that the loss function depends on the features through some mapping function, <span class="math inline">\(\mappingFunction_j(\cdot)\)</span> which we call the <em>prediction function</em>.</p>
<p><span class="math display">\[
\errorFunction(\dataVector, \inputMatrix) = \sum_{k}\sum_{j} L(\dataScalar_{k,j},
\mappingFunction_j(\inputVector_k))
\]</span> without loss of generality, we can move the index to the inputs, so we have <span class="math inline">\(\inputVector_i =\left[\inputVector \quad j\right]\)</span>, and we set <span class="math inline">\(\dataScalar_i = \dataScalar_{k, j}\)</span>. So we have</p>
<p><span class="math display">\[
\errorFunction(\dataVector, \inputMatrix) = \sum_{i} L(\dataScalar_i, \mappingFunction(\inputVector_i))
\]</span> Bayesian inference considers uncertainty in <span class="math inline">\(\mappingFunction\)</span>, often through parameterizing it, <span class="math inline">\(\mappingFunction(\inputVector; \parameterVector)\)</span>, and considering a <em>prior</em> distribution for the parameters, <span class="math inline">\(p(\parameterVector)\)</span>, this in turn implies a distribution over functions, <span class="math inline">\(p(\mappingFunction)\)</span>. Process models, such as Gaussian processes specify this distribution, known as a process, directly.</p>
<p>Bayesian inference proceeds by specifying a <em>likelihood</em> which relates the data, <span class="math inline">\(\dataScalar\)</span>, to the parameters. Here we choose not to do this, but instead we only consider the <em>loss</em> function for our objective. The loss is the cost we pay for a misclassification.</p>
<p>The <em>risk function</em> is the expectation of the loss under the distribution of the data. Here we are using the framework of <em>empirical risk</em> minimization, because we have a sample based approximation. The new expectation we are considering is around the loss function itself, not the uncertainty in the data.</p>
<p>The loss function and the log likelihood may take a mathematically similar form but they are philosophically very different. The log likelihood assumes something about the <em>generating</em> function of the data, whereas the loss function assumes something about the cost we pay. Importantly the loss function in Bayesian inference only normally enters at the point of decision.</p>
<p>The key idea in Bayesian inference is that the probabilistic inference can be performed <em>without</em> knowing the loss becasue if the model is correct, then the form of the loss function is irrelevant when performing inference. In practice, however, for real data sets the model is almost never correct.</p>
<p>Some of the maths below looks similar to the maths we can find in Bayesian methods, in particular variational Bayes, but that is merely a consequence of the availability of analytical mathematics. There are only particular ways of developing tractable algorithms, one route involves linear algebra. However, the similarity of the mathematics belies a difference in interpretation. It is similar to travelling a road (e.g. Ermine Street) in a wild landscape. We travel together because that is where efficient progress is to be made, but in practice a our destinations (Lincoln, York), may be different.</p>
<h3 id="introduce-uncertainty">Introduce Uncertainty</h3>
<p>To introduce uncertainty we consider a weighted version of the loss function, we introduce positive weights, <span class="math inline">\(\left\{ \scaleScalar_i\right\}_{i=1}^\numData\)</span>. <span class="math display">\[
\errorFunction(\dataVector, \inputMatrix) = \sum_{i}
\scaleScalar_i L(\dataScalar_i, \mappingFunction(\inputVector_i))
\]</span> We now assume that tmake the assumption that these weights are drawn from a distribution, <span class="math inline">\(q(\scaleScalar)\)</span>. Instead of looking to minimize the loss direction, we look at the expected loss under this distribution.</p>
<p><span class="math display">\[
\begin{align*}
\errorFunction(\dataVector, \inputMatrix) = & \sum_{i}\expectationDist{\scaleScalar_i L(\dataScalar_i, \mappingFunction(\inputVector_i))}{q(\scaleScalar)} \\
& =\sum_{i}\expectationDist{\scaleScalar_i }{q(\scaleScalar)}L(\dataScalar_i, \mappingFunction(\inputVector_i))
\end{align*}
\]</span> We will assume that our process, <span class="math inline">\(q(\scaleScalar)\)</span> can depend on a variety of inputs such as <span class="math inline">\(\dataVector\)</span>, <span class="math inline">\(\inputMatrix\)</span> and time, <span class="math inline">\(t\)</span>.</p>
<h3 id="principle-of-maximum-entropy">Principle of Maximum Entropy</h3>
<p>To maximize uncertainty in <span class="math inline">\(q(\scaleScalar)\)</span> we maximize its entropy. Following Jaynes formalism of maximum entropy, in the continuous space we do this with respect to an invariant measure, <span class="math display">\[
H(\scaleScalar)= - \int q(\scaleScalar) \log \frac{q(\scaleScalar)}{m(\scaleScalar)} \text{d}\scaleScalar
\]</span> and since we minimize the loss, we balance this by adding in this term to form <span class="math display">\[
\begin{align*}
\errorFunction = & \beta\sum_{i}\expectationDist{\scaleScalar_i }{q(\scaleScalar)}L(\dataScalar_i, \mappingFunction(\inputVector_i)) - H(\scaleScalar)\\
&= \beta\sum_{i}\expectationDist{\scaleScalar_i }{q(\scaleScalar)}L(\dataScalar_i, \mappingFunction(\inputVector_i)) + \int q(\scaleScalar) \log \frac{q(\scaleScalar)}{m(\scaleScalar)}\text{d}\scaleScalar
\end{align*}
\]</span> where <span class="math inline">\(\beta\)</span> serves to weight the relative contribution of the entropy term and the loss term.</p>
<p>We can now minimize this modified loss with respect to the density <span class="math inline">\(q(\scaleScalar)\)</span>, the freeform optimization over this term leads to <span class="math display">\[
\begin{align*}
q(\scaleScalar) \propto & \exp\left(- \beta \sum_{i=1}^\numData \scaleScalar_i L(\dataScalar_i, \mappingFunction(\inputVector_i)) \right) m(\scaleScalar)\\
\propto & \prod_{i=1}^\numData \exp\left(- \beta \scaleScalar_i L(\dataScalar_i, \mappingFunction(\inputVector_i)) \right) m(\scaleScalar)
\end{align*}
\]</span></p>
<h3 id="example">Example</h3>
<p>Assume <span class="math display">\[
m(\scaleScalar) = \prod_i \lambda\exp\left(-\lambda\scaleScalar_i\right)
\]</span> which is the distribution with the maximum entropy for a given mean, <span class="math inline">\(\scaleScalar\)</span>. Then we have <span class="math display">\[
q(\scaleScalar) = \prod_i q(\scaleScalar_i)
\]</span> <span class="math display">\[
q(\scaleScalar_i) \propto \frac{1}{\lambda+\beta L_i} \exp\left(-(\lambda+\beta L_i) \scaleScalar_i\right)
\]</span> and we can compute <span class="math display">\[
\expectationDist{\scaleScalar_i}{q(\scaleScalar)} =
\frac{1}{\lambda + \beta L_i}
\]</span></p>
<h3 id="coordinate-descent">Coordinate Descent</h3>
<p>We can minimize with respect to <span class="math inline">\(q(\scaleScalar)\)</span> recovering, <span class="math display">\[
q(\scaleScalar_i) = \frac{1}{\lambda+\beta L_i} \exp\left(-(\lambda+\beta L_i) \scaleScalar_i\right)
\]</span>t allowing us to compute the expectation of <span class="math inline">\(\scaleScalar\)</span>, <span class="math display">\[
\expectationDist{\scaleScalar_i}{q(\scaleScalar_i)} = \frac{1}{\lambda+\beta
L_i}
\]</span> then, we can minimize our expected loss with respect to <span class="math inline">\(\mappingFunction(\cdot)\)</span> <span class="math display">\[
\beta \sum_{i=1}^\numData \expectationDist{\scaleScalar_i}{q(\scaleScalar_i)} L(\dataScalar_i, \mappingFunction(\inputVector_i))
\]</span> If the loss is the <em>squared loss</em>, then this is recognised as a <em>reweighted least squares algorithm</em>. However, the loss can be of any form as long as <span class="math inline">\(q(\scaleScalar)\)</span> defined above exists.</p>
<p>In addition to the above, in our example below, we updated <span class="math inline">\(\beta\)</span> to normalize the expected loss to be <span class="math inline">\(\numData\)</span> at each iteration, so we have <span class="math display">\[
\beta = \frac{\numData}{\sum_{i=1}^\numData \expectationDist{\scaleScalar_i}{q(\scaleScalar_i)} L(\dataScalar_i, \mappingFunction(\inputVector_i))}
\]</span></p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> numpy <span class="im">as</span> np
<span class="im">import</span> matplotlib.pyplot <span class="im">as</span> plt
<span class="im">import</span> pods
<span class="im">import</span> teaching_plots <span class="im">as</span> plot
<span class="im">import</span> mlai</code></pre></div>
<h3 id="olympic-marathon-data">Olympic Marathon Data</h3>
<p>The first thing we will do is load a standard data set for regression modelling. The data consists of the pace of Olympic Gold Medal Marathon winners for the Olympics from 1896 to present. First we load in the data and plot.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">data <span class="op">=</span> pods.datasets.olympic_marathon_men()
x <span class="op">=</span> data[<span class="st">'X'</span>]
y <span class="op">=</span> data[<span class="st">'Y'</span>]
offset <span class="op">=</span> y.mean()
scale <span class="op">=</span> np.sqrt(y.var())
xlim <span class="op">=</span> (<span class="dv">1875</span>,<span class="dv">2030</span>)
ylim <span class="op">=</span> (<span class="fl">2.5</span>, <span class="fl">6.5</span>)
yhat <span class="op">=</span> (y<span class="op">-</span>offset)<span class="op">/</span>scale
fig, ax <span class="op">=</span> plt.subplots(figsize<span class="op">=</span>plot.big_wide_figsize)
_ <span class="op">=</span> ax.plot(x, y, <span class="st">'r.'</span>,markersize<span class="op">=</span><span class="dv">10</span>)
ax.set_xlabel(<span class="st">'year'</span>, fontsize<span class="op">=</span><span class="dv">20</span>)
ax.set_ylabel(<span class="st">'pace min/km'</span>, fontsize<span class="op">=</span><span class="dv">20</span>)
ax.set_xlim(xlim)
ax.set_ylim(ylim)
mlai.write_figure(figure<span class="op">=</span>fig, filename<span class="op">=</span><span class="st">'../slides/diagrams/datasets/olympic-marathon.svg'</span>, transparent<span class="op">=</span><span class="va">True</span>, frameon<span class="op">=</span><span class="va">True</span>)</code></pre></div>
<h3 id="olympic-marathon-data-1">Olympic Marathon Data</h3>
<table>
<tr>
<td width="70%">
<ul>
<li><p>Gold medal times for Olympic Marathon since 1896.</p></li>
<li><p>Marathons before 1924 didn’t have a standardised distance.</p></li>
<li><p>Present results using pace per km.</p></li>
<li>In 1904 Marathon was badly organised leading to very slow times.</li>
</ul>
</td>
<td width="30%">
<img src="../slides/diagrams/Stephen_Kiprotich.jpg" alt="image" /> <small>Image from Wikimedia Commons <a href="http://bit.ly/16kMKHQ" class="uri">http://bit.ly/16kMKHQ</a></small>
</td>
</tr>
</table>
<object class="svgplot" align data="../slides/diagrams/datasets/olympic-marathon.svg">
</object>
<p>Things to notice about the data include the outlier in 1904, in this year, the olympics was in St Louis, USA. Organizational problems and challenges with dust kicked up by the cars following the race meant that participants got lost, and only very few participants completed.</p>
<p>More recent years see more consistently quick marathons.</p>
<h3 id="example-linear-regression">Example: Linear Regression</h3>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> mlai
<span class="im">import</span> numpy <span class="im">as</span> np
<span class="im">import</span> scipy <span class="im">as</span> sp</code></pre></div>
<p>Create a weighted linear regression class, inheriting from the <code>mlai.LM</code> class.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">class</span> LML(mlai.LM):
<span class="co">"""Linear model with evolving loss</span>
<span class="co"> :param X: input values</span>
<span class="co"> :type X: numpy.ndarray</span>
<span class="co"> :param y: target values</span>
<span class="co"> :type y: numpy.ndarray</span>
<span class="co"> :param basis: basis function </span>
<span class="co"> :param type: function</span>
<span class="co"> :param beta: weight of the loss function</span>
<span class="co"> :param type: float"""</span>
<span class="kw">def</span> <span class="fu">__init__</span>(<span class="va">self</span>, X, y, basis<span class="op">=</span><span class="va">None</span>, beta<span class="op">=</span><span class="fl">1.0</span>, lambd<span class="op">=</span><span class="fl">1.0</span>):
<span class="co">"Initialise"</span>
<span class="cf">if</span> basis <span class="kw">is</span> <span class="va">None</span>:
basis <span class="op">=</span> mlai.basis(mlai.polynomial, number<span class="op">=</span><span class="dv">2</span>)
mlai.LM.<span class="fu">__init__</span>(<span class="va">self</span>, X, y, basis)
<span class="va">self</span>.s <span class="op">=</span> np.ones((<span class="va">self</span>.num_data, <span class="dv">1</span>))<span class="co">#np.random.rand(self.num_data, 1)>0.5</span>
<span class="va">self</span>.update_w()
<span class="va">self</span>.sigma2 <span class="op">=</span> <span class="dv">1</span><span class="op">/</span>beta
<span class="va">self</span>.beta <span class="op">=</span> beta
<span class="va">self</span>.name <span class="op">=</span> <span class="st">'LML_'</span><span class="op">+</span>basis.function.<span class="va">__name__</span>
<span class="va">self</span>.objective_name <span class="op">=</span> <span class="st">'Weighted Sum of Square Training Error'</span>
<span class="va">self</span>.lambd <span class="op">=</span> lambd
<span class="kw">def</span> update_QR(<span class="va">self</span>):
<span class="co">"Perform the QR decomposition on the basis matrix."</span>
<span class="va">self</span>.Q, <span class="va">self</span>.R <span class="op">=</span> np.linalg.qr(<span class="va">self</span>.Phi<span class="op">*</span>np.sqrt(<span class="va">self</span>.s))
<span class="kw">def</span> fit(<span class="va">self</span>):
<span class="co">"""Minimize the objective function with respect to the parameters"""</span>
<span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">30</span>):
<span class="va">self</span>.update_w() <span class="co"># In the linear regression clas</span>
<span class="va">self</span>.update_s()
<span class="kw">def</span> update_w(<span class="va">self</span>):
<span class="va">self</span>.update_QR()
<span class="va">self</span>.w_star <span class="op">=</span> sp.linalg.solve_triangular(<span class="va">self</span>.R, np.dot(<span class="va">self</span>.Q.T, <span class="va">self</span>.y<span class="op">*</span>np.sqrt(<span class="va">self</span>.s)))
<span class="va">self</span>.update_losses()
<span class="kw">def</span> predict(<span class="va">self</span>, X):
<span class="co">"""Return the result of the prediction function."""</span>
<span class="cf">return</span> np.dot(<span class="va">self</span>.basis.Phi(X), <span class="va">self</span>.w_star), <span class="va">None</span>
<span class="kw">def</span> update_s(<span class="va">self</span>):
<span class="co">"""Update the weights"""</span>
<span class="va">self</span>.s <span class="op">=</span> <span class="dv">1</span><span class="op">/</span>(<span class="va">self</span>.lambd <span class="op">+</span> <span class="va">self</span>.beta<span class="op">*</span><span class="va">self</span>.losses)
<span class="kw">def</span> update_losses(<span class="va">self</span>):
<span class="co">"""Compute the loss functions for each data point."""</span>
<span class="va">self</span>.update_f()
<span class="va">self</span>.losses <span class="op">=</span> ((<span class="va">self</span>.y<span class="op">-</span><span class="va">self</span>.f)<span class="op">**</span><span class="dv">2</span>)
<span class="va">self</span>.beta <span class="op">=</span> <span class="dv">1</span><span class="op">/</span>(<span class="va">self</span>.losses<span class="op">*</span><span class="va">self</span>.s).mean()
<span class="kw">def</span> objective(<span class="va">self</span>):
<span class="co">"""Compute the objective function."""</span>
<span class="va">self</span>.update_losses()
<span class="cf">return</span> (<span class="va">self</span>.losses<span class="op">*</span><span class="va">self</span>.s).<span class="bu">sum</span>()</code></pre></div>
<p>Set up a linear model (polynomial with two basis functions).</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">num_basis<span class="op">=</span><span class="dv">2</span>
data_limits<span class="op">=</span>[<span class="dv">1890</span>, <span class="dv">2020</span>]
basis <span class="op">=</span> mlai.basis(mlai.polynomial, num_basis, data_limits<span class="op">=</span>data_limits)
model <span class="op">=</span> LML(x, y, basis<span class="op">=</span>basis, lambd<span class="op">=</span><span class="dv">1</span>, beta<span class="op">=</span><span class="dv">1</span>)
model2 <span class="op">=</span> mlai.LM(x, y, basis<span class="op">=</span>basis)</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">model.fit()
model2.fit()</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> matplotlib.pyplot <span class="im">as</span> plt</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">x_test <span class="op">=</span> np.linspace(data_limits[<span class="dv">0</span>], data_limits[<span class="dv">1</span>], <span class="dv">130</span>)[:, <span class="va">None</span>]
f_test, f_var <span class="op">=</span> model.predict(x_test)
f2_test, f2_var <span class="op">=</span> model2.predict(x_test)</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> teaching_plots <span class="im">as</span> plot
<span class="im">from</span> matplotlib <span class="im">import</span> rc, rcParams
rcParams.update({<span class="st">'font.size'</span>: <span class="dv">22</span>})
rc(<span class="st">'text'</span>, usetex<span class="op">=</span><span class="va">True</span>)</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">fig, ax <span class="op">=</span> plt.subplots(figsize<span class="op">=</span>plot.big_wide_figsize)
ax.plot(x_test, f2_test, linewidth<span class="op">=</span><span class="dv">3</span>, color<span class="op">=</span><span class="st">'r'</span>)
ax.plot(x, y, <span class="st">'g.'</span>, markersize<span class="op">=</span><span class="dv">10</span>)
ax.set_xlim(data_limits[<span class="dv">0</span>], data_limits[<span class="dv">1</span>])
ax.set_xlabel(<span class="st">'year'</span>)
ax.set_ylabel(<span class="st">'pace min/km'</span>)
_ <span class="op">=</span> ax.set_ylim(<span class="dv">2</span>, <span class="dv">6</span>)
mlai.write_figure(<span class="st">'../slides/diagrams/ml/olympic-loss-linear-regression000.svg'</span>, transparent<span class="op">=</span><span class="va">True</span>)
ax.plot(x_test, f_test, linewidth<span class="op">=</span><span class="dv">3</span>, color<span class="op">=</span><span class="st">'b'</span>)
ax.plot(x, y, <span class="st">'g.'</span>, markersize<span class="op">=</span><span class="dv">10</span>)
ax2 <span class="op">=</span> ax.twinx()
ax2.bar(x.flatten(), model.s.flatten(), width<span class="op">=</span><span class="dv">2</span>, color<span class="op">=</span><span class="st">'b'</span>)
ax2.set_ylim(<span class="dv">0</span>, <span class="dv">4</span>)
ax2.set_yticks([<span class="dv">0</span>, <span class="dv">1</span>, <span class="dv">2</span>])
ax2.set_ylabel(<span class="st">'$\langle s_i </span><span class="ch">\\</span><span class="st">rangle$'</span>)
mlai.write_figure(<span class="st">'../slides/diagrams/ml/olympic-loss-linear-regression001.svg'</span>, transparent<span class="op">=</span><span class="va">True</span>)</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> pods
pods.notebook.display_plots(<span class="st">'olympic-loss-linear-regression</span><span class="sc">{number:0>3}</span><span class="st">.svg'</span>,
directory<span class="op">=</span><span class="st">'../slides/diagrams/ml'</span>, number<span class="op">=</span>(<span class="dv">0</span>, <span class="dv">1</span>))</code></pre></div>
<object class="svgplot" align data="../slides/diagrams/ml/olympic-loss-linear-regression001.svg">
</object>
<center>
<em>Linear regression for the standard quadratic loss in </em>red* and the probabilistically weighted loss in <em>blue</em>. *
</center>
<h3 id="parameter-uncertainty">Parameter Uncertainty</h3>
<p>Classical Bayesian inference is concerned with parameter uncertainty, which equates to uncertainty in the <em>prediction function</em>, <span class="math inline">\(\mappingFunction(\inputVector)\)</span>. The prediction function is normally an estimate of the value of <span class="math inline">\(\dataScalar\)</span> or constructs a probability density for <span class="math inline">\(\dataScalar\)</span>.</p>
<p>Uncertainty in the prediction function can arise through uncertainty in our loss function, but also through uncertainty in parameters in the classical Bayesian sense. The full maximum entropy formalism would now be <span class="math display">\[
\expectationDist{\beta \scaleScalar_i L(\dataScalar_i,
\mappingFunction(\inputVector_i))}{q(\scaleScalar, \mappingFunction)} + \int
q(\scaleScalar, \mappingFunction) \log \frac{q(\scaleScalar,
\mappingFunction)}{m(\scaleScalar)m(\mappingFunction)}\text{d}\scaleScalar
\text{d}\mappingFunction
\]</span></p>
<p><span class="math display">\[
q(\mappingFunction, \scaleScalar) \propto
\prod_{i=1}^\numData \exp\left(- \beta \scaleScalar_i L(\dataScalar_i,
\mappingFunction(\inputVector_i)) \right) m(\scaleScalar)m(\mappingFunction)
\]</span></p>
<h3 id="approximation">Approximation</h3>
<ul>
<li><p>Generally intractable, so assume: <span class="math display">\[
q(\mappingFunction, \scaleScalar) = q(\mappingFunction)q(\scaleScalar)
\]</span></p></li>
<li><p>Entropy maximization proceeds as before but with <span class="math display">\[
q(\scaleScalar) \propto
\prod_{i=1}^\numData \exp\left(- \beta \scaleScalar_i \expectationDist{L(\dataScalar_i,
\mappingFunction(\inputVector_i))}{q(\mappingFunction)} \right) m(\scaleScalar)
\]</span> and <span class="math display">\[
q(\mappingFunction) \propto
\prod_{i=1}^\numData \exp\left(- \beta \expectationDist{\scaleScalar_i}{q(\scaleScalar)} L(\dataScalar_i,
\mappingFunction(\inputVector_i)) \right) m(\mappingFunction)
\]</span></p></li>
<li><p>Can now proceed with iteration between <span class="math inline">\(q(\scaleScalar)\)</span>, <span class="math inline">\(q(\mappingFunction)\)</span></p></li>
</ul>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">class</span> BLML(mlai.BLM):
<span class="co">"""Bayesian Linear model with evolving loss</span>
<span class="co"> :param X: input values</span>
<span class="co"> :type X: numpy.ndarray</span>
<span class="co"> :param y: target values</span>
<span class="co"> :type y: numpy.ndarray</span>
<span class="co"> :param basis: basis function </span>
<span class="co"> :param type: function</span>
<span class="co"> :param beta: weight of the loss function</span>
<span class="co"> :param type: float"""</span>
<span class="kw">def</span> <span class="fu">__init__</span>(<span class="va">self</span>, X, y, basis<span class="op">=</span><span class="va">None</span>, alpha<span class="op">=</span><span class="fl">1.0</span>, beta<span class="op">=</span><span class="fl">1.0</span>, lambd<span class="op">=</span><span class="fl">1.0</span>):
<span class="co">"Initialise"</span>
<span class="cf">if</span> basis <span class="kw">is</span> <span class="va">None</span>:
basis <span class="op">=</span> mlai.basis(mlai.polynomial, number<span class="op">=</span><span class="dv">2</span>)
mlai.BLM.<span class="fu">__init__</span>(<span class="va">self</span>, X, y, basis<span class="op">=</span>basis, alpha<span class="op">=</span>alpha, sigma2<span class="op">=</span><span class="dv">1</span><span class="op">/</span>beta)
<span class="va">self</span>.s <span class="op">=</span> np.ones((<span class="va">self</span>.num_data, <span class="dv">1</span>))<span class="co">#np.random.rand(self.num_data, 1)>0.5 </span>
<span class="va">self</span>.update_w()
<span class="va">self</span>.beta <span class="op">=</span> beta
<span class="va">self</span>.name <span class="op">=</span> <span class="st">'BLML_'</span><span class="op">+</span>basis.function.<span class="va">__name__</span>
<span class="va">self</span>.objective_name <span class="op">=</span> <span class="st">'Weighted Sum of Square Training Error'</span>
<span class="va">self</span>.lambd <span class="op">=</span> lambd
<span class="kw">def</span> update_QR(<span class="va">self</span>):
<span class="co">"Perform the QR decomposition on the basis matrix."</span>
<span class="va">self</span>.Q, <span class="va">self</span>.R <span class="op">=</span> np.linalg.qr(np.vstack([<span class="va">self</span>.Phi<span class="op">*</span>np.sqrt(<span class="va">self</span>.s), np.sqrt(<span class="va">self</span>.sigma2<span class="op">/</span><span class="va">self</span>.alpha)<span class="op">*</span>np.eye(<span class="va">self</span>.basis.number)]))
<span class="kw">def</span> fit(<span class="va">self</span>):
<span class="co">"""Minimize the objective function with respect to the parameters"""</span>
<span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">30</span>):
<span class="va">self</span>.update_w()
<span class="va">self</span>.update_s()
<span class="kw">def</span> update_w(<span class="va">self</span>):
<span class="va">self</span>.update_QR()
<span class="va">self</span>.QTy <span class="op">=</span> np.dot(<span class="va">self</span>.Q[:<span class="va">self</span>.y.shape[<span class="dv">0</span>], :].T, <span class="va">self</span>.y<span class="op">*</span>np.sqrt(<span class="va">self</span>.s))
<span class="va">self</span>.mu_w <span class="op">=</span> sp.linalg.solve_triangular(<span class="va">self</span>.R, <span class="va">self</span>.QTy)
<span class="va">self</span>.RTinv <span class="op">=</span> sp.linalg.solve_triangular(<span class="va">self</span>.R, np.eye(<span class="va">self</span>.R.shape[<span class="dv">0</span>]), trans<span class="op">=</span><span class="st">'T'</span>)
<span class="va">self</span>.C_w <span class="op">=</span> np.dot(<span class="va">self</span>.RTinv, <span class="va">self</span>.RTinv.T)
<span class="va">self</span>.update_losses()
<span class="kw">def</span> update_s(<span class="va">self</span>):
<span class="co">"""Update the weights"""</span>
<span class="va">self</span>.s <span class="op">=</span> <span class="dv">1</span><span class="op">/</span>(<span class="va">self</span>.lambd <span class="op">+</span> <span class="va">self</span>.beta<span class="op">*</span><span class="va">self</span>.losses)
<span class="kw">def</span> update_losses(<span class="va">self</span>):
<span class="co">"""Compute the loss functions for each data point."""</span>
<span class="va">self</span>.update_f()
<span class="va">self</span>.losses <span class="op">=</span> ((<span class="va">self</span>.y<span class="op">-</span><span class="va">self</span>.f_bar)<span class="op">**</span><span class="dv">2</span>) <span class="op">+</span> <span class="va">self</span>.f_cov[:, np.newaxis]
<span class="va">self</span>.beta <span class="op">=</span> <span class="dv">1</span><span class="op">/</span>(<span class="va">self</span>.losses<span class="op">*</span><span class="va">self</span>.s).mean()
<span class="va">self</span>.sigma2<span class="op">=</span><span class="dv">1</span><span class="op">/</span><span class="va">self</span>.beta
</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">model <span class="op">=</span> BLML(x, y, basis<span class="op">=</span>basis, alpha<span class="op">=</span><span class="dv">1000</span>, lambd<span class="op">=</span><span class="dv">1</span>, beta<span class="op">=</span><span class="dv">1</span>)
model2 <span class="op">=</span> mlai.BLM(x, y, basis<span class="op">=</span>basis, alpha<span class="op">=</span><span class="dv">1000</span>, sigma2<span class="op">=</span><span class="dv">1</span>)</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">model.fit()
model2.fit()</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">x_test <span class="op">=</span> np.linspace(data_limits[<span class="dv">0</span>], data_limits[<span class="dv">1</span>], <span class="dv">130</span>)[:, <span class="va">None</span>]
f_test, f_var <span class="op">=</span> model.predict(x_test)
f2_test, f2_var <span class="op">=</span> model2.predict(x_test)</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> gp_tutorial</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">fig, ax <span class="op">=</span> plt.subplots(figsize<span class="op">=</span>plot.big_wide_figsize)
<span class="im">from</span> matplotlib <span class="im">import</span> rc, rcParams
rcParams.update({<span class="st">'font.size'</span>: <span class="dv">22</span>})
rc(<span class="st">'text'</span>, usetex<span class="op">=</span><span class="va">True</span>)
gp_tutorial.gpplot(x_test, f2_test, f2_test <span class="op">-</span> <span class="dv">2</span><span class="op">*</span>np.sqrt(f2_var), f2_test <span class="op">+</span> <span class="dv">2</span><span class="op">*</span>np.sqrt(f2_var), ax<span class="op">=</span>ax, edgecol<span class="op">=</span><span class="st">'r'</span>, fillcol<span class="op">=</span><span class="st">'#CC3300'</span>)
ax.plot(x, y, <span class="st">'g.'</span>, markersize<span class="op">=</span><span class="dv">10</span>)
ax.set_xlim(data_limits[<span class="dv">0</span>], data_limits[<span class="dv">1</span>])
ax.set_xlabel(<span class="st">'year'</span>)
ax.set_ylabel(<span class="st">'pace min/km'</span>)
_ <span class="op">=</span> ax.set_ylim(<span class="dv">2</span>, <span class="dv">6</span>)
mlai.write_figure(<span class="st">'../slides/diagrams/ml/olympic-loss-bayes-linear-regression000.svg'</span>, transparent<span class="op">=</span><span class="va">True</span>)
gp_tutorial.gpplot(x_test, f_test, f_test <span class="op">-</span> <span class="dv">2</span><span class="op">*</span>np.sqrt(f_var), f_test <span class="op">+</span> <span class="dv">2</span><span class="op">*</span>np.sqrt(f_var), ax<span class="op">=</span>ax, edgecol<span class="op">=</span><span class="st">'b'</span>, fillcol<span class="op">=</span><span class="st">'#0033CC'</span>)
<span class="co">#ax.plot(x_test, f_test, linewidth=3, color='b')</span>
ax.plot(x, y, <span class="st">'g.'</span>, markersize<span class="op">=</span><span class="dv">10</span>)
ax2 <span class="op">=</span> ax.twinx()
ax2.bar(x.flatten(), model.s.flatten(), width<span class="op">=</span><span class="dv">2</span>, color<span class="op">=</span><span class="st">'b'</span>)
ax2.set_ylim(<span class="dv">0</span>, <span class="fl">0.2</span>)
ax2.set_yticks([<span class="dv">0</span>, <span class="fl">0.1</span>, <span class="fl">0.2</span>])
ax2.set_ylabel(<span class="st">'$\langle s_i </span><span class="ch">\\</span><span class="st">rangle$'</span>)
mlai.write_figure(<span class="st">'../slides/diagrams/ml/olympic-loss-bayes-linear-regression001.svg'</span>, transparent<span class="op">=</span><span class="va">True</span>)</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> pods
pods.notebook.display_plots(<span class="st">'olympic-loss-bayes-linear-regression</span><span class="sc">{number:0>3}</span><span class="st">.svg'</span>,
directory<span class="op">=</span><span class="st">'../slides/diagrams/ml'</span>, number<span class="op">=</span>(<span class="dv">0</span>, <span class="dv">1</span>))</code></pre></div>
<object class="svgplot" align data="../slides/diagrams/ml/olympic-loss-bayes-linear-regression001.svg">
</object>
<center>
<em>Probabilistic linear regression for the standard quadratic loss in </em>red* and the probabilistically weighted loss in <em>blue</em>. *
</center>
<h3 id="correlated-scales">Correlated Scales</h3>
<p>Going beyond independence between weights, we now consider <span class="math inline">\(m(\vScalar)\)</span> to be a Gaussian process, and scale by the <em>square</em> of <span class="math inline">\(\vScalar\)</span>, <span class="math inline">\(\scaleScalar=\vScalar^2\)</span> <span class="math display">\[
\vScalar \sim \mathcal{GP}\left(\meanScalar(\inputVector), \kernel(\inputVector, \inputVector^\prime)\right)
\]</span></p>
<p><span class="math display">\[
q(\vScalar) \propto
\prod_{i=1}^\numData \exp\left(- \beta \vScalar_i^2 L(\dataScalar_i,
\mappingFunction(\inputVector_i)) \right)
\exp\left(-\frac{1}{2}(\vVector-\meanVector)^\top \kernelMatrix^{-1}
(\vVector-\meanVector)\right)
\]</span> where <span class="math inline">\(\kernelMatrix\)</span> is the covariance of the process made up of elements taken from the covariance function, <span class="math inline">\(\kernelScalar(\inputVector, t, \dataVector; \inputVector^\prime, t^\prime, \dataVector^\prime)\)</span> so <span class="math inline">\(q(\vScalar)\)</span> itself is Gaussian with covariance <span class="math display">\[
\covarianceMatrix = \left(\beta\mathbf{L} + \kernelMatrix^{-1}\right)^{-1}
\]</span> and mean <span class="math display">\[
\meanTwoVector = \beta\covarianceMatrix\mathbf{L}\meanVector
\]</span> where <span class="math inline">\(\mathbf{L}\)</span> is a matrix containing the loss functions, <span class="math inline">\(L(\dataScalar_i, \mappingFunction(\inputVector_i))\)</span> along its diagonal elements with zeros elsewhere.</p>
<p>The update is given by <span class="math display">\[
\expectationDist{\vScalar_i^2}{q(\vScalar)} = \meanTwoScalar_i^2 +
\covarianceScalar_{i, i}.
\]</span> To compare with before, if the mean of the measure <span class="math inline">\(m(\vScalar)\)</span> was zero and the prior covariance was spherical, <span class="math inline">\(\kernelMatrix=\lambda^{-1}\eye\)</span>. Then this would equate to an update, <span class="math display">\[
\expectationDist{\vScalar_i^2}{q(\vScalar)} = \frac{1}{\lambda + \beta L_i}
\]</span> which is the same as we had before for the exponential prior over <span class="math inline">\(\scaleScalar\)</span>.</p>
<h3 id="conditioning-the-measure">Conditioning the Measure</h3>
<p>Now that we have defined a process over <span class="math inline">\(\vScalar\)</span>, we could define a region in which we're certain that we would like the weights to be high. For example, if we were looking to have a test point at location <span class="math inline">\(\inputVector_\ast\)</span>, we could update our measure to be a Gaussian process that is conditioned on the observation of <span class="math inline">\(\vScalar_\ast\)</span> set appropriately at <span class="math inline">\(\inputScalar_\ast\)</span>. In this case we have,</p>
<p><span class="math display">\[
\kernelMatrix^\prime = \kernelMatrix - \frac{\kernelVector_\ast\kernelVector^\top_\ast}{\kernelScalar_{*,*}}
\]</span> and <span class="math display">\[
\meanVector^\prime = \meanVector + \frac{\kernelVector_\ast}{\kernelScalar_{*,*}}
(\vScalar_\ast-\meanScalar)
\]</span> where <span class="math inline">\(\kernelScalar_\ast\)</span> is the vector computed through the covariance function between the training data <span class="math inline">\(\inputMatrix\)</span> and the proposed point that we are conditioning the scale upon, <span class="math inline">\(\inputVector_\ast\)</span> and <span class="math inline">\(\kernelScalar_{*,*}\)</span> is the covariance function computed for <span class="math inline">\(\inputVector_\ast\)</span>. Now the updated mean and covariance can be used in the maximum entropy formulation as before. <span class="math display">\[
q(\vScalar) \propto \prod_{i=1}^\numData \exp\left(-
\beta \vScalar_i^2 L(\dataScalar_i, \mappingFunction(\inputVector_i)) \right)
\exp\left(-\frac{1}{2}(\vVector-\meanVector^\prime)^\top
\left.\kernelMatrix^\prime\right.^{-1} (\vVector-\meanVector^\prime)\right)
\]</span></p>
<p>We will consider the same data set as above. We first create a Gaussian process model for the update.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">class</span> GPL(mlai.GP):
<span class="kw">def</span> <span class="fu">__init__</span>(<span class="va">self</span>, X, losses, kernel, beta<span class="op">=</span><span class="fl">1.0</span>, mu<span class="op">=</span><span class="fl">0.0</span>, X_star<span class="op">=</span><span class="va">None</span>, v_star<span class="op">=</span><span class="va">None</span>):
<span class="co"># Bring together locations</span>
<span class="va">self</span>.kernel <span class="op">=</span> kernel
<span class="va">self</span>.K <span class="op">=</span> <span class="va">self</span>.kernel.K(X)
<span class="va">self</span>.mu <span class="op">=</span> np.ones((X.shape[<span class="dv">0</span>],<span class="dv">1</span>))<span class="op">*</span>mu
<span class="va">self</span>.beta <span class="op">=</span> beta
<span class="cf">if</span> X_star <span class="kw">is</span> <span class="kw">not</span> <span class="va">None</span>:
kstar <span class="op">=</span> kernel.K(X, X_star)
kstarstar <span class="op">=</span> kernel.K(X_star, X_star)
kstarstarInv <span class="op">=</span> np.linalg.inv(kstarstar)
kskssInv <span class="op">=</span> np.dot(kstar, kstarstarInv)
<span class="va">self</span>.K <span class="op">-=</span> np.dot(kskssInv,kstar.T)
<span class="cf">if</span> v_star <span class="kw">is</span> <span class="kw">not</span> <span class="va">None</span>:
<span class="va">self</span>.mu <span class="op">=</span> kskssInv<span class="op">*</span>(v_star<span class="op">-</span><span class="va">self</span>.mu)<span class="op">+</span><span class="va">self</span>.mu
Xaug <span class="op">=</span> np.vstack((X, X_star))
<span class="cf">else</span>:
<span class="cf">raise</span> <span class="pp">ValueError</span>(<span class="st">"v_star should not be None when X_star is None"</span>)</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">class</span> BLMLGP(BLML):
<span class="kw">def</span> <span class="fu">__init__</span>(<span class="va">self</span>, X, y, basis<span class="op">=</span><span class="va">None</span>, kernel<span class="op">=</span><span class="va">None</span>, beta<span class="op">=</span><span class="fl">1.0</span>, mu<span class="op">=</span><span class="fl">0.0</span>, alpha<span class="op">=</span><span class="fl">1.0</span>, X_star<span class="op">=</span><span class="va">None</span>, v_star<span class="op">=</span><span class="va">None</span>):
BLML.<span class="fu">__init__</span>(<span class="va">self</span>, X, y, basis<span class="op">=</span>basis, alpha<span class="op">=</span>alpha, beta<span class="op">=</span>beta, lambd<span class="op">=</span><span class="va">None</span>)
<span class="va">self</span>.gp_model<span class="op">=</span>GPL(<span class="va">self</span>.X, <span class="va">self</span>.losses, kernel<span class="op">=</span>kernel, beta<span class="op">=</span>beta, mu<span class="op">=</span>mu, X_star<span class="op">=</span>X_star, v_star<span class="op">=</span>v_star)
<span class="kw">def</span> update_s(<span class="va">self</span>):
<span class="co">"""Update the weights"""</span>
<span class="va">self</span>.gp_model.C <span class="op">=</span> sp.linalg.inv(sp.linalg.inv(<span class="va">self</span>.gp_model.K<span class="op">+</span>np.eye(<span class="va">self</span>.X.shape[<span class="dv">0</span>])<span class="op">*</span><span class="fl">1e-6</span>) <span class="op">+</span> <span class="va">self</span>.beta<span class="op">*</span>np.diag(<span class="va">self</span>.losses.flatten()))
<span class="va">self</span>.gp_model.diagC <span class="op">=</span> np.diag(<span class="va">self</span>.gp_model.C)[:, np.newaxis]
<span class="va">self</span>.gp_model.f <span class="op">=</span> <span class="va">self</span>.gp_model.beta<span class="op">*</span>np.dot(np.dot(<span class="va">self</span>.gp_model.C,np.diag(<span class="va">self</span>.losses.flatten())),<span class="va">self</span>.gp_model.mu) <span class="op">+</span><span class="va">self</span>.gp_model.mu
<span class="co">#f, v = self.gp_model.K self.gp_model.predict(self.X)</span>
<span class="va">self</span>.s <span class="op">=</span> <span class="va">self</span>.gp_model.f<span class="op">*</span><span class="va">self</span>.gp_model.f <span class="op">+</span> <span class="va">self</span>.gp_model.diagC <span class="co"># + 1.0/(self.losses*self.gp_model.beta)</span></code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">model <span class="op">=</span> BLMLGP(x, y,
basis<span class="op">=</span>basis,
kernel<span class="op">=</span>mlai.kernel(mlai.eq_cov, lengthscale<span class="op">=</span><span class="dv">20</span>, variance<span class="op">=</span><span class="fl">1.0</span>),
mu<span class="op">=</span><span class="fl">0.0</span>,
beta<span class="op">=</span><span class="fl">1.0</span>,
alpha<span class="op">=</span><span class="dv">1000</span>,
X_star<span class="op">=</span>np.asarray([[<span class="dv">2020</span>]]),
v_star<span class="op">=</span>np.asarray([[<span class="dv">1</span>]]))</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">model.fit()</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">f_test, f_var <span class="op">=</span> model.predict(x_test)</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">fig, ax <span class="op">=</span> plt.subplots(figsize<span class="op">=</span>plot.big_wide_figsize)
ax.cla()
<span class="im">from</span> matplotlib <span class="im">import</span> rc, rcParams
rcParams.update({<span class="st">'font.size'</span>: <span class="dv">22</span>})
rc(<span class="st">'text'</span>, usetex<span class="op">=</span><span class="va">True</span>)
gp_tutorial.gpplot(x_test, f2_test, f2_test <span class="op">-</span> <span class="dv">2</span><span class="op">*</span>np.sqrt(f2_var), f2_test <span class="op">+</span> <span class="dv">2</span><span class="op">*</span>np.sqrt(f2_var), ax<span class="op">=</span>ax, edgecol<span class="op">=</span><span class="st">'r'</span>, fillcol<span class="op">=</span><span class="st">'#CC3300'</span>)
ax.plot(x, y, <span class="st">'g.'</span>, markersize<span class="op">=</span><span class="dv">10</span>)
ax.set_xlim(data_limits[<span class="dv">0</span>], data_limits[<span class="dv">1</span>])
ax.set_xlabel(<span class="st">'year'</span>)
ax.set_ylabel(<span class="st">'pace min/km'</span>)
_ <span class="op">=</span> ax.set_ylim(<span class="dv">2</span>, <span class="dv">6</span>)
mlai.write_figure(<span class="st">'../slides/diagrams/ml/olympic-gp-loss-bayes-linear-regression000.svg'</span>, transparent<span class="op">=</span><span class="va">True</span>)
gp_tutorial.gpplot(x_test, f_test, f_test <span class="op">-</span> <span class="dv">2</span><span class="op">*</span>np.sqrt(f_var), f_test <span class="op">+</span> <span class="dv">2</span><span class="op">*</span>np.sqrt(f_var), ax<span class="op">=</span>ax, edgecol<span class="op">=</span><span class="st">'b'</span>, fillcol<span class="op">=</span><span class="st">'#0033CC'</span>)
<span class="co">#ax.plot(x_test, f_test, linewidth=3, color='b')</span>
ax.plot(x, y, <span class="st">'g.'</span>, markersize<span class="op">=</span><span class="dv">10</span>)
ax2 <span class="op">=</span> ax.twinx()
ax2.bar(x.flatten(), model.s.flatten(), width<span class="op">=</span><span class="dv">2</span>, color<span class="op">=</span><span class="st">'b'</span>)
ax2.set_ylim(<span class="dv">0</span>, <span class="dv">3</span>)
ax2.set_yticks([<span class="dv">0</span>, <span class="fl">0.5</span>, <span class="dv">1</span>])
ax2.set_ylabel(<span class="st">'$\langle s_i </span><span class="ch">\\</span><span class="st">rangle$'</span>)
mlai.write_figure(<span class="st">'../slides/diagrams/ml/olympic-gp-loss-bayes-linear-regression001.svg'</span>, transparent<span class="op">=</span><span class="va">True</span>)</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> pods</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">pods.notebook.display_plots(<span class="st">'olympic-gp-loss-bayes-linear-regression</span><span class="sc">{number:0>3}</span><span class="st">.svg'</span>,
directory<span class="op">=</span><span class="st">'../slides/diagrams/ml'</span>, number<span class="op">=</span>(<span class="dv">0</span>, <span class="dv">1</span>))</code></pre></div>
<object class="svgplot" align data="../slides/diagrams/ml/olympic-gp-loss-bayes-linear-regression001.svg">
</object>
<center>
<em>Probabilistic linear regression for the standard quadratic loss in </em>red* and the probabilistically weighted loss with a Gaussian process measure in <em>blue</em>. *
</center>
<p>Finally, we make an attempt to show the joint uncertainty by first of all sampling from the loss function weights density, <span class="math inline">\(q(\scaleScalar)\)</span>.</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">fig, ax <span class="op">=</span> plt.subplots(figsize<span class="op">=</span>plot.big_wide_figsize)
num_samps<span class="op">=</span><span class="dv">10</span>
samps<span class="op">=</span>np.random.multivariate_normal(model.gp_model.f.flatten(), model.gp_model.C, size<span class="op">=</span><span class="dv">100</span>).T<span class="op">**</span><span class="dv">2</span>
ax.plot(x, samps, <span class="st">'-x'</span>, markersize<span class="op">=</span><span class="dv">10</span>, linewidth<span class="op">=</span><span class="dv">2</span>)
ax.set_xlim(data_limits[<span class="dv">0</span>], data_limits[<span class="dv">1</span>])
ax.set_xlabel(<span class="st">'year'</span>)
_ <span class="op">=</span> ax.set_ylabel(<span class="st">'$s_i$'</span>)
mlai.write_figure(<span class="st">'../slides/diagrams/ml/olympic-gp-loss-samples.svg'</span>, transparent<span class="op">=</span><span class="va">True</span>)</code></pre></div>
<object class="svgplot" align data="../slides/diagrams/ml/olympic-gp-loss-samples.svg">
</object>
<center>
<em>Samples of loss weightings from the density <span class="math inline">\(q(\scaleSamples)\)</span>. </em>
</center>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">fig, ax <span class="op">=</span> plt.subplots(figsize<span class="op">=</span>plot.big_wide_figsize)
ax.plot(x, y, <span class="st">'r.'</span>, markersize<span class="op">=</span><span class="dv">10</span>)
ax.set_xlim(data_limits[<span class="dv">0</span>], data_limits[<span class="dv">1</span>])
ax.set_ylim(<span class="dv">2</span>, <span class="dv">6</span>)
ax.set_xlabel(<span class="st">'year'</span>)
ax.set_ylabel(<span class="st">'pace min/km'</span>)
gp_tutorial.gpplot(x_test, f_test, f_test <span class="op">-</span> <span class="dv">2</span><span class="op">*</span>np.sqrt(f_var), f_test <span class="op">+</span> <span class="dv">2</span><span class="op">*</span>np.sqrt(f_var), ax<span class="op">=</span>ax, edgecol<span class="op">=</span><span class="st">'b'</span>, fillcol<span class="op">=</span><span class="st">'#0033CC'</span>)
mlai.write_figure(<span class="st">'../slides/diagrams/ml/olympic-gp-loss-bayes-linear-regression-and-samples000.svg'</span>, transparent<span class="op">=</span><span class="va">True</span>)
allsamps <span class="op">=</span> []
<span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(samps.shape[<span class="dv">1</span>]):
model.s <span class="op">=</span> samps[:, i:i<span class="op">+</span><span class="dv">1</span>]
model.update_w()
f_bar, f_cov <span class="op">=</span>model.predict(x_test, full_cov<span class="op">=</span><span class="va">True</span>)
f_samp <span class="op">=</span> np.random.multivariate_normal(f_bar.flatten(), f_cov, size<span class="op">=</span><span class="dv">10</span>).T
ax.plot(x_test, f_samp, linewidth<span class="op">=</span><span class="fl">0.5</span>, color<span class="op">=</span><span class="st">'k'</span>)
allsamps<span class="op">+=</span><span class="bu">list</span>(f_samp[<span class="op">-</span><span class="dv">1</span>, :])
mlai.write_figure(<span class="st">'../slides/diagrams/ml/olympic-gp-loss-bayes-linear-regression-and-samples001.svg'</span>, transparent<span class="op">=</span><span class="va">True</span>)</code></pre></div>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> pods
pods.notebook.display_plots(<span class="st">'olympic-gp-loss-bayes-linear-regression-and-samples</span><span class="sc">{number:0>3}</span><span class="st">.svg'</span>,
directory<span class="op">=</span><span class="st">'../slides/diagrams/ml'</span>, number<span class="op">=</span>(<span class="dv">0</span>, <span class="dv">1</span>))</code></pre></div>
<object class="svgplot" align data="../slides/diagrams/ml/olympic-gp-loss-bayes-linear-regression-and-samples001.svg">
</object>
<center>
<em>Samples from the joint density of loss weightings and regression weights show the full distribution of function predictions. </em>
</center>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">fig, ax <span class="op">=</span> plt.subplots(figsize<span class="op">=</span>plot.big_figsize)
ax.hist(np.asarray(allsamps), bins<span class="op">=</span><span class="dv">30</span>, density<span class="op">=</span><span class="va">True</span>)
ax.set_xlabel<span class="op">=</span><span class="st">'pace min/kim'</span>
mlai.write_figure(<span class="st">'../slides/diagrams/ml/olympic-gp-loss-histogram-2020.svg'</span>, transparent<span class="op">=</span><span class="va">True</span>)</code></pre></div>
<object class="svgplot" align data="../slides/diagrams/ml/olympic-gp-loss-histogram-2020.svg">
</object>
<center>
<em>Histogram of samples from the year 2020, where the weight of the loss function was pinned to ensure that the model focussed its predictions on this region for test data. </em>
</center>
<h3 id="conclusions">Conclusions</h3>
<ul>
<li>Maximum Entropy Framework for uncertainty in
<ul>
<li>Loss functions</li>
<li>Prediction functions</li>
</ul></li>
</ul>
<h3 id="thanks">Thanks!</h3>
<ul>
<li>twitter: @lawrennd</li>
<li>blog: <a href="http://inverseprobability.com/blog.html">http://inverseprobability.com</a></li>
</ul>
Tue, 29 May 2018 00:00:00 +0000
http://inverseprobability.com/talks/notes/uncertainty-in-loss-functions.html
http://inverseprobability.com/talks/notes/uncertainty-in-loss-functions.htmlnotesOutlook for AI and Machine Learning<script type="math/tex; mode=display">\newcommand{\numData}{n}
\newcommand{\errorFunction}{E}
\newcommand{\mappingFunction}{f}
\newcommand{\sigmoid}[1]{\sigma\left(#1\right)}
\newcommand{\inputScalar}{x}
\newcommand{\inputVector}{\mathbf{x}}
\newcommand{\inputMatrix}{\mathbf{X}}
\newcommand{\dataScalar}{y}
\newcommand{\dataVector}{\mathbf{y}}
\newcommand{\dataMatrix}{\mathbf{Y}}</script>
<p>The aim of this presentation is give a sense of the current situation in machine learning and artificial intelligence as well as some perspective on the immediate outlook for the field.</p>
<h2 id="the-gartner-hype-cycle">The Gartner Hype Cycle</h2>
<p><img class="negate" src="http://inverseprobability.com/talks/slides/diagrams/Gartner_Hype_Cycle.png" width="70%" align="center" style="background:none; border:none; box-shadow:none;" /></p>
<p>The <a href="https://en.wikipedia.org/wiki/Hype_cycle">Gartner Hype Cycle</a> tries to assess where an idea is in terms of maturity and adoption. It splits the evolution of technology into a technological trigger, a peak of expectations followed by a trough of disillusionment and a final ascension into a useful technology. It looks rather like a classical control response to a final set point.</p>
<object class="svgplot" align="" data="http://inverseprobability.com/talks/slides/diagrams/data-science/bd-ds-iot-ml-google-trends003.svg"></object>
<center><i>Google Trends data for different search terms in an attempt to assess their position on the "hype cycle"</i></center>
<p>Google trends gives us insight into how far along various technological terms are on the hype cycle.</p>
<h2 id="what-is-machine-learning">What is Machine Learning?</h2>
<p>What is machine learning? At its most basic level machine learning is a combination of</p>
<script type="math/tex; mode=display">\text{data} + \text{model} \xrightarrow{\text{compute}} \text{prediction}</script>
<p>where <em>data</em> is our observations. They can be actively or passively
acquired (meta-data). The <em>model</em> contains our assumptions, based on
previous experience. THat experience can be other data, it can come
from transfer learning, or it can merely be our beliefs about the
regularities of the universe. In humans our models include our
inductive biases. The <em>prediction</em> is an action to be taken or a
categorization or a quality score. The reason that machine learning
has become a mainstay of artificial intelligence is the importance of
predictions in artificial intelligence. The data and the model are combined through computation.</p>
<p>In practice we normally perform machine learning using two functions. To combine data with a model we typically make use of:</p>
<p><strong>a prediction function</strong> a function which is used to make the predictions. It includes our beliefs about the regularities of the universe, our assumptions about how the world works, e.g. smoothness, spatial similarities, temporal similarities.</p>
<p><strong>an objective function</strong> a function which defines the cost of misprediction. Typically it includes knowledge about the world’s generating processes (probabilistic objectives) or the costs we pay for mispredictions (empiricial risk minimization).</p>
<p>The combination of data and model through the prediction function and the objectie function leads to a <em>learning algorithm</em>. The class of prediction functions and objective functions we can make use of is restricted by the algorithms they lead to. If the prediction function or the objective function are too complex, then it can be difficult to find an appropriate learning algorithm. Much of the acdemic field of machine learning is the quest for new learning algorithms that allow us to bring different types of models and data together.</p>
<p>A useful reference for state of the art in machine learning is the UK Royal Society Report, <a href="https://royalsociety.org/~/media/policy/projects/machine-learning/publications/machine-learning-report.pdf">Machine Learning: Power and Promise of Computers that Learn by Example</a>.</p>
<p>You can also check my blog post on <a href="http://inverseprobability.com/2017/07/17/what-is-machine-learning">“What is Machine Learning?”</a></p>
<h2 id="natural-and-artificial-intelligence-embodiment-factors">Natural and Artificial Intelligence: Embodiment Factors</h2>
<table>
<tr>
<td></td>
<td align="center"><img class="" src="http://inverseprobability.com/talks/slides/diagrams/IBM_Blue_Gene_P_supercomputer.jpg" width="40%" align="center" style="background:none; border:none; box-shadow:none;" /></td>
<td align="center"><img class="" src="http://inverseprobability.com/talks/slides/diagrams/ClaudeShannon_MFO3807.jpg" width="25%" align="center" style="background:none; border:none; box-shadow:none;" /></td>
</tr>
<tr>
<td>compute</td>
<td align="center">$$\approx 10 \text{ gigaflops}$$</td><td align="center">$$\approx 14 \text{ teraflops}$$</td>
</tr>
<tr>
<td>communicate</td>
<td align="center">$$\approx 1 \text{ gigbit/s}$$</td>
<td align="center">$$\approx 100 \text{ bit/s}$$</td>
</tr>
<tr>
<td>(compute/communicate)</td>
<td align="center">$$10$$</td>
<td align="center">$$\approx 10^{13}$$</td>
</tr>
</table>
<p>There is a fundamental limit placed on our intelligence based on our ability to communicate. Claude Shannon founded the field of information theory. The clever part of this theory is it allows us to separate our measurement of information from what the information pertains to<sup id="fnref:knowledge-representation"><a href="#fn:knowledge-representation" class="footnote">1</a></sup>.</p>
<p>Shannon measured information in bits. One bit of information is the amount of information I pass to you when I give you the result of a coin toss. Shannon was also interested in the amount of information in the English language. He estimated that on average a word in the English language contains 12 bits of information.</p>
<p>Given typical speaking rates, that gives us an estimate of our ability to communicate of around 100 bits per second. Computers on the other hand can communicate much more rapidly. Current wired network speeds are around a billion bits per second, ten million times faster.</p>
<p>When it comes to compute though, our best estimates indicate our computers are slower. A typical modern computer can process make around 2 billion floating point operations per second, each floating point operation involves a 64 bit number. So the computer is processing around 120 billion bits per second.</p>
<p>It’s difficult to get similar estimates for humans, but by some estimates the amount of compute we would require to <em>simulate</em> a human brain is equivalent to that in the UK’s fastest computer, the MET office machine in Exeter, which in 2018 ranks as the 11th fastest computer in the world. That machine simulates the world’s weather each morning, and then simulates the world’s climate. It is a 16 petaflop machine.</p>
<p>So when it comes to our ability to compute we are extraordinary, not compute in our conscious mind, but the underlying neuron firings that underpin both our consciousness, our sbuconsciousness as well as our motor control etc. By analogy I sometimes like to think of us as a Formula One engine. But in terms of our ability to deploy that computation in actual use, to share the results of what we have inferred, we are very limited. So when you imagine the F1 car that represents a psyche, think of an F1 car with bicycle wheels.</p>
<p>In contrast, our computers have less computational power, but they can communicate far more fluidly. They are more like a go-kart, less well powered, but with tires that allow them to deploy that power.</p>
<p>For humans, that means much of our computation should be dedicated to considering <em>what</em> we should compute. To do that efficiently we need to model the world around us. The most complex thing in the world around us is other humans. So it is no surprise that we model them. We second guess what their intentions are, and our communication is only necessary when they are departing from how we model them. Naturally, for this to work well, we need to understand those we work closely with. So it is no surprise that social communication, social bonding, forms so much of a part of our use of our limited bandwidth.</p>
<p>There is a second effect here, our need to anthropomorphise objects around us. Our tendency to model our fellow humans extends to when we interact with other entities in our environment. To our pets as well as inanimate objects around us, such as computers or even our cars. This tendency to overinterpret could be a consequence of our limited ability to communicate.</p>
<p>For more details see this paper <a href="https://arxiv.org/abs/1705.07996">“Living Together: Mind and Machine Intelligence”</a>, and this <a href="http://inverseprobability.com/talks/lawrence-tedx17/living-together.html">TEDx talk</a>.</p>
<h2 id="evolved-relationship-with-information">Evolved Relationship with Information</h2>
<object class="svgplot" align="" data="http://inverseprobability.com/talks/slides/diagrams/data-science/information-flow003.svg"></object>
<p>The high bandwidth of computers has resulted in a close relationship between the computer and data. Larege amounts of information can flow between the two. The degree to which the computer is mediating our relationship with data means that we should consider it an intermediary.</p>
<p>Origininally our low bandwith relationship with data was affected by two characteristics. Firstly, our tendency to over-interpret driven by our need to extract as much knowledge from our low bandwidth information channel as possible. Secondly, by our improved understanding of the domain of <em>mathematical</em> statistics and how our cognitive biases can mislead us.</p>
<p>With this new set up there is a potential for assimilating far more information via the computer, but the computer can present this to us in various ways. If it’s motives are not aligned with ours then it can misrepresent the information. This needn’t be nefarious it can be simply as a result of the computer pursuing a different objective from us. For example, if the computer is aiming to maximize our interaction time that may be a different objective from ours which may be to summarize information in a representative manner in the <em>shortest</em> possible lenght of time.</p>
<p>For example, for me it was a common experience to pick up my telephone with the intention of checking when my next appointme was, but to soon find myself distracted by another application on the phone, and end up reading something on the internet. By the time I’d finished reading, I would often have forgotten the reason I picked up my phone in the first place.</p>
<p>We can benefit enormously from the very large amount of information we can now obtain through this evolved relationship between us and data. Biology has already benefited from large scale data sharing and the improved inferences that can be drawn through summarizing data by computer. That has underpinned the revolution in computational biology. But in our daily lives this phenomenon is affecting us in ways which we haven’t envisaged.</p>
<p>Better mediation of this flow actually requires a better understanding of human-computer interaction. This in turn involves understanding our own intelligence better, what its cognitive biases are and how these might mislead us.</p>
<p>For further thoughts see <a href="https://www.theguardian.com/media-network/2015/jul/23/data-driven-economy-marketing">this Guardian article</a> on marketing in the internet era and <a href="http://inverseprobability.com/2015/12/04/what-kind-of-ai">this blog post</a> on System Zero.</p>
<h3 id="societal-effects">Societal Effects</h3>
<p>We have already seen the effects of this changed dynamic in biology and computational biology. Improved sensorics have led to the new domains of transcriptomics, epigenomics, and ‘rich phenomics’ as well as considerably augmenting our capabilities in genomics.</p>
<p>Biologists have had to become data-savvy, they require a rich understanding of the available data resources and need to assimilate existing data sets in their hypothesis generation as well as their experimental design. Modern biology has become a far more quantitative science, but the quantitativeness has required new methods developed in the domains of <em>computational biology</em> and <em>bioinformatics</em>.</p>
<p>There is also great promise for personalized health, but in health the wide data-sharing that has underpinned success in the computational biology community is much harder to cary out.</p>
<p>We can expect to see these phenomena reflected in wider society. Particularly as we make use of more automated decision making based only on data.</p>
<p>The main phenomenon we see across the board is the shift in dynamic from the direct pathway between human and data, as traditionally mediated by classical statistcs, to a new flow of information via the computer. This change of dynamics gives us the modern and emerging domain of <em>data science</em>.</p>
<h2 id="human-communication">Human Communication</h2>
<p>For human conversation to work, we require an internal model of who we are speaking to. We model each other, and combine our sense of who they are, who they think we are, and what has been said. This is our approach to dealing with the limited bandwidth connection we have. Empathy and understanding of intent. Mental dispositional concepts are used to augment our limited communication bandwidth.</p>
<p>Fritz Heider referred to the important point of a conversation as being that they are happenings that are “<em>psychologically represented</em> in each of the participants” (his emphasis) [@Heider:interpersonal58]</p>
<h3 id="machine-learning-and-narratives">Machine Learning and Narratives</h3>
<p><img class="" src="http://inverseprobability.com/talks/slides/diagrams/Classic_baby_shoes.jpg" width="60%" align="" style="background:none; border:none; box-shadow:none;" /></p>
<center><i>For sale: baby shoes, never worn.</i></center>
<p>Consider the six word novel, apocraphally credited to Ernest Hemingway, “For sale: baby shoes, never worn”. To understand what that means to a human, you need a great deal of additional context. Context that is not directly accessible to a machine that has not got both the evolved and contextual understanding of our own condition to realize both the implication of the advert and what that implication means emotionally to the previous owner.</p>
<p><a href="https://www.youtube.com/watch?v=8FIEZXMUM2I&t=7"><img src="https://img.youtube.com/vi/8FIEZXMUM2I/0.jpg" alt="" /></a></p>
<p><a href="https://en.wikipedia.org/wiki/Fritz_Heider">Fritz Heider</a> and <a href="https://en.wikipedia.org/wiki/Marianne_Simmel">Marianne Simmel</a>’s experiments with animated shapes from 1944 [@Heider:experimental44]. Our interpretation of these objects as showing motives and even emotion is a combination of our desire for narrative, a need for understanding of each other, and our ability to empathise. At one level, these are crudely drawn objects, but in another key way, the animator has communicated a story through simple facets such as their relative motions, their sizes and their actions. We apply our psychological representations to these faceless shapes in an effort to interpret their actions.</p>
<blockquote>
<p>There are three types of lies: lies, damned lies and statistics</p>
<p>Benjamin Disraeli 1804-1881</p>
</blockquote>
<p>The quote lies, damned lies and statistics was credited to Benjamin Disraeli by Mark Twain in his autobiography. It characterizes the idea that statistic can be made to prove anything. But Disraeli died in 1881 and Mark Twain died in 1910. The important breakthrough in overcoming our tendency to overinterpet data came with the formalization of the field through the development of <em>mathematical statistics</em>.</p>
<h3 id="mathematical-statistics"><em>Mathematical</em> Statistics</h3>
<p><img class="" src="http://inverseprobability.com/talks/slides/diagrams/Portrait_of_Karl_Pearson.jpg" width="30%" align="" style="background:none; border:none; box-shadow:none;" /></p>
<p><a href="https://en.wikipedia.org/wiki/Karl_Pearson">Karl Pearson</a> (1857-1936), <a href="https://en.wikipedia.org/wiki/Ronald_Fisher">Ronald Fisher</a> (1890-1962) and others considered the question of what conclusions can truly be drawn from data. Their mathematical studies act as a restraint on our tendency to over-interpret and see patterns where there are none. They introduced concepts such as randomized control trials that form a mainstay of the our decision making today, from government, to clinicians to large scale A/B testing that determines the nature of the web interfaces we interact with on social media and shopping.</p>
<p>Today the statement “There are three types of lies: lies, damned lies and ‘big data’” may be more apt. We are revisiting many of the mistakes made in interpreting data from the 19th century. Big data is laid down by happenstance, rather than actively collected with a particular question in mind. That means it needs to be treated with care when conclusions are being drawn. For data science to succede it needs the same form of rigour that Pearson and Fisher brought to statistics, a “mathematical data science” is needed.</p>
<h3 id="artificial-intelligence-and-data-science">Artificial Intelligence and Data Science</h3>
<p>Machine learning technologies have been the driver of two related, but distinct disciplines. The first is <em>data science</em>. Data science is an emerging field that arises from the fact that we now collect so much data by happenstance, rather than by <em>experimental design</em>. Classical statistics is the science of drawing conclusions from data, and to do so statistical experiments are carefully designed. In the modern era we collect so much data that there’s a desire to draw inferences directly from the data.</p>
<p>As well as machine learning, the field of data science draws from statistics, cloud computing, data storage (e.g. streaming data), visualization and data mining.</p>
<p>In contrast, artificial intelligence technologies typically focus on emulating some form of human behaviour, such as understanding an image, or some speech, or translating text from one form to another. The recent advances in artifcial intelligence have come from machine learning providing the automation. But in contrast to data science, in artifcial intelligence the data is normally collected with the specific task in mind. In this sense it has strong relations to classical statistics.</p>
<p>Classically artificial intelligence worried more about <em>logic</em> and <em>planning</em> and focussed less on data driven decision making. Modern machine learning owes more to the field of <em>Cybernetics</em> [@Wiener:cybernetics48] than artificial intelligence. Related fields include <em>robotics</em>, <em>speech recognition</em>, <em>language understanding</em> and <em>computer vision</em>.</p>
<p>There are strong overlaps between the fields, the wide availability of data by happenstance makes it easier to collect data for designing AI systems. These relations are coming through wide availability of sensing technologies that are interconnected by celluar networks, WiFi and the internet. This phenomenon is sometimes known as the <em>Internet of Things</em>, but this feels like a dangerous misnomer. We must never forget that we are interconnecting people, not things.</p>
<h3 id="what-does-machine-learning-do">What does Machine Learning do?</h3>
<p>Any process of automation allows us to scale what we do by codifying a process in some way that makes it efficient and repeatable. Machine learning automates by emulating human (or other actions) found in data. Machine learning codifies in the form of a mathematical function that is learnt by a computer. If we can create these mathematical functions in ways in which they can interconnect, then we can also build systems.</p>
<p>Machine learning works through codifing a prediction of interest into a mathematical function. For example, we can try and predict the probability that a customer wants to by a jersey given knowledge of their age, and the latitude where they live. The technique known as logistic regression estimates the odds that someone will by a jumper as a linear weighted sum of the features of interest.</p>
<p><script type="math/tex">\text{odds} = \frac{p(\text{bought})}{p(\text{not bought})}</script>
<script type="math/tex">\log \text{odds} = \beta_0 + \beta_1 \text{age} + \beta_2 \text{latitude}</script></p>
<p>Here $\beta_0$, $\beta_1$ and $\beta_2$ are the parameters of the model. If $\beta_1$ and $\beta_2$ are both positive, then the log-odds that someone will buy a jumper increase with increasing latitude and age, so the further north you are and the older you are the more likely you are to buy a jumper. The parameter $\beta_0$ is an offset parameter, and gives the log-odds of buying a jumper at zero age and on the equator. It is likely to be negative[^logarithms] indicating that the purchase is odds-against. This is actually a classical statistical model, and models like logistic regression are widely used to estimate probabilities from ad-click prediction to risk of disease.</p>
<p>This is called a generalized linear model, we can also think of it as estimating the <em>probability</em> of a purchase as a nonlinear function of the features (age, lattitude) and the parameters (the $\beta$ values). The function is known as the <em>sigmoid</em> or <a href="https://en.wikipedia.org/wiki/Logistic_regression">logistic function</a>, thus the name <em>logistic</em> regression.</p>
<script type="math/tex; mode=display">p(\text{bought}) = \sigmoid{\beta_0 + \beta_1 \text{age} + \beta_2 \text{latitude}}</script>
<p>In the case where we have <em>features</em> to help us predict, we sometimes denote such features as a vector, $\inputVector$, and we then use an inner product between the features and the parameters, $\boldsymbol{\beta}^\top \inputVector = \beta_1 \inputScalar_1 + \beta_2 \inputScalar_2 + \beta_3 \inputScalar_3 …$, to represent the argument of the sigmoid.</p>
<script type="math/tex; mode=display">p(\text{bought}) = \sigmoid{\boldsymbol{\beta}^\top \inputVector}</script>
<p>More generally, we aim to predict some aspect of our data, $\dataScalar$, by relating it through a mathematical function, $\mappingFunction(\cdot)$, to the parameters, $\boldsymbol{\beta}$ and the data, $\inputVector$.</p>
<script type="math/tex; mode=display">\dataScalar = \mappingFunction\left(\inputVector, \boldsymbol{\beta}\right)</script>
<p>We call $\mappingFunction(\cdot)$ the <em>prediction function</em></p>
<p>To obtain the fit to data, we use a separate function called the <em>objective function</em> that gives us a mathematical representation of the difference between our predictions and the real data.</p>
<p><script type="math/tex">\errorFunction(\boldsymbol{\beta}, \dataMatrix, \inputMatrix)</script>
A commonly used examples (for example in a regression problem) is least squares,
<script type="math/tex">\errorFunction(\boldsymbol{\beta}, \dataMatrix, \inputMatrix) = \sum_{i=1}^\numData \left(\dataScalar_i - \mappingFunction(\inputVector_i, \boldsymbol{\beta})\right)^2.</script></p>
<p>If a linear prediction funciton is combined with the least squares objective function then that gives us a classical <em>linear regression</em>, another classical statistical model. Statistics often focusses on linear models because it makes interpretation of the model easier. Interpretation is key in statistics because the aim is normally to validate questions by analysis of data. Machine learning has typically focussed more on the prediction function itself and worried less about the interpretation of parameters, which are normally denoted by $\mathbf{w}$ instead of $\boldsymbol{\beta}$. As a result <em>non-linear</em> functions are explored more often as they tend to improve quality of predictions but at the expense of interpretability.</p>
<h3 id="deep-learning">Deep Learning</h3>
<ul>
<li>
<p>These are interpretable models: vital for disease etc.</p>
</li>
<li>
<p>Modern machine learning methods are less interpretable</p>
</li>
<li>
<p>Example: face recognition</p>
</li>
</ul>
<p><small>Outline of the DeepFace architecture. A front-end of a single convolution-pooling-convolution filtering on the rectified input, followed by three locally-connected layers and two fully-connected layers. Color illustrates feature maps produced at each layer. The net includes more than 120 million parameters, where more than 95% come from the local and fully connected.</small></p>
<p><img class="" src="http://inverseprobability.com/talks/slides/diagrams/deepface_neg.png" width="100%" align="" style="background:none; border:none; box-shadow:none;" /></p>
<p align="right">
<small>Source: DeepFace</small></p>
<p><img class="" src="http://inverseprobability.com/talks/slides/diagrams/576px-Early_Pinball.jpg" width="50%" align="" style="background:none; border:none; box-shadow:none;" /></p>
<p>We can think of what these models are doing as being similar to early pin ball machines. In a neural network, we input a number (or numbers), whereas in pinball, we input a ball. The location of the ball on the left-right axis can be thought of as the number. As the ball falls through the machine, each layer of pins can be thought of as a different layer of neurons. Each layer acts to move the ball from left to right.</p>
<p>In a pinball machine, when the ball gets to the bottom it might fall into a hole defining a score, in a neural network, that is equivalent to the decision: a classification of the input object.</p>
<p>An image has more than one number associated with it, so it’s like playing pinball in a <em>hyper-space</em>.</p>
<p>At initialization, the pins aren’t in the right place to bring the ball to the correct decision.</p>
<p>Learning involves moving all the pins to be in the right position, so that the ball falls in the right place. But moving all these pins in hyperspace can be difficult. In a hyper space you have to put a lot of data through the machine for to explore the positions of all the pins. Adversarial learning reflects the fact that a ball can be moved a small distance and lead to a very different result.</p>
<p>Probabilistic methods explore more of the space by considering a range of possible paths for the ball through the machine.</p>
<h3 id="uncertainty-and-learning">Uncertainty and Learning</h3>
<ul>
<li>
<p>In this “vanilla” form these machines “don’t know when they don’t know”.</p>
</li>
<li>
<p>Doubt is vital in real world decision making.</p>
</li>
<li>
<p>Incorporating this in systems is a long time focus of my technical research.</p>
</li>
</ul>
<h3 id="comparison-with-human-learning--embodiment">Comparison with Human Learning & Embodiment</h3>
<ul>
<li>The emulation of intelligence does not exhibit all the meta-modelling humans perform.</li>
</ul>
<h3 id="data-science">Data Science</h3>
<ul>
<li>Industrial Revolution 4.0?</li>
<li><em>Industrial Revolution</em> (1760-1840) term coined by Arnold Toynbee,
late 19th century.</li>
<li>Maybe: But this one is dominated by <em>data</em> not <em>capital</em></li>
<li>That presents <em>challenges</em> and <em>opportunities</em></li>
</ul>
<p>compare <a href="https://www.theguardian.com/media-network/2015/mar/05/digital-oligarchy-algorithms-personal-data">digital oligarchy</a> vs <a href="https://www.theguardian.com/media-network/2015/aug/25/africa-benefit-data-science-information">how Africa can benefit from the data revolution</a></p>
<p>Disruptive technologies take time to assimilate, and best practices, as well as the pitfalls of new technologies take time to share. Historically, new technologies led to new professions. <a href="https://en.wikipedia.org/wiki/Isambard_Kingdom_Brunel">Isambard Kingdom Brunel</a> (born 1806) was a leading innovator in civil, mechanical and naval engineering. Each of these has its own professional institutions founded in 1818, 1847, and 1860 respectively.</p>
<p><a href="https://en.wikipedia.org/wiki/Nikola_Tesla">Nikola Tesla</a> developed the modern approach to electrical distribution, he was born in 1856 and the American Instiute for Electrical Engineers was founded in 1884, the UK equivalent was founded in 1871.</p>
<p><a href="https://en.wikipedia.org/wiki/William_Shockley">William Schockley Jr</a>, born 1910, led the group that developed the transistor, referred to as “the man who brought silicon to Silicon Valley”, in 1963 the American Institute for Electical Engineers merged with the Institute of Radio Engineers to form the Institute of Electrical and Electronic Engineers.</p>
<p><a href="https://en.wikipedia.org/wiki/Watts_Humphrey">Watts S. Humphrey</a>, born 1927, was known as the “father of software quality”, in the 1980s he founded a program aimed at understanding and managing the software process. The British Computer Society was founded in 1956.</p>
<p>Why the need for these professions? Much of it is about codification of best practice and developing trust between the public and practitioners. These fundamental characteristics of the professions are shared with the oldest professions (Medicine, Law) as well as the newest (Information Technology).</p>
<p>So where are we today? My best guess is we are somewhere equivalent to the 1980s for Software Engineering. In terms of professional deployment we have a basic understanding of the equivalent of “programming” but much less understanding of <em>machine learning systems design</em> and <em>data infrastructure</em>. How the components we ahve developed interoperate together in a reliable and accountable manner. Best practice is still evolving, but perhaps isn’t being shared widely enough.</p>
<p>One problem is that the art of data science is superficially similar to regular software engineering. Although in practice it is rather different. Modern software engineering practice operates to generate code which is well tested as it is written, agile programming techniques provide the appropriate degree of flexibility for the individual programmers alongside sufficient formalization and testing. These techniques have evolved from an overly restrictive formalization that was proposed in the early days of software engineering.</p>
<p>While data science involves programming, it is different in the following way. Most of the work in data science involves understanding the data and the appropriate manipulations to apply to extract knowledge from the data. The eventual number of lines of code that are required to extract that knowledge are often very few, but the amount of thought and attention that needs to be applied to each line is much more than a traditional line of software code. Testing of those lines is also of a different nature, provisions have to be made for evolving data environments. Any development work is often done on a static snapshot of data, but deployment is made in a live environment where the nature of data changes. Quality control involves checking for degradation in performance arising form unanticipated changes in data quality. It may also need to check for regulatory conformity. For example, in the UK the General Data Protection Regulation stipulates standards of explainability and fairness that may need to be monitored. These concerns do not affect traditional software deployments.</p>
<p>Others are also pointing out these challenges, <a href="https://medium.com/@karpathy/software-2-0-a64152b37c35">this post</a> from Andrej Karpathy (now head of AI at Tesla) covers the notion of “Software 2.0”. Google researchers have highlighted the challenges of “Technical Debt” in machine learning [@Sculley:debt15]. Researchers at Berkeley have characterized the systems challenges associated with machine learning [@Stoica:systemsml17].</p>
<p>Data science is not only about technical expertise and analysis of data, we need to also generate a culture of decision making that acknowledges the true challenges in data-driven automated decision making. In particular, a focus on algorithms has neglected the importance of data in driving decisions. The quality of data is paramount in that poor quality data will inevitably lead to poor quality decisions. Anecdotally most data scientists will suggest that 80% of their time is spent on data clean up, and only 20% on actually modelling.</p>
<h3 id="the-software-crisis">The Software Crisis</h3>
<blockquote>
<p>The major cause of the software crisis is that the machines have
become several orders of magnitude more powerful! To put it quite
bluntly: as long as there were no machines, programming was no problem
at all; when we had a few weak computers, programming became a mild
problem, and now we have gigantic computers, programming has become an
equally gigantic problem.</p>
<p>Edsger Dijkstra (1930-2002), The Humble Programmer</p>
</blockquote>
<p>In the late sixties early software programmers made note of the increasing costs of software development and termed the challenges associated with it as the “<a href="https://en.wikipedia.org/wiki/Software_crisis">Software Crisis</a>”. Edsger Dijkstra referred to the crisis in his 1972 Turing Award winner’s address.</p>
<h3 id="the-data-crisis">The Data Crisis</h3>
<blockquote>
<p>The major cause of the data crisis is that machines have become more
interconnected than ever before. Data access is therefore cheap, but
data quality is often poor. What we need is cheap high quality
data. That implies that we develop processes for improving and
verifying data quality that are efficient.</p>
<p>There would seem to be two ways for improving efficiency. Firstly, we
should not duplicate work. Secondly, where possible we should automate
work.</p>
</blockquote>
<p>What I term “The Data Crisis” is the modern equivalent of this problem. The quantity of modern data, and the lack of attention paid to data as it is initially “laid down” and the costs of data cleaning are bringing about a crisis in data-driven decision making. Just as with software, the crisis is most correctly addressed by ‘scaling’ the manner in which we process our data. Duplication of work occurs because the value of data cleaning is not correctly recognised in management decision making processes. Automation of work is increasingly possible through techniques in “artificial intelligence”, but this will also require better management of the data science pipeline so that data about data science (meta-data science) can be correctly assimilated and processed. The Alan Turing institute has a program focussed on this area, <a href="https://www.turing.ac.uk/research_projects/artificial-intelligence-data-analytics/">AI for Data Analytics</a>.</p>
<p><img class="" src="http://inverseprobability.com/talks/slides/diagrams/Medievalplowingwoodcut.jpg" width="" align="" style="background:none; border:none; box-shadow:none;" /></p>
<p>Our current information infrastructure bears a close relation with <em>feudal systems</em> of government. In the feudal system a lord had a duty of care over his serfs and vassals, a duty to protect subjects. But in practice there was a power-asymetry. In feudal days protection was against Viking raiders, today, it is against information raiders. However, when there is an information leak, when there is a failure it is too late. Alternatively, our data is publicly shared, in an information commons. Akin to common land of the medieval village. But just as commons were subject to overgrazing and poor management, so it is that much of our data cannot be managed in this way. In particularly personal, sensitive data.</p>
<p>I explored this idea further in <a href="https://www.theguardian.com/media-network/2015/nov/16/information-barons-threaten-autonomy-privacy-online">this Guardian Op-Ed from 2015</a>.</p>
<h2 id="public-use-of-data-for-public-good">Public Use of Data for Public Good</h2>
<p>Since machine learning methods are so dependent on data, Understanding public attitudes to the use of their data is key to developing machine learning methods that maintain the trust of the public. Nowhere are the benefits of machine learning more profound, and the potential pitfalls more catastrophic than in the use of machine learning in health data.</p>
<p>The promise is for methods that take a personalized perspective on our individual health, but health data is some of the most sensitive data available to us. This is recognised both by the public and by regulation.</p>
<p>With this in mind The Wellcome Trust launched a report on <a href="https://wellcome.ac.uk/news/understanding-patient-data-launches-today">“Understanding Patient Data”</a> authored by Nicola Perrin, driven by the National Data Guardian’s recommendations.</p>
<p>From this report we know that patients trust Universities and hospitals more than the trust commercial entities and insurers. However, there are a number of different ways in which data can be mishandled, it is not only the intent of the data-controllers that effects our data security.</p>
<p>For example, the recent WannaCry virus attack which demonstrated the unpreparedness of much of the NHS IT infrastructure for a virus exhibiting an exploit that was well known to the security community. The key point is that the public trust the <em>intent</em> of academics and medical professionals, but actual <em>capability</em> could be at variance with the intent.</p>
<p><img class="" src="http://inverseprobability.com/talks/slides/diagrams/health/bush-pilot-grant-mcconachie.jpg" width="60%" align="" style="background:none; border:none; box-shadow:none;" /></p>
<center><i>Bush Pilot Grant McConachie</i></center>
<p>The situation is somewhat reminiscient of early aviation. This is where we are with our data science capabilities. By analogy, the engine of the plane is our data security infrastructure, the basic required technology to make us safe. The pilot is the health professional performing data analytics. The nature of the job of early pilots and indeed today’s <em>bush pilots</em> (who fly to remote places) included a need to understand the mechanics of the engine. Just as a health data scientist, today, needs to deal with security of the infrastructure as well as the nature of the analysis.</p>
<p><img class="" src="http://inverseprobability.com/talks/slides/diagrams/health/British_Airways_at_SFO.jpg" width="50%" align="" style="background:none; border:none; box-shadow:none;" /></p>
<center><i>British Airways 747 at SFO</i></center>
<p>I suspect most passengers would find it disconcerting if the pilot of a 747 was seen working on the engine shortly before a flight. As aviation has become more widespread, there is now a separation of responsibilities between pilots and mechanics. Indeed, Rolls Royce maintain ownership of their engines today, and merely lease them to the aircraft company. The responsibility for maintenance of the engine is entirely with Rolls Royce, yet the pilot is responsibility for the safety of the aircraft and its passengers.</p>
<p>We need to develop a modern data-infrastructure for which separates the need for security of infrastructure from the decision making of the data analyst.</p>
<p>This separation of responsibility according to expertise needs to be emulated when considering health data infrastructure. This resolves the <em>intent-capability</em> dilemma, by ensuring a separation of responsibilities to those that are best placed to address the issues.</p>
<h3 id="propagation-of-best-practice">Propagation of Best Practice</h3>
<p>We must also be careful to maintain openness in this new genaration of digital solutions for patient care. Matthew Syed’s book, “Black Box Thinking” [@Syed:blackbox15], emphasizes the importance of surfacing errors as a route to learning and improved process. Taking aviation as an example, and contrasting it with the culture in medicine, Matthew relates the story of <a href="https://chfg.org/trustees/martin-bromiley/">Martin Bromiley</a>, an airline pilot whose wife died during a routine hospital procedure and his efforts to improve the culture of safety in medicine. The motivation for the book is the difference in culture between aviation and medicine in how errors are acknowledged and dealt with. We must ensure that these high standards of oversight apply to the era of data-driven automated decision making.</p>
<p>In particular, while there is much to be gained by involving comemrcial companies, if the process by which they are drawing inference about patient condition is hidden (for example, due to commercial confidentiality), this may prevent us from understanding errors in diagnosis or treatment. This would be a retrograde step. It may be that health device certification needs modification or reform for data-driven automated decision making, but we need a spirit of transparency around how these systems are deriving their inferences to ensure best practice.</p>
<h2 id="data-trusts">Data Trusts</h2>
<p>The machine learning solutions we are dependent on to drive automated decision making are dependent on data. But with regard to personal data there are important issues of privacy. Data sharing brings benefits, but also exposes our digital selves. From the use of social media data for targeted advertising to influence us, to the use of genetic data to identify criminals, or natural family members. Control of our virtual selves maps on to control of our actual selves.</p>
<p>The fuedal system that is implied by current data protection legislation has signficant power asymmetries at its heart, in that the data controller has a duty of care over the data subject, but the data subject may only discover failings in that duty of care when it’s too late. Data controllers also may have conflicting motivations, and often their primary motivation is <em>not</em> towards the data-subject, but that is a consideration in their wider agenda.</p>
<p>I proposed <a href="https://www.theguardian.com/media-network/2016/jun/03/data-trusts-privacy-fears-feudalism-democracy">Data Trusts</a> as a solution to this problem. Inspired by <em>land societies</em> that formed in the 19th century to bring democratic representation to the growing middle classes. A land society was a mutual organisation where resources were pooled for the common good.</p>
<p>A Data Trust would be a legal entity where the trustees responsibility was entirely to the members of the trust. So the motivation of the data-controllers is aligned only with the data-subjects. How data is handled would be subject to the terms under which the trust was convened. The success of an individual trust would be contingent on it satisfying its members with appropriate balancing of individual privacy with the benefits of data sharing.</p>
<p>Formation of Data Trusts became the number one recommendation of the Hall-Presenti report on AI, but the manner in which this is done will have a significant impact on their utility. It feels important to have a diversity of approaches, and yet it feels important that any individual trust would be large enough to be taken seriously in representing the views of its members in wider negotiations.</p>
<h3 id="thanks">Thanks!</h3>
<ul>
<li>twitter: @lawrennd</li>
<li>blog: <a href="http://inverseprobability.com/blog.html">http://inverseprobability.com</a></li>
<li><a href="https://medium.com/@mijordan3/artificial-intelligence-the-revolution-hasnt-happened-yet-5e1d5812e1e7">Mike Jordan’s Medium Post</a></li>
</ul>
<div class="footnotes">
<ol>
<li id="fn:knowledge-representation">
<p>the challenge of understanding what it pertains to is known as knowledge representation). <a href="#fnref:knowledge-representation" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Fri, 11 May 2018 00:00:00 +0000
http://inverseprobability.com/talks/outlook-for-uk-ai-and-ml.html
http://inverseprobability.com/talks/outlook-for-uk-ai-and-ml.htmlTowards Machine Learning Systems Design<h3 id="what-is-machine-learning">What is Machine Learning?</h3>
<p>What is machine learning? At its most basic level machine learning is a combination of</p>
<script type="math/tex; mode=display">\text{data} + \text{model} \xrightarrow{\text{compute}} \text{prediction}</script>
<p>where <em>data</em> is our observations. They can be actively or passively
acquired (meta-data). The <em>model</em> contains our assumptions, based on
previous experience. That experience can be other data, it can come
from transfer learning, or it can merely be our beliefs about the
regularities of the universe. In humans our models include our
inductive biases. The <em>prediction</em> is an action to be taken or a
categorization or a quality score. The reason that machine learning
has become a mainstay of artificial intelligence is the importance of
predictions in artificial intelligence. The data and the model are combined through computation.</p>
<p>In practice we normally perform machine learning using two functions. To combine data with a model we typically make use of:</p>
<p><strong>a prediction function</strong> a function which is used to make the predictions. It includes our beliefs about the regularities of the universe, our assumptions about how the world works, e.g. smoothness, spatial similarities, temporal similarities.</p>
<p><strong>an objective function</strong> a function which defines the cost of misprediction. Typically it includes knowledge about the world’s generating processes (probabilistic objectives) or the costs we pay for mispredictions (empiricial risk minimization).</p>
<p>The combination of data and model through the prediction function and the objectie function leads to a <em>learning algorithm</em>. The class of prediction functions and objective functions we can make use of is restricted by the algorithms they lead to. If the prediction function or the objective function are too complex, then it can be difficult to find an appropriate learning algorithm. Much of the acdemic field of machine learning is the quest for new learning algorithms that allow us to bring different types of models and data together.</p>
<p>A useful reference for state of the art in machine learning is the UK Royal Society Report, <a href="https://royalsociety.org/~/media/policy/projects/machine-learning/publications/machine-learning-report.pdf">Machine Learning: Power and Promise of Computers that Learn by Example</a>.</p>
<p>You can also check my blog post on <a href="http://inverseprobability.com/2017/07/17/what-is-machine-learning">“What is Machine Learning?”</a></p>
<p>Machine learning technologies have been the driver of two related, but distinct disciplines. The first is <em>data science</em>. Data science is an emerging field that arises from the fact that we now collect so much data by happenstance, rather than by <em>experimental design</em>. Classical statistics is the science of drawing conclusions from data, and to do so statistical experiments are carefully designed. In the modern era we collect so much data that there’s a desire to draw inferences directly from the data.</p>
<p>As well as machine learning, the field of data science draws from statistics, cloud computing, data storage (e.g. streaming data), visualization and data mining.</p>
<p>In contrast, artificial intelligence technologies typically focus on emulating some form of human behaviour, such as understanding an image, or some speech, or translating text from one form to another. The recent advances in artifcial intelligence have come from machine learning providing the automation. But in contrast to data science, in artifcial intelligence the data is normally collected with the specific task in mind. In this sense it has relations to classical statistics.</p>
<p>Classically artificial intelligence worried more about <em>logic</em> and <em>planning</em> and focussed less on data driven decision making. Modern machine learning owes more to the field of <em>Cybernetics</em> than artificial intelligence. Related fields include <em>robotics</em>, <em>speech recognition</em>, <em>language understanding</em> and <em>computer vision</em>.</p>
<p>There are strong overlaps between the fields, the wide availability of data by happenstance makes it easier to collect data for designing AI systems. These relations are coming through wide availability of sensing technologies that are interconnected by celluar networks, WiFi and the internet. This phenomenon is sometimes known as the <em>Internet of Things</em>, but this feels like a dangerous misnomer. We must never forget that we are interconnecting people, not things.</p>
<h3 id="what-does-machine-learning-do">What does Machine Learning do?</h3>
<ul>
<li>ML Automates through Data
<ul>
<li><em>Strongly</em> related to statistics.</li>
<li>Field underpins revolution in <em>data science</em> and <em>AI</em></li>
</ul>
</li>
<li>With AI:
<ul>
<li><em>logic</em>, <em>robotics</em>, <em>computer vision</em>, <em>speech</em></li>
</ul>
</li>
<li>With Data Science:
<ul>
<li><em>databases</em>, <em>data mining</em>, <em>statistics</em>, <em>visualization</em></li>
</ul>
</li>
</ul>
<h3 id="embodiment-factors">“Embodiment Factors”</h3>
<table>
<tr>
<td></td>
<td align="center"><img class="" src="./slides/diagrams/IBM_Blue_Gene_P_supercomputer.jpg" width="40%" align="center" style="background:none; border:none; box-shadow:none;" /></td>
<td align="center"><img class="" src="./slides/diagrams/ClaudeShannon_MFO3807.jpg" width="25%" align="center" style="background:none; border:none; box-shadow:none;" /></td>
</tr>
<tr>
<td>compute</td>
<td align="center">$$\approx 10 \text{ gigaflops}$$</td><td align="center">$$\approx 14 \text{ teraflops}$$</td>
</tr>
<tr>
<td>communicate</td>
<td align="center">$$\approx 1 \text{ gigbit/s}$$</td>
<td align="center">$$\approx 100 \text{ bit/s}$$</td>
</tr>
<tr>
<td>(compute/communicate)</td>
<td align="center">$$10$$</td>
<td align="center">$$\approx 10^{13}$$</td>
</tr>
</table>
<p>There is a fundamental limit placed on our intelligence based on our ability to communicate. Claude Shannon founded the field of information theory. The clever part of this theory is it allows us to separate our measurement of information from what the information pertains to<sup id="fnref:knowledge-representation"><a href="#fn:knowledge-representation" class="footnote">1</a></sup>.</p>
<p>Shannon measured information in bits. One bit of information is the amount of information I pass to you when I give you the result of a coin toss. Shannon was also interested in the amount of information in the English language. He estimated that on average a word in the English language contains 12 bits of information.</p>
<p>Given typical speaking rates, that gives us an estimate of our ability to communicate of around 100 bits per second. Computers on the other hand can communicate much more rapidly. Current wired network speeds are around a billion bits per second, ten million times faster.</p>
<p>When it comes to compute though, our best estimates indicate our computers are slower. A typical modern computer can process make around 2 billion floating point operations per second, each floating point operation involves a 64 bit number. So the computer is processing around 120 billion bits per second.</p>
<p>It’s difficult to get similar estimates for humans, but by some estimates the amount of compute we would require to <em>simulate</em> a human brain is equivalent to that in the UK’s fastest computer, the MET office machine in Exeter, which in 2018 ranks as the 11th fastest computer in the world. That machine simulates the world’s weather each morning, and then simulates the world’s climate. It is a 16 petaflop machine.</p>
<p>So when it comes to our ability to compute we are extraordinary, not compute in our conscious mind, but the underlying neuron firings that underpin both our consciousness, our sbuconsciousness as well as our motor control etc. By analogy I sometimes like to think of us as a Formula One engine. But in terms of our ability to deploy that computation in actual use, to share the results of what we have inferred, we are very limited. So when you imagine the F1 car that represents a psyche, think of an F1 car with bicycle wheels.</p>
<p>In contrast, our computers have less computational power, but they can communicate fare more fluidly. They are more like a go-kart, less well powered, but with tires that allow them to deploy that power.</p>
<p>For humans, that means much of our computation should be dedicated to considering <em>what</em> we should compute. To do that efficiently we need to model the world around us. The most complex thing in the world around us is other humans. So it is no surprise that we model them. We second guess what their intentions are, and our communication is only necessary when they are departing from how we model them. Naturally, for this to work well, we need to understand those we work closely with. So it is no surprise that social communication, social bonding, forms so much of a part of our use of our limited bandwidth.</p>
<p>There is a second effect here, our need to anthropomorphise objects around us. Our tendency to model our fellow humans extends to when we interact with other entities in our environment. To our pets as well as inanimate objects around us, such as computers or even our cars. This tendency to overinterpret could be a consequence of our limited ability to communicate.</p>
<p>For more details see this paper <a href="https://arxiv.org/abs/1705.07996">“Living Together: Mind and Machine Intelligence”</a>, and this <a href="http://inverseprobability.com/talks/lawrence-tedx17/living-together.html">TEDx talk</a>.</p>
<h3 id="evolved-relationship">Evolved Relationship</h3>
<object class="svgplot" align="" data="./slides/diagrams/data-science/information-flow003.svg"></object>
<p>The high bandwidth of computers has resulted in a close relationship between the computer and data. Larege amounts of information can flow between the two. The degree to which the computer is mediating our relationship with data means that we should consider it an intermediary.</p>
<p>Origininally our low bandwith relationship with data was affected by two characteristics. Firstly, our tendency to over-interpret driven by our need to extract as much knowledge from our low bandwidth information channel as possible. Secondly, by our improved understanding of the domain of <em>mathematical</em> statistics and how our cognitive biases can mislead us.</p>
<p>With this new set up there is a potential for assimilating far more information via the computer, but the computer can present this to us in various ways. If it’s motives are not aligned with ours then it can misrepresent the information. This needn’t be nefarious it can be simply as a result of the computer pursuing a different objective from us. For example, if the computer is aiming to maximize our interaction time that may be a different objective from ours which may be to summarize information in a representative manner in the <em>shortest</em> possible lenght of time.</p>
<p>For example, for me it was a common experience to pick up my telephone with the intention of checking when my next appointme was, but to soon find myself distracted by another application on the phone, and end up reading something on the internet. By the time I’d finished reading, I would often have forgotten the reason I picked up my phone in the first place.</p>
<p>We can benefit enormously from the very large amount of information we can now obtain through this evolved relationship between us and data. Biology has already benefited from large scale data sharing and the improved inferences that can be drawn through summarizing data by computer. That has underpinned the revolution in computational biology. But in our daily lives this phenomenon is affecting us in ways which we haven’t envisaged.</p>
<p>Better mediation of this flow actually requires a better understanding of human-computer interaction. This in turn involves understanding our own intelligence better, what its cognitive biases are and how these might mislead us.</p>
<p>For further thoughts see <a href="https://www.theguardian.com/media-network/2015/jul/23/data-driven-economy-marketing">this Guardian article</a> on marketing in the internet era and <a href="http://inverseprobability.com/2015/12/04/what-kind-of-ai">this blog post</a> on System Zero.</p>
<h3 id="what-does-machine-learning-do-1">What does Machine Learning do?</h3>
<p>Any process of automation allows us to scale what we do by codifying a process in some way that makes it efficient and repeatable. Machine learning automates by emulating human (or other actions) found in data. Machine learning codifies in the form of a mathematical function that is learnt by a computer. If we can create these mathematical functions in ways in which they can interconnect, then we can also build systems.</p>
<h3 id="codify-through-mathematical-functions">Codify Through Mathematical Functions</h3>
<ul>
<li>How does machine learning work?</li>
<li>Jumper (jersey/sweater) purchase with logistic regression
<script type="math/tex">\text{odds} = \frac{\text{bought}}{\text{not bought}}</script>
<script type="math/tex">\log \text{odds} = \beta_0 + \beta_1 \text{age} + \beta_2 \text{latitude}</script></li>
</ul>
<h3 id="codify-through-mathematical-functions-1">Codify Through Mathematical Functions</h3>
<ul>
<li>How does machine learning work?</li>
<li>Jumper (jersey/sweater) purchase with logistic regression
<script type="math/tex">p(\text{bought}) = \mappingFunction\left(\beta_0 + \beta_1 \text{age} + \beta_2 \text{latitude}\right)</script></li>
</ul>
<h3 id="codify-through-mathematical-functions-2">Codify Through Mathematical Functions</h3>
<ul>
<li>How does machine learning work?</li>
<li>Jumper (jersey/sweater) purchase with logistic regression
<script type="math/tex">p(\text{bought}) = \mappingFunction\left(\boldsymbol{\beta}^\top \inputVector\right)</script></li>
</ul>
<p>We call $\mappingFunction(\cdot)$ the <em>prediction function</em></p>
<h3 id="fit-to-data">Fit to Data</h3>
<ul>
<li>
<p>Use an objective function
<script type="math/tex">\errorFunction(\boldsymbol{\beta}, \dataMatrix, \inputMatrix)</script></p>
</li>
<li>
<p>E.g. least squares
<script type="math/tex">\errorFunction(\boldsymbol{\beta}) = \sum_{i=1}^\numData \left(\dataScalar_i - \mappingFunction(\inputVector_i)\right)^2</script></p>
</li>
</ul>
<h3 id="two-components">Two Components</h3>
<ul>
<li>Prediction function, $\mappingFunction(\cdot)$</li>
<li>Objective function, $\errorFunction(\cdot)$</li>
</ul>
<h3 id="deep-learning">Deep Learning</h3>
<ul>
<li>
<p>These are interpretable models: vital for disease etc.</p>
</li>
<li>
<p>Modern machine learning methods are less interpretable</p>
</li>
<li>
<p>Example: face recognition</p>
</li>
</ul>
<p><small>Outline of the DeepFace architecture. A front-end of a single convolution-pooling-convolution filtering on the rectified input, followed by three locally-connected layers and two fully-connected layers. Color illustrates feature maps produced at each layer. The net includes more than 120 million parameters, where more than 95% come from the local and fully connected.</small></p>
<p><img class="" src="./slides/diagrams/deepface_neg.png" width="100%" align="" style="background:none; border:none; box-shadow:none;" /></p>
<p align="right">
<small>Source: DeepFace</small></p>
<p><img class="" src="./slides/diagrams/576px-Early_Pinball.jpg" width="50%" align="" style="background:none; border:none; box-shadow:none;" /></p>
<p>We can think of what these models are doing as being similar to early pin ball machines. In a neural network, we input a number (or numbers), whereas in pinball, we input a ball. The location of the ball on the left-right axis can be thought of as the number. As the ball falls through the machine, each layer of pins can be thought of as a different layer of neurons. Each layer acts to move the ball from left to right.</p>
<p>In a pinball machine, when the ball gets to the bottom it might fall into a hole defining a score, in a neural network, that is equivalent to the decision: a classification of the input object.</p>
<p>An image has more than one number associated with it, so it’s like playing pinball in a <em>hyper-space</em>.</p>
<p>At initialization, the pins aren’t in the right place to bring the ball to the correct decision.</p>
<p>Learning involves moving all the pins to be in the right position, so that the ball falls in the right place. But moving all these pins in hyperspace can be difficult. In a hyper space you have to put a lot of data through the machine for to explore the positions of all the pins. Adversarial learning reflects the fact that a ball can be moved a small distance and lead to a very different result.</p>
<p>Probabilistic methods explore more of the space by considering a range of possible paths for the ball through the machine.</p>
<pre><code class="language-{.python}">import numpy as np
import teaching_plots as plot
</code></pre>
<pre><code class="language-{.python}">%load -s compute_kernel mlai.py
</code></pre>
<pre><code class="language-{.python}">%load -s eq_cov mlai.py
</code></pre>
<pre><code class="language-{.python}">np.random.seed(10)
plot.rejection_samples(compute_kernel, kernel=eq_cov,
lengthscale=0.25, diagrams='./slides/diagrams/gp')
</code></pre>
<pre><code class="language-{.python}">import pods
import matplotlib.pyplot as plt
</code></pre>
<h3 id="olympic-marathon-data">Olympic Marathon Data</h3>
<p>The first thing we will do is load a standard data set for regression modelling. The data consists of the pace of Olympic Gold Medal Marathon winners for the Olympics from 1896 to present. First we load in the data and plot.</p>
<pre><code class="language-{.python}">data = pods.datasets.olympic_marathon_men()
x = data['X']
y = data['Y']
offset = y.mean()
scale = np.sqrt(y.var())
xlim = (1875,2030)
ylim = (2.5, 6.5)
yhat = (y-offset)/scale
fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
_ = ax.plot(x, y, 'r.',markersize=10)
ax.set_xlabel('year', fontsize=20)
ax.set_ylabel('pace min/km', fontsize=20)
ax.set_xlim(xlim)
ax.set_ylim(ylim)
mlai.write_figure(figure=fig, filename='./slides/diagrams/datasets/olympic-marathon.svg', transparent=True, frameon=True)
</code></pre>
<h3 id="olympic-marathon-data-1">Olympic Marathon Data</h3>
<table><tr><td width="70%">
- Gold medal times for Olympic Marathon since 1896.
- Marathons before 1924 didn’t have a standardised distance.
- Present results using pace per km.
- In 1904 Marathon was badly organised leading to very slow times.
</td><td width="30%">
![image](./slides/diagrams/Stephen_Kiprotich.jpg)
<small>Image from Wikimedia Commons <http://bit.ly/16kMKHQ></small>
</td></tr></table>
<object class="svgplot" align="" data="./slides/diagrams/ml/olympic_marathon.svg"></object>
<p>Things to notice about the data include the outlier in 1904, in this year, the olympics was in St Louis, USA. Organizational problems and challenges with dust kicked up by the cars following the race meant that participants got lost, and only very few participants completed.</p>
<p>More recent years see more consistently quick marathons.</p>
<p>Our first objective will be to perform a Gaussian process fit to the data, we’ll do this using the <a href="https://github.com/SheffieldML/GPy">GPy software</a>.</p>
<pre><code class="language-{.python}">m_full = GPy.models.GPRegression(x,yhat)
_ = m_full.optimize() # Optimize parameters of covariance function
</code></pre>
<p>The first command sets up the model, then</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>m_full.optimize()
</code></pre></div></div>
<p>optimizes the parameters of the covariance function and the noise level of the model. Once the fit is complete, we’ll try creating some test points, and computing the output of the GP model in terms of the mean and standard deviation of the posterior functions between 1870 and 2030. We plot the mean function and the standard deviation at 200 locations. We can obtain the predictions using</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>y_mean, y_var = m_full.predict(xt)
</code></pre></div></div>
<pre><code class="language-{.python}">xt = np.linspace(1870,2030,200)[:,np.newaxis]
yt_mean, yt_var = m_full.predict(xt)
yt_sd=np.sqrt(yt_var)
</code></pre>
<p>Now we plot the results using the helper function in <code class="highlighter-rouge">teaching_plots</code>.</p>
<pre><code class="language-{.python}">import teaching_plots as plot
</code></pre>
<pre><code class="language-{.python}">fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
plot.model_output(m_full, scale=scale, offset=offset, ax=ax, xlabel='year', ylabel='pace min/km', fontsize=20, portion=0.2)
ax.set_xlim(xlim)
ax.set_ylim(ylim)
mlai.write_figure(figure=fig,
filename='./slides/diagrams/gp/olympic-marathon-gp.svg',
transparent=True, frameon=True)
</code></pre>
<object class="svgplot" align="" data="./slides/diagrams/gp/olympic-marathon-gp.svg"></object>
<h3 id="fit-quality">Fit Quality</h3>
<p>In the fit we see that the error bars (coming mainly from the noise variance) are quite large. This is likely due to the outlier point in 1904, ignoring that point we can see that a tighter fit is obtained. To see this making a version of the model, <code class="highlighter-rouge">m_clean</code>, where that point is removed.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>x_clean=np.vstack((x[0:2, :], x[3:, :]))
y_clean=np.vstack((y[0:2, :], y[3:, :]))
m_clean = GPy.models.GPRegression(x_clean,y_clean)
_ = m_clean.optimize()
</code></pre></div></div>
<p>###</p>
<p>Data is fine for answering very specific questions, like “Who won the Olympic Marathon in 2012?”, because we have that answer stored, however, we are not given the answer to many other questions. For example, Alan Turing was a formidable marathon runner, in 1946 he ran a time 2 hours 46 minutes (just under four minutes per kilometer, faster than I and most of the other <a href="http://www.parkrun.org.uk/sheffieldhallam/">Endcliffe Park Run</a> runners can do 5 km). What is the probability he would have won an Olympics if one had been held in 1946?</p>
<table><tr><td width=""></td><td width=""></td></tr></table>
<p>{<img class="" src="./slides/diagrams/turing_run.jpg" width="40%" align="" style="background:none; border:none; box-shadow:none;" />}{<img class="50%" src="./slides/diagrams/turing-times.gif" width="50%" align="" style="background:none; border:none; box-shadow:none;" /></p>
<p><em>Alan Turing, in 1946 he was only 11 minutes slower than the winner of the 1948 games. Would he have won a hypothetical games held in 1946? Source: <a href="http://www.turing.org.uk/scrapbook/run.html">Alan Turing Internet Scrapbook</a>.</em></p>
<h3 id="deep-gp-fit">Deep GP Fit</h3>
<p>Let’s see if a deep Gaussian process can help here. We will construct a deep Gaussian process with one hidden layer (i.e. one Gaussian process feeding into another).</p>
<p>Build a Deep GP with an additional hidden layer (one dimensional) to fit the model.</p>
<pre><code class="language-{.python}">hidden = 1
m = deepgp.DeepGP([y.shape[1],hidden,x.shape[1]],Y=yhat, X=x, inits=['PCA','PCA'],
kernels=[GPy.kern.RBF(hidden,ARD=True),
GPy.kern.RBF(x.shape[1],ARD=True)], # the kernels for each layer
num_inducing=50, back_constraint=False)
</code></pre>
<p>Deep Gaussian process models also can require some thought in initialization. Here we choose to start by setting the noise variance to be one percent of the data variance.</p>
<p>Optimization requires moving variational parameters in the hidden layer representing the mean and variance of the expected values in that layer. Since all those values can be scaled up, and this only results in a downscaling in the output of the first GP, and a downscaling of the input length scale to the second GP. It makes sense to first of all fix the scales of the covariance function in each of the GPs.</p>
<p>Sometimes, deep Gaussian processes can find a local minima which involves increasing the noise level of one or more of the GPs. This often occurs because it allows a minimum in the KL divergence term in the lower bound on the likelihood. To avoid this minimum we habitually train with the likelihood variance (the noise on the output of the GP) fixed to some lower value for some iterations.</p>
<p>Let’s create a helper function to initialize the models we use in the notebook.</p>
<pre><code class="language-{.python}">def initialize(self, noise_factor=0.01, linear_factor=1):
"""Helper function for deep model initialization."""
self.obslayer.likelihood.variance = self.Y.var()*noise_factor
for layer in self.layers:
if type(layer.X) is GPy.core.parameterization.variational.NormalPosterior:
if layer.kern.ARD:
var = layer.X.mean.var(0)
else:
var = layer.X.mean.var()
else:
if layer.kern.ARD:
var = layer.X.var(0)
else:
var = layer.X.var()
# Average 0.5 upcrossings in four standard deviations.
layer.kern.lengthscale = linear_factor*np.sqrt(layer.kern.input_dim)*2*4*np.sqrt(var)/(2*np.pi)
# Bind the new method to the Deep GP object.
deepgp.DeepGP.initialize=initialize
</code></pre>
<pre><code class="language-{.python}"># Call the initalization
m.initialize()
</code></pre>
<p>Now optimize the model. The first stage of optimization is working on variational parameters and lengthscales only.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>m.optimize(messages=False,max_iters=100)
</code></pre></div></div>
<p>Now we remove the constraints on the scale of the covariance functions associated with each GP and optimize again.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>for layer in m.layers:
pass #layer.kern.variance.constrain_positive(warning=False)
m.obslayer.kern.variance.constrain_positive(warning=False)
m.optimize(messages=False,max_iters=100)
</code></pre></div></div>
<p>Finally, we allow the noise variance to change and optimize for a large number of iterations.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>for layer in m.layers:
layer.likelihood.variance.constrain_positive(warning=False)
m.optimize(messages=True,max_iters=10000)
</code></pre></div></div>
<p>For our optimization process we define a new function.</p>
<pre><code class="language-{.python}">def staged_optimize(self, iters=(1000,1000,10000), messages=(False, False, True)):
"""Optimize with parameters constrained and then with parameters released"""
for layer in self.layers:
# Fix the scale of each of the covariance functions.
layer.kern.variance.fix(warning=False)
layer.kern.lengthscale.fix(warning=False)
# Fix the variance of the noise in each layer.
layer.likelihood.variance.fix(warning=False)
self.optimize(messages=messages[0],max_iters=iters[0])
for layer in self.layers:
layer.kern.lengthscale.constrain_positive(warning=False)
self.obslayer.kern.variance.constrain_positive(warning=False)
self.optimize(messages=messages[1],max_iters=iters[1])
for layer in self.layers:
layer.kern.variance.constrain_positive(warning=False)
layer.likelihood.variance.constrain_positive(warning=False)
self.optimize(messages=messages[2],max_iters=iters[2])
# Bind the new method to the Deep GP object.
deepgp.DeepGP.staged_optimize=staged_optimize
</code></pre>
<pre><code class="language-{.python}">m.staged_optimize(messages=(True,True,True))
</code></pre>
<h3 id="plot-the-prediction">Plot the prediction</h3>
<p>The prediction of the deep GP can be extracted in a similar way to the normal GP. Although, in this case, it is an approximation to the true distribution, because the true distribution is not Gaussian.</p>
<pre><code class="language-{.python}">fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
plot.model_output(m, scale=scale, offset=offset, ax=ax, xlabel='year', ylabel='pace min/km',
fontsize=20, portion=0.2)
ax.set_xlim(xlim)
ax.set_ylim(ylim)
mlai.write_figure(figure=fig, filename='./slides/diagrams/deepgp/olympic-marathon-deep-gp.svg',
transparent=True, frameon=True)
</code></pre>
<h3 id="olympic-marathon-data-deep-gp">Olympic Marathon Data Deep GP</h3>
<object class="svgplot" align="" data="./slides/diagrams/deepgp/olympic-marathon-deep-gp.svg"></object>
<pre><code class="language-{.python}">def posterior_sample(self, X, **kwargs):
"""Give a sample from the posterior of the deep GP."""
Z = X
for i, layer in enumerate(reversed(self.layers)):
Z = layer.posterior_samples(Z, size=1, **kwargs)[:, :, 0]
return Z
deepgp.DeepGP.posterior_sample = posterior_sample
</code></pre>
<pre><code class="language-{.python}">fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
plot.model_sample(m, scale=scale, offset=offset, samps=10, ax=ax,
xlabel='year', ylabel='pace min/km', portion = 0.225)
ax.set_xlim(xlim)
ax.set_ylim(ylim)
mlai.write_figure(figure=fig, filename='./slides/diagrams/deepgp/olympic-marathon-deep-gp-samples.svg',
transparent=True, frameon=True)
</code></pre>
<h3 id="olympic-marathon-data-deep-gp-data-transitionnone">Olympic Marathon Data Deep GP {data-transition=”None”}</h3>
<object class="svgplot" align="" data="./slides/diagrams/deepgp/olympic-marathon-deep-gp-samples.svg"></object>
<h3 id="fitted-gp-for-each-layer">Fitted GP for each layer</h3>
<p>Now we explore the GPs the model has used to fit each layer. First of all, we look at the hidden layer.</p>
<pre><code class="language-{.python}">def visualize(self, scale=1.0, offset=0.0, xlabel='input', ylabel='output',
xlim=None, ylim=None, fontsize=20, portion=0.2,dataset=None,
diagrams='./diagrams'):
"""Visualize the layers in a deep GP with one-d input and output."""
depth = len(self.layers)
if dataset is None:
fname = 'deep-gp-layer'
else:
fname = dataset + '-deep-gp-layer'
filename = os.path.join(diagrams, fname)
last_name = xlabel
last_x = self.X
for i, layer in enumerate(reversed(self.layers)):
if i>0:
plt.plot(last_x, layer.X.mean, 'r.',markersize=10)
last_x=layer.X.mean
ax=plt.gca()
name = 'layer ' + str(i)
plt.xlabel(last_name, fontsize=fontsize)
plt.ylabel(name, fontsize=fontsize)
last_name=name
mlai.write_figure(filename=filename + '-' + str(i-1) + '.svg',
transparent=True, frameon=True)
if i==0 and xlim is not None:
xt = plot.pred_range(np.array(xlim), portion=0.0)
elif i>0:
xt = plot.pred_range(np.array(next_lim), portion=0.0)
else:
xt = plot.pred_range(last_x, portion=portion)
yt_mean, yt_var = layer.predict(xt)
if layer==self.obslayer:
yt_mean = yt_mean*scale + offset
yt_var *= scale*scale
yt_sd = np.sqrt(yt_var)
gpplot(xt,yt_mean,yt_mean-2*yt_sd,yt_mean+2*yt_sd)
ax = plt.gca()
if i>0:
ax.set_xlim(next_lim)
elif xlim is not None:
ax.set_xlim(xlim)
next_lim = plt.gca().get_ylim()
plt.plot(last_x, self.Y*scale + offset, 'r.',markersize=10)
plt.xlabel(last_name, fontsize=fontsize)
plt.ylabel(ylabel, fontsize=fontsize)
mlai.write_figure(filename=filename + '-' + str(i) + '.svg',
transparent=True, frameon=True)
if ylim is not None:
ax=plt.gca()
ax.set_ylim(ylim)
# Bind the new method to the Deep GP object.
deepgp.DeepGP.visualize=visualize
</code></pre>
<pre><code class="language-{.python}">m.visualize(scale=scale, offset=offset, xlabel='year',
ylabel='pace min/km',xlim=xlim, ylim=ylim,
dataset='olympic-marathon',
diagrams='./slides/diagrams/deepgp')
</code></pre>
<pre><code class="language-{.python}">def scale_data(x, portion):
scale = (x.max()-x.min())/(1-2*portion)
offset = x.min() - portion*scale
return (x-offset)/scale, scale, offset
def visualize_pinball(self, ax=None, scale=1.0, offset=0.0, xlabel='input', ylabel='output',
xlim=None, ylim=None, fontsize=20, portion=0.2, points=50, vertical=True):
"""Visualize the layers in a deep GP with one-d input and output."""
if ax is None:
fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
depth = len(self.layers)
last_name = xlabel
last_x = self.X
# Recover input and output scales from output plot
plot_model_output(self, scale=scale, offset=offset, ax=ax,
xlabel=xlabel, ylabel=ylabel,
fontsize=fontsize, portion=portion)
xlim=ax.get_xlim()
xticks=ax.get_xticks()
xtick_labels=ax.get_xticklabels().copy()
ylim=ax.get_ylim()
yticks=ax.get_yticks()
ytick_labels=ax.get_yticklabels().copy()
# Clear axes and start again
ax.cla()
if vertical:
ax.set_xlim((0, 1))
ax.invert_yaxis()
ax.set_ylim((depth, 0))
else:
ax.set_ylim((0, 1))
ax.set_xlim((0, depth))
ax.set_axis_off()#frame_on(False)
def pinball(x, y, y_std, color_scale=None,
layer=0, depth=1, ax=None,
alpha=1.0, portion=0.0, vertical=True):
scaledx, xscale, xoffset = scale_data(x, portion=portion)
scaledy, yscale, yoffset = scale_data(y, portion=portion)
y_std /= yscale
# Check whether data is anti-correlated on output
if np.dot((scaledx-0.5).T, (scaledy-0.5))<0:
scaledy=1-scaledy
flip=-1
else:
flip=1
if color_scale is not None:
color_scale, _, _=scale_data(color_scale, portion=0)
scaledy = (1-alpha)*scaledx + alpha*scaledy
def color_value(x, cmap=None, width=None, centers=None):
"""Return color as a function of position along x axis"""
if cmap is None:
cmap = np.asarray([[1, 0, 0], [1, 1, 0], [0, 1, 0]])
ncenters = cmap.shape[0]
if centers is None:
centers = np.linspace(0+0.5/ncenters, 1-0.5/ncenters, ncenters)
if width is None:
width = 0.25/ncenters
r = (x-centers)/width
weights = np.exp(-0.5*r*r).flatten()
weights /=weights.sum()
weights = weights[:, np.newaxis]
return np.dot(cmap.T, weights).flatten()
for i in range(x.shape[0]):
if color_scale is not None:
color = color_value(color_scale[i])
else:
color=(1, 0, 0)
x_plot = np.asarray((scaledx[i], scaledy[i])).flatten()
y_plot = np.asarray((layer, layer+alpha)).flatten()
if vertical:
ax.plot(x_plot, y_plot, color=color, alpha=0.5, linewidth=3)
ax.plot(x_plot, y_plot, color='k', alpha=0.5, linewidth=0.5)
else:
ax.plot(y_plot, x_plot, color=color, alpha=0.5, linewidth=3)
ax.plot(y_plot, x_plot, color='k', alpha=0.5, linewidth=0.5)
# Plot error bars that increase as sqrt of distance from start.
std_points = 50
stdy = np.linspace(0, alpha,std_points)
stdx = np.sqrt(stdy)*y_std[i]
stdy += layer
mean_vals = np.linspace(scaledx[i], scaledy[i], std_points)
upper = mean_vals+stdx
lower = mean_vals-stdx
fillcolor=color
x_errorbars=np.hstack((upper,lower[::-1]))
y_errorbars=np.hstack((stdy,stdy[::-1]))
if vertical:
ax.fill(x_errorbars,y_errorbars,
color=fillcolor, alpha=0.1)
ax.plot(scaledy[i], layer+alpha, '.',markersize=10, color=color, alpha=0.5)
else:
ax.fill(y_errorbars,x_errorbars,
color=fillcolor, alpha=0.1)
ax.plot(layer+alpha, scaledy[i], '.',markersize=10, color=color, alpha=0.5)
# Marker to show end point
return flip
# Whether final axis is flipped
flip = 1
first_x=last_x
for i, layer in enumerate(reversed(self.layers)):
if i==0:
xt = plot.pred_range(last_x, portion=portion, points=points)
color_scale=xt
yt_mean, yt_var = layer.predict(xt)
if layer==self.obslayer:
yt_mean = yt_mean*scale + offset
yt_var *= scale*scale
yt_sd = np.sqrt(yt_var)
flip = flip*pinball(xt,yt_mean,yt_sd,color_scale,portion=portion,
layer=i, ax=ax, depth=depth,vertical=vertical)#yt_mean-2*yt_sd,yt_mean+2*yt_sd)
xt = yt_mean
# Make room for axis labels
if vertical:
ax.set_ylim((2.1, -0.1))
ax.set_xlim((-0.02, 1.02))
ax.set_yticks(range(depth,0,-1))
else:
ax.set_xlim((-0.1, 2.1))
ax.set_ylim((-0.02, 1.02))
ax.set_xticks(range(0, depth))
def draw_axis(ax, scale=1.0, offset=0.0, level=0.0, flip=1,
label=None,up=False, nticks=10, ticklength=0.05,
vertical=True,
fontsize=20):
def clean_gap(gap):
nsf = np.log10(gap)
if nsf>0:
nsf = np.ceil(nsf)
else:
nsf = np.floor(nsf)
lower_gaps = np.asarray([0.005, 0.01, 0.02, 0.03, 0.04, 0.05,
0.1, 0.25, 0.5,
1, 1.5, 2, 2.5, 3, 4, 5, 10, 25, 50, 100])
upper_gaps = np.asarray([1, 2, 3, 4, 5, 10])
if nsf >2 or nsf<-2:
d = np.abs(gap-upper_gaps*10**nsf)
ind = np.argmin(d)
return upper_gaps[ind]*10**nsf
else:
d = np.abs(gap-lower_gaps)
ind = np.argmin(d)
return lower_gaps[ind]
tickgap = clean_gap(scale/(nticks-1))
nticks = int(scale/tickgap) + 1
tickstart = np.round(offset/tickgap)*tickgap
ticklabels = np.asarray(range(0, nticks))*tickgap + tickstart
ticks = (ticklabels-offset)/scale
axargs = {'color':'k', 'linewidth':1}
if not up:
ticklength=-ticklength
tickspot = np.linspace(0, 1, nticks)
if flip < 0:
ticks = 1-ticks
for tick, ticklabel in zip(ticks, ticklabels):
if vertical:
ax.plot([tick, tick], [level, level-ticklength], **axargs)
ax.text(tick, level-ticklength*2, ticklabel, horizontalalignment='center',
fontsize=fontsize/2)
ax.text(0.5, level-5*ticklength, label, horizontalalignment='center', fontsize=fontsize)
else:
ax.plot([level, level-ticklength], [tick, tick], **axargs)
ax.text(level-ticklength*2, tick, ticklabel, horizontalalignment='center',
fontsize=fontsize/2)
ax.text(level-5*ticklength, 0.5, label, horizontalalignment='center', fontsize=fontsize)
if vertical:
xlim = list(ax.get_xlim())
if ticks.min()<xlim[0]:
xlim[0] = ticks.min()
if ticks.max()>xlim[1]:
xlim[1] = ticks.max()
ax.set_xlim(xlim)
ax.plot([ticks.min(), ticks.max()], [level, level], **axargs)
else:
ylim = list(ax.get_ylim())
if ticks.min()<ylim[0]:
ylim[0] = ticks.min()
if ticks.max()>ylim[1]:
ylim[1] = ticks.max()
ax.set_ylim(ylim)
ax.plot([level, level], [ticks.min(), ticks.max()], **axargs)
_, xscale, xoffset = scale_data(first_x, portion=portion)
_, yscale, yoffset = scale_data(yt_mean, portion=portion)
draw_axis(ax=ax, scale=xscale, offset=xoffset, level=0.0, label=xlabel,
up=True, vertical=vertical)
draw_axis(ax=ax, scale=yscale, offset=yoffset,
flip=flip, level=depth, label=ylabel, up=False, vertical=vertical)
#for txt in xticklabels:
# txt.set
# Bind the new method to the Deep GP object.
deepgp.DeepGP.visualize_pinball=visualize_pinball
</code></pre>
<pre><code class="language-{.python}">fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
m.visualize_pinball(ax=ax, scale=scale, offset=offset, points=30, portion=0.1,
xlabel='year', ylabel='pace km/min', vertical=True)
mlai.write_figure(figure=fig, filename='./slides/diagrams/deepgp/olympic-marathon-deep-gp-pinball.svg',
transparent=True, frameon=True)
</code></pre>
<h3 id="olympic-marathon-pinball-plot">Olympic Marathon Pinball Plot</h3>
<object class="svgplot" align="" data="./slides/diagrams/deepgp/olympic-marathon-deep-gp-pinball.svg"></object>
<p>The pinball plot shows the flow of any input ball through the deep Gaussian process. In a pinball plot a series of vertical parallel lines would indicate a purely linear function. For the olypmic marathon data we can see the first layer begins to shift from input towards the right. Note it also does so with some uncertainty (indicated by the shaded backgrounds). The second layer has less uncertainty, but bunches the inputs more strongly to the right. This input layer of uncertainty, followed by a layer that pushes inputs to the right is what gives the heteroschedastic noise.</p>
<!-- ### Data Science
* Industrial Revolution 4.0?
* *Industrial Revolution* (1760-1840) term coined by Arnold Toynbee,
late 19th century.
* Maybe: But this one is dominated by *data* not *capital*
* That presents *challenges* and *opportunities*
compare [digital oligarchy](https://www.theguardian.com/media-network/2015/mar/05/digital-oligarchy-algorithms-personal-data) vs [how Africa can benefit from the data revolution](https://www.theguardian.com/media-network/2015/aug/25/africa-benefit-data-science-information)
* Apple vs Nokia: How you handle disruption.
Disruptive technologies take time to assimilate, and best practices, as well as the pitfalls of new technologies take time to share. Historically, new technologies led to new professions. [Isambard Kingdom Brunel](https://en.wikipedia.org/wiki/Isambard_Kingdom_Brunel) (born 1806) was a leading innovator in civil, mechanical and naval engineering. Each of these has its own professional institutions founded in 1818, 1847, and 1860 respectively.
[Nikola Tesla](https://en.wikipedia.org/wiki/Nikola_Tesla) developed the modern approach to electrical distribution, he was born in 1856 and the American Instiute for Electrical Engineers was founded in 1884, the UK equivalent was founded in 1871.
[William Schockley Jr](https://en.wikipedia.org/wiki/William_Shockley), born 1910, led the group that developed the transistor, referred to as "the man who brought silicon to Silicon Valley", in 1963 the American Institute for Electical Engineers merged with the Institute of Radio Engineers to form the Institute of Electrical and Electronic Engineers.
[Watts S. Humphrey](https://en.wikipedia.org/wiki/Watts_Humphrey), born 1927, was known as the "father of software quality", in the 1980s he founded a program aimed at understanding and managing the software process. The British Computer Society was founded in 1956.
Why the need for these professions? Much of it is about codification of best practice and developing trust between the public and practitioners. These fundamental characteristics of the professions are shared with the oldest professions (Medicine, Law) as well as the newest (Information Technology).
So where are we today? My best guess is we are somewhere equivalent to the 1980s for Software Engineering. In terms of professional deployment we have a basic understanding of the equivalent of "programming" but much less understanding of *machine learning systems design* and *data infrastructure*. How the components we ahve developed interoperate together in a reliable and accountable manner. Best practice is still evolving, but perhaps isn't being shared widely enough.
One problem is that the art of data science is superficially similar to regularl software engineering. Although in practice it is rather different. Modern software engineering practice operates to generate code which is well tested as it is written, agile programming techniques provide the appropriate degree of flexibility for the individual programmers alongside sufficient formalization and testing. These techniques have evolved from an overly restrictive formalization that was proposed in the early days of software engineering.
While data science involves programming, it is different in the following way. Most of the work in data science involves understanding the data and the appropriate manipulations to apply to extract knowledge from the data. The eventual number of lines of code that are required to extract that knowledge are often very few, but the amount of thought and attention that needs to be applied to each line is much more than a traditional line of software code. Testing of those lines is also of a different nature, provisions have to be made for evolving data environments. Any development work is often done on a static snapshot of data, but deployment is made in a live environment where the nature of data changes. Quality control involves checking for degradation in performance arising form unanticipated changes in data quality. It may also need to check for regulatory conformity. For example, in the UK the General Data Protection Regulation stipulates standards of explainability and fairness that may need to be monitored. These concerns do not affect traditional software deployments.
Others are also pointing out these challenges, [this post](https://medium.com/@karpathy/software-2-0-a64152b37c35) from Andrej Karpathy (now head of AI at Tesla) covers the notion of "Software 2.0". Google researchers have highlighted the challenges of "Technical Debt" in machine learning [@Sculley:debt15]. Researchers at Berkeley have characterized the systems challenges associated with machine learning [@Stoica:systemsml17].
-->
<!-- Data science is not only about technical expertise and analysis of data, we need to also generate a culture of decision making that acknowledges the true challenges in data-driven automated decision making. In particular, a focus on algorithms has neglected the importance of data in driving decisions. The quality of data is paramount in that poor quality data will inevitably lead to poor quality decisions. Anecdotally most data scientists will suggest that 80% of their time is spent on data clean up, and only 20% on actually modelling.
### The Software Crisis
>The major cause of the software crisis is that the machines have
>become several orders of magnitude more powerful! To put it quite
>bluntly: as long as there were no machines, programming was no problem
>at all; when we had a few weak computers, programming became a mild
>problem, and now we have gigantic computers, programming has become an
>equally gigantic problem.
>
> Edsger Dijkstra (1930-2002), The Humble Programmer
In the late sixties early software programmers made note of the increasing costs of software development and termed the challenges associated with it as the "[Software Crisis](https://en.wikipedia.org/wiki/Software_crisis)". Edsger Dijkstra referred to the crisis in his 1972 Turing Award winner's address.
### The Data Crisis
>The major cause of the data crisis is that machines have become more
>interconnected than ever before. Data access is therefore cheap, but
>data quality is often poor. What we need is cheap high quality
>data. That implies that we develop processes for improving and
>verifying data quality that are efficient.
>
>There would seem to be two ways for improving efficiency. Firstly, we
>should not duplicate work. Secondly, where possible we should automate
>work.
What I term "The Data Crisis" is the modern equivalent of this problem. The quantity of modern data, and the lack of attention paid to data as it is initially "laid down" and the costs of data cleaning are bringing about a crisis in data-driven decision making. Just as with software, the crisis is most correctly addressed by 'scaling' the manner in which we process our data. Duplication of work occurs because the value of data cleaning is not correctly recognised in management decision making processes. Automation of work is increasingly possible through techniques in "artificial intelligence", but this will also require better management of the data science pipeline so that data about data science (meta-data science) can be correctly assimilated and processed. The Alan Turing institute has a program focussed on this area, [AI for Data Analytics](https://www.turing.ac.uk/research_projects/artificial-intelligence-data-analytics/).
-->
<!-- ### Rest of this Talk: Two Areas of Focus -->
<!-- * Reusability of Data -->
<!-- * Deployment of Machine Learning Systems -->
<!-- ### Rest of this Talk: Two Areas of Focus -->
<!-- * <s>Reusability of Data</s> -->
<!-- * Deployment of Machine Learning Systems -->
<!--### Data Readiness Levels
[<img class="" src="./slides/diagrams/data-science/data-readiness-levels.png" width="" align="" style="background:none; border:none; box-shadow:none;">](https://arxiv.org/pdf/1705.02245.pdf)
[Data Readiness Levels](http://inverseprobability.com/2017/01/12/data-readiness-levels)
Data readiness levels are an attempt to develop a language around data quality that can bridge the gap between technical solutions and decision makers such as managers and project planners. The are inspired by Technology Readiness Levels which attempt to quantify the readiness of technologies for deployment.
### Three Grades of Data Readiness:
Data-readiness describes, at its coarsest level, three separate stages of data graduation.
* Grade C - accessibility
* Grade B - validity
* Grade A - usability
### Accessibility: Grade C
The first grade refers to the accessibility of data. Most data science practitioners will be used to working with data-providers who, perhaps having had little experience of data-science before, state that they "have the data". More often than not, they have not verified this. A convenient term for this is "Hearsay Data", someone has *heard* that they have the data so they *say* they have it. This is the lowest grade of data readiness.
Progressing through Grade C involves ensuring that this data is accessible. Not just in terms of digital accessiblity, but also for regulatory, ethical and intellectual property reasons.
### Validity: Grade B
Data transits from Grade C to Grade B once we can begin digital analysis on the computer. Once the challenges of access to the data have been resolved, we can make the data available either via API, or for direct loading into analysis software (such as Python, R, Matlab, Mathematica or SPSS). Once this has occured the data is at B4 level. Grade B involves the *validity* of the data. Does the data really represent what it purports to? There are challenges such as missing values, outliers, record duplication. Each of these needs to be investigated.
Grade B and C are important as if the work done in these grades is documented well, it can be reused in other projects. Reuse of this labour is key to reducing the costs of data-driven automated decision making. There is a strong overlap between the work required in this grade and the statistical field of [*exploratory data analysis*](https://en.wikipedia.org/wiki/Exploratory_data_analysis) [@Tukey:exploratory77].
### Usability: Grade A
Once the validity of the data is determined, the data set can be considered for use in a particular task. This stage of data readiness is more akin to what machine learning scientists are used to doing in Universities. Bringing an algorithm to bear on a well understood data set.
In Grade A we are concerned about the utility of the data given a particular task. Grade A may involve additional data collection (experimental design in statistics) to ensure that the task is fulfilled.
This is the stage where the data and the model are brought together, so expertise in learning algorithms and their application is key. Further ethical considerations, such as the fairness of the resulting predictions are required at this stage. At the end of this stage a prototype model is ready for deployment.
Deployment and maintenance of machine learning models in production is another important issue which Data Readiness Levels are only a part of the solution for.
To find out more, or to contribute ideas go to <http://data-readiness.org>
Throughout the data preparation pipeline, it is important to have close interaction between data scientists and application domain experts. Decisions on data preparation taken outside the context of application have dangerous downstream consequences. This provides an additional burden on the data scientist as they are required for each project, but it should also be seen as a learning and familiarization exercise for the domain expert. Long term, just as biologists have found it necessary to assimilate the skills of the bioinformatician to be effective in their science, most domains will also require a familiarity with the nature of data driven decision making and its application. Working closely with data-scientists on data preparation is one way to begin this sharing of best practice.
The processes involved in Grade C and B are often badly taught in courses on data science. Perhaps not due to a lack of interest in the areas, but maybe more due to a lack of access to real world examples where data quality is poor.
These stages of data science are also ridden with ambiguity. In the long term they could do with more formalization, and automation, but best practice needs to be understood by a wider community before that can happen.
-->
<h3 id="artificial-intelligence">Artificial Intelligence</h3>
<ul>
<li>
<p>Challenges in deploying AI.</p>
</li>
<li>
<p>Currently this is in the form of “machine learning systems”</p>
</li>
</ul>
<h3 id="internet-of-people">Internet of People</h3>
<ul>
<li>Fog computing: barrier between cloud and device blurring.
<ul>
<li>Computing on the Edge</li>
</ul>
</li>
<li>Complex feedback between algorithm and implementation</li>
</ul>
<h3 id="deploying-ml-in-real-world-machine-learning-systems-design">Deploying ML in Real World: Machine Learning Systems Design</h3>
<ul>
<li>
<p>Major new challenge for systems designers.</p>
</li>
<li>
<p>Internet of Intelligence but currently:</p>
<ul>
<li>AI systems are <em>fragile</em></li>
</ul>
</li>
</ul>
<h3 id="fragility-of-ai-systems">Fragility of AI Systems</h3>
<h3 id="pigeonholing">Pigeonholing</h3>
<p><img class="" src="./slides/diagrams/TooManyPigeons.jpg" width="60%" align="" style="background:none; border:none; box-shadow:none;" /></p>
<p>The way we are deploying artificial intelligence systems in practice is to build up systems of machine learning components. To build a machine learning system, we decompose the task into parts which we can emulate with ML methods. Each of these parts can be, typically, independently constructed and verified. For example, in a driverless car we can decompose the tasks into components such as “pedestrian detection” and “road line detection”. Each of these components can be constructed with, for example, an independent classifier. We can then superimpose a logic on top. For example, “Follow the road line unless you detect a pedestrian in the road”.</p>
<p>This allows for verification of car performance, as long as we can verify the individual components. However, it also implies that the AI systems we deploy are <em>fragile</em>.</p>
<h3 id="rapid-reimplementation">Rapid Reimplementation</h3>
<h3 id="early-ai">Early AI</h3>
<p><img class="rotateimg90" src="./slides/diagrams/2017-10-12 16.47.34.jpg" width="40%" align="" style="background:none; border:none; box-shadow:none;" /></p>
<h3 id="machine-learning-systems-design">Machine Learning Systems Design</h3>
<p><img class="" src="./slides/diagrams/SteamEngine_Boulton&Watt_1784_neg.png" width="50%" align="" style="background:none; border:none; box-shadow:none;" /></p>
<h3 id="adversaries">Adversaries</h3>
<ul>
<li>Stuxnet</li>
<li>Mischevious-Adversarial</li>
</ul>
<h3 id="turnaround-and-update">Turnaround And Update</h3>
<ul>
<li>There is a massive need for turn around and update</li>
<li>A redeploy of the entire system.
<ul>
<li>This involves changing the way we design and deploy.</li>
</ul>
</li>
<li>Interface between security engineering and machine learning.</li>
</ul>
<h3 id="peppercorns">Peppercorns</h3>
<ul>
<li>A new name for system failures which aren’t bugs.</li>
<li>Difference between finding a fly in your soup vs a peppercorn in
your soup.</li>
</ul>
<h3 id="uncertainty-quantification">Uncertainty Quantification</h3>
<ul>
<li>
<p>Deep nets are powerful approach to images, speech, language.</p>
</li>
<li>
<p>Proposal: Deep GPs may also be a great approach, but better to deploy according to natural strengths.</p>
</li>
</ul>
<h3 id="uncertainty-quantification-1">Uncertainty Quantification</h3>
<ul>
<li>
<p>Probabilistic numerics, surrogate modelling, emulation, and UQ.</p>
</li>
<li>
<p>Not a fan of AI as a term.</p>
</li>
<li>
<p>But we are faced with increasing amounts of <em>algorithmic decision making</em>.</p>
</li>
</ul>
<h3 id="ml-and-decision-making">ML and Decision Making</h3>
<ul>
<li>
<p>When trading off decisions: compute or acquire data?</p>
</li>
<li>
<p>There is a critical need for uncertainty.</p>
</li>
</ul>
<h3 id="uncertainty-quantification-2">Uncertainty Quantification</h3>
<blockquote>
<p>Uncertainty quantification (UQ) is the science of quantitative characterization and reduction of uncertainties in both computational and real world applications. It tries to determine how likely certain outcomes are if some aspects of the system are not exactly known.</p>
</blockquote>
<ul>
<li>Interaction between physical and virtual worlds of major interest for Amazon.</li>
</ul>
<p>We will to illustrate different concepts of <a href="https://en.wikipedia.org/wiki/Uncertainty_quantification">Uncertainty Quantification</a> (UQ) and the role that Gaussian processes play in this field. Based on a simple simulator of a car moving between a valley and a mountain, we are going to illustrate the following concepts:</p>
<ul>
<li>
<p><strong>Systems emulation</strong>. Many real world decisions are based on simulations that can be computationally very demanding. We will show how simulators can be replaced by <em>emulators</em>: Gaussian process models fitted on a few simulations that can be used to replace the <em>simulator</em>. Emulators are cheap to compute, fast to run, and always provide ways to quantify the uncertainty of how precise they are compared the original simulator.</p>
</li>
<li>
<p><strong>Emulators in optimization problems</strong>. We will show how emulators can be used to optimize black-box functions that are expensive to evaluate. This field is also called Bayesian Optimization and has gained an increasing relevance in machine learning as emulators can be used to optimize computer simulations (and machine learning algorithms) quite efficiently.</p>
</li>
<li>
<p><strong>Multi-fidelity emulation methods</strong>. In many scenarios we have simulators of different quality about the same measure of interest. In these cases the goal is to merge all sources of information under the same model so the final emulator is cheaper and more accurate than an emulator fitted only using data from the most accurate and expensive simulator.</p>
</li>
</ul>
<h3 id="example-formula-one-racing">Example: Formula One Racing</h3>
<ul>
<li>
<p>Designing an F1 Car requires CFD, Wind Tunnel, Track Testing etc.</p>
</li>
<li>
<p>How to combine them?</p>
</li>
</ul>
<h3 id="mountain-car-simulator">Mountain Car Simulator</h3>
<p>To illustrate the above mentioned concepts we we use the <a href="https://github.com/openai/gym/wiki/MountainCarContinuous-v0">mountain car simulator</a>. This simulator is widely used in machine learning to test reinforcement learning algorithms. The goal is to define a control policy on a car whose objective is to climb a mountain. Graphically, the problem looks as follows:</p>
<p><img class="" src="./slides/diagrams/uq/mountaincar.png" width="negate" align="" style="background:none; border:none; box-shadow:none;" /></p>
<p>The goal is to define a sequence of actions (push the car right or left with certain intensity) to make the car reach the flag after a number $T$ of time steps.</p>
<p>At each time step $t$, the car is characterized by a vector $\inputVector_{t} = (p_t,v_t)$ of states which are respectively the the position and velocity of the car at time $t$. For a sequence of states (an episode), the dynamics of the car is given by</p>
<script type="math/tex; mode=display">\inputVector_{t+1} = \mappingFunction(\inputVector_{t},\textbf{u}_{t})</script>
<p>where $\textbf{u}<em>{t}$ is the value of an action force, which in this example corresponds to push car to the left (negative value) or to the right (positive value). The actions across a full episode are represented in a policy $\textbf{u}</em>{t} = \pi(\inputVector_{t},\theta)$ that acts according to the current state of the car and some parameters $\theta$. In the following examples we will assume that the policy is linear which allows us to write $\pi(\inputVector_{t},\theta)$ as</p>
<script type="math/tex; mode=display">\pi(\inputVector,\theta)= \theta_0 + \theta_p p + \theta_vv.</script>
<p>For $t=1,\dots,T$ now given some initial state $\inputVector_{0}$ and some some values of each $\textbf{u}<em>{t}$, we can <strong>simulate</strong> the full dynamics of the car for a full episode using <a href="https://gym.openai.com/envs/">Gym</a>. The values of
$\textbf{u}</em>{t}$ are fully determined by the parameters of the linear controller.</p>
<p>After each episode of length $T$ is complete, a reward function $R_{T}(\theta)$ is computed. In the mountain car example the reward is computed as 100 for reaching the target of the hill on the right hand side, minus the squared sum of actions (a real negative to push to the left and a real positive to push to the right) from start to goal. Note that our reward depend on $\theta$ as we make it dependent on the parameters of the linear controller.</p>
<h3 id="emulate-the-mountain-car">Emulate the Mountain Car</h3>
<pre><code class="language-{.python}">import gym
</code></pre>
<pre><code class="language-{.python}">env = gym.make('MountainCarContinuous-v0')
</code></pre>
<p>Our goal in this section is to find the parameters $\theta$ of the linear controller such that</p>
<script type="math/tex; mode=display">\theta^* = arg \max_{\theta} R_T(\theta).</script>
<p>In this section, we directly use Bayesian optimization to solve this problem. We will use <a href="https://sheffieldml.github.io/GPyOpt/">GPyOpt</a> so we first define the objective function:</p>
<pre><code class="language-{.python}">import mountain_car as mc
import GPyOpt
</code></pre>
<pre><code class="language-{.python}">obj_func = lambda x: mc.run_simulation(env, x)[0]
objective = GPyOpt.core.task.SingleObjective(obj_func)
</code></pre>
<p>For each set of parameter values of the linear controller we can run an episode of the simulator (that we fix to have a horizon of $T=500$) to generate the reward. Using as input the parameters of the controller and as outputs the rewards we can build a Gaussian process emulator of the reward.</p>
<p>We start defining the input space, which is three-dimensional:</p>
<pre><code class="language-{.python}">## --- We define the input space of the emulator
space= [{'name':'postion_parameter', 'type':'continuous', 'domain':(-1.2, +1)},
{'name':'velocity_parameter', 'type':'continuous', 'domain':(-1/0.07, +1/0.07)},
{'name':'constant', 'type':'continuous', 'domain':(-1, +1)}]
design_space = GPyOpt.Design_space(space=space)
</code></pre>
<p>Now we initizialize a Gaussian process emulator.</p>
<pre><code class="language-{.python}">model = GPyOpt.models.GPModel(optimize_restarts=5, verbose=False, exact_feval=True, ARD=True)
</code></pre>
<p>In Bayesian optimization an acquisition function is used to balance exploration and exploitation to evaluate new locations close to the optimum of the objective. In this notebook we select the expected improvement (EI). For further details have a look to the review paper of <a href="http://www.cs.ox.ac.uk/people/nando.defreitas/publications/BayesOptLoop.pdf">Shahriari et al (2015)</a>.</p>
<pre><code class="language-{.python}">aquisition_optimizer = GPyOpt.optimization.AcquisitionOptimizer(design_space)
acquisition = GPyOpt.acquisitions.AcquisitionEI(model, design_space, optimizer=aquisition_optimizer)
evaluator = GPyOpt.core.evaluators.Sequential(acquisition) # Collect points sequentially, no parallelization.
</code></pre>
<p>To initalize the model we start sampling some initial points (25) for the linear controler randomly.</p>
<pre><code class="language-{.python}">from GPyOpt.experiment_design.random_design import RandomDesign
</code></pre>
<pre><code class="language-{.python}">n_initial_points = 25
random_design = RandomDesign(design_space)
initial_design = random_design.get_samples(n_initial_points)
</code></pre>
<p>Before we start any optimization, lets have a look to the behavior of the car with the first of these initial points that we have selected randomly.</p>
<pre><code class="language-{.python}">import numpy as np
</code></pre>
<pre><code class="language-{.python}">random_controller = initial_design[0,:]
_, _, _, frames = mc.run_simulation(env, np.atleast_2d(random_controller), render=True)
anim=mc.animate_frames(frames, 'Random linear controller')
</code></pre>
<pre><code class="language-{.python}">from IPython.core.display import HTML
</code></pre>
<pre><code class="language-{.python}">mc.save_frames(frames,
diagrams='./slides/diagrams/uq',
filename='mountain_car_random.html')
</code></pre>
<iframe src="./slides/diagrams/uq/mountain_car_random.html" width="1024" height="768" allowtransparency="true" frameborder="0">
</iframe>
<p>As we can see the random linear controller does not manage to push the car to the top of the mountain. Now, let’s optimize the regret using Bayesian optimization and the emulator for the reward. We try 50 new parameters chosen by the EI.</p>
<pre><code class="language-{.python}">max_iter = 50
bo = GPyOpt.methods.ModularBayesianOptimization(model, design_space, objective, acquisition, evaluator, initial_design)
bo.run_optimization(max_iter = max_iter )
</code></pre>
<p>Now we visualize the result for the best controller that we have found with Bayesian optimization.</p>
<pre><code class="language-{.python}">_, _, _, frames = mc.run_simulation(env, np.atleast_2d(bo.x_opt), render=True)
anim=mc.animate_frames(frames, 'Best controller after 50 iterations of Bayesian optimization')
</code></pre>
<pre><code class="language-{.python}">mc.save_frames(frames,
diagrams='./slides/diagrams/uq',
filename='mountain_car_simulated.html')
</code></pre>
<iframe src="./slides/diagrams/uq/mountain_car_simulated.html" width="1024" height="768" allowtransparency="true" frameborder="0">
</iframe>
<p>he car can now make it to the top of the mountain! Emulating the reward function and using the EI helped as to find a linear controller that solves the problem.</p>
<h3 id="data-efficient-emulation">Data Efficient Emulation</h3>
<p>In the previous section we solved the mountain car problem by directly emulating the reward but no considerations about the dynamics $\inputVector_{t+1} = \mappingFunction(\inputVector_{t},\textbf{u}_{t})$ of the system were made. Note that we had to run 75 episodes of 500 steps each to solve the problem, which required to call the simulator $500\times 75 =37500$ times. In this section we will show how it is possible to reduce this number by building an emulator for $f$ that can later be used to directly optimize the control.</p>
<p>The inputs of the model for the dynamics are the velocity, the position and the value of the control so create this space accordingly.</p>
<pre><code class="language-{.python}">import gym
</code></pre>
<pre><code class="language-{.python}">env = gym.make('MountainCarContinuous-v0')
</code></pre>
<pre><code class="language-{.python}">import GPyOpt
</code></pre>
<pre><code class="language-{.python}">space_dynamics = [{'name':'position', 'type':'continuous', 'domain':[-1.2, +0.6]},
{'name':'velocity', 'type':'continuous', 'domain':[-0.07, +0.07]},
{'name':'action', 'type':'continuous', 'domain':[-1, +1]}]
design_space_dynamics = GPyOpt.Design_space(space=space_dynamics)
</code></pre>
<p>The outputs are the velocity and the position. Indeed our model will capture the change in position and velocity on time. That is, we will model</p>
<script type="math/tex; mode=display">\Delta v_{t+1} = v_{t+1} - v_{t}</script>
<script type="math/tex; mode=display">\Delta x_{t+1} = p_{t+1} - p_{t}</script>
<p>with Gaussian processes with prior mean $v_{t}$ and $p_{t}$ respectively. As a covariance function, we use a Matern52. We need therefore two models to capture the full dynamics of the system.</p>
<pre><code class="language-{.python}">position_model = GPyOpt.models.GPModel(optimize_restarts=5, verbose=False, exact_feval=True, ARD=True)
velocity_model = GPyOpt.models.GPModel(optimize_restarts=5, verbose=False, exact_feval=True, ARD=True)
</code></pre>
<p>Next, we sample some input parameters and use the simulator to compute the outputs. Note that in this case we are not running the full episodes, we are just using the simulator to compute $\inputVector_{t+1}$ given $\inputVector_{t}$ and $\textbf{u}_{t}$.</p>
<pre><code class="language-{.python}">import numpy as np
from GPyOpt.experiment_design.random_design import RandomDesign
import mountain_car as mc
</code></pre>
<pre><code class="language-{.python}">### --- Random locations of the inputs
n_initial_points = 500
random_design_dynamics = RandomDesign(design_space_dynamics)
initial_design_dynamics = random_design_dynamics.get_samples(n_initial_points)
</code></pre>
<pre><code class="language-{.python}">### --- Simulation of the (normalized) outputs
y = np.zeros((initial_design_dynamics.shape[0], 2))
for i in range(initial_design_dynamics.shape[0]):
y[i, :] = mc.simulation(initial_design_dynamics[i, :])
# Normalize the data from the simulation
y_normalisation = np.std(y, axis=0)
y_normalised = y/y_normalisation
</code></pre>
<p>In general we might use much smarter strategies to design our emulation of the simulator. For example, we could use the variance of the predictive distributions of the models to collect points using uncertainty sampling, which will give us a better coverage of the space. For simplicity, we move ahead with the 500 randomly selected points.</p>
<p>Now that we have a data set, we can update the emulators for the location and the velocity.</p>
<pre><code class="language-{.python}">position_model.updateModel(initial_design_dynamics, y[:, [0]], None, None)
velocity_model.updateModel(initial_design_dynamics, y[:, [1]], None, None)
</code></pre>
<p>We can now have a look to how the emulator and the simulator match. First, we show a contour plot of the car aceleration for each pair of can position and velocity. You can use the bar bellow to play with the values of the controler to compare the emulator and the simulator.</p>
<pre><code class="language-{.python}">from IPython.html.widgets import interact
</code></pre>
<pre><code class="language-{.python}">control = mc.plot_control(velocity_model)
interact(control.plot_slices, control=(-1, 1, 0.05))
</code></pre>
<!---->
<p>We can see how the emulator is doing a fairly good job approximating the simulator. On the edges, however, it struggles to captures the dynamics of the simulator.</p>
<p>Given some input parameters of the linear controlling, how do the dynamics of the emulator and simulator match? In the following figure we show the position and velocity of the car for the 500 time steps of an episode in which the parameters of the linear controller have been fixed beforehand. The value of the input control is also shown.</p>
<pre><code class="language-{.python}">controller_gains = np.atleast_2d([0, .6, 1]) # change the valus of the linear controller to observe the trayectories.
</code></pre>
<pre><code class="language-{.python}">mc.emu_sim_comparison(env, controller_gains, [position_model, velocity_model],
max_steps=500, diagrams='./slides/diagrams/uq')
</code></pre>
<object class="svgplot" align="" data="./slides/diagrams/uq/emu_sim_comparison.svg"></object>
<p>We now make explicit use of the emulator, using it to replace the simulator and optimize the linear controller. Note that in this optimization, we don’t need to query the simulator anymore as we can reproduce the full dynamics of an episode using the emulator. For illustrative purposes, in this example we fix the initial location of the car.</p>
<p>We define the objective reward function in terms of the simulator.</p>
<pre><code class="language-{.python}">### --- Optimize control parameters with emulator
car_initial_location = np.asarray([-0.58912799, 0])
### --- Reward objective function using the emulator
obj_func_emulator = lambda x: mc.run_emulation([position_model, velocity_model], x, car_initial_location)[0]
objective_emulator = GPyOpt.core.task.SingleObjective(obj_func_emulator)
</code></pre>
<p>And as before, we use Bayesian optimization to find the best possible linear controller.</p>
<pre><code class="language-{.python}">### --- Elements of the optimization that will use the multi-fidelity emulator
model = GPyOpt.models.GPModel(optimize_restarts=5, verbose=False, exact_feval=True, ARD=True)
</code></pre>
<p>The design space is the three continuous variables that make up the linear controller.</p>
<pre><code class="language-{.python}">space= [{'name':'linear_1', 'type':'continuous', 'domain':(-1/1.2, +1)},
{'name':'linear_2', 'type':'continuous', 'domain':(-1/0.07, +1/0.07)},
{'name':'constant', 'type':'continuous', 'domain':(-1, +1)}]
design_space = GPyOpt.Design_space(space=space)
aquisition_optimizer = GPyOpt.optimization.AcquisitionOptimizer(design_space)
random_design = RandomDesign(design_space)
initial_design = random_design.get_samples(25)
</code></pre>
<p>We set the acquisition function to be expected improvement using <code class="highlighter-rouge">GPyOpt</code>.</p>
<pre><code class="language-{.python}">acquisition = GPyOpt.acquisitions.AcquisitionEI(model, design_space, optimizer=aquisition_optimizer)
evaluator = GPyOpt.core.evaluators.Sequential(acquisition)
</code></pre>
<pre><code class="language-{.python}">bo_emulator = GPyOpt.methods.ModularBayesianOptimization(model, design_space, objective_emulator, acquisition, evaluator, initial_design)
bo_emulator.run_optimization(max_iter=50)
</code></pre>
<pre><code class="language-{.python}">_, _, _, frames = mc.run_simulation(env, np.atleast_2d(bo_emulator.x_opt), render=True)
anim=mc.animate_frames(frames, 'Best controller using the emulator of the dynamics')
</code></pre>
<pre><code class="language-{.python}">from IPython.core.display import HTML
</code></pre>
<pre><code class="language-{.python}">mc.save_frames(frames,
diagrams='./slides/diagrams/uq',
filename='mountain_car_emulated.html')
</code></pre>
<iframe src="./slides/diagrams/uq/mountain_car_emulated.html" width="1024" height="768" allowtransparency="true" frameborder="0">
</iframe>
<p>And the problem is again solved, but in this case we have replaced the simulator of the car dynamics by a Gaussian process emulator that we learned by calling the simulator only 500 times. Compared to the 37500 calls that we needed when applying Bayesian optimization directly on the simulator this is a great gain.</p>
<p>In some scenarios we have simulators of the same environment that have different fidelities, that is that reflect with different level of accuracy the dynamics of the real world. Running simulations of the different fidelities also have a different cost: hight fidelity simulations are more expensive the cheaper ones. If we have access to these simulators we can combine high and low fidelity simulations under the same model.</p>
<p>So let’s assume that we have two simulators of the mountain car dynamics, one of high fidelity (the one we have used) and another one of low fidelity. The traditional approach to this form of multi-fidelity emulation is to assume that</p>
<script type="math/tex; mode=display">\mappingFunction_i\left(\inputVector\right) = \rho\mappingFunction_{i-1}\left(\inputVector\right) + \delta_i\left(\inputVector \right)</script>
<p>where $\mappingFunction_{i-1}\left(\inputVector\right)$ is a low fidelity simulation of the problem of interest and $\mappingFunction_i\left(\inputVector\right)$ is a higher fidelity simulation. The function $\delta_i\left(\inputVector \right)$ represents the difference between the lower and higher fidelity simulation, which is considered additive. The additive form of this covariance means that if $\mappingFunction_{0}\left(\inputVector\right)$ and $\left{\delta_i\left(\inputVector \right)\right}_{i=1}^m$ are all Gaussian processes, then the process over all fidelities of simuation will be a joint Gaussian process.</p>
<p>But with Deep Gaussian processes we can consider the form</p>
<script type="math/tex; mode=display">\mappingFunction_i\left(\inputVector\right) = \mappingFunctionTwo_{i}\left(\mappingFunction_{i-1}\left(\inputVector\right)\right) + \delta_i\left(\inputVector \right),</script>
<p>where the low fidelity representation is non linearly transformed by $\mappingFunctionTwo(\cdot)$ before use in the process. This is the approach taken in @Perdikaris:multifidelity17. But once we accept that these models can be composed, a highly flexible framework can emerge. A key point is that the data enters the model at different levels, and represents different aspects. For example these correspond to the two fidelities of the mountain car simulator.</p>
<p>We start by sampling both of them at 250 random input locations.</p>
<pre><code class="language-{.python}">import gym
</code></pre>
<pre><code class="language-{.python}">env = gym.make('MountainCarContinuous-v0')
</code></pre>
<pre><code class="language-{.python}">import GPyOpt
</code></pre>
<pre><code class="language-{.python}">### --- Collect points from low and high fidelity simulator --- ###
space = GPyOpt.Design_space([
{'name':'position', 'type':'continuous', 'domain':(-1.2, +1)},
{'name':'velocity', 'type':'continuous', 'domain':(-0.07, +0.07)},
{'name':'action', 'type':'continuous', 'domain':(-1, +1)}])
n_points = 250
random_design = GPyOpt.experiment_design.RandomDesign(space)
x_random = random_design.get_samples(n_points)
</code></pre>
<p>Next, we evaluate the high and low fidelity simualtors at those locations.</p>
<pre><code class="language-{.python}">import numpy as np
import mountain_car as mc
</code></pre>
<pre><code class="language-{.python}">d_position_hf = np.zeros((n_points, 1))
d_velocity_hf = np.zeros((n_points, 1))
d_position_lf = np.zeros((n_points, 1))
d_velocity_lf = np.zeros((n_points, 1))
# --- Collect high fidelity points
for i in range(0, n_points):
d_position_hf[i], d_velocity_hf[i] = mc.simulation(x_random[i, :])
# --- Collect low fidelity points
for i in range(0, n_points):
d_position_lf[i], d_velocity_lf[i] = mc.low_cost_simulation(x_random[i, :])
</code></pre>
<p>It is time to build the multi-fidelity model for both the position and the velocity.</p>
<p>As we did in the previous section we use the emulator to optimize the simulator. In this case we use the high fidelity output of the emulator.</p>
<pre><code class="language-{.python}">### --- Optimize controller parameters
obj_func = lambda x: mc.run_simulation(env, x)[0]
obj_func_emulator = lambda x: mc.run_emulation([position_model, velocity_model], x, car_initial_location)[0]
objective_multifidelity = GPyOpt.core.task.SingleObjective(obj_func)
</code></pre>
<p>And we optimize using Bayesian optimzation.</p>
<pre><code class="language-{.python}">from GPyOpt.experiment_design.random_design import RandomDesign
</code></pre>
<pre><code class="language-{.python}">model = GPyOpt.models.GPModel(optimize_restarts=5, verbose=False, exact_feval=True, ARD=True)
space= [{'name':'linear_1', 'type':'continuous', 'domain':(-1/1.2, +1)},
{'name':'linear_2', 'type':'continuous', 'domain':(-1/0.07, +1/0.07)},
{'name':'constant', 'type':'continuous', 'domain':(-1, +1)}]
design_space = GPyOpt.Design_space(space=space)
aquisition_optimizer = GPyOpt.optimization.AcquisitionOptimizer(design_space)
n_initial_points = 25
random_design = RandomDesign(design_space)
initial_design = random_design.get_samples(n_initial_points)
acquisition = GPyOpt.acquisitions.AcquisitionEI(model, design_space, optimizer=aquisition_optimizer)
evaluator = GPyOpt.core.evaluators.Sequential(acquisition)
</code></pre>
<pre><code class="language-{.python}">bo_multifidelity = GPyOpt.methods.ModularBayesianOptimization(model, design_space, objective_multifidelity, acquisition, evaluator, initial_design)
bo_multifidelity.run_optimization(max_iter=50)
</code></pre>
<pre><code class="language-{.python}">_, _, _, frames = mc.run_simulation(env, np.atleast_2d(bo_multifidelity.x_opt), render=True)
anim=mc.animate_frames(frames, 'Best controller with multi-fidelity emulator')
</code></pre>
<pre><code class="language-{.python}">from IPython.core.display import HTML
</code></pre>
<pre><code class="language-{.python}">mc.save_frames(frames,
diagrams='./slides/diagrams/uq',
filename='mountain_car_multi_fidelity.html')
</code></pre>
<iframe src="./slides/diagrams/uq/mountain_car_multi_fidelity.html" width="1024" height="768" allowtransparency="true" frameborder="0">
</iframe>
<p>And problem solved! We see how the problem is also solved with 250 observations of the high fidelity simulator and 250 of the low fidelity simulator.</p>
<h3 id="conclusion">Conclusion</h3>
<ul>
<li>
<p>Artificial Intelligence and Data Science are fundamentally different.</p>
</li>
<li>
<p>In one you are dealing with data collected by happenstance.</p>
</li>
<li>
<p>In the other you are trying to build systems in the real world, often by actively collecting data.</p>
</li>
<li>
<p>Our approaches to systems design are building powerful machines that
will be deployed in evolving environments.</p>
</li>
</ul>
<h3 id="thanks">Thanks!</h3>
<ul>
<li>twitter: @lawrennd</li>
<li>blog: <a href="http://inverseprobability.com/blog.html">http://inverseprobability.com</a></li>
<li><a href="http://inverseprobability.com/2018/02/06/natural-and-artificial-intelligence">Natural vs Artifical Intelligence</a></li>
<li><a href="https://medium.com/@mijordan3/artificial-intelligence-the-revolution-hasnt-happened-yet-5e1d5812e1e7">Mike Jordan’s Medium Post</a></li>
</ul>
<div class="footnotes">
<ol>
<li id="fn:knowledge-representation">
<p>the challenge of understanding what it pertains to is known as knowledge representation). <a href="#fnref:knowledge-representation" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Wed, 02 May 2018 00:00:00 +0000
http://inverseprobability.com/talks/towards-machine-learning-systems-design.html
http://inverseprobability.com/talks/towards-machine-learning-systems-design.htmlDecision Making and Diversity: The Folly of Value AlignmentMon, 30 Apr 2018 00:00:00 +0000
http://inverseprobability.com/talks/decision-making-and-diversity.html
http://inverseprobability.com/talks/decision-making-and-diversity.htmlChallenges for Data Science in HealthcareWed, 18 Apr 2018 00:00:00 +0000
http://inverseprobability.com/talks/challenges-for-data-science-in-healthcare.html
http://inverseprobability.com/talks/challenges-for-data-science-in-healthcare.htmlOn Natural and Artificial IntelligenceThu, 29 Mar 2018 00:00:00 +0000
http://inverseprobability.com/talks/on-natural-and-artificial-intelligence.html
http://inverseprobability.com/talks/on-natural-and-artificial-intelligence.htmlData Science: Time for Professionalisation?Tue, 27 Mar 2018 00:00:00 +0000
http://inverseprobability.com/talks/data-science-time-for-professionalisation.html
http://inverseprobability.com/talks/data-science-time-for-professionalisation.html