Introduction to Machine Learning and Data Science

Data Science in Africa Summer School

Makerere University, Kampala, Uganda

Neil D. Lawrence

27th June 2016

Text

@Rogers:book11

Another Text

@Bishop:book06

What is Machine Learning?

data + model = prediction

  • data : observations, could be actively or passively acquired (meta-data).

  • model : assumptions, based on previous experience (other data! transfer learning etc), or beliefs about the regularities of the universe. Inductive bias.

  • prediction : an action to be taken or a categorization or a quality score.

Fitting Data

  • data
In [3]:
import numpy as np

# Create some data
x = np.array([1, 3])
y = np.array([3, 1])
  • model $$y=mx + c$$

Model Fitting

$$m = \frac{y_2- y_1}{x_2-x_1}$$$$ c = y_1 - m x_1 $$
In [4]:
xvals = np.linspace(0, 5, 2);

m = (y[1]-y[0])/(x[1]-x[0]);
c = y[0]-m*x[0];

yvals = m*xvals+c;
In [5]:
%matplotlib inline
import matplotlib.pyplot as plt

xvals = np.linspace(0, 5, 2);

m = (y[1]-y[0])/(x[1]-x[0]);
c = y[0]-m*x[0];

yvals = m*xvals+c;

ylim = np.array([0, 5])
xlim = np.array([0, 5])

f, ax = plt.subplots(1,1,figsize=(5,5))
a = ax.plot(xvals, yvals, '-', linewidth=3);

ax.set_xlim(xlim)
ax.set_ylim(ylim)

plt.xlabel('$x$', fontsize=30)
plt.ylabel('$y$',fontsize=30)
plt.text(4, 4, '$y=mx+c$',  horizontalalignment='center', verticalalignment='bottom', fontsize=30)
plt.savefig('diagrams/straight_line1.svg')
ctext = ax.text(0.15, c+0.15, '$c$',  horizontalalignment='center', verticalalignment='bottom', fontsize=20)
xl = np.array([1.5, 2.5])
yl = xl*m + c;
mhand = ax.plot([xl[0], xl[1]], [yl.min(), yl.min()], color=[0, 0, 0])
mhand2 = ax.plot([xl.min(), xl.min()], [yl[0], yl[1]], color=[0, 0, 0])
mtext = ax.text(xl.mean(), yl.min()-0.2, '$m$',  horizontalalignment='center', verticalalignment='bottom',fontsize=20);
plt.savefig('diagrams/straight_line2.svg')

a2 = ax.plot(x, y, '.', markersize=20, linewidth=3, color=[1, 0, 0])
plt.savefig('diagrams/straight_line3.svg')

xs = 2
ys = m*xs + c + 0.3

ast = ax.plot(xs, ys, '.', markersize=20, linewidth=3, color=[0, 1, 0])
plt.savefig('diagrams/straight_line4.svg')


m = (y[1]-ys)/(x[1]-xs);
c = ys-m*xs;
yvals = m*xvals+c;

for i in a:
    i.set_visible(False)
for i in mhand:
    i.set_visible(False)
for i in mhand2:
    i.set_visible(False)
mtext.set_visible(False)
ctext.set_visible(False)
a3 = ax.plot(xvals, yvals, '-', linewidth=2, color=[0, 0, 1])
for i in ast:
    i.set_color([1, 0, 0])
plt.savefig('diagrams/straight_line5.svg')

m = (ys-y[0])/(xs-x[0])
c = y[0]-m*x[0]
yvals = m*xvals+c

for i in a3:
    i.set_visible(False)
a4 = ax.plot(xvals, yvals, '-', linewidth=2, color=[0, 0, 1]);
for i in ast:
    i.set_color([1, 0, 0])
plt.savefig('diagrams/straight_line6.svg')
for i in a:
    i.set_visible(True)
for i in a3:
    i.set_visible(True)
plt.savefig('diagrams/straight_line7.svg')
In [6]:
import pods
pods.notebook.display_plots('straight_line{plot}.svg', 
                            directory='./diagrams', plot=(1, 7))

$y = mx + c$

point 1: $x = 1$, $y=3$ $$3 = m + c$$ point 2: $x = 3$, $y=1$ $$1 = 3m + c$$ point 3: $x = 2$, $y=2.5$ $$2.5 = 2m + c$$

$y = mx + c + \epsilon$

point 1: $x = 1$, $y=3$ $$3 = m + c + \epsilon_1$$

point 2: $x = 3$, $y=1$ $$1 = 3m + c + \epsilon_2$$

point 3: $x = 2$, $y=2.5$ $$2.5 = 2m + c + \epsilon_3$$

Regression Examples

  • Predict a real value, $y_i$ given some inputs $x_i$.

  • Predict quality of meat given spectral measurements (Tecator data).

  • Radiocarbon dating, the C14 calibration curve: predict age given quantity of C14 isotope.

  • Predict quality of different Go or Backgammon moves given expert rated training data.

Olympic 100m Data

  • Gold medal times for Olympic 100 m runners since 1896.

image Image from Wikimedia Commons http://bit.ly/191adDC

Olympic 100m Data

In [3]:
data = pods.datasets.olympic_100m_men()
f, ax = plt.subplots(figsize=(7,7))
ax.plot(data['X'], data['Y'], 'ro', markersize=10)
Out[3]:
[<matplotlib.lines.Line2D at 0x109316208>]

Olympic Marathon Data

image Image from Wikimedia Commons http://bit.ly/16kMKHQ

Olympic Marathon Data

  • Gold medal times for Olympic Marathon since 1896.

  • Marathons before 1924 didn’t have a standardised distance.

  • Present results using pace per km.

  • In 1904 Marathon was badly organised leading to very slow times.

Olympic Marathon Data

In [4]:
data = pods.datasets.olympic_marathon_men()
f, ax = plt.subplots(figsize=(7,7))
ax.plot(data['X'], data['Y'], 'ro',markersize=10)
Out[4]:
[<matplotlib.lines.Line2D at 0x114702ef0>]

What is Machine Learning?

$$ \text{data} + \text{model} = \text{prediction}$$
  • $\text{data}$ : observations, could be actively or passively acquired (meta-data).

  • $\text{model}$ : assumptions, based on previous experience (other data! transfer learning etc), or beliefs about the regularities of the universe. Inductive bias.

  • $\text{prediction}$ : an action to be taken or a categorization or a quality score.

Regression: Linear Releationship

$$y_i = m x_i + c$$
  • $y_i$ : winning time/pace.

  • $x_i$ : year of Olympics.

  • $m$ : rate of improvement over time.

  • $c$ : winning time at year 0.

Overdetermined System

$y = mx + c$

point 1: $x = 1$, $y=3$ $$3 = m + c$$ point 2: $x = 3$, $y=1$ $$1 = 3m + c$$ point 3: $x = 2$, $y=2.5$ $$2.5 = 2m + c$$

$y = mx + c + \epsilon$

point 1: $x = 1$, $y=3$ $$3 = m + c + \epsilon_1$$

point 2: $x = 3$, $y=1$ $$1 = 3m + c + \epsilon_2$$

point 3: $x = 2$, $y=2.5$ $$2.5 = 2m + c + \epsilon_3$$

The Gaussian Density

  • Perhaps the most common probability density. \begin{align} p(y| \mu, \sigma^2) & = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{(y - \mu)^2}{2\sigma^2}\right)\ & \buildrel\triangle\over = \mathcal{N}(y|\mu, \sigma^2) \end{align}
  • The Gaussian density.

Gaussian Density

The Gaussian PDF with $\mu=1.7$ and variance $\sigma^2= 0.0225$. Mean shown as red line. It could represent the heights of a population of students.

Gaussian Density

$$ \mathcal{N}(y|\mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y-\mu)^2}{2\sigma^2}\right) $$

$\sigma^2$ is the variance of the density and $\mu$ is the mean.

Laplace's Idea

A Probabilistic Process

  • Set the mean of Gaussian to be a function. $$p\left(y_i|x_i\right)=\frac{1}{\sqrt{2\pi\sigma^2}}\exp \left(-\frac{\left(y_i-f\left(x_i\right)\right)^{2}}{2\sigma^2}\right).$$

  • This gives us a ‘noisy function’.

  • This is known as a stochastic process.

Height as a Function of Weight

  • In the standard Gaussian, parametized by mean and variance.

  • Make the mean a linear function of an input.

  • This leads to a regression model. \begin{align*} y_i=&f\left(x_i\right)+\epsilon_i,\

     \epsilon_i \sim &\mathcal{N}(0, \sigma^2).
    

    \end{align*}

  • Assume $y_i$ is height and $x_i$ is weight.

Data Point Likelihood

  • Likelihood of an individual data point $$p\left(y_i|x_i,m,c\right)=\frac{1}{\sqrt{2\pi \sigma^2}}\exp \left(-\frac{\left(y_i-mx_i-c\right)^{2}}{2\sigma^2}\right).$$

  • Parameters are gradient, $m$, offset, $c$ of the function and noise variance $\sigma^2$.

Data Set Likelihood

  • If the noise, $\epsilon_i$ is sampled independently for each data point.

  • Each data point is independent (given $m$ and $c$).

  • For independent variables: $$p(\mathbf{y}) = \prod_{i=1}^n p(y_i)$$ $$p(\mathbf{y}|\mathbf{x}, m, c) = \prod_{i=1}^n p(y_i|x_i, m, c)$$

For Gaussian

  • i.i.d. assumption

    $$p(\mathbf{y}|\mathbf{x}, m, c) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi \sigma^2}}\exp \left(-\frac{\left(y_i-mx_i-c\right)^{2}}{2\sigma^2}\right).$$ $$p(\mathbf{y}|\mathbf{x}, m, c) = \frac{1}{\left(2\pi \sigma^2\right)^{\frac{n}{2}}}\exp \left(-\frac{\sum_{i=1}^n\left(y_i-mx_i-c\right)^{2}}{2\sigma^2}\right).$$

Log Likelihood Function

  • Normally work with the log likelihood: $$L(m,c,\sigma^{2})=-\frac{n}{2}\log 2\pi -\frac{n}{2}\log \sigma^2 -\sum _{i=1}^{n}\frac{\left(y_i-mx_i-c\right)^{2}}{2\sigma^2}.$$

Error Function

  • Negative log likelihood is the error function leading to an error function $$E(m,c,\sigma^{2})=\frac{n}{2}\log \sigma^2 +\frac{1}{2\sigma^2}\sum _{i=1}^{n}\left(y_i-mx_i-c\right)^{2}.$$

  • Learning proceeds by minimizing this error function for the data set provided.

Connection: Sum of Squares Error

  • Ignoring terms which don’t depend on $m$ and $c$ gives $$E(m, c) \propto \sum_{i=1}^n (y_i - f(x_i))^2$$ where $f(x_i) = mx_i + c$.

  • This is known as the sum of squares error function.

  • Commonly used and is closely associated with the Gaussian likelihood.

Reminder

  • Two functions involved:
    • Prediction function: $f(x_i)$
    • Error, or Objective function: $E(m, c)$
  • Error function depends on parameters through prediction function.

Mathematical Interpretation

  • What is the mathematical interpretation?

    • There is a cost function.

    • It expresses mismatch between your prediction and reality. $$E(m, c)=\sum_{i=1}^n \left(y_i - mx_i -c\right)^2$$

    • This is known as the sum of squares error.

Nonlinear Regression

  • Problem with Linear Regression—$\mathbf{x}$ may not be linearly related to $\mathbf{y}$.

  • Potential solution: create a feature space: define $\phi(\mathbf{x})$ where $\phi(\cdot)$ is a nonlinear function of $\mathbf{x}$.

  • Model for target is a linear combination of these nonlinear functions $$f(\mathbf{x}) = \sum_{j=1}^k w_j \phi_j(\mathbf{x})$$

Quadratic Basis

  • Basis functions can be global. E.g. quadratic basis: $$\boldsymbol{\phi} = [1, x, x^2]$$
In [6]:
pods.notebook.display_plots('polynomial_basis{num_basis}.svg', directory='./diagrams', num_basis=(1,3))

Functions Derived from Quadratic Basis

$$f(x) = {\color{\redColor}w_0} + {\color{\magentaColor}w_1x} + {\color{\blueColor}w_2 x^2}$$
In [7]:
pods.notebook.display_plots('polynomial_function{func_num}.svg', directory='./diagrams', func_num=(1,3))

Radial Basis Functions

  • Or they can be local. E.g. radial (or Gaussian) basis $$\phi_j(x) = \exp\left(-\frac{(x-\mu_j)^2}{\ell^2}\right)$$
In [8]:
pods.notebook.display_plots('radial_basis{num_basis}.svg', directory='./diagrams', num_basis=(1,3))

Functions Derived from Radial Basis

$$f(x) = {\color{\redColor}w_1 e^{-2(x+1)^2}} + {\color{\magentaColor}w_2e^{-2x^2}} + {\color{\blueColor}w_3 e^{-2(x-1)^2}}$$
In [10]:
pods.notebook.display_plots('radial_function{func_num}.svg', directory='./diagrams', func_num=(1,3))

Basis Function Models

  • The prediction function is now defined as $$f(\mathbf{x}_i) = \sum_{j=1}^m w_j \phi_{i, j}$$

Vector Notation

  • Write in vector notation, $$f(\mathbf{x}_i) = \mathbf{w}^\top \boldsymbol{\phi}_i$$

Log Likelihood for Basis Function Model

  • The likelihood of a single data point is $$p\left(y_i|x_i\right)=\frac{1}{\sqrt{2\pi\sigma^2}}\exp
    \left(-\frac{\left(y_i-\mathbf{w}^{\top}\boldsymbol{\phi}_i\right)^{2}}{2\sigma^2}\right).$$

Log Likelihood for Basis Function Model

  • Leading to a log likelihood for the data set of $$L(\mathbf{w},\sigma^2)= -\frac{n}{2}\log \sigma^2
      -\frac{n}{2}\log 2\pi -\frac{\sum
        _{i=1}^{n}\left(y_i-\mathbf{w}^{\top}\boldsymbol{\phi}_i\right)^{2}}{2\sigma^2}.$$

Objective Function

  • And a corresponding objective function of the form $$E(\mathbf{w},\sigma^2)= \frac{n}{2}\log
        \sigma^2 + \frac{\sum
          _{i=1}^{n}\left(y_i-\mathbf{w}^{\top}\boldsymbol{\phi}_i\right)^{2}}{2\sigma^2}.$$

Polynomial Fits to Olympic Data

In [11]:
pods.notebook.display_plots('olympic_LM_polynomial{num_basis}.svg', directory='./diagrams', num_basis=(1,7))

Polynomial Fits to Olymics Data

In [15]:
pods.notebook.display_plots('olympic_LM_polynomial{num_basis}.svg', 
                            directory='./diagrams', num_basis=(1, max_basis))

Overfitting

  • Increase number of basis functions we obtain a better 'fit' to the data.
  • How will the model perform on previously unseen data?
  • Let's consider predicting the future.
In [18]:
pods.notebook.display_plots('olympic_val_LM_polynomial{num_basis}.svg', 
                            directory='./diagrams', num_basis=(1, max_basis))

Extrapolation

  • Here we are training beyond where the model has learnt.
  • This is known as extrapolation.
  • Extrapolation is predicting into the future here, but could be:
    • Predicting back to the unseen past (pre 1892)
    • Spatial prediction (e.g. Cholera rates outside Manchester given rates inside Manchester).

Alan Turing

  • He was a formidable Marathon runner.
  • In 1946 he ran a time 2 hours 46 minutes.
  • What is the probability he would have won an Olympics if one had been held in 1946?
    Alan Turing running in 1946
    *Alan Turing, in 1946 he was only 11 minutes slower than the winner of the 1948 games. Would he have won a hypothetical games held in 1946? Source: [Alan Turing Internet Scrapbook](http://www.turing.org.uk/scrapbook/run.html).*

Interpolation

  • Predicting the wining time for 1946 Olympics is interpolation.
  • This is because we have times from 1936 and 1948.
  • If we want a model for interpolation how can we test it?
  • One trick is to sample the validation set from throughout the data set.
In [20]:
pods.notebook.display_plots('olympic_val_inter_LM_polynomial{num_basis}.svg', 
                            directory='./diagrams', num_basis=(1, max_basis))

Choice of Validation Set

  • The choice of validation set should reflect how you will use the model in practice.
  • For extrapolation into the future we tried validating with data from the future.
  • For interpolation we chose validation set from data.
  • For different validation sets we could get different results.

Leave One Out Error

  • Take training set and remove one point.
  • Train on the remaining data.
  • Compute the error on the point you removed (which wasn't in the training data).
  • Do this for each point in the training set in turn.
  • Average the resulting error.
  • This is the leave one out error.
In [22]:
pods.notebook.display_plots('olympic_loo{part}_inter_LM_polynomial{num_basis}.svg', 
                            directory='./diagrams', num_basis=(1, max_basis), part=(0,len(partitions)))

$k$ Fold Cross Validation

  • Leave one out error can be very time consuming.
  • Need to train your algorithm $n$ times.
  • An alternative: $k$ fold cross validation.
In [24]:
pods.notebook.display_plots('olympic_5cv{part}_inter_LM_polynomial{num_basis}.svg', 
                            directory='./diagrams', num_basis=(1, max_basis), part=(0,5))

Reading

  • Section 1.3-1.5 of @Rogers:book11.