jupyter nbconvert --to slides --reveal-prefix "http://cdn.jsdelivr.net/reveal.js/2.6.2"
import pods
from matplotlib import pyplot as plt
%matplotlib inline
Predict a real value, $y_i$ given some inputs $x_i$.
Predict quality of meat given spectral measurements (Tecator data).
Radiocarbon dating, the C14 calibration curve: predict age given quantity of C14 isotope.
Predict quality of different Go or Backgammon moves given expert rated training data.
Image from Wikimedia Commons http://bit.ly/191adDC
data = pods.datasets.olympic_100m_men()
f, ax = plt.subplots(figsize=(7,7))
ax.plot(data['X'], data['Y'], 'rx', markersize=10)
[<matplotlib.lines.Line2D at 0x1092e47f0>]
Gold medal times for Olympic Marathon since 1896.
Marathons before 1924 didn’t have a standardised distance.
Present results using pace per km.
In 1904 Marathon was badly organised leading to very slow times.
Image from Wikimedia Commons http://bit.ly/16kMKHQ
data = pods.datasets.olympic_marathon_men()
f, ax = plt.subplots(figsize=(7,7))
ax.plot(data['X'], data['Y'], 'rx',markersize=10)
[<matplotlib.lines.Line2D at 0x109612ba8>]
$\text{data}$ : observations, could be actively or passively acquired (meta-data).
$\text{model}$ : assumptions, based on previous experience (other data! transfer learning etc), or beliefs about the regularities of the universe. Inductive bias.
$\text{prediction}$ : an action to be taken or a categorization or a quality score.
$y_i$ : winning time/pace.
$x_i$ : year of Olympics.
$m$ : rate of improvement over time.
$c$ : winning time at year 0.
point 1: $x = 1$, $y=3$ $$3 = m + c$$ point 2: $x = 3$, $y=1$ $$1 = 3m + c$$ point 3: $x = 2$, $y=2.5$ $$2.5 = 2m + c$$
point 1: $x = 1$, $y=3$ $$3 = m + c + \epsilon_1$$
point 2: $x = 3$, $y=1$ $$1 = 3m + c + \epsilon_2$$
point 3: $x = 2$, $y=2.5$ $$2.5 = 2m + c + \epsilon_3$$
import numpy as np
h = np.linspace(0, 2.5, 1000)
sigma2 = 0.0225
mu = 1.7
p = 1./np.sqrt(2*np.pi*sigma2)*np.exp(-(h-mu)**2/(2*sigma2**2))
f2, ax2 = plt.subplots(figsize=(7, 3.5))
ax2.plot(h, p, 'b-', linewidth=3)
ylim = (0, 3)
ax2.vlines(mu, ylim[0], ylim[1], colors='r', linewidth=3)
ax2.set_ylim(ylim)
ax2.set_xlim(1.4, 2.0)
ax2.set_xlabel('$h/m$', fontsize=20)
ax2.set_ylabel('$p(h|\mu, \sigma^2)$', fontsize = 20)
f2.savefig('./diagrams/gaussianOfHeight.svg')
The Gaussian PDF with $\mu=1.7$ and variance $\sigma^2= 0.0225$. Mean shown as red line. It could represent the heights of a population of students.
$\sigma^2$ is the variance of the density and $\mu$ is the mean.
Sum of Gaussians
Scaling a Gaussian
In the standard Gaussian, parametized by mean and variance.
Make the mean a linear function of an input.
This leads to a regression model. \begin{align*} y_i=&f\left(x_i\right)+\epsilon_i,\
\epsilon_i \sim &\mathcal{N}(0, \sigma^2).
\end{align*}
Assume $y_i$ is height and $x_i$ is weight.
Likelihood of an individual data point $$p\left(y_i|x_i,m,c\right)=\frac{1}{\sqrt{2\pi \sigma^2}}\exp \left(-\frac{\left(y_i-mx_i-c\right)^{2}}{2\sigma^2}\right).$$
Parameters are gradient, $m$, offset, $c$ of the function and noise variance $\sigma^2$.
If the noise, $\epsilon_i$ is sampled independently for each data point.
Each data point is independent (given $m$ and $c$).
For independent variables: $$p(\mathbf{y}) = \prod_{i=1}^n p(y_i)$$ $$p(\mathbf{y}|\mathbf{x}, m, c) = \prod_{i=1}^n p(y_i|x_i, m, c)$$</span><3> $$p(\mathbf{y}|\mathbf{x}, m, c) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi \sigma^2}}\exp \left(-\frac{\left(y_i-mx_i-c\right)^{2}}{2\sigma^2}\right).$$<4> $$p(\mathbf{y}|\mathbf{x}, m, c) = \frac{1}{\left(2\pi \sigma^2\right)^{\frac{n}{2}}}\exp \left(-\frac{\sum_{i=1}^n\left(y_i-mx_i-c\right)^{2}}{2\sigma^2}\right).$$
If data was really generated according to probability we specified.
Correct parameters will be recovered in limit as $n \rightarrow \infty$.
This can be proven through sample based approximations (law of large numbers) of “KL divergences”.
Mainstay of classical statistics.
Probabilistic Interpretation for Error Function is Negative Log Likelihood.
Minimizing error function is equivalent to maximizing log likelihood.
Maximizing log likelihood is equivalent to maximizing the likelihood because $\log$ is monotonic.
Probabilistic interpretation: Minimizing error function is equivalent to maximum likelihood with respect to parameters.
Negative log likelihood is the error function leading to an error function $$E(m,c,\sigma^{2})=\frac{n}{2}\log \sigma^2 +\frac{1}{2\sigma^2}\sum _{i=1}^{n}\left(y_i-mx_i-c\right)^{2}.$$
Learning proceeds by minimizing this error function for the data set provided.
Ignoring terms which don’t depend on $m$ and $c$ gives $$E(m, c) \propto \sum_{i=1}^n (y_i - fx_i))^2$$ where $f(x_i) = mx_i + c$.
This is known as the sum of squares error function.
Commonly used and is closely associated with the Gaussian likelihood.
What is the mathematical interpretation?
There is a cost function.
It expresses mismatch between your prediction and reality. $$E(m, c)=\sum_{i=1}^n \left(y_i - mx_i -c\right)^2$$
This is known as the sum of squares error.
Learning is minimization of the cost function.
At the minima the gradient is zero.
Coordinate ascent, find gradient in each coordinate and set to zero. $$\frac{\text{d}E(m)}{\text{d}m} =
-2\sum_{i=1}^n x_i
\left(y_i- m x_i - c \right)$$
$$0 =
-2\sum_{i=1}^n x_i
\left(y_i-
m x_i - c \right)$$
$$0 =
-2\sum_{i=1}^n x_iy_i
+2\sum_{i=1}^n
m x_i^2 +2\sum_{i=1}^n cx_i$$
$$m = \frac{\sum_{i=1}^n \left(y_i
-c\right)x_i}{\sum_{i=1}^nx_i^2}$$
Learning is minimization of the cost function.
At the minima the gradient is zero.
Coordinate ascent, find gradient in each coordinate and set to zero. $$\frac{\text{d}E(c)}{\text{d}c} =
-2\sum_{i=1}^n
\left(y_i- m x_i - c \right)$$
$$0 =
-2\sum_{i=1}^n\left(y_i-
m x_i - c \right)$$
$$0 =
-2\sum_{i=1}^n y_i
+2\sum_{i=1}^n
m x_i +2n c$$
$$c = \frac{\sum_{i=1}^n \left(y_i
-mx_i\right)}{n}$$
Worked example. $$\begin{aligned} c^{*}=&\frac{\sum _{i=1}^{n}\left(y_i-m^{*}x_i\right)}{n},\\ m^{*}=&\frac{\sum _{i=1}^{n}x_i\left(y_i-c^{*}\right)}{\sum _{i=1}^{n}x_i^{2}},\\ \left.\sigma^2\right.^{*}=&\frac{\sum _{i=1}^{n}\left(y_i-m^{*}x_i-c^{*}\right)^{2}}{n} \end{aligned}$$
Other optimization methods:
Second order methods, conjugate gradient, quasi-Newton and Newton.
Effective heuristics such as momentum.
Local vs global solutions.
Multivariate functions involve more than one input.
Height might be a function of weight and gender.
There could be other contributory factors.
Place these factors in a feature vector $\mathbf{x}_i$.
Linear function is now defined as $$f(\mathbf{x}_i) = \sum_{j=1}^p w_j x_{i, j} + c$$
Write in vector notation, $$f(\mathbf{x}_i) = \mathbf{w}^\top \mathbf{x}_i + c$$
Can absorb $c$ into $\mathbf{w}$ by assuming extra input $x_0$ which is always 1. $$f(\mathbf{x}_i) = \mathbf{w}^\top \mathbf{x}_i$$
The likelihood of a single data point is $$p\left(y_i|x_i\right)=\frac{1}{\sqrt{2\pi\sigma^2}}\exp
\left(-\frac{\left(y_i-\mathbf{w}^{\top}\mathbf{x}_i\right)^{2}}{2\sigma^2}\right).$$
Leading to a log likelihood for the data set of $$L(\mathbf{w},\sigma^2)= -\frac{n}{2}\log \sigma^2
-\frac{n}{2}\log 2\pi -\frac{\sum
_{i=1}^{n}\left(y_i-\mathbf{w}^{\top}\mathbf{x}_i\right)^{2}}{2\sigma^2}.$$
And a corresponding error function of $$E(\mathbf{w},\sigma^2)= \frac{n}{2}\log
\sigma^2 + \frac{\sum
_{i=1}^{n}\left(y_i-\mathbf{w}^{\top}\mathbf{x}_i\right)^{2}}{2\sigma^2}.$$
We will need some multivariate calculus.
For now some simple multivariate differentiation: $$\frac{\text{d}{\mathbf{a}^{\top}}{\mathbf{w}}}{\text{d}\mathbf{w}}=\mathbf{a}$$ and $$\frac{\mathbf{w}^{\top}\mathbf{A}\mathbf{w}}{\mathbf{w}}=\left(\mathbf{A}+\mathbf{A}^{\top}\right)\mathbf{w}$$ or if $\mathbf{A}$ is symmetric (i.e. $\mathbf{A}=\mathbf{A}^{\top}$) $$\frac{\text{d}\mathbf{w}^{\top}\mathbf{A}\mathbf{w}}{\text{d}\mathbf{w}}=2\mathbf{A}\mathbf{w}.$$
Differentiating with respect to the vector $\mathbf{w}$ we obtain $$\frac{\partial L\left(\mathbf{w},\beta \right)}{\partial \mathbf{w}}=\beta \sum _{i=1}^{n}\mathbf{x}_iy_i-\beta \left[\sum _{i=1}^{n}\mathbf{x}_i\mathbf{x}_i^{\top}\right]\mathbf{w}$$ Leading to $$\mathbf{w}^{*}=\left[\sum _{i=1}^{n}\mathbf{x}_i\mathbf{x}_i^{\top}\right]^{-1}\sum _{i=1}^{n}\mathbf{x}_iy_i,$$ Rewrite in matrix notation: $$\sum _{i=1}^{n}\mathbf{x}_i\mathbf{x}_i^\top = \mathbf{X}^\top \mathbf{X}$$ $$\sum _{i=1}^{n}\mathbf{x}_iy_i = \mathbf{X}^\top \mathbf{y}$$
Update for $\mathbf{w}^{*}$. $$\mathbf{w}^{*} = \left(\mathbf{X}^\top \mathbf{X}\right)^{-1} \mathbf{X}^\top \mathbf{y}$$
The equation for $\left.\sigma^2\right.^{*}$ may also be found $$\left.\sigma^2\right.^{{*}}=\frac{\sum _{i=1}^{n}\left(y_i-\left.\mathbf{w}^{*}\right.^{\top}\mathbf{x}_i\right)^{2}}{n}.$$