Sample based approximation: replace true expectation with sum over samples. \[
\int \mappingFunction(z) p(z) \text{d}z\approx \frac{1}{s}\sum_{i=1}^s \mappingFunction(z_i).
\]
Allows us to approximate true integral with a sum \[
R(\mappingVector) \approx \frac{1}{\numData}\sum_{i=1}^{\numData} L(\dataScalar_i, \inputScalar_i, \mappingVector).
\]
Empirical Risk Minimization
If the loss is the squared loss\[
L(\dataScalar, \inputScalar, \mappingVector) = (\dataScalar-\mappingVector^\top\boldsymbol{\phi}(\inputScalar))^2,
\]
This recovers the empirical risk\[
R(\mappingVector) \approx \frac{1}{\numData} \sum_{i=1}^{\numData}
(\dataScalar_i - \mappingVector^\top \boldsymbol{\phi}(\inputScalar_i))^2
\]
Estimating Risk through Validation
Validation
Olympic Marathon Data
Gold medal times for Olympic Marathon since 1896.
Marathons before 1924 didn’t have a standardised distance.
Present results using pace per km.
In 1904 Marathon was badly organised leading to very slow times.
The next thing we’ll do is consider a quadratic fit. We will compute the training error for the two fits.
Polynomial Fits to Olympics Data
Hold Out Validation on Olympic Marathon Data
Overfitting
Increase number of basis functions we obtain a better ‘fit’ to the data.
How will the model perform on previously unseen data?
Let’s consider predicting the future.
Future Prediction: Extrapolation
Extrapolation
Here we are training beyond where the model has learnt.
This is known as extrapolation.
Extrapolation is predicting into the future here, but could be:
Predicting back to the unseen past (pre 1892)
Spatial prediction (e.g. Cholera rates outside Manchester given rates inside Manchester).
Interpolation
Predicting the wining time for 1946 Olympics is interpolation.
This is because we have times from 1936 and 1948.
If we want a model for interpolation how can we test it?
One trick is to sample the validation set from throughout the data set.
Future Prediction: Interpolation
Choice of Validation Set
The choice of validation set should reflect how you will use the model in practice.
For extrapolation into the future we tried validating with data from the future.
For interpolation we chose validation set from data.
For different validation sets we could get different results.
Leave One Out Validation
Leave One Out Error
Take training set and remove one point.
Train on the remaining data.
Compute the error on the point you removed (which wasn’t in the training data).
Do this for each point in the training set in turn.
Average the resulting error.
This is the leave one out error.
Leave One Out Validation
fold
num basis
num basis
num basis
num basis
num basis
num basis
\(k\)-fold Cross Validation
Leave one out error can be very time consuming.
Need to train your algorithm \(\numData\) times.
An alternative: \(k\) fold cross validation.
\(k\)-fold Cross Validation
fold
num basis
num basis
num basis
num basis
num basis
num basis
Bias Variance Decomposition
Expected test error for different variations of the training data sampled from, \(\Pr(\dataVector, \dataScalar)\)\[\mathbb{E}\left[ \left(\dataScalar - \mappingFunction^*(\dataVector)\right)^2 \right]\] Decompose as \[\mathbb{E}\left[ \left(\dataScalar - \mappingFunction(\dataVector)\right)^2 \right] = \text{bias}\left[\mappingFunction^*(\dataVector)\right]^2 + \text{variance}\left[\mappingFunction^*(\dataVector)\right] +\sigma^2\]
Bias
Given by \[\text{bias}\left[\mappingFunction^*(\dataVector)\right] =
\mathbb{E}\left[\mappingFunction^*(\dataVector)\right] * \mappingFunction(\dataVector)\]
Error due to bias comes from a model that’s too simple.
Variance
Given by \[\text{variance}\left[\mappingFunction^*(\dataVector)\right] = \mathbb{E}\left[\left(\mappingFunction^*(\dataVector) - \mathbb{E}\left[\mappingFunction^*(\dataVector)\right]\right)^2\right]\]
Slight variations in the training set cause changes in the prediction. Error due to variance is error in the model due to an overly complex model.
Figure: simple models on left complex models on right