\(\mappingFunction(\cdot)\) is a scalar function with vector inputs,
\(\activationVector(\cdot)\) is a vector function with vector inputs.
dimensionality of the vector function is known as the number of hidden units, or the number of neurons.
elements of \(\activationVector(\cdot)\) are the activation function of the neural network
elements of \(\mappingMatrix_{1}\) are the parameters of the activation functions.
Relations with Classical Statistics
In statistics activation functions are known as basis functions.
would think of this as a linear model: not linear predictions, linear in the parameters
\(\mappingVector_{1}\) are static parameters.
Adaptive Basis Functions
In machine learning we optimize \(\mappingMatrix_{1}\) as well as \(\mappingMatrix_{2}\) (which would normally be denoted in statistics by \(\boldsymbol{\beta}\)).
Machine Learning
observe a system in practice
emulate its behavior with mathematics.
Design challenge: where to put mathematical function.
We are given a data set containing ‘inputs’, \(\inputMatrix\) and ‘targets’, \(\dataVector\).
Each data point consists of an input vector \(\inputVector_i\) and a class label, \(\dataScalar_i\).
For binary classification assume \(\dataScalar_i\) should be either \(1\) (yes) or \(-1\) (no).
Input vector can be thought of as features.
Discrete Probability
Algorithms based on prediction function and objective function.
For regression the codomain of the functions, \(f(\inputMatrix)\) was the real numbers or sometimes real vectors.
In classification we are given an input vector, \(\inputVector\), and an associated label, \(\dataScalar\) which either takes the value \(-1\) or \(1\).
Classification
Inputs, \(\inputVector\), mapped to a label, \(\dataScalar\), through a function \(\mappingFunction(\cdot)\) dependent on parameters, \(\weightVector\), \[
\dataScalar = \mappingFunction(\inputVector; \weightVector).
\]
\(\mappingFunction(\cdot)\) is known as the prediction function.
Classification Examples
Classifiying hand written digits from binary images (automatic zip code reading)
Detecting faces in images (e.g. digital cameras).
Who a detected face belongs to (e.g. Facebook, DeepFace)
Classifying type of cancer given gene expression data.
Categorization of document types (different types of news article on the internet)
Perceptron
Simple classification with the perceptron algorithm.
Logistic Regression and GLMs
Modelling entire density allows any question to be answered (also missing data).
Comes at the possible expense of strong assumptions about data generation distribution.
In regression we model probability of \(\dataScalar_i |\inputVector_i\) directly.
Allows less flexibility in the question, but more flexibility in the model assumptions.
Can do this not just for regression, but classification.
Framework is known as generalized linear models.
Log Odds
model the log-odds with the basis functions.
odds are defined as the ratio of the probability of a positive outcome, to the probability of a negative outcome.
Probability is between zero and one, odds are: \[ \frac{\pi}{1-\pi} \]
Odds are between \(0\) and \(\infty\).
Logarithm of odds maps them to \(-\infty\) to \(\infty\).
Logit Link Function
The Logit function, \[g^{-1}(\pi_i) = \log\frac{\pi_i}{1-\pi_i}.\] This function is known as a link function.
For a standard regression we take, \[f(\inputVector_i) = \mappingVector^\top \basisVector(\inputVector_i),\]
For classification we perform a logistic regression. \[\log \frac{\pi_i}{1-\pi_i} = \mappingVector^\top \basisVector(\inputVector_i)\]
Inverse Link Function
We have defined the link function as taking the form \(g^{-1}(\cdot)\) implying that the inverse link function is given by \(g(\cdot)\). Since we have defined, \[
g^{-1}(\pi(\inputVector)) = \mappingVector^\top\basisVector(\inputVector)
\] we can write \(\pi\) in terms of the inverse link function, \(g(\cdot)\) as \[
\pi(\inputVector) = g(\mappingVector^\top\basisVector(\inputVector)).
\]
Logistic function
Logistic (or sigmoid) squashes real line to between 0 & 1. Sometimes also called a ‘squashing function’.
Basis Function
Prediction Function
Can now write \(\pi\) as a function of the input and the parameter vector as, \[\pi(\inputVector,\mappingVector) = \frac{1}{1+
\exp\left(-\mappingVector^\top \basisVector(\inputVector)\right)}.\]
Compute the output of a standard linear basis function composition (\(\mappingVector^\top \basisVector(\inputVector)\), as we did for linear regression)
Apply the inverse link function, \(g(\mappingVector^\top \basisVector(\inputVector))\).
Use this value in a Bernoulli distribution to form the likelihood.
Bernoulli Reminder
From last time \[P(\dataScalar_i|\mappingVector, \inputVector) = \pi_i^{\dataScalar_i} (1-\pi_i)^{1-\dataScalar_i}\]
Probability of positive outcome for the \(i\)th data point \[\pi_i = g\left(\mappingVector^\top \basisVector(\inputVector_i)\right),\] where \(g(\cdot)\) is the inverse link function
Objective function of the form \[\begin{align*}
E(\mappingVector) = & - \sum_{i=1}^\numData \dataScalar_i \log
g\left(\mappingVector^\top \basisVector(\inputVector_i)\right) \\& -
\sum_{i=1}^\numData(1-\dataScalar_i)\log \left(1-g\left(\mappingVector^\top
\basisVector(\inputVector_i)\right)\right).
\end{align*}\]
Also need gradient of inverse link function wrt parameters. \[\begin{align*}
g(\mappingFunction_i) &= \frac{1}{1+\exp(-\mappingFunction_i)}\\
&=(1+\exp(-\mappingFunction_i))^{-1}
\end{align*}\] and the gradient can be computed as \[\begin{align*}
\frac{\text{d}g(\mappingFunction_i)}{\text{d} \mappingFunction_i} & =
\exp(-\mappingFunction_i)(1+\exp(-\mappingFunction_i))^{-2}\\
& = \frac{1}{1+\exp(-\mappingFunction_i)}
\frac{\exp(-\mappingFunction_i)}{1+\exp(-\mappingFunction_i)} \\
& = g(\mappingFunction_i) (1-g(\mappingFunction_i))
\end{align*}\]
Similarly to matrix factorization, for large data stochastic gradient descent (Robbins Munro (Robbins and Monro, 1951) optimization procedure) works well.
Batch Gradient Descent
Stochastic Gradient Descent
Regression
Classification is discrete output.
Regression is a continuous output.
Regression Examples
Predict a real value, \(\dataScalar_i\) given some inputs \(\inputVector_i\).
Predict quality of meat given spectral measurements (Tecator data).
Radiocarbon dating, the C14 calibration curve: predict age given quantity of C14 isotope.
Predict quality of different Go or Backgammon moves given expert rated training data.
Supervised Learning Challenges
choosing which features, \(\inputVector\), are relevant in the prediction
defining the appropriate class of function, \(\mappingFunction(\cdot)\).
selecting the right parameters, \(\weightVector\).
Feature Selection
Olympic prediction example only using year to predict pace.
What else could we use?
Can use feature selection algorithms
Applications
rank search results, what adverts to show, newsfeed ranking
Features: number of likes, image present, friendship relationship
Class of Function, \(\mappingFunction(\cdot)\)
Mapping characteristic between \(\inputVector\) and \(\dataScalar\)?
e.g. a cat remains a cat regardless of location (translation), size (scale) or upside-down (rotation and reflection).
Deep Learning
Deep Learning
These are interpretable models: vital for disease modeling etc.
Modern machine learning methods are less interpretable
Example: face recognition
DeepFace
Outline of the DeepFace architecture. A front-end of a single convolution-pooling-convolution filtering on the rectified input, followed by three locally-connected layers and two fully-connected layers. Color illustrates feature maps produced at each layer. The net includes more than 120 million parameters, where more than 95% come from the local and fully connected.
Source: DeepFace (Taigman et al., 2014)
Deep Learning as Pinball
Encoding Knowledge
Encode invariance is encoding knowledge
Unspecified invariances must be learned
Learning may require a lot more data.
Less data efficient
Choosing Prediction Function
Any function e.g. polynomials for olympic data \[
\mappingFunction(\inputScalar) = \weightScalar_0 + \weightScalar_1 \inputScalar+ \weightScalar_2 \inputScalar^2 + \weightScalar_3 \inputScalar^3 + \weightScalar_4 \inputScalar^4.
\]
Parameter Estimation: Objective Functions
After choosing features and function class we need parameters.
Estimate \(\weightVector\) by specifying an objective function.
Given \(\numData\) inputs, \(\inputVector_1\), \(\inputVector_2\), \(\inputVector_3\), \(\dots\), \(\inputVector_\numData\)
And labels \(\dataScalar_1\), \(\dataScalar_2\), \(\dataScalar_3\), \(\dots\), \(\dataScalar_\numData\).
Sometimes label is cheap e.g. Newsfeed ranking
Often it is very expensive.
Manual labour
Annotation
Human annotators
E.g. in ImageNet annotated using Amazon’s Mechanical Turk. (AI?)
Without humans no AI.
Not real intelligence, emulated
Annotation
Some tasks easier to annotate than others.
Sometimes annotation requires an experiment (Tecator data)
Annotation
Even for easy tasks there will be problems.
E.g. humans extrapolate the context of an image.
Quality of ML is very sensitive to data.
Investing in processes and tools is vital.
Misrepresentation and Bias
Bias can appear in the model and the data
Data needs to be carefully collected
E.g. face detectors trained on Europeans tested in Africa.
Generalization and Overfitting
How does the model perform on previously unseen data?
Validation and Model Selection
Selecting model at the validation step
Difficult Trap
Vital that you avoid test data in training.
Validation data is different from test data.
Hold Out Validation on Olympic Marathon Data
Overfitting
Increase number of basis functions we obtain a better ‘fit’ to the data.
How will the model perform on previously unseen data?
Let’s consider predicting the future.
Future Prediction: Extrapolation
Extrapolation
Here we are training beyond where the model has learnt.
This is known as extrapolation.
Extrapolation is predicting into the future here, but could be:
Predicting back to the unseen past (pre 1892)
Spatial prediction (e.g. Cholera rates outside Manchester given rates inside Manchester).
Interpolation
Predicting the wining time for 1946 Olympics is interpolation.
This is because we have times from 1936 and 1948.
If we want a model for interpolation how can we test it?
One trick is to sample the validation set from throughout the data set.
Future Prediction: Interpolation
Choice of Validation Set
The choice of validation set should reflect how you will use the model in practice.
For extrapolation into the future we tried validating with data from the future.
For interpolation we chose validation set from data.
For different validation sets we could get different results.
Bias Variance Decomposition
Expected test error for different variations of the training data sampled from, \(\Pr(\dataVector, \dataScalar)\)\[\mathbb{E}\left[ \left(\dataScalar - \mappingFunction^*(\dataVector)\right)^2 \right]\] Decompose as \[\mathbb{E}\left[ \left(\dataScalar - \mappingFunction(\dataVector)\right)^2 \right] = \text{bias}\left[\mappingFunction^*(\dataVector)\right]^2 + \text{variance}\left[\mappingFunction^*(\dataVector)\right] +\sigma^2\]
Bias
Given by \[\text{bias}\left[\mappingFunction^*(\dataVector)\right] =
\mathbb{E}\left[\mappingFunction^*(\dataVector)\right] * \mappingFunction(\dataVector)\]
Error due to bias comes from a model that’s too simple.
Variance
Given by \[\text{variance}\left[\mappingFunction^*(\dataVector)\right] = \mathbb{E}\left[\left(\mappingFunction^*(\dataVector) - \mathbb{E}\left[\mappingFunction^*(\dataVector)\right]\right)^2\right]\]
Slight variations in the training set cause changes in the prediction. Error due to variance is error in the model due to an overly complex model.
Figure: simple models on left complex models on right
Andrade-Pacheco, R., Mubangizi, M., Quinn, J., Lawrence, N.D., 2014. Consistent mapping of government malaria records across a changing territory delimitation. Malaria Journal 13. https://doi.org/10.1186/1475-2875-13-S1-P5
Gelman, A., Carlin, J.B., Stern, H.S., Rubin, D.B., 2013. Bayesian data analysis, 3rd ed. Chapman; Hall.
McCulloch, W.S., Pitts, W., 1943. A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics 5, 115–133.
Mubangizi, M., Andrade-Pacheco, R., Smith, M.T., Quinn, J., Lawrence, N.D., 2014. Malaria surveillance with multiple data sources using Gaussian process models, in: 1st International Conference on the Use of Mobile ICT in Africa.
Robbins, H., Monro, S., 1951. A stochastic approximation method. Annals of Mathematical Statistics 22, 400–407.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L., 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115, 211–252. https://doi.org/10.1007/s11263-015-0816-y
Taigman, Y., Yang, M., Ranzato, M., Wolf, L., 2014. DeepFace: Closing the gap to human-level performance in face verification, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. https://doi.org/10.1109/CVPR.2014.220