
# What is Machine Learning?

Data Science Africa Summer School, Addis Ababa, Ethiopia

## Introduction

Data Science Africa is a bottom up initiative for capacity building in data science, machine learning and AI on the African continent

## Example: Prediction of Malaria Incidence in Uganda

• Work with Ricardo Andrade Pacheco, John Quinn and Martin Mubaganzi (Makerere University, Uganda)
• See AI-DEV Group.

## Malaria Prediction in Uganda

(Andrade-Pacheco et al., 2014; Mubangizi et al., 2014)

## Rise of Machine Learning

• Driven by data and computation
• Fundamentally dependent on models

$\text{data} + \text{model} \stackrel{\text{compute}}{\rightarrow} \text{prediction}$

## Machine Learning in Supply Chain

• Supply chain: Large Automated Decision Making Network
• Major Challenge:
• We have a mechanistic understanding of supply chain.
• Machine learning is a data driven technology.

## For Africa

• Infrastructure dominated by information.

## Data Driven

• Machine Learning: Replicate Processes through direct use of data.
• Aim to emulate cognitive processes through the use of data.
• Use data to provide new approaches in control and optimization that should allow for emulation of human motor skills.

## Process Emulation

• Key idea: emulate the process as a mathematical function.
• Each function has a set of parameters which control its behaviour.
• Learning is the process of changing these parameters to change the shape of the function
• Choice of which class of mathematical functions we use is a vital component of our model.

## Olympic Marathon Data

 Gold medal times for Olympic Marathon since 1896. Marathons before 1924 didn’t have a standardised distance. Present results using pace per km. In 1904 Marathon was badly organised leading to very slow times. Image from Wikimedia Commons http://bit.ly/16kMKHQ

## What does Machine Learning do?

• Automation scales by codifying processes and automating them.
• Need:
• Interconnected components
• Compatible components
• Early examples:
• cf Colt 45, Ford Model T

## Codify Through Mathematical Functions

• How does machine learning work?
• Jumper (jersey/sweater) purchase with logistic regression

$\text{odds} = \frac{p(\text{bought})}{p(\text{not bought})}$

$\log \text{odds} = \beta_0 + \beta_1 \text{age} + \beta_2 \text{latitude}.$

## Codify Through Mathematical Functions

• How does machine learning work?
• Jumper (jersey/sweater) purchase with logistic regression

$p(\text{bought}) = \sigmoid{\beta_0 + \beta_1 \text{age} + \beta_2 \text{latitude}}.$

## Codify Through Mathematical Functions

• How does machine learning work?
• Jumper (jersey/sweater) purchase with logistic regression

$p(\text{bought}) = \sigmoid{\boldsymbol{\beta}^\top \inputVector}.$

## Codify Through Mathematical Functions

• How does machine learning work?
• Jumper (jersey/sweater) purchase with logistic regression

$\dataScalar = \mappingFunction\left(\inputVector, \boldsymbol{\beta}\right).$

We call $\mappingFunction(\cdot)$ the prediction function.

## Fit to Data

• Use an objective function

$\errorFunction(\boldsymbol{\beta}, \dataMatrix, \inputMatrix)$

• E.g. least squares $\errorFunction(\boldsymbol{\beta}, \dataMatrix, \inputMatrix) = \sum_{i=1}^\numData \left(\dataScalar_i - \mappingFunction(\inputVector_i, \boldsymbol{\beta})\right)^2.$

## Two Components

• Prediction function, $\mappingFunction(\cdot)$
• Objective function, $\errorFunction(\cdot)$

$\text{data} + \text{model} \stackrel{\text{compute}}{\rightarrow} \text{prediction}$

## From Model to Decision

 $\text{data} + \text{model} \stackrel{\text{compute}}{\rightarrow} \text{prediction}$

## Artificial Intelligence and Data Science

• AI aims to equip computers with human capabilities
• Image understanding
• Computer vision
• Speech recognition
• Natural language understanding
• Machine translation

## Supervised Learning for AI

• Dominant approach today:
• Generate large labelled data set from humans.
• Use supervised learning to emulate that data.
• E.g. ImageNet Russakovsky et al. (2015)
• Significant advances due to deep learning
• E.g. Alexa, Amazon Go

## Data Science

• Arises from happenstance data.
• Differs from statistics in that the question comes after data collection.

## Neural Networks and Prediction Functions

• adaptive non-linear function models inspired by simple neuron models (McCulloch and Pitts, 1943)
• have become popular because of their ability to model data.
• can be composed to form highly complex functions
• start by focussing on one hidden layer

## Prediction Function of One Hidden Layer

$\mappingFunction(\inputVector) = \left.\mappingVector^{(2)}\right.^\top \activationVector(\mappingMatrix_{1}, \inputVector)$

$\mappingFunction(\cdot)$ is a scalar function with vector inputs,

$\activationVector(\cdot)$ is a vector function with vector inputs.

• dimensionality of the vector function is known as the number of hidden units, or the number of neurons.

• elements of $\activationVector(\cdot)$ are the activation function of the neural network

• elements of $\mappingMatrix_{1}$ are the parameters of the activation functions.

## Relations with Classical Statistics

• In statistics activation functions are known as basis functions.

• would think of this as a linear model: not linear predictions, linear in the parameters

• $\mappingVector_{1}$ are static parameters.

• In machine learning we optimize $\mappingMatrix_{1}$ as well as $\mappingMatrix_{2}$ (which would normally be denoted in statistics by $\boldsymbol{\beta}$).

## Machine Learning

1. observe a system in practice
2. emulate its behavior with mathematics.
• Design challenge: where to put mathematical function.
• Where it’s placed leads to different ML domains.

## Types of Machine Learning

1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning

## Types of Machine Learning

1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning

## Supervised Learning

• Widley deployed
• Particularly in classification.
• Input is e.g. image
• Output is class label (e.g. dog or cat).

## Classification

• Wake word classification (Global Pulse Project).
• Breakthrough in 2012 with ImageNet result of Alex Krizhevsky, Ilya Sutskever and Geoff Hinton

• We are given a data set containing ‘inputs’, $\inputMatrix$ and ‘targets’, $\dataVector$.
• Each data point consists of an input vector $\inputVector_i$ and a class label, $\dataScalar_i$.
• For binary classification assume $\dataScalar_i$ should be either $1$ (yes) or $-1$ (no).
• Input vector can be thought of as features.

## Discrete Probability

• Algorithms based on prediction function and objective function.
• For regression the codomain of the functions, $f(\inputMatrix)$ was the real numbers or sometimes real vectors.
• In classification we are given an input vector, $\inputVector$, and an associated label, $\dataScalar$ which either takes the value $-1$ or $1$.

## Classification

• Inputs, $\inputVector$, mapped to a label, $\dataScalar$, through a function $\mappingFunction(\cdot)$ dependent on parameters, $\weightVector$, $\dataScalar = \mappingFunction(\inputVector; \weightVector).$
• $\mappingFunction(\cdot)$ is known as the prediction function.

## Classification Examples

• Classifiying hand written digits from binary images (automatic zip code reading)
• Detecting faces in images (e.g. digital cameras).
• Who a detected face belongs to (e.g. Facebook, DeepFace)
• Classifying type of cancer given gene expression data.
• Categorization of document types (different types of news article on the internet)

## Perceptron

Simple classification with the perceptron algorithm.

## Logistic Regression and GLMs

• Modelling entire density allows any question to be answered (also missing data).
• Comes at the possible expense of strong assumptions about data generation distribution.
• In regression we model probability of $\dataScalar_i |\inputVector_i$ directly.
• Allows less flexibility in the question, but more flexibility in the model assumptions.
• Can do this not just for regression, but classification.
• Framework is known as generalized linear models.

## Log Odds

• model the log-odds with the basis functions.
• odds are defined as the ratio of the probability of a positive outcome, to the probability of a negative outcome.
• Probability is between zero and one, odds are: $\frac{\pi}{1-\pi}$
• Odds are between $0$ and $\infty$.
• Logarithm of odds maps them to $-\infty$ to $\infty$.

## Logistic function

• Logistic (or sigmoid) squashes real line to between 0 & 1. Sometimes also called a ‘squashing function’.

## Prediction Function

• Can now write $\pi$ as a function of the input and the parameter vector as, $\pi(\inputVector,\mappingVector) = \frac{1}{1+ \exp\left(-\mappingVector^\top \basisVector(\inputVector)\right)}.$
• Compute the output of a standard linear basis function composition ($\mappingVector^\top \basisVector(\inputVector)$, as we did for linear regression)
• Apply the inverse link function, $g(\mappingVector^\top \basisVector(\inputVector))$.
• Use this value in a Bernoulli distribution to form the likelihood.

## Bernoulli Reminder

• From last time $P(\dataScalar_i|\mappingVector, \inputVector) = \pi_i^{\dataScalar_i} (1-\pi_i)^{1-\dataScalar_i}$

• Trick for switching betwen probabilities

## Maximum Likelihood

• Conditional independence of data: $P(\dataVector|\mappingVector, \inputMatrix) = \prod_{i=1}^\numData P(\dataScalar_i|\mappingVector, \inputVector_i).$

## Log Likelihood

\begin{align*} \log P(\dataVector|\mappingVector, \inputMatrix) = & \sum_{i=1}^\numData \log P(\dataScalar_i|\mappingVector, \inputVector_i) \\ = &\sum_{i=1}^\numData \dataScalar_i \log \pi_i \\ & + \sum_{i=1}^\numData (1-\dataScalar_i)\log (1-\pi_i) \end{align*}

## Objective Function

• Probability of positive outcome for the $i$th data point $\pi_i = g\left(\mappingVector^\top \basisVector(\inputVector_i)\right),$ where $g(\cdot)$ is the inverse link function
• Objective function of the form \begin{align*} E(\mappingVector) = & - \sum_{i=1}^\numData \dataScalar_i \log g\left(\mappingVector^\top \basisVector(\inputVector_i)\right) \\& - \sum_{i=1}^\numData(1-\dataScalar_i)\log \left(1-g\left(\mappingVector^\top \basisVector(\inputVector_i)\right)\right). \end{align*}

## Minimize Objective

• Grdient wrt $\pi(\inputVector;\mappingVector)$ \begin{align*} \frac{\text{d}E(\mappingVector)}{\text{d}\mappingVector} = & -\sum_{i=1}^\numData \frac{\dataScalar_i}{g\left(\mappingVector^\top \basisVector(\inputVector)\right)}\frac{\text{d}g(\mappingFunction_i)}{\text{d}\mappingFunction_i} \basisVector(\inputVector_i) \\ & + \sum_{i=1}^\numData \frac{1-\dataScalar_i}{1-g\left(\mappingVector^\top \basisVector(\inputVector)\right)}\frac{\text{d}g(\mappingFunction_i)}{\text{d}\mappingFunction_i} \basisVector(\inputVector_i) \end{align*}

\begin{align*} \frac{\text{d}E(\mappingVector)}{\text{d}\mappingVector} = & -\sum_{i=1}^\numData \dataScalar_i\left(1-g\left(\mappingVector^\top \basisVector(\inputVector)\right)\right) \basisVector(\inputVector_i) \\ & + \sum_{i=1}^\numData (1-\dataScalar_i)\left(g\left(\mappingVector^\top \basisVector(\inputVector)\right)\right) \basisVector(\inputVector_i). \end{align*}

## Optimization of the Function

• Can’t find a stationary point of the objective function analytically.
• Optimization has to proceed by numerical methods.
• Similarly to matrix factorization, for large data stochastic gradient descent (Robbins Munro (Robbins and Monro, 1951) optimization procedure) works well.

## Regression

• Classification is discrete output.
• Regression is a continuous output.

## Regression Examples

• Predict a real value, $\dataScalar_i$ given some inputs $\inputVector_i$.
• Predict quality of meat given spectral measurements (Tecator data).
• Radiocarbon dating, the C14 calibration curve: predict age given quantity of C14 isotope.
• Predict quality of different Go or Backgammon moves given expert rated training data.

## Supervised Learning Challenges

1. choosing which features, $\inputVector$, are relevant in the prediction
2. defining the appropriate class of function, $\mappingFunction(\cdot)$.
3. selecting the right parameters, $\weightVector$.

## Feature Selection

• Olympic prediction example only using year to predict pace.
• What else could we use?
• Can use feature selection algorithms

## Applications

• rank search results, what adverts to show, newsfeed ranking
• Features: number of likes, image present, friendship relationship

## Class of Function, $\mappingFunction(\cdot)$

• Mapping characteristic between $\inputVector$ and $\dataScalar$?
• smooth (similar inputs lead to similar outputs).
• linear function.
• In forecasting, periodic

## Gelman Book

Gelman et al. (2013)

## Class of Function: Neural Networks

• ImageNet: convolutional neural network
• Convolutional neural network introduces invariances

## Class of Function: Invariances

• An invariance is a transformation of the input
• e.g. a cat remains a cat regardless of location (translation), size (scale) or upside-down (rotation and reflection).

## Deep Learning

• These are interpretable models: vital for disease modeling etc.

• Modern machine learning methods are less interpretable

• Example: face recognition

## DeepFace

Outline of the DeepFace architecture. A front-end of a single convolution-pooling-convolution filtering on the rectified input, followed by three locally-connected layers and two fully-connected layers. Color illustrates feature maps produced at each layer. The net includes more than 120 million parameters, where more than 95% come from the local and fully connected.

Source: DeepFace (Taigman et al., 2014)

## Encoding Knowledge

• Encode invariance is encoding knowledge
• Unspecified invariances must be learned
• Learning may require a lot more data.
• Less data efficient

## Choosing Prediction Function

• Any function e.g. polynomials for olympic data $\mappingFunction(\inputScalar) = \weightScalar_0 + \weightScalar_1 \inputScalar+ \weightScalar_2 \inputScalar^2 + \weightScalar_3 \inputScalar^3 + \weightScalar_4 \inputScalar^4.$

## Parameter Estimation: Objective Functions

• After choosing features and function class we need parameters.
• Estimate $\weightVector$ by specifying an objective function.

## Labels and Squared Error

• Label comes from supervisor or annotator.
• For regression squared error, $\errorFunction(\weightVector) = \sum_{i=1}^\numData (\dataScalar_i - \mappingFunction(\inputVector_i))^2$

## Data Provision

• Given $\numData$ inputs, $\inputVector_1$, $\inputVector_2$, $\inputVector_3$, $\dots$, $\inputVector_\numData$
• And labels $\dataScalar_1$, $\dataScalar_2$, $\dataScalar_3$, $\dots$, $\dataScalar_\numData$.
• Sometimes label is cheap e.g. Newsfeed ranking
• Often it is very expensive.
• Manual labour

## Annotation

• Human annotators
• E.g. in ImageNet annotated using Amazon’s Mechanical Turk. (AI?)
• Without humans no AI.
• Not real intelligence, emulated

## Annotation

• Some tasks easier to annotate than others.
• Sometimes annotation requires an experiment (Tecator data)

## Annotation

• Even for easy tasks there will be problems.
• E.g. humans extrapolate the context of an image.
• Quality of ML is very sensitive to data.
• Investing in processes and tools is vital.

## Misrepresentation and Bias

• Bias can appear in the model and the data
• Data needs to be carefully collected
• E.g. face detectors trained on Europeans tested in Africa.

## Generalization and Overfitting

• How does the model perform on previously unseen data?

## Validation and Model Selection

• Selecting model at the validation step

## Difficult Trap

• Vital that you avoid test data in training.
• Validation data is different from test data.

## Overfitting

• Increase number of basis functions we obtain a better ‘fit’ to the data.
• How will the model perform on previously unseen data?
• Let’s consider predicting the future.

## Extrapolation

• Here we are training beyond where the model has learnt.
• This is known as extrapolation.
• Extrapolation is predicting into the future here, but could be:
• Predicting back to the unseen past (pre 1892)
• Spatial prediction (e.g. Cholera rates outside Manchester given rates inside Manchester).

## Interpolation

• Predicting the wining time for 1946 Olympics is interpolation.
• This is because we have times from 1936 and 1948.
• If we want a model for interpolation how can we test it?
• One trick is to sample the validation set from throughout the data set.

## Choice of Validation Set

• The choice of validation set should reflect how you will use the model in practice.
• For extrapolation into the future we tried validating with data from the future.
• For interpolation we chose validation set from data.
• For different validation sets we could get different results.

## Bias Variance Decomposition

Expected test error for different variations of the training data sampled from, $\Pr(\dataVector, \dataScalar)$ $\mathbb{E}\left[ \left(\dataScalar - \mappingFunction^*(\dataVector)\right)^2 \right]$ Decompose as $\mathbb{E}\left[ \left(\dataScalar - \mappingFunction(\dataVector)\right)^2 \right] = \text{bias}\left[\mappingFunction^*(\dataVector)\right]^2 + \text{variance}\left[\mappingFunction^*(\dataVector)\right] +\sigma^2$

## Bias

• Given by $\text{bias}\left[\mappingFunction^*(\dataVector)\right] = \mathbb{E}\left[\mappingFunction^*(\dataVector)\right] * \mappingFunction(\dataVector)$
• Error due to bias comes from a model that’s too simple.

## Variance

• Given by $\text{variance}\left[\mappingFunction^*(\dataVector)\right] = \mathbb{E}\left[\left(\mappingFunction^*(\dataVector) - \mathbb{E}\left[\mappingFunction^*(\dataVector)\right]\right)^2\right]$
• Slight variations in the training set cause changes in the prediction. Error due to variance is error in the model due to an overly complex model.

Figure: simple models on left complex models on right

## Overfitting

Alex Ihler on Polynomials and Overfitting

## References

Andrade-Pacheco, R., Mubangizi, M., Quinn, J., Lawrence, N.D., 2014. Consistent mapping of government malaria records across a changing territory delimitation. Malaria Journal 13. https://doi.org/10.1186/1475-2875-13-S1-P5

Gelman, A., Carlin, J.B., Stern, H.S., Rubin, D.B., 2013. Bayesian data analysis, 3rd ed. Chapman; Hall.

McCulloch, W.S., Pitts, W., 1943. A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics 5, 115–133.

Mubangizi, M., Andrade-Pacheco, R., Smith, M.T., Quinn, J., Lawrence, N.D., 2014. Malaria surveillance with multiple data sources using Gaussian process models, in: 1st International Conference on the Use of Mobile ICT in Africa.

Robbins, H., Monro, S., 1951. A stochastic approximation method. Annals of Mathematical Statistics 22, 400–407.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L., 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115, 211–252. https://doi.org/10.1007/s11263-015-0816-y

Taigman, Y., Yang, M., Ranzato, M., Wolf, L., 2014. DeepFace: Closing the gap to human-level performance in face verification, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. https://doi.org/10.1109/CVPR.2014.220