[edit]
What is Machine Learning?
Abstract
In this talk we will introduce the fundamental ideas in machine learning. We’ll develop our exposition around the ideas of prediction function and the objective function. We don’t so much focus on the derivation of particular algorithms, but more the general principles involved to give an idea of the machine learning landscape.
Introduction
Data Science Africa [edit]
Data Science Africa is a bottom up initiative for capacity building in data science, machine learning and artificial intelligence on the African continent.
As of 2019 there have been five workshops and five schools, located in Nyeri, Kenya (twice); Kampala, Uganda; Arusha, Tanzania; Abuja, Nigeria; Addis Ababa, Ethiopia and Accra, Ghana. The next event is scheduled for June 2020 in Kampala, Uganda.
The main notion is endtoend data science. For example, going from data collection in the farmer’s field to decision making in the Ministry of Agriculture. Or going from malaria disease counts in health centers to medicine distribution.
The philosophy is laid out in (Lawrence 2015). The key idea is that the modern information infrastructure presents new solutions to old problems. Modes of development change because less capital investment is required to take advantage of this infrastructure. The philosophy is that local capacity building is the right way to leverage these challenges in addressing data science problems in the African context.
Data Science Africa is now a nongovermental organization registered in Kenya. The organising board of the meeting is entirely made up of scientists and academics based on the African continent.
Example: Prediction of Malaria Incidence in Uganda [edit]
As an example of using Gaussian process models within the full pipeline from data to decsion, we’ll consider the prediction of Malaria incidence in Uganda. For the purposes of this study malaria reports come in two forms, HMIS reports from health centres and Sentinel data, which is curated by the WHO. There are limited sentinel sites and many HMIS sites.
The work is from Ricardo Andrade Pacheco’s PhD thesis, completed in collaboration with John Quinn and Martin Mubangizi (AndradePacheco et al. 2014; Mubangizi et al. 2014). John and Martin were initally from the AIDEV group from the University of Makerere in Kampala and more latterly they were based at UN Global Pulse in Kampala.
Malaria data is spatial data. Uganda is split into districts, and health reports can be found for each district. This suggests that models such as conditional random fields could be used for spatial modelling, but there are two complexities with this. First of all, occasionally districts split into two. Secondly, sentinel sites are a specific location within a district, such as Nagongera which is a sentinel site based in the Tororo district.
(AndradePacheco et al. 2014; Mubangizi et al. 2014)
The common standard for collecting health data on the African continent is from the Health management information systems (HMIS). However, this data suffers from missing values (Gething et al. 2006) and diagnosis of diseases like typhoid and malaria may be confounded.
World Health Organization Sentinel Surveillance systems are set up “when highquality data are needed about a particular disease that cannot be obtained through a passive system”. Several sentinel sites give accurate assessment of malaria disease levels in Uganda, including a site in Nagongera.
In collaboration with the AI Research Group at Makerere we chose to investigate whether Gaussian process models could be used to assimilate information from these two different sources of disease informaton. Further, we were interested in whether local information on rainfall and temperature could be used to improve malaria estimates.
The aim of the project was to use WHO Sentinel sites, alongside rainfall and temperature, to improve predictions from HMIS data of levels of malaria.
Early Warning Systems
Health monitoring system for the Kabarole district. Here we have fitted the reports with a Gaussian process with an additive covariance function. It has two components, one is a long time scale component (in red above) the other is a short time scale component (in blue).
Monitoring proceeds by considering two aspects of the curve. Is the blue line (the short term report signal) above the red (which represents the long term trend? If so we have higher than expected reports. If this is the case and the gradient is still positive (i.e. reports are going up) we encode this with a red color. If it is the case and the gradient of the blue line is negative (i.e. reports are going down) we encode this with an amber color. Conversely, if the blue line is below the red and decreasing, we color green. On the other hand if it is below red but increasing, we color yellow.
This gives us an early warning system for disease. Red is a bad situation getting worse, amber is bad, but improving. Green is good and getting better and yellow good but degrading.
Finally, there is a gray region which represents when the scale of the effect is small.
These colors can now be observed directly on a spatial map of the districts to give an immediate impression of the current status of the disease across the country.
Machine Learning
This talk is a general introduction to machine learning, we will highlight the technical challenges and the current solutions. We will give an overview of what is machine learning and why it is important.
Rise of Machine Learning
Machine learning is the combination of data and models, through computation, to make predictions.
$$
\text{data} + \text{model} \xrightarrow{\text{compute}} \text{prediction}
$$
Data Revolution
Machine learning has risen in prominence due to the rise in data availability, and its interconnection with computers. The high bandwidth connection between data and computer leads to a new interaction between us and data via the computer. It is that channel that is being mediated by machine learning techniques.
Supply Chain [edit]
On Sunday mornings in Sheffield, I often used to run across Packhorse Bridge in Burbage valley. The bridge is part of an ancient network of trails crossing the Pennines that, before Turnpike roads arrived in the 18th century, was the main way in which goods were moved. Given that the moors around Sheffield were home to sand quarries, tin mines, lead mines and the villages in the Derwent valley were known for nail and pin manufacture, this wasn’t simply movement of agricultural goods, but it was the infrastructure for industrial transport.
The profession of leading the horses was known as a Jagger and leading out of the village of Hathersage is Jagger’s Lane, a trail that headed underneath Stanage Edge and into Sheffield.
The movement of goods from regions of supply to areas of demand is fundamental to our society. The physical infrastructure of supply chain has evolved a great deal over the last 300 years.
Cromford [edit]
Richard Arkwright is known as the father of the modern factory system. In 1771 he set up a Mill for spinning cotton yarn in the village of Cromford, in the Derwent Valley. The Derwent valley is relatively inaccessible. Raw cotton arrived in Liverpool from the US and India. It needed to be transported on packhorse across the bridleways of the Pennines. But Cromford was a good location due to proximity to Nottingham, where weavers where consuming the finished thread, and the availability of water power from small tributaries of the Derwent river for Arkwright’s water frames which automated the production of yarn from raw cotton.
By 1794 the Cromford Canal was opened to bring coal in to Cromford and give better transport to Nottingham. The construction of the canals was driven by the need to improve the transport infrastructure, facilitating the movement of goods across the UK. Canals, roads and railways were initially constructed by the economic need for moving goods. To improve supply chain.
The A6 now does pass through Cromford, but at the time he moved there there was merely a track. The High Peak Railway was opened in 1832, it is now converted to the High Peak Trail, but it remains the highest railway built in Britain.
Cooper (1991)
Containerization [edit]
Containerization has had a dramatic effect on global economics, placing many people in the developing world at the end of the supply chain.


For example, you can buy Wild Alaskan Cod fished from Alaska, processed in China, sold in North America. This is driven by the low cost of transport for frozen cod vs the higher relative cost of cod processing in the US versus China. Similarly, Scottish prawns are also processed in China for sale in the UK.
This effect on cost of transport vs cost of processing is the main driver of the topology of the modern supply chain and the associated effect of globalization. If transport is much cheaper than processing, then processing will tend to agglomerate in places where processing costs can be minimized.
Large scale global economic change has principally been driven by changes in the technology that drives supply chain.
Supply chain is a largescale automated decision making network. Our aim is to make decisions not only based on our models of customer behavior (as observed through data), but also by accounting for the structure of our fulfilment center, and delivery network.
Many of the most important questions in supply chain take the form of counterfactuals. E.g. “What would happen if we opened a manufacturing facility in Cambridge?” A counter factual is a question that implies a mechanistic understanding of a system. It goes beyond simple smoothness assumptions or translation invariants. It requires a physical, or mechanistic understanding of the supply chain network. For this reason, the type of models we deploy in supply chain often involve simulations or more mechanistic understanding of the network.
In supply chain Machine Learning alone is not enough, we need to bridge between models that contain real mechanisms and models that are entirely data driven.
This is challenging, because as we introduce more mechanism to the models we use, it becomes harder to develop efficient algorithms to match those models to data.
For Africa [edit]
There is a large opportunity because infrastructures around automation are moving from physical infrastructure towards information infrastructures. How can African countries benefit from a modern information infrastructure? The aim of Data Science Africa is to answer this question, with the answers coming from the attendees.
Machine learning aims to replicate processes through the direct use of data. When deployed in the domain of ‘artificial intelligence’, the processes that it is replicating, or emulating, are cognitive processes.
The first trick in machine learning is to convert the process itself into a mathematical function. That function has a set of parameters which control its behaviour. What we call learning is the adaption of these parameters to change the behavior of the function. The choice of mathematical function we use is a vital component of the model.
Nigerian NMIS Data [edit]
As an example data set we will use Nigerian NMIS Health Facility data from openAFRICA. It can be found here https://africaopendata.org/dataset/nigerianmishealthfacilitydata2014
Taking from the information on the site,
The Nigeria MDG (Millennium Development Goals) Information System – NMIS health facility data is collected by the Office of the Senior Special Assistant to the President on the Millennium Development Goals (OSSAPMDGs) in partner with the Sustainable Engineering Lab at Columbia University. A rigorous, georeferenced baseline facility inventory across Nigeria is created spanning from 2009 to 2011 with an additional survey effort to increase coverage in 2014, to build Nigeria’s first nationwide inventory of health facility. The database includes 34,139 health facilities info in Nigeria.
The goal of this database is to make the data collected available to planners, government officials, and the public, to be used to make strategic decisions for planning relevant interventions.
For data inquiry, please contact Ms. Funlola Osinupebi, Performance Monitoring & Communications, Advisory Power Team, Office of the Vice President at funlola.osinupebi@aptovp.org
To learn more, please visit http://csd.columbia.edu/2014/03/10/thenigeriamdginformationsystemnmistakesopendatafurther/
Suggested citation: Nigeria NMIS facility database (2014), the Office of the Senior Special Assistant to the President on the Millennium Development Goals (OSSAPMDGs) & Columbia University
Once it is loaded in the data can be summarized using the describe
method in pandas.
In python and jupyter notebook it is possible to see a list of all possible functions and attributes by typing the name of the object followed by .<Tab>
for example in the above case if we type data.<Tab>
it show the columns available (these are attributes in pandas dataframes) such as num_nurses_fulltime
, and also functions, such as .describe()
.
For functions we can also see the documentation about the function by following the name with a question mark. This will open a box with documentation at the bottom which can be closed with the x button.
The NMIS facility data is stored in an object known as a ‘data frame’. Data frames come from the statistical family of programming languages based on S
, the most widely used of which is R
. The data frame gives us a convenient object for manipulating data. The describe method summarizes which columns there are in the data frame and gives us counts, means, standard deviations and percentiles for the values in those columns. To access a column directly we can write
This shows the number of doctors per facility, number of nurses and number of community health workers (CHEWS). We can plot the number of doctors against the number of nurses as follows.
# this ensures the plot appears in the web browser
%matplotlib inline
import matplotlib.pyplot as plt # this imports the plotting library in python
You may be curious what the arguments we give to plt.plot
are for, now is the perfect time to look at the documentation
We immediately note that some facilities have a lot of nurses, which prevent’s us seeing the detail of the main number of facilities. First lets identify the facilities with the most nurses.
Here we are using the command data['num_nurses_fulltime']>100
to index the facilities in the pandas data frame which have over 100 nurses. To sort them in order we can also use the sort
command. The result of this command on its own is a data Series
of True
and False
values. However, when it is passed to the data
data frame it returns a new data frame which contains only those values for which the data series is True
. We can also sort the result. To sort the result by the values in the num_nurses_fulltime
column in descending order we use the following command.
We now see that the ‘University of Calabar Teaching Hospital’ is a large outlier with 513 nurses. We can try and determine how much of an outlier by histograming the data.
Plotting the Data
data['num_nurses_fulltime'].hist(bins=20) # histogram the data with 20 bins.
plt.title('Histogram of Number of Nurses')
We can’t see very much here. Two things are happening. There are so many facilities with zero or one nurse that we don’t see the histogram for hospitals with many nurses. We can try more bins and using a log scale on the yaxis.
data['num_nurses_fulltime'].hist(bins=100) # histogram the data with 20 bins.
plt.title('Histogram of Number of Nurses')
ax = plt.gca()
ax.set_yscale('log')
Let’s try and see how the number of nurses relates to the number of doctors.
fig, ax = plt.subplots(figsize=(10, 7))
ax.plot(data['num_doctors_fulltime'], data['num_nurses_fulltime'], 'rx')
ax.set_xscale('log') # use a logarithmic x scale
ax.set_yscale('log') # use a logarithmic Y scale
# give the plot some titles and labels
plt.title('Number of Nurses against Number of Doctors')
plt.ylabel('number of nurses')
plt.xlabel('number of doctors')
Note a few things. We are interacting with our data. In particular, we are replotting the data according to what we have learned so far. We are using the progamming language as a scripting language to give the computer one command or another, and then the next command we enter is dependent on the result of the previous. This is a very different paradigm to classical software engineering. In classical software engineering we normally write many lines of code (entire object classes or functions) before compiling the code and running it. Our approach is more similar to the approach we take whilst debugging. Historically, researchers interacted with data using a console. A command line window which allowed command entry. The notebook format we are using is slightly different. Each of the code entry boxes acts like a separate console window. We can move up and down the notebook and run each part in a different order. The state of the program is always as we left it after running the previous part.
What does Machine Learning do? [edit]
Any process of automation allows us to scale what we do by codifying a process in some way that makes it efficient and repeatable. Machine learning automates by emulating human (or other actions) found in data. Machine learning codifies in the form of a mathematical function that is learnt by a computer. If we can create these mathematical functions in ways in which they can interconnect, then we can also build systems.
Machine learning works through codifing a prediction of interest into a mathematical function. For example, we can try and predict the probability that a customer wants to by a jersey given knowledge of their age, and the latitude where they live. The technique known as logistic regression estimates the odds that someone will by a jumper as a linear weighted sum of the features of interest.
$$ \text{odds} = \frac{p(\text{bought})}{p(\text{not bought})} $$
log odds = β_{0} + β_{1}age + β_{2}latitude.
Here β_{0}, β_{1} and β_{2} are the parameters of the model. If β_{1} and β_{2} are both positive, then the logodds that someone will buy a jumper increase with increasing latitude and age, so the further north you are and the older you are the more likely you are to buy a jumper. The parameter β_{0} is an offset parameter, and gives the logodds of buying a jumper at zero age and on the equator. It is likely to be negative^{1} indicating that the purchase is oddsagainst. This is actually a classical statistical model, and models like logistic regression are widely used to estimate probabilities from adclick prediction to risk of disease.
This is called a generalized linear model, we can also think of it as estimating the probability of a purchase as a nonlinear function of the features (age, lattitude) and the parameters (the β values). The function is known as the sigmoid or logistic function, thus the name logistic regression.
$$ p(\text{bought}) = \sigmoid{\beta_0 + \beta_1 \text{age} + \beta_2 \text{latitude}}.$$
In the case where we have features to help us predict, we sometimes denote such features as a vector, $\inputVector$, and we then use an inner product between the features and the parameters, $\boldsymbol{\beta}^\top \inputVector = \beta_1 \inputScalar_1 + \beta_2 \inputScalar_2 + \beta_3 \inputScalar_3 ...$, to represent the argument of the sigmoid.
$$ p(\text{bought}) = \sigmoid{\boldsymbol{\beta}^\top \inputVector}.$$
More generally, we aim to predict some aspect of our data, $\dataScalar$, by relating it through a mathematical function, $\mappingFunction(\cdot)$, to the parameters, β and the data, $\inputVector$.
$$ \dataScalar = \mappingFunction\left(\inputVector, \boldsymbol{\beta}\right).$$
We call $\mappingFunction(\cdot)$ the prediction function.
To obtain the fit to data, we use a separate function called the objective function that gives us a mathematical representation of the difference between our predictions and the real data.
$$\errorFunction(\boldsymbol{\beta}, \dataMatrix, \inputMatrix)$$
A commonly used examples (for example in a regression problem) is least squares,
$$\errorFunction(\boldsymbol{\beta}, \dataMatrix, \inputMatrix) = \sum_{i=1}^\numData \left(\dataScalar_i  \mappingFunction(\inputVector_i, \boldsymbol{\beta})\right)^2.$$
If a linear prediction function is combined with the least squares objective function then that gives us a classical linear regression, another classical statistical model. Statistics often focusses on linear models because it makes interpretation of the model easier. Interpretation is key in statistics because the aim is normally to validate questions by analysis of data. Machine learning has typically focussed more on the prediction function itself and worried less about the interpretation of parameters, which are normally denoted by w instead of β. As a result nonlinear functions are explored more often as they tend to improve quality of predictions but at the expense of interpretability.
What is Machine Learning? [edit]
Machine learning allows us to extract knowledge from data to form a prediction.
$$\text{data} + \text{model} \xrightarrow{\text{compute}} \text{prediction}$$
A machine learning prediction is made by combining a model with data to form the prediction. The manner in which this is done gives us the machine learning algorithm.
Machine learning models are mathematical models which make weak assumptions about data, e.g. smoothness assumptions. By combining these assumptions with the data, we observe we can interpolate between data points or, occasionally, extrapolate into the future.
Machine learning is a technology which strongly overlaps with the methodology of statistics. From a historical/philosophical view point, machine learning differs from statistics in that the focus in the machine learning community has been primarily on accuracy of prediction, whereas the focus in statistics is typically on the interpretability of a model and/or validating a hypothesis through data collection.
The rapid increase in the availability of compute and data has led to the increased prominence of machine learning. This prominence is surfacing in two different but overlapping domains: data science and artificial intelligence.
From Model to Decision [edit]
The real challenge, however, is endtoend decision making. Taking information from the environment and using it to drive decision making to achieve goals.
Artificial Intelligence and Data Science [edit]
Artificial intelligence has the objective of endowing computers with humanlike intelligent capabilities. For example, understanding an image (computer vision) or the contents of some speech (speech recognition), the meaning of a sentence (natural language processing) or the translation of a sentence (machine translation).
Supervised Learning for AI
The machine learning approach to artificial intelligence is to collect and annotate a large data set from humans. The problem is characterized by input data (e.g. a particular image) and a label (e.g. is there a car in the image yes/no). The machine learning algorithm fits a mathematical function (I call this the prediction function) to map from the input image to the label. The parameters of the prediction function are set by minimizing an error between the function’s predictions and the true data. This mathematical function that encapsulates this error is known as the objective function.
This approach to machine learning is known as supervised learning. Various approaches to supervised learning use different prediction functions, objective functions or different optimization algorithms to fit them.
For example, deep learning makes use of neural networks to form the predictions. A neural network is a particular type of mathematical function that allows the algorithm designer to introduce invariances into the function.
An invariance is an important way of including prior understanding in a machine learning model. For example, in an image, a car is still a car regardless of whether it’s in the upper left or lower right corner of the image. This is known as translation invariance. A neural network encodes translation invariance in convolutional layers. Convolutional neural networks are widely used in image recognition tasks.
An alternative structure is known as a recurrent neural network (RNN). RNNs neural networks encode temporal structure. They use auto regressive connections in their hidden layers, they can be seen as time series models which have nonlinear autoregressive basis functions. They are widely used in speech recognition and machine translation.
Machine learning has been deployed in Speech Recognition (e.g. Alexa, deep neural networks, convolutional neural networks for speech recognition), in computer vision (e.g. Amazon Go, convolutional neural networks for person recognition and pose detection).
The field of data science is related to AI, but philosophically different. It arises because we are increasingly creating large amounts of data through happenstance rather than active collection. In the modern era data is laid down by almost all our activities. The objective of data science is to extract insights from this data.
Classically, in the field of statistics, data analysis proceeds by assuming that the question (or scientific hypothesis) comes before the data is created. E.g., if I want to determine the effectiveness of a particular drug, I perform a design for my data collection. I use foundational approaches such as randomization to account for confounders. This made a lot of sense in an era where data had to be actively collected. The reduction in cost of data collection and storage now means that many data sets are available which weren’t collected with a particular question in mind. This is a challenge because bias in the way data was acquired can corrupt the insights we derive. We can perform randomized control trials (or A/B tests) to verify our conclusions, but the opportunity is to use data science techniques to better guide our question selection or even answer a question without the expense of a full randomized control trial (referred to as A/B testing in modern internet parlance).
Neural Networks and Prediction Functions [edit]
Neural networks are adaptive nonlinear function models. Originally, they were studied (by McCulloch and Pitts (McCulloch and Pitts 1943)) as simple models for neurons, but over the last decade they have become popular because they are a flexible approach to modelling complex data. A particular characteristic of neural network models is that they can be composed to form highly complex functions which encode many of our expectations of the real world. They allow us to encode our assumptions about how the world works.
We will return to composition later, but for the moment, let’s focus on a one hidden layer neural network. We are interested in the prediction function, so we’ll ignore the objective function (which is often called an error function) for the moment, and just describe the mathematical object of interest
$$
\mappingFunction(\inputVector) = \mappingMatrix^\top \activationVector(\mappingMatrixTwo, \inputVector)
$$
Where in this case $\mappingFunction(\cdot)$ is a scalar function with vector inputs, and $\activationVector(\cdot)$ is a vector function with vector inputs. The dimensionality of the vector function is known as the number of hidden units, or the number of neurons. The elements of this vector function are known as the activation function of the neural network and $\mappingMatrixTwo$ are the parameters of the activation functions.
Relations with Classical Statistics
In statistics activation functions are traditionally known as basis functions. And we would think of this as a linear model. It’s doesn’t make linear predictions, but it’s linear because in statistics estimation focuses on the parameters, $\mappingMatrix$, not the parameters, $\mappingMatrixTwo$. The linear model terminology refers to the fact that the model is linear in the parameters, but it is not linear in the data unless the activation functions are chosen to be linear.
Adaptive Basis Functions
The first difference in the (early) neural network literature to the classical statistical literature is the decision to optimize these parameters, $\mappingMatrixTwo$, as well as the parameters, $\mappingMatrix$ (which would normally be denoted in statistics by β)^{2}.
Machine Learning
The key idea in machine learning is to observe the system in practice, and then emulate its behavior with mathematics. That leads to a design challenge as to where to place the mathematical function. The placement of the mathematical function leads to the different domains of machine learning.
 Supervised learning
 Unsupervised learning
 Reinforcement learning
Supervised learning is one of the most widely deployed machine learning technologies, and a particular domain of success has been classification. Classification is the process of taking an input (which might be an image) and categorizing it into one of a number of different classes (e.g. dog or cat). This simple idea underpins a lot of machine learning. By scanning across the image we can also determine where the animal is in the image.
Introduction to Classification [edit]
Classification is perhaps the technique most closely assocated with machine learning. In the speech based agents, ondevice classifiers are used to determine when the wake word is used. A wake word is a word that wakes up the device. For the Amazon Echo it is “Alexa”, for Siri it is “Hey Siri”. Once the wake word detected with a classifier, the speech can be uploaded to the cloud for full processing, the speech recognition stages.
This isn’t just useful for intelligent agents, the UN global pulse project on public discussion on radio also uses wake word detection for recording radio conversations.
A major breakthrough in image classification came in 2012 with the ImageNet result of Alex Krizhevsky, Ilya Sutskever and Geoff Hinton from the University of Toronto. ImageNet is a large data base of 14 million images with many thousands of classes. The data is used in a communitywide challenge for object categorization. Krizhevsky et al used convolutional neural networks to outperform all previous approaches on the challenge. They formed a company which was purchased shortly after by Google. This challenge, known as object categorisation, was a major obstacle for practical computer vision systems. Modern object categorization systems are close to human performance.
Machine learning problems normally involve a prediction function and an objective function. Regression is the case where the prediction function iss over the real numbers, so the codomain of the functions, $\mappingFunction(\inputMatrix)$ was the real numbers or sometimes real vectors. The classification problem consists of predicting whether or not a particular example is a member of a particular class. So we may want to know if a particular image represents a digit 6 or if a particular user will click on a given advert. These are classification problems, and they require us to map to yes or no answers. That makes them naturally discrete mappings.
In classification we are given an input vector, $\inputVector$, and an associated label, $\dataScalar$ which either takes the value − 1 to represent no or 1 to represent yes.
In supervised learning the inputs, $\inputVector$, are mapped to a label, $\dataScalar$, through a function $\mappingFunction(\cdot)$ that is dependent on a set of parameters, $\weightVector$,
$$
\dataScalar = \mappingFunction(\inputVector; \weightVector).
$$
The function $\mappingFunction(\cdot)$ is known as the prediction function. The key challenges are (1) choosing which features, $\inputVector$, are relevant in the prediction, (2) defining the appropriate class of function, $\mappingFunction(\cdot)$, to use and (3) selecting the right parameters, $\weightVector$.
Classification Examples [edit]
 Classifiying hand written digits from binary images (automatic zip code reading)
 Detecting faces in images (e.g. digital cameras).
 Who a detected face belongs to (e.g. Facebook, DeepFace)
 Classifying type of cancer given gene expression data.
 Categorization of document types (different types of news article on the internet)
Logistic Regression [edit]
A logistic regression is an approach to classification which extends the linear basis function models we’ve already explored. Rather than modeling the output of the function directly the assumption is that we model the logodds with the basis functions.
The odds are defined as the ratio of the probability of a positive outcome, to the probability of a negative outcome. If the probability of a positive outcome is denoted by π, then the odds are computed as $\frac{\pi}{1\pi}$. Odds are widely used by bookmakers in gambling, although a bookmakers odds won’t normalise: i.e. if you look at the equivalent probabilities, and sum over the probability of all outcomes the bookmakers are considering, then you won’t get one. This is how the bookmaker makes a profit. Because a probability is always between zero and one, the odds are always between 0 and ∞. If the positive outcome is unlikely the odds are close to zero, if it is very likely then the odds become close to infinite. Taking the logarithm of the odds maps the odds from the positive half space to being across the entire real line. Odds that were between 0 and 1 (where the negative outcome was more likely) are mapped to the range between − ∞ and 0. Odds that are greater than 1 are mapped to the range between 0 and ∞. Considering the log odds therefore takes a number between 0 and 1 (the probability of positive outcome) and maps it to the entire real line. The function that does this is known as the logit function, $g^{1}(p_i) = \log\frac{p_i}{1p_i}$. This function is known as a link function.
For a standard regression we take,
$$
\mappingFunction(\inputVector) = \mappingVector^\top
\basisVector(\inputVector),
$$
if we want to perform classification we perform a logistic regression.
$$
\log \frac{\pi}{(1\pi)} = \mappingVector^\top
\basisVector(\inputVector)
$$
where the odds ratio between the positive class and the negative class is given by
$$
\frac{\pi}{(1\pi)}
$$
The odds can never be negative, but can take any value from 0 to ∞. We have defined the link function as taking the form g^{ − 1}( ⋅ ) implying that the inverse link function is given by g( ⋅ ). Since we have defined,
$$
g^{1}(\pi) =
\mappingVector^\top \basisVector(\inputVector)
$$
we can write π in terms of the inverse link function, g( ⋅ ) as
$$
\pi = g(\mappingVector^\top
\basisVector(\inputVector)).
$$
Basis Function
We’ll define our prediction, objective and gradient functions below. But before we start, we need to define a basis function for our model. Let’s start with the linear basis.
Prediction Function
Now we have the basis function let’s define the prediction function.
def predict(w, x, basis=linear, **kwargs):
"Generates the prediction function and the basis matrix."
Phi = basis(x, **kwargs)
f = np.dot(Phi, w)
return 1./(1+np.exp(f)), Phi
This inverse of the link function is known as the logistic (thus the name logistic regression) or sometimes it is called the sigmoid function. For a particular value of the input to the link function, $\mappingFunction_i = \mappingVector^\top \basisVector(\inputVector_i)$ we can plot the value of the inverse link function as below.
Sigmoid Function [edit]
The function has this characeristic ‘s’shape (from where the term sigmoid, as in sigma, comes from). It also takes the input from the entire real line and ‘squashes’ it into an output that is between zero and one. For this reason it is sometimes also called a ‘squashing function’.
By replacing the inverse link with the sigmoid we can write π as a function of the input and the parameter vector as,
$$
\pi(\inputVector,\mappingVector) = \frac{1}{1+\exp\left(\mappingVector^\top \basisVector(\inputVector)\right)}.
$$
The process for logistic regression is as follows. Compute the output of a standard linear basis function composition ($\mappingVector^\top \basisVector(\inputVector)$, as we did for linear regression) and then apply the inverse link function, $g(\mappingVector^\top \basisVector(\inputVector))$. In logistic regression this involves squashing it with the logistic (or sigmoid) function. Use this value, which now has an interpretation as a probability in a Bernoulli distribution to form the likelihood. Then we can assume conditional independence of each data point given the parameters and develop a likelihod for the entire data set.
As we discussed last time, the Bernoulli likelihood is of the form,
$$
P(\dataScalar_i\mappingVector, \inputVector) =
\pi_i^{\dataScalar_i} (1\pi_i)^{1\dataScalar_i}
$$
which we can think of as clever trick for mathematically switching between two probabilities if we were to write it as code it would be better described as
but writing it mathematically makes it easier to write our objective function within a single mathematical equation.
Maximum Likelihood
To obtain the parameters of the model, we need to maximize the likelihood, or minimize the objective function, normally taken to be the negative log likelihood. With a data conditional independence assumption the likelihood has the form,
$$
P(\dataVector\mappingVector,
\inputMatrix) = \prod_{i=1}^\numData P(\dataScalar_i\mappingVector, \inputVector_i).
$$
which can be written as a log likelihood in the form
$$
\log P(\dataVector\mappingVector,
\inputMatrix) = \sum_{i=1}^\numData \log P(\dataScalar_i\mappingVector, \inputVector_i) = \sum_{i=1}^\numData
\dataScalar_i \log \pi_i + \sum_{i=1}^\numData (1\dataScalar_i)\log (1\pi_i)
$$
and if we take the probability of positive outcome for the ith data point to be given by
$$
\pi_i = g\left(\mappingVector^\top \basisVector(\inputVector_i)\right),
$$
where g( ⋅ ) is the inverse link function, then this leads to an objective function of the form,
$$
E(\mappingVector) =  \sum_{i=1}^\numData \dataScalar_i \log
g\left(\mappingVector^\top \basisVector(\inputVector_i)\right) 
\sum_{i=1}^\numData(1\dataScalar_i)\log \left(1g\left(\mappingVector^\top
\basisVector(\inputVector_i)\right)\right).
$$
def objective(g, y):
"Computes the objective function."
labs = np.asarray(y, dtype=float).flatten()
posind = np.where(labs==1)
negind = np.where(labs==0)
return np.log(g[posind, :]).sum()  np.log(1g[negind, :]).sum()
As normal, we would like to minimize this objective. This can be done by differentiating with respect to the parameters of our prediction function, $\pi(\inputVector;\mappingVector)$, for optimisation. The gradient of the likelihood with respect to $\pi(\inputVector;\mappingVector)$ is of the form,
$$
\frac{\text{d}E(\mappingVector)}{\text{d}\mappingVector} = \sum_{i=1}^\numData
\frac{\dataScalar_i}{g\left(\mappingVector^\top
\basisVector(\inputVector)\right)}\frac{\text{d}g(\mappingFunction_i)}{\text{d}\mappingFunction_i}
\basisVector(\inputVector_i) + \sum_{i=1}^\numData
\frac{1\dataScalar_i}{1g\left(\mappingVector^\top
\basisVector(\inputVector)\right)}\frac{\text{d}g(\mappingFunction_i)}{\text{d}\mappingFunction_i}
\basisVector(\inputVector_i)
$$
where we used the chain rule to develop the derivative in terms of $\frac{\text{d}g(\mappingFunction_i)}{\text{d}\mappingFunction_i}$, which is the gradient of the inverse link function (in our case the gradient of the sigmoid function).
So the objective function now depends on the gradient of the inverse link function, as well as the likelihood depends on the gradient of the inverse link function, as well as the gradient of the log likelihood, and naturally the gradient of the argument of the inverse link function with respect to the parameters, which is simply $\basisVector(\inputVector_i)$.
The only missing term is the gradient of the inverse link function. For the sigmoid squashing function we have,
$$\begin{align*}
g(\mappingFunction_i) &= \frac{1}{1+\exp(\mappingFunction_i)}\\
&=(1+\exp(\mappingFunction_i))^{1}
\end{align*}$$
and the gradient can be computed as
$$\begin{align*}
\frac{\text{d}g(\mappingFunction_i)}{\text{d} \mappingFunction_i} & =
\exp(\mappingFunction_i)(1+\exp(\mappingFunction_i))^{2}\\
& = \frac{1}{1+\exp(\mappingFunction_i)}
\frac{\exp(\mappingFunction_i)}{1+\exp(\mappingFunction_i)} \\
& = g(\mappingFunction_i) (1g(\mappingFunction_i))
\end{align*}$$
so the full gradient can be written down as
$$
\frac{\text{d}E(\mappingVector)}{\text{d}\mappingVector} = \sum_{i=1}^\numData
\dataScalar_i\left(1g\left(\mappingVector^\top \basisVector(\inputVector)\right)\right)
\basisVector(\inputVector_i) + \sum_{i=1}^\numData
(1\dataScalar_i)\left(g\left(\mappingVector^\top \basisVector(\inputVector)\right)\right)
\basisVector(\inputVector_i).
$$
def gradient(g, Phi, y):
"Generates the gradient of the parameter vector."
labs = np.asarray(y, dtype=float).flatten()
posind = np.where(labs==1)
dw = (Phi[posind]*(1g[posind])).sum(0)
negind = np.where(labs==0 )
dw += (Phi[negind]*g[negind]).sum(0)
return dw[:, None]
Optimization of the Function
Reorganizing the gradient to find a stationary point of the function with respect to the parameters $\mappingVector$ turns out to be impossible. Optimization has to proceed by numerical methods. Options include the multidimensional variant of Newton’s method or gradient based optimization methods like we used for optimizing matrix factorization for the movie recommender system. We recall from matrix factorization that, for large data, stochastic gradient descent or the Robbins Munro (Robbins and Monro 1951) optimization procedure worked best for function minimization.
Nigerian NMIS Data [edit]
First we will load in the Nigerian NMIS health data. Our aim will be to predict whether a center has maternal health delivery services given the attributes in the data. We will predict of the number of nurses, the number of doctors, location etc.
Let’s first remind ourselves of the data.
Now we will convert this data into a form which we can use as inputs X
, and labels y
.
data = data[~pd.isnull(data['maternal_health_delivery_services'])]
data = data.dropna() # Remove entries with missing values
X = data[['emergency_transport',
'num_chews_fulltime',
'phcn_electricity',
'child_health_measles_immun_calc',
'num_nurses_fulltime',
'num_doctors_fulltime',
'improved_water_supply',
'improved_sanitation',
'antenatal_care_yn',
'family_planning_yn',
'malaria_treatment_artemisinin',
'latitude',
'longitude']].copy()
y = data['maternal_health_delivery_services']==True # set label to be whether there's a maternal health delivery service
# Create series of health center types with the relevant index
s = data['facility_type_display'].apply(pd.Series, 1).stack()
s.index = s.index.droplevel(1) # to line up with df's index
# Extract from the series the unique list of types.
types = s.unique()
# For each type extract the indices where it is present and add a column to X
type_names = []
for htype in types:
index = s[s==htype].index.tolist()
type_col=htype.replace(' ', '_').replace('/','').lower()
type_names.append(type_col)
X.loc[:, type_col] = 0.0
X.loc[index, type_col] = 1.0
This has given us a new data frame X
which contains the different facility types in different columns.
Batch Gradient Descent [edit]
We will need to define some initial random values for our vector and then minimize the objective by descending the gradient.
# Separate train and test
indices = np.random.permutation(X.shape[0])
num_train = np.ceil(X.shape[0]/2)r
train_indices = indices[:num_train]
test_indices = indices[num_train:]
X_train = X.iloc[train_indices]
y_train = y.iloc[train_indices]==True
X_test = X.iloc[test_indices]
y_test = y.iloc[test_indices]==True
# gradient descent algorithm
w = np.random.normal(size=(X.shape[1]+1, 1), scale = 0.001)
eta = 1e9
iters = 10000
for i in range(iters):
g, Phi = predict(w, X_train, linear)
w = eta*gradient(g, Phi, y_train) + 0.001*w
if not i % 100:
print("Iter", i, "Objective", objective(g, y_train))
Let’s look at the weights and how they relate to the inputs.
What does the magnitude of the weight vectors tell you about the different parameters and their influence on outcome? Are the weights of roughly the same size, if not, how might you fix this?
Stochastic Gradient Descent
Regression [edit]
Classification is the case where our prediction function gives a discrete valued output, normally associated with a ‘class’. Regression is an alternative approach where the aim is to predict a continuous output.
The name is a historical accident, it would be better to call regression ‘curve fitting’, or even split it into two parts ‘interpolation’, which is the practice of predicting a function value between existing data, and ‘extrapolation’, which is the practice of predicting a function value beyond the regime where we have data.
Regression Examples [edit]
Regression involves predicting a real value, $\dataScalar_i$, given an input vector, $\inputVector_i$. For example, the Tecator data involves predicting the quality of meat given spectral measurements. Or in radiocarbon dating, the C14 calibration curve maps from radiocarbon age to age measured through a backtrace of tree rings. Regression has also been used to predict the quality of board game moves given expert rated training data.
Olympic Marathon Data


The first thing we will do is load a standard data set for regression modelling. The data consists of the pace of Olympic Gold Medal Marathon winners for the Olympics from 1896 to present. First we load in the data and plot.
data = pods.datasets.olympic_marathon_men()
x = data['X']
y = data['Y']
offset = y.mean()
scale = np.sqrt(y.var())
xlim = (1875,2030)
ylim = (2.5, 6.5)
yhat = (yoffset)/scale
fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
_ = ax.plot(x, y, 'r.',markersize=10)
ax.set_xlabel('year', fontsize=20)
ax.set_ylabel('pace min/km', fontsize=20)
ax.set_xlim(xlim)
ax.set_ylim(ylim)
mlai.write_figure(figure=fig,
filename='../slides/diagrams/datasets/olympicmarathon.svg',
transparent=True,
frameon=True)
Things to notice about the data include the outlier in 1904, in this year, the olympics was in St Louis, USA. Organizational problems and challenges with dust kicked up by the cars following the race meant that participants got lost, and only very few participants completed.
More recent years see more consistently quick marathons.
Polynomial Fits to Olympic Data [edit]
basis = mlai.polynomial
data = pods.datasets.olympic_marathon_men()
x = data['X']
y = data['Y']
xlim = [1892, 2020]
basis=mlai.Basis(mlai.polynomial, number=1, data_limits=xlim)
pods.notebook.display_plots('olympic_LM_polynomial_number{num_basis:0>3}.svg',
directory='../slides/diagrams/ml',
num_basis=IntSlider(1,1,27,1))
Supervised Learning Challenges [edit]
There are three principal challenges in constructing a problem for supervised learning.
 choosing which features, $\inputVector$, are relevant in the prediction
 defining the appropriate class of function, $\mappingFunction(\cdot)$.
 selecting the right parameters, $\weightVector$.
Feature Selection [edit]
Feature selection is a critical stage in the algorithm design process. In the Olympic prediction example above we’re only using time to predict the the pace of the runners. In practice we might also want to use characteristics of the course: how hilly it is, what the temperature was when the race was run. In 1904 the runners actually got lost during the race. Should we include ‘lost’ as a feature? It would certainly help explain the particularly slow time in 1904. The features we select should be ones we expect to correlate with the prediction. In statistics, these features are even called predictors which highlights their role in developing the prediction function. For Facebook newsfeed, we might use features that include how close your friendship is with the poster, or how often you react to that poster, or whether a photo is included in the post.
Sometimes we use feature selection algorithms, algorithms that automate the process of finding the features that we need. Classification is often used to rank search results, to decide which adverts to serve or, at Facebook, to determine what appears at the top of your newsfeed. In the Facebook example features might include how many likes a post has had, whether it has an image in it, whether you regularly interact with the friend who has posted. A good newsfeed ranking algorithm is critical to Facebook’s success, just as good ad serving choice is critical to Google’s success. These algorithms are in turn highly dependent on the feature sets used. Facebook in particular has made heavy investments in machine learning pipelines for evaluation of the feature utility.
Class of Function, $\mappingFunction(\cdot)$ [edit]
By class of function we mean, what are the characteristics of the mapping between x and y. Often, we might choose it to be a smooth function. Sometimes we will choose it to be a linear function. If the prediction is a forecast, for example the demand of a particular product, then the function would need some periodic components to reflect seasonal or weekly effects.
Analysis of US Birth Rates [edit]
There’s a nice analysis of US birth rates by Gaussian processes with additive covariances in Gelman et al. (2013). A combination of covariance functions are used to take account of weekly and yearly trends. The analysis is summarized on the cover of the book.


In the ImageNet challenge the input, $\inputVector$, was in the form of an image. And the form of the prediction function was a convolutional neural network (more on this later). A convolutional neural network introduces invariances into the function that are particular to image classification. An invariance is a transformation of the input that we don’t want to affect the output. For example, a cat in an image is still a cat no matter where it’s located in the image (translation). The cat is also a cat regardless of how large it is (scale), or whether it’s upsidedown (rotation). Convolutional neural networks encode these invariances: scale invariance, rotation invariance and translation invariance; in the mathematical function.
Encoding invariance in the prediction function is like encoding knowledge in the model. If we don’t specify these invariances, then the model must learn them. This will require a lot more data to achieve the same performance, making the model less data efficient. Note that one invariance that is not encoded in a convolutional network is invariance to camera type. As a result, practitioners need to be careful to ensure that their training data is representative of the type of cameras that will be used when the model is deployed.
In general the prediction function could be any set of parameterized functions. In the Olympic marathon data example above we used a polynomial fit,
$$
\mappingFunction(\inputScalar) = \weightScalar_0 + \weightScalar_1 \inputScalar+ \weightScalar_2 \inputScalar^2 + \weightScalar_3 \inputScalar^3 + \weightScalar_4 \inputScalar^4.
$$
The Olympic example is also a supervised learning challenge. But it is a regression problem. A regression problem is one where the output is a continuous value (such as the pace in the marathon). In classification the output is constrained to be discrete. For example, classifying whether or not an image contains a dog implies the output is binary. An early example of a regression problem used in machine learning was the Tecator data, where the fat, water and protein content of meat samples was predicted as a function of the absorption of infrared light.
Class of Function: Neural Networks [edit]
One class of function that has become popular recently is neural network functions, in particular deep neural networks. The ImageNet challenge uses convolutional neural networks which introduce a translation invariance to the prediction function.
It’s impressive that only this additional invariance is enough to improve performance so much, particularly when we know that rotational invariances and scale invariances are also applicable for object detection in images.
Deep Learning [edit]
Classical statistical models and simple machine learning models have a great deal in common. The main difference between the fields is philosophical. Machine learning practitioners are typically more concerned with the quality of prediciton (e.g. measured by ROC curve) while statisticians tend to focus more on the interpretability of the model and the validity of any decisions drawn from that interpretation. For example, a statistical model may be used to validate whether a large scale intervention (such as the mass provision of mosquito nets) has had a long term effect on disease (such as malaria). In this case one of the covariates is likely to be the provision level of nets in a particular region. The response variable would be the rate of malaria disease in the region. The parmaeter, β_{1} associated with that covariate will demonstrate a positive or negative effect which would be validated in answering the question. The focus in statistics would be less on the accuracy of the response variable and more on the validity of the interpretation of the effect variable, β_{1}.
A machine learning practitioner on the other hand would typically denote the parameter w_{1}, instead of β_{1} and would only be interested in the output of the prediction function, $\mappingFunction(\cdot)$ rather than the parameter itself. The general formalism of the prediction function allows for nonlinear models. In machine learning, the emphasis on prediction over interpretability means that nonlinear models are often used. The parameters, w, are a means to an end (good prediction) rather than an end in themselves (interpretable).
DeepFace [edit]
The DeepFace architecture (Taigman et al. 2014) consists of layers that deal with translation and rotational invariances. These layers are followed by three locallyconnected layers and two fullyconnected layers. Color illustrates feature maps produced at each layer. The neural network includes more than 120 million parameters, where more than 95% come from the local and fully connected layers.
Deep Learning as Pinball [edit]
Sometimes deep learning models are described as being like the brain, or too complex to understand, but one analogy I find useful to help the gist of these models is to think of them as being similar to early pin ball machines.
In a deep neural network, we input a number (or numbers), whereas in pinball, we input a ball.
Think of the location of the ball on the leftright axis as a single number. Our simple pinball machine can only take one number at a time. As the ball falls through the machine, each layer of pins can be thought of as a different layer of ‘neurons’. Each layer acts to move the ball from left to right.
In a pinball machine, when the ball gets to the bottom it might fall into a hole defining a score, in a neural network, that is equivalent to the decision: a classification of the input object.
An image has more than one number associated with it, so it is like playing pinball in a hyperspace.
pods.notebook.display_plots('pinball{sample:0>3}.svg',
'../slides/diagrams',
sample=IntSlider(1, 1, 2, 1))
Learning involves moving all the pins to be in the correct position, so that the ball ends up in the right place when it’s fallen through the machine. But moving all these pins in hyperspace can be difficult.
In a hyperspace you have to put a lot of data through the machine for to explore the positions of all the pins. Even when you feed many millions of data points through the machine, there are likely to be regions in the hyperspace where no ball has passed. When future test data passes through the machine in a new route unusual things can happen.
Adversarial examples exploit this high dimensional space. If you have access to the pinball machine, you can use gradient methods to find a position for the ball in the hyper space where the image looks like one thing, but will be classified as another.
Probabilistic methods explore more of the space by considering a range of possible paths for the ball through the machine. This helps to make them more data efficient and gives some robustness to adversarial examples.
Encoding Knowledge
Knowledge that is not encoded in the prediction function must be learned through data. So any unspecified invariance (such as rotational or scale invariances) must be learned through the data. This means that learning would require a lot more data than otherwise would be necessary and results in less data efficient algorithms.
The choice of predication funciton and invariances is therefore a critical stage in designing your machine learning algorithm. Unfortunately many invariances are nontrivial to incorporate and many machine learning algorithms focus on simpler concepts such as linearity or smoothness.
Parameter Estimation: Objective Functions [edit]
Once we have a set of features, and the class of functions we use is determined, we need to find the parameters of the model.
The parameters of the model, $\weightVector$, are estimated by specifying an objective function. The objective function specifies the quality of the match between the prediction function and the training data. In supervised learning the objective function incorporates both the input data (in the ImageNet data the image, in the Olympic marathon data the year of the marathon) and a label.
The label is where the term supervised learning comes from. The idea being that a supervisor, or annotator, has already looked at the data and given it labels. For regression problem, a typical objective function is the squared error,
$$
\errorFunction(\weightVector) = \sum_{i=1}^\numData (\dataScalar_i  \mappingFunction(\inputVector_i))^2
$$
where the data is provided to us as a set of n inputs, $\inputVector_1$, $\inputVector_2$, $\inputVector_3$, …, $\inputVector_n$ each one with an associated label, $\dataScalar_1$, $\dataScalar_2$, $\dataScalar_3$, …, $\dataScalar_\numData$. Sometimes the label is cheap to acquire. For example, in Newsfeed ranking Facebook are acquiring a label each time a user clicks on a post in their Newsfeed. Similarly, in adclick prediction labels are obtained whenever an advert is clicked. More generally though, we have to employ human annotators to label the data. For example, ImageNet, the breakthrough deep learning result was annotated using Amazon’s Mechanical Turk. Without such large scale human input, we would not have the breakthrough results on image categorization we have today.
Some tasks are easier to annotate than others. For example, in the Tecator data, to acquire the actual values of water, protein and fat content in the meat samples further experiments may be required. It is not simply a matter of human labelling. Even if the task is easy for humans to solve there can be problems. For example, humans will extrapolate the context of an image. A colleague mentioned once to me a challenge where humans were labelling images as containing swimming pools, even though none was visible, because they could infer there must be a pool nearby, perhaps because there are kids wearing bathing suits. But there is no swimming pool in the image for the computer to find. The quality of any machine learning solution is very sensitive to the quality of annotated data we have. Investing in processes and tools to improve annotation of data is therefore priority for improving the quality of machine learning solutions.
There can also be significant problems with misrepresentation in the data set. If data isn’t collected carefully, then it can reflect biases about the population that we don’t want our models to have. For example, if we design a face detector using Californians may not perform well when deployed in Kampala, Uganda.
Generalization and Overfitting [edit]
Once a supervised learning system is trained it can be placed in a sequential pipeline to automate a process that used to be done manually.
Supervised learning is one of the dominant approaches to learning. But the cost and time associated with labeling data is a major bottleneck for deploying machine learning systems. The process for creating training data requires significant human intervention. For example, internationalization of a speech recognition system would require large speech corpora in new languages.
An important distinction in machine learning is the separation between training data and test data (or production data). Training data is the data that was used to find the model parameters. Test data (or production data) is the data that is used with the live system. The ability of a machine learning system to predict well on production systems given only its training data is known as its generalization ability. This is the system’s ability to predict in areas where it hasn’t previously seen data.
Hold Out Validation on Olympic Marathon Data [edit]
pods.notebook.display_plots('olympic_val_extra_LM_polynomial_number{num_basis:0>3}.svg',
directory='../slides/diagrams/ml',
num_basis=IntSlider(1, 1, max_basis, 1))
Extrapolation
Interpolation
pods.notebook.display_plots('olympic_val_inter_LM_polynomial_number{num_basis:0>3}.svg',
directory='../slides/diagrams/ml',
num_basis=IntSlider(1, 1, max_basis, 1))
Choice of Validation Set
Hold Out Data
You have a conclusion as to which model fits best under the training error, but how do the two models perform in terms of validation? In this section we consider hold out validation. In hold out validation we remove a portion of the training data for validating the model on. The remaining data is used for fitting the model (training). Because this is a time series prediction, it makes sense for us to hold out data at the end of the time series. This means that we are validating on future predictions. We will hold out data from after 1980 and fit the model to the data before 1980.
# select indices of data to 'hold out'
indices_hold_out = np.flatnonzero(x>1980)
# Create a training set
x_train = np.delete(x, indices_hold_out, axis=0)
y_train = np.delete(y, indices_hold_out, axis=0)
# Create a hold out set
x_valid = np.take(x, indices_hold_out, axis=0)
y_valid = np.take(y, indices_hold_out, axis=0)
Richer Basis Set
Now we have an approach for deciding which model to retain, we can consider the entire family of polynomial bases, with arbitrary degrees.
Bias Variance Decomposition [edit]
Expected test error for different variations of the training data sampled from, $\Pr(\dataVector, \dataScalar)$
$$\mathbb{E}\left[ \left(\dataScalar  \mappingFunction^*(\dataVector)\right)^2 \right]$$
Decompose as
$$\mathbb{E}\left[ \left(\dataScalar  \mappingFunction(\dataVector)\right)^2 \right] = \text{bias}\left[\mappingFunction^*(\dataVector)\right]^2 + \text{variance}\left[\mappingFunction^*(\dataVector)\right] +\sigma^2$$
 Given by
$$\text{bias}\left[\mappingFunction^*(\dataVector)\right] = \mathbb{E}\left[\mappingFunction^*(\dataVector)\right] * \mappingFunction(\dataVector)$$ Error due to bias comes from a model that’s too simple.
 Given by
$$\text{variance}\left[\mappingFunction^*(\dataVector)\right] = \mathbb{E}\left[\left(\mappingFunction^*(\dataVector)  \mathbb{E}\left[\mappingFunction^*(\dataVector)\right]\right)^2\right]$$ Slight variations in the training set cause changes in the prediction. Error due to variance is error in the model due to an overly complex model.
Bias vs Variance Error Plots [edit]
Helper function for sampling data from two different classes.
def create_data(per_cluster=30):
"""Create a randomly sampled data set
:param per_cluster: number of points in each cluster
"""
X = []
y = []
scale = 3
prec = 1/(scale*scale)
pos_mean = [[1, 0],[0,0.5],[1,0]]
pos_cov = [[prec, 0.], [0., prec]]
neg_mean = [[0, 0.5],[0,0.5],[0,0.5]]
neg_cov = [[prec, 0.], [0., prec]]
for mean in pos_mean:
X.append(np.random.multivariate_normal(mean=mean, cov=pos_cov, size=per_class))
y.append(np.ones((per_class, 1)))
for mean in neg_mean:
X.append(np.random.multivariate_normal(mean=mean, cov=neg_cov, size=per_class))
y.append(np.zeros((per_class, 1)))
return np.vstack(X), np.vstack(y).flatten()
Helper function for plotting the decision boundary of the SVM.
def plot_contours(ax, cl, xx, yy, **params):
"""Plot the decision boundaries for a classifier.
:param ax: matplotlib axes object
:param cl: a classifier
:param xx: meshgrid ndarray
:param yy: meshgrid ndarray
:param params: dictionary of params to pass to contourf, optional
"""
Z = cl.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot decision boundary and regions
out = ax.contour(xx, yy, Z,
levels=[1., 0., 1],
colors='black',
linestyles=['dashed', 'solid', 'dashed'])
out = ax.contourf(xx, yy, Z,
levels=[Z.min(), 0, Z.max()],
colors=[[0.5, 1.0, 0.5], [1.0, 0.5, 0.5]])
return out
def decision_boundary_plot(models, X, y, axs, filename, titles, xlim, ylim):
"""Plot a decision boundary on the given axes
:param axs: the axes to plot on.
:param models: the SVM models to plot
:param titles: the titles for each axis
:param X: input training data
:param y: target training data"""
for ax in axs.flatten():
ax.clear()
X0, X1 = X[:, 0], X[:, 1]
if xlim is None:
xlim = [X0.min()1, X0.max()+1]
if ylim is None:
ylim = [X1.min()1, X1.max()+1]
xx, yy = np.meshgrid(np.arange(xlim[0], xlim[1], 0.02),
np.arange(ylim[0], ylim[1], 0.02))
for cl, title, ax in zip(models, titles, axs.flatten()):
plot_contours(ax, cl, xx, yy,
cmap=plt.cm.coolwarm, alpha=0.8)
ax.plot(X0[y==1], X1[y==1], 'r.', markersize=10)
ax.plot(X0[y==0], X1[y==0], 'g.', markersize=10)
ax.set_xlim(xlim)
ax.set_ylim(ylim)
ax.set_xticks(())
ax.set_yticks(())
ax.set_title(title)
mlai.write_figure(os.path.join(filename),
figure=fig,
transparent=True)
return xlim, ylim
import matplotlib
font = {'family' : 'sans',
'weight' : 'bold',
'size' : 22}
matplotlib.rc('font', **font)
import matplotlib.pyplot as plt
# Create an instance of SVM and fit the data.
C = 100.0 # SVM regularization parameter
gammas = [0.001, 0.01, 0.1, 1]
per_class=30
num_samps = 20
# Setup 2x2 grid for plotting.
fig, ax = plt.subplots(1, 4, figsize=(10,3))
xlim=None
ylim=None
for samp in range(num_samps):
X, y=create_data(per_class)
models = []
titles = []
for gamma in gammas:
models.append(svm.SVC(kernel='rbf', gamma=gamma, C=C))
titles.append('$\gamma={}$'.format(gamma))
models = (cl.fit(X, y) for cl in models)
xlim, ylim = decision_boundary_plot(models, X, y,
axs=ax,
filename='../slides/diagrams/ml/biasvariance{samp:0>3}.svg'.format(samp=samp),
titles=titles,
xlim=xlim,
ylim=ylim)
pods.notebook.display_plots('biasvariance{samp:0>3}.svg',
directory='../slides/diagrams/ml',
samp=IntSlider(0,0,10,1))
Overfitting
We can easily develop a simple prediction function that reconstructs the training data exactly, you can just use a look up table. But how would the lookup table predict between the training data, where examples haven’t been seen before? The choice of the class of prediction functions is critical in ensuring that the model generalizes well.
The generalization error is normally estimated by applying the objective function to a set of data that the model wasn’t trained on, the test data. To ensure good performance we normally want a model that gives us a low generalization error. If we weren’t sure of the right prediction function to use, then we could try 1,000 different prediction functions. Then we could use the one that gives us the lowest error on the test data. But you have to be careful. Selecting a model in this way is like a further stage of training where you are using the test data in the training.^{3} So when this is done, the data used for this is not known as test data, it is known as validation data. And the associated error is the validation error. Using the validation error for model selection is a standard machine learning technique, but it can be misleading about the final generalization error. Almost all machine learning practitioners know not to use the test data in your training procedure, but sometimes people forget that when validation data is used for model selection that validation error cannot be used as an unbiased estimate of the generalization performance.
Olympic Data with Bayesian Polynomials [edit]
Five fold cross validation tests the ability of the model to interpolate.
pods.notebook.display_plots('olympic_BLM_polynomial_number{num_basis:0>3}.svg',
directory='../slides/diagrams/ml/',
num_basis=IntSlider(1, 1, 27, 1))
Hold Out Validation
For the polynomial fit, we will now look at hold out validation, where we are holding out some of the most recent points. This tests the abilit of our model to extrapolate.
pods.notebook.display_plots('olympic_val_BLM_polynomial_number{num_basis:0>3}.svg',
directory='../slides/diagrams/ml',
num_basis=IntSlider(1, 1, 27, 1))
5fold Cross Validation
Five fold cross validation tests the ability of the model to interpolate.
pods.notebook.display_plots('olympic_5cv{part:0>2}_BLM_polynomial_number{num_basis:0>3}.svg',
directory='../slides/diagrams/ml',
part=(0, 5),
num_basis=IntSlider(1, 1, 27, 1))
Unsupervised Learning [edit]
In unsupervised learning you have data, $\inputVector$, but no labels $\dataScalar$. The aim in unsupervised learning is to extract structure from data. The type of structure you are interested in is dependent on the broader context of the task. In supervised learning that context is very much driven by the labels. Supervised learning algorithms try and focus on the aspects of the data which are relevant to predicting the labels. But in unsupervised learning there are no labels.
Context
Humans can easily sort a number of objects into objects that share similar characteristics. We easily categorize animals or vehicles. But if the data is very large this is too slow. Even for smaller data, it may be that it is presented in a form that is unintelligible for humans. We are good at dealing with high dimensional data when it’s presented in images, but if it’s presented as a series of numbers, we find it hard to interpret. In unsupervised learning we want the computer to do the sorting for us. For example, an ecommerce company might need an algorithm that can go through its entire list of products and automatically sort them into groups such that similar products are located together.
Discrete vs Continuous
Supervised learning is broadly divided into classification: i.e. wake word classification in the Amazon Echo, and regression, e.g. shelf life prediction for perishable goods. Similarly, unsupervised learning can be broadly split into methods that cluster the data (i.e. provide a discrete label) and methods that represent the data as a continuous value.
Clustering [edit]
Clustering methods associate each data point with a different label. Unlike in classification the label is not provided by a human annotator. It is allocated by the computer. Clustering is quite intuitive for humans, we do it naturally with our observations of the real world. For example, we cluster animals into different groups. If we encounter a new animal, we can immediately assign it to a group: bird, mammal, insect. These are certainly labels that can be provided by humans, but they were also originally invented by humans. With clustering we want the computer to recreate that process of inventing the label.
Unsupervised learning enables computers to form similar categorizations on data that is too large scale for us to process. When the Greek philosopher, Plato, was thinking about ideas, he considered the concept of the Platonic ideal. The Platonic ideal bird is the bird that is most birdlike or the chair that is most chairlike. In some sense, the task in clustering is to define different clusters, by finding their Platonic ideal (known as the cluster center) and allocate each data point to the relevant cluster center. So, allocate each animal to the class defined by its nearest cluster center.
To perform clustering on a computer we need to define a notion of either similarity or distance between the objects and their Platonic ideal, the cluster center. We normally assume that our objects are represented by vectors of data, $\inputVector_i$. Similarly, we represent our cluster center for category j by a vector $\meanVector_j$. This vector contains the ideal features of a bird, a chair, or whatever category j is. In clustering we can either think in terms of similarity of the objects, or distances. We want objects that are similar to each other to cluster together. We want objects that are distant from each other to cluster apart.
This requires us to formalize our notion of similarity or distance. Let’s focus on distances. A definition of distance between an object, i, and the cluster center of class j is a function of two vectors, the data point, $\inputVector_i$ and the cluster center, $\meanVector_j$,
$$
d_{ij} = f(\inputVector_i, \meanVector_j).
$$
Our objective is then to find cluster centers that are close to as many data points as possible. For example, we might want to cluster customers into their different tastes. We could represent each customer by the products they’ve purchased in the past. This could be a binary vector $\inputVector_i$. We can then define a distance between the cluster center and the customer.
Squared Distance
A commonly used distance is the squared distance,
$$
\distanceScalar_{ij} = (\inputVector_i  \meanVector_j)^2.
$$
The squared distance comes up a lot in machine learning. In unsupervised learning it was used to measure dissimilarity between predictions and observed data. Here its being used to measure the dissimilarity between a cluster center and the data.
Once we have decided on the distance or similarity function, we can decide a number of cluster centers, K. We find their location by allocating each center to a subset of the points and minimizing the sum of the squared errors,
$$
\errorFunction(\meanMatrix) = \sum_{i \in \mathbf{i}_j} (\inputVector_i  \meanVector_j)^2
$$
where the notation i_{j} represents all the indices of each data point which has been allocated to the jth cluster represented by the center $\meanVector_j$.
kMeans Clustering
One approach to minimizing this objective function is known as kmeans clustering. It is simple and relatively quick to implement, but it is an initialization sensitive algorithm. Initialization is the process of choosing an initial set of parameters before optimization. For kmeans clustering you need to choose an initial set of centers. In kmeans clustering your final set of clusters is very sensitive to the initial choice of centers. For more technical details on kmeans clustering you can watch a video of Alex Ihler introducing the algorithm here.
kMeans Clustering
Hierarchical Clustering
Other approaches to clustering involve forming taxonomies of the cluster centers, like humans apply to animals, to form trees. You can learn more about agglomerative clustering in this video from Alex Ihler.
Phylogenetic Trees
Indeed, one application of machine learning techniques is performing a hierarchical clustering based on genetic data, i.e. the actual contents of the genome. If we do this across a number of species then we can produce a phylogeny. The phylogeny aims to represent the actual evolution of the species and some phylogenies even estimate the timing of the common ancestor between two species^{4}. Similar methods are used to estimate the origin of viruses like AIDS or Bird flu which mutate very quickly. Determining the origin of viruses can be important in containing or treating outbreaks.
Product Clustering
An ecommerce company could apply hierarchical clustering to all its products. That would give a phylogeny of products. Each cluster of products would be split into subclusters of products until we got down to individual products. For example, we might expect a high level split to be Electronics/Clothing. Of course, a challenge with these treelike structures is that many products belong in more than one parent cluster: for example running shoes should be in more than one group, they are ‘sporting goods’ and they are ‘apparel’. A tree structure doesn’t allow this allocation.
Hierarchical Clustering Challenge
Our own psychological grouping capabilities are studied as a domain of cognitive science. Researchers like Josh Tenenbaum have developed algorithms that decompose data in more complex ways, but they can normally only be applied to smaller data sets.
Dimensionality Reduction [edit]
Dimensionality reduction methods compress the data by replacing the original data with a reduced number of continuous variables. One way of thinking of these methods is to imagine a marionette.
The position of each body part of a marionette could be thought of as our data, $\inputVector_i$. So, each data point consists of the 3D coordinates of all the different body parts of the marionette. Let’s say there are 13 different body parts (2 each of feet, knees, hips, hands, elbows, shoulders, one head). Each body part has an x, y, z position in Cartesian coordinates. So that’s 39 numbers associated with each observation.
The movement of these 39 parts is determined by the puppeteer via strings. Let’s assume it’s a very simple puppet, with just one stick to control it. The puppeteer can move the stick up and down, left and right. And they can twist it. This gives three parameters in the puppeteers control. This implies that the 39 variables we see moving are controlled by only 3 variables. These 3 variables are often called the hidden or latent variables.
Dimensionality reduction assumes something similar for real world data. It assumes that the data we observe is generated from some lower dimensional underlying process. It then seeks to recover the values associated with this low dimensional process.
Examples in Social Sciences
Dimensionality reduction techniques underpin a lot of psychological scoring tests such as IQ tests or personality tests. An IQ test can involve several hundred questions, potentially giving a rich, high dimensional, characterization of some aspects of your intelligence. It is then summarized by a single number. Similarly, the MyersBriggs personality test involves answering questions about preferences which are reduced to a set of numbers reflecting personality.
These tests are assuming that our intelligence is implicitly onedimensional and that our personality is implicitly four dimensional. Other examples include political belief which is typically represented on a left to right scale. A onedimensional distillation of an entire philosophy about how a country should be run. Our own leadership principles imply that our decisions have a fourteendimensional space underlying them. Each decision could be characterized by judging to what extent it embodies each of the principles.
Political belief, personality, intelligence, leadership. None of these exist as a directly measurable quantity in the real world, rather they are inferred based on measurables. Dimensionality reduction is the process of allowing the computer to automatically find such underlying dimensions. This automatically allowing us to characterize each data point according to those explanatory variables. Each of these characteristics can be scored, and individuals can then be turned into vectors.
This doesn’t only apply to individuals, in recent years work on language modeling has taken a similar approach to words. The word2vec algorithm performed a dimensionality reduction on words, now you can take any word and map it to a latent space where similar words exhibit similar characteristics. A personality space for words.
Principal Component Analysis
Principal component analysis (PCA) is arguably the queen of dimensionality reduction techniques. PCA was developed as an approach to dimensionality reduction in 1930s by Hotelling as a method for the social sciences. In Hotelling’s formulation of PCA it was assumed that any data point, x could be represented as a weighted sum of the latent factors of interest, so that Hotelling described prediction functions (like in regression and classification above), only the regression is now multiple output. And instead of predicting a label, y_{i}, we now try and force the regression to predict the observed feature vector, $\dataVector_i$. So, for example, on an IQ test we would try and predict subject i’s answer to the jth question with the following function
$$
\dataScalar_{ij} = \mappingFunction_j(\latentScalar_i; \weightVector).
$$
Here z_{i} would be the IQ of subject i and $\mappingFunction_j(\cdot)$ would be a function representing the relationship between the subject’s IQ and their score on the answer to question j. This function is the same for all subjects, but the subject’s IQ is assumed to differ leading to different scores for each subject.
Hotelling’s PCA
In Hotelling’s formulation he assumed that the function was a linear function. This idea is taken from a wider field known as factor analysis, so Hotelling described the challenge as
$$
\mappingFunction_j(\latentScalar_i; \weightVector) = \weightScalar_j \latentScalar_i
$$
so the answer to the jth question is predicted to be a scaling of the subject’s IQ. The scale factor is given by $\weightScalar_j$. If there are more latent dimensions then a matrix of parameters, $\weightMatrix$ is used, for example if there were two latent dimensions, we’d have
$$
\mappingFunction_j(\mathbf{\latentScalar}_i; \weightMatrix) = \weightScalar_{1j} \latentScalar_{1i} + \weightScalar_{2j} \latentScalar_{2i}
$$
where, if this were a personality test, then $\latentScalar_{1i}$ might represent the spectrum over a subject’s extrovert/introvert and $\latentScalar_{2i}$ might represent where the subject was on the rational/perceptual scale. The function would make a prediction about the subjects answer to a particular question on the test (e.g. preference for office job vs preference for outdoor job). In factor analysis the parameters $\weightMatrix$ are known as the factor loadings and in PCA they are known as the principal components.
Parameters
Fitting the model involves finding estimates for the loadings, $\weightMatrix$, and latent variables, $\latentMatrix$. There are different approaches including least squares. The least squares approach is used, for example, in recommender systems. In recommender systems this method is called matrix factorization. The customer characteristics, $\dataVector_i$ is the customer rating for each different product (or item) and the latent variables can be seen as a space of customer preferences. In the recommender system case, the loadings matrix also has an interpretation as product similarities.^{5} Recommender systems have a particular characteristic in that most of the entries of the vector $\dataVector_i$ are missing most of the time.
In PCA and factor analysis the unknown latent factors are dealt with through a probability distribution. They are each assumed to be drawn from a zero mean, unit variance normal distribution. This leaves the factor loadings to be estimated. For PCA the maximum likelihood solution for the factor loadings can be shown to be given by the eigenvalue decomposition of the data covariance matrix. This is algorithmically simple and convenient, although slow to compute for very large data sets with many features and many subjects. The eigenvalue problem can also be derived from many other starting points: e.g. the directions of maximum variance in the data or finding a latent space that best preserves interpoint distances between the data, or the optimal linear compression of the data given a linear reconstruction. These many and varied justifications for the eigenvalue decomposition may account for the popularity of PCA. Indeed, there is even an interpretation for Google’s original PageRank algorithm (which computed the smallest eigenvector of the internet’s linkage matrix) as seeking the dominant principal component of the web.^{6}
Characterizing users according to past buying behavior and combining this with characteristics about products, is key to making good recommendations and returning useful search results. Further advances can be made if we understand the context of a particular session. For example, if a user is buying Christmas presents and searches for a dress, then it could be the case that the user is willing to spend a little more on the dress than in normal circumstances. Characterizing these effects requires more data and more complex algorithms. However, in domains such a search we are normally constrained by the speed with which we need to return results. Accounting for each of these factors while returning results with acceptable latency is a particular challenge.
Reinforcement Learning [edit]
The final domain of learning we will review is known as reinforcement learning. The domain of reinforcement learning is one that many researchers seem to believe is offering a route to general intelligence. The idea of general intelligence is to develop algorithms that are adaptable to many different circumstances. Supervised learning algorithms are designed to resolve particular challenges. Data is annotated with those challenges in mind. Unsupervised attempts to build representations without any context. But normally the algorithm designer has an understanding of what the broader objective is and designs the algorithms accordingly (for example, characterizing users). In reinforcement learning some context is given, in the form of a reward, but the reward is normally delayed. There may have been many actions that affected the outcome, but which actions had a positive effect and which a negative effect?
“Reward”
In reinforcement learning some context is given, in the form of a reward. But it is often delayed
Credit allocation problem: many actions that affected the outcome, but which actions had a positive effect and which a negative effect?
One issue for many companies is that the best way of testing the customer experience, A/B testing, prioritizes short term reward. The internet is currently being driven by short term rewards which make it distracting in the short term, but perhaps less useful in the long term. Clickbait is an example, but there are more subtle effects. The success of Facebook is driven by its ability to draw us in when likely we should be doing something else. This is driven by large scale A/B testing.
One open question is how to drive nonvisual interfaces through equivalents to A/B testing. Speech interfaces, such as those used in intelligent agents, are less amenable to A/B testing when determining the quality of the interface. Improving interaction with them is therefore less exact science than the visual interface. Data efficient reinforcement learning methods are likely to be key to improving these agent’s ability to interact with the user and understand intent. However, they are not yet mature enough to be deployed in this application.
Game Play
An area where reinforcement learning methods have been deployed with high profile success is game play. In game play the reward is delayed to the end of the game, and it comes in the form of victory or defeat. A significant advantage of game play as an application area is that, through simulation of the game, it is possible to generate as much data as is required to solve the problem. For this reason, many of the recent advances in reinforcement learning have occurred with methods that are not data efficient.
The company DeepMind is set up around reinforcement learning as an approach to general intelligence. All their most wellknown achievements are centered around artificial intelligence in game play. In reinforcement learning a decision made at any given time have a downstream effect on the result. Whether the effect if beneficial or not is unknown until a future moment.
We can think of reinforcement learning as providing a label, but the label is associated with a series of data involving a number of decisions taken. Each decision was taken given the understanding of game play at any given moment. Understanding which of these decisions was important in victory or defeat is a hard problem.
In machine learning the process of understanding which decisions were beneficial and which were detrimental is known as the credit allocation problem. You wish to reward decisions that led to success to encourage them, but punish decisions that lead to failure.
Broadly speaking, DeepMind uses an approach to Machine Learning where there are two mathematical functions at work. One determines the action to be taken at any given moment, the other estimates the quality of the board position at any given time. These are respectively known as the policy network and the value network.^{7} DeepMind made use of convolutional neural networks for both these models.
AlphaGo
The ancient Chinese game of Go was considered a challenge for artificial intelligence for two reasons. Firstly, the game tree has a very high branching factor. The game tree is a discrete representation of the game. Every node in the game tree is associated with a board position. You can move through the game tree by making legal a move on the board to change the position. In Go, there are so many legal moves that the game tree increases exponentially. This challenge in Go was addressed by using stochastic game tree search. Rather than exploring the game tree exhaustively they explored it randomly.
Secondly, evaluating the quality of any given board position was deemed to be very hard.^{8} The value function determines for each player whether they are winning or losing. Skilled Go players can assess a board position, but they do it by instinct, by intuition. Just as early AI researchers struggled to give rules for detecting cancer, it is challenging to give rules to assess a Go board. The machine learning approach that AlphaGo took is to train a value function network to make this assessment.
The approach that DeepMind took to conquering Go is a modelfree approach known as Qlearning.^{9} The modelfree approach refers to the fact that they don’t directly include a model of how the world evolves in the reinforcement learning algorithm. They make extensive use of the game tree, but they don’t model how it evolves. They do model the expected reward of each position in the game tree (the value function) but that is not the same as modeling how the game will proceed.
Reinforcement Learning and Classical Control
An alternative approach to reinforcement learning is to use a prediction function to suggest how the world will evolve in response to your actions. To predict how the game tree will evolve. You can then use this prediction to indirectly infer the expected reward associated with any action. This is known as modelbased reinforcement learning.
This modelbased approach is also closer to a control system. A classical control system is one where you give the system a set point. For example, a thermostat in the house. You set the temperature and the boiler switches off when it reaches it. Optimal control is about getting the house to the right temperature as quickly as possible. Classical control is widely used in robotic control and flight control.
One interesting crossover between classical control and machine learning arises because classical optimal control can be seen as a form of modelbased reinforcement learning. One where the reward is recovered when the set point is reached. In control engineering the prediction function is known as the transfer function. The process of fitting the transfer function in control is known as system identification.
There is some exciting work emerging at the interface between the areas of control and reinforcement learning. Results at this interface could be very important for improving the quality of robotic and drone control.
Optimization Methods
As we implied above, reinforcement learning can also used to improve user experience. In that case the reward is gained when the user buys a product from us. This makes it closely allied to the area of optimization. Optimization of our user interfaces can be seen as a reinforcement learning task, but more commonly it is thought about separately in the domains of Bayesian optimization or bandit learning.
We use optimization in machine learning to find the parameters of our models. We can do that because we have a mathematical representation of our objective function as a direct function of the parameters.
Examples in this form of optimization include, what is the best user interface for presenting adverts? What is the best design for a front wing for an F1 racing car? Which product should I return top of the list in response to this user’s search?
Bayesian optimization arises when we can’t directly relate the parameters in the system of interest to our objective through a mathematical function. For example, what is the mathematical function that relates a user’s experience to the probability that they will buy a product?
Bayesian Optimization
One approach to these problems is to use machine learning methods to develop a surrogate model for the optimization task. The surrogate model is a prediction function that attempts to recreate the process we are finding hard to model. We try to simultaneously fit the surrogate model and optimize the process.
Surrogate Models
Bayesian optimization methods use a surrogate model (normally a specific form of regression model). They use this to predict how the real system will perform. The surrogate model makes a prediction (with an estimate of the uncertainty) of what the response will be to any given input. Parameters to test are chosen by considering this prediction. Similar to reinforcement learning, this can be viewed as a modelbased approach because the surrogate model can be seen as a model of the real world. In bandit methods strategies are determined without turning to a model to motivate them. They are model free methods.
ModelBased and Model Free: Performance
Because of their different philosophies, if a class of prediction functions is chosen, then a modelbased approach might have better average case performance. At least in terms of data efficiency. A model free approach may well have better worstcase performance though, because it makes less assumptions about the nature of the data. To put it another way, making assumptions about the data is helpful if they are right: and if the model is sensible they’ll be right on average. However, it is unhelpful if the model is wrong. Indeed, it could be actively damaging. Since we can’t usually guarantee the model is absolutely right, the worstcase performance of a modelbased approach would be poor.
We have introduced a range of machine learning approaches by focusing on their use of mathematical functions to replace manually coded systems of rules. The important characteristic of machine learning is that the form of these functions, as dictated by their parameters, is determined by acquiring data from the real world.
Deployment [edit]
The methods we have introduced are roughly speaking introduced in order of difficulty of deployment. While supervised learning is more involved in terms of collection of data, it is the most straightforward method to deploy once that data is recovered. For this reason, a major focus with supervised learning should always be on maintaining data quality, increasing the efficiency and accountability^{10} of the data collection pipeline and the quality of features used.
You can also check my blog post on Data Readiness Levels. and my blog post on The 3Ds of Machine Learning Systems Design..
Where to Deploy?
In relation to what AI can and can’t do today Andrew Ng is quoted as saying:
If a typical person can do a mental task with less than one second of thought, we can probably automate it using AI either now or in the near future.^{11} Andrew Ng
Is this Right?
I would broadly agree with this quote but only in the context of supervised learning. If a human expert takes around that amount of time, then it’s also likely we can acquire the data necessary to build a supervised learning algorithm that can emulate that human’s response.
The picture with regard to unsupervised learning and reinforcement learning is more clouded.
One observation is that for supervised learning we seem to be moving beyond the era where very deep machine learning expertise is required to deploy methods. A solid understanding of machine learning (say to Masters level) is certainly required, but the quality of the final result is likely more dependent on domain expertise and the quality of the data and the information processing pipeline. This seems part of a wider trend where some of the big successes in machine learning are moving rapidly from the domain of science to that of engineering.^{12}
You can check my blog post on New Directions in Kernels and Gaussian Processes..
So if we can only emulate tasks that humans take around a second to do, how are we managing to deliver on self driving cars? The answer is that we are constructing engineered systems from subcomponents, each of which is a machine learning subsystem. But they are tied together as a component based system in line with our traditional engineering approach. This has an advantage that each component in the system can be verified before its inclusion. This is important for debugging and safety. But in practice we can expect these systems to be very brittle. A human adapts the way in which they drive the car across their lifetime. A human can react to other road users. In extreme situations, such as a car jacking, a human can set to one side normal patterns of behavior, and purposely crash their car to draw attention to the situation.
Supervised machine learning solutions are normally trained offline. They do not adapt when deployed because this makes them less verifiable. But this compounds the brittleness of our solutions. By deploying our solutions we actually change the environment in which they operate. Therefore, it’s important that they can be quickly updated to reflect changing circumstances. This updating happens offline. For a complex mechanical system, such as a delivery drone, extensive testing of the system may be required when any component is updated. It is therefore imperative that these data processing pipelines are well documented so that they can be redeployed on demand.
In practice there can be challenges with the false dichotomy between reproducibility and performance. It is likely that most of our data scientists are caring less about their ability to redeploy their pipelines and only about their ability to produce an algorithm that achieves a particular performance. A key question is how reproducible is that process? There is a false dichotomy because ensuring reproducibility will typically improve performance as it will make it easier to run a rigorous set of explorative experiments. A worry is that, currently, we do not have a way to quantify the scale of this potential problem within companies.
Model Choice
Common to all machine learning methods is the initial choice of useful classes of functions. The deep learning revolution is associated with a particular class of mathematical functions that is proving very successful in what were seen to be challenging domains: speech, vision, language. This has meant that significant advances in problems that have been seen as hard have occurred in artificial intelligence.
References
AndradePacheco, Ricardo, Martin Mubangizi, John Quinn, and Neil D. Lawrence. 2014. “Consistent Mapping of Government Malaria Records Across a Changing Territory Delimitation.” Malaria Journal 13 (Suppl 1). https://doi.org/10.1186/1475287513S1P5.
Cooper, Brian. 1991. Transformation of a Valley: Derbyshire Derwent. Scarthin Books.
Gelman, Andrew, John B. Carlin, Hal S. Stern, and Donald B. Rubin. 2013. Bayesian Data Analysis. 3rd ed. Chapman; Hall.
Gething, Peter W., Abdisalan M. Noor, Priscilla W. Gikandi, Esther A. A. Ogara, Simon I. Hay, Mark S. Nixon, Robert W. Snow, and Peter M. Atkinson. 2006. “Improving Imperfect Data from Health Management Information Systems in Africa Using Space–Time Geostatistics.” PLoS Medicine 3 (6). Public Library of Science. https://doi.org/10.1371/journal.pmed.0030271.
Lawrence, Neil D. 2015. “How Africa Can Benefit from the Data Revolution.” The Guardian Media & Tech Network. https://www.theguardian.com/medianetwork/2015/aug/25/africabenefitdatascienceinformation.
McCulloch, Warren S., and Walter Pitts. 1943. “A Logical Calculus of the Ideas Immanent in Nervous Activity.” Bulletin of Mathematical Biophysics 5: 115–33.
Mubangizi, Martin, Ricardo AndradePacheco, Michael Thomas Smith, John Quinn, and Neil D. Lawrence. 2014. “Malaria Surveillance with Multiple Data Sources Using Gaussian Process Models.” In 1st International Conference on the Use of Mobile ICT in Africa.
Robbins, H., and S. Monro. 1951. “A Stochastic Approximation Method.” Annals of Mathematical Statistics 22: 400–407.
Taigman, Yaniv, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf. 2014. “DeepFace: Closing the Gap to HumanLevel Performance in Face Verification.” In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. https://doi.org/10.1109/CVPR.2014.220.
The logarithm of a number less than one is negative, for a number greater than one the logarithm is positive. So if odds are greater than evens (oddson) the logodds are positive, if the odds are less than evens (oddsagainst) the logodds will be negative.↩
In classical statistics we often interpret these parameters, β, whereas in machine learning we are normally more interested in the result of the prediction, and less in the prediction. Although this is changing with more need for accountability. In honour of this I normally use β when I care about the value of these parameters, and $\mappingVector$ when I care more about the quality of the prediction.↩
Using the test data in your training procedure is a major error in any machine learning procedure. It is extremely dangerous as it gives a misleading assessment of the model performance. The Baidu ImageNet scandal was an example of a team competing in the ImageNet challenge which did this. The team had announced via the publication preprint server Arxiv that they had a worldleading performance on the ImageNet challenge. This was reported in the mainstream media. Two weeks later the challenge organizers revealed that the team had created multiple accounts for checking their test performance more times than was permitted by the challenge rules. This was then reported as “AI’s first doping scandal”. The team lead was fired by Baidu.↩
These models are quite a lot more complex than the simple clustering we describe here. They represent a common ancestor through a cluster center that is then allowed to evolve over time through a mutation rate. The time of separation between different species is estimated via these mutation rates.↩
One way of thinking about this is to flip the model on its side. Instead of thinking about the ith subject and the jth characteristic. Assume that each product is the subject. So, the jth item is thought of as the subject, and each item’s characteristic is given by the rating from a particular user. In this case symmetries in the model show that the matrix $\weightMatrix$ can now be seen as a matrix of latent variables and the matrix $\latentMatrix$ can be seen as factor loadings. So, you can think of the method as simultaneously doing a dimensionality reduction on the products and the users. Recommender systems also use other approaches, some of them based on similarity measures. In a similarity measurebased recommender system the rating prediction is given by looking for similar products in the user profile and scoring the new product with a score that is a weighted sum of those products.↩
The interpretation requires you to think of the web as a series of web pages in a high dimensional space where distances between web pages are computed by moving along the links (in either direction). The PageRank is the onedimensional space that best preserves those distances in the sense of an L1 norm. The interpretation works because the smallest eigenvalue of the linkage matrix is the largest eigenvalue of the inverse of the linkage matrix. The inverse linkage matrix (which would be impossible to compute) embeds similarities between pages according to how far apart they are via a random walk along the linkage matrix.↩
The approach was described early on in the history of machine learning by Chris Watkins, during his PhD thesis in the 1980s. It is known as Qlearning. It’s recent success in the games domain is driven by the use of deep learning for the policy and value functions as well as the use of fast compute to generate and process very large quantities of data. In its standard form it is not seen as a very dataefficient approach.↩
The situation in chess is much easier, firstly the number of possible moves at any time is about an order of magnitude lower, meaning the game tree doesn’t grow as quickly. Secondly, in chess, there are well defined value functions. For example, a value function could be based on adding together the points that are associated with each piece.↩
The approach was described early on in the history of machine learning by Chris Watkins, during his PhD thesis in the 1980s. It is known as Qlearning. It’s recent success in the games domain is driven by the use of deep learning for the policy and value functions as well as the use of fast compute to generate and process very large quantities of data. In its standard form it is not seen as a very dataefficient approach.↩
To try and better embody the state of data readiness in organizations I’ve been proposing “Data Readiness Levels”. More needs to be done in this area to improve the efficiency of the data science pipeline.↩
The quote can be found in the Harvard Business Review Article “What Artificial Intelligence Can and Can’t Do Right Now”.↩
This trend was very clear at the moment, I spoke about it at a recent Dagstuhl workshop on new directions for kernel methods and Gaussian processes.↩