at University of Sheffield on Sep 29, 2015 [jupyter][google colab][reveal]
Neil D. Lawrence, University of Sheffield

Abstract

In this first session we will introduce machine learning, review probability and begin familiarization with the Jupyter notebook, python and pandas.

Course Text 

}

Welcome to the Machine Learning and Adaptive Intelligence course. In this course we will introduce the basic concepts of machine learning and data science. In particular we will look at tools and techniques that describe how to model. An integrated part of that is how we approach data with the computer. We are choosing to do that with the tool you see in front of you: the Jupyter Notebook.

The notebook provides us with a way of interacting with the data that allows us to give the computer instructions and explore the nature of a data set. It is different to normal coding, but it is related. In this course you will, through intensive practical sessions and labs, develop your understanding of the interaction between data and computers. The first thing we are going to do is ask you to forget a bit about what you think about normal programming, or ‘classical software engineering’. Classical software engineering demands a large amount of design and testing. In data analysis, testing remains very important, but the design is often evolving. The design evolves through a process known as exploratory data analysis. You will learn some of the techniques of exploratory data analysis in this course.

A particular difference between classical software engineering and data analysis is the way in which programs are run. Classically we spend a deal of time working with a text editor, writing code. Compilations are done on a regular basis and aspects of the code are tested (perhaps with unit tests).

Data analysis is more like coding in a debugger. In a debugger (particularly a visual debugger) you interact with the data stored in the memory of the computer to try and understand what is happening in the computer, you need to understand exactly what your bug is: you often have a fixed idea of what the program is trying to do, you are just struggling to find out why it isn’t doing it.

Naturally, debugging is an important part of data analysis also, but in some sense it can be seen as its entire premise. You load in a data set into a computer that you don’t understand, your entire objective is to understand the data. This is best done by interogating the data to visualise it or summarize it, just like in a power visual debugger. However, for data science the requirements for visualization and summarization are far greater than in a regular program. When the data is well understood, the actual number of lines of your program may well be very few (particularly if you disregard commands that load in the data and commands which plot your results). If a powerful data science library is available, you may be able to summarize your code with just two or three lines, but the amount of intellectual energy that is expended on writing those three lines is far greater than in standard code.

Assumed Knowledge 

Linear Algebra, Probability and Differential Calculus

We will be assuming that you have good background in maths. In particular we will be making use of linear algrebra (matrix operations including inverse, inner products, determinant etc), probability (sum rule of probability, product rule of probability), and the calculus of variation, e.g. differentiation and integration. A new concept for the course is multivariate differentiation and integration. This combines linear algebra and differential calculus. These techniques are vital in understanding probability distributions over high dimensional distributions.

Choice of Language 

In this course we will be using Python for our programming language. A prerequisite of attending this course is that you have learnt at least one programming language in the past. It is not our objective to teach you python. At Level 4 and Masters we expect our students to be able pick up a language as they go. If you have not experienced python before it may be worth your while spending some time understanding the language. There are resources available for you to do this here that are based on the standard console. An introduction to the Jupyter notebook (formerly known as the IPython notebook) is available here.

Choice of Environment

We are working in the Jupyter notebook (formerly known as the IPython notebook). It provides an environment for interacting with data in a natural way which is reproducible. We will be learning how to make use of the notebook throughout the course. The notebook allows us to combine code with descriptions, interactive visualizations, plots etc. In fact it allows us to do many of the things we need for data science. Notebooks can also be easily shared through the internet for ease of communication of ideas. The box this text is written in is a markdown box. Below we have a code box.

print("This is the Jupyter notebook")
print("It provides a platform for:")
words = ['Open', 'Data', 'Science']
from random import shuffle
for i in range(3):
shuffle(words)
print(' '.join(words))

Have a play with the code in the above box. Think about the following questions: what is the difference between CTRL-enter and SHIFT-enter in running the code? What does the command shuffle do? Can you find out by typing shuffle? in a code box? Once you’ve had a play with the code we can load in some data using the pandas library for data analysis.

What is Machine Learning? 

What is machine learning? At its most basic level machine learning is a combination of

$$\text{data} + \text{model} \xrightarrow{\text{compute}} \text{prediction}$$

where data is our observations. They can be actively or passively acquired (meta-data). The model contains our assumptions, based on previous experience. That experience can be other data, it can come from transfer learning, or it can merely be our beliefs about the regularities of the universe. In humans our models include our inductive biases. The prediction is an action to be taken or a categorization or a quality score. The reason that machine learning has become a mainstay of artificial intelligence is the importance of predictions in artificial intelligence. The data and the model are combined through computation.

In practice we normally perform machine learning using two functions. To combine data with a model we typically make use of:

a prediction function a function which is used to make the predictions. It includes our beliefs about the regularities of the universe, our assumptions about how the world works, e.g. smoothness, spatial similarities, temporal similarities.

an objective function a function which defines the cost of misprediction. Typically it includes knowledge about the world’s generating processes (probabilistic objectives) or the costs we pay for mispredictions (empiricial risk minimization).

The combination of data and model through the prediction function and the objectie function leads to a learning algorithm. The class of prediction functions and objective functions we can make use of is restricted by the algorithms they lead to. If the prediction function or the objective function are too complex, then it can be difficult to find an appropriate learning algorithm. Much of the acdemic field of machine learning is the quest for new learning algorithms that allow us to bring different types of models and data together.

A useful reference for state of the art in machine learning is the UK Royal Society Report, Machine Learning: Power and Promise of Computers that Learn by Example.

You can also check my post blog post on What is Machine Learning?..

Overdetermined System 

The challenge with a linear model is that it has two unknowns, m, and c. Observing data allows us to write down a system of simultaneous linear equations. So, for example if we observe two data points, the first with the input value, $\inputScalar_1 = 1$ and the output value, $\dataScalar_1 =3$ and a second data point, $\inputScalar = 3$, $\dataScalar=1$, then we can write two simultaneous linear equations of the form.

point 1: $\inputScalar = 1$, $\dataScalar=3$
3 = m + c
point 2: $\inputScalar = 3$, $\dataScalar=1$
1 = 3m + c

The solution to these two simultaneous equations can be represented graphically as

The challenge comes when a third data point is observed and it doesn’t naturally fit on the straight line.

point 3: $\inputScalar = 2$, $\dataScalar=2.5$
2.5 = 2m + c

Now there are three candidate lines, each consistent with our data.

This is known as an overdetermined system because there are more data than we need to determine our parameters. The problem arises because the model is a simplification of the real world, and the data we observe is therefore inconsistent with our model.

pods.notebook.display_plots('over_determined_system{samp:0>3}.svg',
directory='../slides/diagrams/ml',
samp=IntSlider(1,1,7,1))

The solution was proposed by Pierre-Simon Laplace. His idea was to accept that the model was an incomplete representation of the real world, and the manner in which it was incomplete is unknown. His idea was that such unknowns could be dealt with through probability.

Pierre-Simon Laplace 

Famously, Laplace considered the idea of a deterministic Universe, one in which the model is known, or as the below translation refers to it, “an intelligence which could comprehend all the forces by which nature is animated”. He speculates on an “intelligence” that can submit this vast data to analysis and propsoses that such an entity would be able to predict the future.

Given for one instant an intelligence which could comprehend all the forces by which nature is animated and the respective situation of the beings who compose it—an intelligence sufficiently vast to submit these data to analysis—it would embrace in the same formulate the movements of the greatest bodies of the universe and those of the lightest atom; for it, nothing would be uncertain and the future, as the past, would be present in its eyes.

This notion is known as Laplace’s demon or Laplace’s superman.

Unfortunately, most analyses of his ideas stop at that point, whereas his real point is that such a notion is unreachable. Not so much superman as strawman. Just three pages later in the “Philosophical Essay on Probabilities” (Laplace 1814), Laplace goes on to observe:

The curve described by a simple molecule of air or vapor is regulated in a manner just as certain as the planetary orbits; the only difference between them is that which comes from our ignorance.

Probability is relative, in part to this ignorance, in part to our knowledge.

In other words, we can never make use of the idealistic deterministc Universe due to our ignorance about the world, Laplace’s suggestion, and focus in this essay is that we turn to probability to deal with this uncertainty. This is also our inspiration for using probability in machine learning.

The “forces by which nature is animated” is our model, the “situation of beings that compose it” is our data and the “intelligence sufficiently vast enough to submit these data to analysis” is our compute. The fly in the ointment is our ignorance about these aspects. And probability is the tool we use to incorporate this ignorance leading to uncertainty or doubt in our predictions.

Laplace’s concept was that the reason that the data doesn’t match up to the model is because of unconsidered factors, and that these might be well represented through probability densities. He tackles the challenge of the unknown factors by adding a variable, $\noiseScalar$, that represents the unknown. In modern parlance we would call this a latent variable. But in the context Laplace uses it, the variable is so common that it has other names such as a “slack” variable or the noise in the system.

point 1: $\inputScalar = 1$, $\dataScalar=3$
$$3 = m + c + \noiseScalar_1$$
point 2: $\inputScalar = 3$, $\dataScalar=1$
$$1 = 3m + c + \noiseScalar_2$$
point 3: $\inputScalar = 2$, $\dataScalar=2.5$
$$2.5 = 2m + c + \noiseScalar_3$$

Laplace’s trick has converted the overdetermined system into an underdetermined system. He has now added three variables, $\{\noiseScalar_i\}_{i=1}^3$, which represent the unknown corruptions of the real world. Laplace’s idea is that we should represent that unknown corruption with a probability distribution.

A Probabilistic Process 

However, it was left to an admirer of Gauss to develop a practical probability density for that purpose. It was Carl Friederich Gauss who suggested that the Gaussian density (which at the time was unnamed!) should be used to represent this error.

The result is a noisy function, a function which has a deterministic part, and a stochastic part. This type of function is sometimes known as a probabilistic or stochastic process, to distinguish it from a deterministic process.

Nigerian NMIS Data 

As an example data set we will use Nigerian NMIS Health Facility data from openAFRICA. It can be found here https://africaopendata.org/dataset/nigeria-nmis-health-facility-data-2014

Taking from the information on the site,

The Nigeria MDG (Millennium Development Goals) Information System – NMIS health facility data is collected by the Office of the Senior Special Assistant to the President on the Millennium Development Goals (OSSAP-MDGs) in partner with the Sustainable Engineering Lab at Columbia University. A rigorous, geo-referenced baseline facility inventory across Nigeria is created spanning from 2009 to 2011 with an additional survey effort to increase coverage in 2014, to build Nigeria’s first nation-wide inventory of health facility. The database includes 34,139 health facilities info in Nigeria.

The goal of this database is to make the data collected available to planners, government officials, and the public, to be used to make strategic decisions for planning relevant interventions.

Suggested citation: Nigeria NMIS facility database (2014), the Office of the Senior Special Assistant to the President on the Millennium Development Goals (OSSAP-MDGs) & Columbia University

import urllib.request
urllib.request.urlretrieve('https://energydata.info/dataset/f85d1796-e7f2-4630-be84-79420174e3bd/resource/6e640a13-cab4-457b-b9e6-0336051bac27/download/healthmopupandbaselinenmisfacility.csv', 'healthmopupandbaselinenmisfacility.csv')
import pandas as pd
data = pd.read_csv('healthmopupandbaselinenmisfacility.csv')

Once it is loaded in the data can be summarized using the describe method in pandas.

data.describe()

In python and jupyter notebook it is possible to see a list of all possible functions and attributes by typing the name of the object followed by .<Tab> for example in the above case if we type data.<Tab> it show the columns available (these are attributes in pandas dataframes) such as num_nurses_fulltime, and also functions, such as .describe().

For functions we can also see the documentation about the function by following the name with a question mark. This will open a box with documentation at the bottom which can be closed with the x button.

data.describe?

The NMIS facility data is stored in an object known as a ‘data frame’. Data frames come from the statistical family of programming languages based on S, the most widely used of which is R. The data frame gives us a convenient object for manipulating data. The describe method summarizes which columns there are in the data frame and gives us counts, means, standard deviations and percentiles for the values in those columns. To access a column directly we can write

print(data['num_doctors_fulltime'])
#print(data['num_nurses_fulltime'])

This shows the number of doctors per facility, number of nurses and number of community health workers (CHEWS). We can plot the number of doctors against the number of nurses as follows.

# this ensures the plot appears in the web browser
%matplotlib inline
import matplotlib.pyplot as plt # this imports the plotting library in python
_ = plt.plot(data['num_doctors_fulltime'], data['num_nurses_fulltime'], 'rx')

You may be curious what the arguments we give to plt.plot are for, now is the perfect time to look at the documentation

plt.plot?

We immediately note that some facilities have a lot of nurses, which prevent’s us seeing the detail of the main number of facilities. First lets identify the facilities with the most nurses.

data[data['num_nurses_fulltime']>100]

Here we are using the command data['num_nurses_fulltime']>100 to index the facilities in the pandas data frame which have over 100 nurses. To sort them in order we can also use the sort command. The result of this command on its own is a data Series of True and False values. However, when it is passed to the data data frame it returns a new data frame which contains only those values for which the data series is True. We can also sort the result. To sort the result by the values in the num_nurses_fulltime column in descending order we use the following command.

data[data['num_nurses_fulltime']>100].sort_values(by='num_nurses_fulltime', ascending=False)

We now see that the ‘University of Calabar Teaching Hospital’ is a large outlier with 513 nurses. We can try and determine how much of an outlier by histograming the data.

Plotting the Data

data['num_nurses_fulltime'].hist(bins=20) # histogram the data with 20 bins.
plt.title('Histogram of Number of Nurses')

We can’t see very much here. Two things are happening. There are so many facilities with zero or one nurse that we don’t see the histogram for hospitals with many nurses. We can try more bins and using a log scale on the y-axis.

data['num_nurses_fulltime'].hist(bins=100) # histogram the data with 20 bins.
plt.title('Histogram of Number of Nurses')
ax = plt.gca()
ax.set_yscale('log')

Let’s try and see how the number of nurses relates to the number of doctors.

fig, ax = plt.subplots(figsize=(10, 7))
ax.plot(data['num_doctors_fulltime'], data['num_nurses_fulltime'], 'rx')
ax.set_xscale('log') # use a logarithmic x scale
ax.set_yscale('log') # use a logarithmic Y scale
# give the plot some titles and labels
plt.title('Number of Nurses against Number of Doctors')
plt.ylabel('number of nurses')
plt.xlabel('number of doctors')

Note a few things. We are interacting with our data. In particular, we are replotting the data according to what we have learned so far. We are using the progamming language as a scripting language to give the computer one command or another, and then the next command we enter is dependent on the result of the previous. This is a very different paradigm to classical software engineering. In classical software engineering we normally write many lines of code (entire object classes or functions) before compiling the code and running it. Our approach is more similar to the approach we take whilst debugging. Historically, researchers interacted with data using a console. A command line window which allowed command entry. The notebook format we are using is slightly different. Each of the code entry boxes acts like a separate console window. We can move up and down the notebook and run each part in a different order. The state of the program is always as we left it after running the previous part.

Probabilities 

We are now going to do some simple review of probabilities and use this review to explore some aspects of our data.

A probability distribution expresses uncertainty about the outcome of an event. We often encode this uncertainty in a variable. So if we are considering the outcome of an event, Y, to be a coin toss, then we might consider Y = 1 to be heads and Y = 0 to be tails. We represent the probability of a given outcome with the notation:
P(Y = 1) = 0.5
The first rule of probability is that the probability must normalize. The sum of the probability of all events must equal 1. So if the probability of heads (Y = 1) is 0.5, then the probability of tails (the only other possible outcome) is given by
P(Y = 0) = 1 − P(Y = 1) = 0.5

Probabilities are often defined as the limit of the ratio between the number of positive outcomes (e.g. heads) given the number of trials. If the number of positive outcomes for event y is denoted by n and the number of trials is denoted by N then this gives the ratio
$$P(Y=y) = \lim_{N\rightarrow \infty}\frac{n_y}{N}.$$
In practice we never get to observe an event infinite times, so rather than considering this we often use the following estimate
$$P(Y=y) \approx \frac{n_y}{N}.$$

Probability and the NMIS Data 

Let’s use the sum rule to compute the estimate the probability that a facility has more than two nurses.

large = (data.num_nurses_fulltime>2).sum()  # number of positive outcomes (in sum True counts as 1, False counts as 0)
total_facilities = data.num_nurses_fulltime.count()

prob_large = float(large)/float(total_facilities)
print("Probability of number of nurses being greather than 2 is:", prob_large)

Conditioning

When predicting whether a coin turns up head or tails, we might think that this event is independent of the year or time of day. If we include an observation such as time, then in a probability this is known as condtioning. We use this notation, P(Y = y|X = x), to condition the outcome on a second variable (in this case the number of doctors). Or, often, for a shorthand we use P(y|x) to represent this distribution (the Y= and X= being implicit). If two variables are independent then we find that
P(y|x) = p(y).
However, we might believe that the number of nurses is dependent on the number of doctors. For this we can try estimating P(Y > 2|X > 1) and compare the result, for example to P(Y > 2|X ≤ 1) using our empirical estimate of the probability.

large = ((data.num_nurses_fulltime>2) & (data.num_doctors_fulltime>1)).sum()
total_large_doctors = (data.num_doctors_fulltime>1).sum()
prob_both_large = large/total_large_doctors
print("Probability of number of nurses being greater than 2 given number of doctors is greater than 1 is:", prob_both_large)

Notes for Question

Make sure the plot is included in this notebook file (the IPython magic command %matplotlib inline we ran above will do that for you, it only needs to be run once per file).

Terminology Mathematical notation Description
joint P(X = x, Y = y) prob. that X=x and Y=y
marginal P(X = x) prob. that X=x regardless of Y
conditional P(X = x|Y = y) prob. that X=x given that Y=y
The different basic probability distributions.

A Pictorial Definition of Probability 

Inspired by lectures from Christopher Bishop

Definition of probability distributions

Terminology Definition Probability Notation
Joint Probability $\lim_{N\rightarrow\infty}\frac{n_{X=3,Y=4}}{N}$ P(X=3,Y=4)
Marginal Probability $\lim_{N\rightarrow\infty}\frac{n_{X=5}}{N}$ P(X=5)
Conditional Probability $\lim_{N\rightarrow\infty}\frac{n_{X=3,Y=4}}{n_{Y=4}}$ P(X=3|Y=4)

Notational Details

Typically we should write out P(X=x,Y=y), but in practice we often shorten this to P(x,y). This looks very much like we might write a multivariate function, e.g.
$$f\left(x,y\right)=\frac{x}{y},$$
but for a multivariate function
f(x,y) ≠ f(y,x).
However,
P(x,y) = P(y,x)
because
P(X=x,Y=y) = P(Y=y,X=x).
Sometimes I think of this as akin to the way in Python we can write ‘keyword arguments’ in functions. If we use keyword arguments, the ordering of arguments doesn’t matter.

We’ve now introduced conditioning and independence to the notion of probability and computed some conditional probabilities on a practical example The scatter plot of deaths vs year that we created above can be seen as a joint probability distribution. We represent a joint probability using the notation P(Y = y, X = x) or P(y, x) for short. Computing a joint probability is equivalent to answering the simultaneous questions, what’s the probability that the number of nurses was over 2 and the number of doctors was 1? Or any other question that may occur to us. Again we can easily use pandas to ask such questions.

num_doctors = 1
large = (data.num_nurses_fulltime[data.num_doctors_fulltime==num_doctors]>2).sum()
total_facilities = data.num_nurses_fulltime.count() # this is total number of films
prob_large = float(large)/float(total_facilities)
print("Probability of nurses being greater than 2 and number of doctors being", num_doctors, "is:", prob_large)

The Product Rule

This number is the joint probability, P(Y, X) which is much smaller than the conditional probability. The number can never be bigger than the conditional probabililty because it is computed using the product rule.
p(Y = y, X = x) = p(Y = y|X = x)p(X = x)
and
p(X = x)
is a probability distribution, which is equal or less than 1, ensuring the joint distribution is typically smaller than the conditional distribution.

The product rule is a fundamental rule of probability, and you must remember it! It gives the relationship between the two questions: 1) What’s the probability that a facility has over two nurses and one doctor? and 2) What’s the probability that a facility has over two nurses given that it has one doctor?

In our shorter notation we can write the product rule as
p(y, x) = p(y|x)p(x)
We can see the relation working in practice for our data above by computing the different values for x = 1.

num_doctors=1
num_nurses=2
p_x = float((data.num_doctors_fulltime==num_doctors).sum())/float(data.num_nurses_fulltime.count())
p_y_given_x = float((data.num_nurses_fulltime[data.num_doctors_fulltime==num_doctors]>num_nurses).sum())/float((data.num_doctors_fulltime==num_doctors).sum())
p_y_and_x = float((data.num_nurses_fulltime[data.num_doctors_fulltime==num_doctors]>num_nurses).sum())/float(data.num_nurses_fulltime.count())

print("P(x) is", p_x)
print("P(y|x) is", p_y_given_x)
print("P(y,x) is", p_y_and_x)

The Sum Rule

The other fundamental rule of probability is the sum rule this tells us how to get a marginal distribution from the joint distribution. Simply put it says that we need to sum across the value we’d like to remove.
P(Y = y) = ∑xP(Y = y, X = x)
Or in our shortened notation
P(y) = ∑xP(y, x)

Bayes’ Rule

Bayes rule is a very simple rule, it’s hardly worth the name of a rule at all. It follows directly from the product rule of probability. Because P(y, x) = P(y|x)P(x) and by symmetry P(y, x) = P(x, y) = P(x|y)P(y) then by equating these two equations and dividing through by P(y) we have
$$P(x|y) = \frac{P(y|x)P(x)}{P(y)}$$
which is known as Bayes’ rule (or Bayes’s rule, it depends how you choose to pronounce it). It’s not difficult to derive, and its importance is more to do with the semantic operation that it enables. Each of these probability distributions represents the answer to a question we have about the world. Bayes rule (via the product rule) tells us how to invert the probability.

Probabilities for Extracting Information from Data 

What use is all this probability in data science? Let’s think about how we might use the probabilities to do some decision making. Let’s look at the information data.

data.columns

Assignment Questions

The questions in the above lab sheet need to be answered and handed in before 09:00 on 7th October 2014 (i.e. before next lecture). The hand should be done via file upload through MOLE.

• See probability review at end of slides for reminders.
• For other material in read:
• If you are unfamiliar with probabilities you should complete the following exercises:

References

Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning. springer.

Laplace, Pierre Simon. 1814. Essai Philosophique Sur Les Probabilités. 2nd ed. Paris: Courcier.

Rogers, Simon, and Mark Girolami. 2011. A First Course in Machine Learning. CRC Press.