Probabilistic Classification: Naive Bayes

Neil D. Lawrence

2015-11-24

University of Sheffield

Review

Last time: Looked at unsupervised learning.
Introduced latent variables, dimensionality reduction and clustering.
This time: Classification with Naive Bayes

Introduction to Classification

Classification

Wake word classification (Global Pulse Project).
Breakthrough in 2012 with ImageNet result of Alex Krizhevsky, Ilya Sutskever and Geoff Hinton
We are given a data set containing ‘inputs’, \(\inputMatrix\) and ‘targets’, \(\dataVector\).
Each data point consists of an input vector \(\inputVector_i\) and a class label, \(\dataScalar_i\).
For binary classification assume \(\dataScalar_i\) should be either \(1\) (yes) or \(-1\) (no).
Input vector can be thought of as features.

Discrete Probability

Algorithms based on prediction function and objective function.
For regression the codomain of the functions, \(f(\inputMatrix)\) was the real numbers or sometimes real vectors.
In classification we are given an input vector, \(\inputVector\), and an associated label, \(\dataScalar\) which either takes the value \(-1\) or \(1\).

Classification

Inputs, \(\inputVector\), mapped to a label, \(\dataScalar\), through a function \(\mappingFunction(\cdot)\) dependent on parameters, \(\weightVector\), \[ \dataScalar = \mappingFunction(\inputVector; \weightVector). \]
\(\mappingFunction(\cdot)\) is known as the prediction function.

Classification Examples

Classifiying hand written digits from binary images (automatic zip code reading)
Detecting faces in images (e.g. digital cameras).
Who a detected face belongs to (e.g. Facebook, DeepFace)
Classifying type of cancer given gene expression data.
Categorization of document types (different types of news article on the internet)

Reminder on the Term “Bayesian”

We use Bayes’ rule to invert probabilities in the Bayesian approach.
- Bayesian is not named after Bayes’ rule (v. common confusion).
- The term Bayesian refers to the treatment of the parameters as stochastic variables.
- Proposed by Laplace (1774) and Bayes (1763) independently.
- For early statisticians this was very controversial (Fisher et al).

Reminder on the Term “Bayesian”

The use of Bayes’ rule does not imply you are being Bayesian.
- It is just an application of the product rule of probability.

Bernoulli Distribution

Binary classification: need a probability distribution for discrete variables.
Discrete probability is in some ways easier: \(P(\dataScalar=1) = \pi\) & specify distribution as a table.
Instead of \(\dataScalar=-1\) for negative class we take \(\dataScalar=0\).

\(\dataScalar\)	0	1
\(P(\dataScalar)\)	\((1-\pi)\)	\(\pi\)

This is the Bernoulli distribution.

Mathematical Switch

The Bernoulli distribution \[ P(\dataScalar) = \pi^\dataScalar (1-\pi)^{(1-\dataScalar)} \]
Is a clever trick for switching probabilities, as code it would be

def bernoulli(y_i, pi):
    if y_i == 1:
        return pi
    else:
        return 1-pi

Jacob Bernoulli’s Bernoulli

Bernoulli described the Bernoulli distribution in terms of an ‘urn’ filled with balls.
There are red and black balls. There is a fixed number of balls in the urn.
The portion of red balls is given by \(\pi\).
For this reason in Bernoulli’s distribution there is epistemic uncertainty about the distribution parameter.

Jacob Bernoulli’s Bernoulli

Thomas Bayes’s Bernoulli

Bayes described the Bernoulli distribution (he didn’t call it that!) in terms of a table and two balls.
Each ball is rolled so it comes to rest at a uniform distribution across the table.
The first ball comes to rest at a position that is a \(\pi\) times the width of table.
After placing the first ball you consider whether a second would land to the left or the right.
For this reason in Bayes’s distribution there is considered to be aleatoric uncertainty about the distribution parameter.

Thomas Bayes’ Bernoulli

Maximum Likelihood in the Bernoulli

Assume data, \(\dataVector\) is binary vector length \(\numData\).
Assume each value was sampled independently from the Bernoulli distribution, given probability \(\pi\) \[ p(\dataVector|\pi) = \prod_{i=1}^{\numData} \pi^{\dataScalar_i} (1-\pi)^{1-\dataScalar_i}. \]

Negative Log Likelihood

Minimize the negative log likelihood \[\begin{align*} \errorFunction(\pi)& = -\log p(\dataVector|\pi)\\ & = -\sum_{i=1}^{\numData} \dataScalar_i \log \pi - \sum_{i=1}^{\numData} (1-\dataScalar_i) \log(1-\pi), \end{align*}\]
Take gradient with respect to the parameter \(\pi\). \[\frac{\text{d}\errorFunction(\pi)}{\text{d}\pi} = -\frac{\sum_{i=1}^{\numData} \dataScalar_i}{\pi} + \frac{\sum_{i=1}^{\numData} (1-\dataScalar_i)}{1-\pi},\]

Fixed Point

Stationary point: set derivative to zero \[0 = -\frac{\sum_{i=1}^{\numData} \dataScalar_i}{\pi} + \frac{\sum_{i=1}^{\numData} (1-\dataScalar_i)}{1-\pi},\]
Rearrange to form \[(1-\pi)\sum_{i=1}^{\numData} \dataScalar_i = \pi\sum_{i=1}^{\numData} (1-\dataScalar_i),\]
Giving \[\sum_{i=1}^{\numData} \dataScalar_i = \pi\left(\sum_{i=1}^{\numData} (1-\dataScalar_i) + \sum_{i=1}^{\numData} \dataScalar_i\right),\]

Solution

Recognise that \(\sum_{i=1}^{\numData} (1-\dataScalar_i) + \sum_{i=1}^{\numData} \dataScalar_i = n\) so we have \[\pi = \frac{\sum_{i=1}^{\numData} \dataScalar_i}{\numData}\]
Estimate the probability associated with the Bernoulli by setting it to the number of observed positives, divided by the total length of \(\dataScalar\).
Makes intiutive sense.
What’s your best guess of probability for coin toss is heads when you get 47 heads from 100 tosses?

Bayes’ Rule Reminder

\[ \text{posterior} = \frac{\text{likelihood}\times\text{prior}}{\text{marginal likelihood}} \]

Four components:

Prior distribution
Likelihood
Posterior distribution
Marginal likelihood

Naive Bayes Classifiers

Probabilistic Machine Learning: place probability distributions (or densities) over all the variables of interest.
In naive Bayes this is exactly what we do.
Form a classification algorithm by modelling the joint density of our observations.
Need to make assumption about joint density.

Assumptions about Density

Make assumptions to reduce the number of parameters we need to optimise.
Given label data \(\dataVector\) and the inputs \(\inputMatrix\) could specify joint density of all potential values of \(\dataVector\) and \(\inputMatrix\), \(p(\dataVector, \inputMatrix)\).
If \(\inputMatrix\) and \(\dataVector\) are training data.
If \(\inputVector^*\) is a test input and \(\dataScalar^*\) a test location we want \[ p(\dataScalar^*|\inputMatrix, \dataVector, \inputVector^*), \]

Answer from Rules of Probability

Compute this distribution using the product and sum rules.
Need the probability associated with all possible combinations of \(\dataVector\) and \(\inputMatrix\).
There are \(2^{\numData}\) possible combinations for the vector \(\dataVector\)
Probability for each of these combinations must be jointly specified along with the joint density of the matrix \(\inputMatrix\),
Also need to extend the density for any chosen test location \(\inputVector^*\).

Naive Bayes Assumptions

In naive Bayes we make certain simplifying assumptions that allow us to perform all of the above in practice.

Data Conditional Independence
Feature conditional independence
Marginal density for \(\dataScalar\).

Data Conditional Independence

Given model parameters \(\paramVector\) we assume that all data points in the model are independent. \[ p(\dataScalar^*, \inputVector^*, \dataVector, \inputMatrix|\paramVector) = p(\dataScalar^*, \inputVector^*|\paramVector)\prod_{i=1}^{\numData} p(\dataScalar_i, \inputVector_i | \paramVector). \]
This is a conditional independence assumption.
We also make similar assumptions for regression (where \(\paramVector = \left\{\mappingVector,\dataStd^2\right\}\)).
Here we assume joint density of \(\dataVector\) and \(\inputMatrix\) is independent across the data given the parameters.

Bayes Classifier

Computing posterior distribution in this case becomes easier, this is known as the ‘Bayes classifier’.

Feature Conditional Independence

Particular to naive Bayes: assume features are also conditionally independent, given param and the label. \[p(\inputVector_i | \dataScalar_i, \paramVector) = \prod_{j=1}^{\dataDim} p(\inputScalar_{i,j}|\dataScalar_i,\paramVector)\] where \(\dataDim\) is the dimensionality of our inputs.
This is known as the naive Bayes assumption.
Bayes classifier + feature conditional independence.

Marginal Density for \(\dataScalar_i\)

To specify the joint distribution we also need the marginal for \(p(\dataScalar_i)\) \[p(\inputScalar_{i,j},\dataScalar_i| \paramVector) = p(\inputScalar_{i,j}|\dataScalar_i, \paramVector)p(\dataScalar_i).\]
Because \(\dataScalar_i\) is binary the Bernoulli density makes a suitable choice for our prior over \(\dataScalar_i\), \[p(\dataScalar_i|\pi) = \pi^{\dataScalar_i} (1-\pi)^{1-\dataScalar_i}\] where \(\pi\) now has the interpretation as being the prior probability that the classification should be positive.

Joint Density for Naive Bayes

This allows us to write down the full joint density of the training data, \[ p(\dataVector, \inputMatrix|\paramVector, \pi) = \prod_{i=1}^{\numData} \prod_{j=1}^{\dataDim} p(\inputScalar_{i,j}|\dataScalar_i, \paramVector)p(\dataScalar_i|\pi) \] which can now be fit by maximum likelihood.

Objective Function

\[\begin{align*} \errorFunction(\paramVector, \pi)& = -\log p(\dataVector, \inputMatrix|\paramVector, \pi) \\ &= -\sum_{i=1}^{\numData} \sum_{j=1}^{\dataDim} \log p(\inputScalar_{i, j}|\dataScalar_i, \paramVector) - \sum_{i=1}^{\numData} \log p(\dataScalar_i|\pi), \end{align*}\]

Maximum Likelihood

Fit Prior

We can minimize prior. For Bernoulli likelihood over the labels we have, \[\begin{align*} \errorFunction(\pi) & = - \sum_{i=1}^{\numData}\log p(\dataScalar_i|\pi)\\ & = -\sum_{i=1}^{\numData} \dataScalar_i \log \pi - \sum_{i=1}^{\numData} (1-\dataScalar_i) \log (1-\pi) \end{align*}\]
Solution from above is \[ \pi = \frac{\sum_{i=1}^{\numData} \dataScalar_i}{\numData}. \]

Fit Conditional

Minimize conditional distribution: \[ \errorFunction(\paramVector) = -\sum_{i=1}^{\numData} \sum_{j=1}^{\dataDim} \log p(\inputScalar_{i, j} |\dataScalar_i, \paramVector), \]
Implies making an assumption about it’s form.
The right assumption will depend on the data.
E.g. for real valued data, use a Gaussian \[ p(\inputScalar_{i, j} | \dataScalar_i,\paramVector) = \frac{1}{\sqrt{2\pi \dataStd_{\dataScalar_i,j}^2}} \exp \left(-\frac{(\inputScalar_{i,j} - \mu_{\dataScalar_i, j})^2}{\dataStd_{\dataScalar_i,j}^2}\right), \]

The distributions show the parameters of the independent class conditional probabilities for no maternity services. It is a Bernoulli distribution with the parameter, \(\pi\), given by (theta_0) for the facilities without maternity services and theta_1 for the facilities with maternity services. The parameters whow that, facilities with maternity services also are more likely to have other services such as grid electricity, emergency transport, immunization programs etc.

The naive Bayes assumption says that the joint probability for these services is given by the product of each of these Bernoulli distributions.

We have modelled the numbers in our table with a Gaussian density. Since several of these numbers are counts, a more appropriate distribution might be the Poisson distribution. But here we can see that the average number of nurses, healthworkers and doctors is higher in the facilities with maternal services (mu_1) than those without maternal services (mu_0). There is also a small difference between the mean latitude and longitudes. However, the standard deviation which would be given by the square root of the variance parameters (sigma_0 and sigma_1) is large, implying that a difference in latitude and longitude may be due to sampling error. To be sure more analysis would be required.

Compute Posterior for Test Point Label

We know that \[ P(\dataScalar^*| \dataVector, \inputMatrix, \inputVector^*, \paramVector)p(\dataVector,\inputMatrix, \inputVector^*|\paramVector) = p(\dataScalar*, \dataVector, \inputMatrix,\inputVector^*| \paramVector) \]
This implies \[ P(\dataScalar^*| \dataVector, \inputMatrix, \inputVector^*, \paramVector) = \frac{p(\dataScalar*, \dataVector, \inputMatrix, \inputVector^*| \paramVector)}{p(\dataVector, \inputMatrix, \inputVector^*|\paramVector)} \]

Compute Posterior for Test Point Label

From conditional independence assumptions \[ p(\dataScalar^*, \dataVector, \inputMatrix, \inputVector^*| \paramVector) = \prod_{j=1}^{\dataDim} p(\inputScalar^*_{j}|\dataScalar^*, \paramVector)p(\dataScalar^*|\pi)\prod_{i=1}^{\numData} \prod_{j=1}^{\dataDim} p(\inputScalar_{i,j}|\dataScalar_i, \paramVector)p(\dataScalar_i|\pi) \]
We also need \[ p(\dataVector, \inputMatrix, \inputVector^*|\paramVector)\] which can be found from \[p(\dataScalar^*, \dataVector, \inputMatrix, \inputVector^*| \paramVector) \]
Using the sum rule of probability, \[ p(\dataVector, \inputMatrix, \inputVector^*|\paramVector) = \sum_{\dataScalar^*=0}^1 p(\dataScalar^*, \dataVector, \inputMatrix, \inputVector^*| \paramVector). \]

Independence Assumptions

From independence assumptions \[ p(\dataVector, \inputMatrix, \inputVector^*| \paramVector) = \sum_{\dataScalar^*=0}^1 \prod_{j=1}^{\dataDim} p(\inputScalar^*_{j}|\dataScalar^*_i, \paramVector)p(\dataScalar^*|\pi)\prod_{i=1}^{\numData} \prod_{j=1}^{\dataDim} p(\inputScalar_{i,j}|\dataScalar_i, \paramVector)p(\dataScalar_i|\pi). \]
Substitute both forms to recover, \[ P(\dataScalar^*| \dataVector, \inputMatrix, \inputVector^*, \paramVector) = \frac{\prod_{j=1}^{\dataDim} p(\inputScalar^*_{j}|\dataScalar^*_i, \paramVector)p(\dataScalar^*|\pi)\prod_{i=1}^{\numData} \prod_{j=1}^{\dataDim} p(\inputScalar_{i,j}|\dataScalar_i, \paramVector)p(\dataScalar_i|\pi)}{\sum_{\dataScalar^*=0}^1 \prod_{j=1}^{\dataDim} p(\inputScalar^*_{j}|\dataScalar^*_i, \paramVector)p(\dataScalar^*|\pi)\prod_{i=1}^{\numData} \prod_{j=1}^{\dataDim} p(\inputScalar_{i,j}|\dataScalar_i, \paramVector)p(\dataScalar_i|\pi)} \]

Cancelation

Note training data terms cancel. \[ p(\dataScalar^*| \inputVector^*, \paramVector) = \frac{\prod_{j=1}^{\dataDim} p(\inputScalar^*_{j}|\dataScalar^*_i, \paramVector)p(\dataScalar^*|\pi)}{\sum_{\dataScalar^*=0}^1 \prod_{j=1}^{\dataDim} p(\inputScalar^*_{j}|\dataScalar^*_i, \paramVector)p(\dataScalar^*|\pi)} \]
This formula is also fairly straightforward to implement for different class conditional distributions.

Laplace Smoothing

Pseudo Counts

\[ \pi = \frac{\sum_{i=1}^{\numData} \dataScalar_i + 1}{\numData + 2} \]

Naive Bayes Summary

Model full joint distribution of data, \(p(\dataVector, \inputMatrix | \paramVector, \pi)\)
Make conditional independence assumptions about the data.
- feature conditional independence
- data conditional independence
Fast to implement, works on very large data.
Despite simple assumptions can perform better than expected.

Thanks!

References

Bayes, T., 1763. An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society 53, 370–418. https://doi.org/10.1098/rstl.1763.0053

Laplace, P.S., 1774. Mémoire sur la probabilité des causes par les évènemens, in: Mémoires de Mathèmatique et de Physique, Presentés à lAcadémie Royale Des Sciences, Par Divers Savans, & Lù Dans Ses Assemblées 6. pp. 621–656.

Rogers, S., Girolami, M., 2011. A first course in machine learning. CRC Press.

Probabilistic Classification: Naive Bayes

Review

Introduction to Classification

Classification

Discrete Probability

Classification

Classification Examples

Reminder on the Term “Bayesian”

Reminder on the Term “Bayesian”

Bernoulli Distribution

Mathematical Switch

Jacob Bernoulli’s Bernoulli

Jacob Bernoulli’s Bernoulli

Thomas Bayes’s Bernoulli

Thomas Bayes’ Bernoulli

Maximum Likelihood in the Bernoulli

Negative Log Likelihood

Fixed Point

Solution

Bayes’ Rule Reminder

Naive Bayes Classifiers

Assumptions about Density

Answer from Rules of Probability

Naive Bayes Assumptions

Data Conditional Independence

Bayes Classifier

Feature Conditional Independence

Marginal Density for \(\dataScalar_i\)

Joint Density for Naive Bayes

Objective Function

Maximum Likelihood

Fit Prior

Fit Conditional

Compute Posterior for Test Point Label

Compute Posterior for Test Point Label

Independence Assumptions

Cancelation

Laplace Smoothing

Pseudo Counts

Naive Bayes Summary

Further Reading

Thanks!

References