
# Probabilistic Classification: Naive Bayes

University of Sheffield

## Review

• Last time: Looked at unsupervised learning.
• Introduced latent variables, dimensionality reduction and clustering.
• This time: Classification with Naive Bayes

## Classification

• Wake word classification (Global Pulse Project).
• Breakthrough in 2012 with ImageNet result of Alex Krizhevsky, Ilya Sutskever and Geoff Hinton

• We are given a data set containing ‘inputs’, $\inputMatrix$ and ‘targets’, $\dataVector$.
• Each data point consists of an input vector $\inputVector_i$ and a class label, $\dataScalar_i$.
• For binary classification assume $\dataScalar_i$ should be either $1$ (yes) or $-1$ (no).
• Input vector can be thought of as features.

## Discrete Probability

• Algorithms based on prediction function and objective function.
• For regression the codomain of the functions, $f(\inputMatrix)$ was the real numbers or sometimes real vectors.
• In classification we are given an input vector, $\inputVector$, and an associated label, $\dataScalar$ which either takes the value $-1$ or $1$.

## Classification

• Inputs, $\inputVector$, mapped to a label, $\dataScalar$, through a function $\mappingFunction(\cdot)$ dependent on parameters, $\weightVector$, $\dataScalar = \mappingFunction(\inputVector; \weightVector).$
• $\mappingFunction(\cdot)$ is known as the prediction function.

## Classification Examples

• Classifiying hand written digits from binary images (automatic zip code reading)
• Detecting faces in images (e.g. digital cameras).
• Who a detected face belongs to (e.g. Facebook, DeepFace)
• Classifying type of cancer given gene expression data.
• Categorization of document types (different types of news article on the internet)

## Reminder on the Term “Bayesian”

• We use Bayes’ rule to invert probabilities in the Bayesian approach.
• Bayesian is not named after Bayes’ rule (v. common confusion).
• The term Bayesian refers to the treatment of the parameters as stochastic variables.
• Proposed by Laplace (1774) and Bayes (1763) independently.
• For early statisticians this was very controversial (Fisher et al).

## Reminder on the Term “Bayesian”

• The use of Bayes’ rule does not imply you are being Bayesian.
• It is just an application of the product rule of probability.

## Bernoulli Distribution

• Binary classification: need a probability distribution for discrete variables.
• Discrete probability is in some ways easier: $P(\dataScalar=1) = \pi$ & specify distribution as a table.
• Instead of $\dataScalar=-1$ for negative class we take $\dataScalar=0$.
$\dataScalar$ 0 1
$P(\dataScalar)$ $(1-\pi)$ $\pi$

This is the Bernoulli distribution.

## Mathematical Switch

• The Bernoulli distribution $P(\dataScalar) = \pi^\dataScalar (1-\pi)^{(1-\dataScalar)}$

• Is a clever trick for switching probabilities, as code it would be

def bernoulli(y_i, pi):
if y_i == 1:
return pi
else:
return 1-pi

## Jacob Bernoulli’s Bernoulli

• Bernoulli described the Bernoulli distribution in terms of an ‘urn’ filled with balls.
• There are red and black balls. There is a fixed number of balls in the urn.
• The portion of red balls is given by $\pi$.
• For this reason in Bernoulli’s distribution there is epistemic uncertainty about the distribution parameter.

## Thomas Bayes’s Bernoulli

• Bayes described the Bernoulli distribution (he didn’t call it that!) in terms of a table and two balls.
• Each ball is rolled so it comes to rest at a uniform distribution across the table.
• The first ball comes to rest at a position that is a $\pi$ times the width of table.
• After placing the first ball you consider whether a second would land to the left or the right.
• For this reason in Bayes’s distribution there is considered to be aleatoric uncertainty about the distribution parameter.

## Maximum Likelihood in the Bernoulli

• Assume data, $\dataVector$ is binary vector length $\numData$.
• Assume each value was sampled independently from the Bernoulli distribution, given probability $\pi$ $p(\dataVector|\pi) = \prod_{i=1}^{\numData} \pi^{\dataScalar_i} (1-\pi)^{1-\dataScalar_i}.$

## Negative Log Likelihood

• Minimize the negative log likelihood \begin{align*} \errorFunction(\pi)& = -\log p(\dataVector|\pi)\\ & = -\sum_{i=1}^{\numData} \dataScalar_i \log \pi - \sum_{i=1}^{\numData} (1-\dataScalar_i) \log(1-\pi), \end{align*}
• Take gradient with respect to the parameter $\pi$. $\frac{\text{d}\errorFunction(\pi)}{\text{d}\pi} = -\frac{\sum_{i=1}^{\numData} \dataScalar_i}{\pi} + \frac{\sum_{i=1}^{\numData} (1-\dataScalar_i)}{1-\pi},$

## Fixed Point

• Stationary point: set derivative to zero $0 = -\frac{\sum_{i=1}^{\numData} \dataScalar_i}{\pi} + \frac{\sum_{i=1}^{\numData} (1-\dataScalar_i)}{1-\pi},$

• Rearrange to form $(1-\pi)\sum_{i=1}^{\numData} \dataScalar_i = \pi\sum_{i=1}^{\numData} (1-\dataScalar_i),$

• Giving $\sum_{i=1}^{\numData} \dataScalar_i = \pi\left(\sum_{i=1}^{\numData} (1-\dataScalar_i) + \sum_{i=1}^{\numData} \dataScalar_i\right),$

## Solution

• Recognise that $\sum_{i=1}^{\numData} (1-\dataScalar_i) + \sum_{i=1}^{\numData} \dataScalar_i = n$ so we have $\pi = \frac{\sum_{i=1}^{\numData} \dataScalar_i}{\numData}$

• Estimate the probability associated with the Bernoulli by setting it to the number of observed positives, divided by the total length of $\dataScalar$.
• Makes intiutive sense.
• What’s your best guess of probability for coin toss is heads when you get 47 heads from 100 tosses?

## Bayes’ Rule Reminder

$\text{posterior} = \frac{\text{likelihood}\times\text{prior}}{\text{marginal likelihood}}$

Four components:

1. Prior distribution
2. Likelihood
3. Posterior distribution
4. Marginal likelihood

## Naive Bayes Classifiers

• Probabilistic Machine Learning: place probability distributions (or densities) over all the variables of interest.
• In naive Bayes this is exactly what we do.

• Form a classification algorithm by modelling the joint density of our observations.

• Need to make assumption about joint density.

• Make assumptions to reduce the number of parameters we need to optimise.
• Given label data $\dataVector$ and the inputs $\inputMatrix$ could specify joint density of all potential values of $\dataVector$ and $\inputMatrix$, $p(\dataVector, \inputMatrix)$.
• If $\inputMatrix$ and $\dataVector$ are training data.
• If $\inputVector^*$ is a test input and $\dataScalar^*$ a test location we want $p(\dataScalar^*|\inputMatrix, \dataVector, \inputVector^*),$

## Answer from Rules of Probability

• Compute this distribution using the product and sum rules.
• Need the probability associated with all possible combinations of $\dataVector$ and $\inputMatrix$.
• There are $2^{\numData}$ possible combinations for the vector $\dataVector$
• Probability for each of these combinations must be jointly specified along with the joint density of the matrix $\inputMatrix$,
• Also need to extend the density for any chosen test location $\inputVector^*$.

## Naive Bayes Assumptions

• In naive Bayes we make certain simplifying assumptions that allow us to perform all of the above in practice.
1. Data Conditional Independence
2. Feature conditional independence
3. Marginal density for $\dataScalar$.

## Data Conditional Independence

• Given model parameters $\paramVector$ we assume that all data points in the model are independent. $p(\dataScalar^*, \inputVector^*, \dataVector, \inputMatrix|\paramVector) = p(\dataScalar^*, \inputVector^*|\paramVector)\prod_{i=1}^{\numData} p(\dataScalar_i, \inputVector_i | \paramVector).$

• This is a conditional independence assumption.

• We also make similar assumptions for regression (where $\paramVector = \left\{\mappingVector,\dataStd^2\right\}$).

• Here we assume joint density of $\dataVector$ and $\inputMatrix$ is independent across the data given the parameters.

## Bayes Classifier

Computing posterior distribution in this case becomes easier, this is known as the ‘Bayes classifier’.

## Feature Conditional Independence

• Particular to naive Bayes: assume features are also conditionally independent, given param and the label. $p(\inputVector_i | \dataScalar_i, \paramVector) = \prod_{j=1}^{\dataDim} p(\inputScalar_{i,j}|\dataScalar_i,\paramVector)$ where $\dataDim$ is the dimensionality of our inputs.
• This is known as the naive Bayes assumption.
• Bayes classifier + feature conditional independence.

## Marginal Density for $\dataScalar_i$

• To specify the joint distribution we also need the marginal for $p(\dataScalar_i)$ $p(\inputScalar_{i,j},\dataScalar_i| \paramVector) = p(\inputScalar_{i,j}|\dataScalar_i, \paramVector)p(\dataScalar_i).$

• Because $\dataScalar_i$ is binary the Bernoulli density makes a suitable choice for our prior over $\dataScalar_i$, $p(\dataScalar_i|\pi) = \pi^{\dataScalar_i} (1-\pi)^{1-\dataScalar_i}$ where $\pi$ now has the interpretation as being the prior probability that the classification should be positive.

## Joint Density for Naive Bayes

• This allows us to write down the full joint density of the training data, $p(\dataVector, \inputMatrix|\paramVector, \pi) = \prod_{i=1}^{\numData} \prod_{j=1}^{\dataDim} p(\inputScalar_{i,j}|\dataScalar_i, \paramVector)p(\dataScalar_i|\pi)$ which can now be fit by maximum likelihood.

## Objective Function

\begin{align*} \errorFunction(\paramVector, \pi)& = -\log p(\dataVector, \inputMatrix|\paramVector, \pi) \\ &= -\sum_{i=1}^{\numData} \sum_{j=1}^{\dataDim} \log p(\inputScalar_{i, j}|\dataScalar_i, \paramVector) - \sum_{i=1}^{\numData} \log p(\dataScalar_i|\pi), \end{align*}

## Fit Prior

• We can minimize prior. For Bernoulli likelihood over the labels we have, \begin{align*} \errorFunction(\pi) & = - \sum_{i=1}^{\numData}\log p(\dataScalar_i|\pi)\\ & = -\sum_{i=1}^{\numData} \dataScalar_i \log \pi - \sum_{i=1}^{\numData} (1-\dataScalar_i) \log (1-\pi) \end{align*}
• Solution from above is $\pi = \frac{\sum_{i=1}^{\numData} \dataScalar_i}{\numData}.$

## Fit Conditional

• Minimize conditional distribution: $\errorFunction(\paramVector) = -\sum_{i=1}^{\numData} \sum_{j=1}^{\dataDim} \log p(\inputScalar_{i, j} |\dataScalar_i, \paramVector),$
• Implies making an assumption about it’s form.
• The right assumption will depend on the data.
• E.g. for real valued data, use a Gaussian $p(\inputScalar_{i, j} | \dataScalar_i,\paramVector) = \frac{1}{\sqrt{2\pi \dataStd_{\dataScalar_i,j}^2}} \exp \left(-\frac{(\inputScalar_{i,j} - \mu_{\dataScalar_i, j})^2}{\dataStd_{\dataScalar_i,j}^2}\right),$

The distributions show the parameters of the independent class conditional probabilities for no maternity services. It is a Bernoulli distribution with the parameter, $\pi$, given by (theta_0) for the facilities without maternity services and theta_1 for the facilities with maternity services. The parameters whow that, facilities with maternity services also are more likely to have other services such as grid electricity, emergency transport, immunization programs etc.

The naive Bayes assumption says that the joint probability for these services is given by the product of each of these Bernoulli distributions.

We have modelled the numbers in our table with a Gaussian density. Since several of these numbers are counts, a more appropriate distribution might be the Poisson distribution. But here we can see that the average number of nurses, healthworkers and doctors is higher in the facilities with maternal services (mu_1) than those without maternal services (mu_0). There is also a small difference between the mean latitude and longitudes. However, the standard deviation which would be given by the square root of the variance parameters (sigma_0 and sigma_1) is large, implying that a difference in latitude and longitude may be due to sampling error. To be sure more analysis would be required.

## Compute Posterior for Test Point Label

• We know that $P(\dataScalar^*| \dataVector, \inputMatrix, \inputVector^*, \paramVector)p(\dataVector,\inputMatrix, \inputVector^*|\paramVector) = p(\dataScalar*, \dataVector, \inputMatrix,\inputVector^*| \paramVector)$
• This implies $P(\dataScalar^*| \dataVector, \inputMatrix, \inputVector^*, \paramVector) = \frac{p(\dataScalar*, \dataVector, \inputMatrix, \inputVector^*| \paramVector)}{p(\dataVector, \inputMatrix, \inputVector^*|\paramVector)}$

## Compute Posterior for Test Point Label

• From conditional independence assumptions $p(\dataScalar^*, \dataVector, \inputMatrix, \inputVector^*| \paramVector) = \prod_{j=1}^{\dataDim} p(\inputScalar^*_{j}|\dataScalar^*, \paramVector)p(\dataScalar^*|\pi)\prod_{i=1}^{\numData} \prod_{j=1}^{\dataDim} p(\inputScalar_{i,j}|\dataScalar_i, \paramVector)p(\dataScalar_i|\pi)$
• We also need $p(\dataVector, \inputMatrix, \inputVector^*|\paramVector)$ which can be found from $p(\dataScalar^*, \dataVector, \inputMatrix, \inputVector^*| \paramVector)$
• Using the sum rule of probability, $p(\dataVector, \inputMatrix, \inputVector^*|\paramVector) = \sum_{\dataScalar^*=0}^1 p(\dataScalar^*, \dataVector, \inputMatrix, \inputVector^*| \paramVector).$

## Independence Assumptions

• From independence assumptions $p(\dataVector, \inputMatrix, \inputVector^*| \paramVector) = \sum_{\dataScalar^*=0}^1 \prod_{j=1}^{\dataDim} p(\inputScalar^*_{j}|\dataScalar^*_i, \paramVector)p(\dataScalar^*|\pi)\prod_{i=1}^{\numData} \prod_{j=1}^{\dataDim} p(\inputScalar_{i,j}|\dataScalar_i, \paramVector)p(\dataScalar_i|\pi).$
• Substitute both forms to recover, $P(\dataScalar^*| \dataVector, \inputMatrix, \inputVector^*, \paramVector) = \frac{\prod_{j=1}^{\dataDim} p(\inputScalar^*_{j}|\dataScalar^*_i, \paramVector)p(\dataScalar^*|\pi)\prod_{i=1}^{\numData} \prod_{j=1}^{\dataDim} p(\inputScalar_{i,j}|\dataScalar_i, \paramVector)p(\dataScalar_i|\pi)}{\sum_{\dataScalar^*=0}^1 \prod_{j=1}^{\dataDim} p(\inputScalar^*_{j}|\dataScalar^*_i, \paramVector)p(\dataScalar^*|\pi)\prod_{i=1}^{\numData} \prod_{j=1}^{\dataDim} p(\inputScalar_{i,j}|\dataScalar_i, \paramVector)p(\dataScalar_i|\pi)}$

## Cancelation

• Note training data terms cancel. $p(\dataScalar^*| \inputVector^*, \paramVector) = \frac{\prod_{j=1}^{\dataDim} p(\inputScalar^*_{j}|\dataScalar^*_i, \paramVector)p(\dataScalar^*|\pi)}{\sum_{\dataScalar^*=0}^1 \prod_{j=1}^{\dataDim} p(\inputScalar^*_{j}|\dataScalar^*_i, \paramVector)p(\dataScalar^*|\pi)}$
• This formula is also fairly straightforward to implement for different class conditional distributions.

## Pseudo Counts

$\pi = \frac{\sum_{i=1}^{\numData} \dataScalar_i + 1}{\numData + 2}$

## Naive Bayes Summary

• Model full joint distribution of data, $p(\dataVector, \inputMatrix | \paramVector, \pi)$
• Make conditional independence assumptions about the data.
• feature conditional independence
• data conditional independence
• Fast to implement, works on very large data.
• Despite simple assumptions can perform better than expected.

• Chapter 5 up to pg 179 (Section 5.1, and 5.2 up to 5.2.2) of Rogers and Girolami (2011)

## References

Bayes, T., 1763. An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society 53, 370–418. https://doi.org/10.1098/rstl.1763.0053

Laplace, P.S., 1774. Mémoire sur la probabilité des causes par les évènemens, in: Mémoires de Mathèmatique et de Physique, Presentés à lAcadémie Royale Des Sciences, Par Divers Savans, & Lù Dans Ses Assemblées 6. pp. 621–656.

Rogers, S., Girolami, M., 2011. A first course in machine learning. CRC Press.