edit
at University of Sheffield on Dec 1, 2015 [jupyter][google colab][reveal]
Neil D. Lawrence, University of Sheffield

#### Abstract

Naive Bayes assumptions allow us to specify class conditional densities through assuming that the data are conditionally independent given parameters. A logistic regression is an approach to classification which extends the linear basis function models we’ve already explored. Rather than modeling the output of the function directly the assumption is that we model the log-odds with the basis functions.

## Review

The allowed us to specify a class conditional density, $p(\inputVector_i|\dataScalar_i, \parameterVector)$, through assuming that the features were conditionally independent given the label. Combined with our assumption that the data points are conditionally independent given the parameters, $\parameterVector$, this allowed us to specify a joint density over the entire data set, $p(\dataVector, \inputMatrix)$. We argued that modeling the joint density is a powerful approach because we can answer any particular question we have about the data through the sum rule and the product rule of probability. We can condition on the training data and query the value of an unseen test point. If we have missing data, then we can integrate over the missing point (marginalise) and obtain our best prediction despite the absence of some of the features for a point. However, it comes at the cost of a particular modeling assumption. Namely, to make modeling practical we assumed that the features were conditionally independent given the feature label. In other words, for any given point, if we know its class, then its features will be independent. This is a very strong assumption. For example, if we were classifying the sex of an individual given their height and weight, naive Bayes would assume that if we knew their sex, then the height and weight would be independent. This is clearly wrong, the dependence between height and weight is not dictated only by the sex of an individual, there is a natural correlation between them.

Modeling the entire joint density allows us to deal with different questions, that we may not have envisaged at the model design time. It contrasts with the approach we took for regression where we specifically chose to model the conditional density for the target values, $\dataVector$, given the input locations, $\inputMatrix$. That density, $p(\dataVector|\inputMatrix)$, effectively assumes that the question we’ll be asked at run time is known. In particular, we expect to be asked about the value of the function, y*, given a particular input location, $\inputVector^*$. We don’t expect to be asked about the value of an input given a particular observation. That would require placing an additional prior over the input location for each point, $p(\inputVector_i)$. Of course, it’s possible to conceive of a model like this, and indeed that is how we proceeded for . However, if we know we will always have all the inputs at run time, it may make sense to directly model the conditional density, $p(\dataVector|\inputMatrix)$.

## Logistic Regression 

A logistic regression is an approach to classification which extends the linear basis function models we’ve already explored. Rather than modeling the output of the function directly the assumption is that we model the log-odds with the basis functions.

The odds are defined as the ratio of the probability of a positive outcome, to the probability of a negative outcome. If the probability of a positive outcome is denoted by π, then the odds are computed as $\frac{\pi}{1-\pi}$. Odds are widely used by bookmakers in gambling, although a bookmakers odds won’t normalise: i.e. if you look at the equivalent probabilities, and sum over the probability of all outcomes the bookmakers are considering, then you won’t get one. This is how the bookmaker makes a profit. Because a probability is always between zero and one, the odds are always between 0 and . If the positive outcome is unlikely the odds are close to zero, if it is very likely then the odds become close to infinite. Taking the logarithm of the odds maps the odds from the positive half space to being across the entire real line. Odds that were between 0 and 1 (where the negative outcome was more likely) are mapped to the range between  − ∞ and 0. Odds that are greater than 1 are mapped to the range between 0 and . Considering the log odds therefore takes a number between 0 and 1 (the probability of positive outcome) and maps it to the entire real line. The function that does this is known as the logit function, $g^{-1}(p_i) = \log\frac{p_i}{1-p_i}$. This function is known as a link function.

For a standard regression we take,
$$\mappingFunction(\inputVector) = \mappingVector^\top \basisVector(\inputVector),$$
if we want to perform classification we perform a logistic regression.
$$\log \frac{\pi}{(1-\pi)} = \mappingVector^\top \basisVector(\inputVector)$$
where the odds ratio between the positive class and the negative class is given by
$$\frac{\pi}{(1-\pi)}$$
The odds can never be negative, but can take any value from 0 to . We have defined the link function as taking the form g − 1( ⋅ ) implying that the inverse link function is given by g( ⋅ ). Since we have defined,
$$g^{-1}(\pi) = \mappingVector^\top \basisVector(\inputVector)$$
we can write π in terms of the inverse link function, g( ⋅ ) as
$$\pi = g(\mappingVector^\top \basisVector(\inputVector)).$$

## Basis Function

We’ll define our prediction, objective and gradient functions below. But before we start, we need to define a basis function for our model. Let’s start with the linear basis.

## Prediction Function

Now we have the basis function let’s define the prediction function.

This inverse of the link function is known as the logistic (thus the name logistic regression) or sometimes it is called the sigmoid function. For a particular value of the input to the link function, $\mappingFunction_i = \mappingVector^\top \basisVector(\inputVector_i)$ we can plot the value of the inverse link function as below.

### Sigmoid Function 

The function has this characeristic ‘s’-shape (from where the term sigmoid, as in sigma, comes from). It also takes the input from the entire real line and ‘squashes’ it into an output that is between zero and one. For this reason it is sometimes also called a ‘squashing function’.

By replacing the inverse link with the sigmoid we can write π as a function of the input and the parameter vector as,
$$\pi(\inputVector,\mappingVector) = \frac{1}{1+\exp\left(-\mappingVector^\top \basisVector(\inputVector)\right)}.$$
The process for logistic regression is as follows. Compute the output of a standard linear basis function composition ($\mappingVector^\top \basisVector(\inputVector)$, as we did for linear regression) and then apply the inverse link function, $g(\mappingVector^\top \basisVector(\inputVector))$. In logistic regression this involves squashing it with the logistic (or sigmoid) function. Use this value, which now has an interpretation as a probability in a Bernoulli distribution to form the likelihood. Then we can assume conditional independence of each data point given the parameters and develop a likelihod for the entire data set.

As we discussed last time, the Bernoulli likelihood is of the form,
$$P(\dataScalar_i|\mappingVector, \inputVector) = \pi_i^{\dataScalar_i} (1-\pi_i)^{1-\dataScalar_i}$$
which we can think of as clever trick for mathematically switching between two probabilities if we were to write it as code it would be better described as

but writing it mathematically makes it easier to write our objective function within a single mathematical equation.

## Maximum Likelihood

To obtain the parameters of the model, we need to maximize the likelihood, or minimize the objective function, normally taken to be the negative log likelihood. With a data conditional independence assumption the likelihood has the form,
$$P(\dataVector|\mappingVector, \inputMatrix) = \prod_{i=1}^\numData P(\dataScalar_i|\mappingVector, \inputVector_i).$$
which can be written as a log likelihood in the form
$$\log P(\dataVector|\mappingVector, \inputMatrix) = \sum_{i=1}^\numData \log P(\dataScalar_i|\mappingVector, \inputVector_i) = \sum_{i=1}^\numData \dataScalar_i \log \pi_i + \sum_{i=1}^\numData (1-\dataScalar_i)\log (1-\pi_i)$$
and if we take the probability of positive outcome for the ith data point to be given by
$$\pi_i = g\left(\mappingVector^\top \basisVector(\inputVector_i)\right),$$
where g( ⋅ ) is the inverse link function, then this leads to an objective function of the form,
$$E(\mappingVector) = - \sum_{i=1}^\numData \dataScalar_i \log g\left(\mappingVector^\top \basisVector(\inputVector_i)\right) - \sum_{i=1}^\numData(1-\dataScalar_i)\log \left(1-g\left(\mappingVector^\top \basisVector(\inputVector_i)\right)\right).$$

As normal, we would like to minimize this objective. This can be done by differentiating with respect to the parameters of our prediction function, $\pi(\inputVector;\mappingVector)$, for optimisation. The gradient of the likelihood with respect to $\pi(\inputVector;\mappingVector)$ is of the form,
$$\frac{\text{d}E(\mappingVector)}{\text{d}\mappingVector} = -\sum_{i=1}^\numData \frac{\dataScalar_i}{g\left(\mappingVector^\top \basisVector(\inputVector)\right)}\frac{\text{d}g(\mappingFunction_i)}{\text{d}\mappingFunction_i} \basisVector(\inputVector_i) + \sum_{i=1}^\numData \frac{1-\dataScalar_i}{1-g\left(\mappingVector^\top \basisVector(\inputVector)\right)}\frac{\text{d}g(\mappingFunction_i)}{\text{d}\mappingFunction_i} \basisVector(\inputVector_i)$$
where we used the chain rule to develop the derivative in terms of $\frac{\text{d}g(\mappingFunction_i)}{\text{d}\mappingFunction_i}$, which is the gradient of the inverse link function (in our case the gradient of the sigmoid function).

So the objective function now depends on the gradient of the inverse link function, as well as the likelihood depends on the gradient of the inverse link function, as well as the gradient of the log likelihood, and naturally the gradient of the argument of the inverse link function with respect to the parameters, which is simply $\basisVector(\inputVector_i)$.

The only missing term is the gradient of the inverse link function. For the sigmoid squashing function we have,
\begin{align*} g(\mappingFunction_i) &= \frac{1}{1+\exp(-\mappingFunction_i)}\\ &=(1+\exp(-\mappingFunction_i))^{-1} \end{align*}
and the gradient can be computed as
\begin{align*} \frac{\text{d}g(\mappingFunction_i)}{\text{d} \mappingFunction_i} & = \exp(-\mappingFunction_i)(1+\exp(-\mappingFunction_i))^{-2}\\ & = \frac{1}{1+\exp(-\mappingFunction_i)} \frac{\exp(-\mappingFunction_i)}{1+\exp(-\mappingFunction_i)} \\ & = g(\mappingFunction_i) (1-g(\mappingFunction_i)) \end{align*}
so the full gradient can be written down as
$$\frac{\text{d}E(\mappingVector)}{\text{d}\mappingVector} = -\sum_{i=1}^\numData \dataScalar_i\left(1-g\left(\mappingVector^\top \basisVector(\inputVector)\right)\right) \basisVector(\inputVector_i) + \sum_{i=1}^\numData (1-\dataScalar_i)\left(g\left(\mappingVector^\top \basisVector(\inputVector)\right)\right) \basisVector(\inputVector_i).$$

## Optimization of the Function

Reorganizing the gradient to find a stationary point of the function with respect to the parameters $\mappingVector$ turns out to be impossible. Optimization has to proceed by numerical methods. Options include the multidimensional variant of Newton’s method or gradient based optimization methods like we used for optimizing matrix factorization for the movie recommender system. We recall from matrix factorization that, for large data, stochastic gradient descent or the Robbins Munro (Robbins and Monro 1951) optimization procedure worked best for function minimization.

{ ## Olivetti Glasses Data

Let’s classify images with logistic regression. We’ll look at a data set of individuals with glasses. We can load in the data from pods as follows.

We will need to define some initial random values for our vector and then minimize the objective by descending the gradient.

Let’s look at the weights and how they relate to the inputs.

What does the magnitude of the weight vectors tell you about the different parameters and their influence on outcome? Are the weights of roughly the same size, if not, how might you fix this?

## Going Further: Optimization 

Other optimization techniques for generalized linear models include Newton’s method, it requires you to compute the Hessian, or second derivative of the objective function.

Methods that are based on gradients only include L-BFGS and conjugate gradients. Can you find these in python? Are they suitable for very large data sets? }

## Other GLMs

We’ve introduced the formalism for generalized linear models. Have a think about how you might model count data using the Poisson distribution and a log link function for the rate, $\lambda(\inputVector)$. If you want a data set you can try the pods.datasets.google_trends() for some count data.

# References

Robbins, H., and S. Monro. 1951. “A Stochastic Approximation Method.” Annals of Mathematical Statistics 22: 400–407.