Review¶

Last week: Specified Class Conditional Distributions, $p(\mathbf{x}_i|y_i, \boldsymbol{\theta})$ .
Used Bayes Classifier + naive Bayes model to specify joint distribution.
Used Bayes rule to compute posterior probability of class membership.
This week:
- direct estimation of probability of class membership.
- introduction of generalised linear models.

Logistic Regression and GLMs¶

Modelling entire density allows any question to be answered (also missing data).
Comes at the possible expense of strong assumptions about data generation distribution.
In regression we model probability of y_i |\mathbf{x}_i directly.
- Allows less flexibility in the question, but more flexibility in the model assumptions.
Can do this not just for regression, but classification.
Framework is known as generalized linear models.

Log Odds¶

model the log-odds with the basis functions.
odds are defined as the ratio of the probability of a positive outcome, to the probability of a negative outcome.
Probability is between zero and one, odds are: $\frac{\pi}{1-\pi}$
Odds are between $0$ and $\infty$ .
Logarithm of odds maps them to $-\infty$ to $\infty$ .

Logit Link Function¶

The Logit function, $g^{-1}(\pi_i) = \log\frac{\pi_i}{1-\pi_i}.$ This function is known as a link function.
For a standard regression we take, $f(\mathbf{x}_i) = \mathbf{w}^\top \boldsymbol{\phi}(\mathbf{x}_i),$
For classification we perform a logistic regression. $\log \frac{\pi_i}{1-\pi_i} = \mathbf{w}^\top \boldsymbol{\phi}(\mathbf{x}_i)$

Inverse Link Function¶

We have defined the link function as taking the form $g^{-1}(\cdot)$ implying that the inverse link function is given by $g(\cdot)$ . Since we have defined, $g^{-1}(\pi(\mathbf{x})) = \mathbf{w}^\top \boldsymbol{\phi}(\mathbf{x})$ we can write $\pi$ in terms of the inverse link function, $g(\cdot)$ as $\pi(\mathbf{x}) = g(\mathbf{w}^\top \boldsymbol{\phi}(\mathbf{x})).$

Prediction Function¶

Can now write $\pi$ as a function of the input and the parameter vector as, $\pi(\mathbf{x},\mathbf{w}) = \frac{1}{1+ \exp\left(-\mathbf{w}^\top \boldsymbol{\phi}(\mathbf{x})\right)}.$
Compute the output of a standard linear basis function composition ( $\mathbf{w}^\top \boldsymbol{\phi}(\mathbf{x})$ , as we did for linear regression)
Apply the inverse link function, $g(\mathbf{w}^\top \boldsymbol{\phi}(\mathbf{x}))$ .
Use this value in a Bernoulli distribution to form the likelihood.

Bernoulli Reminder¶

From last time $P(y_i|\mathbf{w}, \mathbf{x}) = \pi_i^{y_i} (1-\pi_i)^{1-y_i}$

Trick for switching betwen probabilities

def bernoulli(y, pi):
  if y == 1:
      return pi
  else:
      return 1-pi

Maximum Likelihood¶

Conditional independence of data: $P(\mathbf{y}|\mathbf{w}, \mathbf{X}) = \prod_{i=1}^n P(y_i|\mathbf{w}, \mathbf{x}_i).$

Log Likelihood¶

$\begin{align*} \log P(\mathbf{y}|\mathbf{w}, \mathbf{X}) = & \sum_{i=1}^n \log P(y_i|\mathbf{w}, \mathbf{x}_i) \\ = &\sum_{i=1}^n y_i \log \pi_i \\ & + \sum_{i=1}^n (1-y_i)\log (1-\pi_i) \end{align*}$

Objective Function¶

Probability of positive outcome for the $i$ th data point $\pi_i = g\left(\mathbf{w}^\top \boldsymbol{\phi}(\mathbf{x}_i)\right),$ where $g(\cdot)$ is the inverse link function
Objective function of the form \begin{align} E(\mathbf{w}) = & - \sum{i=1}^n y_i \log g\left(\mathbf{w}^\top \boldsymbol{\phi}(\mathbf{x}_i)\right) \& - \sum{i=1}^n(1-y_i)\log \left(1-g\left(\mathbf{w}^\top \boldsymbol{\phi}(\mathbf{x}_i)\right)\right). \end{align}

Minimize Objective¶

Grdient wrt $\pi(\mathbf{x};\mathbf{w})$ \begin{align} \frac{\text{d}E(\mathbf{w})}{\text{d}\mathbf{w}} = & -\sum{i=1}^n \frac{y_i}{g\left(\mathbf{w}^\top \boldsymbol{\phi}(\mathbf{x})\right)}\frac{\text{d}g(f_i)}{\text{d}f_i} \boldsymbol{\phi(\mathbf{x}_i)} \ & + \sum{i=1}^n \frac{1-y_i}{1-g\left(\mathbf{w}^\top \boldsymbol{\phi}(\mathbf{x})\right)}\frac{\text{d}g(f_i)}{\text{d}f_i} \boldsymbol{\phi(\mathbf{x}_i)} \end{align}

Objective Gradient¶

$\begin{align*} \frac{\text{d}E(\mathbf{w})}{\text{d}\mathbf{w}} = & -\sum_{i=1}^n y_i\left(1-g\left(\mathbf{w}^\top \boldsymbol{\phi}(\mathbf{x})\right)\right) \boldsymbol{\phi(\mathbf{x}_i)} \\ & + \sum_{i=1}^n (1-y_i)\left(g\left(\mathbf{w}^\top \boldsymbol{\phi}(\mathbf{x})\right)\right) \boldsymbol{\phi(\mathbf{x}_i)}. \end{align*}$

Other GLMs¶

Logistic regression is part of a family known as generalized linear models
They all take the form $g^{-1}(f_i(x)) = \mathbf{w}^\top \boldsymbol{\phi}(\mathbf{x}_i)$
As another example let's look at Poisson regression.

Poisson Distribution¶

Poisson distribution is used for 'count data'. For non-negative integers, $y$ , $P(y) = \frac{\lambda^y}{y!}\exp(-y)$
Here $\lambda$ is a rate parameter that can be thought of as the number of arrivals per unit time.
Poisson distributions can be used for disease count data. E.g. number of incidence of malaria in a district.

Poisson Regression¶

In a Poisson regression make rate a function of space/time. $\log \lambda(\mathbf{x}, t) = \mathbf{w}_x^\top \boldsymbol{\phi_x(\mathbf{x})} + \mathbf{w}_t^\top \boldsymbol(\phi_t(t))$
This is known as a log linear or log additive model.
The link function is a logarithm.
We can rewrite such a function as $\log \lambda(\mathbf{x}, t) = f_x(\mathbf{x}) + f_t(t)$

Multiplicative Model¶

Be careful though ... a log additive model is really multiplicative. $\log \lambda(\mathbf{x}, t) = f_x(\mathbf{x}) + f_t(t)$
Becomes $\lambda(\mathbf{x}, t) = \exp(f_x(\mathbf{x}) + f_t(t))$
Which is equivalent to $\lambda(\mathbf{x}, t) = \exp(f_x(\mathbf{x}))\exp(f_t(t))$
Link functions can be deceptive in this way.

Reading¶

Section 5.2.2 of @Rogers:book11 up to pg 182.

Week 10: Logistic Regression and Generalized Linear Models¶

Neil D. Lawrence¶

1st December 2015¶

Review¶

Logistic Regression and GLMs¶

Log Odds¶

Logit Link Function¶

Inverse Link Function¶

Logistic function¶

Prediction Function¶

Bernoulli Reminder¶

Maximum Likelihood¶

Log Likelihood¶

Objective Function¶

Minimize Objective¶

Link Function Gradient¶

Objective Gradient¶

Optimization of the Function¶

Ad Matching for Facebook¶

Other GLMs¶

Poisson Distribution¶

Poisson Distribution¶

Poisson Regression¶

Multiplicative Model¶

Reading¶