import pods
import mlai
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
y | 0 | 1 |
---|---|---|
$P(y)$ | $(1-\pi)$ | $\pi$ |
def bernoulli(y_i, pi):
if y_i == 1:
return pi
else:
return 1-pi
from matplotlib.patches import Circle
fig, ax = plt.subplots(figsize=(7,7))
ax.plot([0, 0, 1, 1], [1, 0, 0, 1], linewidth=3, color=[0,0,0])
ax.set_axis_off()
ax.set_aspect('equal')
black_prob = 0.3
ball_radius = 0.1
ax.set_xlim([0, 1])
ax.set_ylim([0, 1])
t = np.linspace(0, 2*np.pi, 24)
rows = 4
cols = round(1/ball_radius)
last_row_cols = 3;
for row in range(rows):
if row == rows-1:
cols = last_row_cols
for col in range(cols):
ball_x = col*2*ball_radius + ball_radius
ball_y = row*2*ball_radius + ball_radius
x = ball_x*np.ones(t.shape) + ball_radius*np.sin(t)
y = ball_y*np.ones(t.shape) + ball_radius*np.cos(t);
if np.random.rand()<black_prob:
ball_color = [0, 0, 0]
else:
ball_color = [1, 0, 0]
plt.sca(ax)
handle = Circle((ball_x, ball_y), ball_radius, fill=True, color=ball_color, figure=fig)
print(ball_x, ball_y, ball_radius)
blackProb = 0.3;
ballRadius = 0.1;
set(gca, 'xlim', [0 1], 'ylim', [0 1])
t = 0:pi/24:2*pi;
rows = 4;
cols = 1/ballRadius;
lastRowCols = 3;
for row = 0:rows-1
if row == rows-1
cols = lastRowCols;
end
for col = 0:cols-1
ballX = col*2*ballRadius+ballRadius;
ballY = row*2*ballRadius + ballRadius;
x = ballX*ones(size(t)) + ballRadius*sin(t);
y = ballY*ones(size(t)) + ballRadius*cos(t);
if rand<blackProb
ballColor = blackColor;
else
ballColor = redColor;
end
handle = patch(x', y', ballColor);
end
end
printLatexPlot('bernoulliUrn', '../../../ml/tex/diagrams/', plotWidth);
%{
\end{comment}
\only<3>{\centerline{\inputdiagram{../../../ml/tex/diagrams/bernoulliUrn}}}
\end{columns}
\end{frame}
figure(1), clf
plotWidth = textWidth*0.4;
line([0 0 1 1 0], [0 1 1 0 0], 'linewidth', 3, 'color', blackColor)
set(gca, 'xlim', [0 1], 'ylim', [0 1])
axis off
printLatexPlot('bayesBilliard0', '../../../ml/tex/diagrams/', plotWidth);
ballX = rand(1);
ballY = 0.5;
r = 0.1;
t = 0:pi/24:2*pi;
x = ballX*ones(size(t)) + r*sin(t);
y = ballY*ones(size(t)) + r*cos(t);
handle = patch(x', y', blackColor);
set(gca, 'xlim', [0 1], 'ylim', [0 1])
printLatexPlot('bayesBilliard1', '../../../ml/tex/diagrams/', plotWidth);
line([ballX ballX], [0 1], 'linestyle', ':', 'linewidth', 3, 'color', blackColor)
printLatexPlot('bayesBilliard2', '../../../ml/tex/diagrams/', plotWidth);
counter = 2;
for ballX = rand(1, 7)
ballY = 0.5;
counter = counter+1;
x = ballX*ones(size(t)) + r*sin(t);
y = ballY*ones(size(t)) + r*cos(t);
handle = patch(x', y', redColor);
set(gca, 'xlim', [0 1], 'ylim', [0 1])
printLatexPlot(['bayesBilliard' num2str(counter)], '../../../ml/tex/diagrams/', plotWidth);
delete(handle)
end
%{
\end{comment}
\column{5cm}
\only<1>{\input{../../../ml/tex/diagrams/bayesBilliard0}}\only<2>{\input{../../../ml/tex/diagrams/bayesBilliard1}}\only<3>{\input{../../../ml/tex/diagrams/bayesBilliard2}}\only<4>{\input{../../../ml/tex/diagrams/bayesBilliard3}}\only<5>{\input{../../../ml/tex/diagrams/bayesBilliard4}}\only<6>{\input{../../../ml/tex/diagrams/bayesBilliard5}}\only<7>{\input{../../../ml/tex/diagrams/bayesBilliard6}}\only<8>{\input{../../../ml/tex/diagrams/bayesBilliard7}}\only<9>{\input{../../../ml/tex/diagrams/bayesBilliard8}}\only<10>{\input{../../../ml/tex/diagrams/bayesBilliard9}}
Stationary point: set derivative to zero $$0 = -\frac{\sum_{i=1}^{n} y_i}{\pi} + \frac{\sum_{i=1}^n (1-y_i)}{1-\pi},$$
Rearrange to form $$(1-\pi)\sum_{i=1}^{n} y_i = \pi\sum_{i=1}^n (1-y_i),$$
Giving $$\sum_{i=1}^{n} y_i = \pi\left(\sum_{i=1}^n (1-y_i) + \sum_{i=1}^{n} y_i\right),$$
Compute this distribution using the product and sum rules.
Need the probability associated with all possible combinations of $\mathbf{y}$ and $\mathbf{X}$.
There are $2^n$ possible combinations for the vector $\mathbf{y}$
Probability for each of these combinations must be jointly specified along with the joint density of the matrix $\mathbf{X}$,
In naive Bayes we make certain simplifying assumptions that allow us to perform all of the above in practice.
Data Conditional Independence
Computing posterior distribution in this case becomes easier, this is known as the 'Bayes classifier'.
This allows us to write down the full joint density of the training data, $$ p(\mathbf{y}, \mathbf{X}|\boldsymbol{\theta}, \pi) = \prod_{i=1}^n \prod_{j=1}^p p(x_{i,j}|y_i, \boldsymbol{\theta})p(y_i|\pi) $$ which can now be fit by maximum likelihood.
As normal we form our objective as the negative log likelihood, $$ E(\boldsymbol{\theta}, \pi) = -\log p(\mathbf{y}, \mathbf{X}|\boldsymbol{\theta}, \pi) = -\sum_{i=1}^n \sum_{j=1}^p \log p(x_{i, j}|y_i, \boldsymbol{\theta}) - \sum_{i=1}^n \log p(y_i|\pi), $$ which we note decomposes into two objective functions, one which is dependent on $\pi$ alone and one which is dependent on $\boldsymbol{\theta}$ alone so we have, $$ E(\pi, \boldsymbol{\theta}) = E(\boldsymbol{\theta}) + E(\pi). $$
Minimize conditional distribution: $$ E(\boldsymbol{\theta}) = -\sum_{i=1}^n \sum_{j=1}^p \log p(x_{i, j} |y_i, \boldsymbol{\theta}), $$
Implies making an assumption about it's form.
Naive Bayes has given us the class conditional densities: $p(\mathbf{x}_i | y_i, \boldsymbol{\theta})$. To make predictions with these densities we need to form the distribution given by $$ P(y^*| \mathbf{y}, \mathbf{X}, \mathbf{x}^*, \boldsymbol{\theta}) $$
P(y^*| \mathbf{y}, \mathbf{X}, \mathbf{x}^*, \boldsymbol{\theta}) = \frac{p(y*, \mathbf{y}, \mathbf{X}, \mathbf{x}^*| \boldsymbol{\theta})}{p(\mathbf{y}, \mathbf{X}, \mathbf{x}^*|\boldsymbol{\theta})}
$$Naive Bayes is making very simple assumptions about the data, in particular it is modeling the full joint probability of the data set, $p(\mathbf{y}, \mathbf{X} | \boldsymbol{\theta}, \pi)$ by very strong assumptions about factorizations that are unlikely to be true in practice. The data conditional independence assumption is common, and relies on a rich parameter vector to absorb all the information in the training data. The additional assumption of naive Bayes is that features are conditional indpenendent given the class label $y_i$ (and the parameter vector, $\boldsymbol{\theta}$. This is quite a strong assumption. However, it causes the objective function to decompose into parts which can be independently fitted to the different feature vectors, meaning it is very easy to fit the model to large data. It is also clear how we should handle streaming data and missing data. This means that the model can be run 'live', adapting parameters and information as it arrives. Indeed, the model is even capable of dealing with new features that might arrive at run time. Such is the strength of the modeling the joint probability density. However, the factorization assumption that allows us to do this efficiently is very strong and may lead to poor decision boundaries in practice.