MLAI Week 6: Bayesian Regression

Neil D. Lawrence

3rd November 2015

In [31]:
import pods
import mlai
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Two Simultaneous Equations

A system of two simultaneous equations with two unknowns.

How do we deal with three simultaneous equations with only two unknowns?

$$\begin{aligned} y_1 = & mx_1 + c\\ y_2 = & mx_2 + c \end{aligned}$$

$$\begin{aligned} y_1-y_2 = & m(x_1 - x_2) \end{aligned}$$

$$\begin{aligned} \frac{y_1-y_2}{x_1 - x_2} = & m \end{aligned}$$

$$\begin{aligned} m & =\frac{y_2-y_1}{x_2 - x_1}\\ c & = y_1 - m x_1 \end{aligned}$$

$$\begin{aligned} y_1 = & mx_1 + c\\ y_2 = & mx_2 + c\\ y_3 = & mx_3 + c \end{aligned}$$
In [32]:
x = 1.
y = 3.
fig, ax = plt.subplots(figsize=(7, 7))
ax.plot(x, y, 'o', markersize=10, linewidth=3, color=[1., 0., 0.])
ax.set_xticks([0, 1, 2, 3])
ax.set_yticks([0, 1, 2, 3, 4, 5])
ylim = [0, 5]
xlim = [0, 3]
ax.set_ylim(ylim)
ax.set_xlim(xlim)
ax.set_xlabel('$x$', fontsize=20)
ax.set_ylabel('$y$', fontsize=20)
fig.savefig('./diagrams/one_point0.svg')

xvals = np.linspace(0, 3, 2)[:, None]
count=0
for i in range(100):
    c = np.random.normal(size=(1,1))*2
    m = (y - c)/x
    yvals = m*xvals+c
    ax.plot(xvals, yvals, '-', linewidth=2, color=[0., 0., 1.])
    if i < 9 or i == 100:
        count += 1
        fig.savefig('./diagrams/one_point' + str(count) + '.svg')

Underdetermined System

  • What about two unknowns and one observation? $$y_1 = mx_1 + c$$
In [50]:
pods.notebook.display_plots('one_point{samp}.svg', directory='./diagrams', samp=(0, 10))

Underdetermined System

  • Can compute $m$ given $c$. $$m = \frac{y_1 -c}{x}$$

Underdetermined System

  • Can compute $m$ given $c$.

Assume $$c \sim \mathcal{N}(0, 4),$$

Overdetermined System

  • With two unknowns and two observations: $$\begin{aligned}

      y_1 = & mx_1 + c\\
      y_2 = & mx_2 + c
    \end{aligned}$$
  • Additional observation leads to overdetermined system. $$y_3 = mx_3 + c$$

  • This problem is solved through a noise model $\epsilon \sim \mathcal{N}(0,\sigma^2)$ $$\begin{aligned}

      y_1 = mx_1 + c + \epsilon_1\\
      y_2 = mx_2 + c + \epsilon_2\\
      y_3 = mx_3 + c + \epsilon_3
    \end{aligned}$$

Noise Models

  • We aren’t modeling entire system.

  • Noise model gives mismatch between model and data.

  • Gaussian model justified by appeal to central limit theorem.

  • Other models also possible (Student-$t$ for heavy tails).

  • Maximum likelihood with Gaussian noise leads to least squares.

Different Types of Uncertainty

  • The first type of uncertainty we are assuming is aleatoric uncertainty.

  • The second type of uncertainty we are assuming is epistemic uncertainty.

Aleatoric Uncertainty

  • This is uncertainty we couldn’t know even if we wanted to. e.g. the result of a football match before it’s played.

  • Where a sheet of paper might land on the floor.

Epistemic Uncertainty

  • This is uncertainty we could in principal know the answer too. We just haven’t observed enough yet, e.g. the result of a football match after it’s played.

  • What colour socks your lecturer is wearing.

Reading

  • @Bishop:book06 Section 1.2.3 (pg 21–24).

  • @Bishop:book06 Section 1.2.6 (start from just past eq 1.64 pg 30-32).

  • @Rogers:book11 use an example of a coin toss for introducing Bayesian inference Chapter 3, Sections 3.1-3.4 (pg 95-117). Although you also need the beta density which we haven’t yet discussed. This is also the example that @Laplace:memoire74 used.

  • Bayesian Inference

    • @Rogers:book11 use an example of a coin toss for introducing Bayesian inference Chapter 3, Sections 3.1-3.4 (pg 95-117). Although you also need the beta density which we haven’t yet discussed. This is also the example that @Laplace:memoire74 used.

    • @Bishop:book06 Section 1.2.3 (pg 21–24).

    • @Bishop:book06 Section 1.2.6 (start from just past eq 1.64 pg 30-32).

Prior Distribution

  • Bayesian inference requires a prior on the parameters.

  • The prior represents your belief before you see the data of the likely value of the parameters.

  • For linear regression, consider a Gaussian prior on the intercept: $$c \sim \mathcal{N}(0, \alpha_1)$$

Posterior Distribution

  • Posterior distribution is found by combining the prior with the likelihood.

  • Posterior distribution is your belief after you see the data of the likely value of the parameters.

  • The posterior is found through Bayes’ Rule $$p(c|y) = \frac{p(y|c)p(c)}{p(y)}$$

Bayes Update

In [34]:
fig, ax = plt.subplots(figsize=(7,7))
num_points = 1000
x_max = 4
x_min = -3

y = np.array([[1.]])
prior_mean = np.array([[0.]])
prior_var = np.array([[.1]])

noise = mlai.Gaussian(offset=np.array([0.6]), scale=np.array(np.sqrt(0.05)))


f = np.linspace(x_min, x_max, num_points)[:, None]
ln_prior_curve = -0.5*(np.log(2*np.pi*prior_var) + (f-prior_mean)*(f-prior_mean)/prior_var)
ln_likelihood_curve = np.zeros(ln_prior_curve.shape)
for i in range(num_points):
    ln_likelihood_curve[i] = noise.log_likelihood(f[i][None, :], 
                                                  np.array([[np.finfo(float).eps]]), 
                                                  y)
ln_marginal_likelihood = noise.log_likelihood(prior_mean, prior_var, y);

prior_curve = np.exp(ln_prior_curve) 
likelihood_curve = np.exp(ln_likelihood_curve)
marginal_curve = np.exp(ln_marginal_likelihood)

ln_posterior_curve = ln_likelihood_curve + ln_prior_curve - ln_marginal_likelihood
posterior_curve = np.exp(ln_posterior_curve)

g, dlnZ_dvs = noise.grad_vals(prior_mean, prior_var, y)

nu = g*g - 2*dlnZ_dvs

approx_var = prior_var - prior_var*prior_var*nu
approx_mean = prior_mean + prior_var*g

ln_approx_curve = -0.5*np.log(2*np.pi*approx_var)-0.5*(f-approx_mean)*(f-approx_mean)/approx_var

approx_curve = np.exp(ln_approx_curve)
noise
xlim = [x_min, x_max] 
ylim = [0, np.vstack([approx_curve, likelihood_curve, 
                      posterior_curve, prior_curve]).max()*1.1]

fig, ax = plt.subplots(figsize=(7,7))

ax.set_xlim(xlim)
ax.set_ylim(ylim)
ax.set_yticks([0, 1, 2, 3, 4, 5])

ax.vlines(xlim[0], ylim[0], ylim[1], color=[0., 0., 0.]) 
ax.hlines(ylim[0], xlim[0], xlim[1], color=[0., 0., 0.]) 

ax.plot(f, prior_curve, color=[1, 0., 0.], linewidth=3)
ax.text(3.5, 2, '$p(c) = \mathcal{N}(c|0, \\alpha_1)$', horizontalalignment='center') 
plt.savefig('./diagrams/dem_gaussian1.svg')

ax.plot(f, likelihood_curve, color=[0, 0, 1], linewidth=3)
ax.text(3.5, 1.5,'$p(y|m, c, x, \\sigma^2)=\mathcal{N}(y|mx+c,\\sigma^2)$', horizontalalignment='center') 
plt.savefig('./diagrams/dem_gaussian2.svg')

ax.plot(f, posterior_curve, color=[1, 0, 1], linewidth=3)
ax.text(3.5, 1, '$p(c|y, m, x, \\sigma^2)=$', horizontalalignment='center') 
plt.text(3.5, 0.75, '$\mathcal{N}\\left(c|\\frac{y-mx}{1+\\sigma^2\\alpha_1},(\\sigma^{-2}+\\alpha_1^{-1})^{-1}\\right)$', horizontalalignment='center') 
plt.savefig('./diagrams/dem_gaussian3.svg')
In [51]:
pods.notebook.display_plots('dem_gaussian{stage}.svg', './diagrams/', stage=(1, 3))

Stages to Derivation of the Posterior

  • Multiply likelihood by prior

    • they are "exponentiated quadratics", the answer is always also an exponentiated quadratic because $$\exp(a^2)\exp(b^2) = \exp(a^2 + b^2)$$
  • Complete the square to get the resulting density in the form of a Gaussian.

  • Recognise the mean and (co)variance of the Gaussian. This is the estimate of the posterior.

Main Trick

$$p(c) = \frac{1}{\sqrt{2\pi\alpha_1}} \exp\left(-\frac{1}{2\alpha_1}c^2\right)$$$$p(\mathbf{y}|\mathbf{x}, c, m, \sigma^2) = \frac{1}{\left(2\pi\sigma^2\right)^{\frac{n}{2}}} \exp\left(-\frac{1}{2\sigma^2}\sum_{i=1}^n(y_i - mx_i - c)^2\right)$$
$$p(c| \mathbf{y}, \mathbf{x}, m, \sigma^2) = \frac{p(\mathbf{y}|\mathbf{x}, c, m, \sigma^2)p(c)}{p(\mathbf{y}|\mathbf{x}, m, \sigma^2)}$$$$p(c| \mathbf{y}, \mathbf{x}, m, \sigma^2) = \frac{p(\mathbf{y}|\mathbf{x}, c, m, \sigma^2)p(c)}{\int p(\mathbf{y}|\mathbf{x}, c, m, \sigma^2)p(c) \text{d} c}$$
$$p(c| \mathbf{y}, \mathbf{x}, m, \sigma^2) \propto p(\mathbf{y}|\mathbf{x}, c, m, \sigma^2)p(c)$$$$\begin{aligned} \log p(c | \mathbf{y}, \mathbf{x}, m, \sigma^2) =&-\frac{1}{2\sigma^2} \sum_{i=1}^n(y_i-c - mx_i)^2-\frac{1}{2\alpha_1} c^2 + \text{const}\\ = &-\frac{1}{2\sigma^2}\sum_{i=1}^n(y_i-mx_i)^2 -\left(\frac{n}{2\sigma^2} + \frac{1}{2\alpha_1}\right)c^2\\ & + c\frac{\sum_{i=1}^n(y_i-mx_i)}{\sigma^2}, \end{aligned}$$

complete the square of the quadratic form to obtain $$\log p(c | \mathbf{y}, \mathbf{x}, m, \sigma^2) = -\frac{1}{2\tau^2}(c - \mu)^2 +\text{const},$$ where $\tau^2 = \left(n\sigma^{-2} +\alpha_1^{-1}\right)^{-1}$ and $\mu = \frac{\tau^2}{\sigma^2} \sum_{i=1}^n(y_i-mx_i)$.

The Joint Density

  • Really want to know the joint posterior density over the parameters $c$ and $m$.

  • Could now integrate out over $m$, but it’s easier to consider the multivariate case.

Height and Weight Models

In [37]:
def plot_height(ax, h, ph):
    ax.plot(h, ph, '-', color=[1, 0, 0], linewidth=3)
    ax.set_xticks([1.25, 1.7, 2.15])
    ax.set_yticks([1, 2, 3])
    ax.set_xlabel('$h/m$', fontsize=20)
    ax.set_ylabel('$p(h)$', fontsize=20)

    ylim = ax.get_ylim()
    xlim = ax.get_xlim()
    ax.vlines(xlim[0], ylim[0], ylim[1], color='k')
    ax.hlines(ylim[0], xlim[0], xlim[1], color='k')

def plot_weight(ax, w, pw):
    ax.plot(w, pw, '-', color=[0, 0, 1.], linewidth=3)
    ax.set_xticks([55, 75, 95])
    ax.set_yticks([0.02, 0.04, 0.06])
    ax.set_xlabel('$w/kg$', fontsize=20)
    ax.set_ylabel('$p(w)$', fontsize=20)

    ylim = ax.get_ylim()
    xlim = ax.get_xlim()
    ax.vlines(xlim[0], ylim[0], ylim[1], color='k')
    ax.hlines(ylim[0], xlim[0], xlim[1], color='k')

fig, ax = plt.subplots(1, 2, figsize=(10,5))

muh = 1.7
varh = 0.0225
muw = 75
varw = 36
tau = 2*np.pi
h = np.linspace(1.25, 2.15, 100)[:, None]
ph = 1/np.sqrt(tau*varh)*np.exp(-1/(2*varh)*(h - muh)**2)
plot_height(ax[0], h, ph)

w = np.linspace(55, 95, 100)[:, None]
pw = 1/np.sqrt(tau*varw)*np.exp(-1/(2*varw)*(w - muw)**2)
plot_weight(ax[1], w, pw)
#ax.set_box('off')
fig.savefig('./diagrams/height_weight_gaussian.svg')

Sampling Two Dimensional Variables

In [38]:
num_samps = 20

fig, axs = plt.subplots(2, 4, figsize=(10, 5))
for a in axs.flatten():
    a.set_axis_off()
ax=[]
ax.append(plt.subplot2grid((2,4), (0,0), colspan=2, rowspan=2))
ax.append(plt.subplot2grid((2,4), (0,3)))
ax.append(plt.subplot2grid((2,4), (1,3)))

ax[0].plot(muh, muw, 'x', color=[1., 0., 1.], markersize=5., linewidth=3)
theta = np.linspace(0, tau, 100)
xel = np.sin(theta)*np.sqrt(varh) + muh
yel = np.cos(theta)*np.sqrt(varw) + muw
ax[0].plot(xel, yel, '-', color=[1., 0., 1.], linewidth=3)
ax[0].set_xlim([h.min(), h.max()])
ax[0].set_ylim([w.min()+10, w.max()-10])
ax[0].set_yticks([65, 75, 85])
ax[0].set_xticks([1.25, 1.7, 2.15])
ax[0].set_xlabel('$h/m$', fontsize=20)
ax[0].set_ylabel('$w/kg$', fontsize=20)

ylim = ax[0].get_ylim()
xlim = ax[0].get_xlim()
ax[0].vlines(xlim[0], ylim[0], ylim[1], color=[0.,0.,0.])
ax[0].hlines(ylim[0], xlim[0], xlim[1], color=[0., 0., 0.])

plot_height(ax[1], h, ph)
plot_weight(ax[2], w, pw)
count = 0
for i in range(num_samps):
    hval = np.random.normal(size=(1,1))*np.sqrt(varh) + muh
    wval = np.random.normal(size=(1,1))*np.sqrt(varw) + muw
    a1 = ax[1].plot(hval, 0.1, marker='o', linewidth=3, color=[1., 0., 0.])
    plt.savefig('./diagrams/independent_height_weight' + str(count) + '.png')
    count+=1
    a2 = ax[2].plot(wval, 0.002, marker='o', linewidth=3, color=[1., 0., 0.])
    plt.savefig('./diagrams/independent_height_weight' + str(count) + '.png')
    count+=1
    a0 = ax[0].plot(hval, wval, marker='o', linewidth=3, color=[1., 0., 0.])
    plt.savefig('./diagrams/independent_height_weight' + str(count) + '.png')
    count+=1
    
    a0[0].set(color=[0.,0.,0.])
    a1[0].set(color=[0.,0.,0.])
    a2[0].set(color=[0.,0.,0.])
    plt.savefig('./diagrams/independent_height_weight' + str(count) + '.png')
    count+=1
In [52]:
pods.notebook.display_plots('independent_height_weight{fig}.png', './diagrams/', fig=(0, 79))

Independence Assumption

  • This assumes height and weight are independent. $$ p(h, w) = p(h)p(w) $$
  • In reality they are dependent (body mass index) $= \frac{w}{h^2}$.

Sampling Two Dimensional Variables

In [40]:
fig, axs = plt.subplots(2, 4, figsize=(10, 5))
for a in axs.flatten():
    a.set_axis_off()
ax=[]
ax.append(plt.subplot2grid((2,4), (0,0), colspan=2, rowspan=2))
ax.append(plt.subplot2grid((2,4), (0,3)))
ax.append(plt.subplot2grid((2,4), (1,3)))

covMat = np.asarray([[1, 0.995], [0.995, 1]])
fact = np.asarray([[np.sqrt(varh), 0], [0, np.sqrt(varw)]])
covMat = np.dot(np.dot(fact,covMat), fact)
_, R = np.linalg.eig(covMat)

ax[0].plot(muh, muw, 'x', color=[1., 0., 1.], markersize=5, linewidth=3)
theta = np.linspace(0, tau, 100)
xel = np.sin(theta)*np.sqrt(varh)
yel = np.cos(theta)*np.sqrt(varw)
vals = np.dot(R,np.vstack([xel, yel]))
ax[0].plot(vals[0, :]+muh, vals[1, :]+muw, '-', color=[1., 0., 1.], linewidth=3)
ax[0].set_xlim([h.min(), h.max()])
ax[0].set_ylim([w.min()+10, w.max()-10])
ax[0].set_yticks([65, 75, 85])
ax[0].set_xticks([1.25, 1.7, 2.15])
ax[0].set_xlabel('$h/m$', fontsize=20)
ax[0].set_ylabel('$w/kg$', fontsize=20)

plot_height(ax[1], h, ph)
plot_weight(ax[2], w, pw)
count = 0
for i in range(num_samps):
    vec_s = np.dot(np.dot(R,fact),np.random.normal(size=(2,1)))
    hval = vec_s[0] + muh
    wval = vec_s[1] + muw
    a1 = ax[1].plot(hval, 0.1, marker='o', linewidth=3, color=[1., 0., 0.])
    plt.savefig('./diagrams/correlated_height_weight' + str(count) + '.png')
    count+=1
    a2 = ax[2].plot(wval, 0.002, marker='o', linewidth=3, color=[1., 0., 0.])
    plt.savefig('./diagrams/correlated_height_weight' + str(count) + '.png')
    count+=1
    a0 = ax[0].plot(hval, wval, marker='o', linewidth=3, color=[1., 0., 0.])
    plt.savefig('./diagrams/correlated_height_weight' + str(count) + '.png')
    count+=1
    
    a0[0].set(color=[0.,0.,0.])
    a1[0].set(color=[0.,0.,0.])
    a2[0].set(color=[0.,0.,0.])
    plt.savefig('./diagrams/correlated_height_weight' + str(count) + '.png')
    count+=1
In [41]:
pods.notebook.display_plots('correlated_height_weight{fig}.png', './diagrams/', fig=(0, 79))

Independent Gaussians

$$p(w, h) = p(w)p(h)$$

$$p(w, h) = \frac{1}{\sqrt{2\pi \sigma_1^2}\sqrt{2\pi\sigma_2^2}} \exp\left(-\frac{1}{2}\left(\frac{(w-\mu_1)^2}{\sigma_1^2} + \frac{(h-\mu_2)^2}{\sigma_2^2}\right)\right)$$

Independent Gaussians

$$p(w, h) = \frac{1}{\sqrt{2\pi\sigma_1^2 2\pi\sigma_2^2}} \exp\left(-\frac{1}{2}\left(\begin{bmatrix}w \\ h\end{bmatrix} - \begin{bmatrix}\mu_1 \\ \mu_2\end{bmatrix}\right)^\top\begin{bmatrix}\sigma_1^2& 0\\0&\sigma_2^2\end{bmatrix}^{-1}\left(\begin{bmatrix}w \\ h\end{bmatrix} - \begin{bmatrix}\mu_1 \\ \mu_2\end{bmatrix}\right)\right)$$

Independent Gaussians

$$p(\mathbf{y}) = \frac{1}{\left|2\pi \mathbf{D}\right|^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\mathbf{y} - \boldsymbol{\mu})^\top\mathbf{D}^{-1}(\mathbf{y} - \boldsymbol{\mu})\right)$$

Correlated Gaussian

Form correlated from original by rotating the data space using matrix $\mathbf{R}$.

$$p(\mathbf{y}) = \frac{1}{\left|2\pi\mathbf{D}\right|^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\mathbf{y} - \boldsymbol{\mu})^\top\mathbf{D}^{-1}(\mathbf{y} - \boldsymbol{\mu})\right)$$

Correlated Gaussian

$$p(\mathbf{y}) = \frac{1}{\left|2\pi\mathbf{D}\right|^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\mathbf{R}^\top\mathbf{y} - \mathbf{R}^\top\boldsymbol{\mu})^\top\mathbf{D}^{-1}(\mathbf{R}^\top\mathbf{y} - \mathbf{R}^\top\boldsymbol{\mu})\right)$$

Correlated Gaussian

$$p(\mathbf{y}) = \frac{1}{\left|2\pi\mathbf{D}\right|^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\mathbf{y} - \boldsymbol{\mu})^\top\mathbf{R}\mathbf{D}^{-1}\mathbf{R}^\top(\mathbf{y} - \boldsymbol{\mu})\right)$$

this gives a covariance matrix: $$\mathbf{C}^{-1} = \mathbf{R} \mathbf{D}^{-1} \mathbf{R}^\top$$

Correlated Gaussian

$$p(\mathbf{y}) = \frac{1}{\left|{2\pi\mathbf{C}}\right|^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\mathbf{y} - \boldsymbol{\mu})^\top\mathbf{C}^{-1} (\mathbf{y} - \boldsymbol{\mu})\right)$$

this gives a covariance matrix: $$\mathbf{C} = \mathbf{R} \mathbf{D} \mathbf{R}^\top$$

Reading

  • Section 2.3 of @Bishop:book06 up to top of pg 85 (multivariate Gaussians).

  • Section 3.3 of @Bishop:book06 up to 159 (pg 152–159).

Revisit Olympics Data

  • Use Bayesian approach on olympics data with polynomials.

  • Choose a prior $\mathbf{w} \sim \mathcal{N}(\mathbf{0},\alpha \mathbf{I})$ with $\alpha = 1$.

  • Choose noise variance $\sigma^2 = 0.01$

Sampling the Prior

  • Always useful to perform a ‘sanity check’ and sample from the prior before observing the data.

  • Since $\mathbf{y} = \boldsymbol{\Phi} \mathbf{w} + \boldsymbol{\epsilon}$ just need to sample $$w \sim \mathcal{N}(0,\alpha)$$ $$\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0},\sigma^2)$$ with $\alpha=1$ and $\sigma^2 = 0.01$.

In [42]:
import mlai
import pods
import numpy as np
import matplotlib.pyplot as plt

basis = mlai.polynomial

data = pods.datasets.olympic_marathon_men()
x = data['X']
y = data['Y']
num_data = x.shape[0]

data_limits = [1892, 2020]

max_basis = y.shape[0]
In [43]:
f, ax = plt.subplots(1, 2, figsize=(10,5))

ll = np.array([np.nan]*(max_basis))
sum_squares = np.array([np.nan]*(max_basis))

for num_basis in range(1,max_basis+1):
    
    model= mlai.BLM(x, y, alpha=1, sigma2=0.04, 
                    basis=basis, num_basis=num_basis, data_limits=data_limits)
    model.fit()
    sum_squares[num_basis-1] = model.objective()/num_data 
    ll[num_basis-1] = model.log_likelihood()
    mlai.plot_marathon_fit(model=model, data_limits=data_limits, 
                           objective=-ll, objective_ylim=[17, 40], 
                           title='Marginal Likelihood',
                           fig=f, ax=ax)

Olympic Data with Bayesian Polynomials

In [44]:
pods.notebook.display_plots('olympic_BLM_polynomial{num_basis}.svg', 
                            directory='./diagrams', num_basis=(1, max_basis))
In [45]:
f, ax = plt.subplots(1, 2, figsize=(10,5))

# Set up the validation set
val_start = 20;
x = data['X'][:val_start, :]
x_val = data['X'][val_start:, :]
y = data['Y'][:val_start, :]
y_val = data['Y'][val_start:, :]
num_val_data = x_val.shape[0]

ll = np.array([np.nan]*(max_basis))
ss = np.array([np.nan]*(max_basis))
ss_val = np.array([np.nan]*(max_basis))
for num_basis in range(1,max_basis+1):
    
    model= mlai.BLM(x, y, basis=basis, num_basis=num_basis, alpha=1, sigma2=0.04, data_limits=data_limits)
    model.fit()
    ss[num_basis-1] = model.objective()
    f_val, _ = model.predict(x_val)
    ss_val[num_basis-1] = ((y_val-f_val)**2).mean() 
    ll[num_basis-1] = model.log_likelihood()
    mlai.plot_marathon_fit(model=model, data_limits=data_limits, 
                           objective=np.sqrt(ss_val), objective_ylim=[0.1, 0.6],
                           fig=f, ax=ax, prefix='olympic_val',
                           title='Hold Out Validation',
                           x_val=x_val, y_val=y_val)

Hold Out Validation

In [46]:
import pods
pods.notebook.display_plots('olympic_val_BLM_polynomial{num_basis}.svg', 
                            directory='./diagrams', num_basis=(1, 27))
In [47]:
f, ax = plt.subplots(1, 2, figsize=(10,5))

num_parts = 5
partitions = []
ind = list(np.random.permutation(num_data))
start = 0
for part in range(num_parts):
    end = round((float(num_data)/num_parts)*(part+1))
    train_ind = ind[:start]
    train_ind.extend(ind[end:])
    val_ind = ind[start:end]
    partitions.append((train_ind, val_ind))
    start = end   

ll = np.array([np.nan]*(max_basis))
ss = np.array([np.nan]*(max_basis))
ss_val = np.array([np.nan]*(max_basis))
for num_basis in range(1,max_basis+1):
    ss_val_temp = 0.
    for part, (train_ind, val_ind) in enumerate(partitions):
        x = data['X'][train_ind, :]
        x_val = data['X'][val_ind, :]
        y = data['Y'][train_ind, :]
        y_val = data['Y'][val_ind, :]
        num_val_data = x_val.shape[0]

        model= mlai.BLM(x, y, alpha=1, sigma2=0.04, basis=basis, num_basis=num_basis, data_limits=data_limits)
        model.fit()
        ss[num_basis-1] = model.objective()
        f_val, _ = model.predict(x_val)
        ss_val_temp += ((y_val-f_val)**2).mean() 
        mlai.plot_marathon_fit(model=model, data_limits=data_limits, 
                                objective=np.sqrt(ss_val), objective_ylim=[0.2,0.6],
                                fig=f, ax=ax, prefix='olympic_' + str(num_parts) + 'cv' + str(part) + '_inter',
                                title='5-fold Cross Validation',
                                x_val=x_val, y_val=y_val)
    ss_val[num_basis-1] = ss_val_temp/(num_parts)
    ax[1].cla()
    mlai.plot_marathon_fit(model=model, data_limits=data_limits, 
                            objective=np.sqrt(ss_val), objective_ylim=[0.2,0.6],
                            fig=f, ax=ax, prefix='olympic_' + str(num_parts) + 'cv5_inter',
                            title='5-fold Cross Validation',
                            x_val=x_val, y_val=y_val)

5-fold Cross Validation

In [48]:
pods.notebook.display_plots('olympic_5cv{part}_inter_BLM_polynomial{num_basis}.svg', 
                            directory='./diagrams', part=(0, 5), num_basis=(1, max_basis))

Model Fit

  • Marginal likelihood doesn’t always increase as model order increases.

  • Bayesian model always has 2 parameters, regardless of how many basis functions (and here we didn’t even fit them).

  • Maximum likelihood model over fits through increasing number of parameters.

  • Revisit maximum likelihood solution with validation set.

Regularized Mean

  • Validation fit here based on mean solution for $\mathbf{w}$ only.

  • For Bayesian solution $$\boldsymbol{\mu}_w = \left[\sigma^{-2}\boldsymbol{\Phi}^\top\boldsymbol{\Phi} + \alpha^{-1}\mathbf{I}\right]^{-1} \sigma^{-2} \boldsymbol{\Phi}^\top \mathbf{y}$$ instead of $$\mathbf{w}^* = \left[\boldsymbol{\Phi}^\top\boldsymbol{\Phi}\right]^{-1} \boldsymbol{\Phi}^\top \mathbf{y}$$

  • Two are equivalent when $\alpha \rightarrow \infty$.

  • Equivalent to a prior for $\mathbf{w}$ with infinite variance.

  • In other cases $\alpha \mathbf{I}$ regularizes the system (keeps parameters smaller).

Sampling the Posterior

  • Now check samples by extracting $\mathbf{w}$ from the posterior.

  • Now for $\mathbf{y} = \boldsymbol{\Phi} \mathbf{w} + \boldsymbol{\epsilon}$ need $$w \sim \mathcal{N}(\boldsymbol{\mu}_w,\mathbf{C}_w)$$ with $\mathbf{C}_w = \left[\sigma^{-2}\boldsymbol{\Phi}^\top \boldsymbol{\Phi} + \alpha^{-1} \mathbf{I}\right]^{-1}$ and $\boldsymbol{\mu}_w =\mathbf{C}_w \sigma^{-2} \boldsymbol{\Phi}^\top \mathbf{y}$ $$\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0},\sigma^2\mathbf{I})$$ with $\alpha=1$ and $\sigma^2 = 0.01$.

Marginal Likelihood

  • The marginal likelihood can also be computed, it has the form: $$p(\mathbf{y}|\mathbf{X}, \sigma^2, \alpha) = \frac{1}{(2\pi)^\frac{n}{2}\left|\mathbf{K}\right|^\frac{1}{2}} \exp\left(-\frac{1}{2} \mathbf{y}^\top \mathbf{K}^{-1} \mathbf{y}\right)$$ where $\mathbf{K} = \alpha \boldsymbol{\Phi}\boldsymbol{\Phi}^\top + \sigma^2 \mathbf{I}$.

  • So it is a zero mean $n$-dimensional Gaussian with covariance matrix $\mathbf{K}$.

Computing the Expected Output

  • Given the posterior for the parameters, how can we compute the expected output at a given location?

  • Output of model at location $\mathbf{x}_i$ is given by $$f(\mathbf{x}_i; \mathbf{w}) = \boldsymbol{\phi}_i^\top \mathbf{w}$$

  • We want the expected output under the posterior density, $p(\mathbf{w}|\mathbf{y}, \mathbf{X}, \sigma^2, \alpha)$.

  • Mean of mapping function will be given by $$\begin{aligned}

      \left\langle f(\mathbf{x}_i; \mathbf{w})\right\rangle_{p(\mathbf{w}|\mathbf{y}, \mathbf{X}, \sigma^2, \alpha)} &= \boldsymbol{\phi}_i^\top \left\langle\mathbf{w}\right\rangle_{p(\mathbf{w}|\mathbf{y}, \mathbf{X}, \sigma^2, \alpha)} \\
      & = \boldsymbol{\phi}_i^\top \boldsymbol{\mu}_w
    \end{aligned}$$

Variance of Expected Output

  • Variance of model at location $\mathbf{x}_i$ is given by $$\begin{aligned}
      \text{var}(f(\mathbf{x}_i; \mathbf{w})) &= \left\langle(f(\mathbf{x}_i; \mathbf{w}))^2\right\rangle - \left\langle f(\mathbf{x}_i; \mathbf{w})\right\rangle^2 \\&= \boldsymbol{\phi}_i^\top \left\langle\mathbf{w}\mathbf{w}^\top\right\rangle \boldsymbol{\phi}_i - \boldsymbol{\phi}_i^\top \left\langle\mathbf{w}\right\rangle\left\langle\mathbf{w}\right\rangle^\top \boldsymbol{\phi}_i \\&= \boldsymbol{\phi}_i^\top \mathbf{C}_i\boldsymbol{\phi}_i
    \end{aligned}$$ where all these expectations are taken under the
    
    posterior density, $p(\mathbf{w}|\mathbf{y}, \mathbf{X}, \sigma^2, \alpha)$.

Reading

  • Section 3.7–3.8 of @Rogers:book11 (pg 122–133).

  • Section 3.4 of @Bishop:book06 (pg 161–165).