
# Basis Functions

University of Sheffield

## Nonlinear Regression

• Problem with Linear Regression—$\inputVector$ may not be linearly related to $\dataVector$.
• Potential solution: create a feature space: define $\basisFunc(\inputVector)$ where $\basisFunc(\cdot)$ is a nonlinear function of $\inputVector$.
• Model for target is a linear combination of these nonlinear functions $\mappingFunction(\inputVector) = \sum_{j=1}^\numBasisFunc \mappingScalar_j \basisFunc_j(\inputVector)$

## Non-linear in the Inputs

$\mappingFunction(\inputVector) = \mappingVector^\top \inputVector$

## Basis Functions

• Basis functions can be global. E.g. quadratic basis: $\basisVector = [1, \inputScalar, \inputScalar^2]$

\begin{align*} \basisFunc_1(\inputScalar) = 1, \\ \basisFunc_2(\inputScalar) = x, \\ \basisFunc_3(\inputScalar) = \inputScalar^2. \end{align*}

$\basisVector(\inputScalar) = \begin{bmatrix} 1\\ x \\ \inputScalar^2\end{bmatrix}.$

## Matrix Valued Function

$\basisMatrix(\inputVector) = \begin{bmatrix} 1 & \inputScalar_1 & \inputScalar_1^2 \\ 1 & \inputScalar_2 & \inputScalar_2^2\\ \vdots & \vdots & \vdots \\ 1 & \inputScalar_n & \inputScalar_n^2 \end{bmatrix}$

## Functions Derived from Quadratic Basis

$\mappingFunction(\inputScalar) = {\color{cyan}\mappingScalar_0} + {\color{green}\mappingScalar_1 \inputScalar} + {\color{yellow}\mappingScalar_2 \inputScalar^2}$

{

{

## Different Bases

$\basisFunc_j(\inputScalar_i) = \inputScalar_i^j$

## Functions Derived from Polynomial Basis

$\mappingFunction(\inputScalar) = {\color{cyan}\mappingScalar_0} + {\color{green}\mappingScalar_1 \inputScalar} + {\color{yellow}\mappingScalar_2 \inputScalar^2}$

## Different Basis

• Basis functions can be local e.g. radial (or Gaussian) basis $\basisFunc_j(\inputScalar) = \exp\left(-\frac{(\inputScalar-\mu_j)^2}{\lengthScale^2}\right)$

## Functions Derived from Radial Basis

$\mappingFunction(\inputScalar) = {\color{cyan}\mappingScalar_1 e^{-2(\inputScalar+1)^2}} + {\color{green}\mappingScalar_2e^{-2\inputScalar^2}} + {\color{yellow}\mappingScalar_3 e^{-2(\inputScalar-1)^2}}$

## Functions Derived from Relu Basis

$\mappingFunction(\inputScalar) = {\color{cyan}\mappingScalar_0} + {\color{green}\mappingScalar_1 xH(x+1.0) } + {\color{yellow}\mappingScalar_2 xH(x+0.33) } + {\color{magenta}\mappingScalar_3 xH(x-0.33)} + {\color{red}\mappingScalar_4 xH(x-1.0)}$

## Functions Derived from Tanh Basis

$\mappingFunction(\inputScalar) = {\color{cyan}\mappingScalar_0} + {\color{green}\mappingScalar_1 } + {\color{yellow}\mappingScalar_3 }$

## Fourier Basis

• $\basisFunc_j(\inputScalar) = \mappingScalar_0 + \mappingScalar_1 \sin(\inputScalar) + \mappingScalar_2 \cos(\inputScalar) + \mappingScalar_3 \sin(2\inputScalar) + \mappingScalar_4 \cos(2\inputScalar)$

## Functions Derived from Fourier Basis

$\mappingFunction(\inputScalar) = {\color{cyan}\mappingScalar_0} + {\color{green}\mappingScalar_1 \sin(\inputScalar)} + {\color{yellow}\mappingScalar_2 \cos(\inputScalar)} + {\color{magenta}\mappingScalar_3 \sin(2\inputScalar)} + {\color{red}\mappingScalar_4 \cos(2\inputScalar)}$

## Fitting to Data

Now we are going to consider how these basis functions can be adjusted to fit to a particular data set. We will return to the olympic marathon data from last time. First we will scale the output of the data to be zero mean and variance 1.

## Basis Function Models

• The prediction function is now defined as $\mappingFunction(\inputVector_i) = \sum_{j=1}^\numBasisFunc \mappingScalar_j \basisFunc_{i, j}$

## Vector Notation

• Write in vector notation, $\mappingFunction(\inputVector_i) = \mappingVector^\top \basisVector_i$

## Log Likelihood for Basis Function Model

• The likelihood of a single data point is $p\left(\dataScalar_i|\inputScalar_i\right)=\frac{1}{\sqrt{2\pi\dataStd^2}}\exp\left(-\frac{\left(\dataScalar_i-\mappingVector^{\top}\basisVector_i\right)^{2}}{2\dataStd^2}\right).$

## Log Likelihood for Basis Function Model

• Leading to a log likelihood for the data set of $L(\mappingVector,\dataStd^2)= -\frac{\numData}{2}\log \dataStd^2-\frac{\numData}{2}\log 2\pi -\frac{\sum_{i=1}^{\numData}\left(\dataScalar_i-\mappingVector^{\top}\basisVector_i\right)^{2}}{2\dataStd^2}.$

## Objective Function

• And a corresponding objective function of the form $\errorFunction(\mappingVector,\dataStd^2)= \frac{\numData}{2}\log\dataStd^2 + \frac{\sum_{i=1}^{\numData}\left(\dataScalar_i-\mappingVector^{\top}\basisVector_i\right)^{2}}{2\dataStd^2}.$

## Expand the Brackets

\begin{align} \errorFunction(\mappingVector,\dataStd^2) = &\frac{\numData}{2}\log \dataStd^2 + \frac{1}{2\dataStd^2}\sum_{i=1}^{\numData}\dataScalar_i^{2}-\frac{1}{\dataStd^2}\sum_{i=1}^{\numData}\dataScalar_i\mappingVector^{\top}\basisVector_i\\ &+\frac{1}{2\dataStd^2}\sum_{i=1}^{\numData}\mappingVector^{\top}\basisVector_i\basisVector_i^{\top}\mappingVector+\text{const}. \end{align}

## Expand the Brackets

\begin{align} \errorFunction(\mappingVector, \dataStd^2) = & \frac{\numData}{2}\log \dataStd^2 + \frac{1}{2\dataStd^2}\sum_{i=1}^{\numData}\dataScalar_i^{2}-\frac{1}{\dataStd^2} \mappingVector^\top\sum_{i=1}^{\numData}\basisVector_i \dataScalar_i\\ & +\frac{1}{2\dataStd^2}\mappingVector^{\top}\left[\sum_{i=1}^{\numData}\basisVector_i\basisVector_i^{\top}\right]\mappingVector+\text{const}.\end{align}

## Design Matrices

• Design matrix notation $\basisMatrix = \begin{bmatrix} \mathbf{1} & \inputVector & \inputVector^2\end{bmatrix}$ so that $\basisMatrix \in \Re^{\numData \times \dataDim}.$

## Multivariate Derivatives Reminder

• We will need some multivariate calculus. $\frac{\text{d}\mathbf{a}^{\top}\mappingVector}{\text{d}\mappingVector}=\mathbf{a}$ and $\frac{\text{d}\mappingVector^{\top}\mathbf{A}\mappingVector}{\text{d}\mappingVector}=\left(\mathbf{A}+\mathbf{A}^{\top}\right)\mappingVector$ or if $\mathbf{A}$ is symmetric (i.e. $\mathbf{A}=\mathbf{A}^{\top}$) $\frac{\text{d}\mappingVector^{\top}\mathbf{A}\mappingVector}{\text{d}\mappingVector}=2\mathbf{A}\mappingVector.$

## Differentiate

Differentiating with respect to the vector $\mappingVector$ we obtain $\frac{\text{d} E\left(\mappingVector,\dataStd^2 \right)}{\text{d}\mappingVector}=-\frac{1}{\dataStd^2} \sum_{i=1}^{\numData}\basisVector_i\dataScalar_i+\frac{1}{\dataStd^2} \left[\sum_{i=1}^{\numData}\basisVector_i\basisVector_i^{\top}\right]\mappingVector$ Leading to $\mappingVector^{*}=\left[\sum_{i=1}^{\numData}\basisVector_i\basisVector_i^{\top}\right]^{-1}\sum_{i=1}^{\numData}\basisVector_i\dataScalar_i,$

## Matrix Notation

Rewrite in matrix notation: $\sum_{i=1}^{\numData}\basisVector_i\basisVector_i^\top = \basisMatrix^\top \basisMatrix$ $\sum _{i=1}^{\numData}\basisVector_i\dataScalar_i = \basisMatrix^\top \dataVector$

## Update Equations

• Update for $\mappingVector^{*}$ $\mappingVector^{*} = \left(\basisMatrix^\top \basisMatrix\right)^{-1} \basisMatrix^\top \dataVector$
• The equation for $\left.\dataStd^2\right.^{*}$ may also be found $\left.\dataStd^2\right.^{{*}}=\frac{\sum_{i=1}^{\numData}\left(\dataScalar_i-\left.\mappingVector^{*}\right.^{\top}\basisVector_i\right)^{2}}{\numData}.$

## Avoid Direct Inverse

• E.g. Solve for $\mappingVector$ $\left(\basisMatrix^\top \basisMatrix\right)\mappingVector = \basisMatrix^\top \dataVector$
• See np.linalg.solve
• In practice use $\mathbf{Q}\mathbf{R}$ decomposition (see lab class notes).

## Non-linear but Linear in the Parameters

• Model is non-linear, but linear in parameters $\mappingFunction(\inputVector) = \mappingVector^\top \basisVector(\inputVector)$
• $\inputVector$ is inside the non-linearity, but $\mappingVector$ is outside. $\mappingFunction(\inputVector) = \mappingVector^\top \basisVector(\inputVector; \boldsymbol{\theta}),$

• Section 1.4 of Rogers and Girolami (2011)

• Chapter 1, pg 1-6 of Bishop (2006)

• Chapter 3, Section 3.1 up to pg 143 of Bishop (2006)

## Use of QR Decomposition for Numerical Stability

TODO example with polynomials.

## References

Bishop, C.M., 2006. Pattern recognition and machine learning. springer.

Rogers, S., Girolami, M., 2011. A first course in machine learning. CRC Press.