Uncertainty in Loss Functions
Neil D. Lawrence
2018-05-29
IWCV 2018, Modena, Italy
What is Machine Learning?
\[ \text{data} + \text{model} \xrightarrow{\text{compute}} \text{prediction}\]
data : observations, could be actively or passively acquired (meta-data).
model : assumptions, based on previous experience (other data! transfer learning etc), or beliefs about the regularities of the universe. Inductive bias.
prediction : an action to be taken or a categorization or a quality score.
What is Machine Learning?
\[\text{data} + \text{model} \xrightarrow{\text{compute}} \text{prediction}\]
To combine data with a model need:
a prediction function \(\mappingFunction (\cdot)\) includes our beliefs about the regularities of the universe
an objective function \(\errorFunction (\cdot)\) defines the cost of misprediction.
Artificial vs Natural Systems
Consider natural intelligence, or natural systems
Contrast between an artificial system and an natural system.
The key difference between the two is that artificial systems are designed whereas natural systems are evolved .
Natural Systems are Evolved
Survival of the fittest
?
Natural Systems are Evolved
Survival of the fittest
Herbet Spencer , 1864
Natural Systems are Evolved
Non-survival of the non-fit
Mistake we Make
Equate fitness for objective function.
Assume static environment and known objective.
.
Classical Loss Function
\[
\errorFunction(\dataVector, \inputMatrix) = \sum_{k}\sum_{j}
L(\dataScalar_{k,j}, \inputVector_k)
\]
Dependent on a prediction function \[
\errorFunction(\dataVector, \inputMatrix) = \sum_{k}\sum_{j} L(\dataScalar_{k,j},
\mappingFunction_j(\inputVector_k))
\]
Introduce Uncertainty
Introduce \(\left\{ \scaleScalar_i\right\}_{i=1}^\numData\) . \[
\errorFunction(\dataVector, \inputMatrix) = \sum_{i}
\scaleScalar_i L(\dataScalar_i, \mappingFunction(\inputVector_i))
\]
Assume \(\scaleScalar \sim q(\scaleScalar)\) \[
\begin{align*}
\errorFunction(\dataVector, \inputMatrix) = & \sum_{i}\expectationDist{\scaleScalar_i L(\dataScalar_i, \mappingFunction(\inputVector_i))}{q(\scaleScalar)} \\
& =\sum_{i}\expectationDist{\scaleScalar_i }{q(\scaleScalar)}L(\dataScalar_i, \mappingFunction(\inputVector_i))
\end{align*}
\]
Principle of Maximum Entropy
Maximum entropy principle \[
H(\scaleScalar)= - \int q(\scaleScalar) \log \frac{q(\scaleScalar)}{m(\scaleScalar)} \text{d}\scaleScalar
\]
Minimize loss minus Entropy \[
\begin{align*}
\errorFunction = & \beta\sum_{i}\expectationDist{\scaleScalar_i }{q(\scaleScalar)}L(\dataScalar_i, \mappingFunction(\inputVector_i)) - H(\scaleScalar)\\
&= \beta\sum_{i}\expectationDist{\scaleScalar_i }{q(\scaleScalar)}L(\dataScalar_i, \mappingFunction(\inputVector_i)) + \int q(\scaleScalar) \log \frac{q(\scaleScalar)}{m(\scaleScalar)}\text{d}\scaleScalar
\end{align*}
\]
Coordinate Descent
Optimize wrt \(q(\cdot)\) \[
q(\scaleScalar_i) = \frac{1}{\lambda+\beta L_i} \exp\left(-(\lambda+\beta L_i) \scaleScalar_i\right)
\] t
Update expectation \[
\expectationDist{\scaleScalar_i}{q(\scaleScalar_i)} = \frac{1}{\lambda+\beta
L_i}
\]
Minimize expected loss \[
\beta \sum_{i=1}^\numData \expectationDist{\scaleScalar_i}{q(\scaleScalar_i)} L(\dataScalar_i, \mappingFunction(\inputVector_i))
\]
Olympic Marathon Data
Gold medal times for Olympic Marathon since 1896.
Marathons before 1924 didn’t have a standardised distance.
Present results using pace per km.
In 1904 Marathon was badly organised leading to very slow times.
Image from Wikimedia Commons http://bit.ly/16kMKHQ
Example: Linear Regression
Linear Regression on Olympic Data
Linear Regression on Olympic Data
Parameter Uncertainty
In Bayesian inference we consider parameter uncertainty
\[
\expectationDist{\beta \scaleScalar_i L(\dataScalar_i,
\mappingFunction(\inputVector_i))}{q(\scaleScalar, \mappingFunction)} + \int
q(\scaleScalar, \mappingFunction) \log \frac{q(\scaleScalar,
\mappingFunction)}{m(\scaleScalar)m(\mappingFunction)}\text{d}\scaleScalar
\text{d}\mappingFunction
\] * Implying \[
q(\mappingFunction, \scaleScalar) \propto
\prod_{i=1}^\numData \exp\left(- \beta \scaleScalar_i L(\dataScalar_i,
\mappingFunction(\inputVector_i)) \right) m(\scaleScalar)m(\mappingFunction)
\]
Approximation
Generally intractable, so assume: \[
q(\mappingFunction, \scaleScalar) = q(\mappingFunction)q(\scaleScalar)
\]
Approximation II
Entropy maximization proceeds as before but with \[
q(\scaleScalar) \propto
\prod_{i=1}^\numData \exp\left(- \beta \scaleScalar_i \expectationDist{L(\dataScalar_i,
\mappingFunction(\inputVector_i))}{q(\mappingFunction)} \right) m(\scaleScalar)
\] and \[
q(\mappingFunction) \propto
\prod_{i=1}^\numData \exp\left(- \beta \expectationDist{\scaleScalar_i}{q(\scaleScalar)} L(\dataScalar_i,
\mappingFunction(\inputVector_i)) \right) m(\mappingFunction)
\]
Approximation III
Can now proceed with iteration between \(q(\scaleScalar)\) , \(q(\mappingFunction)\)
Probabilistic Linear Regression on Olympic Data
Probabilistic Linear Regression on Olympic Data
Expectation Update
The update is given by \[
\expectationDist{\vScalar_i^2}{q(\vScalar)} = \meanTwoScalar_i^2 +
\covarianceScalar_{i, i}.
\]
If measure covariance spherical and mean zero \[
\expectationDist{\vScalar_i^2}{q(\vScalar)} = \frac{1}{\lambda + \beta L_i}
\] which is the same as we had before for the exponential prior over \(\scaleScalar\) .
Conditioning the Measure
Condition measure to be high weight in test region \[
\kernelMatrix^\prime = \kernelMatrix - \frac{\kernelVector_\ast\kernelVector^\top_\ast}{\kernelScalar_{*,*}}
\] and \[
\meanVector^\prime = \meanVector + \frac{\kernelVector_\ast}{\kernelScalar_{*,*}}
(\vScalar_\ast-\meanScalar)
\]
As covariance becomes small, this becomes LOESS regression
Joint Uncertainty
}{olympic-gp-loss-samples}
Joint Samples from Regression
{
Joint Samples from Regression
Conclusions
Maximum Entropy Framework for uncertainty in
Loss functions
Prediction functions