# Cloaking Functions: Differential Privacy with Gaussian Processes

CD-Make 2017 Keynote, Reggio Calabria, Italy

## Embodiment Factors

 bits/min billions 2,000 billioncalculations/s ~100 a billion embodiment 20 minutes 5 billion years

## Evolved Relationship

Rasmussen and Williams (2006)

## Differential Privacy, summary

• We want to protect a user from a linkage attack…

…while still performing inference over the whole group.

• Making a dataset private is more than just erasing names.

Narayanan and Felten (2014);Ohm (2010);Barth-Jones (2012)

• To achieve a level of privacy one needs to add randomness to the data.

• This is a fundamental feature of differential privacy.

See The Algorithmic Foundations of Differential Privacy by Dwork and Roth (2014) for a rigorous introduction to the framework.

## Differential Privacy for Gaussian Processes

We have a dataset in which the inputs, $\mathbf{X}$, are public. The outputs, $\mathbf{ y}$, we want to keep private.

Data consists of the heights and weights of 287 women from a census of the !Kung (Howell, 1967)

## Vectors and Functions

Hall et al. (2013) showed that one can ensure that a version of $f$, function $\tilde{f}$ is $(\varepsilon, \delta)$-differentially private by adding a scaled sample from a GP prior.

## Applied to Gaussian Processes

• We applied this method to the GP posterior.

• The covariance of the posterior only depends on the inputs, $\mathbf{X}$. So we can compute this without applying DP.

• The mean function, $f_D(\mathbf{ x}_*)$, does depend on $\mathbf{ y}$. $f_D(\mathbf{ x}_*) = \mathbf{ k}(x_*, \mathbf{X}) \mathbf{K}^{-1} \mathbf{ y}$

• We are interested in finding

$|| f_D(\mathbf{ x}_*) - f_{D^\prime}(\mathbf{ x}_*) ||_H^2$

…how much the mean function (in RKHS) can change due to a change in $\mathbf{ y}$.

## Applied to Gaussian Processes

• Using the representer theorem, we can write $|| f_D(\mathbf{ x}_*) - f_{D^\prime}(\mathbf{ x}_*) ||_H^2$

as:

$\Big|\Big|\sum_{i=1}^nk(\mathbf{ x}_*,\mathbf{ x}_i) \left(\alpha_i - \alpha^\prime_i\right)\Big|\Big|_H^2$

where $\boldsymbol{\alpha} - \boldsymbol{\alpha}^\prime = \mathbf{K}^{-1} \left(\mathbf{ y}- \mathbf{ y}^\prime \right)$

• L2 Norm

$\Big|\Big|\sum_{i=1}^nk(\mathbf{ x}_*,\mathbf{ x}_i) \left(\alpha_i - \alpha^\prime_i\right)\Big|\Big|_H^2$

where $\boldsymbol{\alpha} - \boldsymbol{\alpha}^\prime = \mathbf{K}^{-1} \left(\mathbf{ y}- \mathbf{ y}^\prime \right)$

• We constrain the kernel: $-1\leq k(\cdot,\cdot) \leq 1$ and we only allow one element of $\mathbf{ y}$ and $\mathbf{ y}^\prime$ to differ (by at most $d$).

• So only one column of $\mathbf{K}^{-1}$ will be involved in the change of mean (which we are summing over).

• The distance above can then be shown to be no greater than $d\;||\mathbf{K}^{-1}||_\infty$

## Applied to Gaussian Processes

This ‘works’ in that it allows DP predictions…but to avoid too much noise, the value of $\varepsilon$ is too large (here it is 100)

EQ kernel, $\ell= 25$ years, $\Delta=100$cm

## Inducing Inputs

Using sparse methods (i.e. inducing inputs) can help reduce the sensitivity a little. We’ll see more on this later.

## Cloaking

• So far we’ve made the whole posterior mean function private…

…what if we just concentrate on making particular predictions private?

## Effect of perturbation

• Standard approach: sample the noise is from the GP’s prior.

• Not necessarily the most ‘efficient’ covariance to use.

## Cloaking

Left: Function change. Right: test point change

## Cloaking

Left: Function change. Right: test point change

## Cloaking

Left: Function change. Right: test point change

## Cloaking

Left: Function change. Right: test point change

## Cloaking

Left: Function change. Right: test point change

## Cloaking

Left: Function change. Right: test point change

## DP Vectors

• Hall et al. (2013) also presented a bound on vectors.

• Find a bound ($\Delta$) on the scale of the output change, in term of its Mahalanobis distance (wrt the added noise covariance).

$\sup_{D \sim {D^\prime}} ||\mathbf{M}^{-1/2} (\mathbf{ y}_* - \mathbf{ y}_{*}^\prime)||_2 \leq \Delta$

• We use this to scale the noise we add:

$\frac{\text{c}(\delta)\Delta}{\varepsilon} \mathcal{N}_d(0,\mathbf{M})$

We get to pick $\mathbf{M}$

## Cloaking

• Intuitively we want to construct $\mathbf{M}$ so that it has greatest covariance in those directions most affected by changes in training points, so that it will be most able to mask those changes.

• The change in posterior mean predictions is,

$\mathbf{ y}_* - \mathbf{ y}^\prime_* = \mathbf{K}_{*f} \mathbf{K}^{-1} (\mathbf{ y}-\mathbf{ y}^\prime)$

• Effect of perturbing each training point on each test point is represented in the cloaking matrix,

$\mathbf{C} = \mathbf{K}_{*f} \mathbf{K}^{-1}$

## Cloaking

• We assume we are protecting only one training input’s change, by at most $d$.

• So $\mathbf{ y}-\mathbf{ y}^\prime$ will be all zeros except for one element, $i$.

• So the change in test points will be (at most)

$\mathbf{ y}_*^\prime - \mathbf{ y}_* = d \mathbf{C}_{:i}$

• We’re able to write the earlier bound as,

$d^2 \sup_{i} \mathbf{c}_i^\top \mathbf{M}^{-1} \mathbf{c}_i \leq\Delta$

where $\mathbf{c}_i \triangleq \mathbf{C}_{:i}$

## Cloaking

• Dealing with $d$ elsewhere and setting $\Delta = 1$ (thus $0 \leq \mathbf{c}_i^\top \mathbf{M}^{-1} \mathbf{c}_i \leq 1$) and minimise $\log |\mathbf{M}|$ (minimises the partial entropy).

• Using Lagrange multipliers and gradient descent, we find $\mathbf{M} = \sum_i{\lambda_i \mathbf{c}_i \mathbf{c}_i^\top}$

## Cloaking: Results

The noise added by this method is now practical.

EQ kernel, $l = 25$ years, $\Delta=100$cm, $\varepsilon=1$

## Cloaking: Results

It also has some interesting features;

• Less noise where data is concentrated
• Least noise far from any data
• Most noise just outside data

## Citibike

• Tested on 4D citibike dataset (predicting journey durations from start/finish station locations).

• The method appears to achieve lower noise than binning alternatives (for reasonable $\varepsilon$).

## Citibike

lengthscale in degrees, values above, journey duration (in seconds)

## Cloaking and Inducing Inputs

• Outliers poorly predicted.

• Too much noise around data ‘edges.’

• Use inducing inputs to reduce the sensitivity to these outliers.

## Results

• For 1D !Kung, RMSE improved from $15.0 \pm 2.0 \text{cm}$ to $11.1 \pm 0.8 \text{cm}$

Use Age and Weight to predict Height

• For 2D !Kung, RMSE improved from $22.8 \pm 1.9 \text{cm}$ to $8.8 \pm 0.6 \text{cm}$

Note that the uncertainty across cross-validation runs smaller. 2D version benefits from data’s 1D manifold.

## Conclusions

• Summary We have developed an improved method for performing differentially private regression.

• Future work Multiple outputs, GP classification, DP Optimising hyperparameters, Making the inputs private.

• Thanks Funders: EPSRC; Colleagues: Michael T. Smith, Mauricio, Max.

• Recruiting Deep Probabilistic Models: 2 year postdoc (tinyurl.com/shefpostdoc)

## References

Barth-Jones, D.C., 2012. The ’re-identification’ of governor william weld’s medical information: A critical re-examination of health data identification risks and privacy protections, then and now. Then and Now.
Dwork, C., Roth, A., 2014. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science 9, 211–407. https://doi.org/10.1561/0400000042
Hall, R., Rinaldo, A., Wasserman, L., 2013. Differential privacy for functions and functional data. Journal of Machine Learning Research 14, 703–727.
Howell, N., 1967. Data from a partial census of the !kung san, dobe. 1967-1969.
Lawrence, N.D., 2017. Living together: Mind and machine intelligence. arXiv.
Narayanan, A., Felten, E.W., 2014. No silver bullet: De-identification still doesn’t work.
Ohm, P., 2010. Broken promises of privacy: Responding to the surprising failure of anonymization. UCLA Law Review 57, 1701.
Rasmussen, C.E., Williams, C.K.I., 2006. Gaussian processes for machine learning. mit, Cambridge, MA.