at on Aug 30, 2017 [jupyter][google colab][reveal]
Neil D. Lawrence, Amazon Cambridge and University of Sheffield

#### Abstract

Processing of personally sensitive information should respect an individual's privacy. One promising framework is Differential Privacy (DP). In this talk I'll present work led by Michael Smith at the University of Sheffield on the use of cloaking functions to make Gaussian process (GP) predictions differentially private. Gaussian process models are flexible models with particular advantages in handling missing and noisy data. Our hope is that advances in DP for GPs will make it easier to 'learn without looking', i.e. gain the advantages of prediction from patient data without impinging on their privacy. Joint work with Michael T. Smith, Max Zwiessele and Mauricio Alvarez

.

.

## Embodiment Factors 

 compute ≈100 gigaflops ≈16 petaflops communicate 1 gigbit/s 100 bit/s (compute/communicate) 104 1014

There is a fundamental limit placed on our intelligence based on our ability to communicate. Claude Shannon founded the field of information theory. The clever part of this theory is it allows us to separate our measurement of information from what the information pertains to1.

Shannon measured information in bits. One bit of information is the amount of information I pass to you when I give you the result of a coin toss. Shannon was also interested in the amount of information in the English language. He estimated that on average a word in the English language contains 12 bits of information.

Given typical speaking rates, that gives us an estimate of our ability to communicate of around 100 bits per second (Reed and Durlach 1998). Computers on the other hand can communicate much more rapidly. Current wired network speeds are around a billion bits per second, ten million times faster.

When it comes to compute though, our best estimates indicate our computers are slower. A typical modern computer can process make around 100 billion floating point operations per second, each floating point operation involves a 64 bit number. So the computer is processing around 6,400 billion bits per second.

It's difficult to get similar estimates for humans, but by some estimates the amount of compute we would require to simulate a human brain is equivalent to that in the UK's fastest computer (Ananthanarayanan et al. 2009), the MET office machine in Exeter, which in 2018 ranks as the 11th fastest computer in the world. That machine simulates the world's weather each morning, and then simulates the world's climate in the afternoon. It is a 16 petaflop machine, processing around 1,000 trillion bits per second.

So when it comes to our ability to compute we are extraordinary, not compute in our conscious mind, but the underlying neuron firings that underpin both our consciousness, our subconsciousness as well as our motor control etc.

If we think of ourselves as vehicles, then we are massively overpowered. Our ability to generate derived information from raw fuel is extraordinary. Intellectually we have formula one engines.

But in terms of our ability to deploy that computation in actual use, to share the results of what we have inferred, we are very limited. So when you imagine the F1 car that represents a psyche, think of an F1 car with bicycle wheels.

Just think of the control a driver would have to have to deploy such power through such a narrow channel of traction. That is the beauty and the skill of the human mind.

In contrast, our computers are more like go-karts. Underpowered, but with well-matched tires. They can communicate far more fluidly. They are more efficient, but somehow less extraordinary, less beautiful.

For humans, that means much of our computation should be dedicated to considering what we should compute. To do that efficiently we need to model the world around us. The most complex thing in the world around us is other humans. So it is no surprise that we model them. We second guess what their intentions are, and our communication is only necessary when they are departing from how we model them. Naturally, for this to work well, we need to understand those we work closely with. So it is no surprise that social communication, social bonding, forms so much of a part of our use of our limited bandwidth.

There is a second effect here, our need to anthropomorphise objects around us. Our tendency to model our fellow humans extends to when we interact with other entities in our environment. To our pets as well as inanimate objects around us, such as computers or even our cars. This tendency to over interpret could be a consequence of our limited ability to communicate.

For more details see this paper "Living Together: Mind and Machine Intelligence", and this TEDx talk.

# Evolved Relationship with Information 

The high bandwidth of computers has resulted in a close relationship between the computer and data. Large amounts of information can flow between the two. The degree to which the computer is mediating our relationship with data means that we should consider it an intermediary.

Originaly our low bandwith relationship with data was affected by two characteristics. Firstly, our tendency to over-interpret driven by our need to extract as much knowledge from our low bandwidth information channel as possible. Secondly, by our improved understanding of the domain of mathematical statistics and how our cognitive biases can mislead us.

With this new set up there is a potential for assimilating far more information via the computer, but the computer can present this to us in various ways. If it's motives are not aligned with ours then it can misrepresent the information. This needn't be nefarious it can be simply as a result of the computer pursuing a different objective from us. For example, if the computer is aiming to maximize our interaction time that may be a different objective from ours which may be to summarize information in a representative manner in the shortest possible length of time.

For example, for me, it was a common experience to pick up my telephone with the intention of checking when my next appointment was, but to soon find myself distracted by another application on the phone, and end up reading something on the internet. By the time I'd finished reading, I would often have forgotten the reason I picked up my phone in the first place.

There are great benefits to be had from the huge amount of information we can unlock from this evolved relationship between us and data. In biology, large scale data sharing has been driven by a revolution in genomic, transcriptomic and epigenomic measurement. The improved inferences that that can be drawn through summarizing data by computer have fundamentally changed the nature of biological science, now this phenomenon is also infuencing us in our daily lives as data measured by happenstance is increasingly used to characterize us.

Better mediation of this flow actually requires a better understanding of human-computer interaction. This in turn involves understanding our own intelligence better, what its cognitive biases are and how these might mislead us.

For further thoughts see this Guardian article from 2015 on marketing in the internet era.

You can also check my blog post on "System Zero"

### Bandwidth Constrained Conversations

import pods
from ipywidgets import IntSlider
pods.notebook.display_plots('anne-bob{sample:0>3}.svg',
'../slides/diagrams',  sample=IntSlider(0, 0, 7, 1))

Embodiment factors imply that, in our communication between humans, what is not said is, perhaps, more important than what is said. To communicate with each other we need to have a model of who each of us are.

To aid this, in society, we are required to perform roles. Whether as a parent, a teacher, an employee or a boss. Each of these roles requires that we conform to certain standards of behaviour to facilitate communication between ourselves.

Control of self is vitally important to these communications.

The high availability of data available to humans undermines human-to-human communication channels by providing new routes to undermining our control of self.

Rasmussen and Williams (2006) is still one of the most important references on Gaussian process models. It is available freely online.

## Bayesian Inference by Rejection Sampling 

One view of Bayesian inference is to assume we are given a mechanism for generating samples, where we assume that mechanism is representing on accurate view on the way we believe the world works.

This mechanism is known as our prior belief.

We combine our prior belief with our observations of the real world by discarding all those samples that are inconsistent with our prior. The likelihood defines mathematically what we mean by inconsistent with the prior. The higher the noise level in the likelihood, the looser the notion of consistent.

The samples that remain are considered to be samples from the posterior.

This approach to Bayesian inference is closely related to two sampling techniques known as rejection sampling and importance sampling. It is realized in practice in an approach known as approximate Bayesian computation (ABC) or likelihood-free inference.

In practice, the algorithm is often too slow to be practical, because most samples will be inconsistent with the data and as a result the mechanism has to be operated many times to obtain a few posterior samples.

However, in the Gaussian process case, when the likelihood also assumes Gaussian noise, we can operate this mechanism mathematically, and obtain the posterior density analytically. This is the benefit of Gaussian processes.

import pods
from ipywidgets import IntSlider
pods.notebook.display_plots('gp_rejection_sample{sample:0>3}.png',
directory='../slides/diagrams/gp',
sample=IntSlider(1,1,5,1))

{

• We want to protect a user from a linkage attack...

...while still performing inference over the whole group.

• Making a dataset private is more than just erasing names.

Narayanan and Felten (2014);Ohm (2010);(???)

• To achieve a level of privacy one needs to add randomness to the data.

• This is a fundamental feature of differential privacy.

See The Algorithmic Foundations of Differential Privacy by Dwork and Roth (2014) for a rigorous introduction to the framework.

## Differential Privacy for Gaussian Processes 

We have a dataset in which the inputs, $\inputMatrix$, are public. The outputs, $\dataVector$, we want to keep private.

Data consists of the heights and weights of 287 women from a census of the !Kung (Howell 1967)

Hall, Rinaldo, and Wasserman (2013) showed that one can ensure that a version of $\mappingFunction$, function $\tilde{f}$ is (ε, δ)-differentially private by adding a scaled sample from a GP prior.

• We applied this method to the GP posterior.

• The covariance of the posterior only depends on the inputs, $\inputMatrix$. So we can compute this without applying DP.

• The mean function, $\mappingFunction_D(\inputVector_*)$, does depend on $\dataVector$.
$$\mappingFunction_D(\inputVector_*) = \kernelVector(x_*, \inputMatrix) \kernelMatrix^{-1} \dataVector$$

• We are interested in finding

$$|| \mappingFunction_D(\inputVector_*) - \mappingFunction_{D^\prime}(\inputVector_*) ||_H^2$$

...how much the mean function (in RKHS) can change due to a change in $\dataVector$.

• Using the representer theorem, we can write
$$|| \mappingFunction_D(\inputVector_*) - \mappingFunction_{D^\prime}(\inputVector_*) ||_H^2$$

as:

$$\Big|\Big|\sum_{i=1}^\numData \kernelScalar(\inputVector_*,\inputVector_i) \left(\alpha_i - \alpha^\prime_i\right)\Big|\Big|_H^2$$

where $\boldsymbol{\alpha} - \boldsymbol{\alpha}^\prime = \kernelMatrix^{-1} \left(\dataVector - \dataVector^\prime \right)$

• L2 Norm

$$\Big|\Big|\sum_{i=1}^\numData \kernelScalar(\inputVector_*,\inputVector_i) \left(\alpha_i - \alpha^\prime_i\right)\Big|\Big|_H^2$$

where $\boldsymbol{\alpha} - \boldsymbol{\alpha}^\prime = \kernelMatrix^{-1} \left(\dataVector - \dataVector^\prime \right)$

• We constrain the kernel: $-1\leq \kernelScalar(\cdot,\cdot) \leq 1$ and we only allow one element of $\dataVector$ and $\dataVector^\prime$ to differ (by at most d).

• So only one column of $\kernelMatrix^{-1}$ will be involved in the change of mean (which we are summing over).

• The distance above can then be shown to be no greater than $d\;||\kernelMatrix^{-1}||_\infty$

This 'works' in that it allows DP predictions...but to avoid too much noise, the value of ε is too large (here it is 100)

EQ kernel, $\lengthScale = 25$ years, Δ = 100cm

Using sparse methods (i.e. inducing inputs) can help reduce the sensitivity a little. We'll see more on this later.

## Cloaking 

• So far we've made the whole posterior mean function private...

...what if we just concentrate on making particular predictions private?

• Standard approach: sample the noise is from the GP's prior.

• Not necessarily the most 'efficient' covariance to use.

Left: Function change. Right: test point change

Left: Function change. Right: test point change

Left: Function change. Right: test point change

Left: Function change. Right: test point change

Left: Function change. Right: test point change

Left: Function change. Right: test point change

• Hall et al. (2013) also presented a bound on vectors.

• Find a bound (Δ) on the scale of the output change, in term of its Mahalanobis distance (wrt the added noise covariance).

$$\sup_{D \sim {D^\prime}} ||\mathbf{M}^{-1/2} (\dataVector_* - \dataVector_{*}^\prime)||_2 \leq \Delta$$

• We use this to scale the noise we add:

$$\frac{\text{c}(\delta)\Delta}{\varepsilon} \mathcal{N}_d(0,\mathbf{M})$$

We get to pick M

• Intuitively we want to construct M so that it has greatest covariance in those directions most affected by changes in training points, so that it will be most able to mask those changes.

• The change in posterior mean predictions is,

$$\dataVector_* - \dataVector^\prime_* = \kernelMatrix_{*f} \kernelMatrix^{-1} (\dataVector-\dataVector^\prime)$$

• Effect of perturbing each training point on each test point is represented in the cloaking matrix,

$$\mathbf{C} = \kernelMatrix_{*f} \kernelMatrix^{-1}$$

• We assume we are protecting only one training input's change, by at most d.

• So $\dataVector-\dataVector^\prime$ will be all zeros except for one element, i.
• So the change in test points will be (at most)

$$\dataVector_*^\prime - \dataVector_* = d \mathbf{C}_{:i}$$

• We're able to write the earlier bound as,

d2supiciM−1ci ≤ Δ

where ci ≜ C:i

• Dealing with d elsewhere and setting Δ = 1 (thus 0 ≤ ciM−1ci ≤ 1) and minimise log|M| (minimises the partial entropy).

• Using Lagrange multipliers and gradient descent, we find
M = ∑iλicici

The noise added by this method is now practical.

EQ kernel, l = 25 years, Δ = 100cm, ε = 1

It also has some interesting features;

• Less noise where data is concentrated
• Least noise far from any data
• Most noise just outside data
• Tested on 4D citibike dataset (predicting journey durations from start/finish station locations).

• The method appears to achieve lower noise than binning alternatives (for reasonable ε).

• Outliers poorly predicted.

• Too much noise around data 'edges'.

• Use inducing inputs to reduce the sensitivity to these outliers.

• For 1D !Kung, RMSE improved from 15.0 ± 2.0cm to 11.1 ± 0.8cm

Use Age and Weight to predict Height

• For 2D !Kung, RMSE improved from 22.8 ± 1.9cm to 8.8 ± 0.6cm

Note that the uncertainty across cross-validation runs smaller. 2D version benefits from data's 1D manifold.

• Summary We have developed an improved method for performing differentially private regression.

• Future work Multiple outputs, GP classification, DP Optimising hyperparameters, Making the inputs private.

• Thanks Funders: EPSRC; Colleagues: Michael T. Smith, Mauricio, Max.

• Recruiting Deep Probabilistic Models: 2 year postdoc (tinyurl.com/shefpostdoc)

• Images used: BostonGlobe: Mass Mutual, Weld. Harvard: Sweeney. Rich on flickr: Sheffield skyline.

# References

Ananthanarayanan, Rajagopal, Steven K. Esser, Horst D. Simon, and Dharmendra S. Modha. 2009. “The Cat Is Out of the Bag: Cortical Simulations with 109 Neurons, 1013 Synapses.” In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis - Sc ’09. doi:10.1145/1654059.1654124.

Dwork, Cynthia, and Aaron Roth. 2014. “The Algorithmic Foundations of Differential Privacy.” Foundations and Trends in Theoretical Computer Science 9 (3-4): 211–407.

Hall, Rob, Alessandro Rinaldo, and Larry Wasserman. 2013. “Differential Privacy for Functions and Functional Data.” Journal of Machine Learning Research 14 (1): 703–27. http://jmlr.csail.mit.edu/papers/v14/hall13a.html.

Howell, Nancy. 1967. “Data from a Partial Census of the !Kung San, Dobe. 1967-1969.” https://public.tableau.com/profile/john.marriott#!/vizhome/kung-san/Attributes.

Narayanan, Arvind, and Edward W. Felten. 2014. “No Silver Bullet: De-Identification Still Doesn’t Work.”

Ohm, Paul. 2010. “Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization.” UCLA Law Review 57: 1701.

Rasmussen, Carl Edward, and Christopher K. I. Williams. 2006. Gaussian Processes for Machine Learning. Cambridge, MA: mit.

Reed, Charlotte, and Nathaniel I. Durlach. 1998. “Note on Information Transfer Rates in Human Communication.” Presence Teleoperators & Virtual Environments 7 (5): 509–18. doi:10.1162/105474698565893.

1. the challenge of understanding what information pertains to is known as knowledge representation.