# Personalized Health: Challenges in Data Science

There are three types of lies: lies, damned lies and statistics

??

There are three types of lies: lies, damned lies and statistics

Benjamin Disraeli

There are three types of lies: lies, damned lies and statistics

Benjamin Disraeli 1804-1881

### Mathematical Statistics

• ‘Founded’ by Karl Pearson (1857-1936)

There are three types of lies: lies, damned lies and ‘big data’

Neil Lawrence 1972-?

### ‘Mathematical Data Science’

• ‘Founded’ by ? (?-?)

### Background: Big Data

• The pervasiveness of data brings forward particular challenges.

• Those challenges are most sharply in focus for personalized health.

• Particular opportunities, in challenging areas such as mental health.

### “Embodiment Factors”

 compute ~10 gigaflops ~ 1000 teraflops? communicate ~1 gigbit/s ~ 100 bit/s embodiment(compute/communicate) 10 ~ 1013

### Effects

• This phenomenon has already revolutionised biology.

• Large scale data acquisition and distribution.

• Transcriptomics, genomics, epigenomics, ‘rich phenomics’.

• Great promise for personalized health.

### Societal Effects

• Automated decision making within the computer based only on the data.

• A requirement to better understand our own subjective biases to ensure that the human to computer interface formulates the correct conclusions from the data.

• Particularly important where treatments are being prescribed.

• But what is a treatment in the modern era: interventions could be far more subtle.

### Societal Effects

• Shift in dynamic from the direct pathway between human and data to indirect pathway between human and data via the computer

• This change of dynamics gives us the modern and emerging domain of data science

### Challenges

1. Paradoxes of the Data Society

2. Quantifying the Value of Data

3. Privacy, loss of control, marginalization

• Able to quantify to a greater and greater degree the actions of individuals

• But less able to characterize society

• As we measure more, we understand less

### What?

• Perhaps greater preponderance of data is making society itself more complex

• Therefore traditional approaches to measurement are failing

• Curate’s egg of a society: it is only ‘measured in parts’

### Wood or Tree

• Can either see a wood or a tree.

### Examples

• Election polls (UK 2015 elections, EU referendum, US 2016 elections)

• Clinical trials vs personalized medicine: Obtaining statistical power where interventions are subtle. e.g. social media

### Also need

• More classical statistics!
• Like the ‘paperless office’
• A better characterization of human (see later)

• Larger studies (100,000 genome)
• Combined with complex models: algorithmic challenges

### Quantifying the Value of Data

There’s a sea of data, but most of it is undrinkable

We require data-desalination before it can be consumed!

### Data

• 90% of our time is spent on validation and integration (Leo Anthony Celi)
• “The Dirty Work We Don’t Want to Think About” (Eric Xing)
• “Voodoo to get it decompressed” (Francisco Giminez?)
• In health care clinicians collect the data and often control the direction of research through guardianship of data.

### Value

• How do we measure value in the data economy?
• How do we encourage data workers: curation and management
• Incentivization for sharing and production.
• Quantifying the value in the contribution of each actor.
$$p_0$$ $$s_0$$ $$a_0$$ $$r_0$$

### Credit Allocation

• Direct work on data generates an enormous amount of ‘value’ in the data economy but this is unaccounted in the economy

• Hard because data is difficult to ‘embody’

• Value of shared data: Wellcome Trust 2010 Joint Statement (from the “Foggy Bottom” meeting)

### Solutions

• Encourage greater interaction between application domains and data scientists

• Encourage visualization of data

• Implications for incentivization schemes

### Privacy, Loss of Control and Marginalization

• Society is becoming harder to monitor

• Individual is becoming easier to monitor

### Hate Speech or Political Dissent?

• social media monitoring for ‘hate speech’ can be easily turned to political dissent monitoring

### Marketing

• can become more sinister when the target of the marketing is well understood and the (digital) environment of the target is also so well controlled

### Free Will

• What does it mean if a computer can predict our individual behavior better than we ourselves can?

### Discrimination

• Potential for explicit and implicit discrimination on the basis of race, religion, sexuality, health status

• All prohibited under European law, but can pass unawares, or be implicit

• GDPR: General Data Protection Regulation

### Discrimination

• Potential for explicit and implicit discrimination on the basis of race, religion, sexuality, health status

• All prohibited under European law, but can pass unawares, or be implicit

• GDPR: Good Data Practice Rules

### Marginalization

• Credit scoring, insurance, medical treatment
• What if certain sectors of society are under-represented in our aanalysis?
• What if Silicon Valley develops everything for us?

### Amelioration

• Work to ensure individual retains control of their own data
• We accept privacy in our real lives, need to accept it in our digital
• Control of persona and ability to project
• Need better technological solutions: trust and algorithms.

### Key Object

• Covariance function, $$\mathbf{K}$$

• Determines properties of samples.

• Function of $${\bf X}$$, $k_{i,j} = k({\bf x}_i, {\bf x}_j)$

### Linear Algebra

• Posterior mean

$f_D({\bf x}_*) = \mathbf{k}({\bf x}_*, {\bf X}) \mathbf{K}^{-1} \mathbf{y}$

• Posterior covariance $\mathbf{C}_* = \mathbf{K}_{*,*} - \mathbf{K}_{*,\mathbf{f}} \mathbf{K}^{-1} \mathbf{K}_{\mathbf{f}, *}$

### Linear Algebra

• Posterior mean

$f_D({\bf x}_*) = \mathbf{k}({\bf x}_*, {\bf X}) \boldsymbol{\alpha}$

• Posterior covariance $\mathbf{C}_* = \mathbf{K}_{*,*} - \mathbf{K}_{*,\mathbf{f}} \mathbf{K}^{-1} \mathbf{K}_{\mathbf{f}, *}$

### Differential Privacy, summary

• We want to protect a user from a linkage attack…

…while still performing inference over the whole group.

• Making a dataset private is more than just erasing names.

• To achieve a level of privacy one needs to add randomness to the data.

• This is a fundamental feature of differential privacy.

See The Algorithmic Foundations of Differential Privacy by Dwork and Roth for a rigorous introduction to the framework.

### Differential Privacy for Gaussian Processes

We have a dataset in which the inputs, $$\mathbf{X}$$, are public. The outputs, $$\mathbf{y}$$, we want to keep private.

Data consists of the heights and weights of 287 women from a census of the !Kung

### Vectors and Functions

Hall et al. (2013) showed that one can ensure that a version of $$f$$, function $$\tilde{f}$$ is $$(\varepsilon, \delta)$$-differentially private by adding a scaled sample from a GP prior.

### Applied to Gaussian Processes

• We applied this method to the GP posterior.

• The covariance of the posterior only depends on the inputs, $$X$$. So we can compute this without applying DP.

• The mean function, $$f_D(\mathbf{x_*})$$, does depend on $$\mathbf{y}$$. $f_D(\mathbf{x_*}) = \mathbf{k}(x_*, \mathbf{X}) \mathbf{K}^{-1} \mathbf{y}$

• We are interested in finding

$|| f_D(\mathbf{x_*}) - f_{D^\prime}(\mathbf{x_*}) ||_H^2$

…how much the mean function (in RKHS) can change due to a change in $$\mathbf{y}$$.

### Applied to Gaussian Processes

• Using the representer theorem, we can write $|| f_D(\mathbf{x_*}) - f_{D^\prime}(\mathbf{x_*}) ||_H^2$

as:

$\Big|\Big|\sum_{i=1}^n k(\mathbf{x_*},\mathbf{x}_i) \left(\alpha_i - \alpha^\prime_i\right)\Big|\Big|_H^2$

where $$\boldsymbol{\alpha} - \boldsymbol{\alpha}^\prime = \mathbf{K}^{-1} \left(\mathbf{y} - \mathbf{y}^\prime \right)$$

• L2 Norm

$\Big|\Big|\sum_{i=1}^n k(\mathbf{x_*},\mathbf{x}_i) \left(\alpha_i - \alpha^\prime_i\right)\Big|\Big|_H^2$

where $$\boldsymbol{\alpha} - \boldsymbol{\alpha}^\prime = \mathbf{K}^{-1} \left(\mathbf{y} - \mathbf{y}^\prime \right)$$

• We constrain the kernel: $$-1\leq k(\cdot,\cdot) \leq 1$$ and we only allow one element of $$\mathbf{y}$$ and $$\mathbf{y}'$$ to differ (by at most $$d$$).

• So only one column of $$\mathbf{K}^{-1}$$ will be involved in the change of mean (which we are summing over).

• The distance above can then be shown to be no greater than $$d\;||\mathbf{K}^{-1}||_\infty$$

### Applied to Gaussian Processes

This ‘works’ in that it allows DP predictions…but to avoid too much noise, the value of $$\varepsilon$$ is too large (here it is 100)

EQ kernel, $$\ell = 25$$ years, $$\Delta=100$$cm

### Inducing Inputs

Using sparse methods (i.e. inducing inputs) can help reduce the sensitivity a little. We’ll see more on this later.

### Cloaking

• So far we’ve made the whole posterior mean function private…

…what if we just concentrate on making particular predictions private?

### Effect of perturbation

• Standard approach: sample the noise is from the GP’s prior.

• Not necessarily the most ‘efficient’ covariance to use.

### Cloaking

Left: Function change. Right: test point change

### Cloaking

Left: Function change. Right: test point change

### Cloaking

Left: Function change. Right: test point change

### Cloaking

Left: Function change. Right: test point change

### Cloaking

Left: Function change. Right: test point change

### Cloaking

Left: Function change. Right: test point change

### DP Vectors

• Hall et al. (2013) also presented a bound on vectors.

• Find a bound ($$\Delta$$) on the scale of the output change, in term of its Mahalanobis distance (wrt the added noise covariance).

$\sup_{D \sim {D'}} ||\mathbf{M}^{-1/2} (\mathbf{y}_* - \mathbf{y}_{*}')||_2 \leq \Delta$

• We use this to scale the noise we add:

$\frac{\text{c}(\delta)\Delta}{\varepsilon} \mathcal{N}_d(0,\mathbf{M})$

We get to pick $$\mathbf{M}$$

### Cloaking

• Intuitively we want to construct $$\mathbf{M}$$ so that it has greatest covariance in those directions most affected by changes in training points, so that it will be most able to mask those changes.

• The change in posterior mean predictions is,

$\mathbf{y}_* - \mathbf{y}'_* = \mathbf{K}_{*f} \mathbf{K}^{-1} (\mathbf{y}-\mathbf{y}')$

• Effect of perturbing each training point on each test point is represented in the cloaking matrix,

$\mathbf{C} = \mathbf{K}_{*f} \mathbf{K}^{-1}$

### Cloaking

• We assume we are protecting only one training input’s change, by at most $$d$$.

• So $$\mathbf{y}-\mathbf{y}'$$ will be all zeros except for one element, $$i$$.

• So the change in test points will be (at most)

$\mathbf{y}_*' - \mathbf{y}_* = d \mathbf{C}_{:i}$

• We’re able to write the earlier bound as,

$d^2 \sup_{i} \mathbf{c}_i^\top \mathbf{M}^{-1} \mathbf{c}_i \leq\Delta$

where $$\mathbf{c}_i \triangleq \mathbf{C}_{:i}$$

### Cloaking

• Dealing with $$d$$ elsewhere and setting $$\Delta = 1$$ (thus $$0 \leq \mathbf{c}_i^\top \mathbf{M}^{-1} \mathbf{c}_i \leq 1$$) and minimise $$\log |\mathbf{M}|$$ (minimises the partial entropy).

• Using Lagrange multipliers and gradient descent, we find

$\mathbf{M} = \sum_i{\lambda_i \mathbf{c}_i \mathbf{c}_i^\top}$

### Cloaking: Results

The noise added by this method is now practical.

EQ kernel, $$l = 25$$ years, $$\Delta=100$$cm, $$\varepsilon=1$$

### Cloaking: Results

It also has some interesting features;

• Less noise where data is concentrated
• Least noise far from any data
• Most noise just outside data

### Citibike

• Tested on 4D citibike dataset (predicting journey durations from start/finish station locations).

• The method appears to achieve lower noise than binning alternatives (for reasonable $$\varepsilon$$).

### Citibike

lengthscale in degrees, values above, journey duration (in seconds)

### Cloaking and Inducing Inputs

• Outliers poorly predicted.

• Too much noise around data ‘edges’.

• Use inducing inputs to reduce the sensitivity to these outliers.

### Results

• For 1D !Kung, RMSE improved from $$15.0 \pm 2.0 \text{cm}$$ to $$11.1 \pm 0.8 \text{cm}$$

Use Age and Weight to predict Height

• For 2D !Kung, RMSE improved from $$22.8 \pm 1.9 \text{cm}$$ to $$8.8 \pm 0.6 \text{cm}$$

Note that the uncertainty across cross-validation runs smaller. 2D version benefits from data’s 1D manifold.

### Awareness

• Need to increase awareness of the pitfalls among researchers
• Need to ensure that technological solutions are being delivered not merely for few (#FirstWorldProblems)
• Address a wider set of challenges that the greater part of the world’s population is facing

### Conclusion

• Data science offers a great deal of promise for personalized health
• There are challenges and pitfalls
• It is incumbent on us to avoid them
• Need new ways of thinking!
• Mathematical Data Science

Many solutions rely on education and awareness