@lawrennd
inverseprobability.com
There are three types of lies: lies, damned lies and statistics
??
There are three types of lies: lies, damned lies and statistics
Benjamin Disraeli
There are three types of lies: lies, damned lies and statistics
Benjamin Disraeli 1804-1881
There are three types of lies: lies, damned lies and ‘big data’
Neil Lawrence 1972-?
Data is Pervasive phenomenon that affects all aspects of our activities
Data diffusiveness is both a challenge and an opportunity
compute | ~10 gigaflops | ~ 1000 teraflops? |
communicate | ~1 gigbit/s | ~ 100 bit/s |
embodiment (compute/communicate) |
10 | ~ 1013 |
This phenomenon has already revolutionised biology.
Large scale data acquisition and distribution.
Transcriptomics, genomics, epigenomics, ‘rich phenomics’.
Automated decision making within the computer based only on the data.
A requirement to better understand our own subjective biases to ensure that the human to computer interface formulates the correct conclusions from the data.
Shift in dynamic from the direct pathway between human and data to indirect pathway between human and data via the computer
This change of dynamics gives us the modern and emerging domain of data science
Paradoxes of the Data Society
Quantifying the Value of Data
Privacy, loss of control, marginalization
Able to quantify to a greater and greater degree the actions of individuals
But less able to characterize society
As we measure more, we understand less
Perhaps greater preponderance of data is making society itself more complex
Therefore traditional approaches to measurement are failing
Curate’s egg of a society: it is only ‘measured in parts’
Election polls (UK 2015 elections, EU referendum, US 2016 elections)
Clinical trial and personalized medicine
Social media memes
Filter bubbles and echo chambers
\[ \mathbf{Y} = \begin{bmatrix} y_{1, 1} & y_{1, 2} &\dots & y_{1,p}\\ y_{2, 1} & y_{2, 2} &\dots & y_{2,p}\\ \vdots & \vdots &\dots & \vdots\\ y_{n, 1} & y_{n, 2} &\dots & y_{n,p} \end{bmatrix} \in \Re^{n\times p} \]
\[ \mathbf{Y} = \begin{bmatrix} \mathbf{y}^\top_{1, :} \\ \mathbf{y}^\top_{2, :} \\ \vdots \\ \mathbf{y}^\top_{n, :} \end{bmatrix} \in \Re^{n\times p} \]
\[ \mathbf{Y} = \begin{bmatrix} \mathbf{y}_{:, 1} & \mathbf{y}_{:, 2} & \dots & \mathbf{y}_{:, p} \end{bmatrix} \in \Re^{n\times p} \]
\[p(\mathbf{Y}|\boldsymbol{\theta}) = \prod_{i=1}^n p(\mathbf{y}_{i, :}|\boldsymbol{\theta})\]
\[p(\mathbf{Y}|\boldsymbol{\theta}) = \prod_{i=1}^n p(\mathbf{y}_{i, :}|\boldsymbol{\theta})\]
\[\log p(\mathbf{Y}|\boldsymbol{\theta}) = \sum_{i=1}^n \log p(\mathbf{y}_{i, :}|\boldsymbol{\theta})\]
Typically \(\boldsymbol{\theta} \in \Re^{\mathcal{O}(p)}\)
Consistency reliant on large sample approximation of KL divergence
\[ \text{KL}(P(\mathbf{Y})|| p(\mathbf{Y}|\boldsymbol{\theta}))\]
Minimization is equivalent to maximization of likelihood.
A foundation stone of classical statistics.
For large \(p\) the parameters are badly determined.
Large \(p\) small \(n\) problem.
Easily dealt with through definition.
\[p(\mathbf{Y}|\boldsymbol{\theta}) = \prod_{j=1}^p p(\mathbf{y}_{:, j}|\boldsymbol{\theta})\]
\[\log p(\mathbf{Y}|\boldsymbol{\theta}) = \sum_{j=1}^p \log p(\mathbf{y}_{:, j}|\boldsymbol{\theta})\]
Modern Measurement deals with depth (many subjects) … or breadth lots of detail about subject.
Massively missing data.
Classical bias towards tables.
Streaming data.
\[ \mathbf{Y} = \begin{bmatrix} y_{1, 1} & y_{1, 2} &\dots & y_{1,p}\\ y_{2, 1} & y_{2, 2} &\dots & y_{2,p}\\ \vdots & \vdots &\dots & \vdots\\ y_{n, 1} & y_{n, 2} &\dots & y_{n,p} \end{bmatrix} \in \Re^{n\times p} \]
\[y_\mathbf{x}\]
where \(\mathbf{x}\) might include time, spatial location …
Streaming data. Joint model of past, \(\mathbf{y}\) and future \(\mathbf{y}_*\)
\[p(\mathbf{y}, \mathbf{y}_*)\]
Prediction through:
\[p(\mathbf{y}_*|\mathbf{y})\]
\(n^*\) is length of \(\mathbf{y}^*\).
Consistent if \(p(\mathbf{y}|n^*) = p(\mathbf{y})\)
Design time problem: parametric bottleneck. \[p(\boldsymbol{\theta} | \mathbf{y})\]
Streaming data could turn out to be more complex than we imagine.
Despite our large interconnected brains, we only have finite storage.
Similar for digital computers. So we need to assume that we can only store a finite number of things about the data \(\mathbf{y}\).
This pushes us back towards parametric models.
Choose to go a different way.
Introduce a set of auxiliary variables, \(\mathbf{u}\), which are \(m\) in length.
They are like “artificial data”.
Used to induce a distribution: \(q(\mathbf{u}|\mathbf{y})\)
Introduce variable set which is finite dimensional. \[ p(\mathbf{y}^*|\mathbf{y}) \approx \int p(\mathbf{y}^*|\mathbf{u}) q(\mathbf{u}|\mathbf{y}) \text{d}\mathbf{u} \]
But dimensionality of \(\mathbf{u}\) can be changed to improve approximation.
\[p(\mathbf{y})\]
\[p(\mathbf{y})=\int p(\mathbf{y}|\mathbf{f})p(\mathbf{f})\text{d}\mathbf{f}\]
\[p(\mathbf{y})=\int \int p(\mathbf{y}|\mathbf{f})p(\mathbf{f}|\mathbf{u})\text{d}\mathbf{f}p(\mathbf{u})\text{d}\mathbf{u}\]
\[p(\mathbf{y})=\int \int p(\mathbf{y}|\mathbf{f})p(\mathbf{f}|\mathbf{u})\text{d}\mathbf{f}p(\mathbf{u})\text{d}\mathbf{u}\]
\[p(\mathbf{y}|\mathbf{u})=\int p(\mathbf{y}|\mathbf{f})p(\mathbf{f}|\mathbf{u})\text{d}\mathbf{f}\]
\[p(\mathbf{y}|\mathbf{u})\]
\[p(\mathbf{y}|\boldsymbol{\theta})\]
Replace true \(p(\mathbf{u}|\mathbf{y})\) with approximation \(q(\mathbf{u}|\mathbf{y})\).
Minimize KL divergence between approximation and truth.
A better characterization of human (see later)
There’s a sea of data, but most of it is undrinkable
We require data-desalination before it can be consumed!
Direct work on data generates an enormous amount of ‘value’ in the data economy but this is unaccounted in the economy
Hard because data is difficult to ‘embody’
Value of shared data: Wellcome Trust 2010 Joint Statement (from the “Foggy Bottom” meeting)
Encourage greater interaction between application domains and data scientists
Encourage visualization of data
Adoption of ‘data readiness levels’
Implications for incentivization schemes
Society is becoming harder to monitor
Individual is becoming easier to monitor
Potential for explicit and implicit discrimination on the basis of race, religion, sexuality, health status
All prohibited under European law, but can pass unawares, or be implicit
Control of persona and ability to project
Need better technological solutions: trust and algorithms.
Many solutions rely on education and awareness