Neil D. Lawrence
There are three types of lies: lies, damned lies and statistics
??
There are three types of lies: lies, damned lies and statistics
Benjamin Disraeli
There are three types of lies: lies, damned lies and statistics
Benjamin Disraeli 1804-1881
There are three types of lies: lies, damned lies and 'big data'
Neil Lawrence 1972-?
\[ \text{data} + \text{model} \rightarrow \text{prediction}\]
... of two different domains
Data Science: arises from the fact that we now capture data by happenstance.
Artificial Intelligence: emulation of human behaviour.
With AI: logic, robotics, computer vision, speech
With Data Science: databases, data mining, statistics, visualization
The pervasiveness of data brings forward particular challenges.
Those challenges are most sharply in focus for personalized health.
Particular opportunities, in challenging areas such as mental health.
compute | ~10 gigaflops | ~ 1000 teraflops? |
communicate | ~1 gigbit/s | ~ 100 bit/s | (compute/communicate) | 10 | ~ 1013 |
This phenomenon has already revolutionised biology.
Large scale data acquisition and distribution.
Transcriptomics, genomics, epigenomics, 'rich phenomics'.
Great promise for personalized health.
Automated decision making within the computer based only on the data.
A requirement to better understand our own subjective biases to ensure that the human to computer interface formulates the correct conclusions from the data.
Particularly important where treatments are being prescribed.
But what is a treatment in the modern era: interventions could be far more subtle.
Shift in dynamic from the direct pathway between human and data to indirect pathway between human and data via the computer
This change of dynamics gives us the modern and emerging domain of data science
Election polls (UK 2015 elections, EU referendum, US 2016 elections)
Clinical trials vs personalized medicine: Obtaining statistical power where interventions are subtle. e.g. social media
A better characterization of human (see later)
The major cause of the software crisis is that the machines have become several orders of magnitude more powerful! To put it quite bluntly: as long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a mild problem, and now we have gigantic computers, programming has become an equally gigantic problem.
Edsger Dijkstra, The Humble Programmer
The major cause of the data crisis is that machines have become more interconnected than ever before. Data access is therefore cheap, but data quality is often poor. What we need is cheap high quality data. That implies that we develop processes for improving and verifying data quality that are efficient.
There would seem to be two ways for improving efficiency. Firstly, we should not duplicate work. Secondly, where possible we should automate work.
Me
There's a sea of data, but most of it is undrinkable
We require data-desalination before it can be consumed!
Grade C - accessibility
Grade B - validity
Grade A - usability