Machine Learning and the Data Science Process

2017-03-30

OxWaSP, Berlin

ARM Data Science Conference

2017-03-30

Neil D. Lawrence

Amazon and University of Sheffield

@lawrennd inverseprobability.com

The Gartner Hype Cycle

Gartner Hype Cycle

Machine Learning

\[ \text{data} + \text{model} \xrightarrow{\text{compute}} \text{prediction}\]

Machine Learning

Normal ML (& stats?) focus: model
In real world need more focus on: data
motivation for data science

Background: Big Data

The pervasiveness of data brings forward particular challenges.
Emerging themes: Devolving compute onto device.
Data preprocessing: Internet of Intelligence.

Embodiment Factors


compute	\[\approx 100 \text{ gigaflops}\]	\[\approx 16 \text{ petaflops}\]
communicate	\[1 \text{ gigbit/s}\]	\[100 \text{ bit/s}\]
(compute/communicate)	\[10^{4}\]	\[10^{14}\]

See “Living Together: Mind and Machine Intelligence” Lawrence (2017a)

Evolved Relationship

Effects

This phenomenon has already revolutionised biology.
Large scale data acquisition and distribution.
What does it mean for IoT

Internet of People

Fog computing: barrier between cloud and device blurring.
- Computing on the Edge
Stuxnet: Adversarial and Security implications for intelligent systems.
Complex feedback between algorithm and implementation

Challenges

Paradoxes of the Data Society
Quantifying the Value of Data
Privacy, loss of control, marginalisation
Deploying AI in the real world.

Challenges

~~Paradoxes of the Data Society~~
Quantifying the Value of Data
~~Privacy, loss of control, marginalisation~~
Deploying AI in the real world.

Quantifying the Value of Data

There’s a sea of data, but most of it is undrinkable

We require data-desalination before it can be consumed!

Data — Quotes from NIPS Workshop on ML for Healthcare

90% of our time is spent on validation and integration (Leo Anthony Celi)
“The Dirty Work We Don’t Want to Think About” (Eric Xing)
“Voodoo to get it decompressed” (Francisco Giminez)
In health care clinicians collect the data and often control the direction of research through guardianship of data.

Value

How do we measure value in the data economy?
How do we encourage data workers: curation and management
- Incentivization for sharing and production.
- Quantifying the value in the contribution of each actor.

Data Readiness Levels

https://arxiv.org/pdf/1705.02245.pdf Data Readiness Levels (Lawrence, 2017b)

Three Grades of Data Readiness

Grade C - accessibility
- Transition: data becomes electronically available
Grade B - validity
- Transition: pose a question to the data.
Grade A - usability

Accessibility: Grade C

Hearsay data.
Availability, is it actually being recorded?
privacy or legal constraints on the accessibility of the recorded data, have ethical constraints been alleviated?
Format: log books, PDF …
limitations on access due to topology (e.g. it’s distributed across a number of devices)
At the end of Grade C data is ready to be loaded into analysis software (R, SPSS, Matlab, Python, Mathematica)

Validity: Grade B

faithfulness and representation
visualisations.
exploratory data analysis
noise characterisation.

Grade B Checks

Missing values.
Schema alignment, record linkage, data fusion
Example:
- Was a column or columns accidentally perturbed (e.g. through a sort operation that missed one or more columns)? Or was a gene name accidentally converted to a date?

Grade B Transition

At the end of Grade B, ready to define a task, or question
Compare with classical statistics:
- Classically: question is first data comes later.
- Today: data is first question comes later.

Data First

In a data first company teams own their data quality issues at least as far as grade B1.

Usability: Grade A

The usability of data
- Grade A is about data in context.
Consider appropriateness of a given data set to answer a particular question or to be subject to a particular analysis.

Recursive Effects

Grade A may also require:
- data integration
- active collection of new data.
- rebalancing of data to ensure fairness
- annotation of data by human experts
- revisiting the collection (and running through the appropriate stages again)

A1 Data

A1 data is ready to make available for challenges or AutoML platforms.

Contribute!

http://data-readiness.org

Also …

Encourage greater interaction between application domains and data scientists
Encourage visualization of data

Machine Learning Systems Design

Fragility of AI Systems

They are componentwise built from ML Capabilities.
Each capability is independently constructed and verified.
- Pedestrian detection
- Road line detection
Important for verification purposes.

Rapid Reimplementation

Whole systems are being deployed.
But they change their environment.
The experience evolved adversarial behaviour.

Turnaround And Update

There is a massive need for turn around and update
A redeploy of the entire system.
- This involves changing the way we design and deploy.
Early Example: Stuxnet.

Peppercorns

A new name for system failures which aren’t bugs.
Difference between finding a fly in your soup vs a peppercorn in your soup.

Peppercorns

Conclusion

Data science offers a great deal of promise for personalized health
There are challenges and pitfalls
It is incumbent on us to avoid them

Many solutions rely on education and awareness

There are particular challenges around the Internet of Intelligence.

Thanks!

twitter: @lawrennd
blog: http://inverseprobability.com

Lawrence, N.D., 2017a. Living together: Mind and machine intelligence. arXiv.

Lawrence, N.D., 2017b. Data readiness levels. arXiv.

Machine Learning and the Data Science Process

ARM Data Science Conference

2017-03-30

Neil D. Lawrence

Amazon and University of Sheffield

The Gartner Hype Cycle

Gartner Hype Cycle

Machine Learning

Machine Learning

Background: Big Data

Embodiment Factors

Evolved Relationship

Evolved Relationship

Effects

Internet of People

Challenges

Challenges

Quantifying the Value of Data

Data — Quotes from NIPS Workshop on ML for Healthcare

Value

Data Readiness Levels

Data Readiness Levels

Three Grades of Data Readiness

Accessibility: Grade C

Validity: Grade B

Grade B Checks

Grade B Transition

Data First

Usability: Grade A

Recursive Effects

A1 Data

Contribute!

Also …

See Also …

Machine Learning Systems Design

Fragility of AI Systems

Rapid Reimplementation

Turnaround And Update

Peppercorns

Peppercorns

Conclusion

Thanks!