Challenges for Delivering Machine Learning in Health

Neil D. Lawrence


Deep Learning in Healthcare Summit 2017


Neil D. Lawrence

Amazon and University of Sheffield


Gartner Hype Cycle

Background: Big Data

  • The pervasiveness of data brings forward particular challenges.

  • Those challenges are most sharply in focus for personalized health.

  • Particular opportunities, in challenging areas such as mental health.

Evolved Relationship


  • This phenomenon has already revolutionised biology.

  • Large scale data acquisition and distribution.

  • Transcriptomics, genomics, epigenomics, ‘rich phenomics’.

  • Great promise for personalized health.


  1. Paradoxes of the Data Society

  2. Quantifying the Value of Data

  3. Privacy, loss of control, marginalisation

Breadth vs Depth Paradox

  • Able to quantify to a greater and greater degree the actions of individuals

  • But less able to characterize society

  • As we measure more, we understand less


  • Perhaps greater preponderance of data is making society itself more complex

  • Therefore traditional approaches to measurement are failing

  • Curate’s egg of a society: it is only ‘measured in parts’

Wood or Tree

  • Can either see a wood or a tree.


  • Election polls (UK 2015 elections, EU referendum, US 2016 elections)

  • Clinical trials vs personalized medicine: Obtaining statistical power where interventions are subtle. e.g. social media

Large \(p\), Large \(n\)

  • For large \(p\) the parameters are badly determined.

  • Large \(p\) small \(n\) problem.

    • Easily dealt with through definition.

Breadth vs Depth

  • Modern Measurement deals with depth (many subjects) … or breadth lots of detail about subject.

  • But what about
    • \(p\approx n\)?
    • Stratification of populations: batch effects etc.
  • Challenge around combination of data sets.
    • E.g. multi-task learning
    • Massively missing data

Also need

  • More classical statistics!
    • Like the ‘paperless office’
  • A better characterization of human (see later)

  • Larger studies (100,000 genome)
    • Combined with complex models: algorithmic challenges

Quantifying the Value of Data

There’s a sea of data, but most of it is undrinkable

We require data-desalination before it can be consumed!

Data — Quotes from NIPS Workshop on ML for Healthcare

  • 90% of our time is spent on validation and integration (Leo Anthony Celi)
  • “The Dirty Work We Don’t Want to Think About” (Eric Xing)
  • “Voodoo to get it decompressed” (Francisco Giminez)
  • In health care clinicians collect the data and often control the direction of research through guardianship of data.


  • How do we measure value in the data economy?
  • How do we encourage data workers: curation and management
    • Incentivization for sharing and production.
    • Quantifying the value in the contribution of each actor.

Embodiment: Data Readiness Levels

  • Three Bands of Data Readiness:

  • Band C - accessibility

  • Band B - validity

  • Band A - usability

Accessibility: Band C

  • Hearsay data.
  • Availability, is it actually being recorded?
  • privacy or legal constraints on the accessibility of the recorded data, have ethical constraints been alleviated?
  • Format: log books, PDF …
  • limitations on access due to topology (e.g. it’s distributed across a number of devices)

Validity: Band B

  • faithfulness and representation
  • visualisations.
  • noise characterisation.
  • Missing values.
  • Example, was a column or columns accidentally perturbed (e.g. through a sort operation that missed one or more columns)? Or was a gene name accidentally converted to a date?

Usability: Band A

  • The usability of data
  • Band A is about data in context.
  • Consider appropriateness of a given data set to answer a particular question or to be subject to a particular analysis.

Recursive Effects

  • Band A may also require
    • active collection of new data.
    • annotation of data by human experts
    • revisiting the collection (and running through the appropriate stages again)

Also …

  • Encourage greater interaction between application domains and data scientists

  • Encourage visualization of data

  • Incentivise the delivery of data.

Privacy, Loss of Control and Marginalization

  • Society is becoming harder to monitor

  • Individual is becoming easier to monitor


  • Marketing can become more sinister when the target of the marketing is well understood and the (digital) environment of the target is also so well controlled

  • Potential for explicit and implicit discrimination on the basis of race, religion, sexuality, health status

  • All prohibited under European law, but can pass unawares, or be implicit


  • Credit scoring, insurance, medical treatment
  • What if certain sectors of society are under-represented in our aanalysis?
  • What if Silicon Valley develops everything for us?

Digital Revolution and Inequality?


  • Work to ensure individual retains control of their own data
  • We accept privacy in our real lives, need to accept it in our digital
  • Control of persona and ability to project

  • Need better technological solutions: trust and algorithms.


  • Data science offers a great deal of promise for personalized health
  • There are challenges and pitfalls
  • It is incumbent on us to avoid them

Many solutions rely on education and awareness