Access, Assess and Address: A Pipeline for (Automated?) Data Science

Neil D. Lawrence

ECML Workshop on Automating Data Science

There are three types of lies: lies, damned lies and statistics

??

There are three types of lies: lies, damned lies and statistics

Benjamin Disraeli

There are three types of lies: lies, damned lies and statistics

Benjamin Disraeli 1804-1881

There are three types of lies: lies, damned lies and ‘big data’

Neil Lawrence 1972-?

Mathematical Statistics

‘Mathematical Data Science’

DELVE

What is Machine Learning?

\[ \text{data} + \text{model} \stackrel{\text{compute}}{\rightarrow} \text{prediction}\]

The Big Data Paradox

  • We collect more data, but we understand less.

Wood or Tree

Big Model Paradox

  • Add complexity to the model to make it realistic.
  • Move model “beyond human intuition”
  • But model still falls well short of mark in terms of representing reality

Increasing Need for Human Judgment

Diane Coyle

The domain of human judgment is increasing.

How these firms use knowledge. How do they generate ideas?

Data as a Convener

  • Data allows externalisation of cognition.
  • Even when not existing, can ask: What data would we want?

Delve

Delve Reports

  1. Facemasks 4th May 2020 (The DELVE Initiative, 2020a)
  2. Test, Trace, Isolate 27th May 2020 (The DELVE Initiative, 2020b)
  3. Nosocomial Infections 6th July 2020 (The DELVE Initiative, 2020c)
  4. Schools 24th July 2020 (The DELVE Initiative, 2020d)
  5. Economics 14th August 2020 (The DELVE Initiative, 2020e)
  6. Vaccines 1st October 2020 (The DELVE Initiative, 2020f)
  7. Data 24th November 2020 (The DELVE Initiative, 2020g)

Delve Data Report

  • Surveillance data situation.
    • REACT Study (Imperial)
    • ONS Coronavirus (COVID-19) Infection Survey
    • RECOVERY Trial (Dexamethasone)
  • Happenstance data.
    • Our report’s focus (The DELVE Initiative, 2020g)

Delve Data Report: Recommendations

  • Update statutory objective of ONS to accommodate happenstance data.
  • ONS and ICO to collaborate on data driving license to standardise access processes.
  • Interdisciplinary pathfinder projects across government, business and academia
    • Nowcasting of economic metrics
    • Movement of populations (mobile phone data).

The Three Esses Framework

  • Access
  • Assess
  • Address

CRISP-DM

More generally, a data scientist is someone who knows how to extract meaning from and interpret data, which requires both tools and methods from statistics and machine learning, as well as being human. She spends a lot of time in the process of collecting, cleaning, and munging data, because data is never clean. This process requires persistence, statistics, and software engineering skills—skills that are also necessary for understanding biases in the data, and for debugging logging output from code.

Cathy O’Neil and Rachel Strutt

Experiment, Analyze, Design

A Vision

We don’t know what science we’ll want to do in five years’ time, but we won’t want slower experiments, we won’t want more expensive experiments and we won’t want a narrower selection of experiments.

What do we want?

  • Faster, cheaper and more diverse experiments.
  • Better ecosystems for experimentation.
  • Data oriented architectures.
  • Data maturity assessments.
  • Data readiness levels.

Ride Sharing: Service Oriented

Ride Sharing: Data Oriented

Ride Sharing: Hypothetical

Access

Bagonza Jimmy Kinyonyi Michael T. Smith

Access Automation

  • Digital Transformation
  • Post-Digital Transformation

Assess

  • Only things you can do without knowing the “question.”
    • This ensures assess is reusable across tasks.
  • Driven by happenstance data.

Case Study: Text Mining for Covid Misinformation

Joyce Nakatumba-Nabende

Automating Assess

  • Automated scheme detection
  • Automated data type detection (Valera and Ghahramani (2017))
  • The automatic statistician (James Robert Lloyd and Ghahramani. (2014))
  • AI for Data Analytics (Nazábal et al. (2020))
  • Joyce’s case study gives us also POS tagging for new languages.

AI for Data Analytics

Address

  • Address the question.
  • Now we bring the context in.
  • Could require:
    • Confirmatory data analysis
    • An ML prediction model
    • Visualisation through a dashboard
    • An Excel spreadsheet

Automating Address

  • Auto ML
  • Automatic Statistician
  • Automatic Visualization

AutoML

Conclusions

  • Bandwidth constraints of humans
  • Big Data Paradox
  • Big Model Paradox
  • Access, Assess, Address

Thanks!

References

James Robert Lloyd, R.G., David Duvenaud, Ghahramani., Z., 2014. Automatic construction and natural-language description of nonparametric regression models, in: AAAI.
Nazábal, A., Williams, C.K.I., Colavizza, G., Smith, C.R., Williams, A., 2020. Data engineering for data analytics: A classification of the issues, and case studies.
O’Neil, C., Schutt, R., 2013. Doing data science: Straight talk from the frontline. O’Reilly.
The DELVE Initiative, 2020g. Data readiness: Lessons from an emergency. The Royal Society.
The DELVE Initiative, 2020e. Economic aspects of the COVID-19 crisis in the UK. The Royal Society.
The DELVE Initiative, 2020a. Face masks for the general public. The Royal Society.
The DELVE Initiative, 2020c. Scoping report on hospital and health care acquisition of COVID-19 and its control. The Royal Society.
The DELVE Initiative, 2020d. Balancing the risks of pupils returning to schools. The Royal Society.
The DELVE Initiative, 2020b. Test, trace, isolate. The Royal Society.
The DELVE Initiative, 2020f. SARS-CoV-2 vaccine development & implementation; scenarios, options, key decisions. The Royal Society.
Valera, I., Ghahramani, Z., 2017. Automatic discovery of the statistical types of variables in a dataset, in: Precup, D., Teh, Y.W. (Eds.), Proceedings of the 34th International Conference on Machine Learning, Proceedings of Machine Learning Research. PMLR, pp. 3521–3529.