# Data Science: Is it Time for Professionalisation?

#### Amazon Research Cambridge and University of Sheffield

@lawrennd inverseprobability.com

### What is Machine Learning?

$\text{data} + \text{model} \rightarrow \text{prediction}$

### Machine Learning as the Driver ...

... of two different domains

1. Data Science: arises from the fact that we now capture data by happenstance.

2. Artificial Intelligence: emulation of human behaviour.

### What does Machine Learning do?

• ML Automates through Data

• Strongly related to statistics.

• Field underpins revolution in data science and AI

• With AI: logic, robotics, computer vision, speech

• With Data Science: databases, data mining, statistics, visualization

### "Embodiment Factors"

 compute ~10 gigaflops ~ 1000 teraflops? communicate ~1 gigbit/s ~ 100 bit/s embodiment(compute/communicate) 10 ~ 1013

### What does Machine Learning do?

• We scale by codifying processes and automating them.

• Ensure components are compatible (Whitworth threads)

• Then interconnect them as efficiently as possible.

• cf Colt 45, Ford Model T

### Codify Through Mathematical Functions

• How does machine learning work?

• Jumper (jersey/sweater) purchase with logistic regression

$\text{odds} = \frac{\text{bought}}{\text{not bought}}$

$\log \text{odds} = \beta_0 + \beta_1 \text{age} + \beta_2 \text{latitude}$

### Codify Through Mathematical Functions

• How does machine learning work?

• Jumper (jersey/sweater) purchase with logistic regression

$p(\text{bought}) = f\left(\beta_0 + \beta_1 \text{age} + \beta_2 \text{lattitude}\right)$

### Codify Through Mathematical Functions

• How does machine learning work?

• Jumper (jersey/sweater) purchase with logistic regression

$p(\text{bought}) = f\left(\boldsymbol{\beta}^\top \mathbf{x}\right)$

### Deep Learning

• These are interpretable models: vital for disease etc.

• Modern machine learning methods are less interpretable

• Example: face recognition

Outline of the DeepFace architecture. A front-end of a single convolution-pooling-convolution filtering on the rectified input, followed by three locally-connected layers and two fully-connected layers. Color illustrates feature maps produced at each layer. The net includes more than 120 million parameters, where more than 95% come from the local and fully connected.

Source: DeepFace

### Data Science

• Industrial Revolution 4.0?

• Industrial Revolution (1760-1840) term coined by Arnold Toynbee, late 19th century.

• Maybe: But this one is dominated by data not capital

• That presents challenges and opportunities

• Apple vs Nokia: How you handle disruption.

### A Time for Professionalisation?

• New technologies historically led to new professions:

• Brunel (born 1806): Civil, mechanical, naval

• Tesla (born 1856): Electrical and power

• William Shockley (born 1910): Electronic

• Watts S. Humphrey (born 1927): Software

### Why?

• Codification of best practice.

• Developing trust

### Where are we?

• Perhaps around the 1980s of programming.

• We understand if, for, procedures

• But we don't share best practice.

• Let's avoid the over formalisation of software engineering.

### The Software Crisis

The major cause of the software crisis is that the machines have become several orders of magnitude more powerful! To put it quite bluntly: as long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a mild problem, and now we have gigantic computers, programming has become an equally gigantic problem.

Edsger Dijkstra, The Humble Programmer

### The Data Crisis

The major cause of the data crisis is that machines have become more interconnected than ever before. Data access is therefore cheap, but data quality is often poor. What we need is cheap high quality data. That implies that we develop processes for improving and verifying data quality that are efficient.

There would seem to be two ways for improving efficiency. Firstly, we should not duplicate work. Secondly, where possible we should automate work.

Me

### Rest of this Talk: Two Areas of Focus

• Reusability of Data

• Deployment of Machine Learning Systems

### Quantifying the Value of Data

There's a sea of data, but most of it is undrinkable

We require data-desalination before it can be consumed!

### Data Quotes

• 90% of our time is spent on validation and integration (Leo Anthony Celi)
• "The Dirty Work We Don't Want to Think About" (Eric Xing)
• "Voodoo to get it decompressed" (Francisco Giminez)
• Getting money from management for data collection and annotation can be a total nightmare.

### Value

• How do we measure value in the data economy?

• How do we encourage data workers: curation and management

• Incentivization for sharing and production.

• Quantifying the value in the contribution of each actor.

• Hearsay data.
• Availability, is it actually being recorded?
• privacy or legal constraints on the accessibility of the recorded data, have ethical constraints been alleviated?
• Format: log books, PDF ...
• limitations on access due to topology (e.g. it's distributed across a number of devices)
• At the end of Grade C data is ready to be loaded into analysis software (R, SPSS, Matlab, Python, Mathematica)

• faithfulness and representation
• visualisations.
• exploratory data analysis
• noise characterisation.
• Missing values.
• Schema alignment, record linkage, data fusion?
• Example, was a column or columns accidentally perturbed (e.g. through a sort operation that missed one or more columns)? Or was a gene name accidentally converted to a date?
• At the end of Grade B, ready to define a candidate question, the context, load into OpenML

• The usability of data
• Consider appropriateness of a given data set to answer a particular question or to be subject to a particular analysis.
• Data integration?
• At the end of Grade A it's ready for data platforms such as RAMP, Kaggle, define a task in OpenML.

### Recursive Effects

• Grade A may also require:

• active collection of new data.

• rebalancing of data to ensure fairness

• annotation of data by human experts

• revisiting the collection (and running through the appropriate stages again)

### Also ...

• Encourage greater interaction between application domains and data scientists

• Encourage visualization of data

• Incentivise the delivery of data.

• Analogies: For Software Engineers describe data science as debugging.

• Data Joel Tests

### Artificial Intelligence

• Challenges in deploying AI.

• Currently this is in the form of "machine learning systems"

### Internet of People

• Fog computing: barrier between cloud and device blurring.

• Complex feedback between algorithm and implementation

### Deploying ML in Real World: Machine Learning Systems Design

• Major new challenge for systems designers.

• Internet of Intelligence but currently:

• AI systems are fragile

### Fragility of AI Systems

• They are componentwise built from ML Capabilities.

• Each capability is independently constructed and verified.

• Pedestrian detection
• Important for verification purposes.

### Rapid Reimplementation

• Whole systems are being deployed.

• But they change their environment.

• The experience evolved adversarial behaviour.

• Stuxnet

### Turnaround And Update

• There is a massive need for turn around and update

• A redeploy of the entire system.

• This involves changing the way we design and deploy.
• Interface between security engineering and machine learning.

### Peppercorns

• A new name for system failures which aren't bugs.

• Difference between finding a fly in your soup vs a peppercorn in your soup.

### Conclusion

• Difference between Artificial Intelligence and Data Science are fundamentally different.

• In one you are dealing with data collected by happenstance.

• In the other you are trying to build systems in the real world, often by actively collecting data.

• Our approaches to systems design are building powerful machines that will be deployed in evolving environments.