
# Real World Machine Learning Challenges

Astra Zeneca Data Science Meetup

### What is Machine Learning?

$\text{data} + \text{model} \xrightarrow{\text{compute}} \text{prediction}$

• data : observations, could be actively or passively acquired (meta-data).
• model : assumptions, based on previous experience (other data! transfer learning etc), or beliefs about the regularities of the universe. Inductive bias.
• prediction : an action to be taken or a categorization or a quality score.

### What is Machine Learning?

$\text{data} + \text{model} \xrightarrow{\text{compute}} \text{prediction}$

• To combine data with a model need:
• a prediction function $\mappingFunction (\cdot)$ includes our beliefs about the regularities of the universe
• an objective function $\errorFunction (\cdot)$ defines the cost of misprediction.

### Machine Learning

• Driver of two different domains:
1. Data Science: arises from the fact that we now capture data by happenstance.
2. Artificial Intelligence: emulation of human behaviour.
• Connection: Internet of Things

### Machine Learning

• Driver of two different domains:
1. Data Science: arises from the fact that we now capture data by happenstance.
2. Artificial Intelligence: emulation of human behaviour.
• Connection: Internet of Things

### Machine Learning

• Driver of two different domains:
1. Data Science: arises from the fact that we now capture data by happenstance.
2. Artificial Intelligence: emulation of human behaviour.
• Connection: Internet of People
Convention for the Protection of Individuals with regard to Automatic Processing of Personal Data (1981/1/28)

### What does Machine Learning do?

• ML Automates through Data
• Strongly related to statistics.
• Field underpins revolution in data science and AI
• With AI:
• logic, robotics, computer vision, speech
• With Data Science:
• databases, data mining, statistics, visualization

Data Science Africa is a bottom up initiative for capacity building in data science, machine learning and AI on the African continent

### Example: Prediction of Malaria Incidence in Uganda

• Work with Ricardo Andrade Pacheco, John Quinn and Martin Mubaganzi (Makerere University, Uganda)
• See AI-DEV Group.

### Malaria Prediction in Uganda

(Andrade-Pacheco et al., 2014; Mubangizi et al., 2014)

### Machine Learning in Supply Chain

• Supply chain: Large Automated Decision Making Network
• Major Challenge:
• We have a mechanistic understanding of supply chain.
• Machine learning is a data driven technology.

### SafeBoda

With road accidents set to match HIV/AIDS as the highest cause of death in low/middle income countries by 2030, SafeBoda’s aim is to modernise informal transportation and ensure safe access to mobility.

### Motto

Solve Supply Chain, then solve everything else.

### Project Description

It used to be true that computers only did what we programmed them to do, but today AI systems are learning from our data. This introduces new problems in how these systems respond to their environment.

We need to better monitor how data is influencing decision making and take corrective action as required.

### Aim

• Scale safe and reliable AI solutions.
• Move from Auto ML to Auto AI
• Bayesian Optimisation to Bayesian System Optimisation

and

### Inclusive Project

There is no way that the team we’re building will be able to deliver on this agenda alone, so please join us in addressing these challenges!

### The Three Ds of Machine Learning Systems Design

• Three primary challenges of Machine Learning Systems Design.
1. Decomposition
2. Data
3. Deployment

### Decomposition

• ML is not Magical Pixie Dust.
• It cannot be sprinkled thoughtlessly.
• We cannot simply automate all decisions through data

### Decomposition

We are constrained by:

1. Our data.
2. The models.

• Careful thought needs to be put into sub-processes of task.
• Any repetitive task is a candidate for automation.

### Pigeonholing

1. Can we decompose decision we need to repetitive sub-tasks where inputs and outputs are well defined?
2. Are those repetitive sub-tasks well represent by a mathematical mapping?

### A Trap

• Over emphasis on the type of model we’re deploying.
• Under emphasis on the appropriateness of the task decomposition.

### Co-evolution

• Absolute decomposition is impossible.
• If we deploy a weak component in one place, downstream system will compensate.
• Systems co-evolve … there is no simple solution
• Trade off between performance and decomposability.
• Need to monitor deployment

### Data

• Hard to overstate its importance.
• Half the equation of $\text{data} + \text{model}$.
• Often utterly neglected.

### Data Neglect

• Arises for two reasons.
1. Data cleaning is perceived as tedious.
2. Data cleaning is complex.

### Data Cleaning

• Seems difficult to formulate into readily teachable princples.
• Heavily neglected in data science, statistics and ML courses.
• In practice most scientists spend around 80% of time data cleaning.

### The Software Crisis

The major cause of the software crisis is that the machines have become several orders of magnitude more powerful! To put it quite bluntly: as long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a mild problem, and now we have gigantic computers, programming has become an equally gigantic problem.

Edsger Dijkstra (1930-2002), The Humble Programmer

### The Data Crisis

The major cause of the data crisis is that machines have become more interconnected than ever before. Data access is therefore cheap, but data quality is often poor. What we need is cheap high-quality data. That implies that we develop processes for improving and verifying data quality that are efficient.

There would seem to be two ways for improving efficiency. Firstly, we should not duplicate work. Secondly, where possible we should automate work.

Me

https://arxiv.org/pdf/1705.02245.pdf Data Readiness Levels (Lawrence, 2017)

• Transition: data becomes electronically available
• Transition: pose a question to the data.

### 80/20 in Data Science

• Anecdotally for a given challenge
• 80% of time is spent on data wrangling.
• 20% of time spent on modelling.
• Many companies employ ML Engineers focussing on models not data.

### Premise

Our machine learning is based on a software systems view that is 20 years out of date.

### Continuous Deployment

• Deployment of modeling code.
• Data dependent models in production need continuous monitoring.
• Continous monitoring implies statistical tests rather than classic software tests.

### Continuous Monitoring

• Continuous deployment:
• We’ve changed the code, we should test the effect.
• Continuous Monitoring:
• The world around us is changing, we should monitor the effect.
• Update our notions of testing: progression testing

### Data Oriented Architectures

• Convert data to a first-class citizen.
• View system as operations on data streams.
• Expose data operations in a programmatic way.

### Data Orientated Architectures

• Historically we’ve been software first
• A necessary but not sufficient condition for data first
• Move from
1. service orientated architectures
2. data orientated architectures

### Streaming System

• Operate on rows rather than columns.
• Lead to stateless logic: persistence handled by system.
• Example Apache Kafka + Apache Flink

### Streaming Architectures

• AWS Kinesis, Apache Kafka
• Nodes in the architecture are stateless
• They persist through storing state on streams
• This brings the data inside out

### Join

stream.join(otherStream)
.where(<KeySelector>)
.equalTo(<KeySelector>)
.window(<WindowAssigner>)
.apply(<JoinFunction>)

### Milan

• Data Oriented Programming Language and runtime.
• DSL Embedded in Scala converts to an intermediate langugage.
• Intermediate language for compilation on different platforms (currently Flink)

https://github.com/amzn/milan

• Stream of prices with millisecond updates.
• Trades required on millisecond time line

### Hypothetical Streams

• Real stream — share prices
• derived hypothetical stream — share prices in future.
• Hypothetical constrained by
• input constraints.
• decision functional
• computational requirements (latency)

• Modelling is now required.
• But modelling is declared in the ecosystem.
• If it’s manual, warnings can be used
• calibration, fairness, dataset shift
• Opens door to Auto AI.

### Information Dynamics

• Potential for information feedback loops.
• Hypothetical streams are instantiated.
• Nature hypothesis (e.g. price prediction) can effect reality.
• Leads to information dynamics, similar to dynamics of governors.
• See e.g. Closed Loop Data Science at Glasgow.

### Conclusion

• Challenges in decomposition, data and model deployment for ML.
• Data oriented architectures and data first thinking are the solution.
• Data oriented programming creates systems that are ready to deploy.
• Opens the door to AutoAI and information dynamics analysis.

### Auto AI

• Auto ML is great but not sufficient
• Interacting components in an ML system
• Identify problems, and automatically deploy solutions

### References

Andrade-Pacheco, R., Mubangizi, M., Quinn, J., Lawrence, N.D., 2014. Consistent mapping of government malaria records across a changing territory delimitation. Malaria Journal 13. https://doi.org/10.1186/1475-2875-13-S1-P5

Lawrence, N.D., 2017. Data readiness levels. arXiv.

Mubangizi, M., Andrade-Pacheco, R., Smith, M.T., Quinn, J., Lawrence, N.D., 2014. Malaria surveillance with multiple data sources using Gaussian process models, in: 1st International Conference on the Use of Mobile ICT in Africa.