at Data Science Africa, Abuja on Nov 15, 2018 [jupyter][reveal]
Neil D. Lawrence, Amazon Cambridge and University of Sheffield

#### Abstract

Machine learning solutions, in particular those based on deep learning methods, form an underpinning of the current revolution in “artificial intelligence” that has dominated popular press headlines and is having a significant influence on the wider tech agenda. In this talk I will give an overview of where we are now with machine learning solutions, and what challenges we face both in the near and far future. These include practical application of existing algorithms in the face of the need to explain decision making, mechanisms for improving the quality and availability of data, dealing with large unstructured datasets.

.

.

# Introduction

## Lies and Damned Lies 

There are three types of lies: lies, damned lies and statistics

Benjamin Disraeli 1804-1881

Benjamin Disraeli said1 that there three types of lies: lies, damned lies and statistics. Disraeli died in 1881, 30 years before the first academic department of applied statistics was founded at UCL. If Disraeli were alive today, it is likely that he'd rephrase his quote:

There are three types of lies, lies damned lies and big data.

Why? Because the challenges of understanding and interpreting big data today are similar to those that Disraeli faced in governing an empire through statistics in the latter part of the 19th century.

The quote lies, damned lies and statistics was credited to Benjamin Disraeli by Mark Twain in his autobiography. It characterizes the idea that statistic can be made to prove anything. But Disraeli died in 1881 and Mark Twain died in 1910. The important breakthrough in overcoming our tendency to overinterpet data came with the formalization of the field through the development of mathematical statistics.

Data has an elusive quality, it promises so much but can deliver little, it can mislead and misrepresent. To harness it, it must be tamed. In Disraeli's time during the second half of the 19th century, numbers and data were being accumulated, the social sciences were being developed. There was a large scale collection of data for the purposes of government.

The modern 'big data era' is on the verge of delivering the same sense of frustration that Disraeli experienced, the early promise of big data as a panacea is evolving to demands for delivery. For me, personally, peak-hype coincided with an email I received inviting collaboration on a project to deploy "Big Data and Internet of Things in an Industry 4.0 environment". Further questioning revealed that the actual project was optimization of the efficiency of a manufacturing production line, a far more tangible and realizable goal.

The antidote to this verbage is found in increasing awareness. When dealing with data the first trap to avoid is the games of buzzword bingo that we are wont to play. The first goal is to quantify what challenges can be addressed and what techniques are required. Behind the hype fundamentals are changing. The phenomenon is about the increasing access we have to data. The manner in which customers information is recorded and processes are codified and digitized with little overhead. Internet of things is about the increasing number of cheap sensors that can be easily interconnected through our modern network structures. But businesses are about making money, and these phenomena need to be recast in those terms before their value can be realized.

## Mathematical Statistics

Karl Pearson (1857-1936), Ronald Fisher (1890-1962) and others considered the question of what conclusions can truly be drawn from data. Their mathematical studies act as a restraint on our tendency to over-interpret and see patterns where there are none. They introduced concepts such as randomized control trials that form a mainstay of the our decision making today, from government, to clinicians to large scale A/B testing that determines the nature of the web interfaces we interact with on social media and shopping.

Their movement did the most to put statistics to rights, to eradicate the 'damned lies'. It was known as 'mathematical statistics'. Today I believe we should look to the emerging field of data science to provide the same role. Data science is an amalgam of statistics, data mining, computer systems, databases, computation, machine learning and artificial intelligence. Spread across these fields are the tools we need to realize data's potential. For many businesses this might be thought of as the challenge of 'converting bits into atoms'. Bits: the data stored on computer, atoms: the physical manifestation of what we do; the transfer of goods, the delivery of service. From fungible to tangible. When solving a challenge through data there are a series of obstacles that need to be addressed.

Firstly, data awareness: what data you have and where its stored. Sometimes this includes changing your conception of what data is and how it can be obtained. From automated production lines to apps on employee smart phones. Often data is locked away: manual log books, confidential data, personal data. For increasing awareness an internal audit can help. The website data.gov.uk hosts data made available by the UK government. To create this website the government's departments went through an audit of what data they each hold and what data they could make available. Similarly, within private buisnesses this type of audit could be useful for understanding their internal digital landscape: after all the key to any successful campaign is a good map.

Secondly, availability. How well are the data sources interconnected? How well curated are they? The curse of Disraeli was associated with unreliable data and unreliable statistics. The misrepresentations this leads to are worse than the absence of data as they give a false sense of confidence to decision making. Understanding how to avoid these pitfalls involves an improved sense of data and its value, one that needs to permeate the organization.

The final challenge is analysis, the accumulation of the necessary expertise to digest what the data tells us. Data requires intepretation, and interpretation requires experience. Analysis is providing a bottleneck due to a skill shortage, a skill shortage made more acute by the fact that, ideally, analysis should be carried out by individuals not only skilled in data science but also equipped with the domain knowledge to understand the implications in a given application, and to see opportunities for improvements in efficiency.

## 'Mathematical Data Science'

As a term 'big data' promises much and delivers little, to get true value from data, it needs to be curated and evaluated. The three stages of awareness, availability and analysis provide a broad framework through which organizations should be assessing the potential in the data they hold. Hand waving about big data solutions will not do, it will only lead to self-deception. The castles we build on our data landscapes must be based on firm foundations, process and scientific analysis. If we do things right, those are the foundations that will be provided by the new field of data science.

Today the statement "There are three types of lies: lies, damned lies and 'big data'" may be more apt. We are revisiting many of the mistakes made in interpreting data from the 19th century. Big data is laid down by happenstance, rather than actively collected with a particular question in mind. That means it needs to be treated with care when conclusions are being drawn. For data science to succede it needs the same form of rigour that Pearson and Fisher brought to statistics, a "mathematical data science" is needed.

You can also check my blog post on "Lies, Damned Lies and Big Data"

## Embodiment Factors 

 compute ≈100 gigaflops ≈16 petaflops communicate 1 gigbit/s 100 bit/s (compute/communicate) 104 1014

There is a fundamental limit placed on our intelligence based on our ability to communicate. Claude Shannon founded the field of information theory. The clever part of this theory is it allows us to separate our measurement of information from what the information pertains to2.

Shannon measured information in bits. One bit of information is the amount of information I pass to you when I give you the result of a coin toss. Shannon was also interested in the amount of information in the English language. He estimated that on average a word in the English language contains 12 bits of information.

Given typical speaking rates, that gives us an estimate of our ability to communicate of around 100 bits per second (Reed and Durlach 1998). Computers on the other hand can communicate much more rapidly. Current wired network speeds are around a billion bits per second, ten million times faster.

When it comes to compute though, our best estimates indicate our computers are slower. A typical modern computer can process make around 100 billion floating point operations per second, each floating point operation involves a 64 bit number. So the computer is processing around 6,400 billion bits per second.

It's difficult to get similar estimates for humans, but by some estimates the amount of compute we would require to simulate a human brain is equivalent to that in the UK's fastest computer (Ananthanarayanan et al. 2009), the MET office machine in Exeter, which in 2018 ranks as the 11th fastest computer in the world. That machine simulates the world's weather each morning, and then simulates the world's climate in the afternoon. It is a 16 petaflop machine, processing around 1,000 trillion bits per second.

So when it comes to our ability to compute we are extraordinary, not compute in our conscious mind, but the underlying neuron firings that underpin both our consciousness, our subconsciousness as well as our motor control etc.

If we think of ourselves as vehicles, then we are massively overpowered. Our ability to generate derived information from raw fuel is extraordinary. Intellectually we have formula one engines.

But in terms of our ability to deploy that computation in actual use, to share the results of what we have inferred, we are very limited. So when you imagine the F1 car that represents a psyche, think of an F1 car with bicycle wheels.

Just think of the control a driver would have to have to deploy such power through such a narrow channel of traction. That is the beauty and the skill of the human mind.

In contrast, our computers are more like go-karts. Underpowered, but with well-matched tires. They can communicate far more fluidly. They are more efficient, but somehow less extraordinary, less beautiful.

For humans, that means much of our computation should be dedicated to considering what we should compute. To do that efficiently we need to model the world around us. The most complex thing in the world around us is other humans. So it is no surprise that we model them. We second guess what their intentions are, and our communication is only necessary when they are departing from how we model them. Naturally, for this to work well, we need to understand those we work closely with. So it is no surprise that social communication, social bonding, forms so much of a part of our use of our limited bandwidth.

There is a second effect here, our need to anthropomorphise objects around us. Our tendency to model our fellow humans extends to when we interact with other entities in our environment. To our pets as well as inanimate objects around us, such as computers or even our cars. This tendency to over interpret could be a consequence of our limited ability to communicate.

For more details see this paper "Living Together: Mind and Machine Intelligence", and this TEDx talk.

# Evolved Relationship with Information 

The high bandwidth of computers has resulted in a close relationship between the computer and data. Large amounts of information can flow between the two. The degree to which the computer is mediating our relationship with data means that we should consider it an intermediary.

Originaly our low bandwith relationship with data was affected by two characteristics. Firstly, our tendency to over-interpret driven by our need to extract as much knowledge from our low bandwidth information channel as possible. Secondly, by our improved understanding of the domain of mathematical statistics and how our cognitive biases can mislead us.

With this new set up there is a potential for assimilating far more information via the computer, but the computer can present this to us in various ways. If it's motives are not aligned with ours then it can misrepresent the information. This needn't be nefarious it can be simply as a result of the computer pursuing a different objective from us. For example, if the computer is aiming to maximize our interaction time that may be a different objective from ours which may be to summarize information in a representative manner in the shortest possible length of time.

For example, for me, it was a common experience to pick up my telephone with the intention of checking when my next appointment was, but to soon find myself distracted by another application on the phone, and end up reading something on the internet. By the time I'd finished reading, I would often have forgotten the reason I picked up my phone in the first place.

There are great benefits to be had from the huge amount of information we can unlock from this evolved relationship between us and data. In biology, large scale data sharing has been driven by a revolution in genomic, transcriptomic and epigenomic measurement. The improved inferences that that can be drawn through summarizing data by computer have fundamentally changed the nature of biological science, now this phenomenon is also infuencing us in our daily lives as data measured by happenstance is increasingly used to characterize us.

Better mediation of this flow actually requires a better understanding of human-computer interaction. This in turn involves understanding our own intelligence better, what its cognitive biases are and how these might mislead us.

For further thoughts see this Guardian article from 2015 on marketing in the internet era.

You can also check my blog post on "System Zero"

## Societal Effects 

We have already seen the effects of this changed dynamic in biology and computational biology. Improved sensorics have led to the new domains of transcriptomics, epigenomics, and 'rich phenomics' as well as considerably augmenting our capabilities in genomics.

Biologists have had to become data-savvy, they require a rich understanding of the available data resources and need to assimilate existing data sets in their hypothesis generation as well as their experimental design. Modern biology has become a far more quantitative science, but the quantitativeness has required new methods developed in the domains of computational biology and bioinformatics.

There is also great promise for personalized health, but in health the wide data-sharing that has underpinned success in the computational biology community is much harder to cary out.

We can expect to see these phenomena reflected in wider society. Particularly as we make use of more automated decision making based only on data. This is leading to a requirement to better understand our own subjective biases to ensure that the human to computer interface allows domain experts to assimilate data driven conclusions in a well calibrated manner. This is particularly important where medical treatments are being prescribed. It also offers potential for different kinds of medical intervention. More subtle interventions are possible when the digital environment is able to respond to users in an bespoke manner. This has particular implications for treatment of mental health conditions.

The main phenomenon we see across the board is the shift in dynamic from the direct pathway between human and data, as traditionally mediated by classical statistcs, to a new flow of information via the computer. This change of dynamics gives us the modern and emerging domain of data science, where the interactions between human and data are mediated by the machine.

## What does Machine Learning do? 

Any process of automation allows us to scale what we do by codifying a process in some way that makes it efficient and repeatable. Machine learning automates by emulating human (or other actions) found in data. Machine learning codifies in the form of a mathematical function that is learnt by a computer. If we can create these mathematical functions in ways in which they can interconnect, then we can also build systems.

Machine learning works through codifing a prediction of interest into a mathematical function. For example, we can try and predict the probability that a customer wants to by a jersey given knowledge of their age, and the latitude where they live. The technique known as logistic regression estimates the odds that someone will by a jumper as a linear weighted sum of the features of interest.

$$\text{odds} = \frac{p(\text{bought})}{p(\text{not bought})}$$

logodds = β0 + β1age + β2latitude.
Here β0, β1 and β2 are the parameters of the model. If β1 and β2 are both positive, then the log-odds that someone will buy a jumper increase with increasing latitude and age, so the further north you are and the older you are the more likely you are to buy a jumper. The parameter β0 is an offset parameter, and gives the log-odds of buying a jumper at zero age and on the equator. It is likely to be negative3 indicating that the purchase is odds-against. This is actually a classical statistical model, and models like logistic regression are widely used to estimate probabilities from ad-click prediction to risk of disease.

This is called a generalized linear model, we can also think of it as estimating the probability of a purchase as a nonlinear function of the features (age, lattitude) and the parameters (the β values). The function is known as the sigmoid or logistic function, thus the name logistic regression.

$$p(\text{bought}) = \sigmoid{\beta_0 + \beta_1 \text{age} + \beta_2 \text{latitude}}.$$
In the case where we have features to help us predict, we sometimes denote such features as a vector, $\inputVector$, and we then use an inner product between the features and the parameters, $\boldsymbol{\beta}^\top \inputVector = \beta_1 \inputScalar_1 + \beta_2 \inputScalar_2 + \beta_3 \inputScalar_3 ...$, to represent the argument of the sigmoid.

$$p(\text{bought}) = \sigmoid{\boldsymbol{\beta}^\top \inputVector}.$$
More generally, we aim to predict some aspect of our data, $\dataScalar$, by relating it through a mathematical function, $\mappingFunction(\cdot)$, to the parameters, β and the data, $\inputVector$.

$$\dataScalar = \mappingFunction\left(\inputVector, \boldsymbol{\beta}\right).$$
We call $\mappingFunction(\cdot)$ the prediction function.

To obtain the fit to data, we use a separate function called the objective function that gives us a mathematical representation of the difference between our predictions and the real data.

$$\errorFunction(\boldsymbol{\beta}, \dataMatrix, \inputMatrix)$$
A commonly used examples (for example in a regression problem) is least squares,
$$\errorFunction(\boldsymbol{\beta}, \dataMatrix, \inputMatrix) = \sum_{i=1}^\numData \left(\dataScalar_i - \mappingFunction(\inputVector_i, \boldsymbol{\beta})\right)^2.$$

If a linear prediction function is combined with the least squares objective function then that gives us a classical linear regression, another classical statistical model. Statistics often focusses on linear models because it makes interpretation of the model easier. Interpretation is key in statistics because the aim is normally to validate questions by analysis of data. Machine learning has typically focussed more on the prediction function itself and worried less about the interpretation of parameters, which are normally denoted by w instead of β. As a result non-linear functions are explored more often as they tend to improve quality of predictions but at the expense of interpretability.

## What is Machine Learning? 

Machine learning allows us to extract knowledge from data to form a prediction.

$$\text{data} + \text{model} \xrightarrow{\text{compute}} \text{prediction}$$

A machine learning prediction is made by combining a model with data to form the prediction. The manner in which this is done gives us the machine learning algorithm.

Machine learning models are mathematical models which make weak assumptions about data, e.g. smoothness assumptions. By combining these assumptions with the data we observe we can interpolate between data points or, occasionally, extrapolate into the future.

Machine learning is a technology which strongly overlaps with the methodology of statistics. From a historical/philosophical view point, machine learning differs from statistics in that the focus in the machine learning community has been primarily on accuracy of prediction, whereas the focus in statistics is typically on the interpretability of a model and/or validating a hypothesis through data collection.

The rapid increase in the availability of compute and data has led to the increased prominence of machine learning. This prominence is surfacing in two different, but overlapping domains: data science and artificial intelligence.

The real challenge, however, is end-to-end decision making. Taking information from the enviroment and using it to drive decision making to achieve goals.

## Machine Learning in Supply Chain 

Containerization has had a dramatic effect on global economics, placing many people in the developing world at the end of the supply chain.

For example, you can buy Wild Alaskan Cod fished from Alaska, processed in China, sold in North America. This is driven by the low cost of transport for frozen cod vs the higher relative cost of cod processing in the US versus China. Similarly, Scottish prawns are also processed in China for sale in the UK.

Supply chain is a large scale automated decision making network. Our aim is to make decisions not only based on our models of customer behavior (as observed through data), but also by accounting for the structure of our fulfilment center, and delivery network.

Many of the most important questions in supply chain take the form of counterfactuals. E.g. “What would happen if we opened a manufacturing facility in Cambridge?” A counter factual is a question that implies a mechanistic understanding of a system. It goes beyond simple smoothness assumptions or translation invariants. It requires a physical, or mechanistic understanding of the supply chain network. For this reason the type of models we deploy in supply chain often involve simulations or more mechanistic understanding of the network.

In supply chain Machine Learning alone is not enough, we need to bridge between models that contain real mechanisms and models that are entirely data driven.

This is challenging, because as we introduce more mechanism to the models we use, it becomes harder to develop efficient algorithms to match those models to data.

## Operations Research, Control, Econometrics, Statistics and Machine Learning 

data + model is not new, it dates back to Laplace and Gauss. Gauss fitted the orbit of Ceres using Keplers laws of planetary motion to generate his basis functions, and Laplace's insights on the error function and uncertainty (Stigler 1999). Different fields such as Operations Research, Control, Econometrics, Statistics, Machine Learning and now Data Science and AI all rely on data + model. Under a Popperian view of science, and equating experiment to data, one could argue that all science has data + model underpinning it.

Different academic fields are born in different eras, driven by different motivations and arrive at different solutions. For example, both Operations Research and Control emerged from the Second World War. Operations Research, the science of decision making, driven by the need for improved logistics and supply chain. Control emerged from cybernetics, a field that was driven in the by researchers who had been involved in radar and decryption (Wiener 1948; Husband, Holland, and Wheeler 2008). The UK artificial intelligence community had similar origins (Copeland 2006).

The separation between these fields has almost become tribal, and from one perspective this can be very helpful. Each tribe can agree on a common language, a common set of goals and a shared understanding of the approach they’ve chose for those goals. This ensures that best practice can be developed and shared and as a result quality standards rise.

This is the nature of our professions. Medics, lawyers, engineers and accountants all have a system of shared best practice that they deploy efficiently in the resolution of a roughly standardized set of problems where they deploy (broken leg, defending a libel trial, bridging a river, ensuring finances are correct).

Control, statistics, economics, operations research are all established professions. Techniques are established, often at undergraduate level, and graduation to the profession is regulated by professional bodies. This system works well as long as the problems we are easily categorized and mapped onto the existing set of known problems.

However, at another level our separate professions of OR, statistics and control engineering are just different views on the same problem. Just as any tribe of humans need to eat and sleep, so do these professions depend on data, modelling, optimization and decision-making.

We are doing something that has never been done before, optimizing and evolving very large scale automated decision making networks. The ambition to scale and automate in a data driven manner means that a tribal approach to problem solving can hinder our progress. Any tribe of hunter gatherers would struggle to understand the operation of a modern city. Similarly, supply chain needs to develop cross-functional skill sets to address the modern problems we face, not the problems that were formulated in the past.

Many of the challenges we face are at the interface between our tribal expertize. We have particular cost functions we are trying to minimize (an expertise of OR) but we have large scale feedbacks in our system (an expertise of control). We also want our systems to be adaptive to changing circumstances, to perform the best action given the data available (an expertise of machine learning and statistics).

Taking the tribal analogy further, we could imagine each of our professions as a separate tribe of hunter-gathers, each with particular expertise (e.g. fishing, deer hunting, trapping). Each of these tribes has their own approach to eating to survive, just as each of our localized professions has its own approach to modelling. But in this analogy, the technological landscapes we face are not wildernesses, they are emerging metropolises. Our new task is to feed our population through a budding network of supermarkets. While we may be sourcing our food in the same way, this requires new types of thinking that don't belong in the pure domain of any of our existing tribes.

For our biggest challenges, focusing on the differences between these fields is unhelpful, we should consider their strengths and how they overlap. Fundamentally all these fields are focused on taking the right action given the information available to us. They need to work in synergy for us to make progress.

While there is some discomfort in talking across field boundaries, it is critical to disconforming our current beliefs and generating the new techniques we need to address the challenges before us.

Recommendation: We should be aware of the limitations of a single tribal view of any of our problem sets. Where our modelling is dominated by one perspective (e.g. economics, OR, control, ML) we should ensure cross fertilization of ideas occurs through scientific review and team rotation mechanisms that embed our scientists (for a short period) in different teams across our organizations.

## Challenges 

The field of data science is rapidly evolving. Different practitioners from different domains have their own perspectives. In this post we identify three broad challenges that are emerging. Challenges which have not been addressed in the traditional sub-domains of data science. The challenges have social implications but require technological advance for their solutions.

The first challenge we'd like to highlight is the unusual paradoxes of the data society. It is too early to determine whether these paradoxes are fundmental or transient. Evidence for them is still somewhat anecdotal, but they seem worthy of further attention.

We are now able to quantify to a greater and greater degree the actions of individuals in society, and this might lead us to believe that social science, politics, economics are becoming quantifiable. We are able to get a far richer characterization of the world around us. Paradoxically it seems that as we measure more, we understand less.

How could this be possible? It may be that the greater preponderance of data is making society itself more complex. Therefore traditional approaches to measurement (e.g. polling by random sub sampling) are becoming harder, for example due to more complex batch effects, a greater stratification of society where it is more difficult to weigh the various sub-populations correctly.

The end result is that we have a Curate's egg of a society: it is only 'measured in parts'. Whether by examination of social media or through polling we no longer obtain the overall picture that can be necessary to obtain the depth of understanding we require.

One example of this phenomenon is the 2015 UK election which polls had as a tie and yet in practice was won by the Conservative party with a seven point advantage. A post-election poll which was truly randomized suggested that this lead was measurable, but pre-election polls are conducted on line and via phone. These approaches can under represent certain sectors. The challenge is that the truly randomized poll is expensive and time consuming. In practice on line and phone polls are usually weighted to reflect the fact that they are not truly randomized, but in a rapidly evolving society the correct weights may move faster than they can be tracked.

Another example is clinical trials. Once again they are the preserve of randomized studies to verify the efficacy of the drug. But now, rather than population becoming more stratified, it is the more personalized nature of the drugs we wish to test. A targeted drug which has efficacy in a sub-population may be harder to test due to difficulty in recruiting the sub-population, the benefit of the drug is also for a smaller sub-group, so expense of drug trials increases.

There are other less clear cut manifestations of this phenomenon. We seem to rely increasingly on social media as a news source, or as a indicator of opinion on a particular subject. But it is beholden to the whims of a vocal minority.

Similar to the way we required more paper when we first developed the computer, the solution is more classical statistics. We need to do more work to verify the tentative conclusions we produce so that we know that our new methodologies are effective.

As we increase the amount of data we acquire, we seem to be able to get better at characterizing the actions of individuals, predicting how they will behave. But we seem, somehow, to be becoming less capable at understanding society. Somehow it seems that as we measure more, we understand less.

That seems counter-intuitive. But perhaps the preponderance of data is making society itself, or the way we measure society, somehow more complex. And in turn, this means that traditional approaches to measurement are failing. So when we realize we are getting better at characterising individuals, perhaps we are only measuring society in parts.

Classical approaches to data analysis made use of many subjects to achieve statistical power. Traditionally, we measure a few things about many people. For example cardiac disease risks can be based on a limited number of factors inmany patients (such as whether the patient smokes, blood pressure, cholesterol levels etc). Because, traditionally, data matrices are stored with individuals in rows and features in columns4, we refer to this as depth of measurement. In statistics this is sometimes known as the large p, small n domain because traditionally p is used to denote the number of features we know about an individual and n is used to denote the number of individuals.

The data-revolution is giving us access to far more detail about each individual, this is leading to a breadth of coverage. This characteristic first came to prominence in computational biology and genomics where we became able to record information about mutations and transcription in millions of genes. So p became very large, but due to expense of measurement, the number of patients recorded, n, was relatively small. But we now see this increasingly for other domains. With an increasing number of sensors on our wrists or in our mobile phones, we are characterizing indivdiuals in unprecedented detail. This domain can also be effectively dealt with by modifying the models that are used for the data.

So we can know an individual extremely well, or we can know a population well. The saying "Can't see the wood for the trees", means we are distracted by the individual trees in a forest, and can't see the wider context. This seems appropriate for what may be going on here. We are becoming distracted by the information on the individual and we can't see the wider context of the data.

We know that a rigorous, randomized, study would characterize that forest well, but it seems we are unwilling to invest the money required to do that and the proxies we are using are no longer effective, perhaps because of shifting patterns of behaviour driven by the rapidly evolving digital world.

Further, it's likely that we are interested in strata within our data set. Equivalent to the structure within the forest: a clearing, a transition between types of tree, a shift in the nature of the undergrowth.

## Examples

Examples exhibiting this phenomenon include recent elections, which have proven difficult to predict. Including, the UK 2015 elections, the EU referendum, the US 2016 elections and the UK 2017 elections. In each case individuals may have taken actions on the back of polls that showed one thing or another but turned out to be inaccurate. Indeed, the only accurate pre-election poll for the UK 2017 election, the YouGov poll, was not a traditional poll, it contains a new type of statistical model called Multilevel Regression and Poststratification (MRP) (Gelman and Hill 2006).

Another example is stratified medicine. If a therapy is effective only in a sub-type of a disease, then statistical power can be lost across the whole population, particularly when that sub-type is a minority. But characterization of that sub-type is difficult. For example, new cancer immunotherapy treatments can have a dramatic effect, leading to almost total elimination of the cancer in some patients, but characterizing this sub-population is hard. This also makes it hard to develop clinical trials that prove the efficacy of the drugs.

A final example is our measurement of our economy, which increasingly may not capture where value is being generated. This is characterized by the changing nature of work, and the way individuals contribute towards society. For example, the open source community has driven the backbone of the majority of operating system software we use today, as well as cloud compute. But this value is difficult to measure as it was contributed by volunteers, not by a traditional corporate structure. Data itself may be driving this change, because the value of data accumulates in a similar way to the value of capital. The movement of data in the economy, and the value it generates is also hard to measure, and it seems there may be a large class of "have nots", in terms of those industries whose productivity has suffered relative to the top performers. The so-called productivity gap may not just be due to skills and infrastructure, but also due to data-skills and data-infrastructure.

## Challenges

The nature of the digital society has a closed loop feedback on itself. This is characterized by social media memes, which focus attention on particular issues very quickly. A good example being the photograph of Aylan Kurdi, the young Syrian boy found drowned on a Turkish beach. This photograph had a dramatic effect on attitudes towards immigration, more than the statistics that were showing that thousands were dieing in the Mediterranean each month (see this report by the University of Sheffield's Social Media Lab). Similarly, the changed dynamics of our social circles. Filter bubbles, where our searches and/or newsfeed has been personalized to things that algorithms already know we like. Echo chambers, where we interact mainly with people we agree with and our opinions aren't challenged. Each of these is changing the dynamic of society, and yet there is a strong temptation to use digital media for surveying information.

## Solutions

The solutions to these challenges come in three flavours. Firstly, there is a need for more data. In particular data that is actively acquired to cover the gaps in our knowledge. We also need more use of classical statistical techniques, and a wider understanding of what they involve. This situation reminds me somewhat of the idea of the 'paperless office'. The innovative research at Xerox PARC that brought us the Graphical User Interface, so prevalent today, was driven by the realization, in the 1970s that eventually offices would stop using paper. Xerox focussed research on what that office would look like as it was a perceived threat to their business. The paperless office may still come, but in practice computers brought about a significant increase in the need for paper due to the additional amounts of information that they caused to be summarized or generated. In a similar way, the world of big data is driving a need for more experimental design and more classical statistics. Any perception of the automated computer algorithm that drives all before it is at least as far away as the paperless office was in the 1970s.

We also need a better social, cognitive and biological understanding of humans and how we and our social structures respond to these interventions. Over time some of the measurables will likely stabilize, but it is not yet clear which ones.

# Quantifying the Value of Data 

The situation is reminiscent of a thirsty castaway, set adrift. There is a sea of data, but it is not fit to drink. We need some form of data desalination before it can be consumed. But like real desalination, this is a non trivial process, particularly if we want to achieve it at scale.

There's a sea of data, but most of it is undrinkable.

We require data-desalination before it can be consumed!

I spoke about the challenges in data science at the NIPS 2016 Workshop on Machine Learning for Health. NIPS mainly focuses on machine learning methodologies, and many of the speakers were doing so. But before my talk, I listened to some of the other speakers talk about the challenges they had with data preparation.

• 90% of our time is spent on validation and integration (Leo Anthony Celi)
• "The Dirty Work We Don't Want to Think About" (Eric Xing)
• "Voodoo to get it decompressed" (Francisco Giminez)

A further challenge in healthcare is that the data is collected by clinicians, often at great inconvenience to both themselves and the patient, but the control of the data is sometimes used to steer the direction of research.

The fact that we put so much effort into processing the data, but so little into allocating credit for this work is a major challenge for realizing the benefit in the data we have.

This type of work is somewhat thankless, with the exception of the clinicians' control of the data, which probably takes things too far, those that collate and correct data sets gain little credit. In the domain of reinforcement learning the aim is to take a series of actions to achieve a stated goal and gain a reward. The credit assignment problem is the challenge in the learning algorithm of distributing credit to each of the actions which brought about the reward. We also experience this problem in society, we use proxies such as monetary reward to incentivise intermediate steps in our economy. Modern society functions because we agree to make basic expenditure on infrastructure, such as roads, which we all make use of. Our data-society is not sufficiently mature to be correctly crediting and rewarding those that undertake this work.

We need to properly incetivize the sharing and production of clean data sets, we need to correctly quantify the value in the contribution of each actor, otherwise there won't be enough clean data to satiate the thirst of our decision making processes.

import pods
pods.notebook.display_plots('pomdp{samp:0>3}.svg',
directory='../slides/diagrams/', samp=(1, 4))

The value of shared data infrastructures in computational biology was recognized by the 2010 joint statement from the Wellcome Trust and other funders of research at the "Foggy Bottom" meeting. They recognised three key benefits to sharing of health data:

• faster progress in improving health
• better value for money
• higher quality science

But incentivising sharing requires incentivising collection and collation of data, and the associated credit allocation models.

Data Readiness Levels (Lawrence 2017) are an attempt to develop a language around data quality that can bridge the gap between technical solutions and decision makers such as managers and project planners. The are inspired by Technology Readiness Levels which attempt to quantify the readiness of technologies for deployment.b

Data-readiness describes, at its coarsest level, three separate stages of data graduation.

• Transition: data becomes electronically available
• Transition: pose a question to the data.

The important definitions are at the transition. The move from Grade C data to Grade B data is delimited by the electronic availability of the data. The move from Grade B to Grade A data is delimited by posing a question or task to the data (Lawrence 2017).

The first grade refers to the accessibility of data. Most data science practitioners will be used to working with data-providers who, perhaps having had little experience of data-science before, state that they "have the data". More often than not, they have not verified this. A convenient term for this is "Hearsay Data", someone has heard that they have the data so they say they have it. This is the lowest grade of data readiness.

Progressing through Grade C involves ensuring that this data is accessible. Not just in terms of digital accessiblity, but also for regulatory, ethical and intellectual property reasons.

Data transits from Grade C to Grade B once we can begin digital analysis on the computer. Once the challenges of access to the data have been resolved, we can make the data available either via API, or for direct loading into analysis software (such as Python, R, Matlab, Mathematica or SPSS). Once this has occured the data is at B4 level. Grade B involves the validity of the data. Does the data really represent what it purports to? There are challenges such as missing values, outliers, record duplication. Each of these needs to be investigated.

Grade B and C are important as if the work done in these grades is documented well, it can be reused in other projects. Reuse of this labour is key to reducing the costs of data-driven automated decision making. There is a strong overlap between the work required in this grade and the statistical field of exploratory data analysis (Tukey 1977).

The need for Grade B emerges due to the fundamental change in the availability of data. Classically, the scientific question came first, and the data came later. This is still the approach in a randomized control trial, e.g. in A/B testing or clinical trials for drugs. Today data is being laid down by happenstance, and the question we wish to ask about the data often comes after the data has been created. The Grade B of data readiness ensures thought can be put into data quality before the question is defined. It is this work that is reusable across multiple teams. It is these processes that the team which is standing up the data must deliver.

Once the validity of the data is determined, the data set can be considered for use in a particular task. This stage of data readiness is more akin to what machine learning scientists are used to doing in Universities. Bringing an algorithm to bear on a well understood data set.

This is the stage where the data and the model are brought together, so expertise in learning algorithms and their application is key. Further ethical considerations, such as the fairness of the resulting predictions are required at this stage. At the end of this stage a prototype model is ready for deployment.

Deployment and maintenance of machine learning models in production is another important issue which Data Readiness Levels are only a part of the solution for.

## Recursive Effects

To find out more, or to contribute ideas go to http://data-readiness.org

You can also check my blog post on "Data Readiness Levels"

Throughout the data preparation pipeline, it is important to have close interaction between data scientists and application domain experts. Decisions on data preparation taken outside the context of application have dangerous downstream consequences. This provides an additional burden on the data scientist as they are required for each project, but it should also be seen as a learning and familiarization exercise for the domain expert. Long term, just as biologists have found it necessary to assimilate the skills of the bioinformatician to be effective in their science, most domains will also require a familiarity with the nature of data driven decision making and its application. Working closely with data-scientists on data preparation is one way to begin this sharing of best practice.

The processes involved in Grade C and B are often badly taught in courses on data science. Perhaps not due to a lack of interest in the areas, but maybe more due to a lack of access to real world examples where data quality is poor.

These stages of data science are also ridden with ambiguity. In the long term they could do with more formalization, and automation, but best practice needs to be understood by a wider community before that can happen.

# Assessing the Organizations Readiness 

Assessing the readiness of data for analysis is one action that can be taken, but assessing teams that need to assimilate the information in the data is the other side of the coin. With this in mind both Damon Civin and Nick Elprin have independently proposed the idea of a "Data Joel Test". A "Joel Test" is a short questionaire to establish the ability of a team to handle software engineering tasks. It is designed as a rough and ready capability assessment. A "Data Joel Test" is similar, but for assessing the capability of a team in performing data science.

## Privacy, Loss of Control and Marginalization 

Society is becoming harder to monitor, but the individual is becoming easier to monitor. Social media monitoring for 'hate speech' can easily be turned to monitoring of political dissent. Marketing becomes more sinister when the target of the marketing is so well understood and the digital environment of the target is so well controlled.

### Bandwidth Constrained Conversations

import pods
from ipywidgets import IntSlider
pods.notebook.display_plots('anne-bob-conversation{sample:0>3}.svg',
'../slides/diagrams',  sample=IntSlider(0, 0, 7, 1))

Embodiment factors imply that, in our communication between humans, what is not said is, perhaps, more important than what is said. To communicate with each other we need to have a model of who each of us are.

To aid this, in society, we are required to perform roles. Whether as a parent, a teacher, an employee or a boss. Each of these roles requires that we conform to certain standards of behaviour to facilitate communication between ourselves.

Control of self is vitally important to these communications.

The high availability of data available to humans undermines human-to-human communication channels by providing new routes to undermining our control of self.

The consequences between this mismatch of power and delivery are to be seen all around us. Because, just as driving an F1 car with bicycle wheels would be a fine art, so is the process of communication between humans.

If I have a thought and I wish to communicate it, I first of all need to have a model of what you think. I should think before I speak. When I speak, you may react. You have a model of who I am and what I was trying to say, and why I chose to say what I said. Now we begin this dance, where we are each trying to better understand each other and what we are saying. When it works, it is beautiful, but when misdeployed, just like a badly driven F1 car, there is a horrible crash, an argument.

What does it mean for our free will if a computer can predict our individual behavior better than we ourselves can?

There is potential for both explicit and implicit discrimination on the basis of race, religion, sexuality or health status. All of these are prohibited under European law, but can pass unawares or be implicit.

The GDPR is the General Data Protection Regulation, but a better name for it would simpl by Good Data Practice Rules. It covers how to deal with discrimination which has a consequential effect on the individual. For example, entrance to University, access to loans or insurance. But the new phenomenon is dealing with a series of inconsequential decisions that taken together have a consequential effect.

Statistics as a community is also focussed on the single consequential effect of an analysis (efficacy of drugs, or distribution of Mosquito nets). Associated with happenstance data is happenstance decision making.

These algorithms behind these decisions are developed in a particular context. The so-called Silicon Valley bubble. But they are deployed across the world. To address this, a key challenge is capacity building in contexts which are remote from the Western norm.

## Data Science Africa 

Data Science Africa is a bottom up initiative for capacity building in data science, machine learning and artificial intelligence on the African continent.

As of 2019 there have been five workshops and five schools, located in Nyeri, Kenya (twice); Kampala, Uganda; Arusha, Tanzania and Abuja, Nigeria. In 2019 there are two events scheduled in Addis Ababa, Ethiopia and Accra, Ghana.

The main notion is end-to-end data science. For example, going from data collection in the farmer's field to decision making in the Ministry of Agriculture. Or going from malaria disease counts in health centers, to medicine distribution.

The philosophy is laid out in (Lawrence 2015). The key idea is that the modern information infrastructure presents new solutions to old problems. Modes of development change because less capital investment is required to take advantage of this infrastructure. The philosophy is that local capacity building is the right way to leverage these challenges in addressing data science problems in the African context.

Data Science Africa is now a non-govermental organization registered in Kenya. The organising board of the meeting is entirely made up of scientists and academics based on the African continent.

## Example: Prediction of Malaria Incidence in Uganda 

As an example of using Gaussian process models within the full pipeline from data to decsion, we'll consider the prediction of Malaria incidence in Uganda. For the purposes of this study malaria reports come in two forms, HMIS reports from health centres and Sentinel data, which is curated by the WHO. There are limited sentinel sites and many HMIS sites.

The work is from Ricardo Andrade Pacheco's PhD thesis, completed in collaboration with John Quinn and Martin Mubangizi (Andrade-Pacheco et al. 2014; Mubangizi et al. 2014). John and Martin were initally from the AI-DEV group from the University of Makerere in Kampala and more latterly they were based at UN Global Pulse in Kampala.

Malaria data is spatial data. Uganda is split into districts, and health reports can be found for each district. This suggests that models such as conditional random fields could be used for spatial modelling, but there are two complexities with this. First of all, occasionally districts split into two. Secondly, sentinel sites are a specific location within a district, such as Nagongera which is a sentinel site based in the Tororo district.

(Andrade-Pacheco et al. 2014; Mubangizi et al. 2014)

The common standard for collecting health data on the African continent is from the Health management information systems (HMIS). However, this data suffers from missing values (Gething et al. 2006) and diagnosis of diseases like typhoid and malaria may be confounded.

World Health Organization Sentinel Surveillance systems are set up "when high-quality data are needed about a particular disease that cannot be obtained through a passive system". Several sentinel sites give accurate assessment of malaria disease levels in Uganda, including a site in Nagongera.

In collaboration with the AI Research Group at Makerere we chose to investigate whether Gaussian process models could be used to assimilate information from these two different sources of disease informaton. Further, we were interested in whether local information on rainfall and temperature could be used to improve malaria estimates.

The aim of the project was to use WHO Sentinel sites, alongside rainfall and temperature, to improve predictions from HMIS data of levels of malaria.

## Early Warning Systems

Health monitoring system for the Kabarole district. Here we have fitted the reports with a Gaussian process with an additive covariance function. It has two components, one is a long time scale component (in red above) the other is a short time scale component (in blue).

Monitoring proceeds by considering two aspects of the curve. Is the blue line (the short term report signal) above the red (which represents the long term trend? If so we have higher than expected reports. If this is the case and the gradient is still positive (i.e. reports are going up) we encode this with a red color. If it is the case and the gradient of the blue line is negative (i.e. reports are going down) we encode this with an amber color. Conversely, if the blue line is below the red and decreasing, we color green. On the other hand if it is below red but increasing, we color yellow.

This gives us an early warning system for disease. Red is a bad situation getting worse, amber is bad, but improving. Green is good and getting better and yellow good but degrading.

Finally, there is a gray region which represents when the scale of the effect is small.

These colors can now be observed directly on a spatial map of the districts to give an immediate impression of the current status of the disease across the country.

# Data Trusts 

The machine learning solutions we are dependent on to drive automated decision making are dependent on data. But with regard to personal data there are important issues of privacy. Data sharing brings benefits, but also exposes our digital selves. From the use of social media data for targeted advertising to influence us, to the use of genetic data to identify criminals, or natural family members. Control of our virtual selves maps on to control of our actual selves.

The fuedal system that is implied by current data protection legislation has signficant power asymmetries at its heart, in that the data controller has a duty of care over the data subject, but the data subject may only discover failings in that duty of care when it's too late. Data controllers also may have conflicting motivations, and often their primary motivation is not towards the data-subject, but that is a consideration in their wider agenda.

Data Trusts (Edwards 2004; Lawrence 2016; Delacroix and Lawrence 2018) are a potential solution to this problem. Inspired by land societies that formed in the 19th century to bring democratic representation to the growing middle classes. A land society was a mutual organisation where resources were pooled for the common good.

A Data Trust would be a legal entity where the trustees responsibility was entirely to the members of the trust. So the motivation of the data-controllers is aligned only with the data-subjects. How data is handled would be subject to the terms under which the trust was convened. The success of an individual trust would be contingent on it satisfying its members with appropriate balancing of individual privacy with the benefits of data sharing.

Formation of Data Trusts became the number one recommendation of the Hall-Presenti report on AI, but the manner in which this is done will have a significant impact on their utility. It feels important to have a diversity of approaches, and yet it feels important that any individual trust would be large enough to be taken seriously in representing the views of its members in wider negotiations.

Addressing challenges in privacy, loss of control and marginalization includes ensuring that the individual retains control of their own data. We accept privacy in our real loves, we need to accept it in our digital persona. This is vital for our control of persona and our ability to project ourselves.

Fairness goes hand in hand with privacy to protect the individual. Regulations like the GDPR date from a time where the main worry was consequential decision making but today we also face problems from accumulation of inconsequential decisions leading to a resulting consequential effect.

Capacity building in different contexts, empowering domain experts to solve their own problems, is one aspect to the solution. A further proposal is the use of data trusts to reintroduce control of personal data for the individual.

You can also check my blog post on "Three Data Science Challenges"

# The Three Ds of Machine Learning Systems Design 

We can characterize the challenges for integrating machine learning within our systems as the three Ds. Decomposition, Data and Deployment.

You can also check my blog post on "The 3Ds of Machine Learning Systems Design"

The first two components decomposition and data are interlinked, but we will first outline the decomposition challenge. Below we will mainly focus on supervised learning because this is arguably the technology that is best understood within machine learning.

## Decomposition 

Machine learning is not magical pixie dust, we cannot simply automate all decisions through data. We are constrained by our data (see below) and the models we use.5 Machine learning models are relatively simple function mappings that include characteristics such as smoothness. With some famous exceptions, e.g. speech and image data, inputs are constrained in the form of vectors and the model consists of a mathematically well behaved function. This means that some careful thought has to be put in to the right sub-process to automate with machine learning. This is the challenge of decomposition of the machine learning system.

Any repetitive task is a candidate for automation, but many of the repetitive tasks we perform as humans are more complex than any individual algorithm can replace. The selection of which task to automate becomes critical and has downstream effects on our overall system design.

### Pigeonholing

Some aspects to take into account are

1. Can we refine the decision we need to a set of repetitive tasks where input information and output decision/value is well defined?
2. Can we represent each sub-task we’ve defined with a mathematical mapping?

The representation necessary for the second aspect may involve massaging of the problem: feature selection or adaptation. It may also involve filtering out exception cases (perhaps through a pre-classification).

All else being equal, we’d like to keep our models simple and interpretable. If we can convert a complex mapping to a linear mapping through clever selection of sub-tasks and features this is a big win.

For example, Facebook have feature engineers, individuals whose main role is to design features they think might be useful for one of their tasks (e.g. newsfeed ranking, or ad matching). Facebook have a training/testing pipeline called FBLearner. Facebook have predefined the sub-tasks they are interested in, and they are tightly connected to their business model.

It is easier for Facebook to do this because their business model is heavily focused on user interaction. A challenge for companies that have a more diversified portfolio of activities driving their business is the identification of the most appropriate sub-task. A potential solution to feature and model selection is known as AutoML (Feurer et al., n.d.). Or we can think of it as using Machine Learning to assist Machine Learning. It’s also called meta-learning. Learning about learning. The input to the ML algorithm is a machine learning task, the output is a proposed model to solve the task.

One trap that is easy to fall in is too much emphasis on the type of model we have deployed rather than the appropriateness of the task decomposition we have chosen.

Recommendation: Conditioned on task decomposition, we should automate the process of model improvement. Model updates should not be discussed in management meetings, they should be deployed and updated as a matter of course. Further details below on model deployment, but model updating needs to be considered at design time. This is the domain of AutoML.

To form modern decision making systems, many components are interlinked. We decompose our complex decision making into individual tasks, but the performance of each component is dependent on those upstream of it.

This naturally leads to co-evolution of systems, upstream errors can be compensated by downstream corrections.

To embrace this characteristic, end-to-end training could be considered. Why produce the best forecast by metrics when we can just produce the best forecast for our systems? End-to-end training can lead to improvements in performance, but it would also damage our systems decomposability and its interpretability, and perhaps its adaptability.

The less human interpretable our systems are, the harder they are to adapt to different circumstances or diagnose when there's a challenge. The trade-off between interpretability and performance is a constant tension which we should always retain in our minds when performing our system design.

## Data 

It is difficult to overstate the importance of data. It is half of the equation for machine learning, but is often utterly neglected. We can speculate that there are two reasons for this. Firstly, data cleaning is perceived as tedious. It doesn’t seem to consist of the same intellectual challenges that are inherent in constructing complex mathematical models and implementing them in code. Secondly, data cleaning is highly complex, it requires a deep understanding of how machine learning systems operate and good intuitions about the data itself, the domain from which data is drawn (e.g. Supply Chain) and what downstream problems might be caused by poor data quality.

A consequence of these two reasons, data cleaning seems difficult to formulate into a readily teachable set of principles. As a result it is heavily neglected in courses on machine learning and data science. Despite data being half the equation, most University courses spend little to no time on its challenges.

Anecdotally, talking to data modelling scientists. Most say they spend 80% of their time acquiring and cleaning data. This is precipitating what I refer to as the “data crisis”. This is an analogy with software. The “software crisis” was the phenomenon of inability to deliver software solutions due to increasing complexity of implementation. There was no single shot solution for the software crisis, it involved better practice (scrum, test orientated development, sprints, code review), improved programming paradigms (object orientated, functional) and better tools (CVS, then SVN, then git).

## The Data Crisis 

Anecdotally, talking to data modelling scientists. Most say they spend 80% of their time acquiring and cleaning data. This is precipitating what I refer to as the “data crisis”. This is an analogy with software. The “software crisis” was the phenomenon of inability to deliver software solutions due to increasing complexity of implementation. There was no single shot solution for the software crisis, it involved better practice (scrum, test orientated development, sprints, code review), improved programming paradigms (object orientated, functional) and better tools (CVS, then SVN, then git).

However, these challenges aren't new, they are merely taking a different form. From the computer's perspective software is data. The first wave of the data crisis was known as the software crisis.

### The Software Crisis

In the late sixties early software programmers made note of the increasing costs of software development and termed the challenges associated with it as the "Software Crisis". Edsger Dijkstra referred to the crisis in his 1972 Turing Award winner's address.

The major cause of the software crisis is that the machines have become several orders of magnitude more powerful! To put it quite bluntly: as long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a mild problem, and now we have gigantic computers, programming has become an equally gigantic problem.

Edsger Dijkstra (1930-2002), The Humble Programmer

The major cause of the data crisis is that machines have become more interconnected than ever before. Data access is therefore cheap, but data quality is often poor. What we need is cheap high quality data. That implies that we develop processes for improving and verifying data quality that are efficient.

There would seem to be two ways for improving efficiency. Firstly, we should not duplicate work. Secondly, where possible we should automate work.

What I term "The Data Crisis" is the modern equivalent of this problem. The quantity of modern data, and the lack of attention paid to data as it is initially "laid down" and the costs of data cleaning are bringing about a crisis in data-driven decision making. This crisis is at the core of the challenge of technical debt in machine learning (Sculley et al. 2015).

Just as with software, the crisis is most correctly addressed by 'scaling' the manner in which we process our data. Duplication of work occurs because the value of data cleaning is not correctly recognised in management decision making processes. Automation of work is increasingly possible through techniques in "artificial intelligence", but this will also require better management of the data science pipeline so that data about data science (meta-data science) can be correctly assimilated and processed. The Alan Turing institute has a program focussed on this area, AI for Data Analytics.

Data is the new software, and the data crisis is already upon us. It is driven by the cost of cleaning data, the paucity of tools for monitoring and maintaining our deployments, the provenance of our models (e.g. with respect to the data they’re trained on).

Three principal changes need to occur in response. They are cultural and infrastructural.

First of all, to excel in data driven decision making we need to move from a software first paradigm to a data first paradigm. That means refocusing on data as the product. Software is the intermediary to producing the data, and its quality standards must be maintained, but not at the expense of the data we are producing. Data cleaning and maintenance need to be prized as highly as software debugging and maintenance. Instead of software as a service, we should refocus around data as a service. This first change is a cultural change in which our teams think about their outputs in terms of data. Instead of decomposing our systems around the software components, we need to decompose them around the data generating and consuming components.6 Software first is only an intermediate step on the way to be coming data first. It is a necessary, but not a sufficient condition for efficient machine learning systems design and deployment. We must move from software orientated architecture to a data orientated architecture.

### Data Quality

Secondly, we need to improve our language around data quality. We cannot assess the costs of improving data quality unless we generate a language around what data quality means. Data Readiness Levels7 are an assessment of data quality that is based on the usage to which data is put.

Recommendation: Build a shared understanding of the language of data readiness levels for use in planning documents and costing of data cleaning and the benefits of reusing cleaned data.

### Move Beyond Software Engineering to Data Engineering

Thirdly, we need to improve our mental model of the separation of data science from applied science. A common trap in our thinking around data is to see data science (and data engineering, data preparation) as a sub-set of the software engineer’s or applied scientist’s skill set. As a result we recruit and deploy the wrong type of resource. Data preparation and question formulation is superficially similar to both because of the need for programming skills, but the day to day problems faced are very different.

## Data Science as Debugging 

One challenge for existing information technology professionals is realizing the extent to which a software ecosystem based on data differs from a classical ecosystem. In particular, by ingesting data we bring unknowns/uncontrollables into our decision making system. This presents opportunity for adversarial exploitation and unforeseen operation.

You can also check my blog post on "Data Science as Debugging"

Starting with the analysis of a data set, the nature of data science is somewhat difference from classical software engineering.

One analogy I find helpful for understanding the depth of change we need is the following. Imagine as a software engineer, you find a USB stick on the ground. And for some reason you know that on that USB stick is a particular API call that will enable you to make a significant positive difference on a business problem. You don't know which of the many library functions on the USB stick are the ones that will help. And it could be that some of those library functions will hinder, perhaps because they are just inappropriate or perhaps because they have been placed there malisciously. The most secure thing to do would be to not introduce this code into your production system at all. But what if your manager told you to do so, how would you go about incorporating this code base?

The answer is very carefully. You would have to engage in a process more akin to debugging than regular software engineering. As you understood the code base, for your work to be reproducible, you should be documenting it, not just what you discovered, but how you discovered it. In the end, you typically find a single API call that is the one that most benefits your system. But more thought has been placed into this line of code than any line of code you have written before.

An enormous amount of debugging would be required. As the nature of the code base is understood, software tests to verify it also need to be constructed. At the end of all your work, the lines of software you write to actually interact with the software on the USB stick are likely to be minimal. But more thought would be put into those lines than perhaps any other lines of code in the system.

Even then, when your API code is introduced into your production system, it needs to be deployed in an environment that monitors it. We cannot rely on an individual’s decision making to ensure the quality of all our systems. We need to create an environment that includes quality controls, checks and bounds, tests, all designed to ensure that assumptions made about this foreign code base are remaining valid.

This situation is akin to what we are doing when we incorporate data in our production systems. When we are consuming data from others, we cannot assume that it has been produced in alignment with our goals for our own systems. Worst case, it may have been adversarialy produced. A further challenge is that data is dynamic. So, in effect, the code on the USB stick is evolving over time.

It might see that this process is easy to formalize now, we simply need to check what the formal software engineering process is for debugging, because that is the current software engineering activity that data science is closest to. But when we look for a formalization of debugging we find that there is none. Indeed, modern software engineering mainly focusses on ensuring that code is written without bugs in the first place.

Recommendation: Anecdotally, resolving a machine learning challenge requires 80% of the resource to be focused on the data and perhaps 20% to be focused on the model. But many companies are too keen to employ machine learning engineers who focus on the models, not the data. We should change our hiring priorities and training. Universities cannot provide the understanding of how to data-wrangle. Companies must fill this gap.

}

Recommendation: We need to share best practice around data deployment across our teams. We should make best use of our processes where applicable, but we need to develop them to become data first organizations. Data needs to be cleaned at output not at input.

## Deployment 

Much of the academic machine learning systems point of view is based on a software systems point of view that is around 20 years out of date. In particular we build machine learning models on fixed training data sets, and we test them on stationary test data sets.

In practice modern software systems involve continuous deployment of models into an ever evolving world of data. These changes are indicated in the software world by greater availaiblity of technologies like streaming technologies.

### Continuous Deployment

Once the decomposition is understood, the data is sourced and the models are created, the model code needs to be deployed.

To extend the USB stick analogy further, how would we deploy that code if we thought it was likely to evolve in production? This is what datadoes. We cannot assume that the conditions under which we trained our model will be retained as we move forward, indeed the only constant we have is change.

This means that when any data dependent model is deployed into production, it requires continuous monitoring to ensure the assumptions of design have not been invalidated. Software changes are qualified through testing, in particular a regression test ensures that existing functionality is not broken by change. Since data is continually evolving, machine learning systems require 'continual regression testing': oversight by systems that ensure their existing functionality has not been broken as the world evolves around them. An approach we refer to as progression testing. Unfortunately, standards around ML model deployment yet been developed. The modern world of continuous deployment does rely on testing, but it does not recognize the continuous evolution of the world around us.

Progression tests are likely to be statistical tests in contrast to classical software tests. The should be monitoring model performance and quality measures. They could also monitor conformance to standardized fairness measures.

If the world has changed around our decision making ecosystem, how are we alerted to those changes?

Recommendation: We establish best practice around model deployment. We need to shift our culture from standing up a software service, to standing up a data as a service. Data as a Service would involve continual monitoring of our deployed models in production. This would be regulated by 'hypervisor' systems8 that understand the context in which models are deployed and recognize when circumstance has changed and models need retraining or restructuring.

Recommendation: We should consider a major re-architecting of systems around our services. In particular we should scope the use of a streaming architecture (such as Apache Kafka) that ensures data persistence and enables asynchronous operation of our systems.9 This would enable the provision of QC streams, and real time dash boards as well as hypervisors..

Importantly a streaming architecture implies the services we build are stateless, internal state is deployed on streams alongside external state. This allows for rapid assessment of other services' data.

## GPy: A Gaussian Process Framework in Python 

GPy is a BSD licensed software code base for implementing Gaussian process models in python. This allows GPs to be combined with a wide variety of software libraries.

The software itself is available on GitHub and the team welcomes contributions.

The aim for GPy is to be a probabilistic-style programming language, i.e. you specify the model rather than the algorithm. As well as a large range of covariance functions the software allows for non-Gaussian likelihoods, multivariate outputs, dimensionality reduction and approximations for larger data sets.

## Emukit 

The aim is to provide a suite where different approaches to emulation are assimilated under one roof. The current version of Emukit includes multi-fidelity emulation for build surrogate models when data is obtained from multiple information sources that have different fidelity and/or cost; Bayesian optimisation for optimising physical experiments and tune parameters of machine learning algorithms or other computational simulations; experimental design and active learning: design the most informative experiments and perform active learning with machine learning models; sensitivity analysis: analyse the influence of inputs on the outputs of a given system; and Bayesian quadrature: efficiently compute the integrals of functions that are expensive to evaluate.

## Outlook for Machine Learning

Machine learning has risen to prominence as an approach to scaling our activities. For us to continue to automate in the manner we have over the last two decades, we need to make more use of computer-based automation. Machine learning is allowing us to automate processes that were out of reach before.

## Conclusion

We operate in a technologically evolving environment. Machine learning is becoming a key component in our decision making capabilities. But the evolving nature of data driven systems means that new approaches to model deployment are necessary. We have characterized three parts of the machine learning systems design process. Decomposition of the problem into separate tasks that are addressable with a machine learning solution. Collection and curation of appropriate data. Verificaction of data quality through data readiness levels. Using progression testing in our deployments. Continuously updating models as appropriate to ensure performance and quality is maintained.

# References

Ananthanarayanan, Rajagopal, Steven K. Esser, Horst D. Simon, and Dharmendra S. Modha. 2009. “The Cat Is Out of the Bag: Cortical Simulations with 109 Neurons, 1013 Synapses.” In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis - Sc ’09. doi:10.1145/1654059.1654124.

Andrade-Pacheco, Ricardo, Martin Mubangizi, John Quinn, and Neil D. Lawrence. 2014. “Consistent Mapping of Government Malaria Records Across a Changing Territory Delimitation.” Malaria Journal 13 (Suppl 1). doi:10.1186/1475-2875-13-S1-P5.

Copeland, B. Jack, ed. 2006. Colossus: The Secrets of Bletchley Park’s Code-Breaking Computers. Oxford University Press.

Delacroix, Sylvie, and Neil D. Lawrence. 2018. “Disturbing the ‘One Size Fits All’ Approach to Data Governance: Bottom-up Data Trusts.” SSRN. doi:10.2139/ssrn.3265315.

Edwards, Lilian. 2004. “The Problem with Privacy.” International Review of Law Computers & Technology 18 (3): 263–94.

Feurer, Matthias, Aaron Klein, Katharina Eggensperger, Jost Tobias Springenberg, Manuel Blum, and Frank Hutter. n.d. “Efficient and Robust Automated Machine Learning.” In Advances in Neural Information Processing Systems.

Gelman, Andrew, and Jennifer Hill. 2006. Data Analysis Using Regression and Multilevel/Hierarchical Models. Analytical Methods for Social Research. Cambridge, UK: Cambridge University Press. doi:10.1017/CBO9780511790942.

Gething, Peter W., Abdisalan M. Noor, Priscilla W. Gikandi, Esther A. A. Ogara, Simon I. Hay, Mark S. Nixon, Robert W. Snow, and Peter M. Atkinson. 2006. “Improving Imperfect Data from Health Management Information Systems in Africa Using Space–Time Geostatistics.” PLoS Medicine 3 (6). Public Library of Science. doi:10.1371/journal.pmed.0030271.

Husband, Phil, Owen Holland, and Michael Wheeler, eds. 2008. The Mechanical Mind in History. mit.

Lawrence, Neil D. 2015. “How Africa Can Benefit from the Data Revolution.” The Guardian Media & Tech Network. https://www.theguardian.com/media-network/2015/aug/25/africa-benefit-data-science-information.

———. 2016. “Data Trusts Could Allay Our Privacy Fears.” The Guardian Media & Tech Network. https://www.theguardian.com/media-network/2016/jun/03/data-trusts-privacy-fears-feudalism-democracy.

———. 2017. “Data Readiness Levels.” arXiv.

Mubangizi, Martin, Ricardo Andrade-Pacheco, Michael Thomas Smith, John Quinn, and Neil D. Lawrence. 2014. “Malaria Surveillance with Multiple Data Sources Using Gaussian Process Models.” In 1st International Conference on the Use of Mobile Ict in Africa.

Popper, Karl R. 1963. Conjectures and Refutations: The Growth of Scientific Knowledge. London: Routledge.

Reed, Charlotte, and Nathaniel I. Durlach. 1998. “Note on Information Transfer Rates in Human Communication.” Presence Teleoperators & Virtual Environments 7 (5): 509–18. doi:10.1162/105474698565893.

Sculley, D., Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-François Crespo, and Dan Dennison. 2015. “Hidden Technical Debt in Machine Learning Systems.” In Advances in Neural Information Processing Systems 28, edited by Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett, 2503–11. Curran Associates, Inc. http://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf.

Stigler, Stephen M. 1999. Statistics on the Table: The History of Statistical Concepts and Methods. Cambridge, MA: harvard.

Tukey, John W. 1977. Exploratory Data Analysis. Addison-Wesley.

Wiener, Norbert. 1948. Cybernetics: Control and Communication in the Animal and the Machine. Cambridge, MA: MIT Press.

1. the challenge of understanding what information pertains to is known as knowledge representation.

2. The logarithm of a number less than one is negative, for a number greater than one the logarithm is positive. So if odds are greater than evens (odds-on) the log-odds are positive, if the odds are less than evens (odds-against) the log-odds will be negative.

3. In statistics this is known as a design matrix, representing the design of a study. But in databases, one might think of each patient being in a row, or record of the database.

4. We can also become constrained by our tribal thinking, just as each of the other groups can.

5. This is related to challenges of machine learning and technical debt (Sculley et al. 2015), although we are trying to frame the solution here rather than the problem.

6. Data Readiness Levels (Lawrence 2017) are an attempt to develop a language around data quality that can bridge the gap between technical solutions and decision makers such as managers and project planners. The are inspired by Technology Readiness Levels which attempt to quantify the readiness of technologies for deployment.

7. Emulation, or surrogate modelling, is one very promising approach to forming such a hypervisor. Emulators are models we fit to other models, often simulations, but the could also be other machine learning models. These models operate at the meta-level, not on the systems directly. This means they can be used to model how the sub-systems interact. As well as emulators we should consider real time dash boards, anomaly detection, mutlivariate analysis, data visualization and classical statistical approaches for hypervision of our deployed systems.

8. These approaches are one area of focus for my own team's reasearch. A data first architecture is a prerequisite for efficient deployment of machine learning systems.