The 3Ds of Machine Learning Systems Design
There is a lot of talk about the fourth industrial revolution centered around AI. If we are at the start of the fourth industrial we also have the unusual honour of being the first to name our revolution before it’s occurred.
The technology that has driven the revolution in AI is machine learning. And when it comes to capitalising on the new generation of deployed machine learning solutions there are practical difficulties we must address.
In 1987 the economist Robert Solow quipped "You can see the computer age everywehere but in the productivity statistics". Thirty years later, we could equally apply that quip to the era of artificial intelligence.
From my perspective, the current era is merely the continuation of the information revolution. A revolution that was triggered by the wide availability of the silicon chip. But whether we are in the midst of a new revolution, or this is just the continuation of an existing revolution, it feels important to characterize the challenges of deploying our innovation and consider what the solutions may be.
There is no doubt that new technologies based around machine learning have opened opportunities to create new businesses. When home computers were introduced there were also new opportunities in software publishing, computer games and a magazine industry around it. The Solow paradox arose because despite this visible activity these innovations take time to percolate through to existing businesses.
Brownfield and Greenfield Innovation
Understanding how to make the best use of new technology takes time. I call this approach, brownfield innovation. In the construction industry, a brownfield site is land with pre-existing infrastructure on it, whereas a greenfield site is where construction can start afresh.
The term brownfield innovation arises by analogy. Brownfield innovation is when you are innovating in a space where there is pre-existing infrastructure. This is the challenge of introducing artificial intelligence to existing businesses. Greenfield innovation, is innovating in areas where no pre-existing infrastructure exists.
One way we can make it easier to benefit from machine learning in both greenfield and brownfield innovation is to better characterise the steps of machine learning systems design. Just as software systems design required new thinking, so does machine learning systems design.
In this post we characterise the process for machine learning systems design, convering some of the issues we face, with the 3D process: Decomposition1, Data and Deployment.
We will consider each component in turn, although there is interplace between components. Particularly between the task decomposition and the data availability. We will first outline the decomposition challenge.
One of the most succesful machine learning approaches has been supervised learning. So we will mainly focus on supervised learning because this is also, arguably, the technology that is best understood within machine learning.
Machine learning is not magical pixie dust, we cannot simply automate all decisions through data. We are constrained by our data (see below) and the models we use. Machine learning models are relatively simple function mappings that include characteristics such as smoothness. With some famous exceptions, e.g. speech and image data, inputs are constrained in the form of vectors and the model consists of a mathematically well behaved function. This means that some careful thought has to be put in to the right sub-process to automate with machine learning. This is the challenge of decomposition of the machine learning system.
Any repetitive task is a candidate for automation, but many of the repetitive tasks we perform as humans are more complex than any individual algorithm can replace. The selection of which task to automate becomes critical and has downstream effects on our overall system design.
The machine learning systems design process calls for separating a complex task into decomposable separate entities. A process we can think of as pigeonholing.
Some aspects to take into account are
- Can we refine the decision we need to a set of repetitive tasks where input information and output decision/value is well defined?
- Can we represent each sub-task we’ve defined with a mathematical mapping?
The representation necessary for the second aspect may involve massaging of the problem: feature selection or adaptation. It may also involve filtering out exception cases (perhaps through a pre-classification).
All else being equal, we’d like to keep our models simple and interpretable. If we can convert a complex mapping to a linear mapping through clever selection of sub-tasks and features this is a big win.
For example, Facebook have feature engineers, individuals whose main role is to design features they think might be useful for one of their tasks (e.g. newsfeed ranking, or ad matching). Facebook have a training/testing pipeline called FBLearner. Facebook have predefined the sub-tasks they are interested in, and they are tightly connected to their business model.
It is easier for Facebook to do this because their business model is heavily focused on user interaction. A challenge for companies that have a more diversified portfolio of activities driving their business is the identification of the most appropriate sub-task. A potential solution to feature and model selection is known as AutoML . Or we can think of it as using Machine Learning to assist Machine Learning. It’s also called meta-learning. Learning about learning. The input to the ML algorithm is a machine learning task, the output is a proposed model to solve the task.
One trap that is easy to fall in is too much emphasis on the type of model we have deployed rather than the appropriateness of the task decomposition we have chosen.
Recommendation: Conditioned on task decomposition, we should automate the process of model improvement. Model updates should not be discussed in management meetings, they should be deployed and updated as a matter of course. Further details below on model deployment, but model updating needs to be considered at design time. This is the domain of AutoML.
The answer to the question which comes first, the chicken or the egg is simple, they co-evolve . Similarly, when we place components together in a complex machine learning system, they will tend to co-evolve and compensate for one another.
To form modern decision making systems, many components are interlinked. We decompose our complex decision making into individual tasks, but the performance of each component is dependent on those upstream of it.
This naturally leads to co-evolution of systems, upstream errors can be compensated by downstream corrections.
To embrace this characteristic, end-to-end training could be considered. Why train individual systems by localized metrics when we can just optimize, globally, end-to-end for final system performance? End to end training can lead to improvements in performance, but it could also damage our systems decomposability, its interpretability, and perhaps its adaptability.
The less human interpretable our systems are, the harder they are to adapt to different circumstances or diagnose when there's a challenge. The trade-off between interpretability and performance is a constant tension which we should always retain in our minds when performing our system design.
It is difficult to overstate the importance of data. It is half of the equation for machine learning, but is often utterly neglected. We can speculate that there are two reasons for this. Firstly, data cleaning is perceived as tedious. It doesn’t seem to consist of the same intellectual challenges that are inherent in constructing complex mathematical models and implementing them in code. Secondly, data cleaning is highly complex, it requires a deep understanding of how machine learning systems operate and good intuitions about the data itself, the domain from which data is drawn (e.g. Supply Chain) and what downstream problems might be caused by poor data quality.
A consequence of these two reasons, data cleaning seems difficult to formulate into a readily teachable set of principles. As a result it is heavily neglected in courses on machine learning and data science. Despite data being half the equation, most University courses spend little to no time on its challenges.
However, these challenges aren't new, they are merely taking a different form.
Anecdotally, talking to data modelling scientists, most say they spend 80% of their time acquiring and cleaning data. The “software crisis” was the inability to deliver software solutions due to increasing complexity of implementation. There was no single shot solution for the software crisis, it involved better practice (scrum, test orientated development, sprints, code review), improved programming paradigms (object orientated, functional) and better tools (CVS, then SVN, then git).
The Software Crisis
The major cause of the software crisis is that the machines have become several orders of magnitude more powerful! To put it quite bluntly: as long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a mild problem, and now we have gigantic computers, programming has become an equally gigantic problem.
Edsger Dijkstra (1930-2002), The Humble Programmer
In the late sixties early software programmers made note of the increasing costs of software development and termed the challenges associated with it as the "Software Crisis". Edsger Dijkstra referred to the crisis in his 1972 Turing Award winner's address.
The Data Crisis
Modern software is driven by data. That is the essence of machine learning. We can therefore see the software crisis as merely the first wave of a (perhaps) larger phenomenon, "the data crisis".
We can update Dijkstra's quote for the modern era.
The major cause of the data crisis is that machines have become more interconnected than ever before. Data access is therefore cheap, but data quality is often poor. What we need is cheap high quality data. That implies that we develop processes for improving and verifying data quality that are efficient.
There would seem to be two ways for improving efficiency. Firstly, we should not duplicate work. Secondly, where possible we should automate work.
The quantity of modern data, and the lack of attention paid to data as it is initially "laid down" and the costs of data cleaning are bringing about a crisis in data-driven decision making. This crisis is at the core of the challenge of technical debt in machine learning .
Just as with software, the crisis is most correctly addressed by 'scaling' the manner in which we process our data. Duplication of work occurs because the value of data cleaning is not correctly recognised in management decision making processes. Automation of work is increasingly possible through techniques in "artificial intelligence", but this will also require better management of the data science pipeline so that data about data science (meta-data science) can be correctly assimilated and processed. The Alan Turing institute has a program focussed on this area, AI for Data Analytics.
Data is the new software, and the data crisis is already upon us. It is driven by the cost of cleaning data, the paucity of tools for monitoring and maintaining our deployments, the provenance of our models (e.g. with respect to the data they’re trained on).
Three principal changes need to occur in response. They are cultural and infrastructural.
The Data First Paradigm
First of all, to excel in data driven decision making we need to move from a software first paradigm to a data first paradigm. That means refocusing on data as the product. Software is the intermediary to producing the data, and its quality standards must be maintained, but not at the expense of the data we are producing. Data cleaning and maintenance need to be prized as highly as software debugging and maintenance. Instead of software as a service, we should refocus around data as a service. This first change is a cultural change in which our teams think about their outputs in terms of data. Instead of decomposing our systems around the software components, we need to decompose them around the data generating and consuming components.2 Software first is only an intermediate step on the way to be coming data first. It is a necessary, but not a sufficient condition for efficient machine learning systems design and deployment. We must move from software orientated architecture to a data orientated architecture.
Secondly, we need to improve our language around data quality. We cannot assess the costs of improving data quality unless we generate a language around what data quality means. Data Readiness Levels3 are an assessment of data quality that is based on the usage to which data is put.
Move Beyond Software Engineering to Data Engineering
Thirdly, we need to improve our mental model of the separation of data science from applied science. A common trap in current thinking around data is to see data science (and data engineering, data preparation) as a sub-set of the software engineer’s or applied scientist’s skill set. As a result we recruit and deploy the wrong type of resource. Data preparation and question formulation is superficially similar to both because of the need for programming skills, but the day to day problems faced are very different.
Recommendation: Build a shared understanding of the language of data readiness levels for use in planning documents, the costing of data cleaning, and the benefits of reusing cleaned data.
Combining Data and Systems Design
One analogy I find helpful for understanding the depth of change we need is the following. Imagine as a software engineer, you find a USB stick on the ground. And for some reason you know that on that USB stick is a particular API call that will enable you to make a significant positive difference on a business problem. However, you also know on that USB stick there is potentially malicious code. The most secure thing to do would be to not introduce this code into your production system. But what if your manager told you to do so, how would you go about incorporating this code base?
The answer is very carefully. You would have to engage in a process more akin to debugging than regular software engineering. As you understood the code base, for your work to be reproducible, you should be documenting it, not just what you discovered, but how you discovered it. In the end, you typically find a single API call that is the one that most benefits your system. But more thought has been placed into this line of code than any line of code you have written before.
Even then, when your API code is introduced into your production system, it needs to be deployed in an environment that monitors it. We cannot rely on an individual’s decision making to ensure the quality of all our systems. We need to create an environment that includes quality controls, checks and bounds, tests, all designed to ensure that assumptions made about this foreign code base are remaining valid.
This situation is akin to what we are doing when we incorporate data in our production systems. When we are consuming data from others, we cannot assume that it has been produced in alignment with our goals for our own systems. Worst case, it may have been adversarialy produced. A further challenge is that data is dynamic. So, in effect, the code on the USB stick is evolving over time.
Anecdotally, resolving a machine learning challenge requires 80% of the resource to be focused on the data and perhaps 20% to be focused on the model. But many companies are too keen to employ machine learning engineers who focus on the models, not the data.
A reservoir of data has more value if the data is consumable. The data crisis can only be addressed if we focus on outputs rather than inputs.
For a data first architecture we need to clean our data at source, rather than individually cleaning data for each task. This involves a shift of focus from our inputs to our outputs. We should provide data streams that are consumable by many teams without purification.
Recommendation: We need to share best practice around data deployment across our teams. We should make best use of our processes where applicable, but we need to develop them to become data first organizations. Data needs to be cleaned at output not at input.
Once the decomposition is understood, the data is sourced and the models are created, the model code needs to be deployed.
To extend our USB stick analogy further, how would we deploy that code if we thought it was likely to evolve in production? This is what datadoes. We cannot assume that the conditions under which we trained our model will be retained as we move forward, indeed the only constant we have is change.
This means that when any data dependent model is deployed into production, it requires continuous monitoring to ensure the assumptions of design have not been invalidated. Software changes are qualified through testing, in particular a regression test ensures that existing functionality is not broken by change. Since data is continually evolving, machine learning systems require 'continual regression testing': oversight by systems that ensure their existing functionality has not been broken as the world evolves around them. An approach we refer to as progression testing. Unfortunately, standards around ML model deployment yet been developed. The modern world of continuous deployment does rely on testing, but it does not recognize the continuous evolution of the world around us.
Recommendation: We establish best practice around model deployment. We need to shift our culture from standing up a software service, to standing up a data as a service. Data as a Service would involve continual monitoring of our deployed models in production. This would be regulated by 'hypervisor' systems4 that understand the context in which models are deployed and recognize when circumstance has changed and models need retraining or restructuring.
Recommendation: We should consider a major re-architecting of systems around our services. In particular we should scope the use of a streaming architecture (such as Apache Kafka) that ensures data persistence and enables asynchronous operation of our systems.5 This would enable the provision of QC streams, and real time dash boards as well as hypervisors.
Importantly a streaming architecture implies the services we build are stateless, internal state is deployed on streams alongside external state. This allows for rapid assessment of other services' data.
Outlook for Machine Learning
Machine learning has risen to prominence as an approach to scaling our activities. For us to continue to automate in the manner we have over the last two decades, we need to make more use of computer-based automation. Machine learning is allowing us to automate processes that were out of reach before.
We operate in a technologically evolving environment. Machine learning is becoming a key component in our decision making capabilities. But the evolving nature of data driven systems means that new approaches to model deployment are necessary. We have characterized three parts of the machine learning systems design process. Decomposition of the problem into separate tasks that are addressable with a machine learning solution. Collection and curation of appropriate data. Verificaction of data quality through data readiness levels. Using progression testing in our deployments. Continuously updating models as appropriate to ensure performance and quality is maintained.
 M. Feurer, A. Klein, K. Eggensperger, J. T. Springenberg, M. Blum, and F. Hutter, “Efficient and robust automated machine learning,” in Advances in neural information processing systems.
 K. R. Popper, Conjectures and refutations: The growth of scientific knowledge. London: Routledge, 1963.
 D. Sculley et al., “Hidden technical debt in machine learning systems,” in Advances in neural information processing systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds. Curran Associates, Inc., 2015, pp. 2503–2511.
 N. D. Lawrence, “Data readiness levels,” arXiv, May 2017.
In earlier versions of the Three D process I've referred to this as the design stage, but decomposition feels more appropriate for what the stage involves and that preserves the word design for the overall process of machine learning systems design.↩
This is related to challenges of machine learning and technical debt , although we are trying to frame the solution here rather than the problem.↩
Data Readiness Levels  are an attempt to develop a language around data quality that can bridge the gap between technical solutions and decision makers such as managers and project planners. The are inspired by Technology Readiness Levels which attempt to quantify the readiness of technologies for deployment.↩
Emulation, or surrogate modelling, is one very promising approach to forming such a hypervisor. Emulators are models we fit to other models, often simulations, but the could also be other machine learning modles. These models operate at the meta-level, not on the systems directly. This means they can be used to model how the sub-systems interact. As well as emulators we shoulc consider real time dash boards, anomaly detection, mutlivariate analysis, data visualization and classical statistical approaches for hypervision of our deployed systems.↩
These approaches are one area of focus for my own team's reasearch. A data first architecture is a prerequisite for efficient deployment of machine learning systems.↩