at Huawei-Cambridge Workshop on Sep 15, 2020 [reveal]
Neil D. Lawrence, University of Cambridge

#### Abstract

As machine learning becomes more widely deployed, it is important that we understand what we have deployed. There has been a lot of focus in machine learning research on the fairness and interpretability of individual models, but less attention paid to how this fits into a wider machine learning system. In this talk I’ll motivate the importance of fair, interpretable and transparent machine learning systems. I’ll outline the challenges and highlight some of the directions we are considering to address these challenges. This work is sponsored by an Alan Turing Institute Senior AI Fellowship.

# The Great AI Fallacy

There is a lot of variation in the use of the term artificial intelligence. I’m sometimes asked to define it, but depending on whether you’re speaking to a member of the public, a fellow machine learning researcher, or someone from the business community, the sense of the term differs.

However, underlying its use I’ve detected one disturbing trend. A trend I’m beginining to think of as "The Great AI Fallacy".

The fallacy is associated with an implicit promise that is embedded in many statements about Artificial Intelligence. Artificial Intelligence, as it currently exists, is merely a form of automated decision making. The implicit promise of Artificial Intelligence is that it will be the first wave of automation where the machine adapts to the human, rather than the human adapting to the machine.

How else can we explain the suspension of sensible business judgment that is accompanying the hype surrounding AI?

This fallacy is particularly pernicious because there are serious benefits to society in deploying this new wave of data-driven automated decision making. But the AI Fallacy is causing us to suspend our calibrated skepticism that is needed to deploy these systems safely and efficiently.

The problem is compounded because many of the techniques that we’re speaking of were originally developed in academic laboratories in isolation from real-world deployment.

# Intellectual Debt

In computer systems the concept of technical debt has been surfaced by authors including Sculley et al. (2015). It is an important concept, that I think is somewhat hidden from the academic community, because it is a phenomenon that occurs when a computer software system is deployed.

<img class="img-button" src="/talks/assets/images/Magnify_Large.svg" style="width:1.5ex">

Figure: The components of a putative automated buying system

## Monolithic System

<img class="img-button" src="/talks/assets/images/Magnify_Large.svg" style="width:1.5ex">

Figure: A potential path of models in a machine learning system.

## Service Oriented Architecture

<img class="img-button" src="/talks/assets/images/Magnify_Large.svg" style="width:1.5ex">

Figure: A potential path of models in a machine learning system.

## … to Banking

<img class="img-button" src="/talks/assets/images/Magnify_Large.svg" style="width:1.5ex">

Figure: A potential path of models in a machine learning system.

## FIT Models to FIT Systems

Zittrain points out the challenge around the lack of interpretability of individual ML models as the origin of intellectual debt. In machine learning I refer to work in this area as fairness, interpretability and transparency or FIT models. To an extent I agree with Zittrain, but if we understand the context and purpose of the decision making, I believe this is readily put right by the correct monitoring and retraining regime around the model. A concept I refer to as "progression testing". Indeed, the best teams do this at the moment, and their failure to do it feels more of a matter of technical debt rather than intellectual, because arguably it is a maintenance task rather than an explanation task. After all, we have good statistical tools for interpreting individual models and decisions when we have the context. We can linearise around the operating point, we can perform counterfactual tests on the model. We can build empirical validation sets that explore fairness or accuracy of the model.

So, this is where, my understanding of intellectual debt in ML systems departs, I believe from John Zittrain’s. The long-term challenge is not in the individual model. We have excellent statistical tools for validating what any individual model, the long-term challenge is the complex interaction between different components in the decomposed system, where the original intent of each component has been forgotten (except perhaps by Lancelot) and each service has been repurposed. We need to move from FIT models to FIT systems.

How to address these challenges? With collaborators I’ve been working towards a solution that contains broadly two parts. The first part is what we refer to as "Data-Oriented Architectures". The second part is "meta modelling", machine learning techniques that help us model the models.

# A Technology

## Statistical Emulation

<img class="img-button" src="/talks/assets/images/Magnify_Large.svg" style="width:1.5ex">

Figure: Real world systems consiste of simulators, that capture our domain knowledge about how our systems operate. Different simulators run at different speeds and granularities.

In many real world systems, decisions are made through simulating the environment. Simulations may operate at different granularities. For example, simulations are used in weather forecasts and climate forecasts. The UK Met office uses the same code for both, but operates climate simulations one at greater spatial and temporal resolutions.

<img class="img-button" src="/talks/assets/images/Magnify_Large.svg" style="width:1.5ex">

Figure: A statistical emulator is a system that reconstructs the simulation with a statistical model.

A statistical emulator is a data-driven model that learns about the underlying simulation. Importantly, learns with uncertainty, so it ‘knows what it doesn’t know’. In practice, we can call the emulator in place of the simulator. If the emulator ‘doesn’t know’, it can call the simulator for the answer.

<img class="img-button" src="/talks/assets/images/Magnify_Large.svg" style="width:1.5ex">

Figure: A statistical emulator is a system that reconstructs the simulation with a statistical model. As well as reconstructing the simulation, a statistical emulator can be used to correlate with the real world.

<img class="img-button" src="/talks/assets/images/Magnify_Large.svg" style="width:1.5ex">

Figure: In modern machine learning system design, the emulator may also consider the output of ML models (for monitoring bias or accuracy) and Operations Research models..

As well as reconstructing an individual simulator, the emulator can calibrate the simulation to the real world, by monitoring differences between the simulator and real data. This allows the emulator to characterise where the simulation can be relied on, i.e. we can validate the simulator.

Similarly, the emulator can adjudicate between simulations. This is known as multi-fidelity emulation. The emulator characterizes which emulations perform well where.

If all this modelling is done with judiscious handling of the uncertainty, the computational doubt, then the emulator can assist in desciding what experiment should be run next to aid a decision: should we run a simulator, in which case which one, or should we attempt to acquire data from a real world intervention.

# A Solution

## Data-oriented Architectures

Data-oriented architectures aim to address the rat’s nest that is the current interaction between the services in a service-oriented architecture. It does this by introducing data-oriented programming. The data-oriented programming language tracks the movement of data between each service.

Service-oriented programming style is a necessary, but not sufficient approach to data-oriented programming. Data-oriented programming is not only about the individual services, but how they are connected. Which service is calling which and where the flow of the data through the system occurs?

If each service has its inputs and outputs declared on a wider ecosystem, then we can programmatically determine which inputs effect which decisions. This programmatic discovery is vital because as systems are built compositionally, the actual inputs that affect a final decision may not be known to any of the software engineers who are maintaining the system.

## Milan

<img class="img-button" src="/talks/assets/images/Magnify_Large.svg" style="width:1.5ex">

Figure: The Milan Software has a general purpose stream algebra at its core, the Milan IL.

At Amazon my team built a data-oriented programming language which is now available through BSD license. The language is called Milan. The team was led by Tom Borchert, quoting from Tom’s blog on Milan:

Milan has three components:

1. A general-purpose stream algebra that encodes relationships between data streams (the Milan Intermediate Language or Milan IL)

2. A Scala library for building programs in that algebra.

3. A compiler that takes programs expressed in Milan IL and produces a Flink application that executes the program.

Component (2) can be extended to support interfaces in additional languages, and component (3) can be extended to support additional runtime targets. Considering just the multiple interfaces and the multiple runtimes, Milan looks a lot like the much more mature Apache Beam. The difference lies in (1), Milan’s general-purpose stream algebra.

It is through the general-purpose stream algebra that we hope to make significant inroads on the intellectual debt challenge.

The stream algebra defines the relationship between different machine learning components in the wider software architecture. Composition of multiple services cannot occur without a signature existing within the stream algebra. The Milan IL becomes the key information structure that is required to reason about the wider software system.

## Context

This deals with the challenges that arise through the death of the programmer because we can now see the context around each service. This allows us to design the relevant validation checks to ensure that accuracy and fairness are maintained. By recompiling the algebra to focus on a particular decision within the system we can also derive new statistical tests to validate performance. These are the checks that we refer to as progression testing. The loss of programmer control means that we can no longer rely on software tests written at design time, we must have the capability to deploy new (statistical) tests after deployment as the uses to which each service is placed extend to previously un-envisaged domains.

## Stateless Services

Importantly, Milan does not place onerous constraints on the builders of individual machine learning models (or other components). Standard modelling frameworks can be used. The main constraint is that any code that is not visible to the ecosystem does not maintain or store global state. This condition implies that the parameters of any machine learning model need to also be declared as an input to the model within the Milan IL.

## Meta Modelling

Where does machine learning come in? The strategy I propose is that the Milan IL is integrated with meta-modelling approaches to assist in the explanation of the decision-making framework. At their simplest these approaches may be novelty detection algorithms on the data streams that are emerging from a given service. This is a form of progression testing. But we can go much further. By knowing the training data, the inputs and outputs of the individual services in the software ecosystem, we can build meta-models that test for fairness, accuracy not just of individual system components, but short or long cascades of decision making. Through the use of the Milan IL algebra all these tests could be automatically deployed. The focus of machine learning is on the models-that-model-the-models. The meta-models.

In Amazon, our own focus was on the use of statistical emulators, sometimes known as surrogate models, for fulfilling this task. The work we were putting into this route is available through another software package, Emukit, a framework for decision making under uncertainty. With collaborators my current focus for addressing these issues is a form of fusion of Emukit and Milan (Milemukit??). But the nature of this fusion requires testing on real world problem sets. A task we hope to carry out in close collaboration with colleagues at Data Science Africa.

## Deep Emulation

<img class="img-button" src="/talks/assets/images/Magnify_Large.svg" style="width:1.5ex">

Figure: A potential path of models in a machine learning system.

As a solution we can use of emulators. When constructing an ML system, software engineers, ML engineers, economists and operations researchers are explicitly defining relationships between variables of interest in the system. That implicitly defines a joint distribution, $p(\dataVector^*, \dataVector)$. In a decomposable system any sub-component may be defined as $p(\dataVector_\mathbf{i}|\dataVector_\mathbf{j})$ where $\dataVector_\mathbf{i}$ and $\dataVector_\mathbf{j}$ represent sub-sets of the full set of variables $\left\{\dataVector^*, \dataVector \right\}$. In those cases where the relationship is deterministic, the probability density would collapse to a vector-valued deterministic function, $\mappingFunctionVector_\mathbf{i}\left(\dataVector_\mathbf{j}\right)$.

Inter-variable relationships could be defined by, for example a neural network (machine learning), an integer program (operational research), or a simulation (supply chain). This makes probabilistic inference in this joint density for real world systems is either very hard or impossible.

Emulation is a form of meta-modelling: we construct a model of the model. We can define the joint density of an emulator as $s(\dataVector*, \dataVector)$, but if this probability density is to be an accurate representation of our system, it is likely to be prohibitively complex. Current practice is to design an emulator to deal with a specific question. This is done by fitting an ML model to a simulation from the the appropriate conditional distribution, $p(\dataVector_\mathbf{i}|\dataVector_\mathbf{j})$, which is intractable. The emulator provides an approximated answer of the form $s(\dataVector_\mathbf{i}|\dataVector_\mathbf{j})$. Critically, an emulator should incorporate its uncertainty about its approximation. So the emulator answer will be less certain than direct access to the conditional $p(\dataVector_i|\dataVector_j)$, but it may be sufficiently confident to act upon. Careful design of emulators to answer a given question leads to efficient diagnostics and understanding of the system. But in a complex interacting system an exponentially increasing number of questions can be asked. This calls for a system of automated construction of emulators which selects the right structure and redeploys the emulator as necessary. Rapid redeployment of emulators could exploit pre-existing emulators through transfer learning.

Automatically deploying these families of emulators for full system understanding is highly ambitious. It requires advances in engineering infrastructure, emulation and Bayesian optimization. However, the intermediate steps of developing this architecture also allow for automated monitoring of system accuracy and fairness. This facilitates AutoML on a component-wise basis which we can see as a simple implementation of AutoAI. The proposal is structured so that despite its technical ambition there is a smooth ramp of benefits to be derived across the programme of work.

In Applied Mathematics, the field studying these techniques is known as uncertainty quantification. The new challenge is the automation of emulator creation on demand to answer questions of interest and facilitate the system design, i.e. AutoAI through BSO.

At design stage, any particular AI task could be decomposed in multiple ways. Bayesian system optimization will assist both in determining the large-scale system design through exploring different decompositions and in refinement of the deployed system.

So far, most work on emulators has focussed on emulating a single component. Automated deployment and maintenance of ML systems requires networks of emulators that can be deployed and redeployed on demand depending on the particular question of interest. Therefore, the technical innovations we require are in the mathematical composition of emulator models (Damianou and Lawrence 2013; Perdikaris et al. 2017). Different chains of emulators will need to be rapidly composed to make predictions of downstream performance. This requires rapid retraining of emulators and propagation of uncertainty through the emulation pipeline a process we call deep emulation.

Recomposing the ML system requires structural learning of the network. By parameterizing covariance functions appropriately this can be done through Gaussian processes (e.g. (Damianou et al., n.d.)), but one could also consider Bayesian neural networks and other generative models, e.g. Generative Adversarial Networks (Goodfellow et al. 2014).

<img class="img-button" src="/talks/assets/images/Magnify_Large.svg" style="width:1.5ex">

Figure: A potential path of models in a machine learning system.

<img class="img-button" src="/talks/assets/images/Magnify_Large.svg" style="width:1.5ex">

Figure: A potential path of models in a machine learning system.

<img class="img-button" src="/talks/assets/images/Magnify_Large.svg" style="width:1.5ex">

Figure: A potential path of models in a machine learning system.

# Conclusion

## Thanks!

For more information on these subjects and more you might want to check the following resources.

# References

Damianou, Andreas, Carl Henrik Ek, Michalis K. Titsias, and Neil D. Lawrence. n.d. "Manifold Relevance Determination." In.

Damianou, Andreas, and Neil D. Lawrence. 2013. "Deep Gaussian Processes." In, 31:207–15.

Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. "Generative Adversarial Nets." In Advances in Neural Information Processing Systems 27, edited by Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, 2672–80. Curran Associates, Inc.

Perdikaris, Paris, Maziar Raissi, Andreas Damianou, Neil D. Lawrence, and George Em Karnidakis. 2017. "Nonlinear Information Fusion Algorithms for Data-Efficient Multi-Fidelity Modelling." Proc. R. Soc. A 473 (20160751). https://doi.org/10.1098/rspa.2016.0751.

Sculley, D., Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-François Crespo, and Dan Dennison. 2015. "Hidden Technical Debt in Machine Learning Systems." In Advances in Neural Information Processing Systems 28, edited by Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett, 2503–11. Curran Associates, Inc. http://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf.