Interpretable End-to-End Learning
Abstract
Practical artificial intelligence systems can be seen as algorithmic decision makers. The fractal nature of decision making implies that this involves interacting systems of components where decisions are made multiple times across different time frames. This affects the decomposability of an artificial intelligence system. Classical systems design relies on decomposability for efficient maintenance and deployment of machine learning systems, in this talk we consider the challenges of optimizing and maintaining such systems.
Introduction
The fourth industrial revolution bears the particular hallmark of being the first revolution that has been named before it has happened. This is particularly unfortunate, because it is not in fact an industrial revolution at all. Nor is it necessarily a distinct phenomenon. It is part of a revolution in information, one that goes back to digitisation and the invention of the silicon chip.
Or to put it more precisely, it is a revolution in how information can affect the physical world. The interchange between information and the physical world.
Amazon’s New Delivery Drone [edit]
One example is autonomous vehicles, both those we intend to operate on the ground and also those in the air.
The drone highlights one of the important changes that is driving the innovation from machine learning, the interaction between the physical world and the information world.
Supply Chain [edit]
On Sunday mornings in Sheffield, I often used to run across Packhorse Bridge in Burbage valley. The bridge is part of an ancient network of trails crossing the Pennines that, before Turnpike roads arrived in the 18th century, was the main way in which goods were moved. Given that the moors around Sheffield were home to sand quarries, tin mines, lead mines and the villages in the Derwent valley were known for nail and pin manufacture, this wasn’t simply movement of agricultural goods, but it was the infrastructure for industrial transport.
The profession of leading the horses was known as a Jagger and leading out of the village of Hathersage is Jagger’s Lane, a trail that headed underneath Stanage Edge and into Sheffield.
The movement of goods from regions of supply to areas of demand is fundamental to our society. The physical infrastructure of supply chain has evolved a great deal over the last 300 years.
Cromford [edit]
Richard Arkwright is known as the father of the modern factory system. In 1771 he set up a Mill for spinning cotton yarn in the village of Cromford, in the Derwent Valley. The Derwent valley is relatively inaccessible. Raw cotton arrived in Liverpool from the US and India. It needed to be transported on packhorse across the bridleways of the Pennines. But Cromford was a good location due to proximity to Nottingham, where weavers where consuming the finished thread, and the availability of water power from small tributaries of the Derwent river for Arkwright’s water frames which automated the production of yarn from raw cotton.
By 1794 the Cromford Canal was opened to bring coal in to Cromford and give better transport to Nottingham. The construction of the canals was driven by the need to improve the transport infrastructure, facilitating the movement of goods across the UK. Canals, roads and railways were initially constructed by the economic need for moving goods. To improve supply chain.
The A6 now does pass through Cromford, but at the time he moved there there was merely a track. The High Peak Railway was opened in 1832, it is now converted to the High Peak Trail, but it remains the highest railway built in Britain.
Cooper (1991)
Containerization [edit]
Containerization has had a dramatic effect on global economics, placing many people in the developing world at the end of the supply chain.
|
|
For example, you can buy Wild Alaskan Cod fished from Alaska, processed in China, sold in North America. This is driven by the low cost of transport for frozen cod vs the higher relative cost of cod processing in the US versus China. Similarly, Scottish prawns are also processed in China for sale in the UK.
This effect on cost of transport vs cost of processing is the main driver of the topology of the modern supply chain and the associated effect of globalization. If transport is much cheaper than processing, then processing will tend to agglomerate in places where processing costs can be minimized.
Large scale global economic change has principally been driven by changes in the technology that drives supply chain.
Supply chain is a large-scale automated decision making network. Our aim is to make decisions not only based on our models of customer behavior (as observed through data), but also by accounting for the structure of our fulfilment center, and delivery network.
Many of the most important questions in supply chain take the form of counterfactuals. E.g. “What would happen if we opened a manufacturing facility in Cambridge?” A counter factual is a question that implies a mechanistic understanding of a system. It goes beyond simple smoothness assumptions or translation invariants. It requires a physical, or mechanistic understanding of the supply chain network. For this reason, the type of models we deploy in supply chain often involve simulations or more mechanistic understanding of the network.
In supply chain Machine Learning alone is not enough, we need to bridge between models that contain real mechanisms and models that are entirely data driven.
This is challenging, because as we introduce more mechanism to the models we use, it becomes harder to develop efficient algorithms to match those models to data.
So many examples in terms of the need for intelligent decision making are based around the challenge of moving goods/energy/compute/water/medicines/drivers/people from where it is to where it needs to be. In other words matching supply with demand. That led me to a motto I developed while working in Amazon’s supply chain.
Solve Supply Chain, then solve everything else.
End-to-End: Environment and Decision [edit]
From Model to Decision [edit]
The real challenge, however, is end-to-end decision making. Taking information from the environment and using it to drive decision making to achieve goals.
We don’t know what science we’ll want to do in 5 years time, but we won’t want slower experiments, we won’t want more expensive experiments and we won’t want a narrower selection of experiments.
- Faster, cheaper and more diverse experiments.
- Better ecosystems for experimentation.
- Data oriented architectures.
Data Oriented Architectures [edit]
In a streaming architecture we shift from management of services, to management of data streams. Instead of worrying about availability of the services we shift to worrying about the quality of the data those services are producing.
Historically we’ve been software first, this is a necessary but insufficient condition for data first. We need to move from software-as-a-service to data-as-a-service, from service oriented architectures to data oriented architectures.
Streaming System
Characteristics of a streaming system include a move from pull updates to push updates, i.e. the computation is driven by a change in the input data rather than the service calling for input data when it decides to run a computation. Streaming systems operate on ‘rows’ of the data rather than ‘columns’. This is because the full column isn’t normally available as it changes over time. As an important design principle, the services themselves are stateless, they take their state from the streaming ecosystem. This ensures the inputs and outputs of given computations are easy to declare. As a result, persistence of the data is also handled by the streaming ecosystem and decisions around data retention or recomputation can be taken at the systems level rather than the component level.
Recommendation: We should consider a major re-architecting of systems around our services. In particular we should scope the use of a streaming architecture (such as Apache Kafka) that ensures data persistence and enables asynchronous operation of our systems.1 This would enable the provision of QC streams, and real time dash boards as well as hypervisors.
Importantly a streaming architecture implies the services we build are stateless, internal state is deployed on streams alongside external state. This allows for rapid assessment of other services’ data.
Apache Flink [edit]
Apache Flink is a stream processing framework. Flink is a foundation for event driven processing. This gives a high throughput and low latency framework that operates on dataflows.
Data storage is handled by other systems such as Apache Kafka or AWS Kinesis.
stream.join(otherStream)
.where(<KeySelector>)
.equalTo(<KeySelector>)
.window(<WindowAssigner>)
.apply(<JoinFunction>)
Apache Flink allows operations on streams. For example, the join operation above. In a traditional data base management system, this join operation may be written in SQL and called on demand. In a streaming ecosystem, computations occur as and when the streams update.
The join is handled by the ecosystem surrounding the business logic.
Milan [edit]
Milan is a data-oriented programming language and runtime infrastructure.
The Milan language is a DSL embedded in Scala. The output is an intermediate language that can be compiled to run on different target platforms. Currently there exists a single compiler that produces Flink applications.
The Milan runtime infrastructure compiles and runs Milan applications on a Flink cluster.
Trading System
As a simple example we’ll consider a high frequency trading system. Anne wishes to build a share trading system. She has access to a high frequency trading system which provides prices and allows trades at millisecond intervals. She wishes to build an automated trading system.
Let’s assume that price trading data is available as a data stream. But the price now is not the only information that Anne needs, she needs an estimate of the price in the future.
# Generate an artificial trading stream
days=pd.date_range(start='21/5/2017', end='21/05/2020')
z = np.random.randn(len(days), 1)
x = z.cumsum()+400
prices = pd.Series(x, index=days)
hypothetical = prices.loc['21/5/2019':]
real = prices.copy()
real['21/5/2019':] = np.NaN
Hypothetical Streams
We’ll call the future price a hypothetical stream.
A hypothetical stream is a desired stream of information which cannot be directly accessed. The lack of direct access may be because the events happen in the future, or there may be some latency between the event and the availability of the data.
Any hypothetical stream will only be provided as a prediction, ideally with an error bar.
The nature of the hypothetical Anne needs is dependent on her decision-making process. In Anne’s case it will depend over what period she is expecting her returns. In MDOP Anne specifies a hypothetical that is derived from the pricing stream.
It is not the price stream directly, but Anne looks for future predictions from the price stream, perhaps for price in T days’ time.
At this stage, this stream is merely typed as a hypothetical.
There are constraints on the hypothetical, they include: the input information, the upper limit of latency between input and prediction, and the decision Anne needs to make (how far ahead, what her upside, downside risks are). These three constraints mean that we can only recover an approximation to the hypothetical.
Hypothetical Advantage
What is the advantage to defining things in this way? By defining, clearly, the two streams as real and hypothetical variants of each other, we now enable automation of the deployment and any redeployment process. The hypothetical can be instantiated against the real, and design criteria can be constantly evaluated triggering retraining when necessary.
SafeBoda [edit]
SafeBoda is a Kampala based rider allocation system for Boda Boda drivers. Boda boda are motorcycle taxis which give employment to, often young men, across Kampala. Safe Boda is driven by the knowledge that road accidents are set to match HIV/AIDS as the highest cause of death in low/middle income families by 2030.
With road accidents set to match HIV/AIDS as the highest cause of death in low/middle income countries by 2030, SafeBoda’s aim is to modernise informal transportation and ensure safe access to mobility.
Let’s consider a ride sharing app, for example the SafeBoda system.
Anne is on her way home now; she wishes to hail a car using a ride sharing app.
The app is designed in the following way. On opening her app Anne is notified about drivers in the nearby neighborhood. She is given an estimate of the time a ride may take to come.
Given this information about driver availability, Anne may feel encouraged to enter a destination. Given this destination, a price estimate can be given. This price is conditioned on other riders that may wish to go in the same direction, but the price estimate needs to be made before the user agrees to the ride.
Business customer service constraints dictate that this price may not change after Anne’s order is confirmed.
In this simple system, several decisions are being made, each of them on the basis of a hypothetical.
When Anne calls for a ride, she is provided with an estimate based on the expected time a ride can be with her. But this estimate is made without knowing where Anne wants to go. There are constraints on drivers imposed by regional boundaries, reaching the end of their shift, or their current passengers mean that this estimate can only be a best guess.
This best guess may well be driven by previous data.
Ride Sharing: Service Oriented to Data Oriented [edit]
The modern approach to software systems design is known as a service-oriented architectures (SOA). The idea is that software engineers are responsible for the availability and reliability of the API that accesses the service they own. Quality of service is maintained by rigorous standards around testing of software systems.
In data driven decision-making systems, the quality of decision-making is determined by the quality of the data. We need to extend the notion of service-oriented architecture to data-oriented architecture (DOA).
The focus in SOA is eliminating hard failures. Hard failures can occur due to bugs or systems overload. This notion needs to be extended in ML systems to capture soft failures associated with declining data quality, incorrect modeling assumptions and inappropriate re-deployments of models. We need to focus on data quality assessments. In data-oriented architectures engineering teams are responsible for the quality of their output data streams in addition to the availability of the service they support (Lawrence 2017). Quality here is not just accuracy, but fairness and explainability. This important cultural change would be capable of addressing both the challenge of technical debt (Sculley et al. 2015) and the social responsibility of ML systems.
Software development proceeds with a test-oriented culture. One where tests are written before software, and software is not incorporated in the wider system until all tests pass. We must apply the same standards of care to our ML systems, although for ML we need statistical tests for quality, fairness and consistency within the environment. Fortunately, the main burden of this testing need not fall to the engineers themselves: through leveraging classical statistics and emulation we will automate the creation and redeployment of these tests across the software ecosystem, we call this ML hypervision (WP5 ).
Modern AI can be based on ML models with many millions of parameters, trained on very large data sets. In ML, strong emphasis is placed on predictive accuracy whereas sister-fields such as statistics have a strong emphasis on interpretability. ML models are said to be ‘black boxes’ which make decisions that are not explainable.2
For the ride sharing system, we start to see a common issue with a more complex algorithmic decision-making system. Several decisions are being made multilple times. Let’s look at the decisions we need along with some design criteria.
- Driver Availability: Estimate time to arrival for Anne’s ride using Anne’s location and local available car locations. Latency 50 milliseconds
- Cost Estimate: Estimate cost for journey using Anne’s destination, location and local available car current destinations and availability. Latency 50 milliseconds
- Driver Allocation: Allocate car to minimize transport cost to destination. Latency 2 seconds.
So we need:
- a hypothetical to estimate availability. It is constrained by lacking destination information and a low latency requirement.
- a hypothetical to estimate cost. It is constrained by low latency requirement and
Simultaneously, drivers in this data ecosystem have an app which notifies them about new jobs and recommends them where to go.
Further advantages. Strategies for data retention (when to snapshot) can be set globally.
A few decisions need to be made in this system. First of all, when the user opens the app, the estimate of the time to the nearest ride may need to be computed quickly, to avoid latency in the service.
This may require a quick estimate of the ride availability.
Information Dynamics [edit]
With all the second guessing within a complex automated decision-making system, there are potential problems with information dynamics, the ‘closed loop’ problem, where the sub-systems are being approximated (second guessing) and predictions downstream are being affected.
This leads to the need for a closed loop analysis, for example, see the “Closed Loop Data Science” project led by Rod Murray-Smith at Glasgow.
Emulation [edit]
In many real world systems, decisions are made through simulating the environment. Simulations may operate at different granularities. For example, simulations are used in weather forecasts and climate forecasts. The UK Met office uses the same code for both, but operates climate simulations one at greater spatial and temporal resolutions.
A statistical emulator is a data-driven model that learns about the underlying simulation. Importantly, learns with uncertainty, so it ‘knows what it doesn’t know’. In practice, we can call the emulator in place of the simulator. If the emulator ‘doesn’t know’, it can call the simulator for the answer.
As well as reconstructing an individual simulator, the emulator can calibrate the simulation to the real world, by monitoring differences between the simulator and real data. This allows the emulator to characterise where the simulation can be relied on, i.e. we can validate the simulator.
Similarly, the emulator can adjudicate between simulations. This is known as multi-fidelity emulation. The emulator characterizes which emulations perform well where.
If all this modelling is done with judiscious handling of the uncertainty, the computational doubt, then the emulator can assist in desciding what experiment should be run next to aid a decision: should we run a simulator, in which case which one, or should we attempt to acquire data from a real world intervention.
Deep Emulation [edit]
As a solution we can use of emulators. When constructing an ML system, software engineers, ML engineers, economists and operations researchers are explicitly defining relationships between variables of interest in the system. That implicitly defines a joint distribution, $p(\dataVector^*, \dataVector)$. In a decomposable system any sub-component may be defined as $p(\dataVector_\mathbf{i}|\dataVector_\mathbf{j})$ where $\dataVector_\mathbf{i}$ and $\dataVector_\mathbf{j}$ represent sub-sets of the full set of variables $\left\{\dataVector^*, \dataVector \right\}$. In those cases where the relationship is deterministic, the probability density would collapse to a vector-valued deterministic function, $\mappingFunctionVector_\mathbf{i}\left(\dataVector_\mathbf{j}\right)$.
Inter-variable relationships could be defined by, for example a neural network (machine learning), an integer program (operational research), or a simulation (supply chain). This makes probabilistic inference in this joint density for real world systems is either very hard or impossible.
Emulation is a form of meta-modelling: we construct a model of the model. We can define the joint density of an emulator as $s(\dataVector*, \dataVector)$, but if this probability density is to be an accurate representation of our system, it is likely to be prohibitively complex. Current practice is to design an emulator to deal with a specific question. This is done by fitting an ML model to a simulation from the the appropriate conditional distribution, $p(\dataVector_\mathbf{i}|\dataVector_\mathbf{j})$, which is intractable. The emulator provides an approximated answer of the form $s(\dataVector_\mathbf{i}|\dataVector_\mathbf{j})$. Critically, an emulator should incorporate its uncertainty about its approximation. So the emulator answer will be less certain than direct access to the conditional $p(\dataVector_i|\dataVector_j)$, but it may be sufficiently confident to act upon. Careful design of emulators to answer a given question leads to efficient diagnostics and understanding of the system. But in a complex interacting system an exponentially increasing number of questions can be asked. This calls for a system of automated construction of emulators which selects the right structure and redeploys the emulator as necessary. Rapid redeployment of emulators could exploit pre-existing emulators through transfer learning.
Automatically deploying these families of emulators for full system understanding is highly ambitious. It requires advances in engineering infrastructure, emulation and Bayesian optimization. However, the intermediate steps of developing this architecture also allow for automated monitoring of system accuracy and fairness. This facilitates AutoML on a component-wise basis which we can see as a simple implementation of AutoAI. The proposal is structured so that despite its technical ambition there is a smooth ramp of benefits to be derived across the programme of work.
In Applied Mathematics, the field studying these techniques is known as uncertainty quantification. The new challenge is the automation of emulator creation on demand to answer questions of interest and facilitate the system design, i.e. AutoAI through BSO.
At design stage, any particular AI task could be decomposed in multiple ways. Bayesian system optimization will assist both in determining the large-scale system design through exploring different decompositions and in refinement of the deployed system.
So far, most work on emulators has focussed on emulating a single component. Automated deployment and maintenance of ML systems requires networks of emulators that can be deployed and redeployed on demand depending on the particular question of interest. Therefore, the technical innovations we require are in the mathematical composition of emulator models (Damianou and Lawrence 2013; Perdikaris et al. 2017). Different chains of emulators will need to be rapidly composed to make predictions of downstream performance. This requires rapid retraining of emulators and propagation of uncertainty through the emulation pipeline a process we call deep emulation.
Recomposing the ML system requires structural learning of the network. By parameterizing covariance functions appropriately this can be done through Gaussian processes (e.g. (Damianou et al., n.d.)), but one could also consider Bayesian neural networks and other generative models, e.g. Generative Adversarial Networks (Goodfellow et al. 2014).
Bayesian System Optimization [edit]
We introduce the notion of Bayesian system optimisation. Standard Bayesian optimisation is about optimising individual components under a given (localised) optimisation criterion. Bayesian system optimisation is about realising that there are upstream and downstream effects, ‘no model is an island’. If we can use emulation to estimate those effects, then we can optimise individual components not just according to their own objective functions, but according to their situation in the wider system and their downstream effects.
Auto AI [edit]
Supervised machine learning models are data-driven statistical functional estimators. Each ML model is trained to perform a task. Machine learning systems are created when these models are integrated as interacting components in a more complex system that carries out a larger scale task, e.g. an autonomous drone delivery system.
Artificial Intelligence can also be seen as algorithmic decision-making. ML systems are data driven algorithmic decision-makers. Designing decision-making engines requires us to firstly decompose the system into its component parts. The decompositions are driven by (1) system performance requirements (2) the suite of ML algorithms at our disposal (3) the data availability. Performance requirements could be computational speed, accuracy, interpretability, and ‘fairness’. The current generation of ML Systems is often based around supervised learning and human annotated data. But in the future, we may expect more use of reinforcement learning and automated knowledge discovery using unsupervised learning.
The classical systems approach assumes decomposability of components. In ML, upstream components (e.g. a pedestrian detector in an autonomous vehicle) make decisions that require revisiting once a fuller picture is realized at a downstream stage (e.g. vehicle path planning). The relative weaknesses and strengths of the different component parts need to be assessed when resolving conflicts.
In long-term planning, e.g. logistics and supply chain, a plan may be computed multiple times under different constraints as data evolves. In logistics, an initial plan for delivery may be computed when an item is viewed on a webpage. Webpage waiting-time constraints dominate the solution we choose. However, when an order is placed the time constraint may be relaxed and an accuracy constraint or a cost constraint may now dominate.
Such sub-systems will make inconsistent decisions, but we should monitor and control the extent of the inconsistency.
One solution to aid with both the lack of decomposability of the components and the inconsistency between components is end-to-end learning of the system. End-to-end learning is when we use ML techniques to fit parameters across the entire decision pipeline. We exploit gradient descent and automated differentiation software to achieve this. However, components in the system may themselves be running a simulation (e.g. a transport delivery-time simulation) or optimization (e.g. a linear program) as a subroutine. This limits the universality of automatic differentiation. Another alternative is to replace the entire system with a single ML model, such as in Deep Reinforcement Learning. However, this can severely limit the interpretability of the resulting system.
We envisage AutoAI as allowing us to take advantage of end-to-end learning without sacrificing the interpretability of the underlying system. Instead of optimizing each component individually, we introduce Bayesian system optimization (BSO). We will make use of the end-to-end learning signals and attribute them to the system sub-components through the construction of an interconnected network of surrogate models, known as emulators, each of which is associated with an individual component from the underlying ML-system. Instead of optimizing each component individually (e.g. by classical Bayesian optimization) in BSO we account for upstream and downstream interactions in the optimization, leveraging our end-to-end knowledge without damaging the interpretability of the underlying system.
DeepFace [edit]
The DeepFace architecture (Taigman et al. 2014) consists of layers that deal with translation and rotational invariances. These layers are followed by three locally-connected layers and two fully-connected layers. Color illustrates feature maps produced at each layer. The neural network includes more than 120 million parameters, where more than 95% come from the local and fully connected layers.
Mathematically, each layer of a neural network is given through computing the activation function, $\basisFunction(\cdot)$, contingent on the previous layer, or the inputs. In this way the activation functions, are composed to generate more complex interactions than would be possible with any single layer.
$$
\begin{align}
\hiddenVector_{1} &= \basisFunction\left(\mappingMatrix_1 \inputVector\right)\\
\hiddenVector_{2} &= \basisFunction\left(\mappingMatrix_2\hiddenVector_{1}\right)\\
\hiddenVector_{3} &= \basisFunction\left(\mappingMatrix_3 \hiddenVector_{2}\right)\\
\dataVector &= \mappingVector_4 ^\top\hiddenVector_{3}
\end{align}
$$
Overfitting [edit]
One potential problem is that as the number of nodes in two adjacent layers increases, the number of parameters in the affine transformation between layers, $\mappingMatrix$, increases. If there are ki − 1 nodes in one layer, and ki nodes in the following, then that matrix contains kiki − 1 parameters, when we have layer widths in the 1000s that leads to millions of parameters.
One proposed solution is known as dropout where only a sub-set of the neural network is trained at each iteration. An alternative solution would be to reparameterize $\mappingMatrix$ with its singular value decomposition.
$$
\mappingMatrix = \eigenvectorMatrix\eigenvalueMatrix\eigenvectwoMatrix^\top
$$
or
$$
\mappingMatrix = \eigenvectorMatrix\eigenvectwoMatrix^\top
$$
where if $\mappingMatrix \in \Re^{k_1\times k_2}$ then $\eigenvectorMatrix\in \Re^{k_1\times q}$ and $\eigenvectwoMatrix \in \Re^{k_2\times q}$, i.e. we have a low rank matrix factorization for the weights.
Bottleneck Layers in Deep Neural Networks [edit]
Including the low rank decomposition of $\mappingMatrix$ in the neural network, we obtain a new mathematical form. Effectively, we are adding additional latent layers, $\latentVector$, in between each of the existing hidden layers. In a neural network these are sometimes known as bottleneck layers. The network can now be written mathematically as
$$
\begin{align}
\latentVector_{1} &= \eigenvectwoMatrix^\top_1 \inputVector\\
\hiddenVector_{1} &= \basisFunction\left(\eigenvectorMatrix_1 \latentVector_{1}\right)\\
\latentVector_{2} &= \eigenvectwoMatrix^\top_2 \hiddenVector_{1}\\
\hiddenVector_{2} &= \basisFunction\left(\eigenvectorMatrix_2 \latentVector_{2}\right)\\
\latentVector_{3} &= \eigenvectwoMatrix^\top_3 \hiddenVector_{2}\\
\hiddenVector_{3} &= \basisFunction\left(\eigenvectorMatrix_3 \latentVector_{3}\right)\\
\dataVector &= \mappingVector_4^\top\hiddenVector_{3}.
\end{align}
$$
$$
\begin{align}
\latentVector_{1} &= \eigenvectwoMatrix^\top_1 \inputVector\\
\latentVector_{2} &= \eigenvectwoMatrix^\top_2 \basisFunction\left(\eigenvectorMatrix_1 \latentVector_{1}\right)\\
\latentVector_{3} &= \eigenvectwoMatrix^\top_3 \basisFunction\left(\eigenvectorMatrix_2 \latentVector_{2}\right)\\
\dataVector &= \mappingVector_4 ^\top \latentVector_{3}
\end{align}
$$
Cascade of Gaussian Processes [edit]
Now if we replace each of these neural networks with a Gaussian process. This is equivalent to taking the limit as the width of each layer goes to infinity, while appropriately scaling down the outputs.
$$
\begin{align}
\latentVector_{1} &= \mappingFunctionVector_1\left(\inputVector\right)\\
\latentVector_{2} &= \mappingFunctionVector_2\left(\latentVector_{1}\right)\\
\latentVector_{3} &= \mappingFunctionVector_3\left(\latentVector_{2}\right)\\
\dataVector &= \mappingFunctionVector_4\left(\latentVector_{3}\right)
\end{align}
$$
Stochastic Process Composition [edit]
$$\dataVector = \mappingFunctionVector_4\left(\mappingFunctionVector_3\left(\mappingFunctionVector_2\left(\mappingFunctionVector_1\left(\inputVector\right)\right)\right)\right)$$
data = pods.datasets.mcycle()
x = data['X']
y = data['Y']
scale=np.sqrt(y.var())
offset=y.mean()
yhat = (y - offset)/scale
fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
_ = ax.plot(x, y, 'r.',markersize=10)
_ = ax.set_xlabel('time', fontsize=20)
_ = ax.set_ylabel('acceleration', fontsize=20)
xlim = (-20, 80)
ylim = (-175, 125)
ax.set_xlim(xlim)
ax.set_ylim(ylim)
mlai.write_figure(filename='../slides/diagrams/datasets/motorcycle-helmet.svg',
transparent=True, frameon=True)
Motorcycle Helmet Data [edit]
m_full = GPy.models.GPRegression(x,yhat)
_ = m_full.optimize() # Optimize parameters of covariance function
Motorcycle Helmet Data GP
The deep Gaussian process code we are using is research code by Andreas Damianou.
To extend the research code we introduce some approaches to initialization and optimization that we’ll use in examples. These approaches can be found in the deepgp_tutorial.py
file.
Deep Gaussian process models also can require some thought in the initialization. Here we choose to start by setting the noise variance to be one percent of the data variance.
Secondly, we introduce a staged optimization approach.
Optimization requires moving variational parameters in the hidden layer representing the mean and variance of the expected values in that layer. Since all those values can be scaled up, and this only results in a downscaling in the output of the first GP, and a downscaling of the input length scale to the second GP. It makes sense to first of all fix the scales of the covariance function in each of the GPs.
Sometimes, deep Gaussian processes can find a local minima which involves increasing the noise level of one or more of the GPs. This often occurs because it allows a minimum in the KL divergence term in the lower bound on the likelihood. To avoid this minimum we habitually train with the likelihood variance (the noise on the output of the GP) fixed to some lower value for some iterations.
Next an optimization of the kernel function parameters at each layer is performed, but with the variance of the likelihood fixed. Again, this is to prevent the model minimizing the Kullback-Leibler divergence between the approximate posterior and the prior before achieving a good data-fit.
Finally, all parameters of the model are optimized together.
The next code is for visualizing the intermediate layers of the deep model. This visualization is only appropriate for models with intermediate layers containing a single latent variable.
The pinball visualization is to bring the pinball-analogy to life in the model. It shows how a ball would fall through the model to end up in the right pbosition. This visualization is only appropriate for models with intermediate layers containing a single latent variable.
The posterior_sample
code allows us to see the output sample locations for a given input. This is useful for visualizing the non-Gaussian nature of the output density.
Finally, we bind these methods to the DeepGP object for ease of calling.
deepgp.DeepGP.initialize=initialize
deepgp.DeepGP.staged_optimize=staged_optimize
deepgp.DeepGP.posterior_sample = posterior_sample
deepgp.DeepGP.visualize=visualize
deepgp.DeepGP.visualize_pinball=visualize_pinball
layers = [y.shape[1], 1, x.shape[1]]
inits = ['PCA']*(len(layers)-1)
kernels = []
for i in layers[1:]:
kernels += [GPy.kern.RBF(i)]
m = deepgp.DeepGP(layers,Y=yhat, X=x,
inits=inits,
kernels=kernels, # the kernels for each layer
num_inducing=20, back_constraint=False)
m.initialize()
fig, ax=plt.subplots(figsize=plot.big_wide_figsize)
plot.model_output(m, scale=scale, offset=offset, ax=ax, xlabel='time', ylabel='acceleration/$g$', fontsize=20, portion=0.5)
ax.set_ylim(ylim)
ax.set_xlim(xlim)
mlai.write_figure(filename='../slides/diagrams/deepgp/motorcycle-helmet-deep-gp.svg',
transparent=True, frameon=True)
Motorcycle Helmet Data Deep GP [edit]
fig, ax=plt.subplots(figsize=plot.big_wide_figsize)
plot.model_sample(m, scale=scale, offset=offset, samps=10, ax=ax, xlabel='time', ylabel='acceleration/$g$', portion = 0.5)
ax.set_ylim(ylim)
ax.set_xlim(xlim)
mlai.write_figure(figure=fig, filename='../slides/diagrams/deepgp/motorcycle-helmet-deep-gp-samples.svg',
transparent=True, frameon=True)
Motorcycle Helmet Data Deep GP
m.visualize(xlim=xlim, ylim=ylim, scale=scale,offset=offset,
xlabel="time", ylabel="acceleration/$g$", portion=0.5,
dataset='motorcycle-helmet',
diagrams='../slides/diagrams/deepgp')
Motorcycle Helmet Data Latent 1
Motorcycle Helmet Data Latent 2
fig, ax=plt.subplots(figsize=plot.big_wide_figsize)
m.visualize_pinball(ax=ax, xlabel='time', ylabel='acceleration/g',
points=50, scale=scale, offset=offset, portion=0.1)
mlai.write_figure(figure=fig, filename='../slides/diagrams/deepgp/motorcycle-helmet-deep-gp-pinball.svg',
transparent=True, frameon=True)
Motorcycle Helmet Pinball Plot
Graphical Models [edit]
One way of representing a joint distribution is to consider conditional dependencies between data. Conditional dependencies allow us to factorize the distribution. For example, a Markov chain is a factorization of a distribution into components that represent the conditional relationships between points that are neighboring, often in time or space. It can be decomposed in the following form.
$$p(\dataVector) = p(\dataScalar_\numData | \dataScalar_{\numData-1}) p(\dataScalar_{\numData-1}|\dataScalar_{\numData-2}) \dots p(\dataScalar_{2} | \dataScalar_{1})$$
By specifying conditional independencies we can reduce the parameterization required for our data, instead of directly specifying the parameters of the joint distribution, we can specify each set of parameters of the conditonal independently. This can also give an advantage in terms of interpretability. Understanding a conditional independence structure gives a structured understanding of data. If developed correctly, according to causal methodology, it can even inform how we should intervene in the system to drive a desired result (Pearl 1995).
However, a challenge arises when the data becomes more complex. Consider the graphical model shown below, used to predict the perioperative risk of C Difficile infection following colon surgery (Steele et al. 2012).
To capture the complexity in the interelationship between the data, the graph itself becomes more complex, and less interpretable.
Conclusion [edit]
We operate in a technologically evolving environment. Machine learning is becoming a key coponent in our decision-making capabilities, our intelligence and strategic command. However, technology drove changes in battlefield strategy. From the stalemate of the first world war to the tank-dominated Blitzkrieg of the second, to the asymmetric warfare of the present. Our technology, tactics and strategies are also constantly evolving. Machine learning is part of that evolution solution, but the main challenge is not to become so fixated on the tactics of today that we miss the evolution of strategy that the technology is suggesting.
Data oriented programming offers a set of development methodologies which ensure that the system designer considers what decisions are required, how they will be made, and critically, declares this within the system architecture.
This allows for monitoring of data quality, fairness, model accuracy and opens the door to Auto AI: a more sophisticated form of auto ML where full redployments of models are considered while analyzing the information dynamics of a complex automated decision-making system.
Related Papers
Deep Gaussian Processes Damianou and Lawrence (2013)
Latent Force Models Álvarez, Luengo, and Lawrence (2013)
Gaussian Process Latent Force Models for Learning and Stochastic Control of Physical Systems Särkkä, Álvarez, and Lawrence (2018)
The Emergence of Organizing Structure in Conceptual Representation Lake, Lawrence, and Tenenbaum (2018)
Other’s Work
- How Deep Are Deep Gaussian Processes? Dunlop et al. (n.d.)
- Doubly Stochastic Variational Inference for Deep Gaussian Processes Salimbeni and Deisenroth (2017)
- Deep Multi-task Gaussian Processes for Survival Analysis with Competing Risks Alaa and van der Schaar (2017)
- Counterfactual Gaussian Processes for Reliable Decision-making and What-if Reasoning Schulam and Saria (2017)
Conclusions and Directions
We’ve introduce some of the challenges of real-world systems and outlined how to address them. The new ideas we are focussing on extend the field of uncertainty quantification and surrogate modelling to four different areas.
Automated Abstraction is the automated deployment of surrogate models, or emulators, for summarizing the underlying components in the system. It relies on data oriented architectures to be possible.
Deep emulation is the combination of chains of different emulators across the system to assess downstream performance.
Bayesian System Optimization is the resulting optimization of the entire system, end-to-end, in a manner that doesn’t destroy interpretability because end-to-end signals are propagted down to the system components through the deep emulator.
Auto AI is the result, moving beyond Auto ML, we will be able to develop systems that identify problems in deployment and assess the appropriate system responses.
References
Alaa, Ahmed M., and Mihaela van der Schaar. 2017. “Deep Multi-Task Gaussian Processes for Survival Analysis with Competing Risks.” In Advances in Neural Information Processing Systems 30, edited by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, 2326–34. Curran Associates, Inc. http://papers.nips.cc/paper/6827-deep-multi-task-gaussian-processes-for-survival-analysis-with-competing-risks.pdf.
Álvarez, Mauricio A., David Luengo, and Neil D. Lawrence. 2013. “Linear Latent Force Models Using Gaussian Processes.” IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (11): 2693–2705. https://doi.org/10.1109/TPAMI.2013.86.
Cooper, Brian. 1991. Transformation of a Valley: Derbyshire Derwent. Scarthin Books.
Damianou, Andreas, Carl Henrik Ek, Michalis K. Titsias, and Neil D. Lawrence. n.d. “Manifold Relevance Determination.” In.
Damianou, Andreas, and Neil D. Lawrence. 2013. “Deep Gaussian Processes.” In, 31:207–15.
Dunlop, Matthew M., Mark A. Girolami, Andrew M. Stuart, and Aretha L. Teckentrup. n.d. “How Deep Are Deep Gaussian Processes?” Journal of Machine Learning Research 19 (54): 1–46. http://jmlr.org/papers/v19/18-015.html.
Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. “Generative Adversarial Nets.” In Advances in Neural Information Processing Systems 27, edited by Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, 2672–80. Curran Associates, Inc.
Lake, Brenden M., Neil D. Lawrence, and Joshua B. Tenenbaum. 2018. “The Emergence of Organizing Structure in Conceptual Representation.” Cognitive Science 42 Suppl 3: 809–32. https://doi.org/10.1111/cogs.12580.
Lawrence, Neil D. 2017. “Data Readiness Levels.” arXiv.
Pearl, Judea. 1995. “From Bayesian Networks to Causal Networks.” In Probabilistic Reasoning and Bayesian Belief Networks, edited by A. Gammerman, 1–31. Alfred Waller.
Perdikaris, Paris, Maziar Raissi, Andreas Damianou, Neil D. Lawrence, and George Em Karnidakis. 2017. “Nonlinear Information Fusion Algorithms for Data-Efficient Multi-Fidelity Modelling.” Proc. R. Soc. A 473 (20160751). https://doi.org/10.1098/rspa.2016.0751.
Salimbeni, Hugh, and Marc Deisenroth. 2017. “Doubly Stochastic Variational Inference for Deep Gaussian Processes.” In Advances in Neural Information Processing Systems 30, edited by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, 4591–4602. Curran Associates, Inc. http://papers.nips.cc/paper/7045-doubly-stochastic-variational-inference-for-deep-gaussian-processes.pdf.
Särkkä, Simo, Mauricio A. Álvarez, and Neil D. Lawrence. 2018. “Gaussian Process Latent Force Models for Learning and Stochastic Control of Physical Systems.” IEEE Transactions on Automatic Control. https://doi.org/10.1109/TAC.2018.2874749.
Schulam, Peter, and Suchi Saria. 2017. “Counterfactual Gaussian Processes for Reliable Decision-Making and What-If Reasoning.” In Advances in Neural Information Processing Systems 30, edited by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, 1696–1706. Curran Associates, Inc. http://papers.nips.cc/paper/6767-counterfactual-gaussian-processes-for-reliable-decision-making-and-what-if-reasoning.pdf.
Sculley, D., Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-François Crespo, and Dan Dennison. 2015. “Hidden Technical Debt in Machine Learning Systems.” In Advances in Neural Information Processing Systems 28, edited by Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett, 2503–11. Curran Associates, Inc. http://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf.
Steele, S, A Bilchik, J Eberhardt, P Kalina, A Nissan, E Johnson, I Avital, and A Stojadinovic. 2012. “Using Machine-Learned Bayesian Belief Networks to Predict Perioperative Risk of Clostridium Difficile Infection Following Colon Surgery.” Interact J Med Res 1 (2): e6. https://doi.org/10.2196/ijmr.2131.
Taigman, Yaniv, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf. 2014. “DeepFace: Closing the Gap to Human-Level Performance in Face Verification.” In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. https://doi.org/10.1109/CVPR.2014.220.
These approaches are one area of focus for my own team’s research. A data first architecture is a prerequisite for efficient deployment of machine learning systems.↩
See for example “The Dark Secret at the Heart of AI” in Technology Review.↩