Interpretable EndtoEnd Learning
[edit]
Abstract
Practical artificial intelligence systems can be seen as algorithmic decision makers. The fractal nature of decision making implies that this involves interacting systems of components where decisions are made multiple times across different time frames. This affects the decomposability of an artificial intelligence system. Classical systems design relies on decomposability for efficient maintenance and deployment of machine learning systems, in this talk we consider the challenges of optimizing and maintaining such systems.
Introduction
The fourth industrial revolution bears the particular hallmark of being the first revolution that has been named before it has happened. This is particularly unfortunate, because it is not in fact an industrial revolution at all. Nor is it necessarily a distinct phenomenon. It is part of a revolution in information, one that goes back to digitisation and the invention of the silicon chip.
Or to put it more precisely, it is a revolution in how information can affect the physical world. The interchange between information and the physical world.
Amazon’s New Delivery Drone [edit]
One example is autonomous vehicles, both those we intend to operate on the ground and also those in the air.
The drone highlights one of the important changes that is driving the innovation from machine learning, the interaction between the physical world and the information world.
Machine Learning in Supply Chain [edit]
On Sunday mornings in Sheffield, I often used to run across Packhorse Bridge in Burbage valley. The bridge is part of an ancient network of trails crossing the Pennines that, before Turnpike roads arrived in the 18th century, was the main way in which goods were moved. Given that the moors around Sheffield were home to sand quarries, tin mines, lead mines and the villages in the Derwent valley were known for nail and pin manufacture, this wasn’t simply movement of agricultural goods, but it was the infrastructure for industrial transport.
The profession of leading the horses was known as a Jagger and leading out of the village of Hathersage is Jagger’s Lane, a trail that headed underneath Stanage Edge and into Sheffield.
The movement of goods from regions of supply to areas of demand is fundamental to our society. The physical infrastructure of supply chain has evolved a great deal over the last 300 years.
Richard Arkwright is known as the father of the modern factory system. In 1771 he set up a Mill for spinning cotton yarn in the village of Cromford, in the Derwent Valley. The Derwent valley is relatively inaccessible. Raw cotton arrived in Liverpool from the US and India. It needed to be transported on packhorse across the bridleways of the Pennines. But Cromford was a good location due to proximity to Nottingham, where weavers where consuming the finished thread, and the availability of water power from small tributaries of the Derwent river for Arkwright’s water frames which automated the production of yarn from raw cotton.
By 1794 the Cromford canal was opened to bring coal in to Cromford and give better transport to Nottingham. The construction of the canals was driven by the need to improve the transport infrastructure, facilitating the movement of goods across the UK. Canals, roads and railways were initially constructed by the economic need for moving goods. To improve supply chain.
Containerization [edit]
Containerization has had a dramatic effect on global economics, placing many people in the developing world at the end of the supply chain.


For example, you can buy Wild Alaskan Cod fished from Alaska, processed in China, sold in North America. This is driven by the low cost of transport for frozen cod vs the higher relative cost of cod processing in the US versus China. Similarly, Scottish prawns are also processed in China for sale in the UK.
This effect on cost of transport vs cost of processing is the main driver of the topology of the modern supply chain and the associated effect of globalization. If transport is much cheaper than processing, then processing will tend to agglomerate in places where processing costs can be minimized.
Large scale global economic change has principally been driven by changes in the technology that drives supply chain.
Supply chain is a largescale automated decision making network. Our aim is to make decisions not only based on our models of customer behavior (as observed through data), but also by accounting for the structure of our fulfilment center, and delivery network.
Many of the most important questions in supply chain take the form of counterfactuals. E.g. “What would happen if we opened a manufacturing facility in Cambridge?” A counter factual is a question that implies a mechanistic understanding of a system. It goes beyond simple smoothness assumptions or translation invariants. It requires a physical, or mechanistic understanding of the supply chain network. For this reason, the type of models we deploy in supply chain often involve simulations or more mechanistic understanding of the network.
In supply chain Machine Learning alone is not enough, we need to bridge between models that contain real mechanisms and models that are entirely data driven.
This is challenging, because as we introduce more mechanism to the models we use, it becomes harder to develop efficient algorithms to match those models to data.
Solve Supply Chain, then solve everything else.
EndtoEnd: Environment and Decision [edit]
From Model to Decision [edit]
The real challenge, however, is endtoend decision making. Taking information from the environment and using it to drive decision making to achieve goals.
We don’t know what science we’ll want to do in 5 years time, but we won’t want slower experiments, we won’t want more expensive experiments and we won’t want a narrower selection of experiments.
 Faster, cheaper and more diverse experiments.
 Better ecosystems for experimentation.
 Data oriented architectures.
Data Oriented Architectures [edit]
In a streaming architecture we shift from management of services, to management of data streams. Instead of worrying about availability of the services we shift to worrying about the quality of the data those services are producing.
Streaming System
Characteristics of a streaming system include a move from pull updates to push updates, i.e. the computation is driven by a change in the input data rather than the service calling for input data when it decides to run a computation. Streaming systems operate on ‘rows’ of the data rather than ‘columns’. This is because the full column isn’t normally available as it changes over time. As an important design principle, the services themselves are stateless, they take their state from the streaming ecosystem. This ensures the inputs and outputs of given computations are easy to declare. As a result, persistence of the data is also handled by the streaming ecosystem and decisions around data retention or recomputation can be taken at the systems level rather than the component level.
Apache Flink [edit]
Apache Flink is a stream processing framework. Flink is a foundation for event driven processing. This gives a high throughput and low latency framework that operates on dataflows.
Data storage is handled by other systems such as Apache Kafka or AWS Kinesis.
stream.join(otherStream)
.where(<KeySelector>)
.equalTo(<KeySelector>)
.window(<WindowAssigner>)
.apply(<JoinFunction>)
Apache Flink allows operations on streams. For example, the join operation above. In a traditional data base management system, this join operation may be written in SQL and called on demand. In a streaming ecosystem, computations occur as and when the streams update.
The join is handled by the ecosystem surrounding the business logic.
Trading System
As a simple example we’ll consider a high frequency trading system. Anne wishes to build a share trading system. She has access to a high frequency trading system which provides prices and allows trades at millisecond intervals. She wishes to build an automated trading system.
Let’s assume that price trading data is available as a data stream. But the price now is not the only information that Anne needs, she needs an estimate of the price in the future.
# Generate an artificial trading stream
days=pd.date_range(start='21/5/2017', end='21/05/2020')
z = np.random.randn(len(days), 1)
x = z.cumsum()+400
prices = pd.Series(x, index=days)
hypothetical = prices.loc['21/5/2019':]
real = prices.copy()
real['21/5/2019':] = np.NaN
Hypothetical Streams
We’ll call the future price a hypothetical stream.
A hypothetical stream is a desired stream of information which cannot be directly accessed. The lack of direct access may be because the events happen in the future, or there may be some latency between the event and the availability of the data.
Any hypothetical stream will only be provided as a prediction, ideally with an error bar.
The nature of the hypothetical Anne needs is dependent on her decisionmaking process. In Anne’s case it will depend over what period she is expecting her returns. In MDOP Anne specifies a hypothetical that is derived from the pricing stream.
It is not the price stream directly, but Anne looks for future predictions from the price stream, perhaps for price in T days’ time.
At this stage, this stream is merely typed as a hypothetical.
There are constraints on the hypothetical, they include: the input information, the upper limit of latency between input and prediction, and the decision Anne needs to make (how far ahead, what her upside, downside risks are). These three constraints mean that we can only recover an approximation to the hypothetical.
Hypothetical Advantage
What is the advantage to defining things in this way? By defining, clearly, the two streams as real and hypothetical variants of each other, we now enable automation of the deployment and any redeployment process. The hypothetical can be instantiated against the real, and design criteria can be constantly evaluated triggering retraining when necessary.
Ride Sharing System
As a second example, we’ll consider a ride sharing app.
Anne is on her way home now; she wishes to hail a car using a ride sharing app.
The app is designed in the following way. On opening her app Anne is notified about driverss in the nearby neighborhood. She is given an estimate of the time a ride may take to come.
Given this information about driver availability, Anne may feel encouraged to enter a destination. Given this destination, a price estimate can be given. This price is conditioned on other riders that may wish to go in the same direction, but the price estimate needs to be made before the user agrees to the ride.
Business customer service constraints dictate that this price may not change after Anne’s order is confirmed.
In this simple system, several decisions are being made, each of them on the basis of a hypothetical.
When Anne calls for a ride, she is provided with an estimate based on the expected time a ride can be with her. But this estimate is made without knowing where Anne wants to go. There are constraints on drivers imposed by regional boundaries, reaching the end of their shift, or their current passengers mean that this estimate can only be a best guess.
This best guess may well be driven by previous data.
For the ride sharing system, we start to see a common issue with a more complex algorithmic decisionmaking system. Several decisions are being made multilple times. Let’s look at the decisions we need along with some design criteria.
 Car Availability: Estimate time to arrival for Anne’s ride using Anne’s location and local available car locations. Latency 50 milliseconds
 Cost Estimate: Estimate cost for journey using Anne’s destination, location and local available car current destinations and availability. Latency 50 milliseconds
 Driver Allocation: Allocate car to minimize transport cost to destination. Latency 2 seconds.
So we need:
 a hypothetical to estimate availability. It is constrained by lacking destination information and a low latency requirement.
 a hypothetical to estimate cost. It is constrained by low latency requirement and
Simultaneously, drivers in this data ecosystem have an app which notifies them about new jobs and recommends them where to go.
Further advantages. Strategies for data retention (when to snapshot) can be set globally.
A few decisions need to be made in this system. First of all, when the user opens the app, the estimate of the time to the nearest ride may need to be computed quickly, to avoid latency in the service.
This may require a quick estimate of the ride availability.
Information Dynamics
With all the second guessing within a complex automated decisionmaking system, there are potential problems with information dynamics, the ‘closed loop’ problem, where the subsystems are being approximated (second guessing) and predictions downstream are being affected.
This leads to the need for a closed loop analysis, for example, see the “Closed Loop Data Science” project led by Rod MurraySmith at Glasgow.
Our aim is to release our first version of a dataoriented programming environment by end of June 2019 (pending internal approval).
Emulation [edit]
In many real world systems, decisions are made through simulating the environment. Simulations may operate at different granularities. For example, simulations are used in weather forecasts and climate forecasts. The UK Met office uses the same code for both, but operates climate simulations one at greater spatial and temporal resolutions.
A statistical emulator is a datadriven model that learns about the underlying simulation. Importantly, learns with uncertainty, so it ‘knows what it doesn’t know’. In practice, we can call the emulator in place of the simulator. If the emulator ‘doesn’t know’, it can call the simulator for the answer.
As well as reconstructing an individual simulator, the emulator can calibrate the simulation to the real world, by monitoring differences between the simulator and real data. This allows the emulator to characterise where the simulation can be relied on, i.e. we can validate the simulator.
Similarly, the emulator can adjudicate between simulations. This is known as multifidelity emulation. The emulator characterizes which emulations perform well where.
If all this modelling is done with judiscious handling of the uncertainty, the computational doubt, then the emulator can assist in desciding what experiment should be run next to aid a decision: should we run a simulator, in which case which one, or should we attempt to acquire data from a real world intervention.
Deep Emulation [edit]
As a solution we can use of emulators. When constructing an ML system, software engineers, ML engineers, economists and operations researchers are explicitly defining relationships between variables of interest in the system. That implicitly defines a joint distribution, $p(\dataVector^*, \dataVector)$. In a decomposable system any subcomponent may be defined as $p(\dataVector_\mathbf{i}\dataVector_\mathbf{j})$ where $\dataVector_\mathbf{i}$ and $\dataVector_\mathbf{j}$ represent subsets of the full set of variables $\left\{\dataVector^*, \dataVector \right\}$. In those cases where the relationship is deterministic, the probability density would collapse to a vectorvalued deterministic function, $\mappingFunctionVector_\mathbf{i}\left(\dataVector_\mathbf{j}\right)$.
Intervariable relationships could be defined by, for example a neural network (machine learning), an integer program (operational research), or a simulation (supply chain). This makes probabilistic inference in this joint density for real world systems is either very hard or impossible.
Emulation is a form of metamodelling: we construct a model of the model. We can define the joint density of an emulator as $s(\dataVector*, \dataVector)$, but if this probability density is to be an accurate representation of our system, it is likely to be prohibitively complex. Current practice is to design an emulator to deal with a specific question. This is done by fitting an ML model to a simulation from the the appropriate conditional distribution, $p(\dataVector_\mathbf{i}\dataVector_\mathbf{j})$, which is intractable. The emulator provides an approximated answer of the form $s(\dataVector_\mathbf{i}\dataVector_\mathbf{j})$. Critically, an emulator should incorporate its uncertainty about its approximation. So the emulator answer will be less certain than direct access to the conditional $p(\dataVector_i\dataVector_j)$, but it may be sufficiently confident to act upon. Careful design of emulators to answer a given question leads to efficient diagnostics and understanding of the system. But in a complex interacting system an exponentially increasing number of questions can be asked. This calls for a system of automated construction of emulators which selects the right structure and redeploys the emulator as necessary. Rapid redeployment of emulators could exploit preexisting emulators through transfer learning.
Automatically deploying these families of emulators for full system understanding is highly ambitious. It requires advances in engineering infrastructure, emulation and Bayesian optimization. However, the intermediate steps of developing this architecture also allow for automated monitoring of system accuracy and fairness. This facilitates AutoML on a componentwise basis which we can see as a simple implementation of AutoAI. The proposal is structured so that despite its technical ambition there is a smooth ramp of benefits to be derived across the programme of work.
In Applied Mathematics, the field studying these techniques is known as uncertainty quantification. The new challenge is the automation of emulator creation on demand to answer questions of interest and facilitate the system design, i.e. AutoAI through BSO.
At design stage, any particular AI task could be decomposed in multiple ways. Bayesian system optimization will assist both in determining the largescale system design through exploring different decompositions and in refinement of the deployed system.
So far, most work on emulators has focussed on emulating a single component. Automated deployment and maintenance of ML systems requires networks of emulators that can be deployed and redeployed on demand depending on the particular question of interest. Therefore, the technical innovations we require are in the mathematical composition of emulator models (Damianou and Lawrence 2013; Perdikaris et al. 2017). Different chains of emulators will need to be rapidly composed to make predictions of downstream performance. This requires rapid retraining of emulators and propagation of uncertainty through the emulation pipeline a process we call deep emulation.
Recomposing the ML system requires structural learning of the network. By parameterizing covariance functions appropriately this can be done through Gaussian processes (e.g. (Damianou et al., n.d.)), but one could also consider Bayesian neural networks and other generative models, e.g. Generative Adversarial Networks (Goodfellow et al. 2014).
Bayesian System Optimization [edit]
 Aim: maintain interpretable compoents.
 Monitor downstream/upstream effects through emulation.
 Optimize individual components considering upstream and downstream.
Auto AI [edit]
Supervised machine learning models are datadriven statistical functional estimators. Each ML model is trained to perform a task. Machine learning systems are created when these models are integrated as interacting components in a more complex system that carries out a larger scale task, e.g. an autonomous drone delivery system.
Artificial Intelligence can also be seen as algorithmic decisionmaking. ML systems are data driven algorithmic decisionmakers. Designing decisionmaking engines requires us to firstly decompose the system into its component parts. The decompositions are driven by (1) system performance requirements (2) the suite of ML algorithms at our disposal (3) the data availability. Performance requirements could be computational speed, accuracy, interpretability, and ‘fairness’. The current generation of ML Systems is often based around supervised learning and human annotated data. But in the future, we may expect more use of reinforcement learning and automated knowledge discovery using unsupervised learning.
The classical systems approach assumes decomposability of components. In ML, upstream components (e.g. a pedestrian detector in an autonomous vehicle) make decisions that require revisiting once a fuller picture is realized at a downstream stage (e.g. vehicle path planning). The relative weaknesses and strengths of the different component parts need to be assessed when resolving conflicts.
In longterm planning, e.g. logistics and supply chain, a plan may be computed multiple times under different constraints as data evolves. In logistics, an initial plan for delivery may be computed when an item is viewed on a webpage. Webpage waitingtime constraints dominate the solution we choose. However, when an order is placed the time constraint may be relaxed and an accuracy constraint or a cost constraint may now dominate.
Such subsystems will make inconsistent decisions, but we should monitor and control the extent of the inconsistency.
One solution to aid with both the lack of decomposability of the components and the inconsistency between components is endtoend learning of the system. Endtoend learning is when we use ML techniques to fit parameters across the entire decision pipeline. We exploit gradient descent and automated differentiation software to achieve this. However, components in the system may themselves be running a simulation (e.g. a transport deliverytime simulation) or optimization (e.g. a linear program) as a subroutine. This limits the universality of automatic differentiation. Another alternative is to replace the entire system with a single ML model, such as in Deep Reinforcement Learning. However, this can severely limit the interpretability of the resulting system.
We envisage AutoAI as allowing us to take advantage of endtoend learning without sacrificing the interpretability of the underlying system. Instead of optimizing each component individually, we introduce Bayesian system optimization (BSO). We will make use of the endtoend learning signals and attribute them to the system subcomponents through the construction of an interconnected network of surrogate models, known as emulators, each of which is associated with an individual component from the underlying MLsystem. Instead of optimizing each component individually (e.g. by classical Bayesian optimization) in BSO we account for upstream and downstream interactions in the optimization, leveraging our endtoend knowledge without damaging the interpretability of the underlying system.
DeepFace [edit]
The DeepFace architecture (Taigman et al. 2014) consists of layers that deal with translation and rotational invariances. These layers are followed by three locallyconnected layers and two fullyconnected layers. Color illustrates feature maps produced at each layer. The neural network includes more than 120 million parameters, where more than 95% come from the local and fully connected layers.
Mathematically, each layer of a neural network is given through computing the activation function, $\basisFunction(\cdot)$, contingent on the previous layer, or the inputs. In this way the activation functions, are composed to generate more complex interactions than would be possible with any single layer.
$$
\begin{align}
\hiddenVector_{1} &= \basisFunction\left(\mappingMatrix_1 \inputVector\right)\\
\hiddenVector_{2} &= \basisFunction\left(\mappingMatrix_2\hiddenVector_{1}\right)\\
\hiddenVector_{3} &= \basisFunction\left(\mappingMatrix_3 \hiddenVector_{2}\right)\\
\dataVector &= \mappingVector_4 ^\top\hiddenVector_{3}
\end{align}
$$
Overfitting [edit]
One potential problem is that as the number of nodes in two adjacent layers increases, the number of parameters in the affine transformation between layers, $\mappingMatrix$, increases. If there are k_{i − 1} nodes in one layer, and k_{i} nodes in the following, then that matrix contains k_{i}k_{i − 1} parameters, when we have layer widths in the 1000s that leads to millions of parameters.
One proposed solution is known as dropout where only a subset of the neural network is trained at each iteration. An alternative solution would be to reparameterize $\mappingMatrix$ with its singular value decomposition.
$$
\mappingMatrix = \eigenvectorMatrix\eigenvalueMatrix\eigenvectwoMatrix^\top
$$
or
$$
\mappingMatrix = \eigenvectorMatrix\eigenvectwoMatrix^\top
$$
where if $\mappingMatrix \in \Re^{k_1\times k_2}$ then $\eigenvectorMatrix\in \Re^{k_1\times q}$ and $\eigenvectwoMatrix \in \Re^{k_2\times q}$, i.e. we have a low rank matrix factorization for the weights.
Including the low rank decomposition of $\mappingMatrix$ in the neural network, we obtain a new mathematical form. Effectively, we are adding additional latent layers, $\latentVector$, in between each of the existing hidden layers. In a neural network these are sometimes known as bottleneck layers. The network can now be written mathematically as
$$
\begin{align}
\latentVector_{1} &= \eigenvectwoMatrix^\top_1 \inputVector\\
\hiddenVector_{1} &= \basisFunction\left(\eigenvectorMatrix_1 \latentVector_{1}\right)\\
\latentVector_{2} &= \eigenvectwoMatrix^\top_2 \hiddenVector_{1}\\
\hiddenVector_{2} &= \basisFunction\left(\eigenvectorMatrix_2 \latentVector_{2}\right)\\
\latentVector_{3} &= \eigenvectwoMatrix^\top_3 \hiddenVector_{2}\\
\hiddenVector_{3} &= \basisFunction\left(\eigenvectorMatrix_3 \latentVector_{3}\right)\\
\dataVector &= \mappingVector_4^\top\hiddenVector_{3}.
\end{align}
$$
$$
\begin{align}
\latentVector_{1} &= \eigenvectwoMatrix^\top_1 \inputVector\\
\latentVector_{2} &= \eigenvectwoMatrix^\top_2 \basisFunction\left(\eigenvectorMatrix_1 \latentVector_{1}\right)\\
\latentVector_{3} &= \eigenvectwoMatrix^\top_3 \basisFunction\left(\eigenvectorMatrix_2 \latentVector_{2}\right)\\
\dataVector &= \mappingVector_4 ^\top \latentVector_{3}
\end{align}
$$
Now if we replace each of these neural networks with a Gaussian process. This is equivalent to taking the limit as the width of each layer goes to infinity, while appropriately scaling down the outputs.
$$
\begin{align}
\latentVector_{1} &= \mappingFunctionVector_1\left(\inputVector\right)\\
\latentVector_{2} &= \mappingFunctionVector_2\left(\latentVector_{1}\right)\\
\latentVector_{3} &= \mappingFunctionVector_3\left(\latentVector_{2}\right)\\
\dataVector &= \mappingFunctionVector_4\left(\latentVector_{3}\right)
\end{align}
$$
Stochastic Process Composition [edit]
$$\dataVector = \mappingFunctionVector_4\left(\mappingFunctionVector_3\left(\mappingFunctionVector_2\left(\mappingFunctionVector_1\left(\inputVector\right)\right)\right)\right)$$
data = pods.datasets.mcycle()
x = data['X']
y = data['Y']
scale=np.sqrt(y.var())
offset=y.mean()
yhat = (y  offset)/scale
fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
_ = ax.plot(x, y, 'r.',markersize=10)
_ = ax.set_xlabel('time', fontsize=20)
_ = ax.set_ylabel('acceleration', fontsize=20)
xlim = (20, 80)
ylim = (175, 125)
ax.set_xlim(xlim)
ax.set_ylim(ylim)
mlai.write_figure(filename='../slides/diagrams/datasets/motorcyclehelmet.svg',
transparent=True, frameon=True)
Motorcycle Helmet Data [edit]
m_full = GPy.models.GPRegression(x,yhat)
_ = m_full.optimize() # Optimize parameters of covariance function
Motorcycle Helmet Data GP
The deep Gaussian process code we are using is research code by Andreas Damianou.
To extend the research code we introduce some approaches to initialization and optimization that we’ll use in examples. These approaches can be found in the deepgp_tutorial.py
file.
Deep Gaussian process models also can require some thought in the initialization. Here we choose to start by setting the noise variance to be one percent of the data variance.
Secondly, we introduce a staged optimization approach.
Optimization requires moving variational parameters in the hidden layer representing the mean and variance of the expected values in that layer. Since all those values can be scaled up, and this only results in a downscaling in the output of the first GP, and a downscaling of the input length scale to the second GP. It makes sense to first of all fix the scales of the covariance function in each of the GPs.
Sometimes, deep Gaussian processes can find a local minima which involves increasing the noise level of one or more of the GPs. This often occurs because it allows a minimum in the KL divergence term in the lower bound on the likelihood. To avoid this minimum we habitually train with the likelihood variance (the noise on the output of the GP) fixed to some lower value for some iterations.
Next an optimization of the kernel function parameters at each layer is performed, but with the variance of the likelihood fixed. Again, this is to prevent the model minimizing the KullbackLeibler divergence between the approximate posterior and the prior before achieving a good datafit.
Finally, all parameters of the model are optimized together.
The next code is for visualizing the intermediate layers of the deep model. This visualization is only appropriate for models with intermediate layers containing a single latent variable.
The pinball visualization is to bring the pinballanalogy to life in the model. It shows how a ball would fall through the model to end up in the right pbosition. This visualization is only appropriate for models with intermediate layers containing a single latent variable.
The posterior_sample
code allows us to see the output sample locations for a given input. This is useful for visualizing the nonGaussian nature of the output density.
Finally, we bind these methods to the DeepGP object for ease of calling.
deepgp.DeepGP.initialize=initialize
deepgp.DeepGP.staged_optimize=staged_optimize
deepgp.DeepGP.posterior_sample = posterior_sample
deepgp.DeepGP.visualize=visualize
deepgp.DeepGP.visualize_pinball=visualize_pinball
layers = [y.shape[1], 1, x.shape[1]]
inits = ['PCA']*(len(layers)1)
kernels = []
for i in layers[1:]:
kernels += [GPy.kern.RBF(i)]
m = deepgp.DeepGP(layers,Y=yhat, X=x,
inits=inits,
kernels=kernels, # the kernels for each layer
num_inducing=20, back_constraint=False)
m.initialize()
fig, ax=plt.subplots(figsize=plot.big_wide_figsize)
plot.model_output(m, scale=scale, offset=offset, ax=ax, xlabel='time', ylabel='acceleration/$g$', fontsize=20, portion=0.5)
ax.set_ylim(ylim)
ax.set_xlim(xlim)
mlai.write_figure(filename='../slides/diagrams/deepgp/motorcyclehelmetdeepgp.svg',
transparent=True, frameon=True)
Motorcycle Helmet Data Deep GP [edit]
fig, ax=plt.subplots(figsize=plot.big_wide_figsize)
plot.model_sample(m, scale=scale, offset=offset, samps=10, ax=ax, xlabel='time', ylabel='acceleration/$g$', portion = 0.5)
ax.set_ylim(ylim)
ax.set_xlim(xlim)
mlai.write_figure(figure=fig, filename='../slides/diagrams/deepgp/motorcyclehelmetdeepgpsamples.svg',
transparent=True, frameon=True)
Motorcycle Helmet Data Deep GP
m.visualize(xlim=xlim, ylim=ylim, scale=scale,offset=offset,
xlabel="time", ylabel="acceleration/$g$", portion=0.5,
dataset='motorcyclehelmet',
diagrams='../slides/diagrams/deepgp')
Motorcycle Helmet Data Latent 1
Motorcycle Helmet Data Latent 2
fig, ax=plt.subplots(figsize=plot.big_wide_figsize)
m.visualize_pinball(ax=ax, xlabel='time', ylabel='acceleration/g',
points=50, scale=scale, offset=offset, portion=0.1)
mlai.write_figure(figure=fig, filename='../slides/diagrams/deepgp/motorcyclehelmetdeepgppinball.svg',
transparent=True, frameon=True)
Motorcycle Helmet Pinball Plot
Graphical Models [edit]
One way of representing a joint distribution is to consider conditional dependencies between data. Conditional dependencies allow us to factorize the distribution. For example, a Markov chain is a factorization of a distribution into components that represent the conditional relationships between points that are neighboring, often in time or space. It can be decomposed in the following form.
$$p(\dataVector) = p(\dataScalar_\numData  \dataScalar_{\numData1}) p(\dataScalar_{\numData1}\dataScalar_{\numData2}) \dots p(\dataScalar_{2}  \dataScalar_{1})$$
By specifying conditional independencies we can reduce the parameterization required for our data, instead of directly specifying the parameters of the joint distribution, we can specify each set of parameters of the conditonal independently. This can also give an advantage in terms of interpretability. Understanding a conditional independence structure gives a structured understanding of data. If developed correctly, according to causal methodology, it can even inform how we should intervene in the system to drive a desired result (Pearl 1995).
However, a challenge arises when the data becomes more complex. Consider the graphical model shown below, used to predict the perioperative risk of C Difficile infection following colon surgery (Steele et al. 2012).
To capture the complexity in the interelationship between the data, the graph itself becomes more complex, and less interpretable.
Conclusion [edit]
We operate in a technologically evolving environment. Machine learning is becoming a key coponent in our decisionmaking capabilities, our intelligence and strategic command. However, technology drove changes in battlefield strategy. From the stalemate of the first world war to the tankdominated Blitzkrieg of the second, to the asymmetric warfare of the present. Our technology, tactics and strategies are also constantly evolving. Machine learning is part of that evolution solution, but the main challenge is not to become so fixated on the tactics of today that we miss the evolution of strategy that the technology is suggesting.
Data oriented programming offers a set of development methodologies which ensure that the system designer considers what decisions are required, how they will be made, and critically, declares this within the system architecture.
This allows for monitoring of data quality, fairness, model accuracy and opens the door to a more sophisticated form of auto ML where full redployments of models are considered while analyzing the information dynamics of a complex automated decisionmaking system.
Related Papers
Deep Gaussian Processes Damianou and Lawrence (2013)
Latent Force Models Álvarez, Luengo, and Lawrence (2013)
Gaussian Process Latent Force Models for Learning and Stochastic Control of Physical Systems Särkkä, Álvarez, and Lawrence (2018)
The Emergence of Organizing Structure in Conceptual Representation Lake, Lawrence, and Tenenbaum (2018)
Other’s Work
 How Deep Are Deep Gaussian Processes? Dunlop et al. (n.d.)
 Doubly Stochastic Variational Inference for Deep Gaussian Processes Salimbeni and Deisenroth (2017)
 Deep Multitask Gaussian Processes for Survival Analysis with Competing Risks Alaa and van der Schaar (2017)
 Counterfactual Gaussian Processes for Reliable Decisionmaking and Whatif Reasoning Schulam and Saria (2017)
Conclusions and Directions
We’ve introduce some of the challenges of realworld systems and outlined how to address them. The new ideas we are focussing on extend the field of uncertainty quantification and surrogate modelling to four different areas.
Automated Abstraction is the automated deployment of surrogate models, or emulators, for summarizing the underlying components in the system. It relies on data oriented architectures to be possible.
Deep emulation is the combination of chains of different emulators across the system to assess downstream performance.
Bayesain System Optimization is the resulting optimization of the entire system, endtoend, in a manner that doesn’t destroy interpretability because endtoend signals are propagted down to the system components through the deep emulator.
Auto AI is the result, moving beyond Auto ML, we will be able to develop systems that identify problems in deployment and assess the appropriate system responses.
References
Alaa, Ahmed M., and Mihaela van der Schaar. 2017. “Deep MultiTask Gaussian Processes for Survival Analysis with Competing Risks.” In Advances in Neural Information Processing Systems 30, edited by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, 2326–34. Curran Associates, Inc. http://papers.nips.cc/paper/6827deepmultitaskgaussianprocessesforsurvivalanalysiswithcompetingrisks.pdf.
Álvarez, Mauricio A., David Luengo, and Neil D. Lawrence. 2013. “Linear Latent Force Models Using Gaussian Processes.” IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (11): 2693–2705. https://doi.org/10.1109/TPAMI.2013.86.
Damianou, Andreas, Carl Henrik Ek, Michalis K. Titsias, and Neil D. Lawrence. n.d. “Manifold Relevance Determination.” In.
Damianou, Andreas, and Neil D. Lawrence. 2013. “Deep Gaussian Processes.” In, 31:207–15.
Dunlop, Matthew M., Mark A. Girolami, Andrew M. Stuart, and Aretha L. Teckentrup. n.d. “How Deep Are Deep Gaussian Processes?” Journal of Machine Learning Research 19 (54): 1–46. http://jmlr.org/papers/v19/18015.html.
Goodfellow, Ian, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. “Generative Adversarial Nets.” In Advances in Neural Information Processing Systems 27, edited by Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, 2672–80. Curran Associates, Inc.
Lake, Brenden M., Neil D. Lawrence, and Joshua B. Tenenbaum. 2018. “The Emergence of Organizing Structure in Conceptual Representation.” Cognitive Science 42 Suppl 3: 809–32. https://doi.org/10.1111/cogs.12580.
Pearl, Judea. 1995. “From Bayesian Networks to Causal Networks.” In Probabilistic Reasoning and Bayesian Belief Networks, edited by A. Gammerman, 1–31. Alfred Waller.
Perdikaris, Paris, Maziar Raissi, Andreas Damianou, Neil D. Lawrence, and George Em Karnidakis. 2017. “Nonlinear Information Fusion Algorithms for DataEfficient MultiFidelity Modelling.” Proc. R. Soc. A 473 (20160751). https://doi.org/10.1098/rspa.2016.0751.
Salimbeni, Hugh, and Marc Deisenroth. 2017. “Doubly Stochastic Variational Inference for Deep Gaussian Processes.” In Advances in Neural Information Processing Systems 30, edited by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, 4591–4602. Curran Associates, Inc. http://papers.nips.cc/paper/7045doublystochasticvariationalinferencefordeepgaussianprocesses.pdf.
Särkkä, Simo, Mauricio A. Álvarez, and Neil D. Lawrence. 2018. “Gaussian Process Latent Force Models for Learning and Stochastic Control of Physical Systems.” IEEE Transactions on Automatic Control. https://doi.org/10.1109/TAC.2018.2874749.
Schulam, Peter, and Suchi Saria. 2017. “Counterfactual Gaussian Processes for Reliable DecisionMaking and WhatIf Reasoning.” In Advances in Neural Information Processing Systems 30, edited by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, 1696–1706. Curran Associates, Inc. http://papers.nips.cc/paper/6767counterfactualgaussianprocessesforreliabledecisionmakingandwhatifreasoning.pdf.
Steele, S, A Bilchik, J Eberhardt, P Kalina, A Nissan, E Johnson, I Avital, and A Stojadinovic. 2012. “Using MachineLearned Bayesian Belief Networks to Predict Perioperative Risk of Clostridium Difficile Infection Following Colon Surgery.” Interact J Med Res 1 (2): e6. https://doi.org/10.2196/ijmr.2131.
Taigman, Yaniv, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf. 2014. “DeepFace: Closing the Gap to HumanLevel Performance in Face Verification.” In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. https://doi.org/10.1109/CVPR.2014.220.