Don't Panic: Deep Learning will be Mostly Harmless

This is a blog post summarizing topics covered in a talk at the Dagstuhl workshop on “New Directions in Kernels and Gaussian Processes”.

Schloss Dagstuhl

Thoughts on New Directions in Kernels and Gaussian Processes

The field of machine learning is entering an interesting era, where due to success we are seeing a large expansion in our leading conference. Typical NIPS submissions have been increasing at around 10% a year, but attendances have started increasing at 60% a year, this year leading to 5,700 total attendees, up from under 800 in 2012.

The cause is the explosion of interest in deep learning, mainly triggered by the breakthrough paper of Krizhevsky, Sutskever and Hinton.

As Raquel said in the previous talk, the success of deep learning is not merely down to good one off performance. She listed a number of characteristics of the field including ease of access to software which allows construction of models in a modular way, with a simple interface.

However, the situation is also reminiscent of Claude Shannon’s 1956 paper “The Bandwagon”, in which he warned that Information Theory was not a global panacea and that expectations were too great in terms of what it could deliver. If such a charge could be leveled at information theory, then we would do well to bear it in mind in the context of deep learning. A term that once meant hierarchical (often probabilistic) models but now seems to have become a synonym for neural networks.

Back in 2008 I recall an invited talk Bernhard was giving at ICML, one which focussed on the changes we had seen in the field since the mid to late 1990s. Bernhard had remarked how much more mathematical knowledge was needed in 2008 than had been needed when he started in the field. I asked him a question: In the 1990s the only mathematical qualification to entering the field seemed to have been knowledge of the chain rule, but now the bar was much higher, was that presenting a barrier to entry to the field?

Now it seems you no longer even need the chain rule. With automatic differentiation facilitating the modular construction of neural network systems, the ease with which deep learning solutions can be trained and deployed indicates we are entering a new era of widespread use of powerful machine learning models.

One way of characterizing the change that Raquel described would be to think of deep learning as having developed into a field of engineering. By which I mean engineering as a profession. Importantly, a profession has an obligation to deliver results, not through innovating a process, but through delivering according to standards. Characteristics of engineering as a profession include the adherence to a certain set of standards about how methodologies are deployed. They don’t necessarily specify that the professional has a wider understanding of the field. They require that the professional has knowledge of the deployment of a particular tool or technique according to a set of standards upheld by the profession.

A professional engineer designs and builds to standards to ensure that their bridge doesn’t collapse, their plane flies and their buildings resist earthquakes or fire. Departure from standards is dangerous in that it exposes the engineer to unforeseen consequences which can lead to catastrophic failures.

These changes are to be welcomed, but in acknowledging them we should also consider what our role as machine learning researchers, or perhaps machine learning scientists should be.

Theory shines a torch light on the path ahead.

Against the background of practioners’ success sits a need for theory. This morning I went for a run around the Dagstuhl $n^2$ route. At 7:30 in the morning, it was still dark. A dim light of dawn ahead could be seen. But, in heading for that light it would have been easy for me to trip over on unseen branches or ruts in the path. Fortunately, I was wearing a head torch. The torch shines a narrow beam where I look allowing me to head down the path to wards the distant light. The head torch provides a similar role to that which theory provides for our research work. As practitioners we wish to reach the goal of the distant light ahead, but to avoid falling flat on our faces we need the torch of theory to illuminate our way.

A current danger is that the toolset of the engineer is being made widely available but the theoretical guarantees and the evolution of the right processes are not yet being deployed. So while deep learning has the appearance of an engineering profession it is missing some of the theoretical checks and practitioners run the risk of falling flat upon their faces.

Expertise Dilution

A further challenge is a common curse of success. When a field develops a successful methodology, the wider context of associated methodologies in that field are often forgotten. Hal Laning was the engineer behind the world’s first real time computer. He was a key contributor to the navigation systems that allowed Apollo to land on the moon. In 1956, along with Richard Battin, he wrote a book called “Random Processes in Automatic Control”. Curious as to what it was about, I got this book out of the University of Sheffield library. The book is a wealth of information and ideas based around Gaussian processes and differential equations and how to deploy them in continuous time control. There are concepts in there that are extremely closely related to work I’ve looked at with collaborators on latent force models. So why wasn’t this ground breaking work more widely known about? By 1960 the Kalman filter had been developed. This was a discrete time system suitable for deployment on the digital computer developed for the moon landings. Such discrete time systems, and also state space models, came to dominate the landscape and the work on continous time systems from Laning seems to have been forgotten. Perhaps the success of the Kalman filter outshone the important foundational work that Laning and his contemporaries (who included Wiener and Zadeh) had done in these continuous time systems.

A year-on-year 10% growth in NIPS attendance from 2012 would have led to a conference size of about 1,200 in 2016. Given this year’s likely attendance of over 5,700, we might conclude that 4,500 NIPS attendees will have come on the back of the success of deep learning, that’s about 80% of the conference. It is easy to see how this massive influx of researchers into the field can easily dilute the broader base of knowledge that we used to have. So what do we do to either preserve or deploy this knowledge?

Statistics is already familiar with the vital role that a profession plays in regulating claims. In the pharmaceutical industry a statistical trial is the standard by which a new drug is validated. Traditionally, in machine learning, we have paid less attention to the validation and explanation of our models because our models seemed less consequential. A prediction of a model to rerank a search result or our social media news feed. However, these methods, while each making a decision of perhaps little consequence, are now being applied to millions or billions of people. The sum of many inconsequential decisions could certainly end up being highly consequential. Whatever the effect of filter bubbles and fake news on recent elections, the very fact that such algorithms can have such widespread societal effects is of significant importance.

Back in 2005 we ran the Gaussian Process Round Table at the University of Sheffield. In those days you could get all of the NIPS Gaussian process community into a small room: there were about 16 of us at the meeting. Our focus was on many of the questions that persist today: how to scale Gaussian processes, how to make use of non-Gaussian likelihoods. But there was also a sense that kernel methods were dominating the community to the detriment of Gaussian process models. Up until that meeting, I think my own approach had been to try and compete with kernel methods, such as the support vector machine, in applications where they were performing well. However, in retrospect most of the success of Gaussian process models has been in domains where their characteristics are important. In particular the ability to represent, in a computationally tractable form, the uncertainty associated with their function representations. Successful approaches such as Bayesian optimization and our own work based on Gaussian process latent variable models, are almost entirely reliant on this uncertainty.

So one potential lesson that can be drawn on new directions is to focus on the strengths of the methods we wish to deploy, rather than attempting to compete directly with deep learning models.

Future Challenges

Looking forward to the future of machine learning, I sometimes talk about challenges in data science. I categorize these as 1) Paradoxes of the data society 2) Privacy, fairness, transparency and accountabililty, 3) Quantifying the value associated with the collection and curation of data.

The paradoxes of data society arise with the phenomenon that, as we measure society more, we don’t seem to be increasing our understanding of it. In particular, in relation to election polls, or challenges with characterizing the population better for clinical trials. Classical statistics focusses on large $n$ and small $p$ problems. In computational biology a lot of attention has been paid to large $p$ and small $n$ problems, but both of these ideas presume the existance of a table of data, much of which is filled with values. For many data situations the truth is somewhat different. We can characterize the situation much better as massively missing data where most of what we want to know about almost all things is missing at any time. Medical data presents an excellent case in point where new tests can be ordered, or even devised, for any given patient at any given time. This suggests a need for non parametric techniques or at least techniques that can handle large scale marginalizations of data, such as Gaussian processes can. A further challenge is $n\approx p$ challenges but where both are large.

Privacy, fairness, transparency and accountability should be easier for kernel methods and Gaussian processes as they are easier to characterize mathematically and therefore understand from the theoretical standpoint.

Neural network models are easy to understand algorithmically, but difficult to understand mathematically. Conversely kernel methods and Gaussian processes are harder to understand algorithmically but much easier to understand mathematically. This may allow us to wield our theory-torch deeper into their workings, to augment the models more straightforwardly with privacy guarantees, or to account for their decision making. Further, kernel methods and Gaussian processes can be deployed in tandem with neural network models in an effort to validate, augment, or explain their operation.

This happy combination of engineering delivery with the potential for scientifically validated guarantees from methods that are mathematically easier to study is a very powerful idea. We have seen examples already today from Arthur Gretton’s talk in describing the test of relative similarity for generative models.

Further, if the process of deploying models is becoming straightforward, there are also significant possiblities to automate that process itself. Initiatives like AutoML (Frank Hutter will present tomorrow), techniques such as Bayesian optimization. These ideas show the power of meta-learning as an idea. Learning about learning. In these domains methods that quantify uncertainty are very important.

The third challenge is accounting for the value of acquiring and curating data. Making it more widely available. This is currently an open area for study. How do we account for the value that is generated by preparing good quality data sets. It may well be that this becomes the bottleneck to delivering machine learning methods across a very wide range of domains.

Physical Systems

As well as these three challenges of data science, a further direction of interest is reaching out closer to physical models of systems. Whether it’s data efficient reinforcement learning such as Marc and Carl’s Pilco algorithm: a model based approach to effiicent reinforcement learning based on Gaussian processes, or attempts to incorporate more physical understanding within Gaussian processes, or the wider areas of Uncertainty Quantification and the related field of Probabilistic Numerics. Each of these fields represents a wide domain of potential application in which mathematical understanding and the quantification of uncertainty can be critical in the successful resolution of the challenge.

Summary

In the talk we’ve looked at new directions in machine learning science that are inspired by particular observations. Deep learning is becoming more engineering-like in its deployment and the expertise required to succesfully deploy a deep learning model is diminishing. Simultaneously the wider understanding of machine learning across the field is becoming diluted by a large influx of new researchers focussed on the (often engineering) challenges of deep learning. A significant opportunity exists in the area of meta-learning and validation of these deep learning models to automate the production of these methods and ensure that users don’t succumb to errors in the deployment.

Remaining challenges in machine learning include work in privacy, accountability and fairness where the mathematical elegance of kernels and Gaussian processes may provide more leeway for progress. Similarly, challenges in the nature of modern data that arise from massively missing data or “when $p \approx n$” also require properties such as non parametrics to deal with the challenge.

On open and unexplored domain is that of data-economics, how we reward those that provide value by curating and preparing data sets for model consumption. This is (perhaps) the most difficult stage of machine learning to automate and one of the most labour intensive. How can kernel methods and Gaussian processes help?

Finally, there is great promise at the interface of physical models and data driven models. We should be able to improve computational and data efficiency in these models by developing more hybrids. There are already promising developments at the interface of surrogate models and ABC but more theory and methodological ideas are required.

There is an exciting future for both kernel methods and Gaussian processes in the domain of machine learning, as The Hitchiker’s Guide to the Galaxy” states “Don’t Panic” by bringing our mathematical tools to bear on the new wave of deep learning methods we can ensure that they remain “mostly harmless”.