Questions on Deep Gaussian Processes

I was recently contacted by Chris Edwards, he’s putting together an article for Communications of the ACM on Deep Learning and had a few questions on deep Gaussian processes. He kindly agreed to let me use his questions and my answers in a blog post.

1) Are there applications that suit Gaussian processes well? Would they typically replace the neural network layers in a deep learning system or would they possibly be mixed and matched with neural layers, perhaps as preprocessors or using the neural layers for stuff like feature extraction (assuming that training algorithms allow for this)?

Yes, I think there are applications that suit Gaussian processes very well. In particular applications where data is scarce (this doesn’t necessarily mean small data sets, but when data is scarce relative to the complexity of the system being modeled). In these scenarios, handling uncertainty in the model appropriately becomes very important. Two examples which have exploited this characteristic in practice are GaussianFace by Lu & Tang, and Bayesian optimization (e.g. Snoek, Larochelle and Adams). Almost all my own group’s work also exploits this characteristic. A further manifestation of this effect is what I call “massively missing data”. Although we are getting a lot of data at the moment, when you think about it you realise that almost all the things we would like to know are still missing almost all of the time. Deep models have performed well in situations where data sets are very well characterised and labeled. However, one of the domains that inspires me is clinical data where this isn’t the case. In clinical data most people haven’t had most clinical tests applied to them most of the time. Also, the nature of clinical tests evolve (as do the diseases that affect patients). This is an example of massively missing data. I think Gaussian processes provide a very promising approach to handling this data.

With regard to whether they are a replacement for deep neural networks, I think in the end they may well be mixed and matched. From a Gaussian process perspective the neural network layers could be seen as a type of ‘mean function’ (a Gaussian process is defined by its mean function and its covariance function). So they can be seen as part of the deep GP framework: deep Gaussian processes enhance the toolkit available. So there is no conceptual reason why they shouldn’t be mixed and matched. I think you’re quite right that it might be that the low level feature extraction is still done by parametric models like neural networks, but it’s certainly important that we use the right techniques in the right domains and being able to interchange ideas enables that.

2) Are there training algorithms that allow Gaussian processes to be used today for deep-learning type applications or is this where work needs to be done?

There are algorithms, yes, we have three different approaches right now and its also clear that work in doubly stochastic variational inference (see for example Kingma and Welling or Rezende, Mohamed and Wierstra) could also be applicable. But more work still needs to be done. In particular, a lot of the success of deep learning has been down to the engineering of the system. How to implement these models on GPUs and scale them to billions of data. We’ve been starting to look at this (Dai, Damianou, Hensman and Lawrence) but there’s no doubt we are far behind and it’s a steep learning curve! We also don’t have quite the same computational resource of Facebook, Microsoft and Google!

3) Is the computational load similar to that of deep-learning neural networks or are the applications sufficiently different that a comparison is meaningless?

We carry an additional algorithmic burden, that of propagating uncertainty around the network. This is where the algorithmic problems begin, but is also where we’ve had most of the breakthroughs. Propagating this uncertainty will always come with an additional load for a particular network, but it has particular advantages like dealing with the massively missing data I mentioned above and automatic regularisation of the system. This has allowed us to automatically determine aspects like the number of layers in the network and the number of hidden nodes in each layer. This type of structural learning is very exciting and was one of the original motivations for considering these models. This has enabled us to develop variants of Gaussian processes that can be used for multiview learning (Damianou, Ek, Titsias and Lawrence), we intend to apply these ideas to deep GPs also.

4) I think I saw a suggestion that GPs are reasonably robust when trained with small datasets - do they represent a way in for smaller organisation without bags of data? Is access to data a key problem when dealing with these data science techniques?

I think it’s a very good question, it’s an area we’re particularly interested in addressing. How can we bring data science to smaller organisations? I think it might relates to our ‘open data science’ initiative (see this blog post here). I refer to this idea as ’analysis empowerment’. However, I hadn’t particularly thought deep GPs in this way before, but can I hazard a possible yes to that? Certainly with GaussianFace we saw they could outperform DeepFace (from Facebook) with a small fraction of the data. For us it wasn’t the main motivation for developing deep GPs, but I’d like to think it might be a characteristic of the models. The motivating examples we have are more in the domain of applications that the current generation of supervised deep learning algorithms can’t address: like interconnection of data sets in health. Many of my group’s papers are about interconnecting different views of the patient (genotype, environmental background, clinical data, survival information … with luck even information from social networks and loyalty cards). We approach this through Gaussian process frameworks to ensure that we can build models that will be fully interconnected in application. We call this approach “deep health”. We aren’t there yet, but I feel there’s a lot of evidence so far that we’re working with a class of models that will do the job. My larger concern is the ethical implications of pulling this scale and diversity of information together. I find the idea of a world where we have computer models outperforming humans in predicting their own behavior (perhaps down to the individual) quite disturbing. It seems to me that now the technology is coming within reach, we need to work hard to also address these ethical questions. And it’s important that this debate is informed by people who actually understand the technology.

5) On a more general point that I think can be explored within this feature, are techniques such as Gaussian processes at a disadvantage in computer science because of their heavy mathematical basis? (I’ve had interviews with people like Donald Knuth and Erol Gelenbe in the past where the idea has come up that computer science and maths should, if not merge, interact a lot more).

Yes, and no. It is true that people seem to have some difficulty with the concept of Gaussian processes. But it’s not that the mathematics is more complex than people are using (at the cutting edge) for deep neural networks. Any of the researchers leading the deep revolution could easily turn their hands to Gaussian processes if they chose to do so. Perhaps at ‘entry’ the concepts seem simpler in deep neural networks, but as you peer ‘deeper’ (forgive the pun) into those models it actually becomes a lot harder to understand what’s going on. The leading people (Hinton, Bengio, LeCun, etc) seem to have really good intuitions, but these are not always easy to teach. Certainly when Geoff Hinton explains something to me I always feel I’ve got a very good grasp of it at the time, but later, when I try and explain the same concept to someone else, I find I can’t always do it (i.e., he’s got better intuitions than me, and he’s better at explaining than I am). There may be similar issues for explaining deep GPs, but my hope is that once the conceptual hurdle of a GP is surmounted, the resulting models are much easier to analyze. Such analysis should also feed back into the wider deep learning community. I’m pleased that this is already starting to happen (see Duvenaud, Rippel, Adams and Ghahramani). Gaussian processes also generalise many different approaches to learning and signal processing (including neural networks), so understanding Gaussian processes well gives you an ‘in’ for many different areas. I agree, though, that the perception in the wider community matches your analysis. This is a major reason for the program of summer schools we’ve developed in Gaussian Processes. So far we’ve taught over 200 students, and we have two further schools planned for 2015 with a developing program for 2016. We’ve made material freely available on line including lectures (on YouTube) and lab notes. So I hope we are doing something to address the perception that these models are harder mathematically!

I totally agree on the Maths/CS interface. It is, however, slightly frustrating (and perhaps inevitable) how much different academic disciplines become dominated by a particular culture of research. This can create barriers, particularly when it comes to formal publication (e.g. in the ‘leading’ journals). My group’s been working very hard over the last decade to combat this through organization of workshops and summer schools that bridge the domains. It always seems to me that meeting people face to face helps us gain a shared understanding. For example, a lot of confusion can be generated by the slightly different ways we use technical terminology, it leads to a surprising number of misunderstandings that do take time to work through. However, through these meetings I’ve learned an enormous amount, particularly from the statistics community. Unfortunately, formal outlets and funding for this interface are still surprisingly difficult to find. This is not helped by the fact that the traditional professional societies don’t necessarily bridge the intellectual ground and sometimes engage in their own fights for territory. These cultural barriers also spill over into organization of funding. For example, in the UK it’s rare that my grant proposals are refereed by colleagues from Maths/Stats community or that their grant proposals are refereed by me. They actually go two totally separate parts of the relevant UK funding body. As a result both sets of proposals can be lost in the wider Maths and CS communities, which is not always conducive to expanding the interface. In the UK I’m hoping that the recent founding of the Alan Turing Institute will cause a bit of a shake up in this area, and that some of these artificial barriers will fall away. But in summary, I totally agree with the point, but also recognize that on both sides of the divide we have created communities which can make collaboration harder.