Questions on Deep Gaussian Processes

I was recently contacted by Chris Edwards, he’s putting together an article for Communications of the ACM on Deep Learning and had a few questions on deep Gaussian processes. He kindly agreed to let me use his questions and my answers in a blog post.
1) Are there applications that suit Gaussian processes well? Would they typically replace the neural network layers in a deep learning system or would they possibly be mixed and matched with neural layers, perhaps as preprocessors or using the neural layers for stuff like feature extraction (assuming that training algorithms allow for this)?
Yes, I think there are applications that suit Gaussian processes very well. In particular applications where data is scarce (this doesn’t necessarily mean small data sets, but when data is scarce relative to the complexity of the system being modeled). In these scenarios, handling uncertainty in the model appropriately becomes very important. Two examples which have exploited this characteristic in practice are GaussianFace by Lu & Tang, and Bayesian optimization (e.g. Snoek, Larochelle and Adams). Almost all my own group’s work also exploits this characteristic. A further manifestation of this effect is what I call “massively missing data”. Although we are getting a lot of data at the moment, when you think about it you realise that almost all the things we would like to know are still missing almost all of the time. Deep models have performed well in situations where data sets are very well characterised and labeled. However, one of the domains that inspires me is clinical data where this isn’t the case. In clinical data most people haven’t had most clinical tests applied to them most of the time. Also, the nature of clinical tests evolve (as do the diseases that affect patients). This is an example of massively missing data. I think Gaussian processes provide a very promising approach to handling this data.
With regard to whether they are a replacement for deep neural networks, I think in the end they may well be mixed and matched. From a Gaussian process perspective the neural network layers could be seen as a type of ‘mean function’ (a Gaussian process is defined by its mean function and its covariance function). So they can be seen as part of the deep GP framework: deep Gaussian processes enhance the toolkit available. So there is no conceptual reason why they shouldn’t be mixed and matched. I think you’re quite right that it might be that the low level feature extraction is still done by parametric models like neural networks, but it’s certainly important that we use the right techniques in the right domains and being able to interchange ideas enables that.
2) Are there training algorithms that allow Gaussian processes to be used today for deep-learning type applications or is this where work needs to be done?
There are algorithms, yes, we have three different approaches right now and its also clear that work in doubly stochastic variational inference (see for example Kingma and Welling  or Rezende, Mohamed and Wierstra) could also be applicable. But more work still needs to be done. In particular, a lot of the success of deep learning has been down to the engineering of the system. How to implement these models on GPUs and scale them to billions of data. We’ve been starting to look at this (Dai, Damianou, Hensman and Lawrence) but there’s no doubt we are far behind and it’s a steep learning curve! We also don’t have quite the same computational resource of Facebook, Microsoft and Google!
3) Is the computational load similar to that of deep-learning neural networks or are the applications sufficiently different that a comparison is meaningless?
We carry an additional algorithmic burden, that of propagating uncertainty around the network. This is where the algorithmic problems begin, but is also where we’ve had most of the breakthroughs. Propagating this uncertainty will always come with an additional load for a particular network, but it has particular advantages like dealing with the massively missing data I mentioned above and automatic regularisation of the system. This has allowed us to automatically determine aspects like the number of layers in the network and the number of hidden nodes in each layer. This type of structural learning is very exciting and was one of the original motivations for considering these models. This has enabled us to develop variants of Gaussian processes that can be used for multiview learning (Damianou, Ek, Titsias and Lawrence), we intend to apply these ideas to deep GPs also.
4) I think I saw a suggestion that GPs are reasonably robust when trained with small datasets – do they represent a way in for smaller organisation without bags of data? Is access to data a key problem when dealing with these data science techniques?
I think it’s a very good question, it’s an area we’re particularly interested in addressing. How can we bring data science to smaller organisations? I think it might relates to our ‘open data science’ initiative (see this blog post here). I refer to this idea as ‘analysis empowerment’. However, I hadn’t particularly thought deep GPs in this way before, but can I hazard a possible yes to that? Certainly with GaussianFace we saw they could outperform DeepFace (from Facebook) with a small fraction of the data. For us it wasn’t the main motivation for developing deep GPs, but I’d like to think it might be a characteristic of the models. The motivating examples we have are more in the domain of applications that the current generation of supervised deep learning algorithms can’t address: like interconnection of data sets in health. Many of my group’s papers are about interconnecting different views of the patient (genotype, environmental background, clinical data, survival information … with luck even information from social networks and loyalty cards). We approach this through Gaussian process frameworks to ensure that we can build models that will be fully interconnected in application. We call this approach “deep health”. We aren’t there yet, but I feel there’s a lot of evidence so far that we’re working with a class of models that will do the job. My larger concern is the ethical implications of pulling this scale and diversity of information together. I find the idea of a world where we have computer models outperforming humans in predicting their own behavior (perhaps down to the individual) quite disturbing. It seems to me that now the technology is coming within reach, we need to work hard to also address these ethical questions. And it’s important that this debate is informed by people who actually understand the technology.
5) On a more general point that I think can be explored within this feature, are techniques such as Gaussian processes at a disadvantage in computer science because of their heavy mathematical basis? (I’ve had interviews with people like Donald Knuth and Erol Gelenbe in the past where the idea has come up that computer science and maths should, if not merge, interact a lot more).
Yes, and no. It is true that people seem to have some difficulty with the concept of Gaussian processes. But it’s not that the mathematics is more complex than people are using (at the cutting edge) for deep neural networks. Any of the researchers leading the deep revolution could easily turn their hands to Gaussian processes if they chose to do so. Perhaps at ‘entry’ the concepts seem simpler in deep neural networks, but as you peer ‘deeper’ (forgive the pun) into those models it actually becomes a lot harder to understand what’s going on. The leading people (Hinton, Bengio, LeCun, etc) seem to have really good intuitions, but these are not always easy to teach. Certainly when Geoff Hinton explains something to me I always feel I’ve got a very good grasp of it at the time, but later, when I try and explain the same concept to someone else, I find I can’t always do it (i.e., he’s got better intuitions than me, and he’s better at explaining than I am). There may be similar issues for explaining deep GPs, but my hope is that once the conceptual hurdle of a GP is surmounted, the resulting models are much easier to analyze. Such analysis should also feed back into the wider deep learning community. I’m pleased that this is already starting to happen (see Duvenaud, Rippel, Adams and Ghahramani). Gaussian processes also generalise many different approaches to learning and signal processing (including neural networks), so understanding Gaussian processes well gives you an ‘in’ for many different areas. I agree, though, that the perception in the wider community matches your analysis. This is a major reason for the program of summer schools we’ve developed in Gaussian Processes. So far we’ve taught over 200 students, and we have two further schools planned for 2015 with a developing program for 2016. We’ve made material freely available on line including lectures (on YouTube) and lab notes. So I hope we are doing something to address the perception that these models are harder mathematically!
I totally agree on the Maths/CS interface. It is, however, slightly frustrating (and perhaps inevitable) how much different academic disciplines become dominated by a particular culture of research. This can create barriers, particularly when it comes to formal publication (e.g. in the ‘leading’ journals). My group’s been working very hard over the last decade to combat this through organization of workshops and summer schools that bridge the domains. It always seems to me that meeting people face to face helps us gain a shared understanding. For example, a lot of confusion can be generated by the slightly different ways we use technical terminology, it leads to a surprising number of misunderstandings that do take time to work through. However, through these meetings I’ve learned an enormous amount, particularly from the statistics community. Unfortunately, formal outlets and funding for this interface are still surprisingly difficult to find. This is not helped by the fact that the traditional professional societies don’t necessarily bridge the intellectual ground and sometimes engage in their own fights for territory. These cultural barriers also spill over into organization of funding. For example, in the UK it’s rare that my grant proposals are refereed by colleagues from Maths/Stats community or that their grant proposals are refereed by me. They actually go two totally separate parts of the relevant UK funding body. As a result both sets of proposals can be lost in the wider Maths and CS communities, which is not always conducive to expanding the interface. In the UK I’m hoping that the recent founding of the Alan Turing Institute will cause a bit of a shake up in this area, and that some of these artificial barriers will fall away. But in summary, I totally agree with the point, but also recognize that on both sides of the divide we have created communities which can make collaboration harder.

Blogs on the NIPS Experiment

There are now quite a few blog posts on the NIPS experiment, I just wanted to put a place together where I could link to them all. It’s a great set of posts from community mainstays, newcomers and those outside our research fields.

Just as a reminder, Corinna and I were extremely open about the entire review process, with a series of posts about how we engaging the reviewers and processing the data. All that background can be found through a separate post here.

At the time of writing there is also still quite a lot of twitter traffic on the experiment.

List of Blog Posts

What an exciting series of posts and perspectives!
For those of you that couldn’t make the conference, here’s what it looked like.
And that’s just one of 5 or six poster rows!

Open Collaborative Grant Writing

Thanks to an introduction to the Sage Math team by Fernando Perez, I just had the pleasure of participating in a large scale collaborative grant proposal construction exercise, co-ordinated Nicolas Thiéry. I’ve collaborated on grants before, but for me this was a unique experience because the grant writing was carried out in the open, on github.

The proposal, ‘OpenDreamKit’ is principally about doing as much as possible to smooth collaboration between mathematicians so that advances in maths can be delivered as rapidly as possible to teachers, researchers, technologists etc. Although, of course, I don’t have to tell you because you can read it on github.

It was a wonderful social experiment, and I think it really worked, although a lot of credit to that surely goes to the people involved (most of whom were there before I came aboard). I really hope this is funded, because collaborating with these people is going to be great.

For the first time on a proposal, I wasn’t the one who was most concerned about the latex template (actually second time … I’ve worked on a grant once with Wolfgang Huber). But this took things to another level, as soon as a feature was required the latex template seemed to be updated, almost in real time, I think mainly by Michael Kohlhase.

Socially it was very interesting, because the etiquette of how to interact (on the editing side) was not necessarily clear at the outset. For example, at one point I was tasked with proof reading a section, but ended up doing a lot of rephrasing. I was worried about whether people would be upset that their text had been changed, but actually there was a positive reaction (at least from Nicolas and Hans Fangohr!), which emboldened me to try more edits. As deadline approached I think others went through a similar transition, because the proposal really came together in the last few days. It was a little like a school dance, where at the start we were all standing at the edge of the room, eyeing each other up, but as DJ Nicolas ramped things up and the music became a little more hardcore (as dawn drew near), barriers broke down and everyone went a little wild. Nicolas produced a YouTube video, visualising the github commits.

As Alex Konovalov pointed out, we look like bees pollinating each other’s flowers!

I also discovered great new (for me) tools like appear.in that we used for brainstorming on ‘Excellence’ with Nicolas and Hans: much more convenient than Skype or Hangouts.

Many thanks to Nicolas, and all of the collaborators. I think it takes an impressive bunch of people to pull off such a thing, and regardless of outcome, which I very much hope will be positive, I look forward to further collaborations within this grouping.

The NIPS Experiment

Just back from NIPS where it was really great to see the results of all the work everyone put in. I really enjoyed the program and thought the quality of all presented work was really strong. Both Corinna and I were particularly impressed by the work that put in by oral presenters to make their work accessible to such a large and diverse audience.

We also released some of the figures from the NIPS experiment, and there was a lot of discussion at the conference about what the result meant.

As we announced at the conference the consistency figure was 25.9%. I just wanted to confirm that in the spirit of openness that we’ve pursued across the entire conference process Corinna and I will provide a full write up of our analysis and conclusions in due course!

Some of the comment in the existing debate is missing out some of the background information we’ve tried to generate, so I just wanted to write a post that summarises that information to highlight its availability.

Scicast Question

With the help of Nicolo Fusi, Charles Twardy and the entire Scicast team we launched a Scicast question a week before the results were revealed. The comment thread for that question already had an amount of interesting comment before the conference. Just for informational purposes before we began reviewing Corinna forecast this figure would be 25% and I forecast it would be 20%. The box plot summary of predictions from Scicast is below.

forecast

Comment at the Conference

There was also an amount of debate at the conference about what the results mean, a few attempts to answer this question (based only on the inconsistency score and the expected accept rate for the conference) are available here in this little Facebook discussion and on this blog post.

Background Information on the Process

Just to emphasise previous posts on this year’s conference see below:

  1. NIPS Decision Time
  2. Reviewer Calibration for NIPS
  3. Reviewer Recruitment and Experience
  4. Paper Allocation for NIPS

Software on Github

And finally there is a large amount of code available on a github site for allowing our processes to be recreated. A lot of it is tidied up, but the last sections on the analysis are not yet done because it was always my intention to finish those when the experimental results are fully released.

Alan Turing Institute: Critical Mass or Incubated Lungs?

On Wednesday last week I attended an “Open Meeting” organised by the UK’s EPSRC Research Council on the Alan Turing Institute. The Turing Institute is a new government initiative that stems from a letter from our Chief Scientific advisor to our prime minister about the “age of algorithms”. It aims to provide an international centre of excellence in data science.

The government has provided 42 million pounds of funding (about 60-70 million dollars) and Universities interested in partnering in the Turing Institute are expected to bring 5 million pounds (8 million dollars) to the initiative, to be spent over 5 years.

It seemed clear that the EPSRC will require that the institute is located in one place, and there was much talk of ‘critical mass’, which made me think of what ‘critical mass’ is in data science, after all, we aren’t building a large hadron collider, and one of the most interesting challenges of the new age of data is its distributed nature. I asked a question about this and was given the answers you might expect: flagship international centre of excellence, stimulating environment, attracting the best of the best etc. Nothing was particularly specific to data science.

In my own area of machine learning the UK has a lot of international recognition, but one of the features I’ve always enjoyed is the distributed nature of the expertise. The groups that spring first to mind are Cambridge (Engineering), Edinburgh (Informatics), UCL (Computer Science and Gatsby) and recently Oxford has expanded significantly (CS, Engineering and Statistics). I’ve always enjoyed the robustness that such a network of leading groups brings. It’s evolved over a period of 20 years, and those of us that have watched it grow are incredibly proud of what the UK has been able to achieve with relatively few people.

Data science requires strong interactions between statisticians and computer scientists. It requires knowledge of classical techniques and modern computational capabilities. The pool of expertise is currently rather small relative to the demand. As a result I find my self constantly in demand within my own University, mainly to advise on the capabilities that current approaches to analysis have. A recent xkcd comic cleverly reminded us of how hard it can be to explain the gap between those things that are easy and those things that are virtually impossible. Although in many cases where advice is need it’s not the full explanation that’s required, just the knowledge. Many expensive errors can be avoided by just a little access to this knowledge. Back in July I posted a position paper on this  that was targeting exactly this problem and in Sheffield we are pursuing the “Open Data Science” agenda I proposed with vigour. Indeed, I sometimes wonder if my group is not more useful for this advice (which rarely involves any intellectual novelty) than for the ideas we push forward in our research. However, our utility as advisors is much more difficult to quantify, particularly because it often won’t lead to a formal collaboration.

I like analogies, but I think that ‘critical mass’ here is the wrong one. To give better access to expertise, what is required is a higher surface area to volume ratio, not a greater mass. Communication between experts is important, but we are fortunate in the UK to have a geographically close network of well connected Universities. Many international visitors take the time to visit two or three of the leading groups when they are here, so I think the idea of analogy of a lung is a far better one for describing what is required for UK data science. I’m pleased the government has recognised the importance of data science, I just hope that in their rush to create a flagship institute, with a large headline grabbing investment figure associated, they don’t switch off the incubator that sustains our developing lungs.

Gaussian Process Summer School

Yesterday we finished our third Sheffield school. As with the previous events we’ve ended with a one day workshop focussed on Gaussian processes, this time on using them for feature extraction. With such a busy summer it was pretty intimidating to take on the school so shortly after we have sent out decisions on NIPS. As ever the group came through with the organisation though. This time out Zhenwen Dai was the main organiser, but once again he could never have done it without the rest of the group chipping in. It’s another reminder that when you are working with great people, great things can happen.

The school always gives me a special kind of energy, that which you can only get from seeing people enthuse about the things you care about. We were very lucky to have such a great group of speakers: Carl Rasmussen, Dan Cornford, Mike Osbourne, Rich Turner, Joaquin Quinonero Candela, and then at the workshop Carl Henrik Ek, Andreas Damianou, Victor Prisacariu and Chaochao Lu. It always part feels like a family reunion (we had brief overlaps between Carl, Joaquin (Sheffield Tap!), Lehel Csato and Magnus Rattray, all four of whom were in Sheffield for the 2005 GPRT) and part like a welcoming event for new researchers. We covered important new developments in probabilistic numerics (Mike Osborne) and time series processing (Rich Turner) and Control (Carl Rasmussen). Joaquin also gave us insights into the evidence and then presented to a University-wide audience on machine learning at Facebook.

In the workshop we also saw how GPs can be used for multiview learning (Carl Henrik Ek) audio processing (Rich Turner) deep learning (Andreas Damianou) shape representation (Victor Prisacariu) and face identification (Chaochao Lu).

We’ve now taught around about 140 students through the schools in Sheffield and a further 60 through roadshows to Uganda and Colombia. Perhaps the best bit was watching everyone head for the Devonshire Cat after the last lecture to continue the debate. I think we all probably remember summer schools from our early times in research that were influential (for me the NATO ASI on Machine Learning and Generalisation, for many it will be the regular MLSS events). It’s nice to hope that this series of events may have also done something to influence others. The next scheduled events will be in roadshows in Australia in February with Trevor Cohn and Kenya in June with Ciira wa Maina and John Quinn (although we plan to make the Kenyan event it will be more data science focussed than GPs).

Thanks to all in the group for organising!

NIPS: Decision Time

Thursday 28th August

In the last two days I’ve spent nearly 20 hours in teleconferences, my last scheduled conference will start in about 1/2 an hour. Given the available 25 minutes it seemed to make sense to try and put down some thoughts about the decision process.

The discussion period has been constant, there is a stream of incoming queries from Area Chairs, requests for advice on additional reviewers, or how to resolve deadlocked or disputing reviews. Corinna has handled many of these.

Since the author rebuttal period all the papers have been distributed to google spreadsheet lists which are updated daily. They contain paper titles, reviewer names, quality scores, calibrated scores, a probability of accept (under our calibration model), a list of bot-compiled potential issues as well as columns for accept/reject and poster/spotlight. Area chairs have been working in buddy pairs, ensuring that a second set of eyes can rest on each paper. For those papers around the borderline, or with contrasting reviews, the discussion period really can have an affect, we see when calibrating the reviewer scores: over time the reviewer bias is reducing and the scores are becoming more consistent. For this reason we allowed this period to go on a week longer than originally planned, and we’ve been compressing our teleconferences into the last few days.

Most teleconferences consist of two buddy pairs coming together to discuss their papers. Perhaps ideally the pairs would have a similar subject background, but constraints of time zone and the fact that there isn’t a balanced number of subject areas mean that this isn’t necessarily the case.

Corinna and I have been following a similar format. Listing the papers from highest scoring first, to lowest scoring, and starting at the top. For each paper, if it is a confident accept, we try and identify if it might be a talk or a spotlight. This is where the opinion of a range of Area Chairs can be very useful. For uncontroversial accepts that aren’t nominated for orals we spend very little time. This proceeds until we start reaching borderline papers, those in the ‘grey area': typically papers with an average score around 6. They fall broadly into two categories: those where the reviewers disagree (e.g. scores of 8,6,4), or those where the review are consistent but the reviewers , perhaps, feel underwhelmed (scores of 6,6,6). Area chairs will often work hard to try and get one of the reviewers to ‘champion’ a paper: it’s a good sign if a reviewer has been prepared to argue the case for a paper in the discussion. However, the decisions in this region are still difficult. It is clear that we are rejecting some very solid papers, for reasons of space and because of the overall quality of submissions. It’s hard for everyone to be on the ‘distributing’ end of this system, but at the same time, we’ve all been on the receiving end of it too.

In this difficult ‘grey area’ for acceptance, we are looking for sparks in a paper that push it over the edge to acceptance. So what sort of thing catches an area chair’s eye? A new direction is always welcome, but often leads to higher variance in the reviewer scores. Not all reviewers are necessarily comfortable with the unfamiliar. But if an area chair feels a paper is taking the machine learning field somewhere new, then even if the paper has some weaknesses (e.g. in evaluation or giving context and detailed derivations etc) then we might be prepared to overlook this. We look at the borderline papers in some detail, scanning the reviews, looking for words like ‘innovative’, ‘new directions’ or ‘strong experimental results’. If we see these then as program chairs we definitely become more attentive. We all remember papers presented at NIPS in the past that lead to revolutions in the way machine learning is done. Both Corinna and I would love to have such papers at ‘our’ NIPS.

A paper that is a more developed area will be expected to have done a more rounded job in terms of setting the context and performing the evaluation. Papers in a more developed area will be expected to hit a high level in terms of their standards.

It is often helpful to have an extra pair of eyes (or even two pairs) run through the paper. Each teleconference call normally ends with a few follow up actions for a different area chair to look through a paper or clarify a particular point. Sometimes we also call in domain experts, who may have already produced four formal reviews of other papers, just to get clarification on  particular point. This certainly doesn’t happen for all papers, but those with scores around 7,6,6 or 6,6,6 or 8,6,4 often get this treatment. Much depends on the discussion and content of the existing reviews, but there are still, often, final checks that need carrying out. From a program chair’s perspective, the most important thing is that the Area Chair is comfortable with the decision, and I think most of the job is acting as a sounding board for the Area Chair’s opinion, which I try to reflect back to them. In the same manner as rubber duck debugging, just vocalising the issues sometimes causes them to be crystallised in the mind. Ensuring that Area Chairs are calibrated to each other is also important. The global probabilities of accept from the reviewer calibration model really help here. As we go through papers I keep half an eye on those, not to influence the decision of a particular paper so much as to ensure that at the end of the process we don’t have a surplus of accepts. At this stage all decisions are tentative, but we hope not to have to come back to too many of them.

Monday 1st September

Corinna finished her last video conference on Friday, Saturday, Sunday and Monday (Labor Day) were filled with making final decisions on accepts, then talks and finally spotlights. Accepts were hard, we were unable to take all the papers that were possible accept, as we would have gone way over our quota of 400. We had to make a decision on duplicated papers where the decisions were in conflict, more details of this to come at the conference. From remembering what a pain it was to do the schedule after the acceptances, and also following advice from Leon Bottou that the talk program emerges to reflect the accepted posters, we finalized the talk and spotlight program whilst putting talks and spotlights directly into the schedule. We had to hone the talks down to 20 from about 40 candidates and spotlights we squeezed in 62 from over a hundred suggestions. We spent three hours in teleconference each day, as well as preparation time, across Labor Day weekend putting together the first draft of the schedule. It was particularly impressive how quickly area chairs responded to any of our follow up queries to our notes from the teleconferences. Particularly those in the US who were enjoying the traditional last weekend of summer.

Tuesday 2nd September

I had an all day meeting in Manchester for the a network of researchers focussed on mental illness. It was really good to have a day discussing research, my first in a long time. I thought very little about NIPS until on the train home, I thought to have a little look at the conference shape. I actually ended up looking at a lot of the papers we rejected, many from close colleagues and friends. I found it a little depressing. I have no doubt there is a lot of excellent work there, and I know how disappointed my friends and colleagues will be to receive those rejections. We did an enormous amount to ensure that the process was right, and I have every confidence in the area chairs and reviewers. But at the end of the day, you know that you will be rejecting a lot of good work. It brought to mind a thought I had at the allocation stage. When we had the draft allocation to each area chair, I went through several of them sanity checking the quality of the allocation. Naturally, I checked those associated with area chairs who are closer to my own areas of expertise. I looked through the paper titles, and I couldn’t help but think what a good workshop each of those allocations would make. There would be some great ideas, some partially developed ideas. There would be some really great experiments and some weaker experiments. But there would be a lot of debate at such workshop. None or very few of the papers would be uninteresting: there would certainly be errors in papers, but that’s one of the charms of a workshop, there’s still a lot more to be said about an idea when it’s presented at a workshop.

Friday 5th September

Returning from an excellent two day UCL-Duke workshop. There is a lot of curiosity about the NIPS experiment, but Corinna and I have agreed to keep the results embargoed until the conference.

Saturday 6th September

Area chairs had until Thursday to finalise their reviews in the light of the final decisions, and also to raise any concerns they had about the final decisions. My own experience of area chairing is that you can have doubts about your reasoning when you are forced to put pen to paper and write the meta review. We felt it was important to not rush the final process to allow any of those doubts to emerge. In the end, the final program has 3 or 4 changes from the draft we first distributed on Monday night, so there may be some merit in this approach. We had a further 3 hour teleconference today to go through the meta-reviews, with a particular focus on those for papers around the decision boundary. Other issues such as comments in the wrong place (the CMT interface can be fairly confusing, 3% of meta reviews were actually placed in the box meant for notes to the program chairs) were also covered. Our big concern was if the area chairs had written a review consistent with our final verdict. A handy learning task would have been to build a sentiment model to predict accept/reject from the meta review.

Monday 8th September 

Our plan had been to release reviews this morning, but we were still waiting for a couple of meta-reviews to be tidied up and had an outstanding issue on one paper. I write this with CMT ‘loaded’ and ready to distribute decisions. However, when I preview the emails the variable fields are not filled in (if I hit ‘send’ I would send 5,000 emails that start “Dear $RecipientFirstName$, which sounds somewhat impersonal … although perhaps more critical is that the authors would be informed of the fate of paper “$Title$,” which may lead to some confusion. CMT are on a different time zone, 8 hours behind. Fortunately, it is late here, so there is a good chance they will respond in time …

Tuesday 9th September

I was wide awake at 6:10 despite going to sleep at 2 am. I always remember when I was Area Chair with John Platt that he would be up late answering emails and then out of bed again 4 hours later doing it again. A few final checks and the all clear for everything is there. Pressed the button at 6:22 … emails are still going out and it is 10:47. 3854 of the 5615 emails have been sent … one reply which was an out of office email from China. Time to make a coffee …

Final Statistics

1678 submissions
414 papers accepted
20 papers for oral
62 for spotlight
331 for poster
19 rejected without review

Epilogue to Decision Mail:  So what was wrong with those variable names? I particularly like the fact that something different was wrong with each one. $RecipientFirstName$ and $RecipientEmail$ are  not available in the “Notification Wizard”, whereas they are in the normal email sending system. Then I got the other variables wrong, $Title$->$PaperTitle$ and $PaperId$->$PaperID$, but since neither of the two I knew to be right were working I assumed there was something wrong with the whole variable substitution system … rather than it being that (at least) two of the variable types just happen to be missing from this wizard … CMT responded nice and quickly though … that’s one advantage of working late.

Epilogue on Acceptances: At the time of the conference there were only 411 papers presented because three were withdrawn. Withdrawals were usually due to some deeper problem authors had found in there own work, perhaps triggered by comments from reviewers. So in the end there were 411 papers accepted and 328 posters.

Author Concerns

So the decisions have been out for a few days now, and of course we have had some queries about our processes. Every one has been pretty reasonable, and their frustration is understandable when three reviewers have argued for accept but the final decision is to reject. This is an issue with ‘space-constrained’ conferences. Whether a paper gets through in the end can depend on subjective judgements about the paper’s qualities. In particular, we’ve been looking for three components to this: novelty, clarity and utility. Papers with borderline scores (and borderline here might be that the average score is in the weak accept range) are examined closely. The decision about whether the paper is accepted at this point necessarily must come down to judgement, because for a paper to get scores this high the reviewers won’t have identified a particular problem with the paper. The things that come through are how novel the paper is, how useful the idea is, and how clearly it’s presented. Several authors seem to think that the latter should be downplayed. As program chairs, we don’t necessarily agree. It’s true that it is a great shame when a great idea is buried in poor presentation, but it’s also true that the objective of a conference is communication, and therefore clarity of presentation definitely plays a role. However, it’s clear that all these three criteria are a matter of academic judgement: that of the reviewers, the area chair and the quad groups in the teleconferences. All the evidence we’ve seen is that reviewers and area chairs did weigh these aspects carefully, but that doesn’t mean that all their decisions can be shown to be right, because they are often a matter of perspective. Naturally authors are upset when what feels like a perfectly good paper is rejected on more subjective grounds. Most of the queries are on papers where this is felt to be the case.

There has also been one query on process, and whether we did enough to evaluate on these criteria, for those papers in the borderline area, before author rebuttal. Authors are naturally upset when the area chair raises such issues in the final decision’s meta review, but these points weren’t there before. Personally I sympathise with both authors and area chairs in this case. We made some effort to encourage authors to identify such papers before rebuttal (we sent out attention reports that highlighted probable borderline papers) but our main efforts at the time were chasing missing and inappropriate or insufficient reviews. We compressed a lot into a fairly short time, and it was also a period when many are on holiday. We were very pleased with the performance of our area chairs, but I think it’s also unsurprising if an area chair didn’t have time to carefully think through these aspects before author rebuttal.

My own feeling is that the space constraint on NIPS is rather artificial, and a lot of these problems would be avoided if it wasn’t there. However, there is a counter argument that suggests that to be a top quality conference NIPS has to have a high reject rate. NIPS is used in tenure cases within the US and these statistics are important there. Whilst I reject these ideas: I don’t think the role of a conference is to allow people to get promoted in a particular country, nor is that the role of a journal: they are both involved in the communication and debate of scientific ideas. However, I do not view the program chair roles as reforming the conference ‘in their own image’. You have to also consider what NIPS means to the different participants.

NIPS as Christmas

I came up with an analogy for this which has NIPS in the role of Christmas (you can substitute Thanksgiving, Chinese New Year, or your favourite traditional feast). In the UK Christmas is a traditional holiday about which people have particular expectations, some of them major (there should be Turkey for Christmas Dinner) and some of them minor (there should be an old Bond movie on TV). These expectations have changed over time.  The Victorians used to eat Goose and the Christmas tree was introduced from Germany by Prince Albert’s influence in the Royal Household, and they also didn’t have James Bond, I think they used Charles Dickens instead. However, you can’t just change Christmas ‘overnight’, it needs to be a smooth transition. You can make lots of arguments about how Christmas could be a better meal, or that presents make the occasion too commercial, but people have expectations so the only way to make change is slowly. Taking small steps in the right direction. For any established successful venture this approach makes a lot of sense. There are many more ways to fail than be successful and I think that the rough argument is that if you are starting from a point of success you should be careful about how quickly you move because you are likely end up in failure. However, not moving at all also leads to failure. I think this year we’ve introduced some innovations and an analysis of the process that will hopefully lead to improvements. We certainly aren’t alone in these innovations, each NIPS before us has done the same thing (I’m a particular fan of Zoubin and Max’s publication of the reviews). Whether we did this well or not, like those borderline papers, is a matter for academic judgement. In the meantime I (personally) will continue to try to enjoy NIPS for what it is, whilst wondering about what it could be and how we might get there. I also know that as a community we will continue to innovate, launching new conferences with new models for reviewing (like ICLR).