25 November 2013

Back in 2002 or 2003, when this paper was going through the journal revision stage, I was asked by the reviewers to provide the software that implemented the algorithm. After the initial annoyance at another job to do, I thought about it a bit and realised that not only was it a perfectly reasonable request, but that the software was probably the main output of research. In paticular, in terms of reproducibility, the implementation of the algorithm seems particularly important. As a result, when I visited Mike Jordan’s group in Berkeley in 2004, he began to write a software framework for publishing his research, based on a simple MATLAB kernel toolbox, and a set of likelihoods. This led to a reissuing of the IVM software and these toolboxes underpinned my group’s work for the next seven or eight years, going through multiple releases.

The initial plan for code release was to provide implementations of published software, but over time the code base evolved into quite a usable framework for Gaussian process modelling. The release of code proved particularly useful in spreading the ideas underlying the GP-LVM, enabling Aaron Hertzman and collaborators to pull together a style based inverse kinematics approach at SIGGRAPH which has proved very influential.

Even at that time it was apparent what a horrible language MATLAB was, but it was the language of choice for machine learning. Efficient data processing requires an interactive shell, and the only ‘real’ programming language with such a facility was python. I remember exploring python whilst at Berkeley with Misha Belkin, but at the time there was no practical implementation of numerical algorithms (at that time it was done in a module called numeric, which was later abandoned). Perhaps more importantly, there was no graph-plotting capability. Although as a result of that exploration, I did stop using perl for scripting and switched to python.

The issues with python as a platform for data analysis were actually being addressed by John D. Hunter with matplotlib. He presented at theNIPS workshop on machine learning open source software in 2008, where I was a member of the afternoon discussion panel. John Eaton, creator of Octave, was also at the workshop, although in the morning session, which I missed due to other commitments. By this time, the group’s MATLAB software was also compatible with Octave. But Octave has similar limitations to MATLAB in terms of language and also did not provide such a rich approach to GUIs. These MATLAB GUIs, whilst quick clunky in implementation, allow live demonstration of algorithms with simple interfaces. This is a facet that I used regularly in my talks.

In the meantime the group was based in Manchester, where in response to the developments in matplotlib and the new numerical module numpy I opened a debate in Manchester about Python in machine learning with this MLO lunch-time talk. At that point I was already persuaded of the potential for python in both teaching and research, but for research in particular there was the problem of translating the legacy code. At this point scikit-learn was fairly immature, so as a test, I began reimplementing portions of the netlab toolbox in python. The (rather poor) result can be found here, with comments from me at the top of netlab.py about issues I was discovering that confuse you when you first move from MATLAB to numpy. I also went through some of these issues in my MLO lunch time talk.

When Nicolo Fusi arrived as a summer student in 2009, he was keen to not use MATLAB for his work, on eQTL studies, and since it was a new direction and the modelling wouldn’t rely too much legacy code, I encouraged him to do this. Others in the group from that period (such as Alfredo Kalaitzis, Pei Gao and our visitor Antti Honkela) were using R as well as MATLAB, because they were focussed on biological problems and delivering code to Bioconductor, but the main part of the group doing methodological development (Carl Henrik Ek, Mauricio Alvarez, Michalis Titsias) were still using MATLAB. I encouraged Nicolo to use python for his work, rather than the normal group practice, which was to stay within the software framework provided by the MATLAB code. By the time Nicolo returned as a PhD student in April 2010 I was agitating in the MLO group in Manchester that all our machine learning should be done in python. In the end, this lobbying attempt was unsuccessful, perhaps because I moved back to Sheffield in August 2010.

On the run into the move, it was already clear where my mind was going. I gave this presentation at a workshop on Validation in Statistics and ML in Berlin in June, 2010 where I talked about the importance of imposing a rigid structure for code (at the time we used an svn repository and a particular directory structure) when reproducible research is the aim, but also mention the importance of responding to individuals and new technology (such as git and python). Nicolo had introduced me to git but we had the legacy of an SVN code base in MATLAB to deal with. So at that point the intention was there to move both in terms of research and teaching, but I don’t think I could yet see how we were going to do it. My main plan was to move the teaching there first, and then follow with the research code.

On the move to Sheffield in August, 2010, we had two new post-docs start (Jaakko Peltonen andJames Hensman) and a new PhD student (Andreas Damianou). Jaakko and Andreas were started on the MATLAB code base but James also expressed a preference for python. Nicolo was progressing well with his python implementations, so James joined Nicolo in working in python. However, James began to work on particular methodological ideas that were targeted at Gaussian processes. I think this was the most difficult time in the transition. In particular, James initially was working on his own code, that was put together in a bespoke manner for solving a particular problem. A key moment in the transition was when James also realised the utility of a shared code base for delivering research, he set to work building a toolbox that replicated the functionality of the old code base, in particular focussing on covariance functions and sparse approximations. Now the development of the new code base had begun in earnest. With Nicolo joining in and the new recruits: Ricardo Andrade Pacheco (PhD student from September 2011), who focussed on developing the likelihood code with the EP approximation in mind and Nicolas Durrande (post-doc) working on covariance function (kernel) code. This tipped the balance in the group so all the main methodological work was now happening in the new python codebase, what was to become GPy. By the time of this talk at the RADIANT Project launch meeting in October 2012, the switch over had been pretty much completed, since then Alan Saul joined the team and has been focussing on the likelihood models and introducing the Laplace approximation. Max Zwiessele, who first visited us from MPI Tuebingen in 2012, returned in April 2013 and has been working on the Bayesian GP-LVM implementations (with Nicolo Fusi and Andreas Damianou).

GPy has now fully replaced the old MATLAB code base as the group’s approach to delivering code implementations.

I think the hardest part of the process was the period between fully committing to the transition and not yet having a fully functional code base in python. The fact that this transition was achieved so smoothly, and has led to a code base that is far more advanced than the MATLAB code, is entirely down to those that worked on the code, but particular thanks is due to James Hensman. As soon as James became convinced of the merits of a shared research code base he began to drive forward the development of GPy. Nicolo worked closely with James to get the early versions functional, and since then all the groups recruits have been contributing.

Four years ago, I knew where we wanted to be, but I didn’t know how (or if) we were going to get there. But actually, I think that’s the way of most things in research. As often, the answer is through the inspiration and perspiration of those that work with you. The result is a new software code base, more functional than before, and more appropriate for student projects, industrial collaborators and teaching. We have already used the code base in two summer schools, and have two more scheduled. It is still very much in alpha release, but we are sharing it through a BSD license to enable both industrial and academic collaborators to contribute. We hope for a wider user base thereby ensuring a more robust code base.