GPy: Moving from MATLAB to Python

Back in 2002 or 2003, when this paper was going through the journal revision stage, I was asked by the reviewers to provide the software that implemented the algorithm. After the initial annoyance at another job to do, I thought about it a bit and realised that not only was it a perfectly reasonable request, but that the software was probably the main output of research. In paticular, in terms of reproducibility, the implementation of the algorithm seems particularly important. As a result, when I visited Mike Jordan’s group in Berkeley in 2004, he began to write a software framework for publishing his research, based on a simple MATLAB kernel toolbox, and a set of likelihoods. This led to a reissuing of the IVM software and these toolboxes underpinned my group’s work for the next seven or eight years, going through multiple releases.

The initial plan for code release was to provide implementations of published software, but over time the code base evolved into quite a usable framework for Gaussian process modelling. The release of code proved particularly useful in spreading the ideas underlying the GP-LVM, enabling Aaron Hertzman and collaborators to pull together a style based inverse kinematics approach at SIGGRAPH which has proved very influential.

Even at that time it was apparent what a horrible language MATLAB was, but it was the language of choice for machine learning. Efficient data processing requires an interactive shell, and the only ‘real’ programming language with such a facility was python. I remember exploring python whilst at Berkeley with Misha Belkin, but at the time there was no practical implementation of numerical algorithms (at that time it was done in a module called numeric, which was later abandoned). Perhaps more importantly, there was no graph-plotting capability. Although as a result of that exploration, I did stop using perl for scripting and switched to python.

The issues with python as a platform for data analysis were actually being addressed by John D. Hunter with matplotlib. He presented at theNIPS workshop on machine learning open source software in 2008, where I was a member of the afternoon discussion panel. John Eaton, creator of Octave, was also at the workshop, although in the morning session, which I missed due to other commitments. By this time, the group’s MATLAB software was also compatible with Octave. But Octave has similar limitations to MATLAB in terms of language and also did not provide such a rich approach to GUIs. These MATLAB GUIs, whilst quick clunky in implementation, allow live demonstration of algorithms with simple interfaces. This is a facet that I used regularly in my talks.

In the meantime the group was based in Manchester, where in response to the developments in matplotlib and the new numerical module numpy I opened a debate in Manchester about Python in machine learning with this MLO lunch-time talk. At that point I was already persuaded of the potential for python in both teaching and research, but for research in particular there was the problem of translating the legacy code. At this point scikit-learn was fairly immature, so as a test, I began reimplementing portions of the netlab toolbox in python. The (rather poor) result can be found here, with comments from me at the top of netlab.py about issues I was discovering that confuse you when you first move from MATLAB to numpy. I also went through some of these issues in my MLO lunch time talk.

When Nicolo Fusi arrived as a summer student in 2009, he was keen to not use MATLAB for his work, on eQTL studies, and since it was a new direction and the modelling wouldn’t rely too much legacy code, I encouraged him to do this. Others in the group from that period (such as Alfredo Kalaitzis, Pei Gao and our visitor Antti Honkela) were using R as well as MATLAB, because they were focussed on biological problems and delivering code to Bioconductor, but the main part of the group doing methodological development (Carl Henrik Ek, Mauricio Alvarez, Michalis Titsias) were still using MATLAB. I encouraged Nicolo to use python for his work, rather than the normal group practice, which was to stay within the software framework provided by the MATLAB code. By the time Nicolo returned as a PhD student in April 2010 I was agitating in the MLO group in Manchester that all our machine learning should be done in python. In the end, this lobbying attempt was unsuccessful, perhaps because I moved back to Sheffield in August 2010.

On the run into the move, it was already clear where my mind was going. I gave this presentation at a workshop on Validation in Statistics and ML in Berlin in June, 2010 where I talked about the importance of imposing a rigid structure for code (at the time we used an svn repository and a particular directory structure) when reproducible research is the aim, but also mention the importance of responding to individuals and new technology (such as git and python). Nicolo had introduced me to git but we had the legacy of an SVN code base in MATLAB to deal with. So at that point the intention was there to move both in terms of research and teaching, but I don’t think I could yet see how we were going to do it. My main plan was to move the teaching there first, and then follow with the research code.

On the move to Sheffield in August, 2010, we had two new post-docs start (Jaakko Peltonen andJames Hensman) and a new PhD student (Andreas Damianou). Jaakko and Andreas were started on the MATLAB code base but James also expressed a preference for python. Nicolo was progressing well with his python implementations, so James joined Nicolo in working in python. However, James began to work on particular methodological ideas that were targeted at Gaussian processes. I think this was the most difficult time in the transition. In particular, James initially was working on his own code, that was put together in a bespoke manner for solving a particular problem. A key moment in the transition was when James also realised the utility of a shared code base for delivering research, he set to work building a toolbox that replicated the functionality of the old code base, in particular focussing on covariance functions and sparse approximations. Now the development of the new code base had begun in earnest. With Nicolo joining in and the new recruits: Ricardo Andrade Pacheco (PhD student from September 2011), who focussed on developing the likelihood code with the EP approximation in mind and Nicolas Durrande (post-doc) working on covariance function (kernel) code. This tipped the balance in the group so all the main methodological work was now happening in the new python codebase, what was to become GPy. By the time of this talk at the RADIANT Project launch meeting in October 2012, the switch over had been pretty much completed, since then Alan Saul joined the team and has been focussing on the likelihood models and introducing the Laplace approximation. Max Zwiessele, who first visited us from MPI Tuebingen in 2012, returned in April 2013 and has been working on the Bayesian GP-LVM implementations (with Nicolo Fusi and Andreas Damianou).

GPy has now fully replaced the old MATLAB code base as the group’s approach to delivering code implementations.

I think the hardest part of the process was the period between fully committing to the transition and not yet having a fully functional code base in python. The fact that this transition was achieved so smoothly, and has led to a code base that is far more advanced than the MATLAB code, is entirely down to those that worked on the code, but particular thanks is due to James Hensman. As soon as James became convinced of the merits of a shared research code base he began to drive forward the development of GPy. Nicolo worked closely with James to get the early versions functional, and since then all the groups recruits have been contributing.

Four years ago, I knew where we wanted to be, but I didn’t know how (or if) we were going to get there. But actually, I think that’s the way of most things in research. As often, the answer is through the inspiration and perspiration of those that work with you. The result is a new software code base, more functional than before, and more appropriate for student projects, industrial collaborators and teaching. We have already used the code base in two summer schools, and have two more scheduled. It is still very much in alpha release, but we are sharing it through a BSD license to enable both industrial and academic collaborators to contribute. We hope for a wider user base thereby ensuring a more robust code base.

License

We had quite an involved discussion about what license to release source code under. The original MATLAB code base (now rereleased as the GPmat toolbox on github) was under a academic use only license. Primarily because the code was being released as papers were being submitted, and I didn’t want to have to make decisions about licensing (beyond letting people see the source code for reproduction) on submission of the paper. When our code was being transferred to bioconductor (e.g. PUMA and tigre) we were releasing as GPL licensed software as required. But when it comes to developing a framework, what license to use? It does bother me that many people continue to use code with out attribution, this has a highly negative effect, particularly when it comes to having to account for the group’s activities. It’s always irritated me that BSD license code can be simply absorbed by a company, without proper acknowledgement of the debt the firm owes open source or long term support of the code development. However, at the end of the day, our job as academics is to allow society to push forward. To make our ideas as accessible as possible so that progress can be made. A BSD license seems to be the most compatible with this ideal. Add to that the fact that some of my PhD students (e.g. Nicolo Fusi, now at Microsoft) move on to companies which are unable to use GPL licenses, but can happily continue to work on BSD licensed code and BSD became the best choice. However, I would ask people, if they do use our code, please acknowledge our efforts. Either by referencing the code base, or, if the code implements a research idea, please reference the paper.

Directions

The original MATLAB code base was originally just a way to get the group’s research out to the ‘user base’. But I think GPy is much more than that. Firstly, it is likely that we will be releasing our research papers with GPy as a dependency, rather than re-releasing the whole of GPy. That makes it more of a platform for research. It also will be a platform for modelling. Influenced by the probabilistic programming community we are trying to make the GPy interface easy to use for modellers. I see all machine learning as separated into model and algorithm. The model is what you say about the data, the algorithm is how you fit (or infer) the parameters of that model. An aim for GPy is to make it easy for users to model without worrying about the algorithm. Simultaneously, we hope that ML researchers will use it as a platform to demonstrate their new algorithms, which are applicable to particular models (certainly we hope to do this). Finally we are using GPy as a teaching tool in our series of Gaussian Process Summer Schools, Winter Schools and Road Shows. The use of python as an underlying platform means we can teach industry and academic collaborators with limited resources the fundamentals of Gaussian processes without requiring them to buy extortionate licenses for out of date programming paradigms.