Open Data Science

Not sure if this is really a blog post, it’s more of a ‘position paper’ or a proposal, but it’s something that I’d be very happy to have comment on, so publishing it in the form of a blog seems most appropriate.

We are in the midst of the information revolution and it is being driven by our increasing ability to monitor, store, interconnect and analyse large interacting sets of data. Industrial mechanisation required a combination of coal and heat engine. Informational mechanisation requires the combination of data and data engines. By analogy with a heat engine, which takes high entropy heat energy, and converts it to low entropy, actionable, kinetic energy, a data engine is powered by large unstructured data sources and converts them to actionable knowledge. This can be achieved through a combination of mathematical and computational modelling and the combination of required skill sets falls across traditional academic boundaries.

Outlook for Compaines

From a commercial perspective companies are looking to characterise consumers/users in unprecedented detail. They need to characterize their users’ behavior in detail to

  1. provide better service to retain users,
  2. target those users with commercial opportunities.

These firms are competing for global dominance, to be the data repository. They are excited by the power of interconnected data, but made nervous about the natural monopoly that it implies. They view the current era as being analogous to the early days of ‘microcomputers’: competing platforms looking to dominate the market. They are nervous of the next stage in this process. They foresee the natural monopoly that the interconnectedness of data implies, and they are pursuing it with the vigour of a young Microsoft. They are paying very large fees to acquire potential competitors to ensure that they retain access to the data (e.g. Facebook’s purchase of Whatsapp for $19 billion) and they are acquiring expertise in the analysis of data from academia either through direct hires (Yann LeCun from NYU to Facebook, Andrew Ng from Stanford to found a $300 million Research Lab for Baidu) or purchasing academic start ups (Geoff Hinton’s DNNResearch from Toronto to Google, the purchase of DeepMind by Google for $400 million). The interest of these leading internet firms in machine learning is exciting and a sign of the major successes of the field, but it leaves a major challenge for firms that want to enter the market and either provide competing or introduce new services. They are debilitated by

  1. lack of access to data,
  2. lack of access to expertise.



Science is far more evolved than the commercial world from the perspective of data sharing. Whilst its merits may not be
universally accepted by individual scientists, communities and funding agencies encourage widespread sharing. One of the most significant endeavours was the human genome project, now nearly 25 years old. In computational biology there is now widespread sharing of data and methodologies: measurement technology moves so quickly that an efficient pipeline for development and sharing is vital to ensure that analysis adapts to the rapidly evolving nature of the data (e.g. cDNA arrays to Affymetrix arrays to RNAseq). There are also large scale modelling and sharing challenges at the core of other disciplines such as astronomy (e.g. Sarah Bridle’s GREAT08 challenge for Cosmic Lensing) and climate science. However, for many scientists their access to these methodologies is restricted not by lack of availability of better methods, but through technical inaccessibility. A major challenge in science is bridging the gap between the data analyst and the scientist. Equipping the scientist with the fundamental concepts that will allow them to explore their own systems with a complete mathematical and computational toolbox, rather than being constrained by the provisions of a commercial ‘analysis toolbox’ software provider.


Historically, in health, scientists have worked closely with clinicians to establish the cause of disease and, ideally, eradicate
them at source. Antibiotics and vaccinations have had major successes in this area. The diseases that remain are

  1. resulting from a large range of initial causes; and as a result having no discernible target for a ‘magic bullet’ cure (e.g. heart disease, cancers).
  2. difficult to diagnose at early stage, leading to identification only when progress is irreversible (e.g. dementias) or
  3. coevolving with our clinical advances developments to subvert our solutions (e.g. C difficile, multiple drug resistant tuberculosis).

Access to large scale interconnected data sources again gives the promise of a route to resolution. It will give us the ability to better characterize the cause of a given disease; the tools to monitor patients and form an early diagnosis of disease; and the biological
understanding of how disease agents manage to subvert our existing cures. Modern data allows us to obtain a very high resolution,
multifaceted perspective on the patient. We now have the ability to characterise their genotype (through high resolution sequencing) and their phenotype (through gene and protein expression, clinical measurements, shopping behaviour, social networks, music listening behaviour). A major challenge in health is ensuring that the privacy of patients is respected whilst leveraging this data for wider societal benefit in understanding human disease. This requires development of new methodologies that are capable of assimilating these information resources on population wide scales. Due to the complexity of the underlying system, the methodologies required are also more complex than the relatively simple approaches that are currently being used to, for example, understand commercial intent. We need more sophisticated and more efficient data engines.

International Development

The wide availability of mobile telephones in many developing countries provides opportunity for modes of development that differ considerably from the traditional paths that arose in the past (e.g. canals, railways, roads and fixed line telecommunications). If countries take advantage of these new approaches, it is likely that the nature of the resulting societies will be very different from those that arose through the industrial revolution. The rapid adoption of mobile money, which arguably places parts of the financial system in many sub-saharan African countries ahead of their apparently ‘more developed’ counterparts, illustrates what is possible. These developments are facilitated by low capital cost of deployment. They are reliant on the mobile telecommunications architecture and the widespread availability of handsets. The ease of deployment and development of mobile phone apps, and the rapidly increasing availability of affordable smartphone handsets presents opportunities that exploit the particular advantages of the new telecommunications ecosystem. A key strand to our thinking is that these developments can be pursued by local entrepeneurs and software developers (to see this in action check out the work of the AI-DEV group here). The two main challenges for enabling this to happen are mechanisms for data sharing that retain the individual’s control over their data and the education of local researchers and students. These aims are both facilitated by the open data science agenda.

Common Strands to these Challenges

The challenges described above have related strands to them that can be summarized in three areas:

  1. Access to data whilst balancing the individual’s right to privacy alongside the societal need for advance.
  2. Advancing methodologies: development of methodologies needed to characterize large interconnected complex data sets
  3. Analysis empowerment: giving scientists, clinicians, students, commercial and academic partners the ability to analyze their own data using the latest methodological advances.

The Open Data Science Idea

It now seems absurd to posit a ‘magic bullet cure’ for the challenges described above across such diverse fields, and indeed, the underlying circumstances of each challenge is sufficiently nuanced for any such sledge hammer to be brittle. However, we will attempt to describe a philosophical approach, that when combined with the appropriate domain expertise (whether that’s cultural, societal or technical)  will aim to address these issues in the long term.

Microsoft’s quasi-monopoly on desk top computing was broken by open source software. It has been estimated that the development cost of a full Linux system would be $10.8 billion dollars. Regardless of the veracity of this figure, we know that
several leading modern operating systems are based on open source (Android is based on Linux, OSX is based on FreeBSD). If it weren’t for open source software, then these markets would have been closed to Microsoft’s competitors due to entry costs. We can do much to celebrate the competition provided by OSX and Android and the contributions of Apple and Google in bringing them to market, but the enablers were the open source software community. Similarly, at launch both Google and Facebook’s architectures, for web search and social networking respectively, were entirely based on open source software and both companies have contributed informally and formally to its development.

Open data science aims to bring the same community resource assimilation together to capitalize on underlying social driver of this phenomenon: many talented people would like to see their ideas and work being applied for the widest benefit possible. The modern internet provides tools such as github, IPython notebook and reddit for easily distribution and comment on this material. In Sheffield we have started making our ideas available through these mechanisms. As academics in open data science part of our role should be to:

  1. Make new analysis methodologies available as widely and rapidly as possible with as few conditions on their use as possible
  2. Educate our commercial, scientific and medical partners in the use of these latest methodologies
  3. Act to achieve a balance between data sharing for societal benefit and the right of an individual to own their data.

We can achieve 1) through widespread distribution of our ideas under flexible BSD-like licenses that give commercial, scientific and medical partners as much flexibility to adapt our methods and analyses as possible to their own circumstances. We will achieve 2) through undergraduate courses, postgraduate courses, summer schools and widespread distribution of teaching materials. We will host projects from across the University from all departments. We will develop new programs of study that address the gaps in current expertise. Our actions regarding 3) will be to support and advise initiatives which look to return to the individual more control of their own data. We should do this simultaneously with engaging with the public on what the technologies behind data sharing are and how they will benefit society.


Open data science should be an inclusive movement that operates across traditional boundaries between companies and academia. It could bridge the technological gap between ‘data science’ and science. It could address the barriers to large scale analysis of health data and it will build bridges between academia and companies to ease access to methodologies and data. It will make our ideas publicly available for consumption by the individual; in developing countries, commercial organisations and public institutes.

In Sheffield we have already been actively pursuing this agenda through different strands: we have been making software available for over a decade, and now are doing so with extremely liberal licenses. We are running a series of Gaussian process summer schools, which have included roadshows in UTP, Colombia (hosted by Mauricio Alvarez) and Makerere University, Uganda (hosted by John Quinn). We have organised workshops targeted at Big Data and we are making our analysis approaches freely available. We have organised courses locally in Sheffield in programming targeted at biologists (taught by Marta Milo) and have begun a series of meetings on Data Science (speakers have included Fernando Perez, Fabian Pedregosa, Michael Betancourt and Mike Croucher). We have taught on the ML Summer School and at EBI Summer Schools focused on Computational Systems Biology. Almost all these activities have led to ongoing research collaborations, both for us and for other attendees. Open Data Science brings all these strands together, and it expands our remit to communicate using the latest tools to a wider cross section of clinicians and scientists. Driven by this agenda we will also expand our interaction with commercial partners, as collaborators, consultants and educators. We welcome other groups both in the UK and internationally in joining us in achieving these aims.

Paper Allocation for NIPS

With luck we will release papers to reviewers early next week. The paper allocations are being refined by area chairs at the moment.

Corinna and I thought it might be informative to give details of the allocation process we used, so I’m publishing it here. Note that this automatic process just gives the initial allocation. The current stage we are in is moving papers between Area Chairs (in response to their comments) whilst they also do some refinement of our initial allocation. If I find time I’ll also tidy up the python code that was used and publish it as well (in the form of an IPython notebook).

I wrote the process down in response to a query from Geoff Gordon. So the questions I answer are imagined questions from Geoff. If you like, you can picture Geoff asking them like I did, but in real life, they are words I put into Geoff’s mouth.

  •  How did you allocate the papers?

We ranked all paper-reviewer matches by a similarity and allocated each paper-reviewer pair from the top of the list, rejecting an allocation if the reviewer had a full quota, or the paper had a full complement of reviewers.

  • How was the similarity computed?

The similarity consisted of the following weighted components.

s_p = 0.25 * primary subject match.
s_s = 0.25 * bag of words match between primary and secondary subjects
m = 0.5 * TPMS score (rescaled to be between 0 and 1).

  •  So how were the bids used?

Each of the similarity scores was multiplied by 1.5^b where b is the bid. For: “eager” b=2, “willing” b=1, “in a pinch” b=-1, “not willing” b=-2 and no bid was b=0. So the final score used in the ranking was (s_p+s_s+m)*1.5^b

  • But how did you deal with the fact that different reviewers used the bidding in different ways?

The rows and columns were crudely normalized by the *square root* of their standard deviations

  • So what about conflicting papers?

Conflicting papers were given similarities of -inf.

  • How did you ensure that ‘expertise’ was evenly distributed?

We split the reviewing body into two groups. The first group of ‘experts’ were those people with two or more NIPS papers since 2007 (thanks to Chris Hiestand for providing this information). This was about 1/3 of the total reviewing body. We allocated these reviewers first to a maximum of one ‘expert’ per paper. We then allocated the remainder of the reviewing body to the papers up to a maximum of 3 reviewers per paper.

  • One or more of my papers has less than three reviewers, how did that happen?

When forming the ranking to allocate papers, we only retained papers scoring in the upper half. This was to ensure that we didn’t drop too far down the rank list. After passing through the rank list of scores once, some papers were still left unallocated.

  • But you didn’t leave unallocated papers to area chairs did you?

No, we needed all papers to have an area chair, so for area chairs we continued to allocate these ‘inappropriate papers’ to the best matching area chair with remaining quota, but for reviewers we left these allocations ‘open’ because we felt manual intervention was appropriate here.

  • Was anything else different about area chair allocation?

Yes, we found there was a tendency for high bidding area chairs to fill up their allocation quickly vs low bidding area chairs, meaning low bidding/similarity area chairs weren’t getting a very good allocation. To try and distribute things more evenly, we used a ‘staged quota’ system. We started by allocating area chairs five papers each. Then ten, then fifeteen etc. This meant that even if an area chair had the top 25 similarities in the overall list, many of those papers would still be matched to other reviewers. Our crude normalization was also designed to prevent this tendency. Perhaps a better idea still would be to rank similarities on a per reviewer basis and use this as the score instead of the similarity itself, although we didn’t try this.

  • Did you do the allocations for the bidding in the same way?

Yes, we did bidding allocations in a similar way, apart from two things. Firstly the similarity score was different, we didn’t have a separate match to primary key. This lead to problems for reviewers who had one dominant primary keyword and many less important secondary key words. Now, the allocated papers were also distributed in a different way. Each paper was allocated (for bidding) to those area chairs who were in the top 25 scores for that paper. This led to quite a wide variety in the number of papers you saw for bidding, but each paper was, (hopefully) seen at least 25 times.

  • That’s for area chairs, was it the same for the bidding allocation for reviewers?

No, for reviewers, we wanted to restrict the number of papers that each reviewer would see. We decided each reviewer should only see a maximum of 25 papers, we did something more similar to the ‘preliminary allocation’ that’s just been sent out. We went down the list allocating a maximum of 25 papers per reviewer, and ensuring each paper was viewed by 17 different reviewers.

  • Why did you do it like this? Did you research into this?

We never (orignally) intended to intervene so heavily in the allocation system, but with this year’s record numbers of submissions and reviewers the CMT database was failing to allocate. This, combined with time delays between Sheffield/New York/Seattle was causing delays in getting papers out for bidding, so at one stage we split the load into Corinna working with CMT to achieve an allocation and Neil working on coding the intervention described above. The intervention was finished first. Once we had a rough and ready system working for bids we realised we could have more fine control over the allocation than we’d get with CMT (for example trying to ensure that each paper got at least one ‘expert’), so we chose to continue with our approach. There may certainly be better ways of doing this.

  • How did you check the quality of the allocation?

The main approach we used for checking allocation quality was to check the allocation of an area chair whose domain we knew well, and ensure that the allocation made sense, i.e. we looked at the list of papers and judged whether it made ‘sense’.

  • That doesn’t sound very objective, isn’t there a better way?

We agree that it’s not very objective, but then again people seem to evaluate topic models like that all the time, and a topic model is a key part of this system (the TPMS matching service). The other approach was to wait until people complained about their allocation. There were only a few very polite complaints at the bidding stage, but these led us to realise we needed to upweight the similarities associated with the primary key word. We found that some people choose one very dominant primary keyword, and many, less important secondary keywords. These reviewers were not getting a very focussed allocation.

  • How about the code?

The code was written in python using pandas in the form of an IPython notebook.

And finally …

Thanks to all the reviewers and area chairs for their patience with the system and particular thanks to Laurent Charlin (TPMS) and the CMT support for their help getting everything uploaded.

Facebook Buys into Machine Learning

Yesterday was an exciting day. First, at the closing of the conference, it was announced that I, along with the with my colleague Corinna Cortes (of Google), would be one of the Program Chairs of next year’s conference. This is a really great honour. Many of the key figures in machine learning have done this job before me. It will be a lot of work, in particular because Max Welling and Zoubin Ghahramani did such a great job this year.

Then,  in the evening, we had a series of industrially sponsored receptions, one from Microsoft, one from Amazon and one from Facebook. I didn’t manage to make the Microsoft reception but did attend those from Facebook and Amazon. The big news from Facebook was the announcement of a new AI research lab to be led by Yann Le Cun, a long time friend and colleague in machine learning. They’ve also recruited Rob Fergus, a rising star in computer vision and deep learning.

The event was really big news, presented to a selected audience  by my old friend and collaborator Joaquin Quinonero Candela.  Facebook had already recruited Marc’Aurelio Ranzato, so they are really committed to this area. Mark Zuckerberg was there to endorse the proceedings, but I was really impressed by the way he let Joaquin and Yann take centre stage. It was a very exciting evening.

I’d guessed a big announcement was coming so I climbed up onto the mezzanine level and took photos and recorded large parts from my mobile phone. They’d asked for an embargo until 8 am this morning, so I’m just posting about this now. I also cleared it with Facebook before posting, they asked that I remove the details of what Mark had to say,  principally because they wanted the main focus of this to be on Yann (which I think is absolutely right … well done Yann!).

A commitment of this kind is a great endorsement for the machine learning community. But there is a lot of work now to be done to fulfil the promise recognized. Today we (myself, James Hensman from Sheffield and  Joaquin and Tianshi from Facebook) are running a workshop on Probabilistic Models for Big Data.

Mark Zuckerberg will be attending the deep learning workshop. The methods that are going to presented at these workshops will hopefully (in the long term) deal with some of the big issues that will face us when taking on these challenges. In the Probabilistic Models workshop we’ve already heard great talks from David Blei (Princeton), Max Welling (Amsterdam), Zoubin Gharamani (Cambridge) as well as some really interesting poster spotlights. This afternoon we will hear from Yoram Singer (Google), Ralf Herbrich (Amazon) and Joaquin Quinonero Candela (Facebook). I think the directions laid out at the workshop will be addressing the challenges that face us in the coming years to fulfil the promise that Mark Zuckerberg and Facebook have seen in the field.

Very proud to be a program chair for next year’s event, wondering if we will be able to sustain the level of excitement we’ve had this year.


Update: I’ve written a piece in “the conversation” about the announcement here:

GPy: Moving from MATLAB to Python

Back in 2002 or 2003, when this paper was going through the journal revision stage, I was asked by the reviewers to provide the software that implemented the algorithm. After the initial annoyance at another job to do, I thought about it a bit and realised that not only was it a perfectly reasonable request, but that the software was probably the main output of research. In paticular, in terms of reproducibility, the implementation of the algorithm seems particularly important. As a result, when I visited Mike Jordan’s group in Berkeley in 2004, he began to write a software framework for publishing his research, based on a simple MATLAB kernel toolbox, and a set of likelihoods. This led to a reissuing of the IVM software and these toolboxes underpinned my group’s work for the next seven or eight years, going through multiple releases.

The initial plan for code release was to provide implementations of published software, but over time the code base evolved into quite a usable framework for Gaussian process modelling. The release of code proved particularly useful in spreading the ideas underlying the GP-LVM, enabling Aaron Hertzman and collaborators to pull together a style based inverse kinematics approach at SIGGRAPH which has proved very influential.

Even at that time it was apparent what a horrible language MATLAB was, but it was the language of choice for machine learning. Efficient data processing requires an interactive shell, and the only ‘real’ programming language with such a facility was python. I remember exploring python whilst at Berkeley with Misha Belkin, but at the time there was no practical implementation of numerical algorithms (at that time it was done in a module called numeric, which was later abandoned). Perhaps more importantly, there was no graph-plotting capability. Although as a result of that exploration, I did stop using perl for scripting and switched to python.

The issues with python as a platform for data analysis were actually being addressed by John D. Hunter with matplotlib. He presented at the NIPS workshop on machine learning open source software in 2008, where I was a member of the afternoon discussion panel. John Eaton, creator of Octave, was also at the workshop, although in the morning session, which I missed due to other commitments. By this time, the group’s MATLAB software was also compatible with Octave. But Octave has similar limitations to MATLAB in terms of language and also did not provide such a rich approach to GUIs. These MATLAB GUIs, whilst quick clunky in implementation, allow live demonstration of algorithms with simple interfaces. This is a facet that I used regularly in my talks.

In the meantime the group was based in Manchester, where in response to the developments in matplotlib and the new numerical module numpy I opened a debate in Manchester about Python in machine learning with this MLO lunch-time talk. At that point I was already persuaded of the potential for python in both teaching and research, but for research in particular there was the problem of translating the legacy code. At this point scikit-learn was fairly immature, so as a test, I began reimplementing portions of the netlab toolbox in python. The (rather poor) result can be found here, with comments from me at the top of about issues I was discovering that confuse you when you first move from MATLAB to numpy. I also went through some of these issues in my MLO lunch time talk.

When Nicolo Fusi arrived as a summer student in 2009, he was keen to not use MATLAB for his work, on eQTL studies, and since it was a new direction and the modelling wouldn’t rely too much legacy code, I encouraged him to do this. Others in the group from that period (such as Alfredo Kalaitzis, Pei Gao and our visitor Antti Honkela) were using R as well as MATLAB, because they were focussed on biological problems and delivering code to Bioconductor, but the main part of the group doing methodological development (Carl Henrik Ek, Mauricio Alvarez, Michalis Titsias) were still using MATLAB. I encouraged Nicolo to use python for his work, rather than the normal group practice, which was to stay within the software framework provided by the MATLAB code. By the time Nicolo returned as a PhD student in April 2010 I was agitating in the MLO group in Manchester that all our machine learning should be done in python. In the end, this lobbying attempt was unsuccessful, perhaps because I moved back to Sheffield in August 2010.

On the run into the move, it was already clear where my mind was going. I gave this presentation at a workshop on Validation in Statistics and ML in Berlin in June, 2010 where I talked about the importance of imposing a rigid structure for code (at the time we used an svn repository and a particular directory structure) when reproducible research is the aim, but also mention the importance of responding to individuals and new technology (such as git and python). Nicolo had introduced me to git but we had the legacy of an SVN code base in MATLAB to deal with. So at that point the intention was there to move both in terms of research and teaching, but I don’t think I could yet see how we were going to do it. My main plan was to move the teaching there first, and then follow with the research code.

On the move to Sheffield in August, 2010, we had two new post-docs start (Jaakko Peltonen and James Hensman) and a new PhD student (Andreas Damianou). Jaakko and Andreas were started on the MATLAB code base but James also expressed a preference for python. Nicolo was progressing well with his python implementations, so James joined Nicolo in working in python. However, James began to work on particular methodological ideas that were targeted at Gaussian processes. I think this was the most difficult time in the transition. In particular, James initially was working on his own code, that was put together in a bespoke manner for solving a particular problem. A key moment in the transition was when James also realised the utility of a shared code base for delivering research, he set to work building a toolbox that replicated the functionality of the old code base, in particular focussing on covariance functions and sparse approximations. Now the development of the new code base had begun in earnest. With Nicolo joining in and the new recruits: Ricardo Andrade Pacheco (PhD student from September 2011), who focussed on developing the likelihood code with the EP approximation in mind and Nicolas Durrande (post-doc) working on covariance function (kernel) code. This tipped the balance in the group so all the main methodological work was now happening in the new python codebase, what was to become GPy. By the time of this talk at the RADIANT Project launch meeting in October 2012, the switch over had been pretty much completed, since then Alan Saul joined the team and has been focussing on the likelihood models and introducing the Laplace approximation. Max Zwiessele, who first visited us from MPI Tuebingen in 2012, returned in April 2013 and has been working on the Bayesian GP-LVM implementations (with Nicolo Fusi and Andreas Damianou).

GPy has now fully replaced the old MATLAB code base as the group’s approach to delivering code implementations.

I think the hardest part of the process was the period between fully committing to the transition and not yet having a fully functional code base in python. The fact that this transition was achieved so smoothly, and has led to a code base that is far more advanced than the MATLAB code, is entirely down to those that worked on the code, but particular thanks is due to James Hensman. As soon as James became convinced of the merits of a shared research code base he began to drive forward the development of GPy. Nicolo worked closely with James to get the early versions functional, and since then all the groups recruits have been contributing.

Four years ago, I knew where we wanted to be, but I didn’t know how (or if) we were going to get there. But actually, I think that’s the way of most things in research. As often, the answer is through the inspiration and perspiration of those that work with you. The result is a new software code base, more functional than before, and more appropriate for student projects, industrial collaborators and teaching. We have already used the code base in two summer schools, and have two more scheduled. It is still very much in alpha release, but we are sharing it through a BSD license to enable both industrial and academic collaborators to contribute. We hope for a wider user base thereby ensuring a more robust code base.


We had quite an involved discussion about what license to release source code under. The original MATLAB code base (now rereleased as the GPmat toolbox on github) was under a academic use only license. Primarily because the code was being released as papers were being submitted, and I didn’t want to have to make decisions about licensing (beyond letting people see the source code for reproduction) on submission of the paper. When our code was being transferred to bioconductor (e.g. PUMA and tigre) we were releasing as GPL licensed software as required. But when it comes to developing a framework, what license to use? It does bother me that many people continue to use code with out attribution, this has a highly negative effect, particularly when it comes to having to account for the group’s activities. It’s always irritated me that BSD license code can be simply absorbed by a company, without proper acknowledgement of the debt the firm owes open source or long term support of the code development. However, at the end of the day, our job as academics is to allow society to push forward. To make our ideas as accessible as possible so that progress can be made. A BSD license seems to be the most compatible with this ideal. Add to that the fact that some of my PhD students (e.g. Nicolo Fusi, now at Microsoft) move on to companies which are unable to use GPL licenses, but can happily continue to work on BSD licensed code and BSD became the best choice. However, I would ask people, if they do use our code, please acknowledge our efforts. Either by referencing the code base, or, if the code implements a research idea, please reference the paper.


The original MATLAB code base was originally just a way to get the group’s research out to the ‘user base’. But I think GPy is much more than that. Firstly, it is likely that we will be releasing our research papers with GPy as a dependency, rather than re-releasing the whole of GPy. That makes it more of a platform for research. It also will be a platform for modelling. Influenced by the probabilistic programming community we are trying to make the GPy interface easy to use for modellers. I see all machine learning as separated into model and algorithm. The model is what you say about the data, the algorithm is how you fit (or infer) the parameters of that model. An aim for GPy is to make it easy for users to model without worrying about the algorithm. Simultaneously, we hope that ML researchers will use it as a platform to demonstrate their new algorithms, which are applicable to particular models (certainly we hope to do this). Finally we are using GPy as a teaching tool in our series of Gaussian Process Summer Schools, Winter Schools and Road Shows. The use of python as an underlying platform means we can teach industry and academic collaborators with limited resources the fundamentals of Gaussian processes without requiring them to buy extortionate licenses for out of date programming paradigms.

EPSRC College of Reviewers

Yesterday, I resigned from the EPSRC college of reviewers.

The EPSRC is the national funding body in the UK for Engineering and Physical Sciences. The college of reviewers is responsible for reading grant proposals and making recommendations to panels with regard to the quality, feasibility and utility of the underlying work.

The EPSRC aims to fund international quality science, but the college of reviewers is a national body of researchers. Allocation of proposals to reviewers is done within the EPSRC.

In 2012 I was asked to view only one proposal, in 2013 so far I have received none. The average number of review requests per college member in 2012 was 2.7.

It’s not that I haven’t been doing any proposal reviewing over the last 18 months, I’ve reviewed for the Dutch research councils, the EU, the Academy of Finland, the National Science Foundation (USA), BBSRC, MRC and I’m contracted as part of a team to provide a major review for the Canadian Institute for Advanced Research. I’d estimate that I’ve reviewed around 20 international applications in the area of machine learning and computational biology across this period.

I resigned from the EPSRC College of Reviewers because I don’t wish people to read the list of names in the college and assume that, as a member of the college, I am active in assessing the quality of work the EPSRC is funding. Looking back over the last ten years all the proposals I have reviewed come from a very small body of researchers, all of whom, I know, nominate me as a reviewer.

Each submitted proposal nominates a number of reviewers who the proposers consider to be appropriate. The EPSRC chooses one of these nominated reviewers, and selects the remainder from the wider college.

Over a 12 year period as an academic, I have never been selected to review an EPSRC proposal unless I’ve been nominated by the proposers to do so.

So in many senses this resignation changes nothing, but by resigning from the college I’m highlighting the fact that if you do think I am appropriate for reviewing your proposal, then the only way it will happen is if you nominate me.

Machine Learning as Engine Design

Originally posted on 5th July 2012:


Just back from ICML 2012 this week, as usual it was good to see everyone and as ever it was difficult to keep track of all the innovations across what is a very diverse field.


One talk that was submitted as a paper but presented across the conference has triggered this blog entry. The talk was popular amoungst many attendees and seemed to reflect concerns some researchers in the field have. However, I felt it didn’t reflect my own perspective, and if it had done I wouldn’t have been at the conference in the first place. It was Kiri Wagstaff’s “Machine Learning that Matters”. Kiri made some good points and presented some evidence to back up her claims. Her main argument was that ICML doesn’t care enough about the applications. Kiri’s paper can be found here: A comment from one audience member also seemed to indicate that he (the audience member) felt we (the ICML conferene) don’t do enough to engage with application areas.


As a researcher who spends a large amount of time working directly in application areas, I must admit to feeling a little peeved by what Kiri said. Whilst she characterized a portion of machine learning papers correctly, I believe that these papers are in a minority.  And I suspect an even larger proportion of such papers are submitted to the conference and then rightly rejected. The reason I attend machine learning conferences is that there are a large number of researchers who are active in trying to make a real difference in important application areas.


It was ironic that the speaker previous to Kiri was Yann Le Cun, who presented tailored machine learning hardware for real time segmentation of video images. Rather than focussing on this aspect of Yann’s work Kiri chose to mention the league table he maintains for the MNIST digits (something Yann does as a community service–I think he last published a paper on MNIST over 10 years ago). She presented the community’s use of the MNIST digits and UCI data sets as being indicative that we don’t care about `real applications’. Kiri found that 23% of ICML papers present results only on UCI and/or simulated data. However, given that ICML is a mainly methodological conference I do not find this surprising at all. I did find it odd that Kiri focussed only on classification as an application. I attended no talks on `classical’ classification at the conference (i.e. discriminative classification of vectorial data without missing labels, missing inputs or any other distinguishing features).  I see that very much as ‘yesterday’s problem’. An up to date critique might have focussed on Deep Learning, Latent topic models, compressed sensing or Bayesian non parametrics (and I’m sure we could make similar claims about those methods too).


However, even if the talk had focussed on more contemporary machine learning, I would still find Kiri’s criticisms misdirected. I’d like to use an analogy to explain my thinking. Machine learning is very much like the early days of engine design. From steam engines to jet engines the aim of engines is to convert heat into kinetic energy. The aim of machine learning algorithms is to convert data into actionable knowledge. Engine designers are concerned with aspects like power to weight ratio. They test these features through proscribed tests (such as maximum power output). These tests can only be indicative. For example high power output for an internal combustion engine (as measured on a ‘rig’) doesn’t give you the `drivability’ of that engine in a family car. That is much more difficult to guage and involves user tests. The MNIST data is like a power test: it is indicative  only (perhaps a necessary condition rather than sufficient), however it is still informative,


My own research is very much inspired by applications. I spend a large portion of my time working in computational biology and have always been inspired by challenges in computer vision. In my analogy this is like a particular engine designer being inspired by the challenges of aircraft engine design. Kiri’s talk seemed to be proposing that designing engines in itself isn’t worthwhile unless we simultaneously build the airplane around our engine. I’d think of such a system as a demonstrator for the engine, and building demonstrators is a very worthwhile endeavour (many early computers, such as the Manchester baby, were built as demonstrators of an important component such as memory systems). In my group we do try and do this, we make our methods available immediately on submission, often via MATLAB, and later in a (slightly!) more polished fashion through software such as Bioconductor. These are our demonstrators (of varying quality). However, I’d argue that in manycases the necessary characteristics of the engine being designed (power, efficiency, weight for engines; speed, accuracy, storage for ML) are so well understood that you don’t need the demonstrator. This is why I think Kiri’s criticisms, whilst well meaning, were misdirected. They were equivalent to walking into an engine development laboratory and shouting at them for not producing finished cars. An engine development lab’s success is determined by the demand for their engines. Our success is determined by the demand for our methods, which is high and sustained. It is absolutely true that we could do more to explain our engines to our user community, but we are a relatively small field (in terms of numbers, 700 at our international conference) and the burden of understanding our engines will also, necessarily, fall upon our potential users.


I know that you can find poorly motivated and undervalidated models in machine learning, but I try and avoid those papers. I would have preferred a presentation that focussed on succesful machine learning work that makes a serious difference in the real world. I hope that is a characteristic of my work, but I know it is a characteristic of many of my colleagues’.

Personal Thoughts on Computer Science Degrees

Originally posted on my uSpace blog on May 20th 2012:


Computer science has evolved as a subject. The early days of computer science focused on languages and compilers. Ease of programming and reusability of code were key objectives. Ensuring the quality of the resulting software for reliability and safety concerns was a cornerstone of computer science research. The early days of the field were dominated by breakthroughs in these areas. The needs of modern Computer Science are very different. The very success of Computer Science has meant that computing is now pervasive, the consequence is vast realms of data. Automatically extracting knowledge from this data should now be the main goal of modern Computer Science.


I cannot say when I became a computer scientist as my undergraduate was in Mechanical Engineering, and whilst my PhD in Machine Learning was in a Computer Science department I was isolated there in terms of my research field and was closer related in my research to colleagues in Engineering and Physics.

My first postdoctoral position was with Microsoft, but I programmed only in MATLAB, and my second postdoctoral position was as a Computer Science Lecturer in Sheffield, but I still felt somewhat out of place. At the time Sheffield was rare in that it had a speech processing group based in Computer Science, and there was also a large and successful language processing group which overlapped more with my research.

My initial focus on arriving in the department was refining my own knowledge of computer science, I taught Networks and two classic books on Operating Systems (Tanenbaum) and Compilers (Aho, Sethi and Ullman) still sit on my shelf. I still thought of myself as an Engineer in the classical sense, only one who was interested in data. A  Data Engineer, if you will. My contemporaries in machine learning research are mostly from Physics, Engineering or Mathematics backgrounds. There were more Psychologists than Computer Scientists.


Machine Learning Today

Today the situation has very much changed. I am a convinced computer scientist. So what has happened in the intervening 10 years. Did I dedicate myself so much to the teaching of Networks and the reading of Operating Systems and Compilers that I forsook my original research field? No, in fact it turned out that I didn’t have to conquer the mountain of computer science, the mountain chose to come to me. Today machine learning is at the core of Computer Science. The big four US institutions in Computer Science: MIT, Stanford, Berkeley and CMU all have very large groups in machine learning. In all of these cases these groups have grown in size significantly since 1996 when I started my PhD. Whilst MIT was active then, that was mostly through their Brain Sciences unit. CMU already had a very large group, and has since moved machine learning to a separate department, but Berkeley and Stanford were also yet to grow such large groups.

At the first NIPS (the premier machine learning conference) I attended there were no industry stands, in recent years we have had stands from Google, Yahoo!, Microsoft as well as a range of financial institutions and even airline booking companies.

Machine learning is now at the core of computer science. Of my current cohort of PhD students 4 out of 5 have computer science undergraduate degrees. The fifth has an undergraduate statistics education. The quality of these students is excellent. They combine mathematical strengths with an excellent technical understanding of their machines and what they are physically capable of. They are trained in programming, but they use their programming like they use their ability to write English, as a means to an end: not as the end in itself.

Modern Computer Science

The research effort to standardize machines, simplify language, encourage code reuse and formalize software specification has to a large extent been successful. Whilst it is not the case that everyone can program (as was envisaged by the inventors of BASIC). Today you do not need a degree in Computer Science to implement very complex systems. You can capitalize on the years of experience integrated in modern high level programming languages and their associated software libraries. There is a large demand for programmers who can combine php with MySql to provide a complex retail interface. But there are many individuals that implement these systems without ever having attended a University. Indeed, the prevailing wisdom seems to be that such skills (implementation of a well specified system in standard programming languages) will be subject to a worldwide labour market causing UK IT workforce to be undercut costwise by countries with a large portion of highly educated people, where labour costs are lower (e.g. India). In the UK (and more widely in Europe and the US) our target should not be to produce graduates who can only implement software to known specifications. What, then, is the role of computer science in a developed country like the UK? What graduates should we be producing?

Historically we would have hoped to produce graduates who had a developed understanding of operating systems and compilers, graphics, and perhaps formal methods. We would have produced people that could have designed the next generation of computer languages. We would have produced people who could conceive and design protocols for the internet. That would have been our target. These goals are still at the computer science core. But today we need to be much more ambitious. The success of the preceding generations has now meant that computer science is pervasive, far beyond the technical domain where it has previously dominated. The internet and social networking means that computers are affecting our every day lives in ways that were only imagined even 15 years ago.

This prevents a major technical challenge. In the past, some of the most advanced uses of computers were in other technical fields (engine management systems, control etc.) Those fields had technical expertise which they were able to bring to bear. The software engineer provided a service role to the engineering experts. Today, there are very few technical experts in the vast realms of data that computers have facilitated. Even in technical domains such as Formula 1, the amount of data being produced means that it is technical expertise that is required in data analysis rather than engineering systems. To a large extent, we made this mess, and now it is time for us to clean it up.

A modern Computer Science degree must retain a very large component of analysis of unstructured data. What do I mean by unstructured data? Data that is not well curated, it was not collected with a particular question in mind, it is just there. Traditional statistics worked by designing an experiment: carefully deciding what to measure in order to answer a specific question. The need for modern data analysis is very different. We need to be able to assimilate unstructured data sources to translate them into a system that can be queried. It may not be clear what questions we’ll be able to answer with the resulting system, and we are only likely to have minimal control over what variables are measured.

Examples of data of this type include health records of patients and associated genomic information. Connectivity data: links between friends in social networks or links between documents such as web pages. Purchase information for customers of large supermarkets or web retailers. Preference information for consumers of films. These data sets will contain millions or billions of records that will be `uncurated’ in the sense that the size of the data sets means that no single individual will have been through all the data consistently removing outliers or dealing with missing or corrupted values. The data may also not be in a traditional vectorial form, it could be in the form of images, text or recorded speech. We need algorithms that deal with these challenges automatically.

The Next Generation of Graduates

To address this situation we need to train a generation of computer scientists to deal with these challenges. The fundamentals they will require are language processing: extracting information from unstructured documents. Speech processing: extracting information from informal meetings, conversations or direct speech interaction with computer. Bioinformatics: extracting information from biological experiments or medical tests. Computer vision: extracting information from images or videos. Sitting at the core of each of these areas is machine learning: the art of processing and assimilating a range of unstructured data sources into a single model.

These areas must form the basis of a modern computer science course if we are to provide added value over what will be achievable by farming out software engineering. At the core of each of the areas outlined above is a deep mathematical understanding. Mathematics is more important to computer science than at any time previously. The algorithms used in all the areas developed above are derived and implemented through mathematics. The modern computer science education needs to be based on solid principles: probability and logic. These areas are at the core of mathematics and it is the responsibility of computer science to drive forward research in these areas. A modern computer science graduate must be fluent in programming languages and systems. Not as an end in themselves, but as a means to an end: the construction of a complex interacting systems for extracting knowledge from data. Teaching programing alone is like teaching someone how to write without giving them something to say.

It must be the target of a leading Computer Science undergraduate course to produce students that can address these challenges. All modern Computer Science courses should have a significant basis of data analysis beginning from the very first year. Computer science graduates should understand text and language processing: extracting meaning from documents. Rapid evolution of the language through internet mediums requires flexible algorithms for decoding meaning. The large volume of text on the internet presents major analysis challenges, but the wider challenge of understanding video: images and speech, gesture and emotion recognition hardly had its surface scratched.

A depth of understanding of probabilistic modelling, language modelling and signal processing must be built up over the second and third years of the degree. Our best graduates would have at least a four year education where they are given an opportunity in their final year to put the ideas they have learned into practice through thesis work on cutting edge research questions. Our graduates must be adaptable, they need to be able to build on the analysis skills we equip them with to address new challenges. If Computer Science doesn’t produce graduates in this mold there is no other field that will.


Many grand visions of Computer Science have largely been realized to, perhaps, a greater degree than was even anticipated: a computer on every desktop has become a computer in every pocket. Social interfaces through the internet that connect across the world and through generations. International commerce conducted with a click of the mouse. All of these successes have created an enormous challenge in processing of uncurated data. Computer Science Research has developed the first generation of tools to address these challenges, it is time for Computer Science departments to produce the first generation of graduates who will wield these tools with confidence.

The computer has changed the world, and I believe now it is time for the world to change the way we study computers.