First of all, I want to highlight something that seems to go missing in the current hype. There is a lot more to machine learning than deep convolutional neural nets and recurrent nets.
I don’t want to diminish the contribution of those methods, there have been some really cool breakthroughs in these domains. However, they rely on very large data and massive computational capabilities. However, many problems involve less data and even when there is large data it can be the case that the computational demands of conv-nets and recurrent networks are too high (just like they were 20 years ago when these methods were first proposed!).
This means that for many companies most of the machine learning is of the pre-deep learning era. Indeed, most of it could also be easily characterised as computational statistical methodologies. Google ad-matching, Facebook news feed ranking, Amazon product recommendations are all based on nicely engineered systems that exploit techniques like logistic regression, decision trees etc.. For smaller data systems (tens of thousands of data), if you are doing classification you’ll still be better off with kernel methods (like SVM) or even k-nearest neighbor. And I can’t go forward without saying that if you want uncertainty then I think Gaussian processes are still one of the dominant methods.
Whence Deep Learning?
Deep learning has had remarkable success on domains that had been considered very challenging like object recognition in images and speech recognition. We’re also starting to see breakthroughs in language, e.g. automatic translation. However, most of the breakthroughs in the methodologies for deep learning that are achieving these results are not new, they are founded on methods from the 1990s and the idea of ‘back propagation’. However, these methods really start to work well when you have massive data (many millions) and massive compute (large systems of GPUs). Unless you have this or you can leverage someone else’s network trained using this, you probably will get more out of a classical classifier like an SVM.
History of ML and Society
With some exceptions the main history of machine learning and society up until now has occurred through our interactions with industry.
Industry and machine learning goes back to the days of Bell labs, and many of the most significant figures of machine learning worked there. Unfortunately, just around the time I was graduating Bell Labs went into some sort of collapse driven by the break up of AT&T and (I believe) a new short termist perspective. Actually, even though Bell Labs was at the heart of ML, they didn’t really behave like a company, more like a University, they didn’t show up at conferences (like NIPS) with a booth. They just employed clever people who said clever things.
At around the time of Bell Labs’s machine learning demise (they still exist today … they just aren’t significant players in ML) something interesting was happening at Microsoft. NIPS had discovered probabilistic graphical models and variational methods, and driven by David Heckerman and Eric Horvitz Microsoft began to take an active interest in the NIPS community. Microsoft, at the time, were seen as an innovate by acquisition company. They were under threat of anti-trust by the Clinton administration. They’d grown so big that their acquisitions of innovative companies were coming under scrutiny as anti-competitive. I think partially in strategic response to this, from the late 1990s they began to expand “Microsoft Research” (MSR) with new labs in Cambridge,1 Bejing and a massive presence in Redmond. Many key ML researchers moved to MSR and they integrated with the ML community through internships, faculty awards and PhD studentships. They became a sort of Bell Labs with money, and influence, and products. If you wanted to use machine learning to influence things, that was the way to go. In those early days of the internet, there wasn’t enough data, nor was our understanding of implementation good enough, to scale machine learning to web levels, the web was in its early stages. Microsoft placed the first bet, but first to really capitalise on the community were the people that turned up next.
David “Pablo” Cohn used to be a fixture at machine learning conferences. One day he joined Google. At the time, we all used Google for our search engines, but none of us figured that ML had anything to contribute to Google. Fortunately, Google knew differently and they began hoovering up many of the minds of machine learning. These people were integrated with engineering teams and internally they had a massive influence on products. For those of us outside they also influenced the scale of our thinking about our methods. I remember Urs Hölzle had a keynote talk at NIPS 2005. He told us that the future of machine learning was massive data. That at future conferences nothing would be accepted unless it was run at enormous scale. And this is what Google did, they took our techniques and “MAP-reducerized” them to scale them up. It wasn’t so much Urs’s talk that changed things, but more the stories we heard of what was being done inside Google. Our students interned there and we dined with our friends who were working there.
Speaking to those involved at the time, the improvements in performance were worth literally billions (and I mean literally in the literal sense here). The leaning models were normally fairly simple: approaches such as logistic regression and related families. An important domain was ‘ad-click prediction’. It brings in a lot of money: users pay per click and it’s important to list ads they are interested in. These predictions need to be done for every user and the models are retrained every day. Speed is the watchword: fast learning, fast prediction. Consumer characterisation is the aim.
The Quiet Revolution
Google’s interest and exploitation of ML drove what I think of as “the quiet revolution”. It started somewhere around 2003 and is still ongoing. This was and is machine learning as a commodity, integrated into the systems. Most of the decisions made about you online are still driven by these methods. These are the drivers of companies’ quests to better characterise you. The models stay simple, so that they can run fast, but they collect more and more data about you to improve the performance of the prediction. This revolution is still ongoing, because it’s still the main way that these companies make money. Where Google led others have followed.
As well as Google, Yahoo! engaged the community early on but by 2012 they announced large scale layoffs, causing many of their researchers to leave. Particularly notable was the large scale acquisition of a lot of their NYC lab by Microsoft. By this time MSR had opened a lab in New England2 and extended it to NYC to bring these people in.
It Gets Noisy: Imagenet and Deep Learning
2012 was a big year for machine learning. The NIPS conference was in Lake Tahoe for the first time, in spitting distance of Silicon Valley. Amazon hosted a reception and announced their intention to join Google in recruiting in machine learning, but now things also got loud.
At that 2012 edition of NIPS Geoff Hinton’s group at Toronto announced a breakthrough in object recognition in images. With his students Alex Krizhevsky and Ilya Sutskever they rapidly formed a new company to exploit the ideas, and were bought by Google. At around the same time Facebook recruited Marc’Aurelio Ranzato from Google and he was instrumental in their decision to launch a new AI lab through hiring in Yann LeCun (launched at NIPS in 2013). Finally in early 2014 Google bought DeepMind: a London startup founded in 2011 for 400 million pounds.
Now everyone was interested, but actually things had evolved. On the back of success some of those in deep learning were no longer only trying to address the challenges of large interconnected data. They were speaking a new language for the community, the language of AI.