In Goethe’s poem The Sorcerer’s Apprentice, a young sorcerer learns one of their master’s spells and deploys it to assist in his chores. Unfortunately, he cannot control it. The poem was popularised by Paul Dukas’s musical composition, in 1940 Disney used the composition in the film Fantasia. Mickey Mouse plays the role of the hapless apprentice who deploys the spell but cannot control the results.
When it comes to our software systems, the same thing is happening. The Harvard Law professor, Jonathan Zittrain calls the phenomenon intellectual debt. In intellectual debt, like the sorcerer’s apprentice, a software system is created but it cannot be explained or controlled by its creator. The phenomenon comes from the difficulty of building and maintaining large software systems: the complexity of the whole is too much for any individual to understand, so it is decomposed into parts. Each part is constructed by a smaller team. The approach is known as separation of concerns, but it has the unfortunate side effect that no individual understands how the whole system works. When this goes wrong, the effects can be devastating. We saw this in the recent Horizon scandal, where neither the Post Office or Fujitsu were able to control the accounting system they had deployed, and we saw it when Facebook’s systems were manipulated to spread misinformation in the 2016 US election.
When Disney’s Fantasia was released, the philosopher Karl Popper was in exile in New Zealand. He wrote The Open Society and its Enemies when his hometown of Vienna was under Nazi rule. The book defends the political system of liberal democracy against totalitarianism. For Popper, the open society is one characterised by institutions that can engage in the pragmatic pursuit of solutions to social and political problems. Those institutions are underpinned by professions: lawyers, the accountants, civil administrators. To Popper these “piecemeal social engineers” are the pragmatic solution to how a society solves political and social problems.
In 2019 Mark Zuckerberg wrote an op-ed in the Washington Post calling for regulation of social media. He was repeating the realisation of Goethe’s apprentice, he had released a technology he couldn’t control. In Goethe’s poem, the master returns, “Besen, besen! Seid’s gewesen” he calls, and order is restored, but back in the real world the role of the master is played by Popper’s open society. Unfortunately, those institutions have been undermined by the very spell that these modern apprentices have cast. The book, the letter, the ledger, each of these has been supplanted in our modern information infrastructure by the computer. The modern scribes are software engineers, and their guilds are the big tech companies. Facebook’s motto was to “move fast and break things”. Their software engineers have done precisely that and the apprentice has robbed the master of his powers. This is a desperate situation, and it’s getting worse. The latest to reprise the apprentice’s role are Sam Altman and OpenAI who dream of “general intelligence” solutions to societal problems which OpenAI will develop, deploy, and control. Popper worried about the threat of totalitarianism to our open societies, today’s threat is a form of information totalitarianism which emerges from the way these companies undermine our institutions.
So, what to do? If we value the open society, we must expose these modern apprentices to scrutiny. Open development processes are critical here, Fujitsu would never have got away with their claims of system robustness for Horizon if the software they were using was open source. We also need to re-empower the professions, equipping them with the resources they need to have a critical understanding of these technologies. That involves redesigning the interface between these systems and the humans that empowers civil administrators to query how they are functioning. This is a mammoth task. But recent technological developments, such as code generation from large language models, offer a route to delivery.
The open society is characterised by institutions that collaborate with each otherin the pragmatic pursuit of solutions to social problems. The large tech companies that have thrived because of the open society are now putting that ecosystem in peril. For the open society to survive it needs to embrace open development practices that enable Popper’s piecemeal social engineers to come back together and chant “Besen, besen! Seid’s gewesen.” Before it is too late for the master to step in and deal with the mess the apprentice has made.
With thanks to Sylvie Delacroix and Jess Montgomery for comments and suggestions.
]]>Forty five years later, at a Parisian trilateral “International Cooperation on artificial intelligence” summit involving China, US and France, I watched as a US state department official made a calculated snub to their Chinese counterpart, declaring that the US would not co-operate with China on AI while it remained authoritarian and unpicking the fruits of a nearly half-century of diplomacy.
My son was given a brown stuffed bear when he was born, and by the time he was four years old we lived in a Victorian house in Sheffield, one big enough to easily accommodate my Italian in-laws on their visits to see him. The house had a large garage that had served as a billiard room, and behind that was a pile of garbage dumped by previous occupants as part of an unfinished remodelling. It contained glass, nails, lead painted boards and goodness knows what else.
On one of my in-laws’ visits, when my son headed towards the back of the garage my father-in-law, called out to him in Italian.
"Don't go there there's a boogeyman behind the garage."
This put me in a difficult position, I also didn't want my son to go behind the garage, but I’d prefer him to have understood the real dangers rather than giving him boogeyman nightmares.
This story reminds me of the UK AI Summit and a new era of what I’ve started thinking of as boogeyman diplomacy. The way the summit is focussing on long term risks of “frontier models” draws to mind popular notions of Terminator robots as portrayed by Schwarzenegger. This is what I think of as the AI boogeyman.
Just like behind my garage, there are many dangers to AI, and some of those dangers are unknown. In the rubbish there could have been rats and asbestos. Similarly with today’s information technologies we already face challenges around power asymmetries, where a few companies are controlling access to information. Competition authorities on both sides of the Atlantic are addressing these. We also face challenges around automated decision making: the European GDPR is an attempt to regulate how and when algorithmic decision making can be used. The AI boogeyman is frightening extreme conflation of these two challenges, just as an asbestos breathing rat would be a very disturbing conflation of the challenges behind my garage.
But just as the right approach to dealing with an asbestos breathing rat would be to deal with the rats and the asbestos separately, so the AI boogeyman can be dealt with by dealing with both power asymmetries and automated decision making. In both these areas many of us have already been supporting governments in developing new regulation to address these risks. But by combining them, boogeyman diplomacy runs the serious risk of highlighting the problems in a way that distracts us from the real dangers we face.
However, like Mao Zedong and Richard Nixon’s panda diplomacy, boogeyman diplomacy does promise a potential benefit. It is far easier to agree on an exchange of pandas than it is to bridge cultural and political divides between two great nations. Similarly, it will be far easier for China and the United States to agree on measures for imagined risks than it will be for them to agree on the significant challenges their different societies are facing. Ever since the calculated snub in that trilateral summit in Paris the US and China have not talked on these important issues.
As Connected by Data’s open letter from yesterday shows, the UK Prime Minister has sacrificed a great deal of credibility with technical experts, civil society, and citizens across the world with his cry of AI boogeyman. But if the United States and China can be brought back around the table to begin talking again, it may be that the sacrifice is worth it.
Happy Halloween from the Boogeyman.
]]>In the classic film Monty Python and the Holy Grail, John Cleese, as Sir Lancelot the Brave, finds a note – a plea from a prisoner of Swamp Castle – beseeching the discoverer to help them escape an unwanted marriage. Responding to this distress call, Sir Lancelot singlehandedly storms Swamp Castle, slaying the guards, terrorising the wedding guests, and fighting his way to the Tall Tower. There, instead of the expected damsel in distress, cruelly imprisoned, Sir Lancelot is surprised to find a wayward groom, Prince Herbert, who sent the note after an argument with his father.
The United Kingdom is considered an international leader in science, and a pioneer in the provision of science advice. The Government has well-established structures for accessing scientific expertise in emergencies through its network of science advisers, the Scientific Advisory Group for Emergencies (SAGE) and departmental advisory committees, including the Science for Pandemic Influenza Groups that provide advice on covid-19 modelling and behavioural science. Together, these structures might call to mind a different Arthurian vision, evoking the works of Thomas Malory: the scientist as Merlin, giving wise counsel to Arthur and honing the Government’s decision-making through deep knowledge of the scientific arts.
Scientists are concerned citizens, and it is perhaps with this vision of adviser as trusted arbiter that many researchers entered into public and policy debates surrounding covid-19. While pursuing the wise Merlin, however, efforts to advise government can easily drift towards Monty Python’s Lancelot. Confident in his knowledge of castle-storming, his individual dedication and his skills in damsel-rescuing, Sir Lancelot enters the fray with only a partial understanding of the challenges and circumstances at hand.
Science policy has long sought ways of bridging the gaps between scientists and policymakers, helping each understand the ways in which evidence can inform policymaking. The UK’s response to the covid-19 pandemic has highlighted the importance of this work, and the long-standing cultural issues that make this mission so challenging. Driven by experiment and theory, scientists often seek definitive answers to a particular question, with each new study prompting more questions and illuminating areas for investigation that stretch into the future. In contrast, policy advice is often rooted to a moment in time. Events cannot wait for definitive scientific understanding. Instead, policymakers need access to high-quality advice that provides actionable insights, based on current understandings, while acknowledging areas of uncertainty.
But what constitutes the best current understanding? Any policy issue can be viewed through multiple lenses: the scientific principles at hand, the economic implications, public acceptability of potential responses, the values embedded in those responses, or operational considerations in policy delivery, amongst others. Each of these lenses is important in considering the evidence available to inform responses to the covid-19 pandemic: the complexity of the pandemic, our relatively limited understanding of the virus, and the practical difficulties of implementing public health policy demand a range of expertise.
These complex and uncertain challenges require a multi-disciplinary response.
The unprecedented nature of the pandemic has spurred multiple efforts to bring research expertise to bear on covid-19 policymaking. These have included a call for rapid assistance from the modelling community, a group providing rapid review and literature analysis, and an independently convened SAGE group. Our own experience is of another of these efforts – the Royal Society-convened DELVE Initiative.
In the early stage of the pandemic, as many governments struggled to implement policies that held back the first wave of infections, data scientists began to explore how advanced analytics could complement traditional forms of government science advice. Chaired by the President of the Royal Society and feeding into SAGE, the DELVE Initiative set out to analyse data from countries at a more advanced stage of the covid-19 pandemic, using these insights to inform UK policy responses.
Data science for ‘real world’ policy questions can only be done effectively with access to domain expertise: extracting insights from data is important, but applying these insights to policy development requires the contextual understanding brought by domain experts, including those embedded in the policymaking process. Mapping this onto the attack of Swamp Castle, Professor Lancelot is better advised to consult his colleagues at the Round Table before charging the Tall Tower – while Lancelot has the tools to break down the doors, other knights may know the residents, the routes in, and why the distress call was sent.
Bridging the ‘supply chain of ideas’ between researchers and policymakers has been core to DELVE’s approach. The breadth of covid-19’s impacts and potential policy responses has required that DELVE make connections across public health, economics, behavioural science, immunology, and more, and the value of collaboration can be seen across DELVE reports. For example, in an early report on testing and tracing systems: researchers from public health brought a wealth of insights about how to detect and manage disease outbreaks in communities, data scientists translated this to analysis that quantified the compliance rates needed to ensure testing and tracing efforts would be successful, and economists contextualised this with evidence about what interventions would encourage individuals and organisations to comply with a test, trace, isolate regime. DELVE’s remit became interdisciplinary by default, with data the focal point around which to convene domain experts.
This type of evidence synthesis would traditionally rely on collaborations developed with the luxury of time – time to understand how different disciplines frame an issue and to identify the different types of evidence that might be policy-relevant. Using data as a convenor has offered a short-cut through these discussions, by creating a common focal point from which different domain experts can explore their ideas. Arthur brought his knights together through the convening power of a sword, Excalibur; DELVE convened its scientists through data.
Despite its importance in enabling rapid evidence synthesis, in pursuing this ambitious research agenda, a consistent barrier to further action has been access to data. Labouring the parallels to Arthurian legend, in many cases relevant data was so difficult to identify and access it may as well have been mythical. But in practice it was the idea that the data might exist and be accessible, rather its actual availability, that was sufficient to convene expertise through DELVE.
Barriers to government data sharing – whether resulting from perceived legal issues, lack of capability in government, or technical barriers to data use – were well-characterised before the pandemic, but have been thrown into sharp relief in recent months. Where successful data sharing arrangements have been established to support the covid-19 response, these have tended to rely on pre-existing relationships between data scientists and policymakers that foster shared understandings of how to use data in research and policy. In some ways, other disciplines have already learned this lesson – sustained engagement between government and academia has played a central role in major policy changes across government, from the smoking ban to the Climate Change Act. If data science is to find a role in policymaking, it will need to build on these experiences.
A new model of open data science, which capitalises on the power of data in convening multidisciplinary exchanges, will be vital, if we are to realise the potential of data science for research and policy. By building a community of researchers at the interface of data science and other disciplines, there is an opportunity to create exciting new research agendas that both advance data science methods and generate new insights for research and policy. Such a community would embed multidisciplinary engagement in its research culture, developing relationships and building capacity for rapid response to future policy challenges. It would seek to create a governance environment in which data can be used safely and rapidly, while ensuring that data analysis tools are made widely available, with clear information about how to use them.
This open data science model will be central to the work of the Accelerate Programme for Scientific Discovery, a new initiative from the Cambridge Computer Lab that will pursue research at the interface of machine learning and the sciences. By operating outside the traditional boundaries that separate disciplines, open data science could bridge the gap between ‘data science’ and the domains that would benefit from its tools and techniques, enabling ideas to spread rapidly and ultimately advancing scientific discovery for the benefit of society.
]]>The current coronavirus epidemic has broad to very public attention the frequent need for policy decisions, most dramatically the announcement of the UK government’s lockdown restrictions on 23 March 2020, to be made quickly, and on the basis of incomplete information.
The government and the general public receive scientific advice related to the coronavirus from a range of sources. One of these is the Royal Society’s DELVE initiative, whose particular focus on data-driven methods (as opposed to modelling, which provides another very important perspective). An immediate difficulty with providing such advice is that there is a great deal that we do not know about the virus and how it is transmitted, and the kind of information we would like takes a long time to obtain, especially if one wishes to apply the usual standards of scientific rigour. But the virus will not wait for us, so governments are forced to make decisions despite many important facts being unknown. This situation is known as ‘finite horizon’ – there is a fixed time after which if you have not acted, then the decision is effectively made for you.
What constitutes good scientific advice when there is a finite horizon? With complete information, one can offer a cost-benefit analysis of the various options between which the government must choose. But when the information is only partial, probability comes into the picture, and this becomes a risk-benefit analysis.
This can be illustrated with a simple example. Suppose you are on holiday and you visit an island. At the end of the day you need to catch a ferry back to your hotel, which leaves at around 11pm, but you are not quite sure of the precise time. You are running a bit late, and as you arrive at the terminal, you see that a ferry is just about to leave. You do not speak the language and do not have time to check whether it is the right ferry.
One part of deciding what to do will be to weigh up the costs and benefits of the possible outcomes. If you do not get on the ferry, you will probably spend the night on the island, so you will consider what that would be like: whether it would be safe, whether there is anywhere to stay, etc. If you do get on the ferry, then you may end up back at your hotel, which is the ideal outcome, but you may perhaps be taken somewhere else, where you will arrive, late at night, not speaking the language.
Another part of the decision is assessing probabilities. For instance, if there are very few ferries, then you may judge that it is likely that the departing ferry is the right one, but if there are many, then you will be less sure. This assessment will have an important effect on your decision: the more likely the ferry is to be the right one, the lower the risk of ending up in the wrong place, and therefore the more sensible it is to jump on.
DELVE comprises a highly diverse group of experts from public health, epidemiology, economics, immunology, mathematics, statistics, machine learning and psychology. The group aims to offer the best policy advice it can, on the time frames where that advice is useful. Its reports assimilate evidence from these different fields and advise on the implications of this evidence for policy-making’. Where scientific consensus is available it can give this advice on the basis of that consensus. But where there is no consensus, it can still offer advice based on our best understanding of the outcomes of any interventions that might be made. Often those outcomes will be uncertain: in such situations we try to assess the risks of the interventions and the likelihood of potential benefits, in order to offer the best possible advice given the evidence we have.
For example, DELVE recently reported on the use of face masks, concluding that more widespread face mask use can help reduce the risk of onward transmission from asymptomatic individuals and could play an important role in situations where physical distancing is not practical. The scientific evidence that this would make a significant difference to transmission rates is by no means conclusive, but neither is the scientific evidence that it would not make a significant difference. DELVE judged that there was a reasonable probability that masks would make a difference. For example, if they reduce the transmission rate, then they could potentially reduce the length of the lockdown, and each day of the lockdown is estimated to cost £2.4bn. On that basis, despite the uncertainties, we felt confident in our advice.
A few general points are worth bearing in mind.
It may be that as decisions-makers accumulate more evidence, they will come to realize that another decision is better: given how rapidly the evidence is changing, it is important to be flexible and ready to change our minds.
In some situations maintaining the status quo until more evidence accumulates is a good default decision. But in an emergency such as the current pandemic, inaction can have serious adverse consequences, so its risks and potential benefits should be assessed along with those of possible interventions.
While science has a very important part to play in decision-making, important aspects of decision-making are not scientific. For instance, science may be able to tell us that there is approximately a 1% chance of a certain catastrophe occurring, but science alone cannot determine how much money it would be worth spending on interventions to reduce that chance to 0.5%.
In summary, because there are many uncertainties associated with the situation in which we find ourselves, we will often have to make decisions in the absence of a scientific consensus. That need not prevent us from making good decisions. We should look at the evidence in order to assess the probabilities of the possible outcomes of any intervention and the likely costs and benefits of those outcomes. This will not always make the decisions easy, but it will give them a much better foundation.
Note. The authors are members of the DELVE Working Group, but are writing in their personal capacity. A good starting point for further reading is a blog-post by LSE Economists Matteo Galizzi, Benno Guenther, Maddie Quinlan and Jet Sanders: https://coronavirusandtheeconomy.com/question/risk-time-covid-19-what-do-we-know-andnot-know
]]>I’ve set the scene below by imagining the application process. Imagine a 1970s style office just off Whitehall (with apologies to John Cleese and Michael Palin).
Whitehall where our action takes place
Minister: Good morning. I’m sorry to have kept you waiting, but I’m afraid my model has become rather sillier recently, and so it takes me rather longer to train it. (sits at desk) Now then, what was it again?
Mr. Pudey: Well sir, I have a silly model and I’d like to obtain a Government grant to help me develop it.
Minister: I see. May I see your silly model?
Mr. Pudey: Yes, certainly, yes.
(Shows laptop with PyTorch open in Jupyter notebook)
Minister: That’s it, is it?
Mr. Pudey: Yes, that’s it, yes.
Minister: It’s not particularly silly, is it? I mean, the first layer isn’t silly at all and second layer merely does a forward pass with a partial recursion every alternate iteration.
Mr. Pudey: Yes, but I think that with Government backing I could make it very silly.
Minister: (rising) Mr. Pudey, (he walks about behind the desk) the very real problem is one of money. I’m afraid that the Ministry of Silly Models is no longer getting the kind of support it needs. You see there’s Defense, Social Security, Health, Housing, Education, Silly Models … they’re all supposed to get the same. But last year, the Government spent less on the Ministry of Silly Models than it did on National Defense! Now we get £348,000,000 a year, which is supposed to be spent on all our available products. (he sits down) Coffee?
Mr. Pudey: Yes please.
(Context: many of you may not yet have heard of the UK’s National Smart Assistant project, “Two-Lumps”, now widely deployed across our civil service)
Minister: Two-Lumps, would you bring us in two coffees please?
Disembodied Voice: Yes, Mr. Teabag.
Minister: … Out of its mind. Now the Japanese have a model which inverts each of its layers over its head and back again with every single iteration. While the Israelis… here’s the coffee.
(Enter a robot with tray with two cups on it. Its motor system has been trained via reinforcement learning through an evolutionary search strategy. This gives it a particularly jerky movement which means that by the time it reaches the minister there is no coffee left in the cups. The minister has a quick look in the cups, and smiles understandingly.)
Minister: Thank you - lovely. (robot exits still carrying tray and cups) You’re really interested in silly models, aren’t you?
Mr. Pudey: Oh rather. Yes.
Minister: Well take a look at this, then.
(He produces a projector from beneath his desk already spooled up and plugged in. He flicks a switch and it beams onto the opposite wall. The film shows a sequence of six old-fashioned silly modelers at the first Connectionist Models Summer School. Geoff Hinton, David Rumelhart, Michael Mozer, Terry Sejnowski, Mike Jordan, John Hopfield. The film is old silent-movie type, scratchy, jerky and 8mm quality. All the participants wear 1980’s type costume. One model has a huge input layer, one has no hidden units. Cut back to office. The minister hurls the projector away. Along with papers and everything else on his desk. He leans forward.)
Minister: Now Mr. Pudey. I’m not going to mince words with you. I’m going to offer you a Research Fellowship on the Anglo-French Silly Model.
Mr. Pudey: La Modèle Futile?
(Cut to two Frenchmen, wearing striped jerseys and berets, standing in a field with a third man who is entirely covered by a sheet.)
First Frenchman: Bonjour … et maintenant … comme d’habitude, au sujet du Le Modèle Commun. Et maintenant, je vous presente, encore une fois, mon ami, le pouf célèbre, Yann LeCun. (he removes his moustache and sticks it onto the other Frenchman)
Second Frenchman: Merci, mon petit chou-chou Leon Bottou. Et maintenant avec les couplages causales à droite, et les couplages dirigés au gauche, et maintenant l’Anglais-Française Modèle Futile, et voilà.
]]>The technology that has driven the revolution in AI is machine learning. And when it comes to capitalising on the new generation of deployed machine learning solutions there are practical difficulties we must address.
In 1987 the economist Robert Solow quipped "You can see the computer age everywehere but in the productivity statistics". Thirty years later, we could equally apply that quip to the era of artificial intelligence.
From my perspective, the current era is merely the continuation of the information revolution. A revolution that was triggered by the wide availability of the silicon chip. But whether we are in the midst of a new revolution, or this is just the continuation of an existing revolution, it feels important to characterize the challenges of deploying our innovation and consider what the solutions may be.
There is no doubt that new technologies based around machine learning have opened opportunities to create new businesses. When home computers were introduced there were also new opportunities in software publishing, computer games and a magazine industry around it. The Solow paradox arose because despite this visible activity these innovations take time to percolate through to existing businesses.
Understanding how to make the best use of new technology takes time. I call this approach, brownfield innovation. In the construction industry, a brownfield site is land with pre-existing infrastructure on it, whereas a greenfield site is where construction can start afresh.
The term brownfield innovation arises by analogy. Brownfield innovation is when you are innovating in a space where there is pre-existing infrastructure. This is the challenge of introducing artificial intelligence to existing businesses. Greenfield innovation, is innovating in areas where no pre-existing infrastructure exists.
One way we can make it easier to benefit from machine learning in both greenfield and brownfield innovation is to better characterise the steps of machine learning systems design. Just as software systems design required new thinking, so does machine learning systems design.
In this post we characterise the process for machine learning systems design, convering some of the issues we face, with the 3D process: Decomposition1, Data and Deployment.
We will consider each component in turn, although there is interplace between components. Particularly between the task decomposition and the data availability. We will first outline the decomposition challenge.
One of the most succesful machine learning approaches has been supervised learning. So we will mainly focus on supervised learning because this is also, arguably, the technology that is best understood within machine learning.
Machine learning is not magical pixie dust, we cannot simply automate all decisions through data. We are constrained by our data (see below) and the models we use. Machine learning models are relatively simple function mappings that include characteristics such as smoothness. With some famous exceptions, e.g. speech and image data, inputs are constrained in the form of vectors and the model consists of a mathematically well behaved function. This means that some careful thought has to be put in to the right sub-process to automate with machine learning. This is the challenge of decomposition of the machine learning system.
Any repetitive task is a candidate for automation, but many of the repetitive tasks we perform as humans are more complex than any individual algorithm can replace. The selection of which task to automate becomes critical and has downstream effects on our overall system design.
The machine learning systems design process calls for separating a complex task into decomposable separate entities. A process we can think of as pigeonholing.
Some aspects to take into account are
The representation necessary for the second aspect may involve massaging of the problem: feature selection or adaptation. It may also involve filtering out exception cases (perhaps through a pre-classification).
All else being equal, we’d like to keep our models simple and interpretable. If we can convert a complex mapping to a linear mapping through clever selection of sub-tasks and features this is a big win.
For example, Facebook have feature engineers, individuals whose main role is to design features they think might be useful for one of their tasks (e.g. newsfeed ranking, or ad matching). Facebook have a training/testing pipeline called FBLearner. Facebook have predefined the sub-tasks they are interested in, and they are tightly connected to their business model.
It is easier for Facebook to do this because their business model is heavily focused on user interaction. A challenge for companies that have a more diversified portfolio of activities driving their business is the identification of the most appropriate sub-task. A potential solution to feature and model selection is known as AutoML [1]. Or we can think of it as using Machine Learning to assist Machine Learning. It’s also called meta-learning. Learning about learning. The input to the ML algorithm is a machine learning task, the output is a proposed model to solve the task.
One trap that is easy to fall in is too much emphasis on the type of model we have deployed rather than the appropriateness of the task decomposition we have chosen.
Recommendation: Conditioned on task decomposition, we should automate the process of model improvement. Model updates should not be discussed in management meetings, they should be deployed and updated as a matter of course. Further details below on model deployment, but model updating needs to be considered at design time. This is the domain of AutoML.
The answer to the question which comes first, the chicken or the egg is simple, they co-evolve [2]. Similarly, when we place components together in a complex machine learning system, they will tend to co-evolve and compensate for one another.
To form modern decision making systems, many components are interlinked. We decompose our complex decision making into individual tasks, but the performance of each component is dependent on those upstream of it.
This naturally leads to co-evolution of systems, upstream errors can be compensated by downstream corrections.
To embrace this characteristic, end-to-end training could be considered. Why train individual systems by localized metrics when we can just optimize, globally, end-to-end for final system performance? End to end training can lead to improvements in performance, but it could also damage our systems decomposability, its interpretability, and perhaps its adaptability.
The less human interpretable our systems are, the harder they are to adapt to different circumstances or diagnose when there's a challenge. The trade-off between interpretability and performance is a constant tension which we should always retain in our minds when performing our system design.
It is difficult to overstate the importance of data. It is half of the equation for machine learning, but is often utterly neglected. We can speculate that there are two reasons for this. Firstly, data cleaning is perceived as tedious. It doesn’t seem to consist of the same intellectual challenges that are inherent in constructing complex mathematical models and implementing them in code. Secondly, data cleaning is highly complex, it requires a deep understanding of how machine learning systems operate and good intuitions about the data itself, the domain from which data is drawn (e.g. Supply Chain) and what downstream problems might be caused by poor data quality.
A consequence of these two reasons, data cleaning seems difficult to formulate into a readily teachable set of principles. As a result it is heavily neglected in courses on machine learning and data science. Despite data being half the equation, most University courses spend little to no time on its challenges.
However, these challenges aren't new, they are merely taking a different form.
Anecdotally, talking to data modelling scientists, most say they spend 80% of their time acquiring and cleaning data. The “software crisis” was the inability to deliver software solutions due to increasing complexity of implementation. There was no single shot solution for the software crisis, it involved better practice (scrum, test orientated development, sprints, code review), improved programming paradigms (object orientated, functional) and better tools (CVS, then SVN, then git).
The major cause of the software crisis is that the machines have become several orders of magnitude more powerful! To put it quite bluntly: as long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a mild problem, and now we have gigantic computers, programming has become an equally gigantic problem.
Edsger Dijkstra (1930-2002), The Humble Programmer
In the late sixties early software programmers made note of the increasing costs of software development and termed the challenges associated with it as the "Software Crisis". Edsger Dijkstra referred to the crisis in his 1972 Turing Award winner's address.
Modern software is driven by data. That is the essence of machine learning. We can therefore see the software crisis as merely the first wave of a (perhaps) larger phenomenon, "the data crisis".
We can update Dijkstra's quote for the modern era.
The major cause of the data crisis is that machines have become more interconnected than ever before. Data access is therefore cheap, but data quality is often poor. What we need is cheap high quality data. That implies that we develop processes for improving and verifying data quality that are efficient.
There would seem to be two ways for improving efficiency. Firstly, we should not duplicate work. Secondly, where possible we should automate work.
The quantity of modern data, and the lack of attention paid to data as it is initially "laid down" and the costs of data cleaning are bringing about a crisis in data-driven decision making. This crisis is at the core of the challenge of technical debt in machine learning [3].
Just as with software, the crisis is most correctly addressed by 'scaling' the manner in which we process our data. Duplication of work occurs because the value of data cleaning is not correctly recognised in management decision making processes. Automation of work is increasingly possible through techniques in "artificial intelligence", but this will also require better management of the data science pipeline so that data about data science (meta-data science) can be correctly assimilated and processed. The Alan Turing institute has a program focussed on this area, AI for Data Analytics.
Data is the new software, and the data crisis is already upon us. It is driven by the cost of cleaning data, the paucity of tools for monitoring and maintaining our deployments, the provenance of our models (e.g. with respect to the data they’re trained on).
Three principal changes need to occur in response. They are cultural and infrastructural.
First of all, to excel in data driven decision making we need to move from a software first paradigm to a data first paradigm. That means refocusing on data as the product. Software is the intermediary to producing the data, and its quality standards must be maintained, but not at the expense of the data we are producing. Data cleaning and maintenance need to be prized as highly as software debugging and maintenance. Instead of software as a service, we should refocus around data as a service. This first change is a cultural change in which our teams think about their outputs in terms of data. Instead of decomposing our systems around the software components, we need to decompose them around the data generating and consuming components.2 Software first is only an intermediate step on the way to be coming data first. It is a necessary, but not a sufficient condition for efficient machine learning systems design and deployment. We must move from software orientated architecture to a data orientated architecture.
Secondly, we need to improve our language around data quality. We cannot assess the costs of improving data quality unless we generate a language around what data quality means. Data Readiness Levels3 are an assessment of data quality that is based on the usage to which data is put.
Thirdly, we need to improve our mental model of the separation of data science from applied science. A common trap in current thinking around data is to see data science (and data engineering, data preparation) as a sub-set of the software engineer’s or applied scientist’s skill set. As a result we recruit and deploy the wrong type of resource. Data preparation and question formulation is superficially similar to both because of the need for programming skills, but the day to day problems faced are very different.
Recommendation: Build a shared understanding of the language of data readiness levels for use in planning documents, the costing of data cleaning, and the benefits of reusing cleaned data.
One analogy I find helpful for understanding the depth of change we need is the following. Imagine as a software engineer, you find a USB stick on the ground. And for some reason you know that on that USB stick is a particular API call that will enable you to make a significant positive difference on a business problem. However, you also know on that USB stick there is potentially malicious code. The most secure thing to do would be to not introduce this code into your production system. But what if your manager told you to do so, how would you go about incorporating this code base?
The answer is very carefully. You would have to engage in a process more akin to debugging than regular software engineering. As you understood the code base, for your work to be reproducible, you should be documenting it, not just what you discovered, but how you discovered it. In the end, you typically find a single API call that is the one that most benefits your system. But more thought has been placed into this line of code than any line of code you have written before.
Even then, when your API code is introduced into your production system, it needs to be deployed in an environment that monitors it. We cannot rely on an individual’s decision making to ensure the quality of all our systems. We need to create an environment that includes quality controls, checks and bounds, tests, all designed to ensure that assumptions made about this foreign code base are remaining valid.
This situation is akin to what we are doing when we incorporate data in our production systems. When we are consuming data from others, we cannot assume that it has been produced in alignment with our goals for our own systems. Worst case, it may have been adversarialy produced. A further challenge is that data is dynamic. So, in effect, the code on the USB stick is evolving over time.
Anecdotally, resolving a machine learning challenge requires 80% of the resource to be focused on the data and perhaps 20% to be focused on the model. But many companies are too keen to employ machine learning engineers who focus on the models, not the data.
A reservoir of data has more value if the data is consumable. The data crisis can only be addressed if we focus on outputs rather than inputs.
For a data first architecture we need to clean our data at source, rather than individually cleaning data for each task. This involves a shift of focus from our inputs to our outputs. We should provide data streams that are consumable by many teams without purification.
Recommendation: We need to share best practice around data deployment across our teams. We should make best use of our processes where applicable, but we need to develop them to become data first organizations. Data needs to be cleaned at output not at input.
Once the decomposition is understood, the data is sourced and the models are created, the model code needs to be deployed.
To extend our USB stick analogy further, how would we deploy that code if we thought it was likely to evolve in production? This is what datadoes. We cannot assume that the conditions under which we trained our model will be retained as we move forward, indeed the only constant we have is change.
This means that when any data dependent model is deployed into production, it requires continuous monitoring to ensure the assumptions of design have not been invalidated. Software changes are qualified through testing, in particular a regression test ensures that existing functionality is not broken by change. Since data is continually evolving, machine learning systems require 'continual regression testing': oversight by systems that ensure their existing functionality has not been broken as the world evolves around them. An approach we refer to as progression testing. Unfortunately, standards around ML model deployment yet been developed. The modern world of continuous deployment does rely on testing, but it does not recognize the continuous evolution of the world around us.
Recommendation: We establish best practice around model deployment. We need to shift our culture from standing up a software service, to standing up a data as a service. Data as a Service would involve continual monitoring of our deployed models in production. This would be regulated by 'hypervisor' systems4 that understand the context in which models are deployed and recognize when circumstance has changed and models need retraining or restructuring.
Recommendation: We should consider a major re-architecting of systems around our services. In particular we should scope the use of a streaming architecture (such as Apache Kafka) that ensures data persistence and enables asynchronous operation of our systems.5 This would enable the provision of QC streams, and real time dash boards as well as hypervisors.
Importantly a streaming architecture implies the services we build are stateless, internal state is deployed on streams alongside external state. This allows for rapid assessment of other services' data.
Machine learning has risen to prominence as an approach to scaling our activities. For us to continue to automate in the manner we have over the last two decades, we need to make more use of computer-based automation. Machine learning is allowing us to automate processes that were out of reach before.
We operate in a technologically evolving environment. Machine learning is becoming a key component in our decision making capabilities. But the evolving nature of data driven systems means that new approaches to model deployment are necessary. We have characterized three parts of the machine learning systems design process. Decomposition of the problem into separate tasks that are addressable with a machine learning solution. Collection and curation of appropriate data. Verificaction of data quality through data readiness levels. Using progression testing in our deployments. Continuously updating models as appropriate to ensure performance and quality is maintained.
[1] M. Feurer, A. Klein, K. Eggensperger, J. T. Springenberg, M. Blum, and F. Hutter, “Efficient and robust automated machine learning,” in Advances in neural information processing systems.
[2] K. R. Popper, Conjectures and refutations: The growth of scientific knowledge. London: Routledge, 1963.
[3] D. Sculley et al., “Hidden technical debt in machine learning systems,” in Advances in neural information processing systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds. Curran Associates, Inc., 2015, pp. 2503–2511.
[4] N. D. Lawrence, “Data readiness levels,” arXiv, May 2017.
In earlier versions of the Three D process I've referred to this as the design stage, but decomposition feels more appropriate for what the stage involves and that preserves the word design for the overall process of machine learning systems design.↩
This is related to challenges of machine learning and technical debt [3], although we are trying to frame the solution here rather than the problem.↩
Data Readiness Levels [4] are an attempt to develop a language around data quality that can bridge the gap between technical solutions and decision makers such as managers and project planners. The are inspired by Technology Readiness Levels which attempt to quantify the readiness of technologies for deployment.↩
Emulation, or surrogate modelling, is one very promising approach to forming such a hypervisor. Emulators are models we fit to other models, often simulations, but the could also be other machine learning modles. These models operate at the meta-level, not on the systems directly. This means they can be used to model how the sub-systems interact. As well as emulators we shoulc consider real time dash boards, anomaly detection, mutlivariate analysis, data visualization and classical statistical approaches for hypervision of our deployed systems.↩
These approaches are one area of focus for my own team's reasearch. A data first architecture is a prerequisite for efficient deployment of machine learning systems.↩
How are we making computers do the things we used to associated only with humans? Have we made a breakthrough in understanding human intelligence?
While recent achievements might give the sense that the answer is yes, the short answer is that we are nowhere near. All we’ve achieved for the moment is a breakthrough in emulating intelligence.
Artificial intelligence is a badly defined term. Successful deployments of intelligent systems are common, but normally they are redefined to be non-intelligent. My favourite example is Watt’s governor. Immortalised in the arms of the statue of “Science” on the Holborn viaduct in London, Watt’s governor automatically regulated the speed of a steam engine, closing the inlet valve progressively as the engine ran faster. It did the job that an intelligent operator used to have to do, but few today would describe it as “artificial intelligence”.
A more recent example comes from the middle of the last century. A hundred years ago computers were human beings, often female, who conducted repetitive mathematical tasks for the creation of mathematical tables such as logarithms. Our modern digital computers were originally called automatic computers to reflect the fact that the intelligence of these human operators had been automated. But despite the efficiency with which they perform these tasks, very few think of their mobile phones or computers as intelligent.
Norbert Wiener launched last century’s first wave of interest in emulation of intelligence with his book “Cybernetics”. The great modern success that stemmed from that work is the modern engineering discipline of Automatic Control. The technology that allows fighter jets to fly. These ideas came out of the Second World War, when researchers explored the use of radar (automated sensing) and automatic computation for decryption of military codes (automated decision making). Post war a body of researchers, including Alan Turing, were seeing the potential for electronic emulation of what had up until then been the preserve of an animallian nervous system.
So what of the modern revolution? Is this any different? Are we finally encroaching on the quintessential nature of human intelligence? Or is there a last bastion of our minds that remains out of reach from this latest technical wave?
My answer has two parts. Firstly, current technology is a long way from emulating all aspects of human intelligence: there are a number of technological breakthroughs that remain before we crack the fundamental nature of human intelligence. Secondly, and perhaps more controversially, I believe that there are aspects of human intelligence that we will never be able to emulate, a preserve that remains uniquely ours.
Before we explore these ideas though, we first need to give some sense of current technology.
Recent breakthroughs in artificial intelligence are being driven by advances in machine learning. Or more precisely, a sub domain of that field known as deep learning. So what is deep learning? Well, machine learning algorithms proceed in the following way. They observe data, often from humans, and they attempt to emulate that data creation process with a mathematical function. Let’s explain that in a different way. Whichever activities we are interested in emulating, we acquire data about them. Data is simply a set of numbers representing the activity. It might include location, or number representing past behaviour. Once we have the behaviour represented in a set of numbers, then we can emulate it mathematically.
Different mathematical functions have different characteristics, in trigonometry we learnt about sine and cosine functions. Those are functions that have a period. They repeat over time. They are useful for emulating behaviour that repeats itself, perhaps behaviour that reflects day/night cycles. But these functions on their own are too simple to emulate human behaviour in all its richness. So how do we make things more complex? In practice we can add functions together, scale them up and scale them down. All these things are done. Deep learning refers to the practice of creating a new function by feeding one function in to another. So we create the output of a function, feed it in to a new function, and take the output of that as our answer. In maths this is known as composition of functions. The advantage is that we can generate much more complex functions from simpler functions. The resulting functions allow us to take an image as an input and produce an output that includes numbers representing what the objects or people are in that image. Doing this accurately is achieved through composition of mathematical functions, otherwise known as deep learning.
To describe this process, I sometimes use the following analogy. Think of a pinball machine with rows of pins. A plunger is used to fire a ball into the top of the machine. Think of each row of pins as a function. The input to the function is the position of the ball along the left to right axis of the machine at the top. The output of the function is the position of the ball as it leaves each row of pins. Composition of functions is simply feeding the output of one row of pins into the next. If we do this multiple times, then although the machine has multiple functions in it, by feeding one function into another, then the entire machine is in itself representing a single function. Just a more complex function than that that any individual row of pins can represent. Our modern machine learning systems are just like this machine. Except they take very high dimensional inputs. The pinball is only represented by its left to right position, a one dimensional function. A real learning machine will often represent multiple dimensions, like playing pinball in a hyperspace.
So the intelligence we’ve created is just a (complex) mathematical function. But how do we make this function match what humans do? We use more maths. In fact we use another mathematical function. To avoid confusion let’s call the function represented by our pinball machine the prediction function. In the pinball machine, we can move the pins around, and any configuration of pins leads to a different answer for our functions. So how do we find the correct configuration? For that we use a separate function known as the objective function.
The objective function measures the dissimilarity between the output of our prediction function, and data we have from the humans. The process of learning in these machines is the process of moving the pins around in the pinball machine to make the prediction more similar to the data. The quality of any given pin configuration is assessed by quantifying the differences between what the humans have done, and what the machine is predicting using the objective function. Different objective functions are used for different tasks, but objective functions are typically much simpler than prediction functions.
So, we have some background. And some understanding of the challenges for the designer of machine learning algorithms. You need data, you need a prediction function, and you need an objective function. Choice of prediction function and objective function are in the hands of the algorithm designer. Deep learning is a way of making very complex, and flexible, classes of prediction functions that deal well with the type of data we get from speech, written language and images.
But what if the designer gets the choice of prediction function wrong? Or the choice of the objective function wrong? Or what if they choose the right class of prediction function (such as a convolutional neural network, used for classifying images) but get the location of all those pins in the machine wrong?
This is a major challenge for our current generation of AI solutions. They are fragile. They are sensitive to incorrect choices and when they fail, they fail catastrophically.
To understand why, let’s take a step back from artificial intelligence, and consider natural intelligence. Or even more generally, let’s consider the contrast between an artificial system and an natural system. The key difference between the two is that artificial systems are designed whereas natural systems are evolved.
Systems design is a major component of all Engineering disciplines. The details differ, but there is a single common theme: achieve your objective with the minimal use of resources to do the job. That provides efficiency. The engineering designer imagines a solution that requires the minimal set of components to achieve the result. A water pump has one route through the pump. That minimises the number of components needed. Redundancy is introduced only in safety critical systems, such as aircraft control systems. Students of biology, however, will be aware that in nature system-redundancy is everywhere. Redundancy leads to robustness. For an organism to survive in an evolving environment it must first be robust, then it can consider how to be efficient. Indeed, organisms that evolve to be too efficient at a particular task, like those that occupy a niche environment, are particularly vulnerable to extinction.
So it is with natural vs artificial intelligences. Any natural intelligence that was not robust to changes in its external environment would not survive, and therefore not reproduce. In contrast the artificial intelligences we produce are designed to be efficient at one specific task: control, computation, playing chess. They are fragile.
The first criterion of a natural intelligence is don’t fail, not because it has a will or intent of its own, but because if it had failed it wouldn’t have stood the test of time. It would no longer exist. In contrast, the mantra for artificial systems is to be more efficient. Our artificial systems are often given a single objective (in machine learning it is encoded in a mathematical function) and they aim to achieve that objective efficiently. These are different characteristics. Even if we wanted to incorporate don’t fail in some form, it is difficult to design for. To design for “don’t fail”, you have to consider every which way in which things can go wrong, if you miss one you fail. These cases are sometimes called corner cases. But in a real, uncontrolled environment, almost everything is a corner. It is difficult to imagine everything that can happen. This is why most of our automated systems operate in controlled environments, for example in a factory, or on a set of rails. Deploying automated systems in an uncontrolled environment requires a different approach to systems design. One that accounts for uncertainty in the environment and is robust to unforeseen circumstances.
The systems we produce today only work well when their tasks are pigeonholed, bounded in some way. To achieve robust artificial intelligences we need new approaches to both the design of the individual components, and the combination of components within our AI systems. We need to deal with uncertainty and increase robustness. Today, it is easy to make a fool of an artificial intelligent agent, technology needs to address the challenge of the uncertain environment to achieve robust intelligences.
However, even if we find technological solutions for these challenges, it may be that the essence of human intelligence remains out of reach. It may be that the most quintessential element of our intelligence is defined by limitations. Limitations that computers have never experienced.
That characteristic is our limited ability to communicate, in particular our limited ability to communicate versus our ability to compute. Not compute in terms of working out arithmetic or solving sudoku problems, but the amount of compute that underpins our higher thoughts. The computation that our neurons are doing.
It is estimated that to simulate the human mind it would require a supercomputer operating as fast as the world’s fastest. The UK Met Office’s computer, that is used for simulating weather and climate across the world, might do it. At the time of writing its the 11th fastest computer in the world. Much faster than a typical desktop. But while a typical computer seems to be performing less operations than our brain does. A typical computer can communicate much faster than we can.
Claude Shannon developed the idea of information theory: the mathematics of information. He defined the amount of information we gain when we learn the result of a coin toss as a “bit” of information. A typical computer can communicate with another computer with a billion bits of information per second. Equivalent to a billion coin tosses per second. So how does this compare to us? Well, we can also estimate the amount of information in the English language. Shannon estimated that the average English word contains around 12 bits of information, twelve coin tosses, this means our verbal communication rates are only around the order of tens to hundreds of bits per second. Computers communicate tens of millions of times faster than us, in relative terms we are constrained to a bit of pocket money, while computers are corporate billionaires.
Our intelligence is not an island, it interacts, it infers the goals or intent of others, it predicts our own actions and how we will respond to others. We are social animals, and together we form a communal intelligence that characterises our species. For intelligence to be communal, our ideas to be shared somehow. We need to overcome this bandwidth limitation. The ability to share and collaborate, despite such constrained ability to communicate, characterises us. We must intellectually commune with one another. We cannot communicate all of what we saw, or the details of how we are about to react. Instead, we need a shared understanding. One that allows us to infer each other’s intent through context and a common sense of humanity. This characteristic is so strong that we anthropomorphise any object with which we interact. We apply moods to our cars, our cats, our environment. We seed the weather, volcanoes, trees with intent. Our desire to communicate renders us intellectually animist.
But our limited bandwidth doesn’t constrain us in our imaginations. Our consciousness, our sense of self, allows us to play out different scenarios. To internally observe how our self interacts with others. To learn from an internal simulation of the wider world. Empathy allows us to understand others’ likely responses without having the full detail of their mental state. We can infer their perspective. Self-awareness also allows us to understand our own likely future responses, to look forward in time, play out a scenario. Our brains contain a sense of self and a sense of others. Because our communication cannot be complete it is both contextual and cultural. When driving a car in the UK a flash of the lights at a junction concedes the right of way and invites another road user to proceed, whereas in Italy, the same flash asserts the right of way and warns another road user to remain.
Our main intelligence is our social intelligence, intelligence that is dedicated to overcoming our bandwidth limitation. We are individually complex, but as a society we rely on shared behaviours and oversimplification of our selves to remain coherent.
This nugget of our intelligence seems impossible for a computer to recreate directly, because it is a consequence of our evolutionary history. The computer, on the other hand, was born into a world of data, of high bandwidth communication. It was not there through the genesis of our minds and the cognitive compromises we made are lost to time. To be a truly human intelligence you need to have shared that journey with us.
Of course, none of this prevents us emulating those aspects of human intelligence that we observe in humans. We can form those emulations based on data. But even if an artificial intelligence can emulate humans to a high degree of accuracy it is a different type of intelligence. It is not constrained in the way human intelligence is. You may ask does it matter? Well, it is certainly important to us in many domains that there’s a human pulling the strings. Even in pure commerce it matters: the narrative story behind a product is often as important as the product itself. Handmade goods attract a price premium over factory made. Or alternatively in entertainment: people pay more to go to a live concert than for streaming music over the internet. People will also pay more to go to see a play in the theatre rather than a movie in the cinema.
In many respects I object to the use of the term Artificial Intelligence. It is poorly defined and means different things to different people. But there is one way in which the term is very accurate. The term artificial is appropriate in the same way we can describe a plastic plant as an artificial plant. It is often difficult to pick out from afar whether a plant is artificial or not. A plastic plant can fulfil many of the functions of a natural plant, and plastic plants are more convenient. But they can never replace natural plants.
In the same way, our natural intelligence is an evolved thing of beauty, a consequence of our limitations. Limitations which don’t apply to artificial intelligences and can only be emulated through artificial means. Our natural intelligence, just like our natural landscapes, should be treasured and can never be fully replaced.
]]>TL;DR Variation is important in decision making.
It should be remembered that just as the Declaration of Independence promises the pursuit of happiness rather than happiness itself, so the iterative scientific model building process offers only the pursuit of the perfect model. For even when we feel we have carried the model building process to a conclusion some new initiative may make further improvement possible. Fortunately to be useful a model does not have to be perfect.
George E. Box (1979)
In the book “Justice: What’s The Right Thing to Do?” (Sandel 2010) Michael Sandel aims to help us answer questions about how to do the right thing by giving some context and background in moral philosophy. Sandel is a philosopher based at Harvard University who is reknowned for his popular treatments of the subject. He starts by illustrating decision making through the ‘trolley’ problem.
Trolley problems seem to be a mainstay of moral philosophy: in the original variant (Foot 1967) there is a runaway trolley1 rolling at speed down a track, it is approaching a set of points beyond which is a group of five workers. The workers do not have time to get out of the way of the trolley. They will be killed. You have the opportunity to switch the track, saving the five workers, but the trolley will run onto another track, killing a single worker. You could shout a warning, but somehow the workers wouldn’t hear you, or if they did hear they wouldn’t have time to get out of the way. The moral question is should you switch the track? You would kill one worker, but save five. Apparently most of us think that in that situation we’d pull the switch.2
Sandel starts his book by looking at utilitarianism. Utilitarianism says that we should make decisions that lead to the the greatest benefit to humanity. Under utility theory, Sandel explains, we pull the lever because we expect fewer people will be unhappy when one person dies than when five people die. Of course we can debate that, and we will come back to that in a moment.
In moral philosophy utilitarianism is the idea that when considering a choice of actions we should choose the one that brings the most benefit to the population.
Utilitarianism is a philosophical variation of utility theory. It is due to Jeremy Bentham and John Stuart Mill. For their time the ideas in utilitarianism are very advanced. But it needs to be borne in mind that that time was rather more limited mathematically and scientifically than we are today. Jeremy Bentham predates Laplace and was born only 15 years after Newton. In Jeremy Bentham’s time differential calculus was only taught in advanced degrees in the most sophisticated universities. Probability was only just beginning to be understood.
Utilitarianism is an example of a philosophy that suggests that the result of a decision can be evaluated mathematically. By quantifying the benefits of the decision and weighing them against the downsides, the idea is that decision making can be rendered mathematical. Such mathematical functions are known as utility functions. The basic principle is that we can encode the good and the bad in a mathematical formula.
Since Bentham and Stuart Mill, the idea of framing our actions by formulating them mathematically has become inextricably interlinked with the domain of mathematical modelling. One way we have of quantifying the value of a mathematical model is its ‘goodness of fit’. In particular, we validate our models by evaluating how well the model does at predicting a quantity of interest. Sometimes that prediction can be associated with a direct monetary gain (such as in a stock market) or sometimes that prediction is associated with an improvement in the quality or length of life (such as in the diagnosis of disease).
Consider a model that tries to predict, on the basis of a test, whether an individual has cancer or not. Most, if not all, clinical tests are unreliable, i.e. a positive test does not always mean that the disease is present. The portion of positive tests where disease are present is known as the sensitivity of the test. It is a count of those people that are correctly diagnosed with the disease divided by the number who really have the disease. A test with 50% sensitivity will diagnose 50% of the people who really have the disease with the disease.
You might think it is a good thing to have a highly sensitive test, and you’d be right. But actually it is easy to increase the sensitivity of a test. Imagine a lazy lab technician, who rather than carrying out the test, just answers ‘disease present’ for every sample sent. The test will now have 100% sensitivity! Unfortunately all those who don’t have disease will also be categorised as having the disease. The technician has rendered the test non-specific. To account for a test that is weak in this way we also measure the quality of a test by its specificity: the number of patients who don’t have the disease that are correctly identified. For the lazy technician the specificity is 0%.
For clinical tests there is a trade off between sensitivity and specificity. A test that is highly sensitive (captures all the diseased population) will often be non specific (it will incorrectly diagnose a large portion of the non-diseased population).
When considering which tests to use we necessarily have to decide what the effect of wrong decisions is on indvidual people. If a test isn’t very sensitive, then patients will be given the all clear, when they actually have a disease. But if a test is highly sensitive but non-specific, many patients will be faced with unnecessary worry that they have the disease when none is present. The number of people involved will also depend on the base rates of the disease in the population. There will also be monetary cost associated with processing the tests and those who have a positive result.
We want to know the utility of the test, and Utility Theory is one approach we have to assessing that. If we make a decision that a particular sensitivity and specificity is acceptable for a test, if we make judgments of the numbers of worrying and unnecessary trips to the doctors that patients must endure then we are placing value on these aspects of life.
To make a single decision we must weigh up these different aspects against one another. We can then decide which outcomes we value the most. Even if we don’t write it all down explicitly, by our actions we can see that we are making decisions that define a utility function: a mathematical function that weights how we value the competing factors. Utility functions are so common that they receive many different names: objective function, cost function, error function, fitness function, risk function.3 The motivation for any given utility function can be different but the end result is the same, a mathematical formula by which we can quantify the relative value of a set of predictions or decisions.
The more we desire accountability for our decisions, the more that we require that they are rationalised, the greater our tendency to be explicit about the mathematical form of our utility function. The more explicit we become the easier it is to see how much value we are associating with aspects of our lives that we might instinctively feel cannot be priced. Aspects that cannot be bought or sold: our health or our life itself.
How can we reconcile this drive for accountability with our natural instinct that we should not be placing a price on such things?
The first thing we have to do is acknowledge the application of ideas about utility is much more sophisticated than is sometimes acknowledged in popular treatments of the subject. In Sandel’s book he criticises utilitarianism by framing it in the context of the historical arguments given in a specific era. But he doesn’t place those writings in context. The utility functions of Jeremy Bentham and John Stuart Mill provide a nascent theory, but there are a number of ways in which those ideas can be updated given the knowledge we have gained in the intervening 250 years.
In his book, Sandel follows up his initial trolley example with a more complex one.4 This time you are on a bridge. There is a rail line going under the bridge and there are workers on the line. This time there is a trolley going towards the five workers on the far side of the bridge. The workers will be killed by the runaway trolley. There is a man on the bridge with you, but there is nothing else to hand. As before you could shout, but the workers can’t get out of the way in time.
Apparently your only opportunity is to push the other man off the bridge onto the rails, thereby deflecting the trolley and saving the five men working on the line. You could jump onto the line yourself to deflect the trolley, but the man on the bridge is heavier than you, and you would be too light to deflect the trolley. Your only chance is to push the other man, the ‘fat man’, off.
It seems most people would choose not to push the fat man off. Sandel finds this difficult to explain from the perspective of utilitarianism and becomes dismissive of utilitarianism as a result.
We may be doing Sandel an injustice but it seems there is a significant over simplification in this scenario. In an attempt to construct a counter example to illustrate a point, there is a key aspect to the second example, which is not so present in the first. It’s an aspect that humans are faced with every day, but we are not entirely sure how we deal with it. It is a domain where humans can outperform machines, that aspect is uncertainty.
In an effort to define a situation that primes us for decision making the trolley scenarios tell stories. The story is far more complex in the second example than the first. We are unable to help but imagine the scenario. We might picture that the bridge is a viaduct. We might picture that it is made of sandstone. We might imagine that the fat man is wearing a blue shirt. We can picture what it means to push him off. One of Sandel’s arguments is that the difference between the two scenarios is that people are unable to kill the a man through direct interaction, and this aspect isn’t explained by utilitarianism. That is certainly a plausible explanation, but we don’t have to leave utilitarianism behind so quickly.
Bentham was not aware of Darwin’s principle of natural selection, he predates Darwin. Bentham believed that we should take an action if it maximised the happiness of the population (thereby minimising pain). So his utility function was the sum of the happiness of the population. For its time this is an interesting concept, but Bentham’s disciple, John Stuart Mill, struggled with the idea that a night of debauchery was the same form of happiness as, for example, staying in and reading a good book.5 He argued that these different happinesses should not be valued the same. This is a clear flaw in the idea of a single utility function which Bentham had proposed. However, we should probably be allowed to update Bentham’s ideas a little in the light of what we’ve discovered since.
Darwin’s idea was that species evolve through natural selection. Natural selection is a relatively simple concept, but it has some complex consequences. Natural selection just suggests that successful strategies will prevail. This is a somewhat self-verifying statement. The measure of success is that the strategy did prevail. However, there are some complex consequences. For a species to prevail it needs to survive. Natural selection forces us to think about why our behaviour might be helpful in determining our survival. Bentham’s idea was that we should maximize happiness. But given knowledge of Darwin, it may be that we’d prefer to identify happiness as an intermediate reward. Natural selection implies that one of our longer term goals is the survival of our species, with intermediat implications for our selves, our societies and our ways of life.
Most of us wouldn’t accept that our happiness at any precise moment is an absolute assessment of where we feel we are in life. In fact, we often find our state of happiness much more sensitive to changes in our circumstances than any particular absolute measurement we can make. If our circumstances are good in the absolute sense, but not improving, then we may have to consciously remind ourselves of how lucky we are to gain pleasure from our position.
In the children’s book “A Squash and a Squeeze” (Donaldson and Scheffler 2004) illustrates this idea nicely with a rendering of an old folk tale. An old lady lives in a house which she finds too small. She asks the advice of a “wise old man” who suggests she adds a chicken, a pig, a goat and a cow to her living quarters. The lady does as instructed finding her accommodation increasingly cramped as she does so. Finally the man tells the lady to let all the animals out again. The lady is now happy, finding her house to be far more spacious without the animals. She is happy at that moment, despite her absolute circumstances not having changed since she first went to the wise old man for advise. This folk tale may not provide a practical solution to the housing crisis: its moral is a caricature of our sensibilities: our happiness is more sensitive to our changes of circumstance than our absolute positioning. From an evolutionary perspective this makes sense. If as organisms we became satiated at a particular stage of achievement then our species or societies could become complacent.
The mathematical foundations of the study of change are given by Newton and Leibniz‘s work differential calculus. In Bentham’s time these ideas were taught only in the most advanced University courses. The argument above suggests that actually happiness is some (monotonic) function of the gradient of whatever personal utility function we have. Not the absolute value. This idea also may deal nicely with the issue of different types of happiness. A night of debauchery may make us instantaneously very happy, implying a high rate of change. But it is over quickly, implying that the absolute change in our circumstances is small (the absolute improvement in the utility would be the level of happiness multiplied by the time we were happy for). Of course this improvement may be offset by whatever the consequences of the debauchery are the next day. John Stuart Mill’s variation on utilitarianism considered ’higher pleasures’, such as the pleasure gained from literature and learning. See also Eleni Vasilaki’s perspective on Epicurus (Vasilaki 2017). Instantaneously we may experience less happiness when engaged in these activities compared to whatever our night of debauchery involved. However, these activities can be sustained for a very long period and allow us to achieve more. The absoute improvement in our circumstance would be given by the period of study multiplied by the pleasure given.
In the terminology of differential calculus, our happiness must be integrated to form our utility. Not instantaneously measured. It may be that Stuart Mill’s differentiating of happiness could have been reconciled with utility theory by realising he was actually distinguishing between sustainable forms of happiness and non-sustainable. Unfortunately we can’t revisit him to ask, but we can at least do him the justice of giving him the benefit of the doubt.
Daniel Kahneman’s Nobel Memorial Prize in Economics was awarded for the idea of prospect theory. Kahneman describes the theory and its background in his book, “Thinking Fast and Slow” (Kahneman 2011). Prospect theory is not a mathematical theory, but a theory in behavioral economics based on empirical observations of human behavior. Empirical observation about how we value different alternatives that involve risk. A key observation of Kahneman’s is in line with our analysis above: people are responsive to change in circumstance, not absolute circumstance. Prospect theory goes on to identify asymmetries in our sensitivities. A negative change in circumstance weights upon us greater than the equivalent positive change in circumstance.
Bentham’s ideas focussed around the idea of a global utility, maximisation of happiness across the population. Darwin’s principle of natural selection actually insists that there must be variation in the population, and therefore variation in our perception of our circumstances.
Natural selection relies on variation, because if there is no variation, then there can be no separation between effective and ineffective strategies. Different strategies arise from different value systems. If all organisms were to pursue the same strategy and when the circumstantial judgment of the selection process would fall upon all members of the species simultaneously, they would die or survive together.
One of the themes that Kahneman explores is the tendency of humans to produce, through their System 2 thought processes (their ‘slow thinking brains’), overcomplicated explanations of observed data. There’s a tendency for people to focus on a detailed narrative as if it was pre-determined and within the control of all participants. In practice we cannot control events in such a regulated manner.
To predict we need data, a model and computation. Data is the information we are given, the model is our belief in the way the world works and computation is required to assimilate the two. This is true for humans and computers. Ignoring the quality of the data for a moment, and focussing on the model, our predictive system can fail in one of two ways. It can either over simplify or it can over complicate.
This binary choice may seem obvious, but it has some significant consequences. The phenomenon was studied in machine learning by Geman et al (Geman, Bienenstock, and Doursat 1992) who referred to it as the ‘bias variance dilemma’. They decomposed errors into those due to oversimplification (the bias error) and those due to insufficient data to underpin a complex model.
Bias errors are errors that arises when your model is not rich enough to capture all the nuances of the world around. Bias errors occur when the rich underlying phenomena underpinning an observation are ignored and a simpler model explanation is given. An example of a bias error would be one that arises from the simple model “home teams always win sports games”. There is some truth to the home advantage: this model will do better than 50/50 guessing. But it is an oversimplification. It is biased. However, because we have a lot of data about sports games, then two experts using this rule to predict outcome would make consistent predictions.
An error due to variance is one which occurs when we go too far the other way. There are a myriad of factors that could effect the outcome of a sports event. Weather, balloons on the pitch, the mental and physical fitness of each of the players. The quality of the pitch. If we take them all into account we might hope for better predictions. But in reality, we haven’t seen enough data to determine how each of these factors effects outcome. Badly determined parameters lead to high variance error. Two experts see slightly different data and weight these complex factors according to their perspective. As a result their predictions can vary, leading to variance error.
The important point is that these errors are fundamentally different in their characteristics. The error due to bias, the simplification error, comes about by not taking taking all the factors into account. In statistical models bias errors are very common: indeed they are often preferred because they are associated with simpler models and the parameters often have some explanatory power.
The type of error Kahneman is describing in human explanations would be termed an error due to variance. A variance error is different from a bias error. In a variance error you may have a model that is sufficient to describe the underlying system, but you don’t have enough data or information to pin down exactly how your observed outcome came to be. A characteristic of error due to variance is that different observers may have highly rich, but conflicting, explanations of what brought about the phenomenon. The soft of conflicting explanations that bring about lively debate in television studios during half-time breaks in football matches
The bias-variance dilemma is a major challenge in machine learning. One widely accepted solution to the dilemma is that we choose a model which exhibits larger bias because even though it is known to be incorrect (too simple) for the data w have available it will make better predictions. Many of Kahneman’s mechanism’s and solutions for human irrationality actually do introduce simple statistical models to improve the quality of prediction. Kahneman relates how decision making can be rendered more consistent and higher quality in this manner.
An alternative and widely used solution in machine learning is do develop large families of complex models that exhibit variance, just as individual humans do. However, once these models are trained they are not relied on individually but the are combined in ‘ensembles’ to predict together. They vote on the solution or their average prediction is taken. This idea is very similar to the ‘wisdom of the crowds’. Seen from this context there are very good reasons why a ‘population’ of intelligent beings should exhibit variance-error instead of bias-error. A characteristic of bias-error would be that we would all be consistent in our predictions. This is good for accounting for behaviour, but it is a serious problem if we are all consistently wrong. Variance-error implies that we all come up with different, over-complicated, reasons why events transpire as they did. Taken as a population we cover a wide range of alternatives. As a result we act in different ways, and make different decisions. It is clear that we will never be all correct, but when it comes to evolution, the important thing is that we are never all wrong.
A further advantage of choosing variance-error over bias-error for a population is that a consistent and robust prediction can always be achieved by averaging outcome. In machine learning approaches such as bagging and boosting (Breiman 1996) can be used to reduce variance-error in a population of models.6 Advocates of the “Wisdom of the Crowds” propose the same principle. By preferring bias-error in our population we have no recourse, but models that exhibit variance-error can always be combined to create a more stable prediction from the population as a whole: democratic decision making is one way to achieve this.
So the ‘rational’ behaviour of a population under natural selection is to sustain a variety of approaches to life. It follows then, that if there is to be natural selection within our species, our ideas of achievement should vary. Our individual utility should be subjective. What brings me happiness, may not bring you happiness. We probably have different ideas of debauchery, literature and learning. While we can disagree with each others tastes, natural selection tells us that our species is more robust if there if there is a diversity of approaches. Darwin’s principle tells us we should be like this.
When we see half-time football pundits debating their convolved explanations of the way the match is evolving we should remember that their arguments are all important and each one may have some validity. They are the result of overly complex models being applied on little data. The game of football is fundamentally stochastic, but the analysts treat it as deterministic.
This phenomenon is not new, and it is not constrained to football punditry. In 1954 the psychologist Peter Meehl wrote a book about how clinical experts can be outperformed by simple statistical models (Meehl 1954). Meehl suggested they ‘try to be clever and think outside the box’. Kahneman addresses this challenge in Chapter 21 of his book.
complexity may work in the odd case, but more often than not it reduces validity
going on to say
humans are incorrigibly inconsistent in making summary judgments of complex information. When asked to evaluate the same information twice, they frequently give different answers.
and
Unreliable judgements cannot be predictors of anything.
The two approaches Kahneman proposes for dealing with this in human society are:
Replace human punditry making with simple statistical models. This also makes sense from a statistics point of view if we desire consistency we should replace the models that exhibit variance-error (the humans) with models that exhibit bias-error (simple statistical forumlae).
Exploit wisdom of the crowds. Wisdom of the crowds is a proposal that human opinion should be aggregated to improve predictions. This is consistent with the idea that humans tend to make variance-errors.
Punditry is widespread, and the bias-variance analysis shows us that there are good reasons why, under natural selection, we should value such diversity of opinions. What is also important is that we should develop mechanisms in society for these opinions to be properly represented when making a decision.
This error is dangerous, it is one of the intellectual failings of those that sought to put ideas from eugenics into political practice. Early philosophies based on natural selection were overly focussed on the average value of the ‘fitness’ of a population. Trying to increase this value whilst simultaneously reducing variation is a very dangerous game. It is artificial selection. It assumes that you have preordained what the future natural circumstances are going to be. It may be OK for race horses, greyhounds, crops, sheep and cows because in those circumstances we are aiming to control their environment. It is not OK for the human race.
We are right to express moral outrage at what the negative eugenecists tried to achieve. But it was also motivated from a flawed understanding of science. Their model was wrong. Populations that are capable of excelling in a particular environment, because they are highly tuned to it, are rapidly extinguished when circumstances change. If the animals in a species become too specialised then they may not be able to respond to changing circumstances. Think of cheetahs and eagles vs rats and pigeons.
Socially the same principle should hold. I may not agree with many people’s subjective approach to life, I may even believe it to be severely sub-optimal. But I should not presume to know better, even if prior experience shows that my own ‘way of being’ is effective. Variation is vitally important for robustness. There may be future circumstances where my approaches fail utterly, and other ways of being are better.
The quality of our subjective utilities at any given time is measured by their effectiveness in the world. Survival of the species indicates that it is the sustenance of the entire human species that should concern us in the long run. Although there will be many intermediate effects that we will be looking to achieve in the medium term. Indeed, we may even question if there are circumstances under which we would not wish the human species to survive.7 The universal utility by which we are judged is therefore difficult to define. We can pin down aspects of it, but perhaps the best we can do is seek compromise between our individual utilities, while maintaining awareness that there may be outside forces, for example climate change, that will have such a detrimental effect on all our lives that it is worth investing significant time and effort as a society to try and reduce our exposure.
Lets get back to the trolleys.
The trolley problem is an oversimplification, and one which is not useful in characterizing the moral dilemmas we are faced with in the modern era of computer decision making. Instead of the trolley problem, let’s propose a new dilemma, and let’s focus on driverless cars.
Most arguments for driverless cars focus on the overall reduction in human death rates we expect to result from their adoption. For example, if we introduce driverless cars and bring about a 90% reduction in deaths on the road, then that surely is a good thing. But what if the remaining 10% of deaths are focussed on a particular section of the population, for example, what if we find that the only people those cars do continue to kill are cyclists.
Now there are ethical and moral questions about what we have developed. Even if we have reduced the total number of cyclist deaths8 is it fair to disproportionately affect one section of the population? A simplistic utilitarian perspective would say yes, please proceed, although we individually might be uncomfortable with this. Let’s explore further.
Utilitarianism appears to be telling us to favour a set up that could disproportionately effect a minority: in this case cyclists. Let’s develop a more sophisticated view of the right utility and see how it might effect our conclusions.
There are two principles we should take into account when considering how we should aim to effect the evolution of our society. The first we have discussed above, uncertainty. The second is Darwin’s principle of natural selection.
Natural selection is a simple idea although it’s had a difficult history. Unfortunately, when applied on its own it leads to some unsophisticated principles that don’t work in practice. First of all, we might naively assume from natural selection that there is a single best way of doing things. This idea is embedded in the statement “survival of the fittest”. But that is a naive reinterpretation of natural selection that gives the wrong connotation. The phrase is not due to Darwin, but due to Herbert Spencer, a Victorian philosopher who applied principles of evolution to sociology. The phrase encourages the idea that one individual will survive, the one that has the best approach. This is a damaging idea, and one that should be put to bed.
The marvel of evolution is its responsiveness, success is defined by the environment, both at the individual level and the level of communities and species. But the environment itself is complex and evolving. It is dynamic. Strategies for success are therefore not static. The criteria for success are also uncertain. While we might accept that any given moment there is a ‘formula for success’, the dynamic nature of our evironment means that that formula for success is evolving.
Such a formula for success is equivalent to our utility function, but we are now placed in the circumstance where there is uncertainty around the utility function itself, because it is rapidly evolving and we don’t have access to it.
How quickly is that utility evolving? In the modern world, for humans, and animals, very quickly. Species are becoming extinct at an alarming rate, our world is warming, potentially changing our climate from our current relatively benign climate closer to the more uncomfortable climates of the past. This means the utility is evolving. The criteria for a lion born 200 years ago are very different from a lion today. How would we redesign the lion to survive today?
In a rapidly evolving environment, the species that are most vulnerable are those that are specialised to fragile niche environments that will disappear as our global environments evolve. As humans we are lucky in that we can consciously change our practices to react to our evolving environment. But one thing is clear: the idea that there is a single particular solution, an ‘absolute’ principle by which we should progress a society is seriously flawed.
In fact, that is not quite true. There is one absolute policy we should follow. That policy is: “There will be single absolute policy that should be followed slavishly in all circumstances”. Our environment is evolving, and will continue to evolve through our lives. Perhaps more so now than at any other time in human history. I’m not even referring to our climate, the temperature changes we might expect from global warming, I’m referring to our social circumstances, the rate at which new technologies are emerging, such that within the one individual’s lifespan our modes of intercommunication in our society are changing so as to be almost unrecognisable, and certainly unenvisageable from decade to decade.
In these periods uncertainty about the right thing to do dominates. And the correct form of response to uncertainty is to value diversity. Every form of extremism that forms a threat to the aspects of life I value can be characterised as absolutist. Whether communist, fascist, islamist or christianist. The malignant characteristic of their extremist forms prohibits any other strategy for society. You don’t need the existence of gods to interpret this as extreme folly and hubris. Ironically these absolutist philosophies are the only behaviours that we can absolutely exclude.
From this perspective Sandel’s use of the second trolley example takes on a new light. The first example, that of a simple switch in the points, is about as deterministic and mechanistic as a situation can get. You pull the lever, you expect the points to change. Even if they don’t there is no downside associated with the consequences, other than a maybe angry railworker who sees you just tried to send a quarter ton railway wagon down his throat. The second example is largely contrived, and riddled with uncertainty. The story suggests that we know that the man we want to push off the bridge is heavy enough to divert the trolley and yet we ourselves aren’t. That strikes me as an absurd notion. We’ve had moments to assess the situation, and yet we can judge this with confidence? Next we have to heave this heavier man over the barrier at the side of the bridge (I’m assuming it has a barrier). Assuming we achieve this, we now have to ensure our aim is accurate enough to land the gentleman fully on the track. We’ve already determined his general heft is only just sufficient to divert the trolley, we know he’ll have to lay very squarely on the track, and remain there. Finally, who’s to say what the path of the trolley will be after hitting him? The endangered workmen are clearly close enough to not be able to clear the track in time, so is it not possible that the diverted trolley will hurtle into them anyway?
My own belief is that, perhaps at a sub-conscious level, humans are very sensitive to the uncertainty in this scenario. We can vividly imagine attempting to explain our actions after the event, and I think the reaction of anyone we explained them to would be utter incredulity: “So the trolley was hurtling to the men, and instead of yelling to them you attempted to push another man off the bridge?”. Only the most trivial simplification of the situation combined with a naive interpretation of utility theory would conclude that pushing the man was the correct action. Indeed, if it had not been phrased as a decision process, it would not even have entered our mind. Why then is it a mainstay of moral philosophy? This may be an example of what Daniel Kahneman refers to as theory induced blindness, but what I think would be more correctly referred to as model-induced blindness. Here the model of reasoning is one that ignores uncertainty and our ability to subconciously quantify it. That’s a good model for the original trolley example but an extremely poor one for the second. The model is wrong.
This brings us to mind again George Box’s quote, after all, the model may be wrong but is it useful? My colleague Richard Wilkinson pointed out to me that a better quote for the modern era (from the same paper) might be
Since all models are wrong the scientist must be alert to what is importantly wrong. It is inappropriate to be concerned about mice when there are tigers abroad.
George E. P. Box (1976)
I use this quote in talks9, the intent of this quote is it is important to worry about the manner in which the model is wrong. Models are abstractions, and it is important when modelling to decide upon the correct level of granularity for our abstraction. Unfortunately, it seems that there is a widespread propensity to take such a model (as applied to the first trolley question) and unquestioningly apply it to the next similar seeming problem on the test paper. Kahneman also has something to say about this type of behaviour from a behavioural psychologist perspective: he categorises it as a consequence of the laziness of System 2 and the resulting dominance of System 1. Or to put it in plain terms: very often people don’t stop and think. And in this case it seems to me that failure to stop and think means that the moral philosopher was eaten by a tiger before he had a chance to get anywhere near the fat man.
Apologies for being trite and rather polemical, but these points emerge from an ongoing frustration about the extent to which our theorising about decision making ignores the almost omnipresent effects of uncertainty. Uncertainty about our individual values, our values as a society, our future circumstances, our present circumstances. Uncertainty is endemic, and explicitly accounting for it induces robust behaviour in the presence of changing circumstance. Paramount among these is respect for diversity. Diversity of opinion and behaviour. This is the reason for our cognitive bias towards variance. The importance of diversity in dealing with an uncertain environment.
Uncertainty of the correct utility and our wider values means that, in practice, the ‘correct’ decision is difficult produce or verify. In these circumstances, there has a been a tendency to seek proxies such as consistency in decision making. However, a consistent decision making algorithm will tend to err on the side of over-simplifying circumstances, potentially missing particular nuances and more negatively effecting one sector of society over another.
In this paper we have argued that the right response to uncertainty is diversity. As studied by Peter Meehl and Daniel Kahnemann, our own cognitive bias towards overcomplicating situations leads to a diversity of responses by experts when presented with the same data. Both Meehl and Kahnemann characterize this diversity of response as a bad thing because, provably, if experts differ opinion then at least one of them must be ‘wrong’.
However, obsession about consistency between experts, or algorithms, misses the point that such algorithms (or experts) could also be consistently wrong. This is particularly worrisome for algorithmic decision making which can be deployed en masse with rapid detrimental effect over particular sectors of society. Such deployments are encouraged by a naive-utilitarian perspective, but a more sophisticated understanding of societal utility actually suggests that preservation of diversity could be much more important.
In an uncertain environment, we are arguing that society would be more robust if diversity of solutions and opinions are sustained and respected. Consistency of opinion should not be substituted as a proxy for truth, and diversity should be understood as being more robust in an evolving society such as ours where there is he uncertainty around our shared value system and the nature of ‘fitness’ in a rapidly evolving environment.
Box, George E. P. 1976. “Science and Statistics.” Journal of the American Statistical Association 71 (356): 791–99. http://www.jstor.org/stable/2286841.
———. 1979. “Robustness in the Strategy of Scientific Model Building.” Edited by R. L. Launer and G. N. Wilkinson. Academic Press, 201–36. http://www.dtic.mil/docs/citations/ADA070213.
Breiman, Leo. 1996. “Bagging Predictors.” Machine Learning 24 (2): 123–40. doi:10.1007/BF00058655.
Donaldson, Julia, and Axel Scheffler. 2004. A Squash and a Squeeze.
Foot, Philippa. 1967. “The Problem of Abortion and the Doctrine of the Double Effect in Virtues and Vices.” Oxford Review 5: 5–15. doi:10.1093/0199252866.003.0002.
Geman, Stuart, Elie Bienenstock, and René Doursat. 1992. “Neural Networks and the Bias/Variance Dilemma.” Neural Computation 4 (1): 1–58. doi:10.1162/neco.1992.4.1.1.
Kahneman, Daniel. 2011. Thinking Fast and Slow.
Meehl, Paul E. 1954. Clinical Versus Statistical Prediction: A Theoretical Analysis and a Review of the Evidence.
“Reported Road Casualties in Great Britain: Main Results 2015.” 2016. UK Department for Transport.
Sandel, Michael. 2010. Justice: What’s the Right Thing to Do?
Thomson, Judith Jarvis. 1976. “Killing, Letting Die, and the Trolley Problem.” The Monist 59 (2): 204–17.
Vasilaki, Eleni. 2017. “Is Epicurus the Father of Reinforcement Learning?” ArXiv E-Prints. https://arxiv.org/abs/1710.04582.
In Phillipa Foot’s original version the example refers to a runaway tram and you are the driver, but this has evolved for it to be referred to as a trolley, and you have control of the points.↩
There is almost palpable excitement in the room amoung students of humanities when driverless cars are discussed, mainly in anticipation (I believe) of a real world application of the trolley problem. The car has a choice between killing a grandma, or a baby, which does it choose? What algorithm does it use? We will discuss a more sophisticated consequence of the algorithm below, but if the car gets in the situation where it’s about to kill someone, something will have gone seriously wrong. There will be a great deal of uncertainty in what happens next, and the main objective should be to reduce the energy of the car, and minimize the result of the impact. In these circumstances, no car manufacturer will be calling upon moral subroutines to determine who should survive.↩
Actually a risk function normally applies we are looking at expected utility, i.e. it incorporates some notion of the probability of outcomes.↩
This example is due to Judith Jarvis Thomson, her original paper (Thomson 1976) is a very interesting read. She introduces three separate variations on Phillipa Foot’s trolley scenario and impersonalizes them by associating them with three characters, Edward, Frank and George.↩
Since I first drafted these ideas, my colleague Eleni Vasilalki began exploring Epicurius’s philosophy as an underpinning of Reinforcement Learning. From a machine learner’s perspective the ideas of Epicurius also seem related to John Stuart Mill (???).↩
If you have an Xbox at home and use the Kinnect, it is using exactly this technique to determine where you are in the video frame. The algorithm is called a “random forest”, and it averages across many ‘decision trees’ to create an output using a voting scheme. A decision tree is an approach to categorisation that is used in machine learning to classify objects. It uses a tree-like structure to decide what an object is according to its attributes. The algorithm typically takes one feature of the data at a time and ‘branches’ according to the value of the attribute, for example one decision in a ‘vehicle classifier’ might be to go down one branch if the number of wheels are greater than 3 (car or truck) and another if the wheels are less than 3 (motorbike or bicycle). A follow on question might involve engine maximum power, sending it down a different branch. Each tree makes its own, detailed decision. Each individual tree may also not be very good on average. Leo Breiman suggested an approach to aggregating these decisions to give the final answer. The algorithm is known as a random forest. Decision trees are a key component in Facebook’s advertising ranking algorithm. They also are used to identify where you are in the image when the Kinnect turns on.↩
This idea emerges in the film “The Matrix”.↩
Cyclist deaths in the UK currently make up around 6% of the total deaths. For example in 2015 of 1,730 total deaths on the road, 100 were cyclists (6%). Pedestrians made up a further 408 (24%), motorcyclists 365 (21%) (“Reported Road Casualties in Great Britain: Main Results 2015” 2016).↩
I did once have to explain that it doesn’t mean the tigers are in foreign countries.↩
Machine learning is perhaps the principal technology behind two emerging domains: data science and artificial intelligence. The rise of machine learning is coming about through the availability of data and computation, but machine learning methdologies are fundamentally dependent on models.
\[\text{data} + \text{model} \xrightarrow{\text{compute}} \text{prediction}\]The emergence of machine learning is closely tied to the emergence of widely available data.
Large amounts of data and high interconnection bandwidth mean that we receive much of our information about the world around us through computers.
Economists try to measure productivity, one of the ways we can become more productive is by becoming more efficient. For example, moving from gathering food to settled agriculture. In the modern era one approach to becoming more efficient is automation of processes like manufacturing production lines. The manufacturing process is decomposed into a series of mechanical or manual processes each of which is applied sequentially.
Manufacturing processes consist of production lines and robotic automation. Logistics can also be decomposed into the supply chain processes. Whether it’s manufacturing or logistics, efficiency can be improved by automating components of the processes to improve the flow of goods.
An interesting challenge for modern society is the management of both the flow of goods and the flow of information. The flow of information is also highly automated. Processing of data is decomposed into stages in computer code.
In these processing pipelines, manufacturing, logistics or data management, the overall pipeline normally also requires human intervention from an operator. These interventions can create bottlenecks and slow the process of automation. Machine learning is the key technology in automating these manual stages.
The human interventions that were easy to replicate with technology have already been replaced. The components that still require human intervention are the knottier problems. Often they represent components that are difficult, or impossible, to decompose into stages which could then be further automated. In that sense these components are process-atoms. In manufacturing or logistics settings these atoms involve the sort of flexible manual skills that we cannot replicate with current robotic technology. They require emulation of a human’s motor skills. In information processing settings these atoms require emulation of our cognitive skills. For example, our ability to mentally process an image or some text.
Machine learning is a field dedicated to replicating such processes through the direct use of data. It is an approach to splitting these process-atoms so that they too be automated. In machine learning we aim to emulate cognitive processes through the use of data. Machine learning also uses data to provide new approaches in control and optimization that should allow for emulation of human motor skills.
A key idea in machine learning is to emulate the process as a mathematical function. Each function has a set of parameters which control its behavior. Learning is the process of changing these parameters to change the shape of the function to make it more representative of the data. Our choice of which class of mathematical functions we use is a vital component of our model.
Example of prediction: The Olympic gold medalist in the marathon’s pace is predicted using a regression fit. In this case the mathematical function is directly predicting the pace of the winner as a function of the year of the Olympics.
Machine learning has risen to prominence as the principal technology underlying the recent advances in artificial intelligence techniques. It takes a different approach to that developed in classical artificial intelligence (sometimes referred to as “good old fashioned AI” or GOFAI). GOFAI relied on symbolic logic as its mathematical engine. Early AI used expert systems: a set of logical rules implemented to reconstruct expertise. For example, rules to decide whether or not someone has cancer. Such rules prove hard to specify for very complex processes.1
We can split applications of machine learning broadly into data science and artificial intelligence. We define the field of data science to be the challenge of making sense of the large volumes of data that have now become available through the increase in sensors and the large interconnection of the internet. Phenomena variously known as “big data” or “the internet of things”. Data science differs from traditional statistics in that this data is not necessarily collected with a purpose or experiment in mind. It is collected by happenstance, and we try and extract value from it later. In classical statistics the question is formed first, and data is collected to answer the question. Karl Popper, the philosopher of science, once put the challenge of model (or question) and experiment (or data) as being similar to the challenge of the chicken and the egg. Which comes first? His answer is that they co-evolve. And a similar answer should apply to the model and the data. But in classical statistics it was normally the question (which could be encoded in a model) that comes first. In modern data science it is often the data that comes first. Regardless, to truly answer questions to our satisfaction several stages of modelling and data collection are likely to be required. This naturally implies that classical statistical analyses are as important as ever in this domain.
Artificial intelligence on the other hand is a field with its origins in Cybernetics. The challenge in artificial intelligence is to recreate “intelligent” behaviour. In particular there is a focus on having machines emulate the capabilities of humans. Machine learning has become important in this domain because of the success of data-driven artificial intelligence. In data-driven artificial intelligence, rather than solving a challenge in intelligence from first principles, we simply acquire data that guides us to the solution of the problem. Often such data is acquired from humans, and the computer is given the task of reconstructing that data. This is where machine learning techniques come in. Machine learning focuses on emulating the data creation process by combining a model with the data. The model incorporates assumptions about the data generating process. But unlike physical models (e.g. differential equation models) the assumptions are typically only weak. Smoothness assumptions are common, and the success of convolutional neural networks is driven by the assumptions they can encode about how an image is structured.
Machine learning takes the different approach of observing a system in practice and emulating its behavior with mathematics. One of the design aspects in designing machine learning solutions is where to put the mathematical function. Obtaining complex behavior in the resulting system can require some imagination in the design process. We will give an overview of the approaches that people take using the classical three subdivisions of the field: supervised learning, unsupervised learning and reinforcement learning. Each of these approaches uses the mathematical functions in a different way. Below we will introduce each of these learning challenges and explore how the mathematical function is deployed.
Supervised learning is one of the most widely deployed machine learning technologies, and a particular domain of success has been classification. Classification is the process of taking an input (which might be an image) and categorizing it into one of a number of different classes (e.g. dog or cat). This simple idea underpins a lot of machine learning. By scanning across the image we can also determine where the animal is in the image.
Classification is perhaps the technique most closely assocated with machine learning. In the speech based agents, on-device classifiers are used to determine when the wake word is used. A wake word is a word that wakes up the device. For the Amazon Echo it is “Alexa”, for Siri it is “Hey Siri”. Once the wake word detected with a classifier, the speech can be uploaded to the cloud for full processing, the speech recognition stages.
A major breakthrough in image classification came in 2012 with the ImageNet result of Alex Krizhevsky, Ilya Sutskever and Geoff Hinton from the University of Toronto. ImageNet is a large data base of 14 million images with many thousands of classes. The data is used in a community-wide challenge for object categorization. Krizhevsky et al used convolutional neural networks to outperform all previous approaches on the challenge. They formed a company which was purchased shortly after by Google. This challenge, known as object categorisation, was a major obstacle for practical computer vision systems. Modern object categorization systems are close to human performance.
In supervised learning the inputs, $\mathbf{x}$, are mapped to a label, $y$, through a function $f(\cdot)$ that is dependent on a set of parameters, $\mathbf{w}$,
\[y = f(\mathbf{x}; \mathbf{w}).\]The function $f(\cdot)$ is known as the prediction function. The key challenges are (1) choosing which features, $\mathbf{x}$, are relevant in the prediction, (2) defining the appropriate class of function, $f(\cdot)$, to use and (3) selecting the right parameters, $\mathbf{w}$.
Feature selection is a critical stage in the algorithm design process. In the Olympic prediction example above we’re only using time to predict the the pace of the runners. In practice we might also want to use characteristics of the course: how hilly it is, what the temperature was when the race was run. In 1904 the runners actually got lost during the race. Should we include ‘lost’ as a feature? It would certainly help explain the particularly slow time in 1904. The features we select should be ones we expect to correlate with the prediction. In statistics, these features are even called predictors which highlights their role in developing the prediction function. For Facebook newsfeed, we might use features that include how close your friendship is with the poster, or how often you react to that poster, or whether a photo is included in the post.
Sometimes we use feature selection algorithms, algorithms that automate the process of finding the features that we need. Classification is often used to rank search results, to decide which adverts to serve or, at Facebook, to determine what appears at the top of your newsfeed. In the Facebook example features might include how many likes a post has had, whether it has an image in it, whether you regularly interact with the friend who has posted. A good newsfeed ranking algorithm is critical to Facebook’s success, just as good ad serving choice is critical to Google’s success. These algorithms are in turn highly dependent on the feature sets used. Facebook in particular has made heavy investments in machine learning pipelines for evaluation of the feature utility.
By class of function we mean, what are the characteristics of the mapping between $\mathbf{x}$ and $y$. Often we might choose it to be a smooth function. Sometimes we will choose it to be a linear function. If the prediction is a forecast, for example the demand of a particular product, then the function would need some periodic components to reflect seasonal or weekly effects.
In the ImageNet challenge the input, $\mathbf{x}$, was in the form of an image. And the form of the prediction function was a convolutional neural network (more on this later). A convolutional neural network introduces invariances into the function that are particular to image classification. An invariance is a transformation of the input that we don’t want to effect the output. For example, a cat in an image is still a cat no matter where it’s located in the image (translation). The cat is also a cat regardless of how large it is (scale), or whether it’s upside-down (rotation). Convolutional neural networks encode these invariances: scale invariance, rotation invariance and translation invariance; in the mathematical function.
Encoding invariance in the prediction function is like encoding knowledge in the model. If we don’t specify these invariances then the model must learn them. This will require a lot more data to achieve the same performance, making the model less data efficient. Note that one invariance that is not encoded in a convolutional network is invariance to camera type. As a result practitioners need to be careful to ensure that their training data is representative of the type of cameras that will be used when the model is deployed.
In general the prediction function could be any set of parameterized functions. In the Olympic marathon data example above we used a polynomial fit,
\[f(x) = w_0 + w_1 x+ w_2 x^2 + w_3 x^3 + w_4 x^4.\]The Olympic example is also a supervised learning challenge. But it is a regression problem. A regression problem is one where the output is a continuous value (such as the pace in the marathon). In classification the output is constrained to be discrete. For example classifying whether or not an image contains a dog implies the output is binary. An early example of a regression problem used in machine learning was the Tecator data, where the fat, water and protein content of meat samples was predicted as a function of the absorption of infrared light.
Once we have a set of features and the class of functions we use is determined we need to find the parameters of the model.
The parameters of the model, $\mathbf{w}$, are estimated by specifying an objective function. The objective function specifies the quality of the match between the prediction function and the training data. In supervised learning the objective function incorporates both the input data (in the ImageNet data the image, in the Olympic marathon data the year of the marathon) and a label.
The label is where the term supervised learning comes from. The idea being that a supervisor, or annotator, has already looked at the data and given it labels. For regression problem, a typical objective function is the squared error,
\[E(\mathbf{w}) = \sum_{i=1}^n (y_i - f(\mathbf{x}_i))^2\]where the data is provided to us as a set of $n$ inputs, $\mathbf{x}_1$, $\mathbf{x}_2$, $\mathbf{x}_3$, $\dots$, $\mathbf{x}_n$ each one with an associated label, $y_1$, $y_2$, $y_3$, $\dots$, $y_n$. Sometimes the label is cheap to acquire. For example in Newsfeed ranking Facebook are acquiring a label each time a user clicks on a post in their Newsfeed. Similarly in ad-click prediction labels are obtained whenever an advert is clicked. More generally though, we have to employ human annotators to label the data. For example, ImageNet, the breakthrough deep learning result was annotated using Amazon’s Mechanical Turk. Without such large scale human input we would not have the breakthrough results on image categorization we have today.
Some tasks are easier to annotate than others. For example, in the Tecator data, to acquire the actual values of water, protein and fat content in the meat samples further experiments may be required. It is not simply a matter of human labelling. Even if the task is easy for humans to solve there can be problems. For example, humans will extrapolate the context of an image. A colleague mentioned once to me a challenge where humans were labelling images as containing swimming pools, even though none was visible, because they could infer there must be a pool nearby, perhaps because there are kids wearing bathing suits. But there is no swimming pool in the image for the computer to find. The quality of any machine learning solution is very sensitive to the quality of annotated data we have. Investing in processes and tools to improve annotation of data is therefore priority for improving the quality of machine learning solutions.
There can also be significant problems with misrepresentation in the data set. If data isn’t collected carefully, then it can reflect biases about the population that we don’t want our models to have. For example, if we design a face detector using Californians may not perform well when deployed in Kampala, Uganda.
Once a supervised learning system is trained it can be placed in a sequential pipeline to automate a process that used to be done manually.
Supervised learning is one of the dominant approaches to learning. But the cost and time associated with labeling data is a major bottleneck for deploying machine learning systems. The process for creating training data requires significant human intervention. For example, internationalization of a speech recognition system would require large speech corpora in new languages.
An important distinction in machine learning is the separation between training data and test data (or production data). Training data is the data that was used to find the model parameters. Test data (or production data) is the data that is used with the live system. The ability of a machine learning system to predict well on production systems given only its training data is known as its generalization ability. This is the system’s ability to predict in areas where it hasn’t previously seen data.
Alex Ihler on Polynomials and Overfitting
We can easily develop a simple prediction function that reconstructs the training data exactly, you can just use a look up table. But how would the lookup table predict between the training data, where examples haven’t been seen before? The choice of the class of prediction functions is critical in ensuring that the model generalizes well.
The generalization error is normally estimated by applying the objective function to a set of data that the model wasn’t trained on, the test data. To ensure good performance we normally want a model that gives us a low generalization error. If we weren’t sure of the right prediction function to use then we could try 1,000 different prediction functions. Then we could use the one that gives us the lowest error on the test data. But you have to be careful. Selecting a model in this way is like a further stage of training where you are using the test data in the training.2 So when this is done, the data used for this is not known as test data, it is known as validation data. And the associated error is the validation error. Using the validation error for model selection is a standard machine learning technique, but it can be misleading about the final generalization error. Almost all machine learning practitioners know not to use the test data in your training procedure, but sometimes people forget that when validation data is used for model selection that validation error cannot be used as an unbiased estimate of the generalization performance.
In unsupervised learning you have data, $\mathbf{x}$, but no labels $y$. The aim in unsupervised learning is to extract structure from data. The type of structure you are interested in is dependent on the broader context of the task. In supervised learning that context is very much driven by the labels. Supervised learning algorithms try and focus on the aspects of the data which are relevant to predicting the labels. But in unsupervised learning there are no labels.
Humans can easily sort a number of objects into objects that share similar characteristics. We easily categorize animals or vehicles. But if the data is very large this is too slow. Even for smaller data, it may be that it is presented in a form that is unintelligible for humans. We are good at dealing with high dimensional data when it’s presented in images, but if it’s presented as a series of numbers we find it hard to interpret. In unsupervised learning we want the computer to do the sorting for us. For example, an e-commerce company might need an algorithm that can go through its entire list of products and automatically sort them into groups such that similar products are located together.
Supervised learning is broadly divided into classification: i.e. wake word classification in the Amazon Echo, and regression, e.g. shelf life prediction for perishable goods. Similarly, unsupervised learning can be broadly split into methods that cluster the data (i.e. provide a discrete label) and methods that represent the data as a continuous value.
Clustering methods associate each data point with a different label. Unlike in classification the label is not provided by a human annotator. It is allocated by the computer. Clustering is quite intuitive for humans, we do it naturally with our observations of the real world. For example, we cluster animals into different groups. If we encounter a new animal we can immediately assign it to a group: bird, mammal, insect. These are certainly labels that can be provided by humans, but they were also originally invented by humans. With clustering we want the computer to recreate that process of inventing the label.
Unsupervised learning enables computers to form similar categorizations on data that is too large scale for us to process. When the Greek philosopher, Plato, was thinking about ideas, he considered the concept of the Platonic ideal. The Platonic ideal bird is the bird that is most bird-like or the chair that is most chair-like. In some sense, the task in clustering is to define different clusters, by finding their Platonic ideal (known as the cluster center) and allocate each data point to the relevant cluster center. So allocate each animal to the class defined by its nearest cluster center.
To perform clustering on a computer we need to define a notion of either similarity or distance between the objects and their Platonic ideal, the cluster center. We normally assume that our objects are represented by vectors of data, $\mathbf{x}_i$. Similarly we represent our cluster center for category $j$ by a vector $\mathbf{m}_j$. This vector contains the ideal features of a bird, a chair, or whatever category $j$ is. In clustering we can either think in terms of similarity of the objects, or distances. We want objects that are similar to each other to cluster together. We want objects that are distant from each other to cluster apart.
This requires us to formalize our notion of similarity or distance. Let’s focus on distances. A definition of distance between an object, $i$, and the cluster center of class $j$ is a function of two vectors, the data point, $\mathbf{x}_i$ and the cluster center, $\mathbf{m}_j$,
\[d_{ij} = f(\mathbf{x}_i, \mathbf{m}_j).\]Our objective is then to find cluster centers that are close to as many data points as possible. For example, we might want to cluster customers into their different tastes. We could represent each customer by the products they’ve purchased in the past. This could be a binary vector $\mathbf{x}_i$. We can then define a distance between the cluster center and the customer.
A commonly used distance is the squared distance,
\[d_{ij} = (\mathbf{x}_i - \mathbf{m}_j)^2.\]The squared distance comes up a lot in machine learning. In unsupervised learning it was used to measure dissimilarity between predictions and observed data. Here its being used to measure the dissimilarity between a cluster center and the data.
Once we have decided on the distance or similarity function we can decide a number of cluster centers, $K$. We find their location by allocating each center to a sub-set of the points and minimizing the sum of the squared errors,
\[E(\mathbf{M}) = \sum_{i \in \mathbf{i}_j} (\mathbf{x}_i - \mathbf{m}_j)^2,\]where the notation $\mathbf{i}_j$ represents all the indices of each data point which has been allocated to the $j$th cluster represented by the center $\mathbf{m}_j$.
One approach to minimizing this objective function is known as $k$-means clustering. It is simple and relatively quick to implement, but it is an initialization sensitive algorithm. Initialization is the process of choosing an initial set of parameters before optimization. For $k$-means clustering you need to choose an initial set of centers. In $k$-means clustering your final set of clusters is very sensitive to the initial choice of centers. For more technical details on $k$-means clustering you can watch a video of Alex Ihler introducing the algorithm here.
Other approaches to clustering involve forming taxonomies of the cluster centers, like humans apply to animals, to form trees. You can learn more about agglomerative clustering in this video from Alex Ihler
Indeed one application of machine learning techniques is performing a hierarchical clustering based on genetic data, i.e. the actual contents of the genome. If we do this across a number of species then we can produce a phylogeny. The phylogeny aims to represent the actual evolution of the species and some phylogenies even estimate the timing of the common ancestor between two species3. Similar methods are used to estimate the origin of viruses like AIDS or Bird flu which mutate very quickly. Determining the origin of viruses can be important in containing or treating outbreaks.
An e-commerce company could apply hierarchical clustering to all its products. That would give a phylogeny of products. Each cluster of products would be split into sub-clusters of products until we got down to individual products. For example we might expect a high level split to be Electronics/Clothing. Of course, a challenge with these tree like structures is that many products belong in more than one parent cluster: for example running shoes should be in more than one group, they are ‘sporting goods’ and they are ‘apparel’. A tree structure doesn’t allow this allocation.
Our own psychological grouping capabilities are studied as a domain of cognitive science. Researchers like Josh Tenenbaum have developed algorithms that decompose data in more complex ways, but they can normally only be applied to smaller data sets.
Dimensionality reduction methods compress the data by replacing the original data with a reduced number of continuous variables. One way of thinking of these methods is to imagine a marionette.
Thinking of dimensionality reduction as a marionette. We observe the high dimensional pose of the puppet, $\mathbf{x}$, but the movement of the puppeteer’s hand, $\mathbf{z}$ remains hidden to us. Dimensionality reduction aims to recover those hidden movements which generated the observations.
The position of each body part of a marionette could be thought of as our data, $\mathbf{x}_i$. So each data point consists of the 3-D co-ordinates of all the different body parts of the marionette. Lets say there are 13 different body parts (2 each of feet, knees, hips, hands, elbows, shoulders, one head). Each body part has an x, y, z position in Cartesian coordinates. So that’s 39 numbers associated with each observation.
The movement of these 39 parts is determined by the puppeteer via strings. Let’s assume it’s a very simple puppet, with just one stick to control it. The puppeteer can move the stick up and down, left and right. And they can twist it. This gives three parameters in the puppeteers control. This implies that the 39 variables we see moving are controlled by only 3 variables. These 3 variables are often called the hidden or latent variables.
Dimensionality reduction assumes something similar for real world data. It assumes that the data we observe is generated from some lower dimensional underlying process. It then seeks to recover the values associated with this low dimensional process.
Dimensionality reduction techniques underpin a lot of psychological scoring tests such as IQ tests or personality tests. An IQ test can involve several hundred questions, potentially giving a rich, high dimensional, characterization of some aspects of your intelligence. It is then summarized by a single number. Similarly the Myers-Briggs personality test involves answering
These tests are assuming that our intelligence is implicitly one dimensional and that our personality is implicitly four dimensional. Other examples include political belief which is typically represented on a left to right scale. A one dimensional distillation of an entire philosophy about how a country should be run. Our own leadership principles imply that our decisions have a fourteen dimensional space underlying them. Each decision could be characterized by judging to what extent it embodies each of the principles.
Political belief, personality, intelligence, leadership. None of these exist as a directly measurable quantity in the real world, rather they are inferred based on measurables. Dimensionality reduction is the process of allowing the computer to automatically find such underlying dimensions. This automatically allowing us to characterize each data point according to those explanatory variables. Neil is intelligent, moderately left wing extrovert who prefers rational thinking to feeling. His leadership decisions often exhibit ownership and backbone, but could don’t score highly on frugality. Each of these characteristics can be scored, and individuals can then be turned into vectors.
This doesn’t only apply to individuals, in recent years work on language modeling has taken a similar approach to words. The word2vec algorithm performed a dimensionality reduction on words, now you can take any word and map it to a latent space where similar words exhibit similar characteristics. A personality space for words.
Principal component analysis (PCA) is arguably the queen of dimensionality reduction techniques. PCA was developed as an approach to dimensionality reduction in 1930s by Hotelling as a method for the social sciences. In Hotelling’s formulation of PCA it was assumed that any data point, $\mathbf{x}$ could be represented as a weighted sum of the latent factors of interest, so that Hotelling described prediction functions (like in regression and classification above), only the regression is now multiple output. And instead of predicting a label, $y_i$, we now try and force the regression to predict the observed feature vector, $\mathbf{x}_i$. So for example, on an IQ test we would try and predict subject $i$’s answer to the $j$th question with the following function \(x_{ij} = f_j(z_i; \mathbf{w}).\) Here $z_i$ would be the IQ of subject $i$ and $f_j(\cdot)$ would be a function representing the relationship between the subject’s IQ and their score on the answer to question $j$. This function is the same for all subjects, but the subject’s IQ is assumed to differ leading to different scores for each subject.
Visualization of the first two principal components of an artificial data set. The data was generated by taking an image of a handwritten digit, 6, and rotating it 360 times, one degree each time. The first two principal components have been extracted in the diagram. The underlying circular shape is derived from the rotation of the data. Each image in the data set is projected on to the location its projected to in the latent space.
In Hotelling’s formulation he assumed that the function was a linear function. This idea is taken from a wider field known as factor analysis, so Hotelling described the challenge as
\[f_j(z_i; \mathbf{w}) = w_j z_i\]so the answer to the $j$th question is predicted to be a scaling of the subject’s IQ. The scale factor is given by $w_j$. If there are more latent dimensions then a matrix of parameters, $\mathbf{W}$ is used, for example if there were two latent dimensions we’d have
\[f_j(\mathbf{z}_i; \mathbf{W}) = w_{1j} z_{1i} + w_{2j} z_{2i},\]where, if this were a personality test, then $z_{1i}$ might represent the spectrum over a subject’s extrovert/introvert and $z_{2i}$ might represent where the subject was on the rational/perceptual scale. The function would make a prediction about the subjects answer to a particular question on the test (e.g. preference for office job vs preference for outdoor job). In factor analysis the parameters $\mathbf{W}$ are known as the factor loadings and in PCA they are known as the principal components.
Fitting the model involves finding estimates for the loadings, $\mathbf{W}$, and latent variables, $\mathbf{Z}$. There are different approaches including least squares. The least squares approach is used, for example, in recommender systems. In recommender systems this method is called matrix factorization. The the customer characteristics, $\mathbf{x}_i$ is the customer rating for each different product (or item) and the latent variables can be seen as a space of customer preferences. In the recommender system case the loadings matrix also has an interpretation as product similarities.4 Recommender systems have a particular characteristic in that most of the entries of the vector $\mathbf{x}_i$ are missing most of the time.
In PCA and factor analysis the unknown latent factors are dealt with through a probability distribution. They are each assumed to be drawn from a zero mean, unit variance normal distribution. This leaves the factor loadings to be estimated. For PCA the maximum likelihood solution for the factor loadings can be shown to be given by the eigenvalue decomposition of the data covariance matrix. This is algorithmically simple and convenient, although slow to compute for very large data sets with many features and many subjects. The eigenvalue problem can also be derived from many other starting points: e.g. the directions of maximum variance in the data or finding a latent space that best preserves inter-point distances between the data, or the optimal linear compression of the data given a linear reconstruction. These many and varied justifications for the eigenvalue decomposition may account for the popularity of PCA. Indeed, there is even an interpretation for Google’s original PageRank algorithm (which computed the smallest eigenvector of the internet’s linkage matrix) as seeking the dominant principal component of the web.5
Characterizing users according to past buying behavior and combining this with characteristics about products, is key to making good recommendations and returning useful search results. Further advances can be made if we understand the context of a particular session. For example, if a user is buying Christmas presents and searches for a dress, then it could be the case that the user is willing to spend a little more on the dress than in normal circumstances. Characterizing these effects requires more data and more complex algorithms. However, in domains such a search we are normally constrained by the speed with which we need to return results. Accounting for each of these factors while returning results with acceptable latency is a particular challenge.
The final domain of learning we will review is known as reinforcement learning. The domain of reinforcement learning is one that many researchers seem to believe is offering a route to general intelligence. The idea of general intelligence is to develop algorithms that are adaptable to many different circumstances. Supervised learning algorithms are designed to resolve particular challenges. Data is annotated with those challenges in mind. Unsupervised attempts to build representations without any context. But normally the algorithm designer has an understanding of what the broader objective is and designs the algorithms accordingly (for example, characterizing users). In reinforcement learning some context is given, in the form of a reward, but the reward is normally delayed. There may have been many actions that affected the outcome, but which actions had a positive effect and which a negative effect?
One issue for many companies is that the best way of testing the customer experience, A/B testing, prioritizes short term reward. The internet is currently being driven by short term rewards which make it distracting in the short term, but perhaps less useful in the long term. Click-bait is an example, but there are more subtle effects. The success of Facebook is driven by its ability to draw us in when likely we should be doing something else. This is driven by large scale A/B testing.
One open question is how to drive non-visual interfaces through equivalents to A/B testing. Speech interfaces, such as those used in intelligent agents, are less amenable to A/B testing when determining the quality of the interface. Improving interaction with them is therefore less exact science than the visual interface. Data efficient reinforcement learning methods are likely to be key to improving these agent’s ability to interact with the user and understand intent. However, they are not yet mature enough to be deployed in this application.
An area where reinforcement learning methods have been deployed with high profile success is game play. In game play the reward is delayed to the end of the game, and it comes in the form of victory or defeat. A significant advantage of game play as an application area is that, through simulation of the game, it is possible to generate as much data as is required to solve the problem. For this reason, many of the recent advances in reinforcement learning have occurred with methods that are not data efficient.
The company DeepMind is set up around reinforcement learning as an approach to general intelligence. All their most well known achievements are centered around artificial intelligence in game play. In reinforcement learning a decision made at any given time have a downstream affect on the result. Whether the effect if beneficial or not is unknown until a future moment.
We can think of reinforcement learning as providing a label, but the label is associated with a series of data involving a number of decisions taken. Each decision was taken given the understanding of game play at any given moment. Understanding which of these decisions was important in victory or defeat is a hard problem.
In machine learning the process of understanding which decisions were beneficial and which were detrimental is known as the credit allocation problem. You wish to reward decisions that led to success to encourage them, but punish decisions that lead to failure.
Broadly speaking, DeepMind uses an approach to Machine Learning where there are two mathematical functions at work. One determines the action to be taken at any given moment, the other estimates the quality of the board position at any given time. These are respectively known as the policy network and the value network.6 DeepMind made use of convolutional neural networks for both these models.
The ancient Chinese game of Go was considered a challenge for artificial intelligence for two reasons. Firstly, the game tree has a very high branching factor. The game tree is a discrete representation of the game. Every node in the game tree is associated with a board position. You can move through the game tree by making legal a move on the board to change the position. In Go, there are so many legal moves that the game tree increases exponentially. This challenge in Go was addressed by using stochastic game tree search. Rather than exploring the game tree exhaustively they explored it randomly.
Secondly, evaluating the quality of any given board position was deemed to be very hard.7 The value function determines for each player whether they are winning or loosing. Skilled Go players can assess a board position, but they do it by instinct, by intuition. Just as early AI researchers struggled to give rules for detecting cancer, it is challenging to give rules to assess a Go board. The machine learning approach that AlphaGo took is to train a value function network to make this assessment.
The approach that DeepMind took to conquering Go is a model free approach known as Q-learning.6 The model free approach refers to the fact that they don’t directly include a model of how the world evolves in the reinforcement learning algorithm. They make extensive use of the game tree, but they don’t model how it evolves. They do model the expected reward of each position in the game tree (the value function) but that is not the same as modeling how the game will proceed.
An alternative approach to reinforcement learning is to use a prediction function to suggest how the world will evolve in response to your actions. To predict how the game tree will evolve. You can then use this prediction to indirectly infer the expected reward associated with any action. This is known as model based reinforcement learning.
This model based approach is also closer to a control system. A classical control system is one where you give the system a set point. For example, a thermostat in the house. You set the temperature and the boiler switches off when it reaches it. Optimal control is about getting the house to the right temperature as quickly as possible. Classical control is widely used in robotic control and flight control.
One interesting crossover between classical control and machine learning arises because classical optimal control can be seen as a form of model based reinforcement learning. One where the reward is recovered when the set point is reached. In control engineering the prediction function is known as the transfer function. The process of fitting the transfer function in control is known as system identification.
There is some exciting work emerging at the interface between the areas of control and reinforcement learning. Results at this interface could be very important for improving the quality of robotic and drone control.
As we implied above, reinforcement learning can also used to improve user experience. In that case the reward is gained when the user buys a product from us. This makes it closely allied to the area of optimization. Optimization of our user interfaces can be seen as a reinforcement learning task, but more commonly it is thought about separately in the domains of Bayesian optimization or bandit learning.
We use optimization in machine learning to find the parameters of our models. We can do that because we have a mathematical representation of our objective function as a direct function of the parameters.
Examples in this form of optimization include, what is the best user interface for presenting adverts? What is the best design for a front wing for an F1 racing car? Which product should I return top of the list in response to this user’s search?
Bayesian optimization arises when we can’t directly relate the parameters in the system of interest to our objective through a mathematical function. For example, what is the mathematical function that relates a user’s experience to the probability that they will buy a product?
One approach to these problems is to use machine learning methods to develop a surrogate model for the optimization task. The surrogate model is a prediction function that attempts to recreate the process we are finding hard to model. We try to simultaneously fit the surrogate model and optimize the process.
Bayesian optimization methods use a surrogate model (normally a specific form of regression model). They use this to predict how the real system will perform. The surrogate model makes a prediction (with an estimate of the uncertainty) of what the response will be to any given input. Parameters to test are chosen by considering this prediction. Similar to reinforcement learning, this can be viewed as a model based approach because the surrogate model can be seen as a model of the real world. In bandit methods strategies are determined without turning to a model to motivate them. They are model free methods.
Because of their different philosophies, if a class of prediction functions is chosen, then a model based approach might have better average case performance. At least in terms of data efficiency. A model free approach may well have better worst case performance though, because it makes less assumptions about the nature of the data. To put it another way, making assumptions about the data is helpful if they are right: and if the model is sensible they’ll be right on average. However, it is unhelpful if the model is wrong. Indeed, it could be actively damaging. Since we can’t usually guarantee the model is absolutely right, the worst case performance of a model based approach would be poor.
We have introduced a range of machine learning approaches by focusing on their use of mathematical functions to replace manually coded systems of rules. The important characteristic of machine learning is that the form of these functions, as dictated by their parameters, is determined by acquiring data from the real world.
The methods we have introduced are roughly speaking introduced in order of difficulty of deployment. While supervised learning is more involved in terms of collection of data, it is the most straightforward method to deploy once that data is recovered. For this reason, a major focus with supervised learning should always be on maintaining data quality, increasing the efficiency and accountability8 of the data collection pipeline and the quality of features used.
In relation to what AI can and can’t do today Andrew Ng is quoted as saying:
If a typical person can do a mental task with less than one second of thought, we can probably automate it using AI either now or in the near future.9
I would broadly agree with this quote but only in the context of supervised learning. If a human expert takes around that amount of time, then it’s also likely we can acquire the data necessary to build a supervised learning algorithm that can emulate that human’s response.
The picture with regard to unsupervised learning and reinforcement learning is more clouded.
One observation is that for supervised learning we seem to be moving beyond the era where very deep machine learning expertise is required to deploy methods. A solid understanding of machine learning (say to Masters level) is certainly required, but the quality of the final result is likely more dependent on domain expertise and the quality of the data and the information processing pipeline. This seems part of a wider trend where some of the big successes in machine learning are moving rapidly from the domain of science to that of engineering.10
So if we can only emulate tasks that humans take around a second to do, how are we managing to deliver on self driving cars? The answer is that we are constructing engineered systems from sub-components, each of which is a machine learning subsystem. But they are tied together as a component based system in line with our traditional engineering approach. This has an advantage that each component in the system can be verified before its inclusion. This is important for debugging and safety. But in practice we can expect these systems to be very brittle. A human adapts the way in which they drive the car across their lifetime. A human can react to other road users. In extreme situations, such as a car jacking, a human can set to one side normal patterns of behavior, and purposely crash their car to draw attention to the situation.
Supervised machine learning solutions are normally trained offline. They do not adapt when deployed because this makes them less verifiable. But this compounds the brittleness of our solutions. By deploying our solutions we actually change the environment in which they operate. Therefore, it’s important that they can be quickly updated to reflect changing circumstances. This updating happens offline. For a complex mechanical system, such as a delivery drone, extensive testing of the system may be required when any component is updated. It is therefore imperative that these data processing pipelines are well documented so that they can be redeployed on demand.
In practice there can be challenges with the false dichotomy between reproducibility and performance. It is likely that most of our data scientists are caring less about their ability to redeploy their pipelines and only about their ability to produce an algorithm that achieves a particular performance. A key question is how reproducible is that process? There is a false dichotomy because ensuring reproducibility will typically improve performance as it will make it easier to run a rigorous set of explorative experiments. A worry is that, currently, we do not have a way to quantify the scale of this potential problem within companies.
Common to all machine learning methods is the initial choice of useful classes of functions. The deep learning revolution is associated with a particular class of mathematical functions that is proving very successful in what were seen to be challenging domains: speech, vision, language. This has meant that significant advances in problems that have been seen as hard have occurred in artificial intelligence.
GOFAI has its origins in the birth of computer science so it is unsurprising that logic should be the dominant paradigm. There was also a sense in which pure logic felt like it mapped well on to human reasoning. In practice, for any complex real world situation, the number of logical rules required causes an explosion in the state space of the system. This makes logical systems intractable in many applications. ↩
Using the test data in your training procedure is a major error in any machine learning procedure. It is extremely dangerous as it gives a misleading assessment of the model performance. The Baidu ImageNet scandal was an example of a team competing in the ImageNet challenge which did this. The team had announced via the publication pre-print server Arxiv that they had a world-leading performance on the ImageNet challenge. This was reported in the mainstream media. Two weeks later the challenge organizers revealed that the team had created multiple accounts for checking their test performance more times than was permitted by the challenge rules. This was then reported as “AI’s first doping scandal”. The team lead was fired by Baidu. ↩
These models are quite a lot more complex than the simple clustering we describe here. They represent a common ancestor through a cluster center that is then allowed to evolve over time through a mutation rate. The time of separation between different species is estimated via these mutation rates. ↩
One way of thinking about this is to flip the model on its side. Instead of thinking about the $i$th subject and the $j$th characteristic. Assume that each product is the subject. So the $j$th item is thought of as the subject, and each item’s characteristic is given by the rating from a particular user. In this case symmetries in the model show that the matrix $\mathbf{W}$ can now be seen as a matrix of latent variables and the matrix $\mathbf{Z}$ can be seen as factor loadings. So you can think of the method as simultaneously doing a dimensionality reduction on the products and the users. Recommender systems also use other approaches, some of them based on similarity measures. In a similarity measure based recommender system the rating prediction is given by looking for similar products in the user profile and scoring the new product with a score that is a weighted sum of those products. ↩
The interpretation requires you to think of the web as a series of web pages in a high dimensional space where distances between web pages are computed by moving along the links (in either direction). The PageRank is the one dimensional space that best preserves those distances in the sense of an L1 norm. The interpretation works because the smallest eigenvalue of the linkage matrix is the largest eigenvalue of the inverse of the linkage matrix. The inverse linkage matrix (which would be impossible to compute) embeds similarities between pages according to how far apart they are via a random walk along the linkage matrix. ↩
The approach was described early on in the history of machine learning by Chris Watkins, during his PhD thesis in the 1980s. It is known as Q-learning. It’s recent success in the games domain is driven by the use of deep learning for the policy and value functions as well as the use of fast compute to generate and process very large quantities of data. In its standard form it is not seen as a very data-efficient approach. ↩ ↩2
The situation in chess is much easier, firstly the number of possible moves at any time is about an order of magnitude lower, meaning the game tree doesn’t grow as quickly. Secondly, in chess, there are well defined value functions. For example a value function could be based on adding together the points that are associated with each piece. ↩
To try and better embody the state of data readiness in organizations I’ve been proposing “Data Readiness Levels”. More needs to be done in this area to improve the efficiency of the data science pipeline. ↩
The quote can be found in the Harvard Business Review Article “What Artificial Intelligence Can and Can’t Do Right Now”. ↩
This trend was very clear at the moment, I spoke about it at a recent Dagstuhl workshop on new directions for kernel methods and Gaussian processes. ↩
Something that I feel really helped is an analogy that, I think, gets software engineers into the right mindset for working with data analysts.
Data science is not like programming, it is like debugging.
Imagine you are given a body of code, perhaps that was found on a discarded hard drive on the street. Imagine that you knew that the hard drive contained code that you need to make your production system run. You need to integrate some of the code on the hard drive.
Unfortunately, there is a lot more code on the hard drive than the code you need. The code is undocumented, and it’s of uncertain provenance. How do you go about integrating this codebase? The answer is: very carefully.
In the end, the new code you need to run in production may only be a couple of lines: the correctly formated API call to the relevant library on the hard drive. But the amount of thought and time placed into those lines of code will be more than you’ve placed into any line of code you’ve previously written.
The tools we use for data science: interactive notebooks like jupyter, visualization techniques, plotting libraries etc.. These tools are sophisticated ways of peering into the memory of the computer, manipulating what’s in there, making sure the information inside is extracted in a safe way. You will write many lines of code exploring the data. Just as you would write many lines of test code to understand the behaviour of the discarded hard drive does. But the final lines of code that extract value from the data may be very few, like the short API call to code on the discarded hard drive.
This perhaps also explains why it can be hard to plan and resource data science projects, it’s akin to planning and resourcing the debugging of a codebase of unknown provenance.
So what are the lessons here?