Introduction to Machine Intelligence
Abstract
With breakthroughs in understanding images, translating language, transcribing speech artificial intelligence promises to revolutionise the technological landscape. Machine learning algorithms are able to convert unstructured data into actionable knowledge. With the increasing impact of these technologies, society’s interest is also growing. The word intelligence conjures notions of human-like capabilities. But are we really on the cusp of creating machines that match us? We associate intelligence with knowledge, but in this talk I will argue that the true marvel of our intelligence is the way it deals with ignorance. Despite the large strides forward we have made, I will argue that we have a long way to go to deliver on the promise of artificial intelligence. And it is a journey that science and artificial inteligence need to take together.
GREAT AI FALLACY
UNPRECEDENTED COMBINATION OF SCIENCE, SOCIAL SCIENCE, ETC REQUIRED TO DELIVER
SUPERINTELLIGENCE
Given for one instant an intelligence which could comprehend all the forces by which nature is animated and the respective situation of the beings who compose it—an intelligence sufficiently vast to submit these data to analysis—it would embrace in the same formulate the movements of the greatest bodies of the universe and those of the lightest atom; for it, nothing would be uncertain and the future, as the past, would be present in its eyes.
Philosophical Essay on Probabilities Laplace (1814) pg 3
If we do discover a theory of everything … it would be the ultimate triumph of human reason-for then we would truly know the mind of God
Stephen Hawking in A Brief History of Time 1988
Life
[edit]
|
|
|
The curve described by a simple molecule of air or vapor is regulated in a manner just as certain as the planetary orbits; the only difference between them is that which comes from our ignorance.
Probability is relative, in part to this ignorance, in part to our knowledge.
Philosophical Essay on Probabilities Laplace (1814) pg 5
INTRODUCE ENTROPY TO MEASURE IGNORANCE
|
|
|
UNCERTAINTY
|
|
|
But if we conceive a being whose faculties are so sharpened that he can follow every molecule in its course, such a being, whose attributes are still as essentially finite as our own, would be able to do what is at present impossible to us. For we have seen that the molecules in a vessel full of air at uniform temperature are moving with velocities by no means uniform, though the mean velocity of any great number of them, arbitrarily selected, is almost exactly uniform. Now let us suppose that such a vessel is divided into two portions, A and B, by a division in which there is a small hole, and that a being, who can see the individual molecules, opens and closes this hole, so as to allow only the swifter molecules to pass from A to B, and the only the slower ones to pass from B to A. He will thus, without expenditure of work, raise the temperature of B and lower that of A, in contradiction to the second law of thermodynamics.
Theory of Heat (Maxwell, 1871) page 308
MEASUREMENT
[edit]
For instance, the temperature at which ice melts is found to be always the same under ordinary circumstances, though, as we shall see, it is slightly altered by change of pressure. The temperature of steam which issues from boiling water is also constant when the pressure is constant.
These two pheomena therefore–the melting of ice and the boiling of water–indicate in a visible manner two temperatures which we may use as points of reference, the position of which depends on the properties of water and not on the conditions of our senses.
Theory of Heat Maxwell (1871) page 3
HUMANS
|
|||
bits/min | billions | 2000 | 6 |
billion calculations/s |
~100 | a billion | a billion |
embodiment | 20 minutes | 5 billion years | 15 trillion years |
Bandwidth Constrained Conversations
[edit]
Embodiment factors imply that, in our communication between humans, what is not said is, perhaps, more important than what is said. To communicate with each other we need to have a model of who each of us are.
To aid this, in society, we are required to perform roles. Whether as a parent, a teacher, an employee or a boss. Each of these roles requires that we conform to certain standards of behaviour to facilitate communication between ourselves.
Control of self is vitally important to these communications.
The high availability of data available to humans undermines human-to-human communication channels by providing new routes to undermining our control of self.
The consequences between this mismatch of power and delivery are to be seen all around us. Because, just as driving an F1 car with bicycle wheels would be a fine art, so is the process of communication between humans.
If I have a thought and I wish to communicate it, I first of all need to have a model of what you think. I should think before I speak. When I speak, you may react. You have a model of who I am and what I was trying to say, and why I chose to say what I said. Now we begin this dance, where we are each trying to better understand each other and what we are saying. When it works, it is beautiful, but when misdeployed, just like a badly driven F1 car, there is a horrible crash, an argument.
Stories, between humans.
Computer Conversations
[edit]
Similarly, we find it difficult to comprehend how computers are making decisions. Because they do so with more data than we can possibly imagine.
In many respects, this is not a problem, it’s a good thing. Computers and us are good at different things. But when we interact with a computer, when it acts in a different way to us, we need to remember why.
Just as the first step to getting along with other humans is understanding other humans, so it needs to be with getting along with our computers.
Embodiment factors explain why, at the same time, computers are so impressive in simulating our weather, but so poor at predicting our moods. Our complexity is greater than that of our weather, and each of us is tuned to read and respond to one another.
Their intelligence is different. It is based on very large quantities of data that we cannot absorb. Our computers don’t have a complex internal model of who we are. They don’t understand the human condition. They are not tuned to respond to us as we are to each other.
Embodiment factors encapsulate a profound thing about the nature of humans. Our locked in intelligence means that we are striving to communicate, so we put a lot of thought into what we’re communicating with. And if we’re communicating with something complex, we naturally anthropomorphize them.
We give our dogs, our cats and our cars human motivations. We do the same with our computers. We anthropomorphize them. We assume that they have the same objectives as us and the same constraints. They don’t.
This means, that when we worry about artificial intelligence, we worry about the wrong things. We fear computers that behave like more powerful versions of ourselves that will struggle to outcompete us.
In reality, the challenge is that our computers cannot be human enough. They cannot understand us with the depth we understand one another. They drop below our cognitive radar and operate outside our mental models.
The real danger is that computers don’t anthropomorphize. They’ll make decisions in isolation from us without our supervision, because they can’t communicate truly and deeply with us.
ARTIFICIAL
Artificial Intelligence
One of the struggles of artificial intelligence is that the term means different things to different people. Our intelligence is precious to us, and the notion that it can be easily recreated is disturbing to us. This leads to some dystopian notions of artificial intelligence, such as the singularity.
Depending on whether this powerful technology is viewed as beneficent or maleficent, it can be viewed either as a helpful assistant, in the manner of Jeeves, or a tyrannical dictator.
The history of automation and technology is a history of us adapting to technological change. The invention of the railways, and the need for consistent national times to timetable our movements. The development of the factory system in the mills of Derbyshire required workers to operate and maintain the machines that replaced them.
Day by day, however, the machines are gaining ground upon us; day by day we are becoming more subservient to them; more men are daily bound down as slaves to tend them, more men are daily devoting the energies of their whole lives to the development of mechanical life. The upshot is simply PAGE 185a question of time, but that the time will come when the machines will hold the real supremacy over the world and its inhabitants is what no person of a truly philosophic mind can for a moment question.
Samuel Butler in Darwin Among the Machines a letter to the Editor of The Press, 1863
Listening to modern to conversations about artificial intelligence, I think the use of the term intelligence has given rise to an idea that this technology will be the But amoung these different assessments of artificial intelligence is buried an idea, one that
Introduce Linnaeus and the hydra.
The Hamburg Hydra
[edit]
ARTIFICIAL INTELLIGENCE
DeepFace
[edit]
The DeepFace architecture (Taigman et al., 2014) consists of layers that deal with translation and rotational invariances. These layers are followed by three locally-connected layers and two fully-connected layers. Color illustrates feature maps produced at each layer. The neural network includes more than 120 million parameters, where more than 95% come from the local and fully connected layers.
Deep Learning as Pinball
[edit]
Sometimes deep learning models are described as being like the brain, or too complex to understand, but one analogy I find useful to help the gist of these models is to think of them as being similar to early pin ball machines.
In a deep neural network, we input a number (or numbers), whereas in pinball, we input a ball.
Think of the location of the ball on the left-right axis as a single number. Our simple pinball machine can only take one number at a time. As the ball falls through the machine, each layer of pins can be thought of as a different layer of ‘neurons.’ Each layer acts to move the ball from left to right.
In a pinball machine, when the ball gets to the bottom it might fall into a hole defining a score, in a neural network, that is equivalent to the decision: a classification of the input object.
An image has more than one number associated with it, so it is like playing pinball in a hyper-space.
%pip install --upgrade git+https://github.com/sods/ods
Learning involves moving all the pins to be in the correct position, so that the ball ends up in the right place when it’s fallen through the machine. But moving all these pins in hyperspace can be difficult.
In a hyper-space you have to put a lot of data through the machine for to explore the positions of all the pins. Even when you feed many millions of data points through the machine, there are likely to be regions in the hyper-space where no ball has passed. When future test data passes through the machine in a new route unusual things can happen.
Adversarial examples exploit this high dimensional space. If you have access to the pinball machine, you can use gradient methods to find a position for the ball in the hyper space where the image looks like one thing, but will be classified as another.
Probabilistic methods explore more of the space by considering a range of possible paths for the ball through the machine. This helps to make them more data efficient and gives some robustness to adversarial examples.
Universe isn’t as Gaussian as it Was
[edit]
The Planck space craft was a European Space Agency space telescope that mapped the cosmic microwave background (CMB) from 2009 to 2013. The Cosmic Microwave Background is the first observable echo we have of the big bang. It dates to approximately 400,000 years after the big bang, at the time the universe was approximately \(10^8\) times smaller and the temperature of the Univers was high, around \(3 \times 10^8\) degrees Kelvin. The Universe was in the form of a hydrogen plasma. The echo we observe is the moment when the Universe was cool enough for Protons and electrons to combine to form hydrogen atoms. At this moment, the Universe became transparent for the first time, and photons could travel through space.
The objective of the Planck space craft was to measure the anisotropy and statistics of the Cosmic Microwave Background. This was important, because if the standard model of the Universe is correct the variations around the very high temperature of the Universe of the CMB should be distributed according to a Gaussian process.1 Currently our best estimates show this to be the case (Elsner et al., 2016, 2015; Jaffe et al., 1998; Pontzen and Peiris, 2010).
To the high degree of precision that we could measure with the Planck space telescope, the CMB appears to be a Gaussian process. The parameters of its covariance function are given by the fundamental parameters of the universe, for example the amount of dark matter and matter in the universe
Simulating a CMB Map
The simulation was created by Boris Leistedt, see the original Jupter notebook here.
Here we use that code to simulate our own universe and sample from what it looks like.
First we install some specialist software as well as matplotlib
, scipy
, numpy
we require
%pip install camb
%pip install healpy
%matplotlib inline
%config IPython.matplotlib.backend = 'retina'
%config InlineBackend.figure_format = 'retina'
import matplotlib
import matplotlib.pyplot as plt
from matplotlib import rc
from cycler import cycler
import numpy as np
"font", family="serif", size=14)
rc("text", usetex=False)
rc('lines.linewidth'] = 2
matplotlib.rcParams['patch.linewidth'] = 2
matplotlib.rcParams['axes.prop_cycle'] =\
matplotlib.rcParams["color", ['k', 'c', 'm', 'y'])
cycler('axes.labelsize'] = 16
matplotlib.rcParams[
import healpy as hp
import camb
from camb import model, initialpower
Now we use the theoretical power spectrum to design the covariance function.
= 512 # Healpix parameter, giving 12*nside**2 equal-area pixels on the sphere.
nside = 3*nside # band-limit. Should be 2*nside < lmax < 4*nside to get information content. lmax
Now we design our Universe. It is parameterised according to the \(\Lambda\)CDM model. The variables are as follows. H0
is the Hubble parameter (in Km/s/Mpc). The ombh2
is Physical Baryon density parameter. The omch2
is the physical dark matter density parameter. mnu
is the sum of the neutrino masses (in electron Volts). omk
is the \(\Omega_k\) is the curvature parameter, which is here set to 0, tiving the minimal six parameter Lambda-CDM model. tau
is the reionization optical depth.
Then we set ns
, the “scalar spectral index.” This was estimated by Planck to be 0.96. Then there’s r
, the ratio of the tensor power spectrum to scalar power spectrum. This has been estimated by Planck to be under 0.11. Here we set it to zero. These parameters are associated with inflation.
# Mostly following http://camb.readthedocs.io/en/latest/CAMBdemo.html with parameters from https://en.wikipedia.org/wiki/Lambda-CDM_model
= camb.CAMBparams()
pars =67.74, ombh2=0.0223, omch2=0.1188, mnu=0.06, omk=0, tau=0.066)
pars.set_cosmology(H0=0.96, r=0) pars.InitPower.set_params(ns
Having set the parameters, we now use the python software “Code for Anisotropies in the Microwave Background” to get the results.
=0);
pars.set_for_lmax(lmax, lens_potential_accuracy= camb.get_results(pars)
results = results.get_cmb_power_spectra(pars)
powers = powers['total']
totCL = powers['unlensed_scalar']
unlensedCL
= np.arange(totCL.shape[0])
ells = totCL[:, 0]
Dells = Dells * 2*np.pi / ells / (ells + 1) # change of convention to get C_ell
Cells 0:2] = 0 Cells[
= hp.synfast(Cells, nside,
cmbmap =lmax, mmax=None, alm=False, pol=False,
lmax=False, fwhm=0.0, sigma=None, new=False, verbose=True) pixwin
The world we see today, of course, is not a Gaussian process. There are many dicontinuities, for example, in the density of matter, and therefore in the temperature of the Universe.
We can think of todays observed Universe, though, as a being a consequence of those temperature fluctuations in the CMB. Those fluctuations are only order \(10^-6\) of the scale of the overal temperature of the Universe. But minor fluctations in that density is what triggered the pattern of formation of the Galaxies and how stars formed and created the elements that are the building blocks of our Earth (Vogelsberger et al., 2020).
Modelling Herd Immunity
[edit]
This example is taken from Thomas House’s blog post on Herd Immunity. This model was shared at the beginning of the Covid19 pandemic when the first UK lockdown hadn’t yet occurred.
# Pull in libraries needed
%matplotlib inline
import numpy as np
from scipy import integrate
The next piece of code sets up the dynamics of the compartmental model model. He doesn’t give the specific details in the blog post, but my understanding is that the four states are as follows. x[0]
is the susceptible population, those that haven’thad the disease yet. The susceptible population decreases by encounters with infections people. In Thomas’s model, both x[3]
and x[4]
are infections. So the dynamics of the reduction of the susceptible is given by \[
\frac{\text{d}{S}}{\text{d}t} = - \beta S (I_1 + I_2).
\] Here, I’ve used \(I_1\) and \(I_2\) to represent what appears to be two separate infectious compartments in Thomas’s model. We’ll speculate about why there are two in a moment.
The model appears to be an SEIR model, so rather than becoming infectious directly you next move to an ‘exposed,’ where you have the disease, but you are not yet infectious. There are again two exposed states, we’ll return to that in a moment. We denote the first, x[1]
by \(E_1\). We have \[
\frac{\text{d}{E_1}}{\text{d}t} = \beta S (I_1 + I_2) - \sigma E_1.
\] Note that the first term matches the term from the Susceptible equation. This is because it is the incoming exposed population.
The exposed population move to a second compartment of exposure, \(E_2\). I believe the reason for this is that if you use only one exposure compartment, then the statistics of the duration of exposure are incorrect (implicitly they are exponetially distributed in the underlying stochastic version of the model). By using two exposure departments, Thomas is making a slight correction to this which would impose a first order gamma distribution on those statistics. A similar trick is being deployed for the ‘infectious group.’ So we gain an additional equation to help with these statistics, \[
\frac{\text{d}{E_2}}{\text{d}t} = \sigma E_1 - \sigma E_2.
\] giving us the exposed group as the sum of the two compartments \(E_1\) and \(E_2\). The exposed group from the second compartment then become ‘infected,’ which we represent with \(I_1\), in the code this is x[3]
, \[
\frac{\text{d}{I_1}}{\text{d}t} = \sigma E_2 - \gamma I_1,
\] and similarly, Thomas is using a two compartment infectious group to fix up the duration model. So we have, \[
\frac{\text{d}{I_2}}{\text{d}t} = \gamma I_1 - \gamma I_2.
\] And finally we have those that have recovered emerging from the second infections compartment. In this model there is no separate model for ‘deaths,’ so the recovered compartment, \(R\), would also include those that die, \[
\frac{\text{d}R}{\text{d}t} = \gamma I_2.
\] All of these equations are then represented in code as follows.
def odefun(t,x,beta0,betat,t0,t1,sigma,gamma):
= np.zeros(6)
dx if ((t>=t0) and (t<=t1)):
= betat
beta else:
= beta0
beta 0] = -beta*x[0]*(x[3] + x[4])
dx[1] = beta*x[0]*(x[3] + x[4]) - sigma*x[1]
dx[2] = sigma*x[1] - sigma*x[2]
dx[3] = sigma*x[2] - gamma*x[3]
dx[4] = gamma*x[3] - gamma*x[4]
dx[5] = gamma*x[4]
dx[return dx
Where the code takes in the states of the compartments (the values of x
) and returns the gradients of those states for the provided parameters (sigma
, gamma
and beta
). Those parameters are set according to the known characteristics of the disease.
The next block of code sets up the parameters of the SEIR model. A particularly important parameter is the reproduction number (\(R_0\)), here Thomas has assumed a reproduction number of 2.5, implying that each infected member of the population transmits the infection up to 2.5 other people. The effective \(R\) decreases over time though, because some of those people they meet will no longer be in the susceptible group.
# Parameters of the model
= 6.7e7 # Total population
N = 1e-4 # 0.5*Proportion of the population infected on day 0
i0 = 365.0 # Consider a year
tlast = 5.0 # Days between being infected and becoming infectious
latent_period = 7.0 # Days infectious
infectious_period = 2.5 # Basic reproduction number in the absence of interventions
R0 = 0.75 # Reproduction number in the presence of interventions
Rt = 21.0 # Number of days of interventions tend
The parameters are correct for the ‘discrete system,’ where the inectious period is a discrete time, and the numbers are discrete values. To translate into our continuous differential equation system’s parameters, we need to do a couple of manipulations. Note the factor of 2 associated with gamma
and sigma
. This is a doubling of the rate to account for the fact that there are two compartments for each of these states (to fix-up the statistics of the duration models).
= R0 / infectious_period
beta0 = Rt / infectious_period
betat = 2.0 / latent_period
sigma = 2.0 / infectious_period gamma
Next we solve the system using scipy
’s initial value problem solver. The solution method is "Runge-Kutta-Fehlberg method, as indicated by the 'RK45'
solver. This is a numerical method for solving differential equations. The 45 is the order of the method and the error estimator.
We can view the solver itself as somehow a piece of simulation code, but here it’s being called as sub routine in the system. It returns a solution for each time step, stored in a list sol
.
This is typical of this type of non-linear differential equation problem. Whether it’s partial differential equations, ordinary differential equations, there’s a step where a numerical solver needs to be called. These are often expensive to run. For climate and weather models, this would be where we solved the Navier-Stokes equations. For this simple model, the solution is relatively quick.
= np.array([-100, 40, 52.5, 65])
t0ran =[]
solfor tt in range(0,len(t0ran)):
lambda t,x: odefun(t,x,beta0,betat,t0ran[tt],t0ran[tt]+tend,sigma,gamma),
sol.append(integrate.solve_ivp(0.0,tlast),
(1.0-2.0*i0, 0.0, 0.0, i0, i0, 0.0]),
np.array(['RK45',
=1e-8,
atol=1e-9)) rtol
In practice, immunity for Covid19 may only last around 6 months. As an exercise, try to extend Thomas’s model for the case where immunity is temporary. You’ll need to account for deaths as well in your new model.
INTRODUCE THE SOLUTIONS
- Step Change in Science through Machine Learning
- You!
- ML and the Physical World
Thanks!
For more information on these subjects and more you might want to check the following resources.
- twitter: @lawrennd
- podcast: The Talking Machines
- newspaper: Guardian Profile Page
- blog: http://inverseprobability.com
References
Most of my understanding of this is taken from conversations with Kyle Cranmer, a physicist who makes extensive use of machine learning methods in his work. See e.g. Mishra-Sharma and Cranmer (2020) from Kyle and Siddharth Mishra-Sharma. Of course, any errors in the above text are mine and do not stem from Kyle.↩︎