Open Data Science Conference, London
\[ \text{data} + \text{model} \stackrel{\text{compute}}{\rightarrow} \text{prediction}\]
\[\text{data} + \text{model} \stackrel{\text{compute}}{\rightarrow} \text{prediction}\]
bits/min | billions | 2,000 |
billion calculations/s |
~100 | a billion |
embodiment | 20 minutes | 5 billion years |
\[ \text{odds} = \frac{p(\text{bought})}{p(\text{not bought})} \]
\[ \log \text{odds} = \beta_0 + \beta_1 \text{age} + \beta_2 \text{latitude}.\]
\[ p(\text{bought}) = \sigma\left(\beta_0 + \beta_1 \text{age} + \beta_2 \text{latitude}\right).\]
\[ p(\text{bought}) = \sigma\left(\boldsymbol{\beta}^\top \mathbf{ x}\right).\]
\[ y= f\left(\mathbf{ x}, \boldsymbol{\beta}\right).\]
We call \(f(\cdot)\) the prediction function.
\[E(\boldsymbol{\beta}, \mathbf{Y}, \mathbf{X})\]
These are interpretable models: vital for disease modeling etc.
Modern machine learning methods are less interpretable
Example: face recognition
Outline of the DeepFace architecture. A front-end of a single convolution-pooling-convolution filtering on the rectified input, followed by three locally-connected layers and two fully-connected layers. Color illustrates feature maps produced at each layer. The net includes more than 120 million parameters, where more than 95% come from the local and fully connected.
compare digital oligarchy vs how Africa can benefit from the data revolution
if
, for
, and proceduresThe major cause of the software crisis is that the machines have become several orders of magnitude more powerful! To put it quite bluntly: as long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a mild problem, and now we have gigantic computers, programming has become an equally gigantic problem.
Edsger Dijkstra (1930-2002), The Humble Programmer
The major cause of the data crisis is that machines have become more interconnected than ever before. Data access is therefore cheap, but data quality is often poor. What we need is cheap high-quality data. That implies that we develop processes for improving and verifying data quality that are efficient.
There would seem to be two ways for improving efficiency. Firstly, we should not duplicate work. Secondly, where possible we should automate work.
Me
Reusability of Data
Deployment of Machine Learning Systems
https://arxiv.org/pdf/1705.02245.pdf Data Readiness Levels (Lawrence, 2017b)
In a data first company teams own their data quality issues at least as far as grade B1.
On Governors, James Clerk Maxwell 1868