Organisational Data Science

Neil D. Lawrence

$$\newcommand{\tk}[1]{} \newcommand{\Amatrix}{\mathbf{A}} \newcommand{\KL}[2]{\text{KL}\left( #1\,\|\,#2 \right)} \newcommand{\Kaast}{\kernelMatrix_{\mathbf{ \ast}\mathbf{ \ast}}} \newcommand{\Kastu}{\kernelMatrix_{\mathbf{ \ast} \inducingVector}} \newcommand{\Kff}{\kernelMatrix_{\mappingFunctionVector \mappingFunctionVector}} \newcommand{\Kfu}{\kernelMatrix_{\mappingFunctionVector \inducingVector}} \newcommand{\Kuast}{\kernelMatrix_{\inducingVector \bf\ast}} \newcommand{\Kuf}{\kernelMatrix_{\inducingVector \mappingFunctionVector}} \newcommand{\Kuu}{\kernelMatrix_{\inducingVector \inducingVector}} \newcommand{\Kuui}{\Kuu^{-1}} \newcommand{\Qaast}{\mathbf{Q}_{\bf \ast \ast}} \newcommand{\Qastf}{\mathbf{Q}_{\ast \mappingFunction}} \newcommand{\Qfast}{\mathbf{Q}_{\mappingFunctionVector \bf \ast}} \newcommand{\Qff}{\mathbf{Q}_{\mappingFunctionVector \mappingFunctionVector}} \newcommand{\aMatrix}{\mathbf{A}} \newcommand{\aScalar}{a} \newcommand{\aVector}{\mathbf{a}} \newcommand{\acceleration}{a} \newcommand{\bMatrix}{\mathbf{B}} \newcommand{\bScalar}{b} \newcommand{\bVector}{\mathbf{b}} \newcommand{\basisFunc}{\phi} \newcommand{\basisFuncVector}{\boldsymbol{ \basisFunc}} \newcommand{\basisFunction}{\phi} \newcommand{\basisLocation}{\mu} \newcommand{\basisMatrix}{\boldsymbol{ \Phi}} \newcommand{\basisScalar}{\basisFunction} \newcommand{\basisVector}{\boldsymbol{ \basisFunction}} \newcommand{\activationFunction}{\phi} \newcommand{\activationMatrix}{\boldsymbol{ \Phi}} \newcommand{\activationScalar}{\basisFunction} \newcommand{\activationVector}{\boldsymbol{ \basisFunction}} \newcommand{\bigO}{\mathcal{O}} \newcommand{\binomProb}{\pi} \newcommand{\cMatrix}{\mathbf{C}} \newcommand{\cbasisMatrix}{\hat{\boldsymbol{ \Phi}}} \newcommand{\cdataMatrix}{\hat{\dataMatrix}} \newcommand{\cdataScalar}{\hat{\dataScalar}} \newcommand{\cdataVector}{\hat{\dataVector}} \newcommand{\centeredKernelMatrix}{\mathbf{ \MakeUppercase{\centeredKernelScalar}}} \newcommand{\centeredKernelScalar}{b} \newcommand{\centeredKernelVector}{\centeredKernelScalar} \newcommand{\centeringMatrix}{\mathbf{H}} \newcommand{\chiSquaredDist}[2]{\chi_{#1}^{2}\left(#2\right)} \newcommand{\chiSquaredSamp}[1]{\chi_{#1}^{2}} \newcommand{\conditionalCovariance}{\boldsymbol{ \Sigma}} \newcommand{\coregionalizationMatrix}{\mathbf{B}} \newcommand{\coregionalizationScalar}{b} \newcommand{\coregionalizationVector}{\mathbf{ \coregionalizationScalar}} \newcommand{\covDist}[2]{\text{cov}_{#2}\left(#1\right)} \newcommand{\covSamp}[1]{\text{cov}\left(#1\right)} \newcommand{\covarianceScalar}{c} \newcommand{\covarianceVector}{\mathbf{ \covarianceScalar}} \newcommand{\covarianceMatrix}{\mathbf{C}} \newcommand{\covarianceMatrixTwo}{\boldsymbol{ \Sigma}} \newcommand{\croupierScalar}{s} \newcommand{\croupierVector}{\mathbf{ \croupierScalar}} \newcommand{\croupierMatrix}{\mathbf{ \MakeUppercase{\croupierScalar}}} \newcommand{\dataDim}{p} \newcommand{\dataIndex}{i} \newcommand{\dataIndexTwo}{j} \newcommand{\dataMatrix}{\mathbf{Y}} \newcommand{\dataScalar}{y} \newcommand{\dataSet}{\mathcal{D}} \newcommand{\dataStd}{\sigma} \newcommand{\dataVector}{\mathbf{ \dataScalar}} \newcommand{\decayRate}{d} \newcommand{\degreeMatrix}{\mathbf{ \MakeUppercase{\degreeScalar}}} \newcommand{\degreeScalar}{d} \newcommand{\degreeVector}{\mathbf{ \degreeScalar}} \newcommand{\diag}[1]{\text{diag}\left(#1\right)} \newcommand{\diagonalMatrix}{\mathbf{D}} \newcommand{\diff}[2]{\frac{\text{d}#1}{\text{d}#2}} \newcommand{\diffTwo}[2]{\frac{\text{d}^2#1}{\text{d}#2^2}} \newcommand{\displacement}{x} \newcommand{\displacementVector}{\textbf{\displacement}} \newcommand{\distanceMatrix}{\mathbf{ \MakeUppercase{\distanceScalar}}} \newcommand{\distanceScalar}{d} \newcommand{\distanceVector}{\mathbf{ \distanceScalar}} \newcommand{\eigenvaltwo}{\ell} \newcommand{\eigenvaltwoMatrix}{\mathbf{L}} \newcommand{\eigenvaltwoVector}{\mathbf{l}} \newcommand{\eigenvalue}{\lambda} \newcommand{\eigenvalueMatrix}{\boldsymbol{ \Lambda}} \newcommand{\eigenvalueVector}{\boldsymbol{ \lambda}} \newcommand{\eigenvector}{\mathbf{ \eigenvectorScalar}} \newcommand{\eigenvectorMatrix}{\mathbf{U}} \newcommand{\eigenvectorScalar}{u} \newcommand{\eigenvectwo}{\mathbf{v}} \newcommand{\eigenvectwoMatrix}{\mathbf{V}} \newcommand{\eigenvectwoScalar}{v} \newcommand{\entropy}[1]{\mathcal{H}\left(#1\right)} \newcommand{\errorFunction}{E} \newcommand{\expDist}[2]{\left\langle#1\right\rangle_{#2}} \newcommand{\expSamp}[1]{\left\langle#1\right\rangle} \newcommand{\expectation}[1]{\left\langle #1 \right\rangle } \newcommand{\expectationDist}[2]{\left\langle #1 \right\rangle _{#2}} \newcommand{\expectedDistanceMatrix}{\mathcal{D}} \newcommand{\eye}{\mathbf{I}} \newcommand{\fantasyDim}{r} \newcommand{\fantasyMatrix}{\mathbf{ \MakeUppercase{\fantasyScalar}}} \newcommand{\fantasyScalar}{z} \newcommand{\fantasyVector}{\mathbf{ \fantasyScalar}} \newcommand{\featureStd}{\varsigma} \newcommand{\gammaCdf}[3]{\mathcal{GAMMA CDF}\left(#1|#2,#3\right)} \newcommand{\gammaDist}[3]{\mathcal{G}\left(#1|#2,#3\right)} \newcommand{\gammaSamp}[2]{\mathcal{G}\left(#1,#2\right)} \newcommand{\gaussianDist}[3]{\mathcal{N}\left(#1|#2,#3\right)} \newcommand{\gaussianSamp}[2]{\mathcal{N}\left(#1,#2\right)} \newcommand{\uniformDist}[3]{\mathcal{U}\left(#1|#2,#3\right)} \newcommand{\uniformSamp}[2]{\mathcal{U}\left(#1,#2\right)} \newcommand{\given}{|} \newcommand{\half}{\frac{1}{2}} \newcommand{\heaviside}{H} \newcommand{\hiddenMatrix}{\mathbf{ \MakeUppercase{\hiddenScalar}}} \newcommand{\hiddenScalar}{h} \newcommand{\hiddenVector}{\mathbf{ \hiddenScalar}} \newcommand{\identityMatrix}{\eye} \newcommand{\inducingInputScalar}{z} \newcommand{\inducingInputVector}{\mathbf{ \inducingInputScalar}} \newcommand{\inducingInputMatrix}{\mathbf{Z}} \newcommand{\inducingScalar}{u} \newcommand{\inducingVector}{\mathbf{ \inducingScalar}} \newcommand{\inducingMatrix}{\mathbf{U}} \newcommand{\inlineDiff}[2]{\text{d}#1/\text{d}#2} \newcommand{\inputDim}{q} \newcommand{\inputMatrix}{\mathbf{X}} \newcommand{\inputScalar}{x} \newcommand{\inputSpace}{\mathcal{X}} \newcommand{\inputVals}{\inputVector} \newcommand{\inputVector}{\mathbf{ \inputScalar}} \newcommand{\iterNum}{k} \newcommand{\kernel}{\kernelScalar} \newcommand{\kernelMatrix}{\mathbf{K}} \newcommand{\kernelScalar}{k} \newcommand{\kernelVector}{\mathbf{ \kernelScalar}} \newcommand{\kff}{\kernelScalar_{\mappingFunction \mappingFunction}} \newcommand{\kfu}{\kernelVector_{\mappingFunction \inducingScalar}} \newcommand{\kuf}{\kernelVector_{\inducingScalar \mappingFunction}} \newcommand{\kuu}{\kernelVector_{\inducingScalar \inducingScalar}} \newcommand{\lagrangeMultiplier}{\lambda} \newcommand{\lagrangeMultiplierMatrix}{\boldsymbol{ \Lambda}} \newcommand{\lagrangian}{L} \newcommand{\laplacianFactor}{\mathbf{ \MakeUppercase{\laplacianFactorScalar}}} \newcommand{\laplacianFactorScalar}{m} \newcommand{\laplacianFactorVector}{\mathbf{ \laplacianFactorScalar}} \newcommand{\laplacianMatrix}{\mathbf{L}} \newcommand{\laplacianScalar}{\ell} \newcommand{\laplacianVector}{\mathbf{ \ell}} \newcommand{\latentDim}{q} \newcommand{\latentDistanceMatrix}{\boldsymbol{ \Delta}} \newcommand{\latentDistanceScalar}{\delta} \newcommand{\latentDistanceVector}{\boldsymbol{ \delta}} \newcommand{\latentForce}{f} \newcommand{\latentFunction}{u} \newcommand{\latentFunctionVector}{\mathbf{ \latentFunction}} \newcommand{\latentFunctionMatrix}{\mathbf{ \MakeUppercase{\latentFunction}}} \newcommand{\latentIndex}{j} \newcommand{\latentScalar}{z} \newcommand{\latentVector}{\mathbf{ \latentScalar}} \newcommand{\latentMatrix}{\mathbf{Z}} \newcommand{\learnRate}{\eta} \newcommand{\lengthScale}{\ell} \newcommand{\rbfWidth}{\ell} \newcommand{\likelihoodBound}{\mathcal{L}} \newcommand{\likelihoodFunction}{L} \newcommand{\locationScalar}{\mu} \newcommand{\locationVector}{\boldsymbol{ \locationScalar}} \newcommand{\locationMatrix}{\mathbf{M}} \newcommand{\variance}[1]{\text{var}\left( #1 \right)} \newcommand{\mappingFunction}{f} \newcommand{\mappingFunctionMatrix}{\mathbf{F}} \newcommand{\mappingFunctionTwo}{g} \newcommand{\mappingFunctionTwoMatrix}{\mathbf{G}} \newcommand{\mappingFunctionTwoVector}{\mathbf{ \mappingFunctionTwo}} \newcommand{\mappingFunctionVector}{\mathbf{ \mappingFunction}} \newcommand{\scaleScalar}{s} \newcommand{\mappingScalar}{w} \newcommand{\mappingVector}{\mathbf{ \mappingScalar}} \newcommand{\mappingMatrix}{\mathbf{W}} \newcommand{\mappingScalarTwo}{v} \newcommand{\mappingVectorTwo}{\mathbf{ \mappingScalarTwo}} \newcommand{\mappingMatrixTwo}{\mathbf{V}} \newcommand{\maxIters}{K} \newcommand{\meanMatrix}{\mathbf{M}} \newcommand{\meanScalar}{\mu} \newcommand{\meanTwoMatrix}{\mathbf{M}} \newcommand{\meanTwoScalar}{m} \newcommand{\meanTwoVector}{\mathbf{ \meanTwoScalar}} \newcommand{\meanVector}{\boldsymbol{ \meanScalar}} \newcommand{\mrnaConcentration}{m} \newcommand{\naturalFrequency}{\omega} \newcommand{\neighborhood}[1]{\mathcal{N}\left( #1 \right)} \newcommand{\neilurl}{http://inverseprobability.com/} \newcommand{\noiseMatrix}{\boldsymbol{ E}} \newcommand{\noiseScalar}{\epsilon} \newcommand{\noiseVector}{\boldsymbol{ \epsilon}} \newcommand{\noiseStd}{\sigma} \newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \newcommand{\normalizedLaplacianMatrix}{\hat{\mathbf{L}}} \newcommand{\normalizedLaplacianScalar}{\hat{\ell}} \newcommand{\normalizedLaplacianVector}{\hat{\mathbf{ \ell}}} \newcommand{\numActive}{m} \newcommand{\numBasisFunc}{m} \newcommand{\numComponents}{m} \newcommand{\numComps}{K} \newcommand{\numData}{n} \newcommand{\numFeatures}{K} \newcommand{\numHidden}{h} \newcommand{\numInducing}{m} \newcommand{\numLayers}{\ell} \newcommand{\numNeighbors}{K} \newcommand{\numSequences}{s} \newcommand{\numSuccess}{s} \newcommand{\numTasks}{m} \newcommand{\numTime}{T} \newcommand{\numTrials}{S} \newcommand{\outputIndex}{j} \newcommand{\paramVector}{\boldsymbol{ \theta}} \newcommand{\parameterMatrix}{\boldsymbol{ \Theta}} \newcommand{\parameterScalar}{\theta} \newcommand{\parameterVector}{\boldsymbol{ \parameterScalar}} \newcommand{\partDiff}[2]{\frac{\partial#1}{\partial#2}} \newcommand{\precisionScalar}{j} \newcommand{\precisionVector}{\mathbf{ \precisionScalar}} \newcommand{\precisionMatrix}{\mathbf{J}} \newcommand{\pseudotargetScalar}{\widetilde{y}} \newcommand{\pseudotargetVector}{\mathbf{ \pseudotargetScalar}} \newcommand{\pseudotargetMatrix}{\mathbf{ \widetilde{Y}}} \newcommand{\rank}[1]{\text{rank}\left(#1\right)} \newcommand{\rayleighDist}[2]{\mathcal{R}\left(#1|#2\right)} \newcommand{\rayleighSamp}[1]{\mathcal{R}\left(#1\right)} \newcommand{\responsibility}{r} \newcommand{\rotationScalar}{r} \newcommand{\rotationVector}{\mathbf{ \rotationScalar}} \newcommand{\rotationMatrix}{\mathbf{R}} \newcommand{\sampleCovScalar}{s} \newcommand{\sampleCovVector}{\mathbf{ \sampleCovScalar}} \newcommand{\sampleCovMatrix}{\mathbf{s}} \newcommand{\scalarProduct}[2]{\left\langle{#1},{#2}\right\rangle} \newcommand{\sign}[1]{\text{sign}\left(#1\right)} \newcommand{\sigmoid}[1]{\sigma\left(#1\right)} \newcommand{\singularvalue}{\ell} \newcommand{\singularvalueMatrix}{\mathbf{L}} \newcommand{\singularvalueVector}{\mathbf{l}} \newcommand{\sorth}{\mathbf{u}} \newcommand{\spar}{\lambda} \newcommand{\trace}[1]{\text{tr}\left(#1\right)} \newcommand{\BasalRate}{B} \newcommand{\DampingCoefficient}{C} \newcommand{\DecayRate}{D} \newcommand{\Displacement}{X} \newcommand{\LatentForce}{F} \newcommand{\Mass}{M} \newcommand{\Sensitivity}{S} \newcommand{\basalRate}{b} \newcommand{\dampingCoefficient}{c} \newcommand{\mass}{m} \newcommand{\sensitivity}{s} \newcommand{\springScalar}{\kappa} \newcommand{\springVector}{\boldsymbol{ \kappa}} \newcommand{\springMatrix}{\boldsymbol{ \mathcal{K}}} \newcommand{\tfConcentration}{p} \newcommand{\tfDecayRate}{\delta} \newcommand{\tfMrnaConcentration}{f} \newcommand{\tfVector}{\mathbf{ \tfConcentration}} \newcommand{\velocity}{v} \newcommand{\sufficientStatsScalar}{g} \newcommand{\sufficientStatsVector}{\mathbf{ \sufficientStatsScalar}} \newcommand{\sufficientStatsMatrix}{\mathbf{G}} \newcommand{\switchScalar}{s} \newcommand{\switchVector}{\mathbf{ \switchScalar}} \newcommand{\switchMatrix}{\mathbf{S}} \newcommand{\tr}[1]{\text{tr}\left(#1\right)} \newcommand{\loneNorm}[1]{\left\Vert #1 \right\Vert_1} \newcommand{\ltwoNorm}[1]{\left\Vert #1 \right\Vert_2} \newcommand{\onenorm}[1]{\left\vert#1\right\vert_1} \newcommand{\twonorm}[1]{\left\Vert #1 \right\Vert} \newcommand{\vScalar}{v} \newcommand{\vVector}{\mathbf{v}} \newcommand{\vMatrix}{\mathbf{V}} \newcommand{\varianceDist}[2]{\text{var}_{#2}\left( #1 \right)} \newcommand{\vecb}[1]{\left(#1\right):} \newcommand{\weightScalar}{w} \newcommand{\weightVector}{\mathbf{ \weightScalar}} \newcommand{\weightMatrix}{\mathbf{W}} \newcommand{\weightedAdjacencyMatrix}{\mathbf{A}} \newcommand{\weightedAdjacencyScalar}{a} \newcommand{\weightedAdjacencyVector}{\mathbf{ \weightedAdjacencyScalar}} \newcommand{\onesVector}{\mathbf{1}} \newcommand{\zerosVector}{\mathbf{0}} $$

at CDEI Away Day on Sep 4, 2022 [reveal]

Neil D. Lawrence, University of Cambridge

Abstract

In this talk we review the challenges in making an organisation data-driven in its decision making. Building on experience working within Amazon and providing advice through the Royal Society convened DELVE group we review challenges and solutions for improving the data capabilities of an institution. This talk is targeted at data-aware leaders working in an institution.

The Challenge

Institutional Character

[edit]

Before we start, I’d like to highlight one idea that will be key for contextualisation of everything else. There is a strong interaction between the structure of an organisation and the structure of its software.

The API Mandate

[edit]

The API Mandate was a memo issued by Jeff Bezos in 2002. Internet folklore has the memo making five statements:

All teams will henceforth expose their data and functionality through service interfaces.
Teams must communicate with each other through these interfaces.
There will be no other form of inter-process communication allowed: no direct linking, no direct reads of another team’s data store, no shared-memory model, no back-doors whatsoever. The only communication allowed is via service interface calls over the network.
It doesn’t matter what technology they use.
All service interfaces, without exception, must be designed from the ground up to be externalizable. That is to say, the team must plan and design to be able to expose the interface to developers in the outside world. No exceptions.

The mandate marked a shift in the way Amazon viewed software, moving to a model that dominates the way software is built today, so-called “Software-as-a-Service.”

What is less written about the effect it had on Amazon’s organizational structure. Amazon is set up around the notion of the “two pizza team.” Teams of 6-10 people that can be theoretically fed by two (American) pizzas. This structure is tightly interconnected with the software. Each of these teams owns one of these “services.” Amazon is strict about the team that develops the service owning the service in production. This approach is the secret to their scale as a company, and the approach has been adopted by many other large tech companies. The software-as-a-service approach changed the information infrastructure of the company. The routes through which information is shared. This had a knock-on effect on the corporate culture.

Amazon works through an approach I think of as “devolved autonomy.” The culture of the company is widely taught (e.g. Customer Obsession, Ownership, Frugality), a team’s inputs and outputs are strictly defined, but within those parameters, teams have a great of autonomy in how they operate. The information infrastructure was devolved, so the autonomy was devolved. The different parts of Amazon are then bound together through shared corporate culture.

Amazon prides itself on agility, I spent three years there and I can confirm things move very quickly. I used to joke that just as a dog year is seven normal years, an Amazon year is four normal years in any other company.

Not all institutions move quickly. My current role is at the University of Cambridge. There are similarities between the way a University operates and the way Amazon operates. In particular Universities exploit devolved autonomy to empower their research leads.

Naturally there are differences too, for example, Universities do less work on developing culture. Corporate culture is a critical element in ensuring that despite the devolved autonomy of Amazon, there is a common vision.

Cambridge University is over 800 years old. Agility is not a word that is commonly used to describe its institutional framework. I don’t want to make a claim for whether an agile institution is better or worse, it’s circumstantial. Institutions have characters, like people. The institutional character of the University is the one of a steady and reliable friend. The institutional character of Amazon is more mecurial.

Why do I emphasise this? Because when it comes to organisational data science, when it comes to a data driven culture around our decision making, that culture inter-plays with the institutional character. If decision making is truly data-driven, then we should expect co-evolution between the information infrastructure and the institutional structures.

A common mistake I’ve seen is to transplant a data culture from one (ostensibly) successful institution to another. Such transplants commonly lead to organisational rejection. The institutional character of the new host will have cultural antibodies that operate against the transplant even if, at some (typically executive) level the institution is aware of the vital need for integrating the data driven culture.

A major part of my job at Amazon was dealing with these tensions. As a scientist, initially working across the company, working with my team introduced dependencies and practices that impinged on the devolved autonomy. I face a similar challenge at Cambridge. Our aim is to integrate data driven methodologies with the classical sciences, humanities and across the academic spectrum. The devolved autonomy inherent in University research provides a similar set of challenges to those I faced at Amazon.

My role before Amazon was at the University of Sheffield. Those were quieter times in terms of societal interest in machine learning and data science, but the Royal Society was already convening a working group on Machine Learning. This was my first interaction with policy advice, I’ve continued that interaction today by working with the AI Council, convening the DELVE group to give pandemic advice, serving on the Advisory Council for the Centre for Science and Policy, and the Advisory Board for the Centre for Data Ethics and Innovation. I’m not an expert on the civil service and government, but I can see that many of the themes I’ve outlined above also apply within government. The ideas I’ll talk about today build on the experiences I’ve had at Sheffield, Amazon, and Cambridge alongside the policy work I’ve been involved in to mak suggestions of what the barriers are for enabling a culture of data driven policy making.

From Amazon to Policy

[edit]

Many of the lessons I learned from the Amazon experience have also proved useful in policy advice. At the outbreak of the pandemic, the Royal Society convened a group of “data science experts” to give scientific advice. This group fed into SAGE via the Royal Society’s then president, Venki Ramakrishnan. But it also worked closely with Government departments (as required) to better understand the challenges they were facing and ensure that its policy advice was tailored to the problems at hand.

In Amazon’s case, institutional structure changed to reflect the information infrastructure. In the long term, cultural changes are likely across any institution that wants to level-up in terms of its data driven decision making. These instituonal characters will be as varied as those we find in governments, hospitals, Universities and industrial manufacturers. The key question is how to trigger those cultural changes, while preserving the essence of what allows that institution to deliver on its responsibilities. What actions can we take to better understand the steps we need?

A major challenge with science advice is that scientists are unused to responding to operational pressures. A large part of my time at Amazon was spent in the supply chain. Within that, what I refer to as operational science, the best available answer was needed at the moment of decision. Many scientists struggle to operate while events are unfolding. The best examples I’ve seen of this being done in practice were at Amazon, during the weekly business reviews. These meetings looked at the status of the global supply of products. Any interventions required to deal with unexpected demand or restricted supply were decided quickly on the basis of the best information available. A different domain where similar skills are displayed is Formula One race strategy. I’ve worked closely with two of the major F1 teams’ strategy groups. Their need for data driven decision making in the moment leads to a very different set of priorities than those you find in the academic community.

Policy exhibits significant aspects of this form of operational science. I perceive a gap between our desire to deliver such data driven policy and skill sets required to do this in practice. My best understanding is that this gap tends to be filled with consultants. Short term, this may lead to decisions being made, but long term this is highly problematic as the practice of data driven decision making needs to be tightly integrated with the institution.

In summary, my first point is that different institutions have different characters. The institutional character is, at least in part, driven by its information infrastructure. It’s the supply chain of information that enables informed decision making. The information must flow from where the data is generated, to those that make the decisions.

With the Amazon and F1 examples in mind, I’d like to suggest that no-one is (yet) doing data-science well at scale. And that is largely to do with how recently we’ve gained these capabilities. But with that in mind we’ll I’d like to look at some solutions for integrating the necessary change in culture.

Solutions

1. Executive Awareness

Executive Awareness

[edit]

The first challenge is Executive Awareness.

In most organisations decision making capability sits with a fairly restricted cadre. For this group to be empowered in decision making, they need to be aware of the problems. Institutional awareness honed through experience is a typical characteristic of this cadre. However, that experience was honed without the modern capabilities we have around data. This means that the intuitions that the executive have about the barriers to data driven decision making are often incorrect. This challenge was even true in Amazon. Despite the Software-as-a-Service approach, data quality within Amazon tended to suffer because it wasn’t monitored. The scale of this problem wasn’t apparent to senior managers until, with my colleague Daniel Marcu, we agitated for questions on data maturity to be assimilated within the organisations annual tech survey. I’ll refer to these questions as a Data Maturity Assessment.

Data Maturity Assessment

[edit]

As part of the work for DELVE on Data Readiness in an Emergency (The DELVE Initiative, 2020), we made recommendations around assessing data maturity, (Lawrence et al., 2020). These recommendations were part of a range of suggestions for government to adopt to improve data driven decision making.

Characterising Data Maturity

Diferent organisations differ in their ability to handle data. A data maturity assessment reviews the ways in which best practice in data management and use is embedded in teams, departments, and business processes. These indicators are themed according to the maturity level. These characteristics would be reviewed in aggregate to give a holistic picture of data management across an organisation.

1. Reactive

Data sharing is not possible or ad-hoc at best.

It is difficult to identify relevant data sets and their owners.
It is possible to access data, but this may take significant time, energy and personal connections.
Data is most commonly shared via ad hoc means, like attaching it to an email.
The quality of data available means that it is often incorrect or incomplete.

2. Repeatable

Some limited data service provision is possible and expected, in particular between neighboring teams. Some limited data provision to distinct teams may also be possible.

Data analysis and documentation is of sufficient quality to enable its replication one year later.
There are standards for documentation that ensure that data is usable across teams.
The time and effort involved in data preparation are commonly understood.
Data is used to inform decision-making, though not always routinely.

3. Managed and Integrated

Data is available through published APIs; corrections to requested data are monitored and API service quality is discussed within the team. Data security protocols are partially automated ensuring electronic access for the data is possible.

Within the organisation, teams publish and share data as a supported output.
Documentation is of sufficient quality to enable teams across the organisation that were not involved in its collection to use it directly.
Procedures for data access are documented for other teams, and there is a way to obtain secure access to data.

4. Optimized

Teams provide reliable data services to other teams. The security and privacy implications of data sharing are automatically handled through privacy and security aware ecosystems.

Within teams, data quality is constantly monitored, for instance through a dashboard. Errors could be flagged for correction.
There are well-established processes to allow easy sharing of high-quality data across teams and track how the same datasets are used by multiple teams across the organisation.
Data API access is streamlined by an approval process for joining digital security groups.

5. Transparent

Internal organizational data is available to external organizations with appropriate privacy and security policies. Decision making across the organisation is data-enabled, with transparent metrics that could be audited through organisational data logs. If appropriate governance frameworks are agreed, data dependent services (including AI systems) could be rapidly and securely redeployed on company data in the service of national emergencies.

Data from APIs are combined in a transparent way to enable decision-making, which could be fully automated or through the organizationâs management.
Data generated by teams within the organisation can be used by people outside of the organization.

Example Data Maturity Questions

Below is a set of questions that could be used in an organisation for assessing data maturity. The questions are targeted at individuals in roles where the decisions are data driven.

I regularly use data to make decisions in my job.
I don’t always know what data is available, or what data is best for my needs.
It is easy to obtain access to the data I need.
I document the processes I apply to render data usable for my department.
To access the data I need from my department I need to email or talk to my colleagues.
The data I would like to use is too difficult to obtain due to security restrictions.
When dealing with a new data set, I can assess whether it is fit for my purposes in less than two hours.
My co-workers appreciate the time and difficulty involved in preparing data for a task
My management appreciates the time and difficulty involved in preparing data for a task.
I can repeat data analysis I created greater than 6 months ago.
I can repeat data analysis my team created from greater than 6 months ago.
To repeat a data analysis from another member of my team I need to speak to that person directly.
The data my team owns is documented well enough to enable people outside the team to use it directly.
My team monitors the quality of the data it owns through the use of issue tracking (e.g. trouble tickets).
The data my team generates is used by other teams inside my department.
The data my team generates is used by other teams outside my department.
The data my team generates is used by other teams, though I’m not sure who.
The data my team generates is used by people outside of the organization.
I am unable to access the data I need due to technical challenges.
The quality of the data I commonly access is always complete and correct.
The quality of the data I commonly access is complete, but not always correct.
The quality of the data I commonly access is often incorrect or incomplete.
When seeking data, I find it hard to find the data owner and request access.
When seeking data, I am normally able to directly access the data in its original location
Poor documentation is a major problem for me when using data from outside my team.
My team has a formal process for identifying and correcting errors in our data.
In my team it is easy to obtain resource for making clean data available to other teams.
For projects analyzing data my team owns, the majority of our time is spent on understanding data provenance.
For projects analyzing data other teams own, the majority of our time is spent on understanding data provenance.
My team views data wrangling as a specialized role and resources it accordingly.
My team can account for each data set they manage.
When a colleague requests data, the easiest way to share it is to attach it to an email.
My team’s main approach to analysis is to view the data in a spreadsheet program.
My team has goals that are centred around improving data quality.
The easiest way for me to share data outside the team is to provide a link to a document that explains how our data access APIs work.
I find it easy to find data scientists outside my team who have attempted similar analyses to those I’m interested in.
For data outside my team, corrupt data is the largest obstacle I face when performing a data analysis.
My team understands the importance of meta-data and makes it available to other teams.
Data I use from outside my team comes with meta-data for assessing its completeness and accuracy.
I regularly create dashboards for monitoring data quality.
My team uses metrics to assess our data quality.

2. Executive Sponsorship

Executive Sponsorship

[edit]

Another lever that can be deployed is that of executive sponsorship. My feeling is that organisational change is most likely if the executive is seen to be behind it. This feeds the corporate culture. While it may be a necessary condition, or at least it is helpful, it is not a sufficient condition. It does not solve the challenge of the institutional antibodies that will obstruct long term change. Here by executive sponsorship I mean that of the CEO of the organisation. That might be equivalent to the Prime Minister or the Cabinet Secretary.

A key part of this executive sponsorship is to develop understanding in the executive of how data driven decision making can help, while also helping senior leadership understand what the pitfalls of this decision making are.

Pathfinder Projects

I do exec education courses for the Judge Business School. One of my main recommendations there is that a project is developed that directly involves the CEO, the CFO and the CIO (or CDO, CTO … whichever the appropriate role is) and operates on some aspect of critical importance for the business.

The inclusion of the CFO is critical for two purposes. Firstly, financial data is one of the few sources of data that tends to be of high quality and availability in any organisation. This is because it is one of the few forms of data that is regularly audited. This means that such a project will have a good chance of success. Secondly, if the CFO is bought in to these technologies, and capable of understanding their strengths and weaknesses, then that will facilitate the funding of future projects.

In the DELVE data report (The DELVE Initiative, 2020), we translated this recommendation into that of “pathfinder projects.” Projects that cut across departments, and involve treasury. Although I appreciate the nuances of the relationship between Treasury and No 10 do not map precisely onto that of CEO and CFO in a normal business. However, the importance of cross cutting exemplar projects that have the close attention of the executive remains.

3. Devolved Capability

Data Science Champions

[edit]

This gives us two interventions that can be characterised as developing executive awareness. But the data maturity assessment has the additional advantage of raising awareness amoung technical staff. The questions in a data maturity assessment have the effect of reminding technical staff what should be possible. Unless those that are doing the analysis are demanding the right tools, then the culture of the organisation won’t change to do that. Cultural change needs to be bottom up as well as top down.

The third intervention focuses more specifically on the bottom-up.

Any capable institution will have a large degree of domain specific expertise that they deploy in their daily processes. For example, in Amazon we had supply chain experts who had been at the company for over 20 years. They often came from an Operations Research background, or perhaps an economics background. When introducing new machine learning technologies to these experts, one of two reactions was typically encountered. Either the technology was viewed with great suspicion, or it was seen as a panacea. In the former case, there was no trust in my team, in the latter case there was a naive and unwarranted total faith in my team. A productive relationship is only formed where there is the correct amount of respect between the two domains of expertise. Unfortunately, the typical approach is to parachute in inexperienced (often younger) machine learning experts into teams of technically experience (often older) domain experts.

One solution to this quandry is to make the domain experts data science champions. This involves championing their expertise and bringing them in to the centre of data science expertise to develop their understanding of how an information infrastructure could/should be developed in their domain. This leads to a two-way exchange of ideas. The core data science team understands the domains better, and the domain experts understand data science better.

The key point about these champions is that they are centrally educated, bringing together a cross-departmental community, but when appropriate they are redeployed into their own domains to represent these approaches “at the coal face.”

The data science champion is then redeployed back to their home department as data science champions within their domain. The key idea is that it’s the domain experts that need to champion the data science techniques. This is important for both scalability of data science capability, and for ensuring that the data-driven decision making becomes embedded in the wider institutional culture.

I sometimes refer to these forms of operations as acting as multipliers of capability instead of additional capability. Additional capability would involve adding a data scientist to every team where she is needed. Multiplying capability involves infusing existing teams with data science capabilities.

4. Intersectional Projects

Departmental Planning

One technique used at Amazon to rapidly deploy machine learning capability across the company was to ask each team to include a short paragraph on how they are using machine learning in their operational plans. This was effective in ensuring that machine learning was taken into account in strategic planning for those teams.

To a large extent this may have already happened, in the last CSR I’m sure it was politically opportune to include plans for spending around e.g. artificial intelligence. But I’d be curious to know to what extent those plans received critical review. Because there’s a danger that that funding doesn’t influence the culture of the department, but merely creates an additional arm of operations that doesn’t interconnect with the wider agenda.

Greenfield vs Brownfield

[edit]

Stepping back from the pure government context, we can ask “what are the greatest opportunities for the UK in enabling data-driven innovation?” When answering this, I like to separate “greenfield” innovation and “brownfield” innovation. Both are important, but are different in character.

Greenfield innovation: a previous example of greenfield innovation is the the computer games industry. in the early days of software. A new form of business and company that capitalized on new technology to create new businesses.

In the 1980’s Robert Solow said: “You can see the computer age everywhere except in the productivity statistics.” I see this quote as a consequence of the challenge of brownfield innovation. The largest part of the economy will naturally not (at the outset) be the greenfield innovations. The dominant portion of the economy will be existing businesses. Retrofitting the technology into existing businesses. For example, the process of digital transformation is still not complete in supply chain (my team at Amazon focussed on delivering machine learning technology in that domain).

Solow Paradox

[edit]

You can see the computer age everywhere but in the productivity statistics.

Robert Solow in Solow (1987)

Artificial intelligence promises automated decision making that will alleviate and revolutionise the nature of work. In practice, we know from previous technological solutions, new technologies often take time to percolate through to productivity. Robert Solow’s paradox saw “the computer age everywhere but in the productivity statistics.” Solow was reviewing Cohen and Zysman (1987) and he quotes them

We do not need to show that the new technologies produce a break with past patterns of productivity growth. … [That] would depend not just on the possibilities the technologies represent, but rather on how effectively they are used.

Stephen S. Cohen and John Zysman in Cohen and Zysman (1987) quoted by Solow.

This point about effectiveness of use can be equally deployed in the current “revolution” for artificial intelligence. Today, when computers are integrated into our automation of process they are rarely mentioned. We will know that AI is truly successful when we stop talking about it.

There is a massive opportunity for the UK in brownfield innovation, but it’s easier to talk about greenfield innovation (e.g. how Google does things) regardless of whether those ideas transfer to the institutional characters that we’re looking to form.

The key challenge/opportunity in data driven decision making is bridging the gap. Overall we require an ecosystem of companies that deliver greenfield innovators that address the challenges of the brownfield domains. We can’t afford for there to be such a large gap between e.g. car parts manufacturing in Rotherham and city trading. A similar story plays out at a smaller scale within government. But I suspect the subtleties of this story can be lost in spending plans. The Data Maturity Assessment gives a route to reviewing the extent to which such plans move beyond “fancy words” and translate into the reality of how decisions are being made in a given department.

When thinking of brownfield innovation I find it useful to think of the equivalent tasks in physical infrastructure. The retrofitting of 40 million applicances for use with natural gas that took place in the 1960s and 1970s. The current moves to a carbon-neutral energy system. I also give the analogy of the flush toilet.

Imagine the challenge of retro-fitting flush toilets to a tenement block (brownfield innovation) vs the challenge of integrating a flush toilet within a new build. In particular imagine that the tenement block has people living in it, and they won’t be moving out for the retrofit. There are aspects to this challenge that are infrastructural, and require working out how to put in the plumbing for the building as a whole. But there are other aspects that are personal, and require close-discussion with the families that live in the building.

We might expect that in the wider economy traditional businesses will be disrupted by those that exploit the new capabilities, the market driven nature of that economy would imply that they either evolve or die. But even if we believe that’s the correct approach in the wider economy, the same approach can not be taken in our schools, hopsitals, governments and Universities.

The data maturity assessments give a sense of the scale of the need for the infrastructure, and what might be needed in terms of close interaction with individual departments, teams or groups. The data science champions provide a mechanism to get a more qualitative understanding of the challenges within the different sub-domains and ensure that the facilities can be fitted with the minimal disruption, and when fitted will be properly used.

5. Relevant Education

Bridging Data Science Education

[edit]

A final piece from the puzzle I would like to mention is education. While there are courses springing up teaching data science in Universities, on-line etc, my own feeling is that these approaches are far less effective than a tailored and integrated approach.

As part of the Accelerate Programme for Scientific Discovery in Cambridge, we have designed a set of bridging courses. These are bootcamp style courses that work closely with researchers, typically PhD students, post-docs but also Masters students and university lecturers. They teach the hands-on skills that are needed to deliver solutions.

Rather than starting with trying to communicate the complex machine learning models that have underpinned the AI revolution, we teach the fundamental tools of data science, for example how to build data sets, including tutorials on extracting data from the web. This approach is taken from noting that most people spend most of their time developing the data set, not modeling. It is this work we need to invest in.

Our courses are tailored and built around examples that attendees bring to their teaching. They are delivered in partnership with Cambridge Spark, a local start up focussed on data science education.

Four Current Examples of Interventions from My Work

Other examples of this form of work include our collaboration with Data Science Africa, which focusses on empowering individuals with solutions for solving challenges that emerge in the African context.

Figure: Address challenges in the way that complex software systems involving machine learning components are constructed to deal with the challenge of Intellectual Debt.

You can find our strategic research agenda here: https://mlatcl.github.io/papers/autoai-sra.pdf.

Challenge

[edit]

It used to be true that computers only did what we programmed them to do, but today AI systems are learning from our data. This introduces new problems in how these systems respond to their environment.

We need to better monitor how data is influencing decision making and take corrective action as required.

Aim

Our aim is to scale our ability to deploy safe and reliable AI solutions. Our technical approach is to do this through data-oriented software engineering practices and deep system emulation. We will do this through a significant extension of the notion of Automated ML (AutoML) to Automated AI (AutoAI), this relies on a shift from Bayesian Optimisation to Bayesian System Optimisation. The project will develop a toolkit for automating the deployment, maintenance and monitoring of artificial intelligence systems.

Turing AI Fellowship

From December 2019 I began a Senior AI Fellowship at the Turing Institute funded by the Office for AI to investigate the consequences of deploying complex AI systems.

The notion relates from the “Promise of AI”: it promises to be the first generation of automation technology that will adapt to us, rather than us adapting to it. The premise of the project is that this promise will remain unfulfilled with current approaches to systems design and deployment.

A second intervention is dealing with the complexity of the software systems that underpin modern AI solutions. Even if two individuals, say African masters students, who are technically capable and have an interesting idea, deploy their idea. One challenge they face is the operational load in maintaining and explaining their software systems. The challenge of maintaining is known as intellectual debt (Sculley et al., 2015), the problem of explaining is known as intellectual debt.

The AutoAI project, sponsored by an ATI Senior AI Fellowship addresses this challenge.

Data Trusts: Empower People through their Data

Figure: The Data Trusts Initiative (http://datatrusts.uk) hosts blog posts helping build understanding of data trusts and supports research and pilot projects.

The third intervention goes direct to the source of the machine’s power. What we are seeing is an emergent digital oligarchy based on the power that comes with aggregation of data. Data Trusts are form of data intermediary designed to reutrn the power associated with this data accumulation to the originators of the data, that is us.

Personal Data Trusts

[edit]

The machine learning solutions we are dependent on to drive automated decision making are dependent on data. But with regard to personal data there are important issues of privacy. Data sharing brings benefits, but also exposes our digital selves. From the use of social media data for targeted advertising to influence us, to the use of genetic data to identify criminals, or natural family members. Control of our virtual selves maps on to control of our actual selves.

The fuedal system that is implied by current data protection legislation has signficant power asymmetries at its heart, in that the data controller has a duty of care over the data subject, but the data subject may only discover failings in that duty of care when it’s too late. Data controllers also may have conflicting motivations, and often their primary motivation is not towards the data-subject, but that is a consideration in their wider agenda.

Personal Data Trusts (Delacroix and Lawrence, 2018; Edwards, 2004; Lawrence, 2016) are a potential solution to this problem. Inspired by land societies that formed in the 19th century to bring democratic representation to the growing middle classes. A land society was a mutual organisation where resources were pooled for the common good.

A Personal Data Trust would be a legal entity where the trustees responsibility was entirely to the members of the trust. So the motivation of the data-controllers is aligned only with the data-subjects. How data is handled would be subject to the terms under which the trust was convened. The success of an individual trust would be contingent on it satisfying its members with appropriate balancing of individual privacy with the benefits of data sharing.

Formation of Data Trusts became the number one recommendation of the Hall-Presenti report on AI, but unfortunately, the term was confounded with more general approaches to data sharing that don’t necessarily involve fiduciary responsibilities or personal data rights. It seems clear that we need to better characterise the data sharing landscape as well as propose mechanisms for tackling specific issues in data sharing.

It feels important to have a diversity of approaches, and yet it feels important that any individual trust would be large enough to be taken seriously in representing the views of its members in wider negotiations.

Figure: For thoughts on data trusts see Guardian article on Data Trusts.

Figure: Data Trusts were the first recommendation of the Hall-Presenti Report. Unfortunately, since then the role of data trusts vs other data sharing mechanisms in the UK has been somewhat confused.

See Guardian articles on Guardian article on Digital Oligarchies and Guardian article on Information Feudalism.

Data Trusts Initiative

[edit]

The Data Trusts Initiative, funded by the Patrick J. McGovern Foundation is supporting three pilot projects that consider how bottom-up empowerment can redress the imbalance associated with the digital oligarchy.

Figure: The Data Trusts Initiative (http://datatrusts.uk) hosts blog posts helping build understanding of data trusts and supports research and pilot projects.

Progress So Far

In its first 18 months of operation, the Initiative has:

Convened over 200 leading data ethics researchers and practitioners;
Funded 7 new research projects tackling knowledge gaps in data trust theory and practice;
Supported 3 real-world data trust pilot projects establishing new data stewardship mechanisms.

Figure: AI@Cam is a Flagship Programme that supports AI research across the University.

Finally, we are working across the University to empower the diversity ofexpertise and capability we have to focus on these broad societal problems. We will recently launched AI@Cam with a vision document that outlines these challenges for the University.

Conclusions

[edit]

In this talk I’ve put forward the idea that an institution’s character and its information infrastructure are interlinked. This implies that by reforming the information infrastructure, through wider provision of data-driven decision making capabilities, we are also interacting with the institutional character.

Many approaches to integrating data science ignore this and suffer from a phenomenon we can think of as organisational rejection. Despite the intent of the executive, the new capability is isolated from the wider institution and rendered ineffective.

The solution is to co-evolve the information infrastructure and the organisational culture. I’ve introduced five specific actions for doing this.

Institution wide data maturity assessments.
Executive sponsorship of the data science core.
Pathfinder projects that are intersectional.
Data science champions who come from the domains of deployment.
Bridging courses that integrate institutional questions of important within data science education.

The key design criterion for these actions is that each of these ideas brings short term benefits as well as effecting long term cultural change. Any idea that delivers on these two fronts is also likely to be suitable for integration in an institutional data-science programme.

Thanks!

For more information on these subjects and more you might want to check the following resources.

References

Cohen, S.S., Zysman, J., 1987. Manufacturing matters: The myth of the post-industrial economy. Basic Books, New York.

Delacroix, S., Lawrence, N.D., 2018. Disturbing the ‘one size fits all’ approach to data governance: Bottom-up data trusts. SSRN. https://doi.org/10.1093/idpl/ipz01410.2139/ssrn.3265315

Edwards, L., 2004. The problem with privacy. International Review of Law Computers & Technology 18, 263–294.

Lawrence, N.D., 2016. Data trusts could allay our privacy fears.

Lawrence, N.D., Montgomery, J., Paquet, U., 2020. Organisational data maturity. The Royal Society.

Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V., Young, M., Crespo, J.-F., Dennison, D., 2015. Hidden technical debt in machine learning systems, in: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (Eds.), Advances in Neural Information Processing Systems 28. Curran Associates, Inc., pp. 2503–2511.

Solow, R.M., 1987. We’d better watch out: Review of manufacturing matters by stephen s. Cohen and john zysman. New York Times Book Review.

The DELVE Initiative, 2020. Data readiness: Lessons from an emergency. The Royal Society.