Computational oncologists know better than any other domain the importance of data sharing in progress in understanding complex decisions. Underlying the revolution in “artificial intelligence” is really a revolution in data. But when data is persona or has legal protections placed upon there are challenges to data sharing. In this talk we introduce the ideas behind data sharing and the model of data trusts, an approach to data sharing that relies on trust law to form its governance structure.
Deep Health 
From a machine learning perspective, we’d like to be able to interrelate all the different modalities that are informative about the state of the disease. For deep health, the notion is that the state of the disease is appearing at the more abstract levels, as we descend the model, we express relationships between the more abstract concept, that sits within the physician’s mind, and the data we can measure.
UK Government Stipulation on Data Availability 
As far back as 2012 the UK government was stipulating that NHS records should be made available on line to patients. This seemed to herald a new era of patient understanding about their data.
The reaction of the Royal College of General Practioners at the time was to say that the data that was accessible should be restricted. This raises interesting questions around what data should be made available to a doctor, but not to the individual patient it pertains to.
EMIS Patient Access
The majority of general practioner patient data is held by two companies in the UK. EMIS and SystemOne. EMIS made their data available through a web portal, <patient.co.uk>.
Simultaneously, the government at the time was stipulating that data access requests should become electronic. Allowing an ecosystem by which individuals could trigger access to their own data.
When it came to the advantages that could be derived from the data, NHS dDigital proposed a opt-out system for sharing the data. This promised to deliver great benefits in terms of understanding disease. But the launch was botched. The limited scope for opting in and out of the scheme, added to a lack of a clear vision around whether the data was for sale or for health, undermined the confidence of the public. This set back the availability of health data for a number of years.
More recently, the health services are taking a more mature view of data and its utility in aiding an aging population. The Topol review is a recent report in to how the NHS can benefit from the data revolution.
A more positive story is around UK biobank which is able to centralise records for individuals based on very permissive consent. Still there are challenges to assimilating data in the same location.
Public Use of Data for Public Good 
Since machine learning methods are so dependent on data, Understanding public attitudes to the use of their data is key to developing machine learning methods that maintain the trust of the public. Nowhere are the benefits of machine learning more profound, and the potential pitfalls more catastrophic than in the use of machine learning in health data.
The promise is for methods that take a personalized perspective on our individual health, but health data is some of the most sensitive data available to us. This is recognised both by the public and by regulation.
With this in mind The Wellcome Trust launched a report on “Understanding Patient Data” authored by Nicola Perrin, driven by the National Data Guardian’s recommendations.
From this report we know that patients trust Universities and hospitals more than the trust commercial entities and insurers. However, there are a number of different ways in which data can be mishandled, it is not only the intent of the data-controllers that effects our data security.
For example, the recent WannaCry virus attack which demonstrated the unpreparedness of much of the NHS IT infrastructure for a virus exhibiting an exploit that was well known to the security community. The key point is that the public trust the intent of academics and medical professionals, but actual capability could be at variance with the intent.
The situation is somewhat reminiscient of early aviation. This is where we are with our data science capabilities. By analogy, the engine of the plane is our data security infrastructure, the basic required technology to make us safe. The pilot is the health professional performing data analytics. The nature of the job of early pilots and indeed today’s bush pilots (who fly to remote places) included a need to understand the mechanics of the engine. Just as a health data scientist, today, needs to deal with security of the infrastructure as well as the nature of the analysis.
I suspect most passengers would find it disconcerting if the pilot of a 747 was seen working on the engine shortly before a flight. As aviation has become more widespread, there is now a separation of responsibilities between pilots and mechanics. For example, Rolls Royce goes so far as to maintain ownership of their engines, and lease them to the airline. The responsibility for maintenance of the engine is entirely with Rolls Royce, yet the pilot is responsibility for the safety of the aircraft and its passengers.
We need to develop a modern data-infrastructure for which separates the need for security of infrastructure from the decision making of the data analyst.
This separation of responsibility according to expertise needs to be emulated when considering health data infrastructure. This resolves the intent-capability dilemma, by ensuring a separation of responsibilities to those that are best placed to address the issues.
Feudal Era Data Ecosystem 
Our current information infrastructure bears a close relation with feudal systems of government. In the feudal system a lord had a duty of care over his serfs and vassals, a duty to protect subjects. But in practice there was a power-asymetry. In feudal days protection was against Viking raiders, today, it is against information raiders. However, when there is an information leak, when there is some failure in protections, it is already too late.
Alternatively, our data is publicly shared, as in an information commons. Akin to common land of the medieval village. But just as commons were subject to overgrazing and poor management, so it is that much of our data cannot be managed in this way. In particularly personal, sensitive data.
I explored this idea further in this Guardian article on 2015/nov/16/information-barons-threaten-autonomy-privacy-online.
Data Governance Toolkit 
With Sylvie Delacroix and Jessica Montgomery we’ve been working towards a data governance toolkit. Trying to understand the different approaches to data sharing and access we may need for different types of data.
One major challenge to data sharing is that it means different things to different people and different institutions. The objectives of data sharing can differ amoung different sharing parties. This leads to a great deal of confusion when discussing mechanisms for data sharing.
For example, in 2016, inspired by conversations with Jonathan Price, I proposed the idea of Data Trusts. This was a data sharing idea specifically targeted at protecting indviduals from the vulnerabilities they are exposed to when sharing personal data.
Unfortunately, the idea has since been promoted as a universal panacea for data sharing. As a result, the original concept is inevitably misunderstood, watered down or derided. It seems clear that we need a better understanding of the data sharing landscape, and with Martin Tisne, under a remit given by the AI council we’ve begun to do that.
The framework we’re using comes from discussions with Sylvie Delacroix on how to characterise the data governenance process. We’ve developed the color wheel shown above.
The color wheel is meant to broadly characterise the different considerations we need to have when developing data governance frameworks and some of the tensions we experience when describing different approaches to data sharing.
Typical social benefits derived from data sharing include better health, and better security. By sharing information widely about society we can better understand how to manage society to wider benefit.
There are individuals in society who are vulnerable due to disempowerment. Many of these vulnerabilities are due to minority status or particular conditions. Protecting vulnerable individuals is a vital component of good governance. This is reflected in, for example, data rights legislation such as the GDPR which defines protected characteristics and prohibited discriminations around sensitive areas such as health, race, religion, sexuality and gender. These protections very often recognise past injustices or systemic biases that we wish to prevent in future. Other vulnerable groups include those we don’t empower to decide for themselves, such as children.
In any governance structure, the route for individuals (or institutions) that are participating in the structure to represent their own opinion, query decision making and realign the values of the governance organisation with there own is a criticial component. Considering groups that are currently disenfranchised (for example, in digital systems, the elderly) also forms a component of the design of the governance structure. Dealing with power asymmetries, such as those we’re experiencing in our current somewhat feudal system of data governance is also a key challenge for the enfranchisement.
Representing individual aspirations as part of the governance structure relates strongly to traditional notions of liberty which involve freedom of action. A particular sense in which we think of individual liberty in digital systems is in terms of control we each have around our individual aspirations.
How the GDPR May Help
Early reactions to the General Data Protection Regulation by companies seem to have been fairly wary, but if we view the principles outlined in the GDPR as good practice, rather than regulation, it feels like companies can only improve their internal data ecosystems by conforming to the GDPR. For this reason, I like to think of the initials as standing for “Good Data Practice Rules” rather than General Data Protection Regulation. In particular, the term “data protection” is a misnomer, and indeed the earliest data protection directive from the EU (from 1981) refers to the protection of individuals with regard to the automatic processing of personal data, which is a much better sense of the term.
If we think of the legislation as protecting individuals, and we note that it seeks, and instead of viewing it as regulation, we view it as “Wouldn’t it be good if …”, e.g. in respect to the “right to an explanation”, we might suggest: “Wouldn’t it be good if we could explain why our automated decision making system made a particular decison”. That seems like good practice for an organization’s automated decision making systems.
Similarly, with regard to data minimization principles. Retaing the minimum amount of personal data needed to drive decisions could well lead to better decision making as it causes us to become intentional about which data is used rather than the sloppier thinking that “more is better” encourages. Particularly when we consider that to be truly useful data has to be cleaned and maintained.
If GDPR is truly reflecting the interests of individuals, then it is also reflecting the interests of consumers, patients, users etc, each of whom make use of these systems. For any company that is customer facing, or any service that prides itself on the quality of its delivery to those individuals, “good data practice” should become part of the DNA of the organization.
Personal Data Trusts 
The machine learning solutions we are dependent on to drive automated decision making are dependent on data. But with regard to personal data there are important issues of privacy. Data sharing brings benefits, but also exposes our digital selves. From the use of social media data for targeted advertising to influence us, to the use of genetic data to identify criminals, or natural family members. Control of our virtual selves maps on to control of our actual selves.
The fuedal system that is implied by current data protection legislation has signficant power asymmetries at its heart, in that the data controller has a duty of care over the data subject, but the data subject may only discover failings in that duty of care when it’s too late. Data controllers also may have conflicting motivations, and often their primary motivation is not towards the data-subject, but that is a consideration in their wider agenda.
Personal Data Trusts (Edwards 2004; Lawrence 2016; Delacroix and Lawrence 2018) are a potential solution to this problem. Inspired by land societies that formed in the 19th century to bring democratic representation to the growing middle classes. A land society was a mutual organisation where resources were pooled for the common good.
A Personal Data Trust would be a legal entity where the trustees responsibility was entirely to the members of the trust. So the motivation of the data-controllers is aligned only with the data-subjects. How data is handled would be subject to the terms under which the trust was convened. The success of an individual trust would be contingent on it satisfying its members with appropriate balancing of individual privacy with the benefits of data sharing.
Formation of Data Trusts became the number one recommendation of the Hall-Presenti report on AI, but unfortunately, the term was confounded with more general approaches to data sharing that don’t necessarily involve fiduciary responsibilities or personal data rights. It seems clear that we need to better characterise the data sharing landscape as well as propose mechanisms for tackling specific issues in data sharing.
It feels important to have a diversity of approaches, and yet it feels important that any individual trust would be large enough to be taken seriously in representing the views of its members in wider negotiations.
To help clarify some of the issues around data sharing we’ve produced a new website focussing on data trusts, explaining what they do and what they don’t do.
Delacroix, Sylvie, and Neil D. Lawrence. 2018. “Disturbing the ‘One Size Fits All’ Approach to Data Governance: Bottom-up Data Trusts.” SSRN. https://doi.org/10.1093/idpl/ipz01410.2139/ssrn.3265315.
Edwards, Lilian. 2004. “The Problem with Privacy.” International Review of Law Computers & Technology 18 (3): 263–94.
Lawrence, Neil D. 2016. “Data Trusts Could Allay Our Privacy Fears.” The Guardian Media & Tech Network. https://www.theguardian.com/media-network/2016/jun/03/data-trusts-privacy-fears-feudalism-democracy.