Data Science Challenges

This post is thoughts for a talk given at the UN Global Pulse lab in Kampala as part of the second Data Science in Africa meeting. It covers challenges in data science.

Data is a pervasive phenomenon. It affects all aspects of our activities. This diffusiveness is both a challenge and an opportunity. A challenge, because our expertise is spread thinly: like raisins in a fruitcake, or nuggets in a gold mine. It is an opportunity, because if we can resolve the challenges of difussion we can foster a multi-faceted benefits across the entire University.

What Got Us Here

The old world of data was formulated around the relationship between human and data. Data was expensive to collect, and the focus was on minimising subjectivity through randomised trials and hypothesis testing.

The trinity of human, data and computer, and highlights the modern phenomenon. The communication channel between computer and data now has an extremely high bandwidth. The channel between human and computer and the channel between data and human is narrow.

Historically, the interaction between human and data was necessarily restricted by our capability to absorb its implications and the laborious tasks of collection, collation and validation. The bandwidth of communication between human and computer was limited (perhaps at best hundreds of bits per second).

This status quo has been significantly affected by the coming of the digital age and the development of fast digital computers with extremely high communication bandwidth. In particular, today, our computing power is widely distributed and communication occurs at Gigabits per second. Data is now often collected through happenstance. Its collation can be automated. The cost per bit has dropped dramatically, but the care with which it is collected has significantly decreased.

Traditional data analysis focussed on the interaction between data and human. Sometimes, these data may have been processed by computer, but often through human driven data entry.

Today, massively interconnected processing power combined with widely deployed sensorics has led to manyfold increases in the channel between data and computer. This leads to two effects:

automated decision making within the computer based only on the data.
a requirement to better understand our own subjective biases to ensure that the human to computer interface formulates the correct conclusions from the data.

This process has already revolutionised biology, leading to computational biology and a closer interaction between computational, mathematical and wet lab scientists. Now we are seeing new challenges in health and computational social sciences. The area has been widely touted as ‘big data’ in the media and the sensorics side has been referred to as the ‘internet of things’. In some academic fields overuse of these terms has already caused them to be viewed with some trepidation. However, the phenomena to which the refer are very real. With this in mind we choose the term ‘data science’ to refer to the wider domain of studying these effects and developing new methodologies and practices for dealing with them.

The main shift in dynamic we’d like to highlight is from the direct pathway between human and data (the traditional domain of statistics) to the indirect pathway between human and data via the computer scientist. This change of dynamics gives us the modern and emerging domain of data science.

Challenges

The field of data science is rapidly evolving. Different practitioners from different domains have their own perspectives. In this post we identify three broad challenges that are emerging. Challenges which have not been addressed in the traditional sub-domains of data science. The challenges have social implications but require technological advance for their solutions.

Paradoxes of the Data Society

The first challenge we’d like to highlight is the unusual paradoxes of the data society. It is too early to determine whether these paradoxes are fundmental or transient. Evidence for them is still somewhat anecdotal, but they seem worthy of further attention.

The Paradox of Measurement

The first paradox is the paradox of measurement in the data society. We are now able to quantify to a greater and greater degree the actions of individuals in society, and this might lead us to believe that social science, politics, economics are now able to get a far richer characterisation of the world around us. Paradoxically it seems that as we measure more, we understand less.

Facebook and twitter give us the opinions of many individuals on particular

How could this be possible? It may be that the greater preponderance of data is making society itself more complex. Therefore traditional approaches to measurement (e.g. polling by random sub sampling) are becoming harder, for example due to more complex batch effects.

The end result is that we have a Curate’s egg of a society: it is only ‘measured in parts’. Whether by examination of social media or through polling we no longer obtain the overall picture that can be necessary to don’t obtain

Two examples: polling and clinical trials.

The answer is more classical statistics and more investment in those techniques. better characterisation of some parts .A possible explanation is as follows. Direct access to but simultaneously our ability t qu

Tsunami of data, difficult to detect the ripples? The vocal minority.

Filter Bubbles and Echo Chambers

A related effect is own own ability to judge the wider society in our countries and across the world. It is now possible to be connected with friends and relatives across the globe, and one might hope that would lead to greater understanding between people. Paradoxically, it may be the case that the opposite is occurring, that we understand each other less well.

This argument, sometimes summarised as the ‘filter bubble’ or the ‘echo chamber’ is based on the idea that our information sources are now curated, either by ourselves or by algorithms working to maximise our interaction. Twitter feeds, for example, contain comments from only those people you follow. Facebook’s newsfeed is ordered to increase your interaction with the site.

In our diagram above, if humans have a limited bandwidth through which to consume their data, and that bandwidth is saturated with filtered content, e.g. ideas which they agree with, then it might be the case that we become more entrenched in our opinions than we were before. We don’t see ideas that challenge our opinions.

This is not a purely new phenomenon, in the past people’s perspectives were certainly influenced by the community in which they lived, but the scale on which this can now occur is much larger than it has been before.

Quantifying the Value of Data

Most of the data we generate is not directly usable. To make it usable we first need to identify it, then curate it and finally make it accessible. Historically this work was done because data was actively collected and the collection of data was such a burden in itself that its curation was less of an additional overhead. Now we find that there’s a sea of data, but most of it is undrinkable. We require data-desalination before it can be consumed.

How do we measure value in the data economy?
How do we encourage data workers: curation and management.
- Incentivization
- Quantifying the value in their contribution.

This relates also to the questions of measurment above. Direct work on data generates an enormous amount of ‘value’ in the data economy. Yet, the credit allocation for this work is not properly accounted for. I have visited many organisations where the curation of data is treated almost as a dirty after thought, you might be shown simmulations of cities of questionable value (in the real world) but highlighting work at the data-face is rare. Until this work is properly quantified the true worth of an organisation will not be understood.

Privacy and Loss of Control

While society is perhaps becoming harder to monitor, the individual is becoming easier. Our behaviour can now be tracked to a far greater extent

Twitter monitoring for ‘hate speech’ vs twitter monitoring for ‘political dissent’.
Recommendations, Filter bubbles.

Fairness, accountability and transparency.

Further

Education of the workforce? Integration of data and application? THe advantage of data as being ‘infrastructure free’.

Data trusts

Google & NHS