Dean’s Blog: Responsible Data Science

While data science, as I have defined in previous blogs, provides the opportunity to make a very positive impact on society, there can also be negative consequences. It is these negative consequences that get all the attention, thereby, in the public’s eyes at least, downgrading what is actually being achieved.  This is a slippery slope in some ways akin to the emergence of the era of atomic energy and more recently gene editing to mention just two breakthroughs. The difference is data science is much more pervasive – it can impact every aspect of society and the consequences may not be immediately obvious to those creating change. Those exploring the potential of atomic energy had no doubts as to the nefarious use of the same technology. With data science it is not so clear. Unintended bias in the data, mischievous use when combined with other forms of data, unexpected business models that cause harm, etc. are harder to anticipate. But we must try and we must train our students to try.

Within the School of Data Science (SDS) at the University of Virginia (UVA), this effort is embodied in the term, responsible data science (RDS). The term is new enough not to have a Wikipedia page, not to mention the ability to agree on a definition. I doubt even within our own team we would agree upon anything more than a superficial definition. Thus, what follows is more a personal interpretation, itself less important than what we need to accomplish to declare ourselves responsible.

Ethics can be defined as individual, occupational, organizational, or societal morals and values, while responsibility is the practical application of ethical concerns for the benefit of society as a whole.  As Julia Stoyanovich and Armanda Lewis have pointed out in the teaching of RDS, considering ethics and responsibility separately is a mistake. Rather than a code of ethics, we need coding of ethics, in other words, responsible data science. If the two are separated, the context is lost and students start asking “why are we learning this?” Stoyanovich and Lewis go on to speak of RDS in terms of transparency and interpretability. The Dutch, on the other hand, define RDS in terms of Fairness, Accuracy, Confidentiality and Transparency (FACT), further expanded by van de Aalst et al. Both are valid definitions and need to be embraced not so much in a set of statements, but in a set of actions and lifelong modalities. Statements (aka guiding principles) are likely to ring hollow to data scientists (students as indicated above, but also professionals) who typically have an engineering bent. The statements must be embodied in the actions data scientists perform as they build things, namely, well structured datasets, including appropriate metadata, transparent software pipelines and applications. Metrics appeal to engineers, but then too ring hollow when not knowing what metrics to apply when. RDS should be practiced throughout the data lifecycle and indeed the lifecycle of the resulting products. RDS is also about having humans in the loop. Application of automated RDS to only a fraction of the process is where problems arise. A data scientist will typically not control the whole process, which is why transparency is so important, but first let’s consider the part they do control.

Starting with the metadata and associated data, every effort must be made to insure the data are accurate, address privacy concerns, are unbiased, etc., so transparency can be brought to bear, namely the ability of others to take these data and use them appropriately, including reproducing the results. The consequences of merging these data with other data (assuming permissible access) must be anticipated as far as possible and – where such consequences cannot be predicted – enable policies that prevent misuse. Methods, notably machine learning, although there are many others, used to analyze the data make predictions etc. present significant challenges. Outcomes are not fully understood and may or may not be valid. Sound statistical support and communication of likelihoods are critical. Visualization and narrative plays an important role in communication of the validity of the findings.

All of the above represent a very STEM bias and could be characterized as the application of research ethics. A humanities scholar likely has a very different view of what RDS represents. I would go as far as to say a STEM scholar of today (myself included) lacks the perspective to see the broader role of what RDS entails. We were just not trained to do so. Campuses are uniquely qualified to fill this void. They have experts in each area, they simply need to collaborate and understand each other more than they do now and design appropriate courses and foster cultures that make RDS integral to every discipline which is touched by data science, which, of course, is all disciplines.

In a large organization, the work of an individual data scientist represents a component of a larger ecosystem which has a business model that may change over time. It is my view that the greatest contribution of our students to RDS will come, not in engineering, but in choosing to work in places where RDS is integral to the business model at all times. In this instance, RDS is less aligned to the individual, but to the organization writ large. While I will take great pride in a student who tells me they have contributed to a product that has a positive impact on society, I will take equal pride in a student who walks away from a company which does not practice RDS. Our research and education programs must be constant reminders of this principle. Beyond that, with whom we work and certainly take money from, requires constant vigilance as recent cases have shown. We must also be reparatory – setting an example that either directly repairs or is a counter example to what negativity has already been done.

A question that remains is, responsibility to what? Responsibility means response-ability. If we have the ability to respond, then what do we respond to, and how do we recognize the need/demand to respond? It is too easy to assume we somehow inherently know this, or that it can be easily taught in 1-2 classes. Can an entire way of being in the world with others (one definition of ethics) be taught so easily? Furthermore, what are the motivations and ‘values’ for responding? One can imagine that Fascists feel a great deal of responsibility for seeing a certain way of life come about. But we tend not to think of this. We tend to assume that responsibility = good. But if we have no critically considered notion of the good, then responsibility becomes this slippery term that can equally be applied to a commitment to, for example, healthcare for all or individual responsibility for one’s own healthcare. There is much to understand and do.


It is no good writing about the need for STEM and humanities scholars to come together without actually doing so in what one writes. To that end, this blog was significantly improved through the thoughts and text of Jarret Zigon, Luis Felipe Rosado Murillo and Daniel Mietchen.


One thought on “Dean’s Blog: Responsible Data Science

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

About Phil Bourne

Stephenson Founding Dean of the School of Data Science and Professor of Data Science & Biomedical Engineering, University of Virginia