Dean’s Blog: Data Science Must Further Incentivize Open Science

In forty years of academic research, like so much else, what is happening during the COVID-19 pandemic is unprecedented. The emphasis on the researcher as an individual with the viewpoint, “I must hide all I am doing until it is published so I don’t get scooped”  is replaced by a common purpose, “lets share ideas, data, results and resulting papers throughout the research lifecycle so we can defeat this damn virus.” You have to ask yourself, why has research not been open and shared (aka accessible to all) all along?

We will get to what this has to do with the School of Data Science (SDS) in a moment. Stay with me as we look more generally at the issue of open science, the implications for a new School of Data Science will become apparent.

Ironically, the majority of the public and indeed politicians, have thought that research has been open all along. Why would it not be? The public, through their taxes, paid for a lot of that research after all. As an example, when former Vice President, Joe Biden, started the cancer moonshot after the tragic death of his son, Beau, to brain cancer, he said,

“And by the way, the taxpayers fund $5 billion a year in cancer research every year, but once it’s published, nearly all of that taxpayer-funded research sits behind walls. Tell me how this is moving the process along more rapidly.”

The implications of this throttling of knowledge can be devastating when someone has a serious health problem. To understand why so much research is hidden requires we look at that research with respect to incentives and business models. An individual researcher, or with a small group of collaborators, applies for a grant to do research. At the same time, a colleague down the hall might be applying to the same program with a different group of collaborators. They are competing against each other as individuals for the same pot of money. Some would argue, competition is good at one level since it drives innovation. On the other hand, working as a team greater achievements might be possible.  Teaming involves compensating everyone who is involved (e.g., students, staff who always do the “invisible work“) as to distribute the rewards. As humans we need that individual praise – some more than others. Once you get past the ideas stage embodied in grants there is the work of generating data and analyzing that data. Most times, pre-COVID, those data might never be made publicly available at all. If they were made available it would not be until after the work was published in a scientific journal, which might be years after the data were collected. The piece de resistance comes when the work is published and the taxpayer has to pay to read the work. That’s right, we all pay for the work to be done and then we pay again to read the results of that work as Joe Biden came to realize. To understand why this happens, let’s keep it simple and break the discussion into two parts, first, the final scientific articles and second, other scientific products – data, software, documentation etc.

The first scientific paper was published in 1665 and scientific publishers have played a crucial part in the dissemination of scientific discourse since then. As scientific societies grew in areas of developing science, they too became intertwined with publishing. In my earlier research years I fondly remember going to the recent additions section of our university library and savoring the glossy issues of my favorite scientific journals before they were bound and placed in an orderly way on shelf-upon-shelf. I would marvel at both the content and the quality of its production. All was well until about 25 years ago, in 1995, when costs of journal subscriptions began to escalate at the same time the cost to produce and disseminate content electronically plummeted as a result of the Internet. Since that time the digital article (not the journal volume) has become the lingua franca and the paper copy (if it exists) a backup. Some scientists, myself included, and librarians began asking themselves why such exorbitant costs and why should only a limited audience get to read the work that’s sitting there on the Internet? Finally, taking matters into their own hands these pioneers created alternative business models around various forms of open access (OA) publishing. Among other things, OA flipped the business model from a reader pays to an author pays model. Peer review and hence quality was intended to stay the same, but overall costs to the scientific enterprise would be less. And oh yes, anyone in the world could read the content for free, not just those with a subscription. Various flaws appeared in this movement. To name a couple, first, authors (i.e., the scientists) did not see that they were paying directly in the old model since the cost of library subscriptions was coming out of the indirect income their institution received for each grant they got. Only when it was apparent that that money could be used more directly to aid their research did they get engaged. Second, with costs of publishing much reduced anyone could become a publisher, and has, if the amount of junk email I get every morning from one predatory publisher or another wanting to publish or even republish my work is any indication. OA has bought about a quality versus quantity problem. The good news is that many closed access publishers have now moved to provide at least an OA option driving the scientific publishing market further towards open science.

What has this brief personal view of OA publishing got to do with data science? The answer is everything. Here’s why. Not only would the OA content be accessible to data scientists worldwide to mine that corpus, but it would be so in a way that was useful. Intertwined with the notion of open versus closed access are the issues of copyright and format. In a typical closed access model, the author signs over the ownership of the material to the publisher. Authors no longer own their work and how and who can access that work is determined by the publisher. Moreover, that closed content is frequently only available in PDF format which is not very accessible. Unfortunate, as an XML version, as tagged markup, is generated along the way to the final published article, but not available. In short, current closed access publishing models make it difficult or impossible for data scientists to mine that content. Now do you see where we are going with this open science diatribe?

Consider the situation with data, although the same arguments can be made for software, experimental protocols and other products of the research pipeline and beyond, extending to, for example, open hardware.

Data are the raw material of science and can be mined in a variety of ways, perhaps ways unimaginable to the scientist(s) that originally generated that data. If the taxpayer paid for it, should not that data be available so that its value to the payer can be maximized? The answer is obviously yes, so why an issue? First, the culture of science, notably the reward system, has been for too long on the final result found in a published paper, not the data that generated that result. For reasons of reuse, more reward to individual scientists should focus on providing high quality data to the community, not just on the resulting paper, which is arguably only an advertisement for the data. I represent a case in point here and it is a story I love to tell. I have a paper that, as of a few moments ago, had over 32,000 citations according to Google Scholar. In the world of scientific bibliometrics this is a big deal. But here is the thing. No one has ever read that paper! Well not for a long time anyway. It is a paper about a dataset that is used widely by the biomedical community, namely the Protein Data Bank. As of now there is no consistently used way to cite data, so you have to write a paper about that data and cite that to be known and rewarded. Crazy. Here is one more crazy. In some fields, including my own field of biomedicine, we run experiments that generate digital data, we then include that in a research paper and present that paper in an essentially analog form (a PDF). We then pay people to extract that data from the paper and put it back into a digital database where it can be used by others. Crazy on steroids. I often think that if aliens discover us and review how we disseminate science, they will pass us by assuming no intelligent life exists on earth, regardless of the scientific achievements themselves. 

There are two big levers (aka incentives) through which this situation can change – incentives from funders and incentives from publishers. The flaw in the latter is that some publishers just see data as another way to make money – they want us to buy back our data in the way we buy back our knowledge embodied in research papers. Some publishers have more of the right idea insisting the data are available as part of the publication. Funders generally support open data through their data management plans which need to be submitted and are reviewed as part of the grant submission. The problem is there is little enforcement that the plan was actually followed and indeed the data are available. Then there is the question of how usable are the data if it is available? Smart and dedicated folks have put a lot of effort into principles for how data should be made available – the FAIR principles, the ability to Find, Access, Interoperate, and Reuse content.   Making data FAIR is time consuming for the producers of that data and the rewards are few. That needs to change.

I like to think that such change is afoot. Smaller levers are emerging and growing. The new generations of digitally native scientists see the world as more open and public universities, like our own, are beginning to see their institutional responsibility in a digital world. With a new School we have a particularly important opportunity to further that change. Read on.

The value of open data to data science is obvious – the field would not exist without it. In the seven years we have had students undertaking capstone research projects within the former Institute and now the School of Data Science, essentially all their research used data generated by others. To be fair (a different kind of fair) sometimes access has to be withheld for reasons of personal privacy, national security, protection of indigenous people, etc. However, as a general principle of the school, if you use public data, the results derived from that data, that is more data and knowledge, should also be made publicly available and FAIR.

Thus a guiding principle of the School of Data Science is: 

Data science exists as a field by virtue of the open access to research tools (think Python and R) and products generated by others. We have an obligation to carry that forward with all the research products we produce.

As stated, not a big lever like a publisher or funder, but an academic institution should be able to make a difference. If only it were that simple. The simplicity is lost the moment you look at incentives. Researchers get promoted not by the quality of the research products they produce for use by others, but by the papers they publish in prestigious journals and the amount of grant money they bring in. Certainly it is too much of a stretch to expect that to change overnight, nor is it in alignment with why folks become researchers, but I would like to think there is a better place than where we sit now with respect to incentives. Scientists, like all of us, love to be rewarded, so let’s continue to reward them for their fine papers, but also for the ability for others to better build upon what they have done based on more that the paper alone, that is, based on all output of the research lifecycle – data, methods, software, protocols, workflows and so on. The reuse of research products by others, beyond findings in a paper, should be a reward unto itself. 

The School of Data Science is new, as yet, it does not have anything beyond a startup culture and its foundational ethical principles to teach and practice science for maximum public benefit. We would be remiss given what we gain from open science, not to make open science front and center to our culture. Even though we gain so much from open science, it won’t be easy. Some of our existing team have a long history in a different culture which they understandably find hard to give up. However, others in our team are dedicated to advance scholarship in direct connection with the open science movement. We must try as a single team and it will be a process, not a foregone conclusion. We owe it to all our stakeholders. What we have agreed to thus far as a team is a promotion and tenure policy that states:

Openness: Tenure in SDS requires a commitment to open scholarship. Evidence of that openness comes in the form of observed degree of collaboration, data and software availability, publishing in open access journals, the use of preprints, the presence of an ORCID and Google Scholar profile and other evidence.

I am proud of this inclusion, but I also recognize parts of it are open to interpretation and indeed it has to be enacted. Since I, as Dean, make the final decisions regarding promotion and tenure we are at least off to a useful start. What else can we do?

Evaluating openness and promoting people accordingly is a hammer.  Instilling a true culture of openness as the norm is something else. Here are a few ways we have come up with to make all our stakeholders naturally see the value of openness and hopefully see it as a normal mode of operation. 

Undertake Research that Demonstrates the Value of Openness – We are avid supporters of the Wikimedia Foundation and their various projects, notably Wikipedia and Wikidata. We have a Wikimedian–in-Residence who supports our efforts in this regard and shows the value of such open products. An example is a capstone project working with the Metropolitan Museum of Art and their large digital collection to better catalog the content through image recognition and make that metadata catalog openly available as Wikidata. Similarly, Scholia is a project to present bibliographic information and scholarly profiles of authors and institutions using Wikidata, the community-curated database supporting Wikipedia and all other Wikimedia projects. Scholia is being developed in the framework of the larger WikiCite initiative, which seeks to index bibliographic metadata in Wikidata about resources that can be used to substantiate claims made on Wikidata, Wikipedia or elsewhere.

Beyond the virtual, we helped to start the Journal of Open Hardware to promote a culture of hardware sharing for scientific instrumentation and have established an Open Hardware Lab.

Build Openness into our Educational Programs – We have a way to go here. There are sessions as to the value of openness in our bootcamps and professional development sessions, but no commitment to make openness part of our pedagogy. That is, constantly educate as to the importance of openness to the data science enterprise and to reward that openness. For example, students research projects encouraging the sharing of all aspects of the research lifecycle, but we currently fall short of the delivery of that capability. However, we are currently experimenting with introducing the topic of open data in classes, such as by using open datasets in classroom demonstrations to show the usefulness of a public data commons.

Establish the Commons Concept through Infrastructure – Like a commons as it relates to shared land, a data science commons is a shared virtual space whose governance is also shared and agreed upon for the benefit of all stakeholders. With such capabilities easily and readily available openness will be facilitated. SDS is currently supporting two commons efforts, one through iTHRIV to support various types of clinical data, which of course has some constraints based on privacy concerns, as it should. The second, the Open Data Lab is intended as a more open platform to support non personally identifiable information (PPI) research and other types of output. The former is under active development, the latter needs further evaluation in the context of other open platforms which continue to evolve and be used to support the commons concept. SDS is agnostic to the tools, but adamant about the concept.

Taking our incentives, key personnel, research initiatives and the supporting infrastructure,  SDS has the beginnings of support for open science. We will be hiring folks who believe in this vision. Time will tell if it evolves into a culture which implies it is the natural thing to be doing, rather than something the Dean thinks the Schools should be doing. 

Thanks to Brian Nosek, Lane Rasberry and Luis Felipe Rosado Murillo for their thoughtful input.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

About Phil Bourne

Stephenson Founding Dean of the School of Data Science and Professor of Data Science & Biomedical Engineering, University of Virginia