Welcome. This is the first of what I anticipate will be periodic updates on the work of the Associate Director for Data Science (ADDS) team at NIH. The goal is to be transparent and informative and we welcome your input at any time.
When accepting the job of ADDS at the NIH I asked Director, Francis Collins, to summarize my job description. His answer was simply, “to change the culture of the NIH.” My response was, “and what do I do next week?” Nine weeks after my self imposed deadline, I am certainly having fun, and while it will take much more than one person to change a deeply ingrained culture centered around specific diseases and organs; the complexity of disease and the value of sharing data across institutional boundaries, will drive us forward. All at NIH seem to share this belief, hence the fun part.
How did we get to this ten week point? In 2011 Dr. Collins formed the Data and Informatics Working Group, which in 2012 released a report  highlighting the need to:
- Advance basic and translational science by facilitating and enhancing the sharing of research-generated data.
- Promote the development of new analytical methods and software for this emerging data.
- Increase the workforce in quantitative science toward maximizing the return on the NIH’s public investment in biomedical research.
The report provides a compelling roadmap and thanks to the outstanding efforts of Drs. Eric Green, Mark Guyer and many others at the NIH, much had already been done to address these needs through the Big Data to Knowledge (BD2K) program . It is now my job to take over and build on this initiative. BD2K is a predominantly extramural program, which will make its first awards this summer, and is intended to foster developments in data science relevant to biomedicine. BD2K consists of training programs, calls to enable data and software discoverability, facilitation of standards efforts, and consortia that will improve all aspects of scientific data handling, analysis and reuse.
After ten weeks we have begun to formulate a strategic plan to elaborate upon BD2K and assemble folks to carry it out. The modus operandi is one of coordination of existing efforts, both intramural and extramural. Currently the “we” is Dr. Jennie Larkin, Program Director for the Advanced Technologies and Surgery Branch Division of Cardiovascular Sciences National Heart, Lung and Blood Institute and now the Deputy in the ADDS team, Eric, Mark, and the whole BD2K team of over 100 NIH staff, who have other roles but participate on a regular basis. Additional full-time ADDS team members will be joining in the coming months.
We began by talking to a vast array of people to get an idea of the landscape and determine what might be done. While ongoing, this includes all 27 NIH Institute and Center Directors as well as many members of their staff — a group of amazingly dedicated individuals pulling for the same overall goals and trying to maximize the research that can be done on a flat (at best) budget.
From discussions with stakeholders an ADDS team-driven plan has begun to emerge that is based on a number of observations, some obvious, including:
- The US government Office of Science and Technology Policy (OSTP) directives on data sharing has defined the why of sharing, but not the how.
- The current situation of matching budgets to data growth will not scale as data grows rapidly and budgets remain flat.
- The problems of the long-term sustainability of biomedical research data have been identified and discussed; the solutions are not so clear.
- We need more information to begin to define and test possible data sustainability models. For example, we do not currently know enough about how existing data are used and thus are under informed for how to proceed going forward.
- Estimation of the amount of data that will be generated in years to come and the demand for that data as part of the digital enterprise are difficult to estimate.
- As yet we have not fully addressed what fraction of the NIH data science budget should go into data management and analysis versus generating new data.
- The BD2K initiative as currently conceived is only part of the answer. Moreover, Big Data is a lot more than just data. We need to consider all digital assets: data, metadata, software, narrative, workflows, training materials, etc.
- Training in biomedical data science is in need of expansion and coordination.
- Better reward systems and hence improved recognition of biomedical data scientists is needed.
How to address the above observations? The answer, as we currently see it, is to have you (the community) help us work through these issues, by considering five programmatic themes that, with the exception of BD2K, are new. Those five themes have one strategic goal:
To enable biomedical research as a digital enterprise through which new discoveries are made and knowledge generated by maximizing community engagement and productivity.
Consider the five programmatic themes:
- BD2K – fostering innovation through partnership with the extramural biomedical research community. BD2K seeks to develop better ways to tackle the challenges (and harness the potential) of biomedical big data, with the goal of establishing a national infrastructure to support biomedical research.
- Sustainability – partnering with the community to address the challenges of maintaining the rapidly growing digital assets that are generated as part of biomedical research.
- Training – preparing the workforce to address the challenges and opportunities of biomedical research as a digital enterprise.
- Evaluation & Reward – defining the means to evaluate the value of data scientists, data, software and other digital assets to the research enterprise and getting all scholars to appreciate that value.
- Communication & Outreach – working with partners – other federal agencies, the private sector, both nationally and internationally, inside and outside of biomedicine, to learn from experience and maximize the value of the digital enterprise, within and across disciplines.
In the coming months we will implement a series of activities to move these five programmatic themes forward. Our activities will be agile – small steps followed by evaluation to determine what next steps to take. The community will be engaged in all aspects of the development – there will be no “build it and they will come.”
To this end we will begin in June with a workshop for all NIH personnel engaged in BD2K so internally we are all on the same page. Subsequently, in late summer or early fall, we will convene a group of stakeholders to help us chart the course going forward. The draft action plan will be provided for public comment, modified and a set of actions subsequently put in place. At this point in the evolution of our strategy lets consider what this might look like.
The current thinking is to establish a commons and conduct a series of pilots to evaluate its value to the community. The commons is a concept, the instantiation of which could occur in a variety of ways. Conceptually it is a shared workspace in which a variety of research objects can reside. It will have a business model such that the contents of the commons are governed by economic realities and hence sustainable. What that means in practice will be governed by cost versus value to the content stakeholders. Contents will also be governed by regulatory and ethical concerns. Minimally it will be a biomedical drop box. One step beyond that, we will encourage the community to write apps (also in the commons) to operate on the content and to provide a measure of authentication of different content types. That authentication is metadata important in defining the content. A further step is other computations performed on the content, generating new research objects. One final step is the analysis of the complete commons perhaps finding commonality across research objects, or yet to be imagined new findings in biomedicine. Instantiation of the commons requires a few decisions regarding object identification, provenance etc. The goal is to not reinvent the wheel, keep it lightweight, and encourage exploration. Instantiation would likely be in the cloud and various public private partnerships will be explored as pilots. Assuming the commons is viable there is no limitation on who can participate. As such no single entity owns the commons and oversight comes in a way defined by the community itself. The commons can be thought of as a thin layer added to the Internet specifically to support biomedical research. Its value at every step will be judged by the ability to make progress on data-driven biomedical research.
Clearly if the community endorsed the Commons there are many details associated with making it a reality. Details that would require developments in data science and testing on biomedical problems by the community and extramural funding will be provided to support these efforts.
There is a significant amount of training in components of data science already funded by the NIH, both as extramural grants supporting training/career development and as course materials offered intramurally and extramurally. Additional extramural grants for training are being offered by BD2K to help the workforce prepare for biomedical science as a digital enterprise. Additional courses and course materials are also needed to fill gaps, but first we need to understand the current scope of courses being offered in this fast-growing area. The sense at present is that there is redundancy in some areas and a lack in others. Our initial efforts are anticipated to be in rationalizing what is available so that we can share best practices and enable students and researchers to find relevant training. Next, we anticipate working towards a description of a complete curriculum, much of which we hope will be offered online. We will also contemplate one or more training centers, which would be available for serious hands-on training, both using reference data and in working with data from one’s own NIH funded research.
Evaluation & Reward
Data-, software-, and standards-related grants need special consideration when reviewed to insure best practices are used and the value to the community is fully appreciated. In turn this requires new metrics, which will be the subject of discussion and community engagement in coming months. Those same metrics will be used to highlight the value of data scientists to the digital research enterprise.
Communication & Outreach
Scientific data are global, yet the way these data are maintained is typically national. We need improved cooperation between the funding agencies in the US and with our counterparts worldwide. Such exchanges are beginning and a recent workshop convened by the Gordon and Betty Moore Foundation discussed data sustainability across the federal agencies. The goal is to work towards some common principles for the maintenance of scientific data and to maximize the use of our research dollars through synergy.
BD2K is the extramural component of the NIH’s effort towards the digital enterprise. It is aimed at taking full advantage of what the community has to offer and have it contribute to a coherent national infrastructure, likely centered around the commons. Components include significant cataloging efforts for data (the data discovery index), software and standards, training and a series of BD2K centers that taken together as a cohesive ecosystem drives innovation in biomedical data science. To achieve this goal will require an appropriate oversight model and a willingness of all participants to work towards common goals.
So there you have it a brushstroke of our thinking at the ten-week point as we embark on the notion of biomedical research as a digital enterprise. We look forward to your thoughts at any time – this is a community effort.
Philip E. Bourne 05/16/14
Associate Director for Data Science (ADDS), NIH