The vision for data science at the NIH will undoubtedly change over time. What follows is the current vision as we prepare for the first meeting with Big Data to Knowledge (BD2K) awardees.
Biomedical research is increasingly a digital enterprise demanding new approaches and new expertise. While the research lifecycle has not changed in hundreds of years – ideas turn into hypotheses, hypotheses are tested by experiments, experiments generate data, data are analyzed so that conclusions can be drawn, conclusions are disseminated as narrative, and narratives inspire further ideas – the processes by which this takes place are in a rapid state of change. The research assets that were once physical – lab notebooks on the shelf, images as hardcopy, software as punched cards, journals as hardcopy, etc., are now digital. These digital research objects, which include data (of various types), software, workflows, publications, and more, not only increase the pace of the scientific process, but also allow connections to be made and patterns to be discovered that were previously hidden. Increasingly those discoveries are made through the reuse of data for purposes different to those for which it was collected.
Whereas other sectors of the economy and society (e.g., the music industry, commerce, stock exchanges, marketing and social media) have transformed to adapt to the new possibilities of a digital enterprise, the biomedical research enterprise is far from fully leveraging its digital research objects, the transformative effects of which would include increasing the rate of discovery from biomedical research and accelerating translation of those discoveries into improved health and well-being. As the NIH mission is to foster the best biomedical research and discovery to improve health and reduce disease, the NIH has a serious need, and can play a significant role, in fully enabling the generation, preservation, and use of these emerging digital research objects, not only through research funding, but also by means of new policies, new procedures, and enhanced coordination with existing organizations and resources.
When all is said and done, the NIH cannot effect the change to a digital research enterprise – this change will come from the community itself. This represents a cultural shift and is the hardest to accomplish. Thus even if the technologies and the policies are in place the impetus must come from the community. Is the community ready to begin to effect this change? We believe so and the NIH is poised to help with initiatives like Big Data to Knowledge (BD2K)
This is a huge undertaking and effective progress will include building the foundations upon which trust and belief in this digital enterprise can be based. The loosely used term “big data” is only part of the story, but serves the purpose of bringing attention to the need to think differently.
Let’s start by describing what the digital enterprise should look like, then what it should accomplish, and lastly, how we can get there, noting that, to bring this about, the NIH will need to work in partnership with the broader community (other federal agencies, international bodies, the private sector, societies, etc. beyond health care) to transform biomedical research into a digital enterprise by effecting the necessary scientific, technological, and social changes.
Biomedical Research as a Connected Digital Enterprise
The intent of the scientific enterprise is to support new discovery and increase the knowledge base and to facilitate both sharing of that knowledge among all stakeholders and to provide appropriate attribution of such contributions. Increasingly, biomedical research both involves and generates a wide variety of digital research objects, each of which can be uniquely identified and each of which has associated provenance and annotation that allows it to be discoverable. Examples of digital research objects include individual datasets in various modalities (basic and clinical), software modules, narratives in the form of publications or grant applications, labels (descriptors) for physical objects e.g., types of equipment, reagents used etc., as well as metadata (descriptors) about these digital objects themselves. As biomedical data and other digital research objects become larger, more numerous, and more distributed, and analyses become more complex, a connected digital enterprise will become essential for supporting the NIH mission. The connected biomedical digital research enterprise is the collection of all biomedical digital research objects along with the relationships between those objects. Because the enterprise will be digital, collection, manipulation and use of all of these research objects can be distributed across geographically dispersed servers, and yet appear to be seamlessly connected with effective interoperability and (where appropriate) widely used common standards, annotation, and/or interfaces. From this foundation can come consistent annotation, consistent application of methods, appropriate access control, and much more, with the goal of increasing the quality of all aspects of the enterprise; a strategy already in play within the private sector.
What Can be Accomplished?
Simply tagging research objects with unique identifiers, providing minimal metadata, including provenance, and sharing them (taking into account patient confidentiality etc.) will open up a vast array of possible development efforts, beyond those originally envisaged, that can increase the research objects’ value, particularly when they conform to some standards. As the connected biomedical digital research enterprise grows, here are a few examples of what this transformation can enable:
- Expanded metrics of scientific productivity beyond publications. If properly identified, contributions in the form of digital research objects (data, software, etc.) can be aggregated to open up a whole series of new metrics and indicators of scientific productivity and reward systems. Stated another way: one way to describe an individual’s scholarly output could be as a set of digital assets, which will allow those objects to be identified and a quantitative measure of the contribution to be determined. That contribution should also measure and favor collaboration to reduce the unhealthy sense of hyper-competitiveness that currently plagues the biomedical research community, as well as reduce unnecessary duplication of effort, making the biomedical research enterprise more cost-effective. Article level metrics (ALMs) and tools such as Impactstory are early indicators of what is possible.
- Reproducible science. Including research object identifiers, for both digital objects and physical objects in publications, while not the complete answer, can help facilitate reproducibility by allowing all of the components of a given piece of research to be located and accurately repeated. In the same way, sharing of workflows (a workflow is an ordered use of digital objects) would support reproducibility of best practices, would allow scientific approaches to be compared and contrasted, and would allow errors to be quickly corrected. RRIDs for antibodies, software and model organisms are early indicators of what is possible.
- Increased productivity and reduced costs while supporting increased creativity and impact. The envisioned digital enterprise would provide a supporting foundation for the development and evaluation of new standards, software tools and algorithms, facilitate their adoption across the community, and increase their impact. Currently, the lack of a common foundation significantly hinders adoption of some tools. A widely used foundation would allow development efforts to be spent on advancing knowledge through efficient use of the research objects rather than wasting time attempting to reconcile multiple esoteric, and often incompatible, formats as well as reducing the re-creation of tools that already exist but are unknown to the new developer. Creative new tools, built to take advantage of the opportunities created by access to large quantities of high-quality data, will enable new discoveries and have an impact far beyond the original problem for which they were developed. Support for the development of the next generation of tools and technologies that are unique to biomedical research and translatable across diseases is essential for creating value from the digital research enterprise.
- Improved interoperability. The envisioned digital enterprise would enable a new level of interoperability and federation among all of the diverse views that the biomedical research enterprise has of its single objective, human health. Software, reagents, data, etc. would be uniquely identified and associated with each other and the publications that rely on them. Efforts in interoperability will in turn highlight anomalies in comparable data, offering new opportunities for improved annotation and quality control as well as the development and adoption of new standards, including standards from other domains.
- Advanced manipulation, analysis and modeling. Larger and more aggregated forms of data call for new methods of manipulation, analysis and modeling of living systems as well as new methods of evaluation.
- Emergence of a new work force. In order to reap the benefits of the digital research enterprise, many biomedical investigators will need additional training to gain the skills necessary to access and manipulate data, run appropriate analyses, and interpret the results. In addition, more data scientists who are also conversant in biomedical science will be needed as technicians, consultants, and research collaborators. A new, transformed work force will make obvious the value of the digital enterprise by fully utilizing biomedical data and methods to make new discoveries.
- A further democratization of science. An increasingly open culture of science and rapid advances in technology accelerate innovations that allow geographically dispersed individuals with limited resources, including minorities, to take advantage of data science, repositories and computing capabilities to work on their scientific questions of interest.
How Can We Get There?
We should be ever conscious of what has gone before, both good and bad, and learn from those experiences. Among the most relevant lessons are:
- “Build it and they will come” only works in rare circumstances.
- The vision should be driven by the desire to solve identified basic, translational and clinical research problems that are in line with the missions of the NIH institutes and centers.
- “Not invented here” thinking should be avoided and existing hardware, software and human resources fully utilized.
- The community owns the digital enterprise; hence transparency and community engagement are critical.
- Fostering new interdisciplinary virtual communities will be necessary.
- The individuality of each institute and center within NIH must be respected while encouragement to work across borders, for example in data integration, should be sought and valued. The intramural program offers a particular opportunity to make this happen.
- Iteration is essential to progress, as it allows agility and the opportunity to learn from small steps that are rigorously evaluated at every opportunity along the way.
- Engaging community-minded individuals to design, develop, and support the digital enterprise allows the advantages of synergy.
- Engaging the next generation of leading scientists is essential.
- Biomedical research problems are so complex that real progress requires a true partnership of different kinds of expertise all coming to bear on a common problem.
- Crowd sourcing can be a useful strategy.
- Competitions can be a useful support mechanism if used judiciously so as to avoid hyper-competitiveness. Such competitions should encourage team science, which brings together diverse expertise.
- Success comes from the correct blend of top-down (specific funding initiatives, regularity requirements etc.) and bottom-up (grassroots communities that come together to try to solve problems with or without the NIH, individual investigator-initiated research) approaches.
- Funding is typically national yet the value of research objects international. Collaboration between funding sources is needed to maximize the value and sustainability of research objects.
- Privacy and security through appropriate policies and authentication must be developed in tandem.
More specifically, with respect to the Associate Director for Data Science (ADDS) office, BD2K and other trans-NIH Data Science activities:
- On-going and future ADDS/BD2K programs should be related to this vision of the future as a digital enterprise and must be integrated and synergistic both within the program and with what already exists in the community and other parts of NIH.
- BD2K programs must be synchronized with the emergent regulations and policies around shared access, as well as preservation of patient privacy. Conversely, BD2K activities should drive appropriate policy and regulatory efforts.
- The current OSTP directive defines a why for data sharing, but says little as to the how. While a considerable amount of the data generated will find its way into existing repositories, a large fraction belongs to the long tail of science and does not have a home. The how has both practical and economic implications. Practically data integrators must be part of the discussion and from an economic perspective new business models must be tried.
- A commons is needed to accommodate these data and other research objects; the commons would provide storage co-located with computing power to embrace the emergent community-based contributions to the digital enterprise. Multiple commons pilots will push forward this activity in specific instances, to identify and share best practices and allow the community to work across such efforts to ensure interoperability from the start.
- BD2K-funded initiatives can be the leading developers and evangelists for the connected digital enterprise and coordinate themselves in such a way as to develop, or be contributors to, the commons, provision of digital research object discovery, interoperability, annotation, and indexing.
- Standards should be defined and maintained with reference to specific types of research objects. Standards will promote the usability of digital assets and are necessary for interoperability and connectedness within the digital enterprise. Engagement of the community, both biomedical and beyond, is an essential component of both starting and maintaining standards efforts. Standards include the notion of reference datasets, controlled simulation environments to promote benchmarking, training and repurposing.
- Training is needed at all levels, from the branding of biology (from K12 education on up) as an analytical science, to retraining of established biomedical researchers to understand and contribute to the emerging digital enterprise. A first step is to catalog what is being taught already to determine what is lacking in specific areas. Such training should engage experts from a variety of disciplines such as computer science, behavioral and social sciences, statistics, mathematics and others. New physical and virtual training facilities will likely be needed to complete coverage of relevant topics. Educational content itself should be part of the digital enterprise and hence be cataloged, and findable using appropriate metadata standards.
- New processes, and perhaps funding mechanisms, are needed for supporting the digital enterprise as part of the NIH ecosystem. This includes policies and procedures defining research object accessibility and protection, processes for managing grants relating to the digital enterprise so that appropriate expertise is applied and reflects the value of various types of research objects to the NIH mission.
- Collaboration between funding agencies is needed to minimize redundancy and maximize value to national and international communities. Cooperative agreements for a single foundation, for research and training exchanges as well as international cooperation on standards and training are examples for how the situation can be improved.
The ADDS and BD2K activities thus far are all valuable when cast into a shared vision for a biomedical research infrastructure that makes the sum of those activities greater than the individual parts. This shared vision will not be easy to build, but it is the time in history to make this happen. Without the eager cooperation of individual centers and cataloging efforts, the rate of progress will be slow. We must fund projects that provide the right balance of new science and innovation and shared infrastructure in order to realize the biomedical discoveries and health applications that will flow from research in a connected digital enterprise. The foundation of the connected biomedical digital enterprise is the standards, identifiers, and commons, but the monument is the discoveries that are made through analyzing the interconnected digital research objects using new tools, technologies, and methods developed in response to the new challenges. The success is the new and improved health and well-being that arises. Something we will need to evaluate.
Special thanks to Jennie Larkin, Michelle Dunn, Vivien Bonazzi, Leigh Finnegan, Beth Russell, and Mark Guyer of the ADDS team, to the many NIHers involved in BD2K and to Carole Goble, Mark Patterson, Melissa Haendel, Marryann Martone, Chris Mentzel, and Mark Patterson.
Consider a hypothetical scenario for what this future might look like in this example use case:
Researcher x is automatically made aware of researcher y through commonalities in their respective data located in the commons. Intrigued researcher x reviews the data catalog, locating the researcher y’s data sets with their associated usage statistics and commentary by the community. From the datasets, researcher x navigates to the associated publications and researcher x starts to explore various ideas with the help from the personal collaboration network when necessary. Researcher x is further convinced to contact researcher y. Researcher x studies the online presence of researcher y and knows more about researcher y’s expertise specialty. Researcher x formulates some specific questions and ideas to engage with researcher y and their research network. A fruitful collaboration ensues and they generate publications, data sets and software using tools developed by the BD2K Consortium. Their output is captured in PubMed and the commons, and is indexed by the data and software catalogs. Company z automatically identifies all relevant NIH data and software in a specific domain that, based on the metrics from the catalogs, have utilization above a threshold that indicates that those data and software are heavily utilized and respected by the community. An open source version remains, but the company adds services on top of the software for the novice user and revenue flows back to the labs of researchers x and y which is used to develop new innovative software for open distribution. Researchers x and y come to the NIH data science training centers periodically to provide hands-on advice in the use of their new version and their course is offered as a MOOC.
 The term “biomedical” is used in the broadest sense to include biological, biomedical, behavioral, social, environmental and clinical studies that relate to understanding health and disease.
 NIH’s mission is to “seek fundamental knowledge about the nature and behavior of living systems and the application of that knowledge to enhance health, lengthen life, and reduce illness and disability.”