ADDS Current Vision Statement, October, 2014

The vision for data science at the NIH will undoubtedly change over time. What follows is the current vision as we prepare for the first meeting with Big Data to Knowledge (BD2K) awardees.

Introduction

Biomedical[1] research is increasingly a digital enterprise demanding new approaches and new expertise. While the research lifecycle has not changed in hundreds of years – ideas turn into hypotheses, hypotheses are tested by experiments, experiments generate data, data are analyzed so that conclusions can be drawn, conclusions are disseminated as narrative, and narratives inspire further ideas – the processes by which this takes place are in a rapid state of change. The research assets that were once physical – lab notebooks on the shelf, images as hardcopy, software as punched cards, journals as hardcopy, etc., are now digital. These digital research objects[2], which include data (of various types), software, workflows, publications, and more, not only increase the pace of the scientific process, but also allow connections to be made and patterns to be discovered that were previously hidden. Increasingly those discoveries are made through the reuse of data for purposes different to those for which it was collected.

Whereas other sectors of the economy and society (e.g., the music industry, commerce, stock exchanges, marketing and social media) have transformed to adapt to the new possibilities of a digital enterprise, the biomedical research enterprise is far from fully leveraging its digital research objects, the transformative effects of which would include increasing the rate of discovery from biomedical research and accelerating translation of those discoveries into improved health and well-being. As the NIH mission[3] is to foster the best biomedical research and discovery to improve health and reduce disease, the NIH has a serious need, and can play a significant role, in fully enabling the generation, preservation, and use of these emerging digital research objects, not only through research funding, but also by means of new policies, new procedures, and enhanced coordination with existing organizations and resources.

When all is said and done, the NIH cannot effect the change to a digital research enterprise – this change will come from the community itself. This represents a cultural shift and is the hardest to accomplish. Thus even if the technologies and the policies are in place the impetus must come from the community. Is the community ready to begin to effect this change? We believe so and the NIH is poised to help with initiatives like Big Data to Knowledge (BD2K)[4]

This is a huge undertaking and effective progress will include building the foundations upon which trust and belief in this digital enterprise can be based. The loosely used term “big data” is only part of the story, but serves the purpose of bringing attention to the need to think differently.

Let’s start by describing what the digital enterprise should look like, then what it should accomplish, and lastly, how we can get there, noting that, to bring this about, the NIH will need to work in partnership with the broader community (other federal agencies, international bodies, the private sector, societies, etc. beyond health care) to transform biomedical research into a digital enterprise by effecting the necessary scientific, technological, and social changes.

Biomedical Research as a Connected Digital Enterprise

The intent of the scientific enterprise is to support new discovery and increase the knowledge base and to facilitate both sharing of that knowledge among all stakeholders and to provide appropriate attribution of such contributions. Increasingly, biomedical research both involves and generates a wide variety of digital research objects, each of which can be uniquely identified and each of which has associated provenance and annotation that allows it to be discoverable. Examples of digital research objects include individual datasets in various modalities (basic and clinical), software modules, narratives in the form of publications or grant applications, labels (descriptors) for physical objects e.g., types of equipment, reagents used etc., as well as metadata (descriptors) about these digital objects themselves.   As biomedical data and other digital research objects become larger, more numerous, and more distributed, and analyses become more complex, a connected digital enterprise will become essential for supporting the NIH mission. The connected biomedical digital research enterprise is the collection of all biomedical digital research objects along with the relationships between those objects. Because the enterprise will be digital, collection, manipulation and use of all of these research objects can be distributed across geographically dispersed servers, and yet appear to be seamlessly connected with effective interoperability and (where appropriate) widely used common standards, annotation, and/or interfaces. From this foundation can come consistent annotation, consistent application of methods, appropriate access control, and much more, with the goal of increasing the quality of all aspects of the enterprise; a strategy already in play within the private sector.

What Can be Accomplished?

Simply tagging research objects with unique identifiers, providing minimal metadata, including provenance, and sharing them (taking into account patient confidentiality etc.) will open up a vast array of possible development efforts, beyond those originally envisaged, that can increase the research objects’ value, particularly when they conform to some standards. As the connected biomedical digital research enterprise grows, here are a few examples of what this transformation can enable:

  • Expanded metrics of scientific productivity beyond publications. If properly identified, contributions in the form of digital research objects (data, software, etc.) can be aggregated to open up a whole series of new metrics and indicators of scientific productivity and reward systems. Stated another way: one way to describe an individual’s scholarly output could be as a set of digital assets, which will allow those objects to be identified and a quantitative measure of the contribution to be determined. That contribution should also measure and favor collaboration to reduce the unhealthy sense of hyper-competitiveness that currently plagues the biomedical research community, as well as reduce unnecessary duplication of effort, making the biomedical research enterprise more cost-effective. Article level metrics (ALMs)[5] and tools such as Impactstory[6] are early indicators of what is possible.
  • Reproducible science. Including research object identifiers, for both digital objects and physical objects in publications, while not the complete answer, can help facilitate reproducibility by allowing all of the components of a given piece of research to be located and accurately repeated. In the same way, sharing of workflows (a workflow is an ordered use of digital objects) would support reproducibility of best practices, would allow scientific approaches to be compared and contrasted, and would allow errors to be quickly corrected. RRIDs[7] for antibodies, software and model organisms are early indicators of what is possible.
  • Increased productivity and reduced costs while supporting increased creativity and impact. The envisioned digital enterprise would provide a supporting foundation for the development and evaluation of new standards, software tools and algorithms, facilitate their adoption across the community, and increase their impact. Currently, the lack of a common foundation significantly hinders adoption of some tools. A widely used foundation would allow development efforts to be spent on advancing knowledge through efficient use of the research objects rather than wasting time attempting to reconcile multiple esoteric, and often incompatible, formats as well as reducing the re-creation of tools that already exist but are unknown to the new developer. Creative new tools, built to take advantage of the opportunities created by access to large quantities of high-quality data, will enable new discoveries and have an impact far beyond the original problem for which they were developed. Support for the development of the next generation of tools and technologies that are unique to biomedical research and translatable across diseases is essential for creating value from the digital research enterprise.
  •    Improved interoperability. The envisioned digital enterprise would enable a new level of interoperability and federation among all of the diverse views that the biomedical research enterprise has of its single objective, human health. Software, reagents, data, etc. would be uniquely identified and associated with each other and the publications that rely on them. Efforts in interoperability will in turn highlight anomalies in comparable data, offering new opportunities for improved annotation and quality control as well as the development and adoption of new standards, including standards from other domains.
  • Advanced manipulation, analysis and modeling. Larger and more aggregated forms of data call for new methods of manipulation, analysis and modeling of living systems as well as new methods of evaluation.
  • Emergence of a new work force. In order to reap the benefits of the digital research enterprise, many biomedical investigators will need additional training to gain the skills necessary to access and manipulate data, run appropriate analyses, and interpret the results. In addition, more data scientists who are also conversant in biomedical science will be needed as technicians, consultants, and research collaborators. A new, transformed work force will make obvious the value of the digital enterprise by fully utilizing biomedical data and methods to make new discoveries.
  • A further democratization of science. An increasingly open culture of science and rapid advances in technology accelerate innovations that allow geographically dispersed individuals with limited resources, including minorities, to take advantage of data science, repositories and computing capabilities to work on their scientific questions of interest.

How Can We Get There?

We should be ever conscious of what has gone before, both good and bad, and learn from those experiences. Among the most relevant lessons are:

  • “Build it and they will come” only works in rare circumstances.
  • The vision should be driven by the desire to solve identified basic, translational and clinical research problems that are in line with the missions of the NIH institutes and centers.
  • “Not invented here” thinking should be avoided and existing hardware, software and human resources fully utilized.
  • The community owns the digital enterprise; hence transparency and community engagement are critical.
  • Fostering new interdisciplinary virtual communities will be necessary.
  • The individuality of each institute and center within NIH must be respected while encouragement to work across borders, for example in data integration, should be sought and valued. The intramural program offers a particular opportunity to make this happen.
  • Iteration is essential to progress, as it allows agility and the opportunity to learn from small steps that are rigorously evaluated at every opportunity along the way.
  • Engaging community-minded individuals to design, develop, and support the digital enterprise allows the advantages of synergy.
  • Engaging the next generation of leading scientists is essential.
  • Biomedical research problems are so complex that real progress requires a true partnership of different kinds of expertise all coming to bear on a common problem.
  • Crowd sourcing can be a useful strategy.
  • Competitions can be a useful support mechanism if used judiciously so as to avoid hyper-competitiveness. Such competitions should encourage team science, which brings together diverse expertise.
  • Success comes from the correct blend of top-down (specific funding initiatives, regularity requirements etc.) and bottom-up (grassroots communities that come together to try to solve problems with or without the NIH, individual investigator-initiated research) approaches.
  • Funding is typically national yet the value of research objects international. Collaboration between funding sources is needed to maximize the value and sustainability of research objects.
  •      Privacy and security through appropriate policies and authentication must be developed in tandem.

More specifically, with respect to the Associate Director for Data Science (ADDS) office, BD2K and other trans-NIH Data Science activities:

  • On-going and future ADDS/BD2K programs should be related to this vision of the future as a digital enterprise and must be integrated and synergistic both within the program and with what already exists in the community and other parts of NIH.
  • BD2K programs must be synchronized with the emergent regulations and policies around shared access, as well as preservation of patient privacy. Conversely, BD2K activities should drive appropriate policy and regulatory efforts.
  • The current OSTP directive defines a why for data sharing, but says little as to the how. While a considerable amount of the data generated will find its way into existing repositories, a large fraction belongs to the long tail of science and does not have a home. The how has both practical and economic implications. Practically data integrators must be part of the discussion and from an economic perspective new business models must be tried.
  • A commons[8] is needed to accommodate these data and other research objects; the commons would provide storage co-located with computing power to embrace the emergent community-based contributions to the digital enterprise. Multiple commons pilots will push forward this activity in specific instances, to identify and share best practices and allow the community to work across such efforts to ensure interoperability from the start.
  • BD2K-funded initiatives can be the leading developers and evangelists for the connected digital enterprise and coordinate themselves in such a way as to develop, or be contributors to, the commons, provision of digital research object discovery, interoperability, annotation, and indexing.
  • Standards should be defined and maintained with reference to specific types of research objects. Standards will promote the usability of digital assets and are necessary for interoperability and connectedness within the digital enterprise. Engagement of the community, both biomedical and beyond, is an essential component of both starting and maintaining standards efforts. Standards include the notion of reference datasets, controlled simulation environments to promote benchmarking, training and repurposing.
  • Training is needed at all levels, from the branding of biology (from K12 education on up) as an analytical science, to retraining of established biomedical researchers to understand and contribute to the emerging digital enterprise. A first step is to catalog what is being taught already to determine what is lacking in specific areas. Such training should engage experts from a variety of disciplines such as computer science, behavioral and social sciences, statistics, mathematics and others. New physical and virtual training facilities will likely be needed to complete coverage of relevant topics. Educational content itself should be part of the digital enterprise and hence be cataloged, and findable using appropriate metadata standards.
  • New processes, and perhaps funding mechanisms, are needed for supporting the digital enterprise as part of the NIH ecosystem. This includes policies and procedures defining research object accessibility and protection, processes for managing grants relating to the digital enterprise so that appropriate expertise is applied and reflects the value of various types of research objects to the NIH mission.
  • Collaboration between funding agencies is needed to minimize redundancy and maximize value to national and international communities. Cooperative agreements for a single foundation, for research and training exchanges as well as international cooperation on standards and training are examples for how the situation can be improved.

In Summary

The ADDS and BD2K activities thus far are all valuable when cast into a shared vision for a biomedical research infrastructure that makes the sum of those activities greater than the individual parts. This shared vision will not be easy to build, but it is the time in history to make this happen. Without the eager cooperation of individual centers and cataloging efforts, the rate of progress will be slow. We must fund projects that provide the right balance of new science and innovation and shared infrastructure in order to realize the biomedical discoveries and health applications that will flow from research in a connected digital enterprise. The foundation of the connected biomedical digital enterprise is the standards, identifiers, and commons, but the monument is the discoveries that are made through analyzing the interconnected digital research objects using new tools, technologies, and methods developed in response to the new challenges. The success is the new and improved health and well-being that arises. Something we will need to evaluate.

Acknowledgements

Special thanks to Jennie Larkin, Michelle Dunn, Vivien Bonazzi, Leigh Finnegan, Beth Russell, and Mark Guyer of the ADDS team, to the many NIHers involved in BD2K and to Carole Goble, Mark Patterson, Melissa Haendel, Marryann Martone, Chris Mentzel, and Mark Patterson.

Appendix:

Consider a hypothetical scenario for what this future might look like in this example use case:

Researcher x is automatically made aware of researcher y through commonalities in their respective data located in the commons. Intrigued researcher x reviews the data catalog, locating the researcher y’s data sets with their associated usage statistics and commentary by the community. From the datasets, researcher x navigates to the associated publications and researcher x starts to explore various ideas with the help from the personal collaboration network when necessary. Researcher x is further convinced to contact researcher y. Researcher x studies the online presence of researcher y and knows more about researcher y’s expertise specialty. Researcher x formulates some specific questions and ideas to engage with researcher y and their research network. A fruitful collaboration ensues and they generate publications, data sets and software using tools developed by the BD2K Consortium. Their output is captured in PubMed and the commons, and is indexed by the data and software catalogs. Company z automatically identifies all relevant NIH data and software in a specific domain that, based on the metrics from the catalogs, have utilization above a threshold that indicates that those data and software are heavily utilized and respected by the community. An open source version remains, but the company adds services on top of the software for the novice user and revenue flows back to the labs of researchers x and y which is used to develop new innovative software for open distribution. Researchers x and y come to the NIH data science training centers periodically to provide hands-on advice in the use of their new version and their course is offered as a MOOC.

[1] The term “biomedical” is used in the broadest sense to include biological, biomedical, behavioral, social, environmental and clinical studies that relate to understanding health and disease.

[2] http://www.researchobject.org/

[3] NIH’s mission is to “seek fundamental knowledge about the nature and behavior of living systems and the application of that knowledge to enhance health, lengthen life, and reduce illness and disability.”

[4] http://bd2k.nih.gov

[5] http://www.sparc.arl.org/initiatives/article-level-metrics

[6] https://impactstory.org/

[7] https://www.force11.org/Resource_identification_initiative

[8] https://pebourne.wordpress.com/2014/10/07/the-commons/

Advertisements

The Commons

The Associate Director for Data Science (ADDS) team at the National Institutes of Health (NIH), in partnership with the research community and the private sector, is establishing The Commons as a means to support the digital biomedical research enterprise. What is The Commons and what will it enable?

In an era when biomedical research[1] is becoming increasingly digital and analytical, the current support system is neither cost-effective nor sustainable. Moreover, that digital content is hard to find and use. The Commons is a pilot experiment in the efficient storage, manipulation, analysis, and sharing of research output, from all parts of the research lifecycle. Should The Commons be successful we would achieve a level of comprehensive access and interoperability across the research enterprise far beyond what is possible today.

The Commons is a conceptual framework for a digital environment to allow efficient storage, manipulation, and sharing of research objects[2]. Borrowing and modifying the dictionary definition, The Commons belongs to and affects the whole research community. From the perspective of the NIH we are concerned with digital research assets that support and accelerate biomedical research, and that will be the focus here, but the concept is purposely quite general so as to foster interdisciplinary interaction and use. As the concept can be employed by the entire global biomedical research enterprise, the NIH does not own it, nor is solely responsible for it, so it is not the NIH Commons; similarly it is not just for scientific data and hence is not the Data Commons. Rather The Commons is the concept of sharing digital research objects from any domain, where sharing implies finding, using, reusing and attributing.

The Commons could be considered analogous to the Internet or World Wide Web – each user has his/her own definition of exactly what they are, but all are able to use them every day for their own purposes. No one seems to own either yet they work because each participant abides by a simple set of agreed-upon rules. For the World Wide Web those rules are: (1) a URL scheme to find Web sites; (2) a protocol to communicate; (3) a standard format (HTML) in which to express Web pages. The initial definition of The Commons does not go much beyond (1) in an effort to keep it simple, but still be functional. However, if common Application Program Interfaces (API’s) were developed to access The Commons content they would be analogous to (2) and data formats for specific types of data, if widely adopted by the community, would be analogous to (3).

The initial rules for the Commons are proposed as follows:

  1. Each unique research object placed into The Commons must have a unique identifier.
  2. That unique identifier must allow the research object to be found, shared and attributed.
  3. Attribution requires associated provenance that, minimally, identifies the creator(s) of the unique research object and those that have subsequently modified it and how it was modified.

Although not required, it is anticipated that the majority of research objects in The Commons will, in addition, have associated metadata, which will facilitate their use. The metadata might include descriptions of content for specific types of research objects, as well as details of who has the rights to obtain access to the research object.

The Commons concept needs to have a real implementation. That implementation can be on the combination of a variety of compute resources – public, private or hybrid clouds, high performance computing (HPC) resources, (commercial, in national laboratories, and elsewhere), and/or on institutional facilities. Each of these resources is referred to as a Commons provider. The only requirement for a Commons provider is that they agree to support the rules of the Commons as stated above and to provide or permit services that facilitate the use of The Commons. Those services could be API’s for access to research objects, tools for manipulation and analysis of research objects and many more that we cannot imagine at this time. Research objects within The Commons will be cataloged in an index being developed as part of the Big Data to Knowledge (BD2K)[3] Initiative and hence findable and shared regardless of physical location. Commons users are free to use any Commons provider; in this way, competition will be created in the market place to provide a cost-effective environment to perform digital research. Thus, a Commons user with data-intense, minimal compute needs will be able to use a different provider than a data-light, compute-intensive user, yet the research output of each will be readily found and used by anyone interested and authorized to use that content.

Thus The Commons is a distributed collection of uniquely identifiable research objects with no explicitly defined relationship among them. The Commons is not a warehouse, a federation, nor a database. Such structures can however be instantiated on a subset of contents should a user choose to do so. While there is no necessary relationship between research objects, The Commons is intended to facilitate the discovery and instantiation of such relationships.

How the NIH will utilize The Commons

While any organization is encouraged to utilize The Commons, the NIH will use The Commons as indicated below and views The Commons as an experiment in:

  • Sharing & Accessibility A directive from the US Office of Science and Technology Policy (OSTP) requires federal scientific research agencies to share, as far as practical and allowable, research data generated with public dollars[4]. How this is done has been left to individual agencies, but they must do so on existing budgets. The Commons is one of a number of NIH responses to this request.
  • International To be maximally successful, The Commons must be accepted and utilized by researchers around the world. As envisioned, funding agencies from around the world could support participation in The Commons while maintaining any necessary national identity by means of supporting their own Commons-compliant infrastructure.
  • The Commons should allow data science to become more cost-effective and hence more sustainable. In principle, through The Commons, data science will become focused around a smaller number of shared cost-effective compute resources, which will compete with each other for awarded NIH dollars, a situation that should be more cost-effective than the highly distributed model of computing currently used to support biomedical researchers.   The Commons also holds the promise of enabling access to and assessment of reliable negative results, which could reduce the number of attempts to study a plausible, but incorrect, hypothesis.
  • Replicability The opportunity and ability to reproduce, or at least replicate, experiments is a basic tenet of science. However, the issue has received a great deal of attention among scientists and the public of late as a result of an apparently increasing number of failures to demonstrate that published results and claims can be reproduced. The Commons provides a means to readily expose and make accessible the full research lifecycle that underlies the subset of that cycle that is normally described in a publication, but which is typically not accessible from the publisher or the authors.
  • The majority of research output is currently not easily findable, and some may not even be on-line. Therefore discoverability of research output through indexing or other methods will be an essential element of the Commons. Furthermore, we currently do not have the capability of knowing how useful most of that output has been, as we cannot determine how much has been accessed by others, nor what the users might have to say about it. The NIH Big Data to Knowledge (BD2K) initiative, through the Data Discovery Index Coordination Consortium (DDICC) is one approach that will address this for research objects within The Commons, making research output more accessible and its use more quantifiable. The intent is that others will define alternative schemes which make research more discoverable and usable. Further, while replication as outlined above is desirable, discovery also prohibits unwanted duplication of effort thus making the research enterprise more cost-effective.
  • With greater access and transparency, and hence scrutiny, of research objects and full research lifecycles, where ownership is easily ascertained, quality should improve as all components of a research project become part of the accessible public record. Further, The Commons offers the promise of larger accessible control data and hence greater confidence in baseline values.
  • Again, with more access by a larger number of researchers, it should be possible to perform more forms of novel analysis on existing data, with more analysis tools being contributed and applied to scientific questions.
  • Reward structures. Accessibility and metrics that describe the complete research lifecycle hold the promise of shifting emphasis away from solely the final peer-reviewed publication to additional forms of valuable scholarship, such as well-formed and annotated datasets and robust and accessible software.

There is no guarantee that the desired outcomes outlined above will be met. If they were, it would represent an important change in the culture of doing science that could have a significant impact on the way we do biomedical research. Such a change will not come from the NIH and other funding agencies alone, but rather from collaboration with the research community. The role of the NIH is to enable the community. We will attempt to do so through the funding by BD2K of science-driven applications that utilize the emergent Commons. Such applications represent a virtuous cycle where the scientist must see the scientific merit of operating in the Commons from the outset.

Evaluation will be a key part of The Commons. However, at no time will evaluating the infrastructure per se be the focus, but rather evaluating the quality of the science that results from the application of the infrastructure. The emphasis on Commons deployment is on agile – small steps each of which can be evaluated before going to the next. The Commons must be, as far as possible, a come-and-then-build initiative.

Acknowledgements

Thanks to the ADDS and the complete BD2K teams for useful comments, also to Francis Collins, Larry Tabak, Susan Gregurick, Jerry Sheehan and Dave Glazer for useful feedback.

[1] Covering all aspects of basic, clinical and behavioral research.

[2] A research object is a bounded entity identifiable in the field of research. Examples are specific data sets, items of software, narrative about an experiment, a research paper etc. In short anything it makes sense to uniquely identify in the domain.

[3] http://bd2k.nih.gov/

[4] http://www.whitehouse.gov/sites/default/files/microsites/ostp/ostp_public_access_memo_2013.pdf