The Associate Director for Data Science (ADDS) team at the National Institutes of Health (NIH), in partnership with the research community and the private sector, is establishing The Commons as a means to support the digital biomedical research enterprise. What is The Commons and what will it enable?
In an era when biomedical research is becoming increasingly digital and analytical, the current support system is neither cost-effective nor sustainable. Moreover, that digital content is hard to find and use. The Commons is a pilot experiment in the efficient storage, manipulation, analysis, and sharing of research output, from all parts of the research lifecycle. Should The Commons be successful we would achieve a level of comprehensive access and interoperability across the research enterprise far beyond what is possible today.
The Commons is a conceptual framework for a digital environment to allow efficient storage, manipulation, and sharing of research objects. Borrowing and modifying the dictionary definition, The Commons belongs to and affects the whole research community. From the perspective of the NIH we are concerned with digital research assets that support and accelerate biomedical research, and that will be the focus here, but the concept is purposely quite general so as to foster interdisciplinary interaction and use. As the concept can be employed by the entire global biomedical research enterprise, the NIH does not own it, nor is solely responsible for it, so it is not the NIH Commons; similarly it is not just for scientific data and hence is not the Data Commons. Rather The Commons is the concept of sharing digital research objects from any domain, where sharing implies finding, using, reusing and attributing.
The Commons could be considered analogous to the Internet or World Wide Web – each user has his/her own definition of exactly what they are, but all are able to use them every day for their own purposes. No one seems to own either yet they work because each participant abides by a simple set of agreed-upon rules. For the World Wide Web those rules are: (1) a URL scheme to find Web sites; (2) a protocol to communicate; (3) a standard format (HTML) in which to express Web pages. The initial definition of The Commons does not go much beyond (1) in an effort to keep it simple, but still be functional. However, if common Application Program Interfaces (API’s) were developed to access The Commons content they would be analogous to (2) and data formats for specific types of data, if widely adopted by the community, would be analogous to (3).
The initial rules for the Commons are proposed as follows:
- Each unique research object placed into The Commons must have a unique identifier.
- That unique identifier must allow the research object to be found, shared and attributed.
- Attribution requires associated provenance that, minimally, identifies the creator(s) of the unique research object and those that have subsequently modified it and how it was modified.
Although not required, it is anticipated that the majority of research objects in The Commons will, in addition, have associated metadata, which will facilitate their use. The metadata might include descriptions of content for specific types of research objects, as well as details of who has the rights to obtain access to the research object.
The Commons concept needs to have a real implementation. That implementation can be on the combination of a variety of compute resources – public, private or hybrid clouds, high performance computing (HPC) resources, (commercial, in national laboratories, and elsewhere), and/or on institutional facilities. Each of these resources is referred to as a Commons provider. The only requirement for a Commons provider is that they agree to support the rules of the Commons as stated above and to provide or permit services that facilitate the use of The Commons. Those services could be API’s for access to research objects, tools for manipulation and analysis of research objects and many more that we cannot imagine at this time. Research objects within The Commons will be cataloged in an index being developed as part of the Big Data to Knowledge (BD2K) Initiative and hence findable and shared regardless of physical location. Commons users are free to use any Commons provider; in this way, competition will be created in the market place to provide a cost-effective environment to perform digital research. Thus, a Commons user with data-intense, minimal compute needs will be able to use a different provider than a data-light, compute-intensive user, yet the research output of each will be readily found and used by anyone interested and authorized to use that content.
Thus The Commons is a distributed collection of uniquely identifiable research objects with no explicitly defined relationship among them. The Commons is not a warehouse, a federation, nor a database. Such structures can however be instantiated on a subset of contents should a user choose to do so. While there is no necessary relationship between research objects, The Commons is intended to facilitate the discovery and instantiation of such relationships.
How the NIH will utilize The Commons
While any organization is encouraged to utilize The Commons, the NIH will use The Commons as indicated below and views The Commons as an experiment in:
- Sharing & Accessibility A directive from the US Office of Science and Technology Policy (OSTP) requires federal scientific research agencies to share, as far as practical and allowable, research data generated with public dollars. How this is done has been left to individual agencies, but they must do so on existing budgets. The Commons is one of a number of NIH responses to this request.
- International To be maximally successful, The Commons must be accepted and utilized by researchers around the world. As envisioned, funding agencies from around the world could support participation in The Commons while maintaining any necessary national identity by means of supporting their own Commons-compliant infrastructure.
- The Commons should allow data science to become more cost-effective and hence more sustainable. In principle, through The Commons, data science will become focused around a smaller number of shared cost-effective compute resources, which will compete with each other for awarded NIH dollars, a situation that should be more cost-effective than the highly distributed model of computing currently used to support biomedical researchers. The Commons also holds the promise of enabling access to and assessment of reliable negative results, which could reduce the number of attempts to study a plausible, but incorrect, hypothesis.
- Replicability The opportunity and ability to reproduce, or at least replicate, experiments is a basic tenet of science. However, the issue has received a great deal of attention among scientists and the public of late as a result of an apparently increasing number of failures to demonstrate that published results and claims can be reproduced. The Commons provides a means to readily expose and make accessible the full research lifecycle that underlies the subset of that cycle that is normally described in a publication, but which is typically not accessible from the publisher or the authors.
- The majority of research output is currently not easily findable, and some may not even be on-line. Therefore discoverability of research output through indexing or other methods will be an essential element of the Commons. Furthermore, we currently do not have the capability of knowing how useful most of that output has been, as we cannot determine how much has been accessed by others, nor what the users might have to say about it. The NIH Big Data to Knowledge (BD2K) initiative, through the Data Discovery Index Coordination Consortium (DDICC) is one approach that will address this for research objects within The Commons, making research output more accessible and its use more quantifiable. The intent is that others will define alternative schemes which make research more discoverable and usable. Further, while replication as outlined above is desirable, discovery also prohibits unwanted duplication of effort thus making the research enterprise more cost-effective.
- With greater access and transparency, and hence scrutiny, of research objects and full research lifecycles, where ownership is easily ascertained, quality should improve as all components of a research project become part of the accessible public record. Further, The Commons offers the promise of larger accessible control data and hence greater confidence in baseline values.
- Again, with more access by a larger number of researchers, it should be possible to perform more forms of novel analysis on existing data, with more analysis tools being contributed and applied to scientific questions.
- Reward structures. Accessibility and metrics that describe the complete research lifecycle hold the promise of shifting emphasis away from solely the final peer-reviewed publication to additional forms of valuable scholarship, such as well-formed and annotated datasets and robust and accessible software.
There is no guarantee that the desired outcomes outlined above will be met. If they were, it would represent an important change in the culture of doing science that could have a significant impact on the way we do biomedical research. Such a change will not come from the NIH and other funding agencies alone, but rather from collaboration with the research community. The role of the NIH is to enable the community. We will attempt to do so through the funding by BD2K of science-driven applications that utilize the emergent Commons. Such applications represent a virtuous cycle where the scientist must see the scientific merit of operating in the Commons from the outset.
Evaluation will be a key part of The Commons. However, at no time will evaluating the infrastructure per se be the focus, but rather evaluating the quality of the science that results from the application of the infrastructure. The emphasis on Commons deployment is on agile – small steps each of which can be evaluated before going to the next. The Commons must be, as far as possible, a come-and-then-build initiative.
Thanks to the ADDS and the complete BD2K teams for useful comments, also to Francis Collins, Larry Tabak, Susan Gregurick, Jerry Sheehan and Dave Glazer for useful feedback.
 Covering all aspects of basic, clinical and behavioral research.
 A research object is a bounded entity identifiable in the field of research. Examples are specific data sets, items of software, narrative about an experiment, a research paper etc. In short anything it makes sense to uniquely identify in the domain.