ADDS Current Vision Statement, October, 2014

The vision for data science at the NIH will undoubtedly change over time. What follows is the current vision as we prepare for the first meeting with Big Data to Knowledge (BD2K) awardees.

Introduction

Biomedical[1] research is increasingly a digital enterprise demanding new approaches and new expertise. While the research lifecycle has not changed in hundreds of years – ideas turn into hypotheses, hypotheses are tested by experiments, experiments generate data, data are analyzed so that conclusions can be drawn, conclusions are disseminated as narrative, and narratives inspire further ideas – the processes by which this takes place are in a rapid state of change. The research assets that were once physical – lab notebooks on the shelf, images as hardcopy, software as punched cards, journals as hardcopy, etc., are now digital. These digital research objects[2], which include data (of various types), software, workflows, publications, and more, not only increase the pace of the scientific process, but also allow connections to be made and patterns to be discovered that were previously hidden. Increasingly those discoveries are made through the reuse of data for purposes different to those for which it was collected.

Whereas other sectors of the economy and society (e.g., the music industry, commerce, stock exchanges, marketing and social media) have transformed to adapt to the new possibilities of a digital enterprise, the biomedical research enterprise is far from fully leveraging its digital research objects, the transformative effects of which would include increasing the rate of discovery from biomedical research and accelerating translation of those discoveries into improved health and well-being. As the NIH mission[3] is to foster the best biomedical research and discovery to improve health and reduce disease, the NIH has a serious need, and can play a significant role, in fully enabling the generation, preservation, and use of these emerging digital research objects, not only through research funding, but also by means of new policies, new procedures, and enhanced coordination with existing organizations and resources.

When all is said and done, the NIH cannot effect the change to a digital research enterprise – this change will come from the community itself. This represents a cultural shift and is the hardest to accomplish. Thus even if the technologies and the policies are in place the impetus must come from the community. Is the community ready to begin to effect this change? We believe so and the NIH is poised to help with initiatives like Big Data to Knowledge (BD2K)[4]

This is a huge undertaking and effective progress will include building the foundations upon which trust and belief in this digital enterprise can be based. The loosely used term “big data” is only part of the story, but serves the purpose of bringing attention to the need to think differently.

Let’s start by describing what the digital enterprise should look like, then what it should accomplish, and lastly, how we can get there, noting that, to bring this about, the NIH will need to work in partnership with the broader community (other federal agencies, international bodies, the private sector, societies, etc. beyond health care) to transform biomedical research into a digital enterprise by effecting the necessary scientific, technological, and social changes.

Biomedical Research as a Connected Digital Enterprise

The intent of the scientific enterprise is to support new discovery and increase the knowledge base and to facilitate both sharing of that knowledge among all stakeholders and to provide appropriate attribution of such contributions. Increasingly, biomedical research both involves and generates a wide variety of digital research objects, each of which can be uniquely identified and each of which has associated provenance and annotation that allows it to be discoverable. Examples of digital research objects include individual datasets in various modalities (basic and clinical), software modules, narratives in the form of publications or grant applications, labels (descriptors) for physical objects e.g., types of equipment, reagents used etc., as well as metadata (descriptors) about these digital objects themselves.   As biomedical data and other digital research objects become larger, more numerous, and more distributed, and analyses become more complex, a connected digital enterprise will become essential for supporting the NIH mission. The connected biomedical digital research enterprise is the collection of all biomedical digital research objects along with the relationships between those objects. Because the enterprise will be digital, collection, manipulation and use of all of these research objects can be distributed across geographically dispersed servers, and yet appear to be seamlessly connected with effective interoperability and (where appropriate) widely used common standards, annotation, and/or interfaces. From this foundation can come consistent annotation, consistent application of methods, appropriate access control, and much more, with the goal of increasing the quality of all aspects of the enterprise; a strategy already in play within the private sector.

What Can be Accomplished?

Simply tagging research objects with unique identifiers, providing minimal metadata, including provenance, and sharing them (taking into account patient confidentiality etc.) will open up a vast array of possible development efforts, beyond those originally envisaged, that can increase the research objects’ value, particularly when they conform to some standards. As the connected biomedical digital research enterprise grows, here are a few examples of what this transformation can enable:

  • Expanded metrics of scientific productivity beyond publications. If properly identified, contributions in the form of digital research objects (data, software, etc.) can be aggregated to open up a whole series of new metrics and indicators of scientific productivity and reward systems. Stated another way: one way to describe an individual’s scholarly output could be as a set of digital assets, which will allow those objects to be identified and a quantitative measure of the contribution to be determined. That contribution should also measure and favor collaboration to reduce the unhealthy sense of hyper-competitiveness that currently plagues the biomedical research community, as well as reduce unnecessary duplication of effort, making the biomedical research enterprise more cost-effective. Article level metrics (ALMs)[5] and tools such as Impactstory[6] are early indicators of what is possible.
  • Reproducible science. Including research object identifiers, for both digital objects and physical objects in publications, while not the complete answer, can help facilitate reproducibility by allowing all of the components of a given piece of research to be located and accurately repeated. In the same way, sharing of workflows (a workflow is an ordered use of digital objects) would support reproducibility of best practices, would allow scientific approaches to be compared and contrasted, and would allow errors to be quickly corrected. RRIDs[7] for antibodies, software and model organisms are early indicators of what is possible.
  • Increased productivity and reduced costs while supporting increased creativity and impact. The envisioned digital enterprise would provide a supporting foundation for the development and evaluation of new standards, software tools and algorithms, facilitate their adoption across the community, and increase their impact. Currently, the lack of a common foundation significantly hinders adoption of some tools. A widely used foundation would allow development efforts to be spent on advancing knowledge through efficient use of the research objects rather than wasting time attempting to reconcile multiple esoteric, and often incompatible, formats as well as reducing the re-creation of tools that already exist but are unknown to the new developer. Creative new tools, built to take advantage of the opportunities created by access to large quantities of high-quality data, will enable new discoveries and have an impact far beyond the original problem for which they were developed. Support for the development of the next generation of tools and technologies that are unique to biomedical research and translatable across diseases is essential for creating value from the digital research enterprise.
  •    Improved interoperability. The envisioned digital enterprise would enable a new level of interoperability and federation among all of the diverse views that the biomedical research enterprise has of its single objective, human health. Software, reagents, data, etc. would be uniquely identified and associated with each other and the publications that rely on them. Efforts in interoperability will in turn highlight anomalies in comparable data, offering new opportunities for improved annotation and quality control as well as the development and adoption of new standards, including standards from other domains.
  • Advanced manipulation, analysis and modeling. Larger and more aggregated forms of data call for new methods of manipulation, analysis and modeling of living systems as well as new methods of evaluation.
  • Emergence of a new work force. In order to reap the benefits of the digital research enterprise, many biomedical investigators will need additional training to gain the skills necessary to access and manipulate data, run appropriate analyses, and interpret the results. In addition, more data scientists who are also conversant in biomedical science will be needed as technicians, consultants, and research collaborators. A new, transformed work force will make obvious the value of the digital enterprise by fully utilizing biomedical data and methods to make new discoveries.
  • A further democratization of science. An increasingly open culture of science and rapid advances in technology accelerate innovations that allow geographically dispersed individuals with limited resources, including minorities, to take advantage of data science, repositories and computing capabilities to work on their scientific questions of interest.

How Can We Get There?

We should be ever conscious of what has gone before, both good and bad, and learn from those experiences. Among the most relevant lessons are:

  • “Build it and they will come” only works in rare circumstances.
  • The vision should be driven by the desire to solve identified basic, translational and clinical research problems that are in line with the missions of the NIH institutes and centers.
  • “Not invented here” thinking should be avoided and existing hardware, software and human resources fully utilized.
  • The community owns the digital enterprise; hence transparency and community engagement are critical.
  • Fostering new interdisciplinary virtual communities will be necessary.
  • The individuality of each institute and center within NIH must be respected while encouragement to work across borders, for example in data integration, should be sought and valued. The intramural program offers a particular opportunity to make this happen.
  • Iteration is essential to progress, as it allows agility and the opportunity to learn from small steps that are rigorously evaluated at every opportunity along the way.
  • Engaging community-minded individuals to design, develop, and support the digital enterprise allows the advantages of synergy.
  • Engaging the next generation of leading scientists is essential.
  • Biomedical research problems are so complex that real progress requires a true partnership of different kinds of expertise all coming to bear on a common problem.
  • Crowd sourcing can be a useful strategy.
  • Competitions can be a useful support mechanism if used judiciously so as to avoid hyper-competitiveness. Such competitions should encourage team science, which brings together diverse expertise.
  • Success comes from the correct blend of top-down (specific funding initiatives, regularity requirements etc.) and bottom-up (grassroots communities that come together to try to solve problems with or without the NIH, individual investigator-initiated research) approaches.
  • Funding is typically national yet the value of research objects international. Collaboration between funding sources is needed to maximize the value and sustainability of research objects.
  •      Privacy and security through appropriate policies and authentication must be developed in tandem.

More specifically, with respect to the Associate Director for Data Science (ADDS) office, BD2K and other trans-NIH Data Science activities:

  • On-going and future ADDS/BD2K programs should be related to this vision of the future as a digital enterprise and must be integrated and synergistic both within the program and with what already exists in the community and other parts of NIH.
  • BD2K programs must be synchronized with the emergent regulations and policies around shared access, as well as preservation of patient privacy. Conversely, BD2K activities should drive appropriate policy and regulatory efforts.
  • The current OSTP directive defines a why for data sharing, but says little as to the how. While a considerable amount of the data generated will find its way into existing repositories, a large fraction belongs to the long tail of science and does not have a home. The how has both practical and economic implications. Practically data integrators must be part of the discussion and from an economic perspective new business models must be tried.
  • A commons[8] is needed to accommodate these data and other research objects; the commons would provide storage co-located with computing power to embrace the emergent community-based contributions to the digital enterprise. Multiple commons pilots will push forward this activity in specific instances, to identify and share best practices and allow the community to work across such efforts to ensure interoperability from the start.
  • BD2K-funded initiatives can be the leading developers and evangelists for the connected digital enterprise and coordinate themselves in such a way as to develop, or be contributors to, the commons, provision of digital research object discovery, interoperability, annotation, and indexing.
  • Standards should be defined and maintained with reference to specific types of research objects. Standards will promote the usability of digital assets and are necessary for interoperability and connectedness within the digital enterprise. Engagement of the community, both biomedical and beyond, is an essential component of both starting and maintaining standards efforts. Standards include the notion of reference datasets, controlled simulation environments to promote benchmarking, training and repurposing.
  • Training is needed at all levels, from the branding of biology (from K12 education on up) as an analytical science, to retraining of established biomedical researchers to understand and contribute to the emerging digital enterprise. A first step is to catalog what is being taught already to determine what is lacking in specific areas. Such training should engage experts from a variety of disciplines such as computer science, behavioral and social sciences, statistics, mathematics and others. New physical and virtual training facilities will likely be needed to complete coverage of relevant topics. Educational content itself should be part of the digital enterprise and hence be cataloged, and findable using appropriate metadata standards.
  • New processes, and perhaps funding mechanisms, are needed for supporting the digital enterprise as part of the NIH ecosystem. This includes policies and procedures defining research object accessibility and protection, processes for managing grants relating to the digital enterprise so that appropriate expertise is applied and reflects the value of various types of research objects to the NIH mission.
  • Collaboration between funding agencies is needed to minimize redundancy and maximize value to national and international communities. Cooperative agreements for a single foundation, for research and training exchanges as well as international cooperation on standards and training are examples for how the situation can be improved.

In Summary

The ADDS and BD2K activities thus far are all valuable when cast into a shared vision for a biomedical research infrastructure that makes the sum of those activities greater than the individual parts. This shared vision will not be easy to build, but it is the time in history to make this happen. Without the eager cooperation of individual centers and cataloging efforts, the rate of progress will be slow. We must fund projects that provide the right balance of new science and innovation and shared infrastructure in order to realize the biomedical discoveries and health applications that will flow from research in a connected digital enterprise. The foundation of the connected biomedical digital enterprise is the standards, identifiers, and commons, but the monument is the discoveries that are made through analyzing the interconnected digital research objects using new tools, technologies, and methods developed in response to the new challenges. The success is the new and improved health and well-being that arises. Something we will need to evaluate.

Acknowledgements

Special thanks to Jennie Larkin, Michelle Dunn, Vivien Bonazzi, Leigh Finnegan, Beth Russell, and Mark Guyer of the ADDS team, to the many NIHers involved in BD2K and to Carole Goble, Mark Patterson, Melissa Haendel, Marryann Martone, Chris Mentzel, and Mark Patterson.

Appendix:

Consider a hypothetical scenario for what this future might look like in this example use case:

Researcher x is automatically made aware of researcher y through commonalities in their respective data located in the commons. Intrigued researcher x reviews the data catalog, locating the researcher y’s data sets with their associated usage statistics and commentary by the community. From the datasets, researcher x navigates to the associated publications and researcher x starts to explore various ideas with the help from the personal collaboration network when necessary. Researcher x is further convinced to contact researcher y. Researcher x studies the online presence of researcher y and knows more about researcher y’s expertise specialty. Researcher x formulates some specific questions and ideas to engage with researcher y and their research network. A fruitful collaboration ensues and they generate publications, data sets and software using tools developed by the BD2K Consortium. Their output is captured in PubMed and the commons, and is indexed by the data and software catalogs. Company z automatically identifies all relevant NIH data and software in a specific domain that, based on the metrics from the catalogs, have utilization above a threshold that indicates that those data and software are heavily utilized and respected by the community. An open source version remains, but the company adds services on top of the software for the novice user and revenue flows back to the labs of researchers x and y which is used to develop new innovative software for open distribution. Researchers x and y come to the NIH data science training centers periodically to provide hands-on advice in the use of their new version and their course is offered as a MOOC.

[1] The term “biomedical” is used in the broadest sense to include biological, biomedical, behavioral, social, environmental and clinical studies that relate to understanding health and disease.

[2] http://www.researchobject.org/

[3] NIH’s mission is to “seek fundamental knowledge about the nature and behavior of living systems and the application of that knowledge to enhance health, lengthen life, and reduce illness and disability.”

[4] http://bd2k.nih.gov

[5] http://www.sparc.arl.org/initiatives/article-level-metrics

[6] https://impactstory.org/

[7] https://www.force11.org/Resource_identification_initiative

[8] https://pebourne.wordpress.com/2014/10/07/the-commons/

The Commons

The Associate Director for Data Science (ADDS) team at the National Institutes of Health (NIH), in partnership with the research community and the private sector, is establishing The Commons as a means to support the digital biomedical research enterprise. What is The Commons and what will it enable?

In an era when biomedical research[1] is becoming increasingly digital and analytical, the current support system is neither cost-effective nor sustainable. Moreover, that digital content is hard to find and use. The Commons is a pilot experiment in the efficient storage, manipulation, analysis, and sharing of research output, from all parts of the research lifecycle. Should The Commons be successful we would achieve a level of comprehensive access and interoperability across the research enterprise far beyond what is possible today.

The Commons is a conceptual framework for a digital environment to allow efficient storage, manipulation, and sharing of research objects[2]. Borrowing and modifying the dictionary definition, The Commons belongs to and affects the whole research community. From the perspective of the NIH we are concerned with digital research assets that support and accelerate biomedical research, and that will be the focus here, but the concept is purposely quite general so as to foster interdisciplinary interaction and use. As the concept can be employed by the entire global biomedical research enterprise, the NIH does not own it, nor is solely responsible for it, so it is not the NIH Commons; similarly it is not just for scientific data and hence is not the Data Commons. Rather The Commons is the concept of sharing digital research objects from any domain, where sharing implies finding, using, reusing and attributing.

The Commons could be considered analogous to the Internet or World Wide Web – each user has his/her own definition of exactly what they are, but all are able to use them every day for their own purposes. No one seems to own either yet they work because each participant abides by a simple set of agreed-upon rules. For the World Wide Web those rules are: (1) a URL scheme to find Web sites; (2) a protocol to communicate; (3) a standard format (HTML) in which to express Web pages. The initial definition of The Commons does not go much beyond (1) in an effort to keep it simple, but still be functional. However, if common Application Program Interfaces (API’s) were developed to access The Commons content they would be analogous to (2) and data formats for specific types of data, if widely adopted by the community, would be analogous to (3).

The initial rules for the Commons are proposed as follows:

  1. Each unique research object placed into The Commons must have a unique identifier.
  2. That unique identifier must allow the research object to be found, shared and attributed.
  3. Attribution requires associated provenance that, minimally, identifies the creator(s) of the unique research object and those that have subsequently modified it and how it was modified.

Although not required, it is anticipated that the majority of research objects in The Commons will, in addition, have associated metadata, which will facilitate their use. The metadata might include descriptions of content for specific types of research objects, as well as details of who has the rights to obtain access to the research object.

The Commons concept needs to have a real implementation. That implementation can be on the combination of a variety of compute resources – public, private or hybrid clouds, high performance computing (HPC) resources, (commercial, in national laboratories, and elsewhere), and/or on institutional facilities. Each of these resources is referred to as a Commons provider. The only requirement for a Commons provider is that they agree to support the rules of the Commons as stated above and to provide or permit services that facilitate the use of The Commons. Those services could be API’s for access to research objects, tools for manipulation and analysis of research objects and many more that we cannot imagine at this time. Research objects within The Commons will be cataloged in an index being developed as part of the Big Data to Knowledge (BD2K)[3] Initiative and hence findable and shared regardless of physical location. Commons users are free to use any Commons provider; in this way, competition will be created in the market place to provide a cost-effective environment to perform digital research. Thus, a Commons user with data-intense, minimal compute needs will be able to use a different provider than a data-light, compute-intensive user, yet the research output of each will be readily found and used by anyone interested and authorized to use that content.

Thus The Commons is a distributed collection of uniquely identifiable research objects with no explicitly defined relationship among them. The Commons is not a warehouse, a federation, nor a database. Such structures can however be instantiated on a subset of contents should a user choose to do so. While there is no necessary relationship between research objects, The Commons is intended to facilitate the discovery and instantiation of such relationships.

How the NIH will utilize The Commons

While any organization is encouraged to utilize The Commons, the NIH will use The Commons as indicated below and views The Commons as an experiment in:

  • Sharing & Accessibility A directive from the US Office of Science and Technology Policy (OSTP) requires federal scientific research agencies to share, as far as practical and allowable, research data generated with public dollars[4]. How this is done has been left to individual agencies, but they must do so on existing budgets. The Commons is one of a number of NIH responses to this request.
  • International To be maximally successful, The Commons must be accepted and utilized by researchers around the world. As envisioned, funding agencies from around the world could support participation in The Commons while maintaining any necessary national identity by means of supporting their own Commons-compliant infrastructure.
  • The Commons should allow data science to become more cost-effective and hence more sustainable. In principle, through The Commons, data science will become focused around a smaller number of shared cost-effective compute resources, which will compete with each other for awarded NIH dollars, a situation that should be more cost-effective than the highly distributed model of computing currently used to support biomedical researchers.   The Commons also holds the promise of enabling access to and assessment of reliable negative results, which could reduce the number of attempts to study a plausible, but incorrect, hypothesis.
  • Replicability The opportunity and ability to reproduce, or at least replicate, experiments is a basic tenet of science. However, the issue has received a great deal of attention among scientists and the public of late as a result of an apparently increasing number of failures to demonstrate that published results and claims can be reproduced. The Commons provides a means to readily expose and make accessible the full research lifecycle that underlies the subset of that cycle that is normally described in a publication, but which is typically not accessible from the publisher or the authors.
  • The majority of research output is currently not easily findable, and some may not even be on-line. Therefore discoverability of research output through indexing or other methods will be an essential element of the Commons. Furthermore, we currently do not have the capability of knowing how useful most of that output has been, as we cannot determine how much has been accessed by others, nor what the users might have to say about it. The NIH Big Data to Knowledge (BD2K) initiative, through the Data Discovery Index Coordination Consortium (DDICC) is one approach that will address this for research objects within The Commons, making research output more accessible and its use more quantifiable. The intent is that others will define alternative schemes which make research more discoverable and usable. Further, while replication as outlined above is desirable, discovery also prohibits unwanted duplication of effort thus making the research enterprise more cost-effective.
  • With greater access and transparency, and hence scrutiny, of research objects and full research lifecycles, where ownership is easily ascertained, quality should improve as all components of a research project become part of the accessible public record. Further, The Commons offers the promise of larger accessible control data and hence greater confidence in baseline values.
  • Again, with more access by a larger number of researchers, it should be possible to perform more forms of novel analysis on existing data, with more analysis tools being contributed and applied to scientific questions.
  • Reward structures. Accessibility and metrics that describe the complete research lifecycle hold the promise of shifting emphasis away from solely the final peer-reviewed publication to additional forms of valuable scholarship, such as well-formed and annotated datasets and robust and accessible software.

There is no guarantee that the desired outcomes outlined above will be met. If they were, it would represent an important change in the culture of doing science that could have a significant impact on the way we do biomedical research. Such a change will not come from the NIH and other funding agencies alone, but rather from collaboration with the research community. The role of the NIH is to enable the community. We will attempt to do so through the funding by BD2K of science-driven applications that utilize the emergent Commons. Such applications represent a virtuous cycle where the scientist must see the scientific merit of operating in the Commons from the outset.

Evaluation will be a key part of The Commons. However, at no time will evaluating the infrastructure per se be the focus, but rather evaluating the quality of the science that results from the application of the infrastructure. The emphasis on Commons deployment is on agile – small steps each of which can be evaluated before going to the next. The Commons must be, as far as possible, a come-and-then-build initiative.

Acknowledgements

Thanks to the ADDS and the complete BD2K teams for useful comments, also to Francis Collins, Larry Tabak, Susan Gregurick, Jerry Sheehan and Dave Glazer for useful feedback.

[1] Covering all aspects of basic, clinical and behavioral research.

[2] A research object is a bounded entity identifiable in the field of research. Examples are specific data sets, items of software, narrative about an experiment, a research paper etc. In short anything it makes sense to uniquely identify in the domain.

[3] http://bd2k.nih.gov/

[4] http://www.whitehouse.gov/sites/default/files/microsites/ostp/ostp_public_access_memo_2013.pdf

Ten Weeks as ADDS

Welcome. This is the first of what I anticipate will be periodic updates on the work of the Associate Director for Data Science (ADDS) team at NIH. The goal is to be transparent and informative and we welcome your input at any time.

When accepting the job of ADDS at the NIH I asked Director, Francis Collins, to summarize my job description. His answer was simply, “to change the culture of the NIH.” My response was, “and what do I do next week?”  Nine weeks after my self imposed deadline, I am certainly having fun, and while it will take much more than one person to change a deeply ingrained culture centered around specific diseases and organs; the complexity of disease and the value of sharing data across institutional boundaries, will drive us forward. All at NIH seem to share this belief, hence the fun part.

How did we get to this ten week point? In 2011 Dr. Collins formed the Data and Informatics Working Group, which in 2012 released a report [1] highlighting the need to:

  • Advance basic and translational science by facilitating and enhancing the sharing of research-generated data.
  • Promote the development of new analytical methods and software for this emerging data.
  • Increase the workforce in quantitative science toward maximizing the return on the NIH’s public investment in biomedical research.

The report provides a compelling roadmap and thanks to the outstanding efforts of Drs. Eric Green, Mark Guyer and many others at the NIH, much had already been done to address these needs through the Big Data to Knowledge (BD2K) program [2]. It is now my job to take over and build on this initiative. BD2K is a predominantly extramural program, which will make its first awards this summer, and is intended to foster developments in data science relevant to biomedicine. BD2K consists of training programs, calls to enable data and software discoverability, facilitation of standards efforts, and consortia that will improve all aspects of scientific data handling, analysis and reuse.

After ten weeks we have begun to formulate a strategic plan to elaborate upon BD2K and assemble folks to carry it out. The modus operandi is one of coordination of existing efforts, both intramural and extramural.  Currently the “we” is Dr. Jennie Larkin, Program Director for the Advanced Technologies and Surgery Branch Division of Cardiovascular Sciences
National Heart, Lung and Blood Institute and now the Deputy in the ADDS team, Eric, Mark, and the whole BD2K team of over 100 NIH staff, who have other roles but participate on a regular basis. Additional full-time ADDS team members will be joining in the coming months.

We began by talking to a vast array of people to get an idea of the landscape and determine what might be done. While ongoing, this includes all 27 NIH Institute and Center Directors as well as many members of their staff — a group of amazingly dedicated individuals pulling for the same overall goals and trying to maximize the research that can be done on a flat (at best) budget.

From discussions with stakeholders an ADDS team-driven plan has begun to emerge that is based on a number of observations, some obvious, including:

  • The US government Office of Science and Technology Policy (OSTP) directives on data sharing has defined the why of sharing, but not the how.
  • The current situation of matching budgets to data growth will not scale as data grows rapidly and budgets remain flat.
  • The problems of the long-term sustainability of biomedical research data have been identified and discussed; the solutions are not so clear.
  • We need more information to begin to define and test possible data sustainability models.  For example, we do not currently know enough about how existing data are used and thus are under informed for how to proceed going forward.
  • Estimation of the amount of data that will be generated in years to come and the demand for that data as part of the digital enterprise are difficult to estimate.
  • As yet we have not fully addressed what fraction of the NIH data science budget should go into data management and analysis versus generating new data.
  • The BD2K initiative as currently conceived is only part of the answer. Moreover, Big Data is a lot more than just data.  We need to consider all digital assets: data, metadata, software, narrative, workflows, training materials, etc.
  • Training in biomedical data science is in need of expansion and coordination.
  • Better reward systems and hence improved recognition of biomedical data scientists is needed.

How to address the above observations? The answer, as we currently see it, is to have you (the community) help us work through these issues, by considering five programmatic themes that, with the exception of BD2K, are new. Those five themes have one strategic goal:

To enable biomedical research as a digital enterprise through which new discoveries are made and knowledge generated by maximizing community engagement and productivity.

Consider the five programmatic themes:

  1. BD2K – fostering innovation through partnership with the extramural biomedical research community. BD2K seeks to develop better ways to tackle the challenges (and harness the potential) of biomedical big data, with the goal of establishing a national infrastructure to support biomedical research.
  2. Sustainability – partnering with the community to address the challenges of maintaining the rapidly growing digital assets that are generated as part of biomedical research.
  3. Training – preparing the workforce to address the challenges and opportunities of biomedical research as a digital enterprise.
  4. Evaluation & Reward – defining the means to evaluate the value of data scientists, data, software and other digital assets to the research enterprise and getting all scholars to appreciate that value.
  5. Communication & Outreach – working with partners – other federal agencies, the private sector, both nationally and internationally, inside and outside of biomedicine, to learn from experience and maximize the value of the digital enterprise, within and across disciplines.

In the coming months we will implement a series of activities to move these five programmatic themes forward. Our activities will be agile – small steps followed by evaluation to determine what next steps to take. The community will be engaged in all aspects of the development – there will be no “build it and they will come.”

To this end we will begin in June with a workshop for all NIH personnel engaged in BD2K so internally we are all on the same page. Subsequently, in late summer or early fall, we will convene a group of stakeholders to help us chart the course going forward. The draft action plan will be provided for public comment, modified and a set of actions subsequently put in place. At this point in the evolution of our strategy lets consider what this might look like.

Sustainability

The current thinking is to establish a commons and conduct a series of pilots to evaluate its value to the community. The commons is a concept, the instantiation of which could occur in a variety of ways. Conceptually it is a shared workspace in which a variety of research objects can reside. It will have a business model such that the contents of the commons are governed by economic realities and hence sustainable. What that means in practice will be governed by cost versus value to the content stakeholders.  Contents will also be governed by regulatory and ethical concerns. Minimally it will be a biomedical drop box. One step beyond that, we will encourage the community to write apps (also in the commons) to operate on the content and to provide a measure of authentication of different content types. That authentication is metadata important in defining the content. A further step is other computations performed on the content, generating new research objects. One final step is the analysis of the complete commons perhaps finding commonality across research objects, or yet to be imagined new findings in biomedicine. Instantiation of the commons requires a few decisions regarding object identification, provenance etc. The goal is to not reinvent the wheel, keep it lightweight, and encourage exploration. Instantiation would likely be in the cloud and various public private partnerships will be explored as pilots. Assuming the commons is viable there is no limitation on who can participate. As such no single entity owns the commons and oversight comes in a way defined by the community itself. The commons can be thought of as a thin layer added to the Internet specifically to support biomedical research. Its value at every step will be judged by the ability to make progress on data-driven biomedical research.

Clearly if the community endorsed the Commons there are many details associated with making it a reality. Details that would require developments in data science and testing on biomedical problems by the community and extramural funding will be provided to support these efforts.

Training

There is a significant amount of training in components of data science already funded by the NIH, both as extramural grants supporting training/career development and as course materials offered intramurally and extramurally. Additional extramural grants for training are being offered by BD2K to help the workforce prepare for biomedical science as a digital enterprise.  Additional courses and course materials are also needed to fill gaps, but first we need to understand the current scope of courses being offered in this fast-growing area. The sense at present is that there is redundancy in some areas and a lack in others. Our initial efforts are anticipated to be in rationalizing what is available so that we can share best practices and enable students and researchers to find relevant training.  Next, we anticipate working towards a description of a complete curriculum, much of which we hope will be offered online. We will also contemplate one or more training centers, which would be available for serious hands-on training, both using reference data and in working with data from one’s own NIH funded research.

Evaluation & Reward

Data-, software-, and standards-related grants need special consideration when reviewed to insure best practices are used and the value to the community is fully appreciated. In turn this requires new metrics, which will be the subject of discussion and community engagement in coming months. Those same metrics will be used to highlight the value of data scientists to the digital research enterprise.

Communication & Outreach

Scientific data are global, yet the way these data are maintained is typically national. We need improved cooperation between the funding agencies in the US and with our counterparts worldwide. Such exchanges are beginning and a recent workshop convened by the Gordon and Betty Moore Foundation discussed data sustainability across the federal agencies. The goal is to work towards some common principles for the maintenance of scientific data and to maximize the use of our research dollars through synergy.

BD2K

BD2K is the extramural component of the NIH’s effort towards the digital enterprise. It is aimed at taking full advantage of what the community has to offer and have it contribute to a coherent national infrastructure, likely centered around the commons. Components include significant cataloging efforts for data (the data discovery index), software and standards, training and a series of BD2K centers that taken together as a cohesive ecosystem drives innovation in biomedical data science. To achieve this goal will require an appropriate oversight model and a willingness of all participants to work towards common goals.

So there you have it a brushstroke of our thinking at the ten-week point as we embark on the notion of biomedical research as a digital enterprise. We look forward to your thoughts at any time – this is a community effort.

[1] http://acd.od.nih.gov/diwg.htm

[2] http://bd2k.nih.gov/

Philip E. Bourne 05/16/14

Associate Director for Data Science (ADDS), NIH

philip.bourne@nih.gov

 

Universities As Big Data

While few believe the current university model will die, profound change will occur. Those universities that best leverage their big data (aka digital assets) will see their coveted ranking from the US News and World Report rise disproportionately in the next 5-10 years. In other words, prospective students and their parents, leading scholars and skilled administrative staff will seek out the advantages to be had from the Big Data University.

 

Universities, like all businesses, are undergoing a period of rapid change fuelled by a changing economy, and an increasingly connected world, where knowledge is broadly available and free to consume. Evidence of these changes can be seen in declining federal and state research dollars, the appearance of massive, open, on-line courses (MOOCs) and disconcertment among students (and parents) for how they are being educated, and the cost of that education. At the same time a degree from a prestigious university counts more than ever in the workplace. Universities with their long traditions, generally conservative viewpoint and prior sense of entitlement through federal, state and public support, seem slow to adapt to the new realities. One aspect of that new reality is the need to leverage ones data to keep up with the times and grow.

 

Universities have traditionally been analog – courses taught with slides or overheads, course notes printed, research data on shelves in notebooks, admission applications kept in endless file cabinets and so on. Now all of that content is, or soon will be, purely digital. The problem is universities treat these data as simply electronic versions of what was maintained in hard copy. Universities have been traditionally very slow to leverage the power of the digital medium. This failure is not new in business; the music industry, the book and newspaper industry, the manufacturing industry etc. all initially responded slowly to the new digital reality and when change accelerated, old businesses died and new ones emerged.

 

What does it mean in today’s fast paced environment to be a Big Data University and hence leverage the data? It means to integrate data and information resources to improve ones business (universities are businesses). Once integrated it must be analyzed to make useful findings that give the University a competitive advantage in ways that would not otherwise be possible.

 

Moving to a Big Data University will be a challenge for many because the current organizational structure of most research universities makes such data and knowledge integration difficult. Research, education and administrative services are siloed and each maintains its own separate organizational and sometimes duplicative data and information infrastructure.  Typically there are central services that provide computer networking but how that is used is a free-for-all with redundancy across schools, departments, colleges, or whatever organizational structure is in place. Breaking down the silos takes vision, leadership, and resources, but consider the gains that are in reach for Jane, a student at a Big Data University, and her colleagues.

 

Jane scores well in parts of her advanced on-line biology class. Professors who undertake research in the areas where Jane did well are automatically notified of her potential based on a computer analysis of her scores and background interests and Professor Smith interviews her and offers her a research internship for the summer. Over the summer, as she enters details of her experiments related to understanding a widespread neurodegenerative disease in an on-line laboratory notebook, the underlying computer system automatically puts Jane into contact with another student, Jack, in a different department whose notebook reveals he is working on using bacteria for purposes of toxic waste cleanup. Why the connection? It turns out the same gene, which they both reference a number of times in their notes, is linked to two very different disciplines – mental health and the environment. In the analog university they would never have discovered each other, but at the Big Data University pooled knowledge can lead to a distinct advantage. The collaboration later results in a patent filing and triggers a notification to a number of biotech companies who might be interested in licensing the technology.  A company licenses the technology and hires Jane and Jack to continue working on the project. Professor Smith hires another student using the revenue from the license and this in turn leads to a federal grant to support further research. The students get good jobs, further research is supported and societal benefit arises from the technology. A hypothetical example for why the Big Data University makes sense.

 

Today there are no technical reasons why this example cannot be realized. However, cultural and resource issues imped a move towards the Big Data University. It will be interesting to see which institutions can overcome these impediments and call themselves Big Data Universities. They are where I would want to send my kids.

 

Taking on the Role of Associate Director for Data Science at the NIH – My Original Vision Statement

On March 3, 2014 I will begin the job of Associate Director for Data Science (ADDS) at the National Institute of Health (NIH). I will report directly to NIH Director, Dr. Francis Collins. When I originally applied for the position in April 2013 I was asked to prepare a short vision statement. That statement follows here. It does not necessarily reflect what I will attempt to accomplish in the job, but rather the way I was thinking about data science at the time of my application. In the spirit of openness which I hope to bring to the position I include it here and invite your comments.

Technology, including information technology, has had a profound impact on health related research at all scales. Witness the plummeting cost of sequencing and assembling a genome to the emergence of mobile health precipitated by smart phones. Yet this is just the beginning.  I believe the future of research into health and well-being is going to be tied very much to our ability to sustain, trust, integrate, analyze/discover, disseminate/visualize and comprehend digital data. The work of the National Library of Medicine and the National Center for Biotechnology Information (NCBI) has been exemplary among the sciences in getting us this far, but it is just the beginning. Let me address each of these issues. Two pages preclude a detailed discussion of how to fulfill the vision and I will hope to speak further about the possibilities. Moreover, it precludes any discussion about the peculiarities, challenges and rewards associated with specific data types. Again, I hope to go beyond this generic discussion on the future of data.

 

Sustainability is the most critical, yet least addressed aspect of digital health, at least in academia. Sustainability cannot simply mean asking the funding agencies for more money as the data continues to grow at unprecedented rates. We need new business models (academia is a business), including public-private partnerships where private enterprise has been thinking about these problems for a while. We need to recognize that data sustainability is a global not a national problem and finally we need to begin to make informed decisions about what data to discard. Consider examples of the types of discussions that need to be had leading to subsequent policies and procedures, that need to be put in place. Discussions need to be had around business models that provide services atop of free open content that generate revenue to sustain that content. Discussions need to be had that review other global industries, e.g., banking and commerce to consider best (and worst) practices associated with global management of data. Lastly, we need to consider what data we need to sustain. That consideration begins with how we actually use data. To date the study of how data are actually utilized is in its infancy in academia. Funded data providers are required to give global statistics on data use, but this does not speak to how each element of data in a corpus is utilized and why. When we understand this better we can make informed decisions about what to discard with the understanding it could be regenerated later at a cost that makes storing it nonsensical. Data rich private sector companies need to be engaged in this discussion so academia can learn from their best practices.

            Sustainability is also an institutional problem. Academic institutions are at this time rarely taking full advantage of their digital assets, including the biomedical data being generated by their faculty and students. The recent Moore and Sloan Foundation initiative (I was involved with this) was a departure in that it rewards institutions for best data science practices rather than individuals. Mechanisms that reward institutions for their careful stewardship and open accessibility of biomedical data should be considered. As should programs that support and promote data scientists in these institutions.  Lack of growth paths and perceptions by faculty review committees need to be changed such that the value of institutional data scientists is elevated. Programs can be designed to support this.

 

Trust in the data has been the biggest factor in the success of the data and knowledge resources (databases and journals) I have been involved with over the years. Trust speaks to the security and quality of the data. Security is temporal and personal. What is secure today may not be secure with the analytical tools of tomorrow. What one person wants to keep secure another wants to make public so as to benefit others. We need to be flexible in our approach to security. Surprisingly quality is not something we pay enough attention to. Current modes of data and knowledge management (database and journals) lack sufficient feedback mechanisms to report on their content. Likewise, there is a data curation-query cycle that is mostly missing in current data management practices. Query of a corpus informs about outliers in that corpus. Such outliers may be discoveries they may be errors that can be corrected or discarded. We need to stimulate more inquiry about the trust in the data we are generating.

 

Integration of disparate data, often at different biological scales, is a major characteristic of current and future biomedical research discoveries. Optimizing such integration speaks to data representation, metadata, ontologies, provenance and so on. Aspects for which good technical solutions already exist, but where motivation and reward to create well formed datasets from which integration can occur are missing. Facilitating the cataloging and comparison of datasets is one mechanism for creating motivation among researchers, funding mandates another. Data sharing policies are a great step forward, when firmly in place the next step is towards how the data should be presented so that it may be optimally shared need to be put in place.

 

Analyze/Discover Discovery informatics is in its infancy. Search engines are grappling with the need for deep search, but it is doubtful they will fulfill the needs of the biomedical research community when it comes to finding and analyzing the appropriate datasets. Let me cast the vision in a use case. As a research group winds down for the day algorithms take over, deciphering from the days on-line raw data, lab notes, grant drafts etc. underlying themes that are being explored by the laboratory (the lab’s digital assets). Those themes are the seeds of deep search to discover what is relevant to the lab that has appeared since a search was last conducted in published papers, public data sets, blogs, open reviews etc.  Next morning the results of the deep search are presented to each member as a personalized view for further post processing. We have a long way to go here, but programs that incite groups of computer, domain and social scientists to work on these needs will move us forward.

 

Disseminate/Visualize/Comprehend In 2005 I wrote an editorial asking the question, is a biological database really different to a biological journal? The answer then, as now, is no; what is different is the way their value is perceived. What has changed is the emergence of data journals and databases hiring more curators to extract information from the literature and adding it to databases (how cost ineffective is that!). In the world of digital scholarship the paper is a means to execute upon the underlying data and becomes a tool of interactive inquiry. Open access opens the door to these possibilities but there is much to be done.

 

What I have tried to do here is to simply introduce how I am thinking about the problems of biomedical research data and appreciate there is little in how these problems might be addressed. I hope to have that discussion.