On March 3, 2014 I will begin the job of Associate Director for Data Science (ADDS) at the National Institute of Health (NIH). I will report directly to NIH Director, Dr. Francis Collins. When I originally applied for the position in April 2013 I was asked to prepare a short vision statement. That statement follows here. It does not necessarily reflect what I will attempt to accomplish in the job, but rather the way I was thinking about data science at the time of my application. In the spirit of openness which I hope to bring to the position I include it here and invite your comments.
Technology, including information technology, has had a profound impact on health related research at all scales. Witness the plummeting cost of sequencing and assembling a genome to the emergence of mobile health precipitated by smart phones. Yet this is just the beginning. I believe the future of research into health and well-being is going to be tied very much to our ability to sustain, trust, integrate, analyze/discover, disseminate/visualize and comprehend digital data. The work of the National Library of Medicine and the National Center for Biotechnology Information (NCBI) has been exemplary among the sciences in getting us this far, but it is just the beginning. Let me address each of these issues. Two pages preclude a detailed discussion of how to fulfill the vision and I will hope to speak further about the possibilities. Moreover, it precludes any discussion about the peculiarities, challenges and rewards associated with specific data types. Again, I hope to go beyond this generic discussion on the future of data.
Sustainability is the most critical, yet least addressed aspect of digital health, at least in academia. Sustainability cannot simply mean asking the funding agencies for more money as the data continues to grow at unprecedented rates. We need new business models (academia is a business), including public-private partnerships where private enterprise has been thinking about these problems for a while. We need to recognize that data sustainability is a global not a national problem and finally we need to begin to make informed decisions about what data to discard. Consider examples of the types of discussions that need to be had leading to subsequent policies and procedures, that need to be put in place. Discussions need to be had around business models that provide services atop of free open content that generate revenue to sustain that content. Discussions need to be had that review other global industries, e.g., banking and commerce to consider best (and worst) practices associated with global management of data. Lastly, we need to consider what data we need to sustain. That consideration begins with how we actually use data. To date the study of how data are actually utilized is in its infancy in academia. Funded data providers are required to give global statistics on data use, but this does not speak to how each element of data in a corpus is utilized and why. When we understand this better we can make informed decisions about what to discard with the understanding it could be regenerated later at a cost that makes storing it nonsensical. Data rich private sector companies need to be engaged in this discussion so academia can learn from their best practices.
Sustainability is also an institutional problem. Academic institutions are at this time rarely taking full advantage of their digital assets, including the biomedical data being generated by their faculty and students. The recent Moore and Sloan Foundation initiative (I was involved with this) was a departure in that it rewards institutions for best data science practices rather than individuals. Mechanisms that reward institutions for their careful stewardship and open accessibility of biomedical data should be considered. As should programs that support and promote data scientists in these institutions. Lack of growth paths and perceptions by faculty review committees need to be changed such that the value of institutional data scientists is elevated. Programs can be designed to support this.
Trust in the data has been the biggest factor in the success of the data and knowledge resources (databases and journals) I have been involved with over the years. Trust speaks to the security and quality of the data. Security is temporal and personal. What is secure today may not be secure with the analytical tools of tomorrow. What one person wants to keep secure another wants to make public so as to benefit others. We need to be flexible in our approach to security. Surprisingly quality is not something we pay enough attention to. Current modes of data and knowledge management (database and journals) lack sufficient feedback mechanisms to report on their content. Likewise, there is a data curation-query cycle that is mostly missing in current data management practices. Query of a corpus informs about outliers in that corpus. Such outliers may be discoveries they may be errors that can be corrected or discarded. We need to stimulate more inquiry about the trust in the data we are generating.
Integration of disparate data, often at different biological scales, is a major characteristic of current and future biomedical research discoveries. Optimizing such integration speaks to data representation, metadata, ontologies, provenance and so on. Aspects for which good technical solutions already exist, but where motivation and reward to create well formed datasets from which integration can occur are missing. Facilitating the cataloging and comparison of datasets is one mechanism for creating motivation among researchers, funding mandates another. Data sharing policies are a great step forward, when firmly in place the next step is towards how the data should be presented so that it may be optimally shared need to be put in place.
Analyze/Discover Discovery informatics is in its infancy. Search engines are grappling with the need for deep search, but it is doubtful they will fulfill the needs of the biomedical research community when it comes to finding and analyzing the appropriate datasets. Let me cast the vision in a use case. As a research group winds down for the day algorithms take over, deciphering from the days on-line raw data, lab notes, grant drafts etc. underlying themes that are being explored by the laboratory (the lab’s digital assets). Those themes are the seeds of deep search to discover what is relevant to the lab that has appeared since a search was last conducted in published papers, public data sets, blogs, open reviews etc. Next morning the results of the deep search are presented to each member as a personalized view for further post processing. We have a long way to go here, but programs that incite groups of computer, domain and social scientists to work on these needs will move us forward.
Disseminate/Visualize/Comprehend In 2005 I wrote an editorial asking the question, is a biological database really different to a biological journal? The answer then, as now, is no; what is different is the way their value is perceived. What has changed is the emergence of data journals and databases hiring more curators to extract information from the literature and adding it to databases (how cost ineffective is that!). In the world of digital scholarship the paper is a means to execute upon the underlying data and becomes a tool of interactive inquiry. Open access opens the door to these possibilities but there is much to be done.
What I have tried to do here is to simply introduce how I am thinking about the problems of biomedical research data and appreciate there is little in how these problems might be addressed. I hope to have that discussion.