Deans Blog: Data Science Meets COVID-19

This past week I had an enjoyable hour On-Air answering questions from 600 University of Virginia alumni and friends on how data science is addressing the COVID-19 crisis. Many had submitted questions beforehand as well. Looking through these questions and thinking back to the On-Air session, I have distilled the three most sought-after answers to address here. 

As I said at the On-Air session, I am neither an epidemiologist not a virologist, rather I have spent over 40 years at the interface of life sciences and computational science and that is the perspective I bring, as well as that of my colleagues who make up the very new University of Virginia School of Data Science (SDS) – A School without Walls.

Together we don’t have the answers – I wish we did. What we do have is an informed opinion on where we are headed, hopefully, ultimately revealing those answers. It is easy to refer to the current COVID-19 pandemic as unprecedented and ignore what has gone before, that would be a mistake. With that in mind let’s get into it.

Why are the models predicting infection and death rates so variable and what does that imply? 

It’s true, early on the CDC predicted 200,000 deaths in the US, whereas the Imperial College, London model, predicted 2.2M US deaths. As I write, in the US  nearly 78,000 people have died and we have 1.3M confirmed cases with some areas showing a decline in rates of infection; other areas a rise. First and foremost, a model is just that – a model. An effort to define outcomes using the information that is available. There are statistical models that provide an outcome based on a meaningful sample of the data; there are mechanistic models that break the complex problem into pieces and look at the interrelationship between those pieces. Purists will argue over the value of their respective models, but to me there are three aspects that are so much more critical than the model itself: the data any model is fed; the immediate impact that any model has on society; and how the model informs our future. 

For an excellent discussion on the first issue I recommend Why It’s So Freaking Hard To Make A Good COVID-19 Model. Let me paraphrase a small part of that article to give you a flavor of why accurate data matters so much and why we do not have that data, a problem especially prevalent in the US. How many people will die is a function of the susceptible population, the infection rate and the fatality rate. Accurate estimates of these numbers are more important than which model one uses to make an estimate. Nuances between models come into play more when we have accurate counts. The irony of the situation is that we might never have them – how many people died in the Spanish Flu pandemic which first appeared in 1918 is still debated. In the current pandemic, to name a few confounders, we don’t know the susceptible population since we don’t know who has established immunity; we don’t know the infection rate, which is variable by region, and we know too little about how people become infected, the lifetime of the virus outside the host etc. We don’t know the fatality rate since many cases are unreported as COVID-19, and so it goes. The input data to models will improve over time, which will then lead to modification of the models, and a better picture will emerge. This will matter more for the next pandemic than it will for this one.

Which brings us to a key point. Research is incremental and operates in a virtuous cycle – data brings outcomes from models which with their flaws drives us to improve the data, which leads to better modeling and on it goes. The work of improving the data will go on long after the current pandemic has subsided. Some estimates put data wrangling, as improving the data is called, as 80% or more of the work that goes on in data science. As part of the virtuous cycle we will get to the point where data are more usable immediately upon collection. With US health data we have a long way to go. The fractionated nature of our health system makes this a hard problem – different companies use different software platforms and conform to different standards for describing data (the metadata) which makes aggregation of data difficult. Perhaps a crisis will lead us to think more like countries with national health systems, where regardless of what you think of the economic model, there can be no denying a more unified system of clinical data.

Pandemic modeling is not a new field, but the attention that governments and other agents of change pay to the outcomes of such models has never received so much attention. This is a good thing, even if it takes a pandemic. I contrast this to prior work on health disparities. Big data and data science generated excitement as to what it would tell us about health disparities. From my perspective it simply gave us a finer grained view of disparities than that derived from census and related data. In other words, it told us what we already knew, but perhaps in finer detail, but it did not change anything. A paper on health disparities in the academic literature does not change anything. Agencies – local, state and federal, engaged in the study can be change agents. COVID-19 gives us that opportunity. We have our state government making decisions based on what our own Biocomplexity Institute is telling them based on their models; the federal government has used models from the University of Washington; and other states are relying on models from their respective universities and laboratories. The models are different, but they are being acted upon. Perhaps in the future we will have something like the National Hurricane Center – the National Pandemic Center – with experts coming up with meta models which get acted upon as part of a coordinated and nationwide response – I live in hope.

Privacy and security seem at risk during a pandemic?

There are undoubtedly tradeoffs between providing access to the data needed to develop the appropriate response to the pandemic and revealing unwanted information on an individual, organization, government and so on. For further reading on the general nature of the problem see here and for AI specifically see here. Mitigating these issues begins in the training of individuals who have the skills to extract knowledge from such data. At the School of Data Science we are focusing on responsible data science, where our efforts are directed not just to courses on ethics (itself important), but also towards developing a culture of being responsible. This applies to everything from experimental design to how the final outcomes are represented and  disseminated. That training must also recognize that those tradeoffs may change during a pandemic. Even so there are at least 3 levels of consideration – the individual, the data scientist (including the organization they represent), and existing laws and policies. As far as possible the desires and the rights of the individual should be respected, unless trumped by laws and policies. Data scientists and their organizations should be subservient to both. Of course life at any time, and certainly during a pandemic, is not that simple. Nor is who makes the judgement call under such circumstances. Our mantra in the School of Data Science is to act for social good while preserving the rights and desires of the individual. Think what that means regarding contact tracing – surely to become more prevalent in coming months.. Automated contact tracing uses location data obtained from smart phones. Responsible data science means we use these data to inform those who may be infected by virtue of contact, but without revealing who the infectant was. Any subsequent use, say in modeling the spread of a virus in a campus environment, would require the consent of all parties to use their data in an anonymized way and only for the purposes they have agreed to. All the while being sure the data are secure.  Existing law, policies and organizations, for example the Institutional Review Board (IRB) in a research organization are there to protect these principles. However, protection, when large amounts of data, including location data, often integrated in unexpected ways presents challenges not faced before. Responsible data scientists must work hand-in-hand with IRB’s to insure appropriate data governance.

What Can I do to Help?

First and foremost, remember data science is a team sport. Stay informed and communicate with us ideas and concerns at any time. To stay informed join our mailing list here, follow us on twitter @uvadatascience and refer to our website. To communicate with us send email to or contact me directly at Together we will get through this.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

About Phil Bourne

Stephenson Founding Dean of the School of Data Science and Professor of Data Science & Biomedical Engineering, University of Virginia