What Should Our PhD in Data Science Imply?

As we prepare to offer a PhD degree in data science, a highly interdisciplinary field encompassing every imaginable discipline, what does such a degree imply, what are employers expecting, what will a faculty actually trained in data science look like? All questions we are contemplating as the program comes together. Here I focus not so much on the nuts and bolts of such a program, but philosophically what do we wish to accomplish – that is the name of the degree after all.

At a recent School of Data Science (SDS) search committee meeting we were discussing the qualifications needed for an Associate Dean of Academic & Faculty Affairs. We listed a PhD in data science. After a brief pause there was laughter. Does anyone in the world actually have a PhD in data science? A web search produces very few universities in the US at least that offer exclusively PhD degrees in data science. Many universities have a data science specialization as part of a broader degree, for example, in biomedical informatics, in business analytics and in computer science. Currently, the US universities offering a “pure” PhD in data science appear to be:

School of Informatics and Computing  Indiana University – Purdue University Indianapolis Minimum of 60 credits
Center for Data Science New York University 72 credits
Data Science Program Worcester Polytechnic Institute Minimum of 60 credits 

The structure of these programs are not dissimilar to other fields, although expected durations vary. Here is an approximation:

  • Year’s 1-2 core courses, electives and research rotations 
  • End of year 2 – qualifying exam to assess research potential
  • End of year 2 – identify research advisor(s)
  • Years 3 onwards research project with advisor(s)
  • Year 3 proposition exam of research topic
  • Graduate at the end of year 4

Standard stuff.  More importantly, what do these programs and what do we want our graduates to have accomplished when they walk away with that degree? How will our PhD graduates in data science with a specialization in x differ from graduates in x with a specialization in data science? By analogy in public health, what distinguishes a PhD in Public Health and a DrPH (Doctor of Public Health).  The former is designed for research-based contributions to the field, the latter is for leadership roles in practice-based settings (e.g., health department director, health officer). But even there the difference is shaky. 

Deep research in some aspect of data science may or may not already belong to an existing field. For example, contributions to fundamentals of deep learning most likely already belong to computer science and engineering; fundamental contributions in cloud computing belongs to systems engineering or elsewhere and so on. On the other hand, where does data ethics currently belong? Where will it belong in the future, perhaps in the social sciences, perhaps not? In short, since data science is a composite of existing disciplines – statistics, computer science, applied mathematics and associated domain(s), would not a PhD represent a deeper study across all these domains than one would experience in a master’s degree, including a deeper dive into a domain area of specialization with increased emphasis placed on developing the research method?

Another way of thinking about our PhD graduates is to think of them as ∏ (pi) shaped and not T shaped. That is, both have broad expertise, but rather than a deep dive into a specific domain area, it is a deep dive into a specific domain area plus data science.

If so, what should a PhD graduate in data science look like to an employer, either in academia or the private sector? We asked this question of our SDS advisory board, a distinguished group of private sector experts. We couch their response, as well as our own knowledge of a career in academia, in terms we are using to operationalize our data science school.

Value – The determination of the value of what research we do, accounting for the natural tensions between social good and business practice. Example PhD training areas:

  • Ethics for Data Science
  • Privacy and regulatory concerns with protected information
  • Sociology and its intersection with data science

Design – the ability to both consume data and produce data products of the highest value. Example PhD training areas:

  • Human computer interaction
  • Data representation and manipulation – e.g., metadata, ontologies
  • Data characteristics – e.g., sparsity, high dimensionality, complexity
  • Data visualization
  • Study design

Systems – infrastructure and architectures to support big data/data science. Example PhD training areas:

  • Cybersecurity
  • Databases
  • Cloud & distributed computing
  • Sensors
  • Algorithms and data structures
  • Signal processing

Analytics – statistical and machine learning theory & application to analyze, infer, simulate and predict. Example PhD training:

  • Theory
    • Statistics & probability – Inference, Bayesian, multivariate
    • Probability
    • Graphs and networks
    • Linear algebra & linear models
    • Game theory, decision theory
  • Application
    • Deep learning/neural networks
    • Natural language processing

Practice – Brings all of the above together where it is practiced in the form of research on one or more specific domain areas, for example, biomedical data sciences, digital humanities, finance. Whatever the domain area, there is a need for professional development in areas so important to a successful research career:

  • Communication – written and verbal
  • Study design
  • Time management
  • The art of academia – effective grant and paper writing, collaboration, personnel management, etc.

Finally, there is training in the guiding principles, by which we are establishing the school, namely:

  • Interdisciplinarity – comfort in multiple traditional disciplines
  • Provision of open knowledge in all we do – written articles, data accessibility and usability, software availability
  • Reproducibility – a prerequisite to the provision of open knowledge
  • Diversity, equality, inclusion – with respect to all with whom we work
  • Innovation & translation – research that makes a difference

No single mentor at this point in time will likely meet the complete needs of our PhD students. Research is best conducted through dual mentorship – an expert in data science combined with a domain expert. A model known to have worked well in other emergent interdisciplinary fields, for example bioinformatics. Research rotations allow the student to explore domains and research topics before honing in on a specific career direction.

Conferring a PhD in data science will be an experiment. How do you evaluate a PhD degree in predictive urban modeling vs radiological image analysis vs real time stock market analysis? What are the comparative rubrics? Should there be comparable rubrics for such diverse domains? Should there be a written thesis to define original work in all domains? Are high quality data and software a partial or complete substitute for a thesis? The first students, as those at New York University, are brave souls. Or are they? Industry tells us they want researchers in data science with a depth of research experience beyond a capstone project as found in a MS degree and professors are needed now to teach the next generation of data scientists. Those brave souls will be in high demand.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

About Phil Bourne

Stephenson Founding Dean of the School of Data Science and Professor of Data Science & Biomedical Engineering, University of Virginia