March 15-18, 2026

Introductory Overview Lecture (IOL) Series

Augmenting Statistical Inference with Machine Learning
Speaker: Rajen Shah

Over the past few decades, statistics and machine learning have developed powerful regression tools—random forests, boosted trees, neural networks—that deliver exceptional predictive performance across diverse applications. Yet in many scientific, biomedical, and policy settings, predictive accuracy alone is not enough. For example, we may wish to understand the contributions of individual predictors on clinical outcomes, infer aspects of the causal structure among biomarkers, or estimate treatment effects from observational health records. While classical parametric models can often support such inference, their reliance on rigid assumptions makes them poorly suited to the large and complex datasets common in modern biostatistical applications. To bridge this gap, a rapidly expanding body of work seeks to integrate the flexibility and accuracy of machine learning into rigorous frameworks for statistical inference.

In the first lecture, we will look at a major branch of this work: targeted or debiased machine learning (DML), grounded in semiparametric theory. Semiparametric models are a powerful generalisation of classical finite-dimensional statistical models that additionally incorporate infinite-dimensional “nuisance” parameters—not the primary target of inference but essential for faithful modelling of the data; for example, nuisance functions may represent nonparametric confounder adjustment in observational studies. While naïve estimation using machine learning methods typically results in high bias for the target parameter, we shall see how one can carefully combine estimates of the nuisance parameters to produce estimators that are insensitive to their errors. Remarkably, these DML estimators are typically asymptotically Gaussian, enabling familiar tools like confidence intervals and hypothesis tests, even in highly flexible models.

The second lecture moves from conceptual foundations to practical methodology. We will explore the application of DML techniques to a variety of problems including conditional independence testing, for example to infer causal structure; identifying shifts in regression models over time; treatment effect estimation from observational studies, and inference in partially linear models for longitudinal data.

The final lecture turns to hypothesis testing using machine learning, particularly in problems with broad or complex alternatives for which a classical parametric approach would be inadequate. For instance, a simple model may fail to detect the relevance of a biomarker if its effect arises through complex interactions. We will introduce a flexible “hunt-and-test” approach: data are split into two parts, one used to construct a test tailored to the observed alternative, the other used to evaluate its significance. This simple idea provides a practical approach to otherwise challenging problems, such as testing for subgroups among patient populations. We will show how combining this approach with DML enables powerful and flexible tests for nonparametric variable significance and goodness-of-fit of semiparametric models. Finally, we will discuss p-value aggregation across multiple data splits, a strategy that improves both statistical power and reproducibility—essential considerations in modern data analysis.

 

Biography

Rajen Shah obtained his PhD in Statistics from the University of Cambridge in 2014, and started as a University Lecturer in the Statistical Laboratory there in 2013, becoming a Reader in 2019. His main research interests include developing methodology and theory for problems in high-dimensional statistics, large-scale data analysis and causal inference. He received the Royal Statistical Society Research Prize in 2017, and the Guy Medal in Bronze in 2022. He was awarded the Faculty Lecturing Prize in 2021.