Presented By: Department of Statistics Dissertation Defenses
Structured Statistical Learning and Inference for Complex Scientific Data
Yang Li
This dissertation develops structured statistical learning and inference methods for complex scientific data. Here, structure refers to problem-specific patterns that can be modeled to improve learning or inference: cluster-specific abundance and presence--absence patterns in microbiome compositions, modular organization in high-dimensional conditional dependence networks, and the conditional predictive structure among outcomes, covariates, and black-box predictions. Modeling such structure can improve clustering, network learning, and inference while preserving interpretability and statistical validity.
The first part studies model-based clustering of microbiome compositional data. We develop an Ising-Dirichlet mixture model for zero-inflated compositions, where each cluster has a presence--absence dependence structure and a nonzero abundance profile. The method is designed to improve clustering with limited samples by using information from both taxon occurrence patterns and relative abundance variation. Simulations and a resistant potato starch study show improved clustering accuracy and interpretable microbiome subgroups.
The second part studies variable clustering in high-dimensional graphical models. We develop a one-step joint estimation framework for a sparse precision matrix and a latent variable partition. This allows graph estimation and partition recovery to reinforce each other, rather than clustering a separately estimated graph. The method treats the partition as an explicit estimation target and allows nonzero cross-cluster dependence, relying on a modularity criterion in which within-cluster connectivity is denser than between-cluster connectivity. Simulations and real-data applications show more stable and interpretable graph-and-cluster representations than two-stage alternatives.
The third part studies statistical inference with limited gold-standard labels and abundant black-box predictions. Because these predictions are not ground truth, valid use requires bias correction. We develop adaptive prediction-powered inference, which learns a score-side adjustment from labeled data to approximate the variance-optimal conditional score adjustment through Taylor-based and ensemble-based constructions. Simulations and real-data examples show that the method preserves coverage while producing smaller confidence regions than existing prediction-powered and surrogate-adjustment methods.
The first part studies model-based clustering of microbiome compositional data. We develop an Ising-Dirichlet mixture model for zero-inflated compositions, where each cluster has a presence--absence dependence structure and a nonzero abundance profile. The method is designed to improve clustering with limited samples by using information from both taxon occurrence patterns and relative abundance variation. Simulations and a resistant potato starch study show improved clustering accuracy and interpretable microbiome subgroups.
The second part studies variable clustering in high-dimensional graphical models. We develop a one-step joint estimation framework for a sparse precision matrix and a latent variable partition. This allows graph estimation and partition recovery to reinforce each other, rather than clustering a separately estimated graph. The method treats the partition as an explicit estimation target and allows nonzero cross-cluster dependence, relying on a modularity criterion in which within-cluster connectivity is denser than between-cluster connectivity. Simulations and real-data applications show more stable and interpretable graph-and-cluster representations than two-stage alternatives.
The third part studies statistical inference with limited gold-standard labels and abundant black-box predictions. Because these predictions are not ground truth, valid use requires bias correction. We develop adaptive prediction-powered inference, which learns a score-side adjustment from labeled data to approximate the variance-optimal conditional score adjustment through Taylor-based and ensemble-based constructions. Simulations and real-data examples show that the method preserves coverage while producing smaller confidence regions than existing prediction-powered and surrogate-adjustment methods.