Presented By: Department of Statistics Dissertation Defenses
Bayesian Generative Modeling of Latent Subpopulations with Non- parametric Distributions
Yilei Zhang
Across many scientific domains, researchers increasingly collect large heterogeneous datasets containing multiple meaningful subpopulations whose labels are unavailable. These subpopulations may be related in complex ways, and each may exhibit rich internal structure. Scientific analysis often requires not only assigning observations to latent subpopulations, but also characterizing the distributional structure within each subpopulation. Mixture models provide a natural framework for this goal. However, most existing work assumes that component distributions belong to specified parametric families, which are almost always misspecified in practice. Capturing complex subpopulation structures therefore requires extending mixture models to allow nonparametric component distributions. This extension immediately raises fundamental challenges of identifiability and inference: since only the overall population distribution is observed, it is unclear what should count as a distinct subpopulation; when components are highly flexible, it is unclear whether they can be separated, especially in overlapping regions; and even when separation is theoretically possible, reliably estimating latent subpopulations remains a major inferential challenge. In this dissertation, we address these theoretical and methodological challenges within a systematic Bayesian nonparametric framework.
First, we develop a unified framework based on mixtures of Dirichlet process mixtures (MDPMs) for two classes of nonparametric mixture structures: one in which components’ high-density regions are spatially differentiated, and another in which components may fully overlap but are distinguished by contrasting density levels. We develop scalable algorithms and evaluate them through simulations and real-data applications in univariate and multivariate settings, showing that component distributions can be accurately recovered under mild conditions.
Second, we extend the approach to multivariate settings where component high-density regions are spatially differentiated but not convexly separable. To handle complex density-contour geometry, we approximate these regions by unions of hypercubes and construct MDPMs over the resulting coverings, allowing the model to learn component distributions with complex latent-support geometries. Simulation studies demonstrate strong performance across diverse settings.
Third, we provide theoretical support for the framework by establishing identifiability conditions for the first class of mixture structures. We further derive posterior contraction rates under the MDPM framework. These results show that MDPMs preserve the efficiency of learning the overall population density relative to a single Dirichlet process mixture, while enabling latent nonparametric component distributions to be learned at a nearly polynomial rate, substantially faster than the typical rates of nonparametric deconvolution.
First, we develop a unified framework based on mixtures of Dirichlet process mixtures (MDPMs) for two classes of nonparametric mixture structures: one in which components’ high-density regions are spatially differentiated, and another in which components may fully overlap but are distinguished by contrasting density levels. We develop scalable algorithms and evaluate them through simulations and real-data applications in univariate and multivariate settings, showing that component distributions can be accurately recovered under mild conditions.
Second, we extend the approach to multivariate settings where component high-density regions are spatially differentiated but not convexly separable. To handle complex density-contour geometry, we approximate these regions by unions of hypercubes and construct MDPMs over the resulting coverings, allowing the model to learn component distributions with complex latent-support geometries. Simulation studies demonstrate strong performance across diverse settings.
Third, we provide theoretical support for the framework by establishing identifiability conditions for the first class of mixture structures. We further derive posterior contraction rates under the MDPM framework. These results show that MDPMs preserve the efficiency of learning the overall population density relative to a single Dirichlet process mixture, while enabling latent nonparametric component distributions to be learned at a nearly polynomial rate, substantially faster than the typical rates of nonparametric deconvolution.