Presented By: Department of Statistics Dissertation Defenses
Mixture and Admixture Models: Estimation Rate, Model Selection, Interpretation, and Applications in Heterogeneous Data Analysis
Trong Dat Do
Mixture and admixture models are powerful tools in modern statistics, with successful applications across diverse fields. By increasing the number of populations in the models, they can approximate any distribution arbitrarily well, albeit at the cost of interpretability. This poses a non-trivial question in practice: How should we select the complexity of mixture and admixture models and interpret them? This dissertation aims to answer this question by studying their asymptotic theory and developing novel statistical methods to select and interpret models fit to heterogeneous data, especially in genomics and population genetics. The two main tools used throughout the dissertation are optimal transport distances and non-parametric statistical methods.
Firstly, we provide an in-depth asymptotic analysis of finite mixture models and their variants. By developing novel notions of strong identifiability, we prove the convergence rate of mixing measures under optimal transport distances in mixture of regression models and deviating mixture models, which are widely used extensions of finite mixture models. The pointwise parameter estimation rate is shown to be optimal when the number of populations is known, but far slower when it is over-fitted. Indeed, when over-fitting, many redundant mixture components can compete to estimate a common true component, so their estimation rates cancel out. Motivated by this phenomenon, we develop a novel algorithm to combine mixture components, leading to a better parameter estimation rate from over-fitted mixing measures. The outcome of this algorithm is a hierarchical clustering tree of mixing measures, which visualizes the relative distance between mixture components and is useful for selecting models. Our method is illustrated on several simulated datasets and a single-cell RNA-seq dataset.
Next, we study the theory and methods for admixture models. In context of topic modeling, by developing a general representation of Dirichlet moment tensors by diagonal tensors and vice versa using techniques in enumerative combinatorics, we connect Latent Dirichlet Allocation to the (simpler) mixture of product models, which paves the way to provide the topics’ estimation rate in both settings where the true number of topics is known and unknown. Finally, the large-sample theory of admixture models in the context of population genetics is studied. We propose an asymptotic version of a regularity condition known as the “anchor condition”, which allows us to establish the parameter estimation rate in the large sample size and high-dimensional regime. Motivated by the theory, a fast and accurate model selection method using parametric bootstraps is proposed. We illustrate our theory and methods using several datasets simulated by the admixture and coalescent models.
Firstly, we provide an in-depth asymptotic analysis of finite mixture models and their variants. By developing novel notions of strong identifiability, we prove the convergence rate of mixing measures under optimal transport distances in mixture of regression models and deviating mixture models, which are widely used extensions of finite mixture models. The pointwise parameter estimation rate is shown to be optimal when the number of populations is known, but far slower when it is over-fitted. Indeed, when over-fitting, many redundant mixture components can compete to estimate a common true component, so their estimation rates cancel out. Motivated by this phenomenon, we develop a novel algorithm to combine mixture components, leading to a better parameter estimation rate from over-fitted mixing measures. The outcome of this algorithm is a hierarchical clustering tree of mixing measures, which visualizes the relative distance between mixture components and is useful for selecting models. Our method is illustrated on several simulated datasets and a single-cell RNA-seq dataset.
Next, we study the theory and methods for admixture models. In context of topic modeling, by developing a general representation of Dirichlet moment tensors by diagonal tensors and vice versa using techniques in enumerative combinatorics, we connect Latent Dirichlet Allocation to the (simpler) mixture of product models, which paves the way to provide the topics’ estimation rate in both settings where the true number of topics is known and unknown. Finally, the large-sample theory of admixture models in the context of population genetics is studied. We propose an asymptotic version of a regularity condition known as the “anchor condition”, which allows us to establish the parameter estimation rate in the large sample size and high-dimensional regime. Motivated by the theory, a fast and accurate model selection method using parametric bootstraps is proposed. We illustrate our theory and methods using several datasets simulated by the admixture and coalescent models.