Presented By: Department of Statistics Dissertation Defenses
Statistical Modeling for Structured Network and Functional Data
Yichao Chen
The rapid growth of complex modern datasets involves structured dependencies. These structures introduce new challenges for statistical learning and require statistical frameworks which can capture higher-order interactions, relational patterns, and temporal dynamics. Motivated by these challenges, this dissertation consists of three parts for modeling structured network and functional data.
The first chapter focuses on modeling higher-order interactions in complex networks. Most statistical models for networks focus on pairwise interactions between nodes. However, many real-world networks involve higher-order interactions among multiple nodes, such as co-authors collaborating on a paper. Hypergraphs provide a natural representation for these networks, with each hyperedge representing a set of nodes. The majority of existing hypergraph models assume uniform hyperedges (i.e., edges of the same size) or rely on diversity among nodes. In this work, we propose a new hypergraph model based on non-symmetric determinantal point processes. The proposed model naturally accommodates non-uniform hyperedges, has tractable probability mass functions, and accounts for both node similarity and diversity in hyperedges. For model estimation, we maximize the likelihood function under constraints using a computationally efficient projected adaptive gradient descent algorithm. We establish the consistency and asymptotic normality of the estimator.
The second chapter presents a probabilistic model for community detection in signed networks. Community detection, discovering the underlying communities within a network from observed connections, is a fundamental problem in network analysis, yet it remains underexplored for signed networks. In signed networks, both edge connection patterns and edge signs are informative, and structural balance theory (e.g., triangles aligned with ``the enemy of my enemy is my friend'' and ``the friend of my friend is my friend'' are more prevalent) provides a global higher-order principle that guides community formation. We propose a Balanced Stochastic Block Model (BSBM), which incorporates balance theory into the network generating process such that balanced triangles are more likely to occur. We develop a fast profile pseudo-likelihood estimation algorithm with provable convergence and establish that our estimator achieves strong consistency under weaker signal conditions than methods for the binary SBM that rely solely on edge connectivity.
The third chapter develops a generative modeling framework for functional data, where each sample is observed over a continuum of time or space. Classical functional data analysis mainly relies on low-rank representations such as functional principal component analysis (FPCA) or spline bases, and focuses on developing discriminative models such as regression and classification. They do not characterize the probability distribution of functional observations. To directly learn the distribution of functional data, we propose a generative model defined on a separable Hilbert space. The generator is formulated as a latent neural ordinary differential equation (ODE) which captures temporal dynamics for functional data, combined with a decoder incorporating Fourier features and learned time embeddings for flexible function representation. The target distribution is estimated via a generalized energy-score loss, which is well-defined for arbitrary measures on separable Hilbert spaces without requiring the existence of Radon–Nikodym derivatives. Furthermore, we establish the error bounds comparing the learned and true functional distributions.
The first chapter focuses on modeling higher-order interactions in complex networks. Most statistical models for networks focus on pairwise interactions between nodes. However, many real-world networks involve higher-order interactions among multiple nodes, such as co-authors collaborating on a paper. Hypergraphs provide a natural representation for these networks, with each hyperedge representing a set of nodes. The majority of existing hypergraph models assume uniform hyperedges (i.e., edges of the same size) or rely on diversity among nodes. In this work, we propose a new hypergraph model based on non-symmetric determinantal point processes. The proposed model naturally accommodates non-uniform hyperedges, has tractable probability mass functions, and accounts for both node similarity and diversity in hyperedges. For model estimation, we maximize the likelihood function under constraints using a computationally efficient projected adaptive gradient descent algorithm. We establish the consistency and asymptotic normality of the estimator.
The second chapter presents a probabilistic model for community detection in signed networks. Community detection, discovering the underlying communities within a network from observed connections, is a fundamental problem in network analysis, yet it remains underexplored for signed networks. In signed networks, both edge connection patterns and edge signs are informative, and structural balance theory (e.g., triangles aligned with ``the enemy of my enemy is my friend'' and ``the friend of my friend is my friend'' are more prevalent) provides a global higher-order principle that guides community formation. We propose a Balanced Stochastic Block Model (BSBM), which incorporates balance theory into the network generating process such that balanced triangles are more likely to occur. We develop a fast profile pseudo-likelihood estimation algorithm with provable convergence and establish that our estimator achieves strong consistency under weaker signal conditions than methods for the binary SBM that rely solely on edge connectivity.
The third chapter develops a generative modeling framework for functional data, where each sample is observed over a continuum of time or space. Classical functional data analysis mainly relies on low-rank representations such as functional principal component analysis (FPCA) or spline bases, and focuses on developing discriminative models such as regression and classification. They do not characterize the probability distribution of functional observations. To directly learn the distribution of functional data, we propose a generative model defined on a separable Hilbert space. The generator is formulated as a latent neural ordinary differential equation (ODE) which captures temporal dynamics for functional data, combined with a decoder incorporating Fourier features and learned time embeddings for flexible function representation. The target distribution is estimated via a generalized energy-score loss, which is well-defined for arbitrary measures on separable Hilbert spaces without requiring the existence of Radon–Nikodym derivatives. Furthermore, we establish the error bounds comparing the learned and true functional distributions.