Presented By: Department of Statistics Dissertation Defenses
Contributions to Distributed Learning and Selective Inference
Yumeng Wang
The rapid growth and distributed nature of modern datasets pose significant challenges in statistical learning and inference. Privacy concerns often prohibit direct data sharing across sites, while distributional heterogeneity complicates accurate modeling and inference. Moreover, high-dimensional data and model selection procedures necessitate statistical methods for valid inference post-selection. This dissertation addresses these challenges by developing methodologies in distributed statistical learning, inference with heterogeneous data, and selective inference.
In the first part of the dissertation, we propose a novel one-shot distributed learning algorithm via refitting bootstrap samples. We demonstrate that the proposed estimator achieves full-sample statistical rates with only one round of communication of subsample-based statistics in generalized linear models and noisy phase retrieval. We further extend this approach to an iterative algorithm and apply it to convolutional neural networks (CNNs), which exhibits superior performances over existing methods in simulation studies.
In the second part of the dissertation, we develop a novel one-shot distributed learning algorithm to address cross-site heterogeneity. The proposed method effectively accommodates heterogeneity by allowing nuisance parameters to vary across sites. We show that the proposed estimator attains the full-sample statistical error rate and efficiency with only a single round of communication of local estimators. Our simulation studies support these theoretical findings.
In the third part of the dissertation, we introduce an asymptotic pivot to infer about the effects of selected variables on conditional quantile functions. Utilizing estimators from smoothed quantile regression, our proposed pivot is easy to compute and yields asymptotically-exact selective inference without making strict distributional assumptions about the response variable. By employing external randomization, our approach fully utilizes the data for both selection and inference, outperforming traditional methods like data splitting by consistently delivering shorter and more reliable confidence intervals. Simulation studies and an empirical application analyzing risk factors for low birth weight validate the practical efficacy of our method.
In the first part of the dissertation, we propose a novel one-shot distributed learning algorithm via refitting bootstrap samples. We demonstrate that the proposed estimator achieves full-sample statistical rates with only one round of communication of subsample-based statistics in generalized linear models and noisy phase retrieval. We further extend this approach to an iterative algorithm and apply it to convolutional neural networks (CNNs), which exhibits superior performances over existing methods in simulation studies.
In the second part of the dissertation, we develop a novel one-shot distributed learning algorithm to address cross-site heterogeneity. The proposed method effectively accommodates heterogeneity by allowing nuisance parameters to vary across sites. We show that the proposed estimator attains the full-sample statistical error rate and efficiency with only a single round of communication of local estimators. Our simulation studies support these theoretical findings.
In the third part of the dissertation, we introduce an asymptotic pivot to infer about the effects of selected variables on conditional quantile functions. Utilizing estimators from smoothed quantile regression, our proposed pivot is easy to compute and yields asymptotically-exact selective inference without making strict distributional assumptions about the response variable. By employing external randomization, our approach fully utilizes the data for both selection and inference, outperforming traditional methods like data splitting by consistently delivering shorter and more reliable confidence intervals. Simulation studies and an empirical application analyzing risk factors for low birth weight validate the practical efficacy of our method.