Presented By: Department of Statistics Dissertation Defenses
Topics in Modern Machine Learning: Sequential Decision Making, High-Dimensional Statistics and Differential Privacy
Gang Qiao
Modern machine learning is increasingly applied to make reliable decisions from limited, noisy, high-dimensional, or privacy-constrained data. This dissertation studies several mathematical problems that arise from this demand, with an emphasis on sequential decision making, sparse high-dimensional inference, differential privacy, and global testing.
The first part of the dissertation concerns stochastic convex hull membership, a pure-exploration problem in which one sequentially samples from a finite collection of distributions in order to decide whether a target point belongs to the convex hull of their unknown means. We first give a complete solution in one dimension, deriving the information-theoretic characteristic time and developing Thompson-CHM, an asymptotically optimal sampling algorithm whose allocation matches the lower bound. We then further extend the whole theory to the higher-dimensional Gaussian setting, where Euclidean geometry makes the least favorable alternatives explicit.
The second part develops differentially private procedures for the Dantzig selector in high-dimensional linear regression. We start by proposing a private sparse-regression method based on a noisy iterative hard-thresholding oracle tailored to the Dantzig selector's score geometry. The algorithm preserves sparsity by construction and satisfies privacy, parameter-error, and population excess-risk guarantees, with the main error rate matching the known differentially private minimax benchmark up to logarithmic factors. We then introduce a complementary active-set method that privatizes the Dantzig score more directly: it privately identifies violated score coordinates, refits on a restricted support, and prunes to exact sparsity. This second approach is closer to the defining feasibility constraint of the Dantzig selector and provides an alternative route to private sparse estimation under stronger sparse-design conditions.
The final part studies sparse-signal detection in high-dimensional regression by combining two classical ideas: knockoffs and higher criticism. Knockoffs provide dependence-adaptive negative controls, while higher criticism is designed to detect rare and weak alternatives near the sharp sparse-mixture boundary. We introduce a multi-knockoff higher-criticism statistic based on Lasso entry times in an augmented design containing multiple knockoff copies per feature. In the orthogonal-design regime, the proposed statistic attains the classical higher-criticism detection boundary for sparse alternatives against the global null. This result suggests a new way to use knockoff constructions beyond false discovery rate control: as a mechanism for calibrating global tests in high-dimensional models with structured dependence.
The first part of the dissertation concerns stochastic convex hull membership, a pure-exploration problem in which one sequentially samples from a finite collection of distributions in order to decide whether a target point belongs to the convex hull of their unknown means. We first give a complete solution in one dimension, deriving the information-theoretic characteristic time and developing Thompson-CHM, an asymptotically optimal sampling algorithm whose allocation matches the lower bound. We then further extend the whole theory to the higher-dimensional Gaussian setting, where Euclidean geometry makes the least favorable alternatives explicit.
The second part develops differentially private procedures for the Dantzig selector in high-dimensional linear regression. We start by proposing a private sparse-regression method based on a noisy iterative hard-thresholding oracle tailored to the Dantzig selector's score geometry. The algorithm preserves sparsity by construction and satisfies privacy, parameter-error, and population excess-risk guarantees, with the main error rate matching the known differentially private minimax benchmark up to logarithmic factors. We then introduce a complementary active-set method that privatizes the Dantzig score more directly: it privately identifies violated score coordinates, refits on a restricted support, and prunes to exact sparsity. This second approach is closer to the defining feasibility constraint of the Dantzig selector and provides an alternative route to private sparse estimation under stronger sparse-design conditions.
The final part studies sparse-signal detection in high-dimensional regression by combining two classical ideas: knockoffs and higher criticism. Knockoffs provide dependence-adaptive negative controls, while higher criticism is designed to detect rare and weak alternatives near the sharp sparse-mixture boundary. We introduce a multi-knockoff higher-criticism statistic based on Lasso entry times in an augmented design containing multiple knockoff copies per feature. In the orthogonal-design regime, the proposed statistic attains the classical higher-criticism detection boundary for sparse alternatives against the global null. This result suggests a new way to use knockoff constructions beyond false discovery rate control: as a mechanism for calibrating global tests in high-dimensional models with structured dependence.