Presented By: Department of Statistics Dissertation Defenses
Estimating the (Un)seen: Sample-dependent Mass Estimation
Vinod Raman
We study mass estimation for distributions over countably infinite domains, where the objective is to estimate the probability mass of sample-dependent sets. Classical results such as missing mass estimation and its k-heavy-hitters generalizations fit into this framework, but little is known beyond these examples. We introduce a systematic study of mass estimation tasks defined by set-valued functions that map a finite sample to a subset of the domain, and identify general conditions under which simple estimators succeed. In particular, we show that the empirical-distribution-based estimator achieves vanishing error whenever the size of the image space of the set-valued function grows sublinearly with the sample size, and that the leave-one-out estimator works whenever the set-valued function satisfies a natural stability property. These results unify and extend prior analyses, yielding new guarantees for functionals such as neighboring mass, pierced sets, and structured combinations via unions and intersections. We conclude by broadening our scope to understand the landscape of estimatability. To that end, we give an example of a set-valued function that is not estimatable and leave open the question of finding matching necessary and sufficient conditions for such functions to be estimatable.