Presented By: Dissertation Defense - Department of Mathematics
Sequential Decision Making with Offline Data and under Partial Observability
Chinmaya Kausik
Thomas T on Unsplash
What Makes Partially Observable Decision Making Tractable?
Abstract:
Real-world sequential decision-making problems - such as those in healthcare, recommendation, and language model alignment - are complicated by latent variables that influence the data but are never directly observed. The most general framework for such settings, the partially observable MDP, is statistically intractable. What structure makes learning tractable despite the latent variables?
This work identifies a common structural pattern across four distinct settings: the latent variable's influence is confined to one part of the data-generating process, and within that part it acts through a low-dimensional or low-complexity channel. We study this pattern in mixtures of MDPs, confounded offline policy evaluation, RLHF with partially observed reward states, and linear latent contextual bandits. In each case, we establish impossibility results showing what fails without the right assumptions, then develop algorithms that exploit the structure through spectral subspace recovery, decoupled estimation, and optimism calibrated to the latent channel's complexity. The resulting regret bounds, sample complexity bounds, and structural characterizations scale with the dimension of the latent channel rather than the ambient problem, and are matched by minimax lower bounds in key settings. We validate our methods on both synthetic and real-world data.
Abstract:
Real-world sequential decision-making problems - such as those in healthcare, recommendation, and language model alignment - are complicated by latent variables that influence the data but are never directly observed. The most general framework for such settings, the partially observable MDP, is statistically intractable. What structure makes learning tractable despite the latent variables?
This work identifies a common structural pattern across four distinct settings: the latent variable's influence is confined to one part of the data-generating process, and within that part it acts through a low-dimensional or low-complexity channel. We study this pattern in mixtures of MDPs, confounded offline policy evaluation, RLHF with partially observed reward states, and linear latent contextual bandits. In each case, we establish impossibility results showing what fails without the right assumptions, then develop algorithms that exploit the structure through spectral subspace recovery, decoupled estimation, and optimism calibrated to the latent channel's complexity. The resulting regret bounds, sample complexity bounds, and structural characterizations scale with the dimension of the latent channel rather than the ambient problem, and are matched by minimax lower bounds in key settings. We validate our methods on both synthetic and real-world data.
Thomas T on Unsplash