Presented By: Department of Statistics Dissertation Defenses
Theoretical Advances in Reinforcement Learning: Online Average-Reward and Offline Constrained Settings
Kihyuk Hong
This dissertation presents theoretical advancements in reinforcement learning (RL), focusing on two key settings: online RL under the infinite-horizon average-reward criterion and offline constrained RL under partial data coverage.
The first part addresses the online setting, where the objective is to optimize long-run average rewards through interaction with the environment. A family of value-iteration-based algorithms is proposed by approximating the average-reward objective using a carefully tuned discounted surrogate. This part resolves an open problem by establishing a computationally efficient algorithm for linear Markov decision processes under a weak structural assumption. The proposed algorithm employs span-constrained value clipping and a decoupled planning strategy that mitigates statistical inefficiencies arising from the complexity of the function class.
The second part is motivated by safety-critical applications and studies the offline setting, where the agent must learn a policy from a fixed dataset without further interaction. Primal-dual algorithms are developed for both linear MDPs and general function-approximation regimes, based on the linear programming formulation of RL. These methods are oracle-efficient and provably sample-efficient under partial data coverage. Moreover, they extend to the constrained RL setting, where the policy must satisfy additional safety constraints defined by auxiliary reward signals and threshold levels.
Together, the contributions of this dissertation advance the theoretical foundations of reinforcement learning in settings that prioritize safety and long-term performance.
The first part addresses the online setting, where the objective is to optimize long-run average rewards through interaction with the environment. A family of value-iteration-based algorithms is proposed by approximating the average-reward objective using a carefully tuned discounted surrogate. This part resolves an open problem by establishing a computationally efficient algorithm for linear Markov decision processes under a weak structural assumption. The proposed algorithm employs span-constrained value clipping and a decoupled planning strategy that mitigates statistical inefficiencies arising from the complexity of the function class.
The second part is motivated by safety-critical applications and studies the offline setting, where the agent must learn a policy from a fixed dataset without further interaction. Primal-dual algorithms are developed for both linear MDPs and general function-approximation regimes, based on the linear programming formulation of RL. These methods are oracle-efficient and provably sample-efficient under partial data coverage. Moreover, they extend to the constrained RL setting, where the policy must satisfy additional safety constraints defined by auxiliary reward signals and threshold levels.
Together, the contributions of this dissertation advance the theoretical foundations of reinforcement learning in settings that prioritize safety and long-term performance.