Presented By: Department of Statistics
Statistics Department Seminar Series: Kihyuk Hong, PhD Candidate, Department of Statistics, University of Michigan
"Learning Infinite-horizon Average-reward MDPs with Linear Function Approximation"
Abstract: The infinite-horizon average-reward Markov decision process (MDP) provides a natural framework for sequential decision-making under uncertainty in settings where an agent interacts with the environment continuously, such as inventory management and network routing systems. Unlike episodic settings, where the environment is periodically reset, continuous interaction without resets introduces unique challenges. For example, it necessitates assumptions on the underlying MDP to avoid pathological scenarios where the agent becomes trapped in an unfavorable state with no path to recovery. Additionally, the average reward optimality criterion complicates algorithm design, as the corresponding Bellman operator is not a contraction, preventing the straightforward application of optimistic value iteration algorithms.
We address reinforcement learning in infinite-horizon average-reward settings, focusing on MDPs where both the reward function and the transition probability kernel are linear in a feature representation of state-action pairs. Existing approaches in this setting either face computational inefficiencies or rely on strong assumptions, such as ergodicity, to achieve order-optimal regret bounds. In this talk, we introduce a computationally efficient algorithm that attains an order-optimal regret bound under a mild assumption on the underlying MDP. The algorithm learns a discounted-reward MDP as a surrogate for the average-reward problem. Leveraging the contraction property of the associated Bellman operator for the surrogate problem, we design an optimistic value iteration algorithm and employ value function clipping technique for improving statistical efficiency. We show that appropriately tuning the discounting factor for the surrogate problem achieves an order-optimal regret for the original average-reward problem.
https://kihyukh.github.io/
We address reinforcement learning in infinite-horizon average-reward settings, focusing on MDPs where both the reward function and the transition probability kernel are linear in a feature representation of state-action pairs. Existing approaches in this setting either face computational inefficiencies or rely on strong assumptions, such as ergodicity, to achieve order-optimal regret bounds. In this talk, we introduce a computationally efficient algorithm that attains an order-optimal regret bound under a mild assumption on the underlying MDP. The algorithm learns a discounted-reward MDP as a surrogate for the average-reward problem. Leveraging the contraction property of the associated Bellman operator for the surrogate problem, we design an optimistic value iteration algorithm and employ value function clipping technique for improving statistical efficiency. We show that appropriately tuning the discounting factor for the surrogate problem achieves an order-optimal regret for the original average-reward problem.
https://kihyukh.github.io/
Related Links
Co-Sponsored By
Explore Similar Events
-
Loading Similar Events...