Presented By: Financial/Actuarial Mathematics Seminar - Department of Mathematics
Convergence Analysis of Discrete Sampling in Continuous-Time Reinforcement Learning and High-Dimensional Numerical Integration
Du Ouyang, Tsinghua
Stochastic policies (also known as relaxed controls) are widely used in continuous-time Reinforcement Learning (RL) algorithms. However, a critical disconnect remains between theory and practice. The theoretical aggregated dynamics, driven by averaged coefficients, provide a convenient basis for deriving RL algorithms but cannot be directly implemented. Physical execution requires the agent to sample concrete actions from the policy. Since continuously sampling independent actions poses significant mathematical and computational challenges, practical implementation must rely on discrete sampling. Yet, for general diffusion processes, the accuracy of such discretely sampled dynamics has lacked rigorous theoretical justification.
In this talk, I will bridge this gap by introducing and rigorously analyzing a policy execution framework that samples actions from a stochastic policy at discrete time points and implements them as piecewise constant controls. We prove that as the sampling mesh size tends to zero, the controlled state process converges weakly to the dynamics with coefficients aggregated according to the stochastic policy. We explicitly quantify the convergence rate based on the regularity of the coefficients and establish an optimal first-order convergence rate for sufficiently regular coefficients. Additionally, we prove a 1/2-order weak convergence rate that holds uniformly over the sampling noise with high probability, and establish a 1/2-order pathwise convergence for each realization of the system noise in the absence of volatility control. Building on these results, we analyze the bias and variance of various policy evaluation and policy gradient estimators based on discrete-time observations. Our results provide theoretical justification for the exploratory stochastic control framework in [H. Wang, T. Zariphopoulou, and X.Y. Zhou, J. Mach. Learn. Res., 21 (2020), pp. 1-34].
Finally, I will also briefly discuss my research on Quasi-Monte Carlo sampling methods for efficient computation in high-dimensional numerical integration.
In this talk, I will bridge this gap by introducing and rigorously analyzing a policy execution framework that samples actions from a stochastic policy at discrete time points and implements them as piecewise constant controls. We prove that as the sampling mesh size tends to zero, the controlled state process converges weakly to the dynamics with coefficients aggregated according to the stochastic policy. We explicitly quantify the convergence rate based on the regularity of the coefficients and establish an optimal first-order convergence rate for sufficiently regular coefficients. Additionally, we prove a 1/2-order weak convergence rate that holds uniformly over the sampling noise with high probability, and establish a 1/2-order pathwise convergence for each realization of the system noise in the absence of volatility control. Building on these results, we analyze the bias and variance of various policy evaluation and policy gradient estimators based on discrete-time observations. Our results provide theoretical justification for the exploratory stochastic control framework in [H. Wang, T. Zariphopoulou, and X.Y. Zhou, J. Mach. Learn. Res., 21 (2020), pp. 1-34].
Finally, I will also briefly discuss my research on Quasi-Monte Carlo sampling methods for efficient computation in high-dimensional numerical integration.