Sciweavers

NIPS
2001

Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning

13 years 4 months ago
Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning
Policy gradient methods for reinforcement learning avoid some of the undesirable properties of the value function approaches, such as policy degradation (Baxter and Bartlett, 2001). However, the variance of the performance gradient estimates obtained from the simulation is sometimes excessive. In this paper, we consider variance reduction methods that were developed for Monte Carlo estimates of integrals. We study two commonly used policy gradient techniques, the baseline and actor-critic methods, from this perspective. Both can be interpreted as additive control variate variance reduction methods. We consider the expected average reward performance measure, and we focus on the GPOMDP algorithm for estimating performance gradients in partially observable Markov decision processes controlled by stochastic reactive policies. We give bounds for the estimation error of the gradient estimates for both baseline and actor-critic algorithms, in terms of the sample size and mixing properties o...
Evan Greensmith, Peter L. Bartlett, Jonathan Baxte
Added 31 Oct 2010
Updated 31 Oct 2010
Type Conference
Year 2001
Where NIPS
Authors Evan Greensmith, Peter L. Bartlett, Jonathan Baxter
Comments (0)