Sciweavers

UAI
2001

The Optimal Reward Baseline for Gradient-Based Reinforcement Learning

13 years 6 months ago
The Optimal Reward Baseline for Gradient-Based Reinforcement Learning
There exist a number of reinforcement learning algorithms which learn by climbing the gradient of expected reward. Their long-run convergence has been proved, even in partially observable environments with non-deterministic actions, and without the need for a system model. However, the variance of the gradient estimator has been found to be a significant practical problem. Recent approaches have discounted future rewards, introducing a bias-variance trade-off into the gradient estimate. We incorporate a reward baseline into the learning system, and show that it affects variance without introducing further bias. In particular, as we approach the zerobias, high-variance parameterization, the optimal (or variance minimizing) constant reward baseline is equal to the long-term average expected reward. Modified policy-gradient algorithms are presented, and a number of experiments demonstrate their improvement over previous work.
Lex Weaver, Nigel Tao
Added 31 Oct 2010
Updated 31 Oct 2010
Type Conference
Year 2001
Where UAI
Authors Lex Weaver, Nigel Tao
Comments (0)