Sciweavers

ICML
2001
IEEE

Off-Policy Temporal Difference Learning with Function Approximation

14 years 5 months ago
Off-Policy Temporal Difference Learning with Function Approximation
We introduce the first algorithm for off-policy temporal-difference learning that is stable with linear function approximation. Off-policy learning is of interest because it forms the basis for popular reinforcement learning methods such as Q-learning, which has been known to diverge with linear function approximation, and because it is critical to the practical utility of multi-scale, multi-goal, learning frameworks such as options, HAMs, and MAXQ. Our new algorithm combines TD() over state?action pairs with importance sampling ideas from our previous work. We prove that, given training under any -soft policy, the algorithm converges w.p.1 to a close approximation (as in Tsitsiklis and Van Roy, 1997; Tadic, 2001) to the action-value function for an arbitrary target policy. Variations of the algorithm designed to reduce variance introduce additional bias but are also guaranteed convergent. We also illustrate our method empirically on a small policy evaluation problem. Our current resu...
Doina Precup, Richard S. Sutton, Sanjoy Dasgupta
Added 17 Nov 2009
Updated 17 Nov 2009
Type Conference
Year 2001
Where ICML
Authors Doina Precup, Richard S. Sutton, Sanjoy Dasgupta
Comments (0)