Sciweavers

ICML
2008
IEEE

Sample-based learning and search with permanent and transient memories

14 years 5 months ago
Sample-based learning and search with permanent and transient memories
We present a reinforcement learning architecture, Dyna-2, that encompasses both samplebased learning and sample-based search, and that generalises across states during both learning and search. We apply Dyna-2 to high performance Computer Go. In this domain the most successful planning methods are based on sample-based search algorithms, such as UCT, in which states are treated individually, and the most successful learning methods are based on temporal-difference learning algorithms, such as Sarsa, in which linear function approximation is used. In both cases, an estimate of the value function is formed, but in the first case it is transient, computed and then discarded after each move, whereas in the second case it is more permanent, slowly accumulating over many moves and games. The idea of Dyna-2 is for the transient planning memory and the permanent learning memory to remain separate, but for both to be based on linear function approximation and both to be updated by Sarsa. To ap...
David Silver, Martin Müller 0003, Richard S.
Added 17 Nov 2009
Updated 17 Nov 2009
Type Conference
Year 2008
Where ICML
Authors David Silver, Martin Müller 0003, Richard S. Sutton
Comments (0)