Policy Gradient Critics

12 years 2 months ago
Policy Gradient Critics
We present Policy Gradient Actor-Critic (PGAC), a new model-free Reinforcement Learning (RL) method for creating limited-memory stochastic policies for Partially Observable Markov Decision Processes (POMDPs) that require long-term memories of past observations and actions. The approach involves estimating a policy gradient for an Actor through a Policy Gradient Critic which evaluates probability distributions on actions. Gradient-based updates of history-conditional action probability distributions enable the algorithm to learn a mapping from memory states (or event histories) to probability distributions on actions, solving POMDPs through a combination of memory and stochasticity. This goes beyond previous approaches to learning purely reactive POMDP policies, without giving up their advantages. Preliminary results on important benchmark tasks show that our approach can in principle be used as a general purpose POMDP algorithm that solves RL problems in both continuous and discrete ac...
Daan Wierstra, Jürgen Schmidhuber
Added 07 Jun 2010
Updated 07 Jun 2010
Type Conference
Year 2007
Where ECML
Authors Daan Wierstra, Jürgen Schmidhuber
Comments (0)