Free Online Productivity Tools
i2Speak
i2Symbol
i2OCR
iTex2Img
iWeb2Print
iWeb2Shot
i2Type
iPdf2Split
iPdf2Merge
i2Bopomofo
i2Arabic
i2Style
i2Image
i2PDF
iLatex2Rtf
Sci2ools

ICML

2010

IEEE

2010

IEEE

We present the first temporal-difference learning algorithm for off-policy control with unrestricted linear function approximation whose per-time-step complexity is linear in the number of features. Our algorithm, Greedy-GQ, is an extension of recent work on gradient temporal-difference learning, which has hitherto been restricted to a prediction (policy evaluation) setting, to a control setting in which the target policy is greedy with respect to a linear approximation to the optimal action-value function. A limitation of our control setting is that we require the behavior policy to be stationary. We call this setting latent learning because the optimal policy, though learned, is not manifest in behavior. Popular off-policy algorithms such as Q-learning are known to be unstable in this setting when used with linear function approximation. In reinforcement learning, the term "off-policy learning" refers to learning about one way of behaving, called the target policy, from da...

Related Content

Added |
09 Nov 2010 |

Updated |
09 Nov 2010 |

Type |
Conference |

Year |
2010 |

Where |
ICML |

Authors |
Hamid Reza Maei, Csaba Szepesvári, Shalabh Bhatnagar, Richard S. Sutton |

Comments (0)