Sciweavers

7 search results - page 2 / 2
» Exploiting Multiple Secondary Reinforcers in Policy Gradient...
Sort
View
ICML
2010
IEEE
13 years 5 months ago
Toward Off-Policy Learning Control with Function Approximation
We present the first temporal-difference learning algorithm for off-policy control with unrestricted linear function approximation whose per-time-step complexity is linear in the ...
Hamid Reza Maei, Csaba Szepesvári, Shalabh ...
COLT
2010
Springer
13 years 2 months ago
An Asymptotically Optimal Bandit Algorithm for Bounded Support Models
Multiarmed bandit problem is a typical example of a dilemma between exploration and exploitation in reinforcement learning. This problem is expressed as a model of a gambler playi...
Junya Honda, Akimichi Takemura