Sciweavers

Share
ICML
2007
IEEE

Combining online and offline knowledge in UCT

10 years 6 days ago
Combining online and offline knowledge in UCT
The UCT algorithm learns a value function online using sample-based search. The TD() algorithm can learn a value function offline for the on-policy distribution. We consider three approaches for combining offline and online value functions in the UCT algorithm. First, the offline value function is used as a default policy during Monte-Carlo simulation. Second, the UCT value function is combined with a rapid online estimate of action values. Third, the offline value function is used as prior knowledge in the UCT search tree. We evaluate these algorithms in 9 ? 9 Go against GnuGo 3.7.10. The first algorithm performs better than UCT with a random simulation policy, but surprisingly, worse than UCT with a weaker, handcrafted simulation policy. The second algorithm outperforms UCT altogether. The third algorithm outperforms UCT with handcrafted prior knowledge. We combine these algorithms in MoGo, the world's strongest 9 ? 9 Go program. Each technique significantly improves MoGo'...
Sylvain Gelly, David Silver
Added 17 Nov 2009
Updated 17 Nov 2009
Type Conference
Year 2007
Where ICML
Authors Sylvain Gelly, David Silver
Comments (0)
books