We model reinforcement learning as the problem of learning to control a Partially Observable Markov Decision Process (  ¢¡¤£¦¥§  ), and focus on gradient ascent approaches to this problem. In [3] we introduced ¨  ¢¡¤£¦¥§  , an algorithm for estimating the performance gradient of a  ©¡¤£¦¥¤  from a single sample path, and we proved that this algorithm almost surely converges to an approximation to the gradient. In this paper, we provide a convergence rate for the estimates produced by ¨  ¢¡¤£¦¥§  , and give an improved bound on the approximation error of these estimates. Both of these bounds are in terms of mixing times of the  ©¡¤£¦¥¤  . 							
						
							
					 															
					Peter L. Bartlett, Jonathan Baxter