Abstract: Policy gradient methods are among the most effective methods in challenging reinforcement learning problems with large state and/or action spaces. However, little is known about even their most basic theoretical convergence properties, including: 1) if and how fast they converge to a globally optimal solution (say with a sufficiently rich policy class); 2) how they cope with approximation error due to using a restricted class of parametric policies; or 3) their finite sample behavior. In this talk, we will study all these issues, and provide a broad understanding of when first-order approaches to direct policy optimization in RL succeed. We will also identify the relevant notions of policy class expressivity underlying these guarantees in the approximate setting. Throughout, we will also highlight the interplay of exploration with policy optimization, both in our upper bounds and illustrative lower bounds. This talk is based on joint work with Sham Kakade, Jason Lee and Gaurav Mahajan. Please see https://arxiv.org/pdf/1908.00261 for details.
Bio: Alekh Agarwal is a Principal Research Manager at Microsoft Research where he has been since 2012. His research broadly focuses on designing theoretically sound and practically useful techniques for sequential decision making problems. Within this context, he has worked on areas such as bandits, active learning and most recently reinforcement learning. He has also worked on several other aspects of machine learning including convex optimization, high-dimensional statistics and large-scale machine learning in distributed settings.