Research Post

Incremental Policy Gradients for Online Reinforcement Learning Control

Abstract

Policy gradient methods are built on the policy gradient theorem, which involves a term representing the complete sum of rewards into the future: the return. Due to this, one usually either waits until the end of an episode before performing updates, or learns an estimate of this return--a so-called critic. Our emphasis is on the first approach in this work, detailing an incremental policy gradient update which neither waits until the end of the episode, nor relies on learning estimates of the return. We provide on-policy and off-policy variants of our algorithm, for both the discounted return and average reward settings. Theoretically, we draw a connection between the traces our methods use and the stationary distributions of the discounted and average reward settings. We conclude with an experimental evaluation of our methods on both simple-to-understand and complex domains.

Latest Research Papers

Connect with the community

Get involved in Alberta's growing AI ecosystem! Speaker, sponsorship, and letter of support requests welcome.

Explore training and advanced education

Curious about study options under one of our researchers? Want more information on training opportunities?

Harness the potential of artificial intelligence

Let us know about your goals and challenges for AI adoption in your business. Our Investments & Partnerships team will be in touch shortly!