[en] In reinforcement learning, direct policy optimization, and specifically policy gradient methods, has proven effective for solving complex control problems. However, these methods are highly sensitive to the evolution of the policy's stochasticity during learning. It is essential to maintain sufficient exploration to avoid premature convergence toward a deterministic or low-entropy policy. This thesis studies this issue, with focus on policy parameterization choices and reward-shaping methods with intrinsic exploration bonuses.
First, we analyze the influence of policy stochasticity on the optimization process. We formulate direct policy optimization within the optimization-by-continuation framework, which involves optimizing a sequence of surrogate objectives called continuations. We show that optimizing the expected return of an affine Gaussian policy, which is sufficiently stochastic either through manually controlling the variance or through regularizing the entropy, corresponds to optimizing a continuation of the expected return of an underlying deterministic policy. This continuation corresponds to the expected return filtered to remove local extrema. Hereby, we argue that policy gradient algorithms enforcing exploration can be understood as methods for optimizing policies by continuation and that the policy variance should be a history-dependent function adapted to avoid local optima.
Next, we introduce a novel analysis of intrinsic bonuses through the lens of numerical optimization. We define two criteria for the learning objective and two for the stochastic gradient estimates, using them to evaluate the policy's quality after optimization. Our analysis highlights two key effects of exploration techniques: smoothing the learning objective to remove local optima while preserving the global maximum, and modifying gradient estimates to increase the likelihood of eventually finding an optimal policy. We empirically illustrate these effects, identifying limitations and suggesting directions for future work.
Finally, we propose a new intrinsic reward bonus for exploration as in maximum-entropy reinforcement learning methods. The intrinsic reward is defined as the relative entropy of the discounted distribution of future state-action pairs, or features of these pairs. We prove that an optimal exploration policy maximizing this reward also maximizes a lower bound on the state-action value function under certain assumptions. We further show that the visitation distribution defining the intrinsic rewards is the fixed point of a contraction operator and describe how existing algorithms can be adapted to learn this fixed point. A new off-policy maximum-entropy algorithm is finally introduced. It demonstrates effective exploration and efficient computation of high-quality control policies.
Disciplines :
Computer science
Author, co-author :
Bolland, Adrien ; Université de Liège - ULiège > Montefiore Institute of Electrical Engineering and Computer Science
Language :
English
Title :
Reimagining Exploration: Theoretical Insights and Practical Advancements in Policy Gradient Methods
Defense date :
February 2025
Institution :
ULiège - Université de Liège [Faculté des Sciences appliquées], Belgium
Degree :
Degree of Doctor of Engineering Sciences
Promotor :
Ernst, Damien ; Université de Liège - ULiège > Département d'électricité, électronique et informatique (Institut Montefiore) > Smart grids
Louppe, Gilles ; Université de Liège - ULiège > Département d'électricité, électronique et informatique (Institut Montefiore) > Big Data
Geurts, Pierre ; Université de Liège - ULiège > Département d'électricité, électronique et informatique (Institut Montefiore) > Algorithmique des systèmes en interaction avec le monde physique
This website uses cookies to improve user experience. Read more
Save & Close
Accept all
Decline all
Show detailsHide details
Cookie declaration
About cookies
Strictly necessary
Performance
Strictly necessary cookies allow core website functionality such as user login and account management. The website cannot be used properly without strictly necessary cookies.
This cookie is used by Cookie-Script.com service to remember visitor cookie consent preferences. It is necessary for Cookie-Script.com cookie banner to work properly.
Performance cookies are used to see how visitors use the website, eg. analytics cookies. Those cookies cannot be used to directly identify a certain visitor.
Used to store the attribution information, the referrer initially used to visit the website
Cookies are small text files that are placed on your computer by websites that you visit. Websites use cookies to help users navigate efficiently and perform certain functions. Cookies that are required for the website to operate properly are allowed to be set without your permission. All other cookies need to be approved before they can be set in the browser.
You can change your consent to cookie usage at any time on our Privacy Policy page.