Doctoral thesis (Dissertations and theses)
Reimagining Exploration: Theoretical Insights and Practical Advancements in Policy Gradient Methods
Bolland, Adrien
2025
 

Files


Full Text
Reimagining Exploration Theoretical Insights and Practical Advancements in Policy Gradient Methods.pdf
Author preprint (4.13 MB) Creative Commons License - Attribution, Non-Commercial, ShareAlike
Download

All documents in ORBi are protected by a user license.

Send to



Details



Abstract :
[en] In reinforcement learning, direct policy optimization, and specifically policy gradient methods, has proven effective for solving complex control problems. However, these methods are highly sensitive to the evolution of the policy's stochasticity during learning. It is essential to maintain sufficient exploration to avoid premature convergence toward a deterministic or low-entropy policy. This thesis studies this issue, with focus on policy parameterization choices and reward-shaping methods with intrinsic exploration bonuses. First, we analyze the influence of policy stochasticity on the optimization process. We formulate direct policy optimization within the optimization-by-continuation framework, which involves optimizing a sequence of surrogate objectives called continuations. We show that optimizing the expected return of an affine Gaussian policy, which is sufficiently stochastic either through manually controlling the variance or through regularizing the entropy, corresponds to optimizing a continuation of the expected return of an underlying deterministic policy. This continuation corresponds to the expected return filtered to remove local extrema. Hereby, we argue that policy gradient algorithms enforcing exploration can be understood as methods for optimizing policies by continuation and that the policy variance should be a history-dependent function adapted to avoid local optima. Next, we introduce a novel analysis of intrinsic bonuses through the lens of numerical optimization. We define two criteria for the learning objective and two for the stochastic gradient estimates, using them to evaluate the policy's quality after optimization. Our analysis highlights two key effects of exploration techniques: smoothing the learning objective to remove local optima while preserving the global maximum, and modifying gradient estimates to increase the likelihood of eventually finding an optimal policy. We empirically illustrate these effects, identifying limitations and suggesting directions for future work. Finally, we propose a new intrinsic reward bonus for exploration as in maximum-entropy reinforcement learning methods. The intrinsic reward is defined as the relative entropy of the discounted distribution of future state-action pairs, or features of these pairs. We prove that an optimal exploration policy maximizing this reward also maximizes a lower bound on the state-action value function under certain assumptions. We further show that the visitation distribution defining the intrinsic rewards is the fixed point of a contraction operator and describe how existing algorithms can be adapted to learn this fixed point. A new off-policy maximum-entropy algorithm is finally introduced. It demonstrates effective exploration and efficient computation of high-quality control policies.
Disciplines :
Computer science
Author, co-author :
Bolland, Adrien ;  Université de Liège - ULiège > Montefiore Institute of Electrical Engineering and Computer Science
Language :
English
Title :
Reimagining Exploration: Theoretical Insights and Practical Advancements in Policy Gradient Methods
Defense date :
February 2025
Institution :
ULiège - Université de Liège [Faculté des Sciences appliquées], Belgium
Degree :
Degree of Doctor of Engineering Sciences
Promotor :
Ernst, Damien  ;  Université de Liège - ULiège > Département d'électricité, électronique et informatique (Institut Montefiore) > Smart grids
Louppe, Gilles  ;  Université de Liège - ULiège > Département d'électricité, électronique et informatique (Institut Montefiore) > Big Data
Geurts, Pierre  ;  Université de Liège - ULiège > Département d'électricité, électronique et informatique (Institut Montefiore) > Algorithmique des systèmes en interaction avec le monde physique
Geist, Matthieu;  UL - Université de Lorraine
Moulines, Eric;  «X» - Ecole Polytechnique, France
Papini, Matteo;  Politecnico di Milano
Available on ORBi :
since 06 March 2025

Statistics


Number of views
178 (25 by ULiège)
Number of downloads
82 (14 by ULiège)

Bibliography


Similar publications



Contact ORBi