Reimagining Exploration: Theoretical Insights and Practical Advancements in Policy Gradient Methods

[en] In reinforcement learning, direct policy optimization, and specifically policy gradient methods, has proven effective for solving complex control problems. However, these methods are highly sensitive to the evolution of the policy's stochasticity during learning. It is essential to maintain sufficient exploration to avoid premature convergence toward a deterministic or low-entropy policy. This thesis studies this issue, with focus on policy parameterization choices and reward-shaping methods with intrinsic exploration bonuses. First, we analyze the influence of policy stochasticity on the optimization process. We formulate direct policy optimization within the optimization-by-continuation framework, which involves optimizing a sequence of surrogate objectives called continuations. We show that optimizing the expected return of an affine Gaussian policy, which is sufficiently stochastic either through manually controlling the variance or through regularizing the entropy, corresponds to optimizing a continuation of the expected return of an underlying deterministic policy. This continuation corresponds to the expected return filtered to remove local extrema. Hereby, we argue that policy gradient algorithms enforcing exploration can be understood as methods for optimizing policies by continuation and that the policy variance should be a history-dependent function adapted to avoid local optima. Next, we introduce a novel analysis of intrinsic bonuses through the lens of numerical optimization. We define two criteria for the learning objective and two for the stochastic gradient estimates, using them to evaluate the policy's quality after optimization. Our analysis highlights two key effects of exploration techniques: smoothing the learning objective to remove local optima while preserving the global maximum, and modifying gradient estimates to increase the likelihood of eventually finding an optimal policy. We empirically illustrate these effects, identifying limitations and suggesting directions for future work. Finally, we propose a new intrinsic reward bonus for exploration as in maximum-entropy reinforcement learning methods. The intrinsic reward is defined as the relative entropy of the discounted distribution of future state-action pairs, or features of these pairs. We prove that an optimal exploration policy maximizing this reward also maximizes a lower bound on the state-action value function under certain assumptions. We further show that the visitation distribution defining the intrinsic rewards is the fixed point of a contraction operator and describe how existing algorithms can be adapted to learn this fixed point. A new off-policy maximum-entropy algorithm is finally introduced. It demonstrates effective exploration and efficient computation of high-quality control policies.

Disciplines :

Computer science

Author, co-author :

Bolland, Adrien ; Université de Liège - ULiège > Montefiore Institute of Electrical Engineering and Computer Science

Language :

English

Title :

Reimagining Exploration: Theoretical Insights and Practical Advancements in Policy Gradient Methods

Defense date :

February 2025

Institution :

ULiège - Université de Liège [Faculté des Sciences appliquées], Belgium

Degree :

Degree of Doctor of Engineering Sciences

Promotor :

Ernst, Damien ; Université de Liège - ULiège > Département d'électricité, électronique et informatique (Institut Montefiore) > Smart grids

Louppe, Gilles ; Université de Liège - ULiège > Département d'électricité, électronique et informatique (Institut Montefiore) > Big Data

Geurts, Pierre ; Université de Liège - ULiège > Département d'électricité, électronique et informatique (Institut Montefiore) > Algorithmique des systèmes en interaction avec le monde physique

Geist, Matthieu; UL - Université de Lorraine

Moulines, Eric; «X» - Ecole Polytechnique, France

Papini, Matteo; Politecnico di Milano

Available on ORBi :

since 06 March 2025

Statistics

Number of views

178 (25 by ULiège)

Number of downloads

82 (14 by ULiège)

More statistics

See more details

Name	Provider / Domaine	Expiration	Description
JSESSIONID	Oracle Corporation www.uliege.be	Session	General purpose platform session cookie, used by sites written in JSP. Usually used to maintain an anonymous user session by the server.
CookieScriptConsent	CookieScript .uliege.be	1 year	This cookie is used by Cookie-Script.com service to remember visitor cookie consent preferences. It is necessary for Cookie-Script.com cookie banner to work properly.

Name	Provider / Domaine	Expiration	Description
_pk_id	InnoCraft Ltd .uliege.be	1 year	Used to store a few details about the user such as the unique visitor ID
_pk_ses	InnoCraft Ltd .uliege.be	30 minutes	Short lived cookies used to temporarily store data for the visit
_pk_ref	InnoCraft Ltd .uliege.be	6 months	Used to store the attribution information, the referrer initially used to visit the website