Learning in a State of Confusion: Employing active perception and reinforcement learning in partially observable worlds
View/Open
Date
06/2007Author
Crook, Paul A
Metadata
Abstract
In applying reinforcement learning to agents acting in the real world we are often faced
with tasks that are non-Markovian in nature. Much work has been done using state estimation
algorithms to try to uncover Markovian models of tasks in order to allow the
learning of optimal solutions using reinforcement learning. Unfortunately these algorithms
which attempt to simultaneously learn a Markov model of the world and how
to act have proved very brittle. Our focus differs. In considering embodied, embedded
and situated agents we have a preference for simple learning algorithms which reliably
learn satisficing policies. The learning algorithms we consider do not try to uncover the
underlying Markovian states, instead they aim to learn successful deterministic reactive
policies such that agents actions are based directly upon the observations provided
by their sensors.
Existing results have shown that such reactive policies can be arbitrarily worse than a
policy that has access to the underlying Markov process and in some cases no satisficing
reactive policy can exist. Our first contribution is to show that providing agents
with alternative actions and viewpoints on the task through the addition of active perception
can provide a practical solution in such circumstances. We demonstrate empirically
that: (i) adding arbitrary active perception actions to agents which can only
learn deterministic reactive policies can allow the learning of satisficing policies where
none were originally possible; (ii) active perception actions allow the learning of better
satisficing policies than those that existed previously and (iii) our approach converges
more reliably to satisficing solutions than existing state estimation algorithms such as
U-Tree and the Lion Algorithm.
Our other contributions focus on issues which affect the reliability with which deterministic
reactive satisficing policies can be learnt in non-Markovian environments. We
show that that greedy action selection may be a necessary condition for the existence
of stable deterministic reactive policies on partially observable Markov decision processes
(POMDPs). We also set out the concept of Consistent Exploration. This is the
idea of estimating state-action values by acting as though the policy has been changed
to incorporate the action being explored. We demonstrate that this concept can be used
to develop better algorithms for learning reactive policies to POMDPs by presenting
a new reinforcement learning algorithm; the Consistent Exploration Q(l) algorithm
(CEQ(l)). We demonstrate on a significant number of problems that CEQ(l) is more
reliable at learning satisficing solutions than the algorithm currently regarded as the
best for learning deterministic reactive policies, that of SARSA(l).