Uniform positive recurrence and long term behavior of Markov decision processes, with applications in sensor scheduling

Carroll, Johnson Edward Hawes

Uniform positive recurrence and long term behavior of Markov decision processes, with applications in sensor scheduling

Access full-text files

CARROLL-DISSERTATION-2015.pdf (633.23 KB)

Date

2015-12-03

Authors

Carroll, Johnson Edward Hawes

Abstract

In this dissertation, we show a number of new results relating to stability, optimal control, and value iteration algorithms for discrete-time Markov decision processes (MDPs). First, we adapt two recent results in controlled diffusion processes to suit countable state MDPs by making assumptions that approximate continuous behavior. We show that if the MDP is stable under any stationary policy, then it must be uniformly so under all policies. This abstract result is very useful in the analysis of optimal control problems, and extends the characterization of uniform stability properties for MDPs. Then we derive two useful local bounds on the discounted value functions for a large class of MDPs, facilitating analysis of the ergodic cost problem via the Arzela-Ascoli theorem. We also examine and exploit the previously underutilized Harnack inequality for discrete Markov chains; one aim of this work was to discover how much can be accomplished for models with this property.

Convergence of the value iteration algorithm is typically treated in the literature under blanket stability assumptions. We show two new sufficient conditions for the convergence of the value iteration algorithm without blanket stability, requiring only geometric ergodicity under the optimal policy. These results form the theoretical basis to apply the value iteration to classes of problems previously unavailable.

We then consider a discrete-time linear system with Gaussian white noise and quadratic costs, observed via multiple sensors that communicate over a congested network. Observations are lost or received according to a Bernoulli random variable with a loss rate determined by the state of the network and the choice of sensor. We completely analyze the finite horizon, discounted, and long-term average optimal control problems. Assuming that the system is stabilizable, we use a partial separation principle to transform the problem into an MDP on the set of symmetric, positive definite matrices. A special case of these results generalizes a known result for Kalman filters with intermittent observations to the multiple-sensor case, with powerful implications.

Finally, we show that the value iteration algorithm converges without additional assumptions, as the structure of the problem guarantees geometric ergodicity under the optimal policy. The results allow the incorporation of adaptive schemes to determine unknown system parameters without affecting stability or long-term average cost. We also show that after only a few steps of the value iteration algorithm, the generated policy is geometrically ergodic and near-optimal.