Book contents
- Frontmatter
- Contents
- Preface
- 1 Introduction
- Part I Stochastic Models and Bayesian Filtering
- Part II Partially Observed Markov Decision Processes: Models and Applications
- 6 Fully observed Markov decision processes
- 7 Partially observed Markov decision processes (POMDPs)
- 8 POMDPs in controlled sensing and sensor scheduling
- Part III Partially Observed Markov Decision Processes: Structural Results
- Part IV Stochastic Approximation and Reinforcement Learning
- Appendix A Short primer on stochastic simulation
- Appendix B Continuous-time HMM filters
- Appendix C Markov processes
- Appendix D Some limit theorems
- References
- Index
6 - Fully observed Markov decision processes
from Part II - Partially Observed Markov Decision Processes: Models and Applications
Published online by Cambridge University Press: 05 April 2016
- Frontmatter
- Contents
- Preface
- 1 Introduction
- Part I Stochastic Models and Bayesian Filtering
- Part II Partially Observed Markov Decision Processes: Models and Applications
- 6 Fully observed Markov decision processes
- 7 Partially observed Markov decision processes (POMDPs)
- 8 POMDPs in controlled sensing and sensor scheduling
- Part III Partially Observed Markov Decision Processes: Structural Results
- Part IV Stochastic Approximation and Reinforcement Learning
- Appendix A Short primer on stochastic simulation
- Appendix B Continuous-time HMM filters
- Appendix C Markov processes
- Appendix D Some limit theorems
- References
- Index
Summary
A Markov decision process (MDP) is a Markov process with feedback control. That is, as illustrated in Figure 6.1, a decision-maker (controller) uses the state xk of the Markov process at each time k to choose an action uk. This action is fed back to the Markov process and controls the transition matrix P(uk). This in turn determines the probability that the Markov process jumps to a particular state xk+1 at time k + 1 and so on. The aim of the decision-maker is to choose a sequence of actions over a time horizon to minimize a cumulative cost function associated with the expected value of the trajectory of the Markov process.
MDPs arise in stochastic optimization models in telecommunication networks, discrete event systems, inventory control, finance, investment and health planning. Also POMDPs can be viewed as continuous state MDPs.
This chapter gives a brief description of MDPs which provides a starting point for POMDPs. The main result is that optimal choice of actions by the controller in Figure 6.1 is obtained by solving a backward stochastic dynamic programming problem.
Finite state finite horizon MDP
Let k = 0, 1, …, N denote discrete time. N is called the time horizon or planning horizon. In this section we consider MDPs where the horizon N is finite.
The finite state MDP model consists of the following ingredients:
1. X = {1, 2, …, X} denotes the state space and xk ∈ X denotes the state of the controlled Markov chain at time k = 0, 1, …, N.
2. U = {1, 2, …, U} denotes the action space. The elements u ∈ U are called actions. In particular, uk ∈ U denotes the action chosen at time k.
3. For each action u ∈ U and time k ∈ {0, …, N−1}, P(u, k) denotes an X×X transition probability matrix with elements
Pij(u, k) = ℙ(xk+1 = j|xk = i, uk = u), i, j ∈ X.
4. For each state i ∈ X, action u ∈ U and time k ∈ {0, 1, …, N −1}, the scalar c(i, u, k) denotes the one-stage cost incurred by the decision-maker (controller).
- Type
- Chapter
- Information
- Partially Observed Markov Decision ProcessesFrom Filtering to Controlled Sensing, pp. 121 - 146Publisher: Cambridge University PressPrint publication year: 2016
- 1
- Cited by