Book contents
- Frontmatter
- Contents
- Preface
- 1 Introduction
- Part I Stochastic Models and Bayesian Filtering
- Part II Partially Observed Markov Decision Processes: Models and Applications
- 6 Fully observed Markov decision processes
- 7 Partially observed Markov decision processes (POMDPs)
- 8 POMDPs in controlled sensing and sensor scheduling
- Part III Partially Observed Markov Decision Processes: Structural Results
- Part IV Stochastic Approximation and Reinforcement Learning
- Appendix A Short primer on stochastic simulation
- Appendix B Continuous-time HMM filters
- Appendix C Markov processes
- Appendix D Some limit theorems
- References
- Index
7 - Partially observed Markov decision processes (POMDPs)
from Part II - Partially Observed Markov Decision Processes: Models and Applications
Published online by Cambridge University Press: 05 April 2016
- Frontmatter
- Contents
- Preface
- 1 Introduction
- Part I Stochastic Models and Bayesian Filtering
- Part II Partially Observed Markov Decision Processes: Models and Applications
- 6 Fully observed Markov decision processes
- 7 Partially observed Markov decision processes (POMDPs)
- 8 POMDPs in controlled sensing and sensor scheduling
- Part III Partially Observed Markov Decision Processes: Structural Results
- Part IV Stochastic Approximation and Reinforcement Learning
- Appendix A Short primer on stochastic simulation
- Appendix B Continuous-time HMM filters
- Appendix C Markov processes
- Appendix D Some limit theorems
- References
- Index
Summary
A POMDP is a controlled HMM. Recall from §2.4 that an HMM consists of an X-state Markov chain {xk} observed via a noisy observation process {yk}. Figure 7.1 displays the schematic setup of a POMDP where the action uk affects the state and/or observation (sensing) process of the HMM. The HMM filter (discussed extensively in Chapter 3) computes the posterior distribution πk of the state. The posterior πk is called the belief state. In a POMDP, the stochastic controller depicted in Figure 7.1 uses the belief state to choose the next action.
This chapter is organized as follows. §7.1 describes the POMDP model. Then §7.2 gives the belief state formulation and the Bellman's dynamic programming equation for the optimal policy of a POMDP. It is shown that a POMDP is equivalent to a continuous-state MDP where the states are belief states (posteriors). Bellman's equation for continuous-state MDP was discussed in §6.3. §7.3 gives a toy example of a POMDP. Despite being a continuous-state MDP, §7.4 shows that for finite horizon POMDPs, Bellman's equation has a finite dimensional characterization. §7.5 discusses several algorithms that exploit this finite dimensional characterization to compute the optimal policy. §7.6 considers discounted cost infinite horizon POMDPs. As an example of a POMDP, optimal search of a moving target is discussed in §7.7.
Finite horizon POMDP
A POMDP model with finite horizon N is a 7-tuple
(X, U,Y, P(u), B(u), c(u), cN).
Partially observed Markov decision process (POMDP) schematic setup. The Markov system together with noisy sensor constitute a hidden Markov model (HMM). The HMM filter computes the posterior (belief state) πk of the state of the Markov chain. The controller (decision-maker) then chooses the action uk at time k based on πk.
1. X = {1, 2, …, X} denotes the state space and xk ∈ X denotes the state of a controlled Markov chain at time k = 0, 1, …, N.
2. U = {1, 2, …, U} denotes the action space with uk ∈ U denoting the action chosen at time k by the controller.
- Type
- Chapter
- Information
- Partially Observed Markov Decision ProcessesFrom Filtering to Controlled Sensing, pp. 147 - 178Publisher: Cambridge University PressPrint publication year: 2016
- 1
- Cited by