CONVERGENCE OF SIMULATION-BASED POLICY ITERATION

William L. Cooper; Shane G. Henderson; Mark E. Lewis

doi:10.1017/S0269964803172051

Abstract

Simulation-based policy iteration (SBPI) is a modification of the policy iteration algorithm for computing optimal policies for Markov decision processes. At each iteration, rather than solving the average evaluation equations, SBPI employs simulation to estimate a solution to these equations. For recurrent average-reward Markov decision processes with finite state and action spaces, we provide easily verifiable conditions that ensure that simulation-based policy iteration almost-surely eventually never leaves the set of optimal decision rules. We analyze three simulation estimators for solutions to the average evaluation equations. Using our general results, we derive simple conditions on the simulation run lengths that guarantee the almost-sure convergence of the algorithm.

Crossref Citations

This article has been cited by the following publications. This list is generated based on data provided by Crossref.

Homem-De-Mello, Tito 2003. Variable-sample methods for stochastic optimization. ACM Transactions on Modeling and Computer Simulation, Vol. 13, Issue. 2, p. 108.

Cao, Xi-Ren and Guo, Xianping 2004. A unified approach to Markov decision problems and performance sensitivity analysis with discounted and average criteria: multichain cases. Automatica, Vol. 40, Issue. 10, p. 1749.

Fang, H.-T. and Cao, X.-R. 2004. Potential-Based Online Policy Iteration Algorithms for Markov Decision Processes. IEEE Transactions on Automatic Control, Vol. 49, Issue. 4, p. 493.

Cao, Xi-Ren 2005. Basic Ideas for Event-Based Optimization of Markov Systems. Discrete Event Dynamic Systems, Vol. 15, Issue. 2, p. 169.

Cao, Xi-Ren and Zhang, Junyu 2008. Event-Based Optimization of Markov Systems. IEEE Transactions on Automatic Control, Vol. 53, Issue. 4, p. 1076.

Jiang, Qi Xi, Hong-Sheng and Yin, Bao-Qin 2009. Online policy iteration algorithm for semi-Markov switching state-space control processes. p. 2298.

Li, Yanjie Yin, Baoqun and Xi, Hongsheng 2011. Finding optimal memoryless policies of POMDPs under the expected average reward criterion. European Journal of Operational Research, Vol. 211, Issue. 3, p. 556.

Chang, Hyeong Soo Hu, Jiaqiao Fu, Michael C. and Marcus, Steven I. 2013. Simulation-Based Algorithms for Markov Decision Processes. p. 1.

Haskell, William B. Jain, Rahul and Kalathil, Dileep 2014. Empirical policy iteration for approximate dynamic programming. p. 6573.

Haskell, William B. Jain, Rahul and Kalathil, Dileep 2014. Empirical Value Iteration for approximate dynamic programming. p. 495.

Jiang, Xiaofeng Ji, Zhe Xi, Hongsheng Wang, Weiping and Liu, Falin 2015. Adaptive multicoset sampling for wideband spectrum sensing based on POMDP framework. p. 338.

Jiang, Qi Leung, Victor C. M. Tang, Hao and Xi, Hong-Sheng 2015. Energy-Efficient Adaptive Rate Control for Streaming Media Transmission Over Cognitive Radio. IEEE Transactions on Communications, Vol. 63, Issue. 12, p. 4682.

Jiang, Xiaofeng Ji, Zhe and Xi, Hongsheng 2015. Hierarchical policy iteration for large-scale POMDP systems. p. 2401.

Dufour, François and Prieto-Rumeau, Tomás 2015. Approximation of average cost Markov decision processes using empirical distributions and concentration inequalities. Stochastics, Vol. 87, Issue. 2, p. 273.

Haskell, William B. Jain, Rahul and Kalathil, Dileep 2016. Empirical Dynamic Programming. Mathematics of Operations Research, Vol. 41, Issue. 2, p. 402.

Ohno, Katsuhisa Boh, Toshitaka Nakade, Koichi and Tamura, Takayoshi 2016. New approximate dynamic programming algorithms for large-scale undiscounted Markov decision processes and their application to optimize a production and distribution system. European Journal of Operational Research, Vol. 249, Issue. 1, p. 22.

Jiang, Qi Leung, Victor C. M. Pourazad, Mahsa T. Tang, Hao and Xi, Hong-Sheng 2016. Energy-Efficient Adaptive Transmission of Scalable Video Streaming in Cognitive Radio Communications. IEEE Systems Journal, Vol. 10, Issue. 2, p. 761.

Robles-Alcaráz, M. Teresa Vega-Amaya, Óscar and Minjárez-Sosa, J. Adolfo 2017. Estimate and approximate policy iteration algorithm for discounted Markov decision models with bounded costs and Borel spaces. Risk and Decision Analysis, Vol. 6, Issue. 2, p. 79.

Li, Yanjie Wu, Xinyu Lou, Yunjiang Chen, Haoyao and Li, Jiangang 2018. Coupling based estimation approaches for the average reward performance potential in Markov chains. Automatica, Vol. 93, Issue. , p. 172.

Gupta, Abhishek Chen, Hao Pi, Jianzong and Tendolkar, Gaurav 2020. Some Limit Properties of Markov Chains Induced by Recursive Stochastic Algorithms. SIAM Journal on Mathematics of Data Science, Vol. 2, Issue. 4, p. 967.

Download full list

Article contents

CONVERGENCE OF SIMULATION-BASED POLICY ITERATION

Abstract

Access options

This article has been cited by the following publications. This list is generated based on data provided by Crossref.

Article contents

CONVERGENCE OF SIMULATION-BASED POLICY ITERATION

Abstract

Access options

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests