Expectation maximisation methods for solving (PO)MDPs and optimal control problems

Marc Toussaint; Amos Storkey; Stefan Harmeling

doi:10.1017/CBO9780511984679.019

18 - Expectation maximisation methods for solving (PO)MDPs and optimal control problems

from VI - Agent-based models

Published online by Cambridge University Press: 07 September 2011

Amos Storkey and

Edited by

A. Taylan Cemgil and

Marc Toussaint: Affiliation:
Universität Berlin
Amos Storkey: Affiliation:
University of Edinburgh
Stefan Harmeling: Affiliation:
Biological Cybernetics
David Barber: Affiliation:
University College London
A. Taylan Cemgil: Affiliation:
Boğaziçi Üniversitesi, Istanbul
Silvia Chiappa: Affiliation:
University of Cambridge

Book contents

Get access

Summary

Introduction

As this book demonstrates, the development of efficient probabilistic inference techniques has made considerable progress in recent years, in particular with respect to exploiting the structure (e.g., factored, hierarchical or relational) of discrete and continuous problem domains. In this chapter we show that these techniques can be used also for solving Markov decision processes (MDPs) or partially observable MDPs (POMDPs) when formulated in terms of a structured dynamic Bayesian network (DBN).

The problems of planning in stochastic environments and inference in state space models are closely related, in particular in view of the challenges both of them face: scaling to large state spaces spanned by multiple state variables, or realising planning (or inference) in continuous or mixed continuous-discrete state spaces. Both fields developed techniques to address these problems. For instance, in the field of planning, they include work on factored Markov decision processes [5, 17, 9, 18], abstractions [10], and relational models of the environment [37]. On the other hand, recent advances in inference techniques show how structure can be exploited both for exact inference as well as for making efficient approximations. Examples are message-passing algorithms (loopy belief propagation, expectation propagation), variational approaches, approximate belief representations (particles, assumed density filtering, Boyen–Koller) and arithmetic compilation (see, e.g., [22, 23, 7]).

In view of these similarities one may ask whether existing techniques for probabilistic inference can directly be translated to solving stochastic planning problems.

Type: Chapter
Information: Bayesian Time Series Models , pp. 388 - 413

DOI: https://doi.org/10.1017/CBO9780511984679.019 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2011

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

[1] C. G., Atkeson and J. C., Santamaría. A comparison of direct and model-based reinforcement learning. In International Conference on Robotics and Automation, 1997.Google Scholar

[2] H., Attias. Planning by probabilistic inference. In Christopher M., Bishop and Brendan J., Frey, editors, Proceedings of the 9th International Workshop on Artificial Intelligence and Statistics, 2003.Google Scholar

[3] M. J., Beal, Z., Ghahramani and C. E., Rasmussen. The infinite hidden Markov model. In T., Dietterich, S., Becker, and Z., Ghahramani, editors, Advances in Neural Information Processing Systems 14. MIT Press, 2002.Google Scholar

[4] C., Boutilier, T., Dean, and S., Hanks. Decision theoretic planning: structural assumptions and computational leverage. Journal of Artificial Intelligence Research, 11:1–94, 1999.Google Scholar

[5] C., Boutilier, R., Dearden and M., Goldszmidt. Exploiting structure in policy construction. In Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI 1995), pages 1104–1111, 1995.Google Scholar

[6] H., Bui, S., Venkatesh and G., West. Policy recognition in the abstract hidden Markov models. Journal of Artificial Intelligence Research, 17:451–499, 2002.Google Scholar

[7] M., Chavira, A., Darwiche, and M., Jaeger. Compiling relational Bayesian networks for exact inference. International Journal of Approximate Reasoning, 42:4–20, 2006.Google Scholar

[8] G. F., Cooper. A method for using belief networks as influence diagrams. In Proceedings of the Fourth Workshop on Uncertainty in Artificial Intelligence, pages 55–63, 1988.Google Scholar

[9] C., Guestrin, D., Koller, R., Parr and S., Venkataraman. Efficient solution algorithms for factored MDPs. Journal of Artificial Intelligence Research, 19: 399–468, 2003.Google Scholar

[10] M., Hauskrecht, N., Meuleau, L. P., Kaelbling, T., Dean and C., Boutilier. Hierarchical solution of Markov decision processes using macro-actions. In Proceedings of Uncertainty in Artificial Intelligence, pages 220–229, 1998.Google Scholar

[11] M., Hoffman, N., de Freitas, A., Doucet and J., Peters. An expectation maximization algorithm for continuous Markov decision processes with arbitrary rewards. In Twelfth International Conference on Artificial Intelligence and Statistics, 2009.Google Scholar

[12] M., Hoffman, A., Doucet, N., de Freitas and A., Jasra. Bayesian policy learning with trans-dimensional MCMC. In Advances in Neural Information Processing Systems 20. MIT Press, 2008.Google Scholar

[13] M., Hoffman, H., Kueck, A., Doucet and N., de Freitas. New inference strategies for solving Markov decision processes using reversible jump MCMC. In Uncertainty in Artificial Intelligence, 2009.Google Scholar

[14] L. P., Kaelbling, M. L., Littman and A. R., Cassandra. Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101:99–134, 1998.Google Scholar

[15] L. P., Kaelbling, M. L., Littman and A. W., Moore. Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4:237–285, 1996.Google Scholar

[16] J., Kober and J., Peters. Policy search for motor primitives in robotics. In D., Koller, D., Schuurmans and Y., Bengio, editors, Advances in Neural Information Processing Systems 21. MIT Press, 2009.Google Scholar

[17] D., Koller and R., Parr. Computing factored value functions for policies in structured MDPs. In Proceedings of the 16th International Joint Conference on Artificial Intelligence, pages 1332–1339, 1999.Google Scholar

[18] B., Kveton and M., Hauskrecht. An MCMC approach to solving hybrid factored MDPs. In Proceedings of the 19th International Joint Conference on Artificial Intelligence, volume 19, pages 1346–1351, 2005.Google Scholar

[19] M. L., Littman, S. M., Majercik and T., Pitassi. Stochastic Boolean satisfiability. Journal of Automated Reasoning, 27(3):251–296, 2001.Google Scholar

[20] A., McGovern and A. G., Barto. Automatic discovery of subgoals in reinforcement learning using diverse density. In Proceedings of the 18th International Conference on Machine Learning, pages 361–368, 2001.Google Scholar

[21] N., Meuleau, L., Peshkin, K.-E., Kim and L. P., Kaelbling. Learning finite-state controllers for partially observable environments. In Proceedings of Fifteenth Conference on Uncertainity in Artificial Intelligence, pages 427–436, 1999.Google Scholar

[22] T., Minka. A family of algorithms for approximate Bayesian inference. PhD thesis, MIT, 2001.

[23] K., Murphy. Dynamic Bayesian networks: Representation, inference and learning. PhD Thesis, UC Berkeley, Computer Science Division, 2002.

[24] A. Y., Ng, R., Parr and D., Koller. Policy search via density estimation. In Advances in Neural Information Processing Systems, pages 1022–1028, 1999.Google Scholar

[25] J., Pearl. Probabilistic Reasoning In Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 1988.Google Scholar

[26] J., Peters and S., Schaal. Reinforcement learning by reward-weighted regression for operational space control. In Proceedings of the International Conference on Machine Learning, 2007.Google Scholar

[27] P., Poupart and C., Boutilier. Bounded finite state controllers. In Advances in Neural Information Processing Systems 16 (NIPS 2003), volume 16. MIT Press, 2004.Google Scholar

[28] T., Raiko and M., Tornio. Learning nonlinear state-space models for control. In Proceedings of International Joint Conference on Neural Networks, 2005.Google Scholar

[29] R. D., Shachter. Probabilistic inference and influence diagrams. Operations Research, 36:589–605, 1988.Google Scholar

[30] G., Theocharous, K., Murphy and L., Kaelbling. Representing hierarchical POMDPs as DBNs for multi-scale robot localization. In International Conference on Robotics and Automation, 2004.Google Scholar

[31] M., Toussaint. Lecture note: Influence diagrams. http://ml.cs.tu-berlin.de/∼mtoussai/notes/, 2009.

[32] M., Toussaint, L., Charlin and P., Poupart. Hierarchical POMDP controller optimization by likelihood maximization. In Uncertainty in Artificial Intelligence, 2008.Google Scholar

[33] M., Toussaint, S., Harmeling and A., Storkey. Probabilistic inference for solving (PO)MDPs. Technical Report EDI-INF-RR-0934, University of Edinburgh, School of Informatics, 2006.

[34] M., Toussaint and A., Storkey. Probabilistic inference for solving discrete and continuous state Markov Decision Processes. In Proceedings of the 23nd International Conference on Machine Learning, pages 945–952, 2006.Google Scholar

[35] D., Verma and R. P. N., Rao. Goal-based imitation as probabilistic inference over graphical models. In Advances in Neural Information Processing Systems, 2006.Google Scholar

[36] N., Vlassis and M., Toussaint. Model-free reinforcement learning as mixture learning. In Proceedings of the 26th International Conference on Machine Learning, 2009.Google Scholar

[37] L., Zettlemoyer, H., Pasula and L. P., Kaelbling. Learning planning rules in noisy stochastic worlds. In Proceedings of the Twentieth National Conference on Artificial Intelligence, 2005.Google Scholar

Book contents

18 - Expectation maximisation methods for solving (PO)MDPs and optimal control problems

Summary

Access options

References

Save book to Kindle

Save book to Dropbox

Save book to Google Drive