Introspective Q-learning and learning from demonstration

Mao Li; Tim Brys; Daniel Kudenko

doi:10.1017/S0269888919000031

Introspective Q-learning and learning from demonstration

Part of: Adaptive Learning Agents 2018

Published online by Cambridge University Press: 01 January 2019

Mao Li ,

Tim Brys and

Daniel Kudenko

Show author details

Mao Li: Affiliation:
Computer Science Department, University of York, Deramore Ln, Heslington, York YO10 5GH, United Kingdom; e-mail: ml1480@york.ac.uk;
Tim Brys: Affiliation:
Computer Science Department, Vrije Universiteit Brussel, Artificial Intelligence Lab Pleinlaan 9, 3th floor 1050, Brussels, Belgium; e-mail: timbrys@vub.ac.be;
Daniel Kudenko: Affiliation:
Computer Science Department, University of York, Deramore Ln, Heslington, York YO10 5GH, United Kingdom; e-mail: ml1480@york.ac.uk; Jet Brains Research, St Petersburg, Russia; e-mail: daniel.kudenko@york.ac.uk

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

One challenge faced by reinforcement learning (RL) agents is that in many environments the reward signal is sparse, leading to slow improvement of the agent’s performance in early learning episodes. Potential-based reward shaping can help to resolve the aforementioned issue of sparse reward by incorporating an expert’s domain knowledge into the learning through a potential function. Past work on reinforcement learning from demonstration (RLfD) directly mapped (sub-optimal) human expert demonstration to a potential function, which can speed up RL. In this paper we propose an introspective RL agent that significantly further speeds up the learning. An introspective RL agent records its state–action decisions and experience during learning in a priority queue. Good quality decisions, according to a Monte Carlo estimation, will be kept in the queue, while poorer decisions will be rejected. The queue is then used as demonstration to speed up RL via reward shaping. A human expert’s demonstration can be used to initialize the priority queue before the learning process starts. Experimental validation in the 4-dimensional CartPole domain and the 27-dimensional Super Mario AI domain shows that our approach significantly outperforms non-introspective RL and state-of-the-art approaches in RLfD in both domains.

Type: Adaptive and Learning Agents
Information: The Knowledge Engineering Review , Volume 34 , 2019 , e8

DOI: https://doi.org/10.1017/S0269888919000031 [Opens in a new window]
Copyright: © Cambridge University Press, 2019

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Argall, B. D., Chernova, S., Veloso, M. & Browning, B. 2009. A survey of robot learning from demonstration. Robotics and Autonomous Systems 57(5), 469–483.CrossRef Google Scholar

Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., & Munos, R. (2016). Unifying count-based exploration and intrinsic motivation. In Proceedings of the 30th Conference on Advances in Neural Information Processing Systems (pp. 1471–1479).Google Scholar

Brys, T. 2016. Reinforcement Learning with Heuristic Information. PhD thesis, Vrije Universiteit Brussel.Google Scholar

Brys, T., Harutyunyan, A., Suay, H. B., Chernova, S., Taylor, M. E. & Nowé, A. 2015. Reinforcement learning from demonstration through shaping. In IJCAI. 3352–3358.Google Scholar

Devlin, S. & Kudenko, D. 2012. Dynamic potential-based reward shaping. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems, Vol. 1, 433–440. International Foundation for Autonomous Agents and Multiagent Systems.Google Scholar

Harutyunyan, A., Devlin, S., Vrancx, P. & Nowé, A. 2015. Expressing arbitrary reward functions as potential-based advice. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence.Google Scholar

Karakovskiy, S. & Togelius, J. 2012. The Mario AI benchmark and competitions. IEEE Transactions on Computational Intelligence and AI in Games 4, 55–67.CrossRef Google Scholar

Mataric, M. J. 1994. Reward functions for accelerated learning. In Machine Learning: Proceedings of the Eleventh International Conference, 181–189.Google Scholar

Michie, D. & Chambers, R. A. 1968. Boxes: an experiment in adaptive control. Machine Intelligence 2, 137–152.Google Scholar

Ng, A. Y., Harada, D. & Russell, S. 1999. Policy invariance under reward transformations: theory and application to reward shaping. In Proceedings of the Sixteenth International Conference on Machine Learning. Vol. 99, 278–287.Google Scholar

Ng, A. Y., Andrew, Y., & Russell, S. J. 2000. Algorithms for inverse reinforcement learning. In ICML 1, 663–670.Google Scholar

Pathak, D., Agrawal, P., Efros, A. A. & Darrell, T. 2017. Curiosity-driven exploration by self-supervised prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. (pp. 16–17).CrossRef Google Scholar

Schaal, S. 1997. Learning from demonstration. Advances in Neural Information Processing Systems 9, 1040–1046.Google Scholar

Schaul, T., Quan, J., Antonoglou, I. & Silver, D. 2015. Prioritized experience replay. arXiv preprint arXiv:1511.05952.Google Scholar

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M. & et al. 2016. Mastering the game of go with deep neural networks and tree search. Nature 529, 484–489.CrossRef Google Scholar PubMed

Singh, S. P. & Sutton, R. S. 1996. Reinforcement learning with replacing eligibility traces. Machine Learning 22, 123–158.CrossRef Google Scholar

Smart, W. D. & Kaelbling, L. P. 2002. Effective reinforcement learning for mobile robots. In IEEE International Conference on Robotics and Automation, Vol. 4, 3404–3410. IEEE.Google Scholar

Suay, H. B., Brys, T., Taylor, M. E. & Chernova, S. 2016. Learning from demonstration for shaping through inverse reinforcement learning. In Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, 429–437. International Foundation for Autonomous Agents and Multiagent Systems.Google Scholar

Sutton, R. & Barto, A. 1998. Reinforcement Learning: An Introduction, Vol. 1. Cambridge University Press.Google Scholar

Taylor, M. E. & Stone, P. 2009. Transfer learning for reinforcement learning domains: a survey. Journal of Machine Learning Research 10, 1633–1685.Google Scholar

Taylor, M. E., Suay, H. B. & Chernova, S. 2011. Integrating reinforcement learning with human demonstrations of varying ability. In The 10th International Conference on Autonomous Agents and Multiagent Systems, Vol. 2, 617–624. International Foundation for Autonomous Agents and Multiagent Systems.Google Scholar

Tsitsiklis, J. N. 1994. Asynchronous stochastic approximation and Q-learning. Machine Learning 16, 185–202.CrossRef Google Scholar

Watkins, C. J. C. H. 1989. Learning from Delayed Rewards. PhD thesis, University of Cambridge.Google Scholar

Wiewiora, E., Cottrell, G. & Elkan, C. 2003. Principled methods for advising reinforcement learning agents. In ICML. 792–799.Google Scholar

Article contents

Introspective Q-learning and learning from demonstration

Abstract

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests