Hostname: page-component-7c8c6479df-94d59 Total loading time: 0 Render date: 2024-03-28T09:42:29.658Z Has data issue: false hasContentIssue false

Introspective Q-learning and learning from demonstration

Published online by Cambridge University Press:  01 January 2019

Mao Li
Affiliation:
Computer Science Department, University of York, Deramore Ln, Heslington, York YO10 5GH, United Kingdom; e-mail: ml1480@york.ac.uk;
Tim Brys
Affiliation:
Computer Science Department, Vrije Universiteit Brussel, Artificial Intelligence Lab Pleinlaan 9, 3th floor 1050, Brussels, Belgium; e-mail: timbrys@vub.ac.be;
Daniel Kudenko
Affiliation:
Computer Science Department, University of York, Deramore Ln, Heslington, York YO10 5GH, United Kingdom; e-mail: ml1480@york.ac.uk; Jet Brains Research, St Petersburg, Russia; e-mail: daniel.kudenko@york.ac.uk

Abstract

One challenge faced by reinforcement learning (RL) agents is that in many environments the reward signal is sparse, leading to slow improvement of the agent’s performance in early learning episodes. Potential-based reward shaping can help to resolve the aforementioned issue of sparse reward by incorporating an expert’s domain knowledge into the learning through a potential function. Past work on reinforcement learning from demonstration (RLfD) directly mapped (sub-optimal) human expert demonstration to a potential function, which can speed up RL. In this paper we propose an introspective RL agent that significantly further speeds up the learning. An introspective RL agent records its state–action decisions and experience during learning in a priority queue. Good quality decisions, according to a Monte Carlo estimation, will be kept in the queue, while poorer decisions will be rejected. The queue is then used as demonstration to speed up RL via reward shaping. A human expert’s demonstration can be used to initialize the priority queue before the learning process starts. Experimental validation in the 4-dimensional CartPole domain and the 27-dimensional Super Mario AI domain shows that our approach significantly outperforms non-introspective RL and state-of-the-art approaches in RLfD in both domains.

Type
Adaptive and Learning Agents
Copyright
© Cambridge University Press, 2019 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Argall, B. D., Chernova, S., Veloso, M. & Browning, B. 2009. A survey of robot learning from demonstration. Robotics and Autonomous Systems 57(5), 469483.CrossRefGoogle Scholar
Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., & Munos, R. (2016). Unifying count-based exploration and intrinsic motivation. In Proceedings of the 30th Conference on Advances in Neural Information Processing Systems (pp. 14711479).Google Scholar
Brys, T. 2016. Reinforcement Learning with Heuristic Information. PhD thesis, Vrije Universiteit Brussel.Google Scholar
Brys, T., Harutyunyan, A., Suay, H. B., Chernova, S., Taylor, M. E. & Nowé, A. 2015. Reinforcement learning from demonstration through shaping. In IJCAI. 33523358.Google Scholar
Devlin, S. & Kudenko, D. 2012. Dynamic potential-based reward shaping. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems, Vol. 1, 433440. International Foundation for Autonomous Agents and Multiagent Systems.Google Scholar
Harutyunyan, A., Devlin, S., Vrancx, P. & Nowé, A. 2015. Expressing arbitrary reward functions as potential-based advice. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence.Google Scholar
Karakovskiy, S. & Togelius, J. 2012. The Mario AI benchmark and competitions. IEEE Transactions on Computational Intelligence and AI in Games 4, 5567.CrossRefGoogle Scholar
Mataric, M. J. 1994. Reward functions for accelerated learning. In Machine Learning: Proceedings of the Eleventh International Conference, 181189.Google Scholar
Michie, D. & Chambers, R. A. 1968. Boxes: an experiment in adaptive control. Machine Intelligence 2, 137152.Google Scholar
Ng, A. Y., Harada, D. & Russell, S. 1999. Policy invariance under reward transformations: theory and application to reward shaping. In Proceedings of the Sixteenth International Conference on Machine Learning. Vol. 99, 278287.Google Scholar
Ng, A. Y., Andrew, Y., & Russell, S. J. 2000. Algorithms for inverse reinforcement learning. In ICML 1, 663670.Google Scholar
Pathak, D., Agrawal, P., Efros, A. A. & Darrell, T. 2017. Curiosity-driven exploration by self-supervised prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. (pp. 1617).CrossRefGoogle Scholar
Schaal, S. 1997. Learning from demonstration. Advances in Neural Information Processing Systems 9, 10401046.Google Scholar
Schaul, T., Quan, J., Antonoglou, I. & Silver, D. 2015. Prioritized experience replay. arXiv preprint arXiv:1511.05952.Google Scholar
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M. & et al. 2016. Mastering the game of go with deep neural networks and tree search. Nature 529, 484489.CrossRefGoogle ScholarPubMed
Singh, S. P. & Sutton, R. S. 1996. Reinforcement learning with replacing eligibility traces. Machine Learning 22, 123158.CrossRefGoogle Scholar
Smart, W. D. & Kaelbling, L. P. 2002. Effective reinforcement learning for mobile robots. In IEEE International Conference on Robotics and Automation, Vol. 4, 34043410. IEEE.Google Scholar
Suay, H. B., Brys, T., Taylor, M. E. & Chernova, S. 2016. Learning from demonstration for shaping through inverse reinforcement learning. In Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, 429437. International Foundation for Autonomous Agents and Multiagent Systems.Google Scholar
Sutton, R. & Barto, A. 1998. Reinforcement Learning: An Introduction, Vol. 1. Cambridge University Press.Google Scholar
Taylor, M. E. & Stone, P. 2009. Transfer learning for reinforcement learning domains: a survey. Journal of Machine Learning Research 10, 16331685.Google Scholar
Taylor, M. E., Suay, H. B. & Chernova, S. 2011. Integrating reinforcement learning with human demonstrations of varying ability. In The 10th International Conference on Autonomous Agents and Multiagent Systems, Vol. 2, 617624. International Foundation for Autonomous Agents and Multiagent Systems.Google Scholar
Tsitsiklis, J. N. 1994. Asynchronous stochastic approximation and Q-learning. Machine Learning 16, 185202.CrossRefGoogle Scholar
Watkins, C. J. C. H. 1989. Learning from Delayed Rewards. PhD thesis, University of Cambridge.Google Scholar
Wiewiora, E., Cottrell, G. & Elkan, C. 2003. Principled methods for advising reinforcement learning agents. In ICML. 792799.Google Scholar