Hostname: page-component-8448b6f56d-m8qmq Total loading time: 0 Render date: 2024-04-16T09:17:03.627Z Has data issue: false hasContentIssue false

Human–agent transfer from observations

Published online by Cambridge University Press:  27 November 2020

Bikramjit Banerjee
Affiliation:
The University of Southern Mississippi, 118 College Drive #5106, Hattiesburg, MS39406, USA e-mails: Bikramjit.Banerjee@usm.edu, Sneha.Racharla@usm.edu
Sneha Racharla
Affiliation:
The University of Southern Mississippi, 118 College Drive #5106, Hattiesburg, MS39406, USA e-mails: Bikramjit.Banerjee@usm.edu, Sneha.Racharla@usm.edu

Abstract

Learning from human demonstration (LfD), among many speedup techniques for reinforcement learning (RL), has seen many successful applications. We consider one LfD technique called human–agent transfer (HAT), where a model of the human demonstrator’s decision function is induced via supervised learning and used as an initial bias for RL. Some recent work in LfD has investigated learning from observations only, that is, when only the demonstrator’s states (and not its actions) are available to the learner. Since the demonstrator’s actions are treated as labels for HAT, supervised learning becomes untenable in their absence. We adapt the idea of learning an inverse dynamics model from the data acquired by the learner’s interactions with the environment and deploy it to fill in the missing actions of the demonstrator. The resulting version of HAT—called state-only HAT (SoHAT)—is experimentally shown to preserve some advantages of HAT in benchmark domains with both discrete and continuous actions. This paper also establishes principled modifications of an existing baseline algorithm—called A3C—to create its HAT and SoHAT variants that are used in our experiments.

Type
Research Article
Copyright
© The Author(s), 2020. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Argall, B. D., Chernova, S., Veloso, M. & Browning, B. 2009. A survey of robot learning from demonstration. Robotics and Autonomous Systems 57(5), 469483.CrossRefGoogle Scholar
Bojarski, M., Testa, D., et al. 2016. End to end learning for self-driving cars. arXiv preprint .Google Scholar
Chernova, S. & Veloso, M. 2007. Confidence-based policy learning from demonstration using Gaussian mixture models. In Proceedings of the 6th International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS), 233, 18. ACM.Google Scholar
Daftry, S., Bagnell, J. & Hebert, M. 2016. Learning transferable policies for monocular reactive MAV control. In Proceedings of the International Symposium on Experimental Robotics, 311.Google Scholar
Da Silva, F. L. & Reali Costa, A. H. 2019. A survey on transfer learning for multiagent reinforcement learning systems. Journal of Artificial Intelligence Research 64, 645703.CrossRefGoogle Scholar
de la Cruz, G. V. Jr, Du, Y. & Taylor, M. E. 2017. Pre-training Neural Networks with Human Demonstrations for Deep Reinforcement Learning. arXiv preprint .Google Scholar
Fernandez, F., Garcia, J. & Veloso, M. 2010. Probabilistic policy reuse for inter-task transfer learning. Robotics and Autonomous Systems 58(7), 866871.Google Scholar
Giusti, A., Guzzi, J., et al. 2016. A machine learning approach to visual perception of forest trails for mobile robots. IEEE Robotics and Automation Letters 1(2), 661667.CrossRefGoogle Scholar
Ho, J. & Ermon, S. 2016. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems, 45654573.Google Scholar
Jain, V., Doshi, P. & Banerjee, B. 2019. Model-free IRL using maximum likelihood estimation. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence (AAAI-19), Honolulu, HI, 39513958.Google Scholar
Judah, K., Fern, A. & Dietterich, T. G. 2012. Active imitation learning via reduction to I.I.D. active learning. In Proceedings of the 28th Conference on Uncertainty in Artificial Intelligence (UAI), 428437.Google Scholar
Karakovskiy, S. & Togelius, J. 2012. The Mario AI benchmark and competitions. IEEE Transactions on Computational Intelligence and AI in Games 4(1), 5567.CrossRefGoogle Scholar
Kingma, D. P. & Ba, J. 2015. Adam: a method for stochastic optimization. In Proceedings of the 3rd International Conference for Learning Representations.Google Scholar
Kolter, J. Z., Abbeel, P. & Ng, A. Y. 2008. Hierarchical apprenticeship learning with application to quadruped locomotion. In Advances in Neural Information Processing Systems (NIPS), 769776.Google Scholar
Liu, Y., Gupta, A., Abbeel, P. & Levine, S. 2018. Imitation from observation: learning to imitate behaviors from raw video via context translation. In Proceedings of the International Conference on Robotics and Automation (ICRA-18).CrossRefGoogle Scholar
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D. & Kavukcuoglu, K. 2016. Asynchronous methods for deep reinforcement learning. In Proceedings of The 33rd International Conference on Machine Learning, Proceedings of Machine Learning Research 48, PMLR, New York, New York, USA, Balcan, M. F. and Weinberger, K. Q. (eds), 19281937.Google Scholar
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S. & Hassabis, D. 2015. Human-level control through deep reinforcement learning. Nature 518, 529533.CrossRefGoogle ScholarPubMed
Niekum, S., Osentoski, S., Konidaris, G., Chitta, S., Marthi, B. & Barto, A. G. 2015. Learning grounded finite-state representations from unstructured demonstrations. International Journal of Robotics Research 34(2), 131157.Google Scholar
Ramachandran, D. & Amir, E. 2007. Bayesian inverse reinforcement learning. In Proceedings of the International Joint Conference on Artificial Intelligence, 25862591.Google Scholar
Ross, S., Gordon, G. & Bagnell, J. 2011. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTAT), 627635.Google Scholar
Russell, S. 1998. Learning agents for uncertain environments (extended abstract). In Eleventh Annual Conference on Computational Learning Theory, 101103.Google Scholar
Schaal, S. 1997. Learning from demonstration. In Advances in Neural Information Processing Systems (NIPS), 10401046.Google Scholar
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T. & Hassabis, D. 2016. Mastering the game of Go with deep neural networks and tree search. Nature 529, 484489.CrossRefGoogle ScholarPubMed
Subramanian, K., Isbell, C. L. Jr & Thomaz, A. L. 2016. Exploration from demonstration for interactive reinforcement learning. In Proceedings of the 15th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 447456.Google Scholar
Sutton, R. & Barto, A. G. 1998. Reinforcement Learning: An Introduction. MIT Press.Google Scholar
Sutton, R. S., McAllester, D., Singh, S. & Mansour, Y. 2000. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems 12. MIT Press, 10571063.Google Scholar
Tamassia, M., Zambetta, F., Raffe, W., Mueller, F. & Li, X. 2017. Learning options from demonstrations: A Pac-Man case study. IEEE Transactions on Computational Intelligence and AI in Games 10(1), 9196.CrossRefGoogle Scholar
Taylor, M. E. & Stone, P. 2009. Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research 10(1), 16331685.Google Scholar
Taylor, M. E., Suay, H. B. & Chernova, S. 2011. Integrating reinforcement learning with human demonstrations of varying ability. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems (AAMAS).Google Scholar
Torabi, F., Warnell, G. & Stone, P. 2018. Behavioral cloning from observation. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18), 49504957.Google Scholar
Uchibe, E. 2018. Model-free deep inverse reinforcement learning by logistic regression. Neural Processing Letters 47(3), 891905.CrossRefGoogle Scholar
VRoman, M. C. 2014. Maximum Likelihood Inverse Reinforcement Learning. PhD thesis, Rutgers University.Google Scholar
Walsh, T. J., Hewlett, D. K. & Morrison, C. T. 2011. Blending autonomous exploration and apprenticeship learning. In Advances in Neural Information Processing Systems (NIPS), 22582266.Google Scholar
Wang, Z. & Taylor, M. E. 2017. Improving reinforcement learning with confidence-based demonstrations. In Proceedings of the 26th International Conference on Artificial Intelligence (IJCAI).CrossRefGoogle Scholar
Wang, Z. & Taylor, M. E. 2019. Interactive reinforcement learning with dynamic reuse of prior knowledge from human and agent demonstrations. In Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI-19), 38203827.Google Scholar
Williams, R. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8(3–4), 229256.CrossRefGoogle Scholar
Ziebart, B. D., Maas, A., Bagnell, J. A. & Dey, A. K. 2008. Maximum entropy inverse reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, 14331438.Google Scholar