Hostname: page-component-8448b6f56d-m8qmq Total loading time: 0 Render date: 2024-04-25T04:01:17.025Z Has data issue: false hasContentIssue false

Reward shaping for knowledge-based multi-objective multi-agent reinforcement learning

Published online by Cambridge University Press:  04 December 2018

Patrick Mannion
Affiliation:
Department of Computer Science & Applied Physics, Galway-Mayo Institute of Technology, Dublin Road, GalwayH91 T8NW, Ireland; e-mail: patrick.mannion@gmit.ie
Sam Devlin
Affiliation:
Microsoft Research, 21 Station Road, CambridgeCB1 2FB, United Kingdom; e-mail: sam.devlin@microsoft.com
Jim Duggan
Affiliation:
Discipline of Information Technology, National University of Ireland Galway, GalwayH91 TK33, Ireland; e-mail: jim.duggan@nuigalway.ie, enda.howley@nuigalway.ie
Enda Howley
Affiliation:
Discipline of Information Technology, National University of Ireland Galway, GalwayH91 TK33, Ireland; e-mail: jim.duggan@nuigalway.ie, enda.howley@nuigalway.ie

Abstract

The majority of multi-agent reinforcement learning (MARL) implementations aim to optimize systems with respect to a single objective, despite the fact that many real-world problems are inherently multi-objective in nature. Research into multi-objective MARL is still in its infancy, and few studies to date have dealt with the issue of credit assignment. Reward shaping has been proposed as a means to address the credit assignment problem in single-objective MARL, however it has been shown to alter the intended goals of a domain if misused, leading to unintended behaviour. Two popular shaping methods are potential-based reward shaping and difference rewards, and both have been repeatedly shown to improve learning speed and the quality of joint policies learned by agents in single-objective MARL domains. This work discusses the theoretical implications of applying these shaping approaches to cooperative multi-objective MARL problems, and evaluates their efficacy using two benchmark domains. Our results constitute the first empirical evidence that agents using these shaping methodologies can sample true Pareto optimal solutions in cooperative multi-objective stochastic games.

Type
Special Issue Contribution
Copyright
© Cambridge University Press, 2018 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Agogino, A. K. & Tumer, K. 2008. Analyzing and visualizing multiagent rewards in dynamic and stochastic environments. Autonomous Agents and Multi-Agent Systems 17, 320338.Google Scholar
Arthur, W. B. 1994. Inductive reasoning and bounded rationality. The American Economic Review 84, 406411.Google Scholar
Basu, M. 2008. Dynamic economic emission dispatch using nondominated sorting genetic algorithm-ii. International Journal of Electrical Power & Energy Systems 30, 140149.Google Scholar
Brys, T., Pham, T. T. & Taylor, M. E. 2014. Distributed learning and multi-objectivity in traffic light control. Connection Science 26, 6583.Google Scholar
Buşoniu, L., Babuška, R. & Schutter, B. 2010. Multi-agent reinforcement learning: an overview. In Innovations in Multi-Agent Systems and Applications - 1, 310 of Studies in Computational Intelligence , Srinivasan, D. & Jain, L. (eds). Springer Berlin Heidelberg, 183221.Google Scholar
Claus, C. & Boutilier, C. 1998. The dynamics of reinforcement learning in cooperative multiagent systems. In Proceedings of the Fifteenth National/Tenth Conference on Artificial Intelligence/Innovative Applications of Artificial Intelligence, AAAI ’98/IAAI, 746–752.Google Scholar
Colby, M. & Tumer, K. 2015. An evolutionary game theoretic analysis of difference evaluation functions. In Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation, 1391–1398. ACM.Google Scholar
Colby, M., Duchow-Pressley, T., Chung, J. J. & Tumer, K. 2016. Local approximation of difference evaluation functions. In Proceedings of the 15th International Conference on Autonomous Agents & Multiagent Systems (AAMAS), 521–529. International Foundation for Autonomous Agents and Multiagent Systems (IFAAMAS).Google Scholar
Devlin, S. 2013. Potential-Based Reward Shaping for Knowledge-Based, Multi-Agent Reinforcement Learning. PhD thesis, University of York.Google Scholar
Devlin, S. & Kudenko, D. 2011. Theoretical considerations of potential-based reward shaping for multi-agent systems. In Proceedings of the 10th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 225–232. International Foundation for Autonomous Agents and Multiagent Systems (IFAAMAS).Google Scholar
Devlin, S. & Kudenko, D. 2012. Dynamic potential-based reward shaping. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 433–440.Google Scholar
Devlin, S., Grzes, M. & Kudenko, D. 2011a. An empirical study of potential-based reward shaping and advice in complex, multi-agent systems. Advances in Complex Systems 14, 251278.Google Scholar
Devlin, S., Grzes, M. & Kudenko, D. 2011b. Multi-agent, potential-based reward shaping for robocup keepaway (extended abstract). In Proceedings of the 10th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 1227–1228. International Foundation for Autonomous Agents and Multiagent Systems (IFAAMAS).Google Scholar
Devlin, S., Yliniemi, L., Kudenko, D. & Tumer, K. 2014. Potential-based difference rewards for multiagent reinforcement learning. In Proceedings of the 13th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 165–172. International Foundation for Autonomous Agents and Multiagent Systems (IFAAMAS).Google Scholar
Duggan, J. 2008. Using system dynamics and multiple objective optimization to support policy analysis for complex systems. In Complex Decision Making: Theory and Practice, Qudrat-Ullah, H., Spector, J. & Davidsen, P. (eds). Springer Berlin Heidelberg, 5981.Google Scholar
Gábor, Z., Kalmár, Z. & Szepesvári, C. 1998. Multi-criteria reinforcement learning. In Proceedings of the Fifteenth International Conference on Machine Learning, 197–205.Google Scholar
Grześ, M. 2017. Reward shaping in episodic reinforcement learning. In Proceedings of the 16th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 565–573. International Foundation for Autonomous Agents and Multiagent Systems (IFAAMAS).Google Scholar
Khamis, M. A. & Gomaa, W. 2014. Adaptive multi-objective reinforcement learning with hybrid exploration for traffic signal control based on cooperative multi-agent framework. Engineering Applications of Artificial Intelligence 29, 134151.Google Scholar
Malialis, K., Devlin, S. & Kudenko, D. 2016. Resource abstraction for reinforcement learning in multiagent congestion problems. In Proceedings of the 15th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 503–511. International Foundation for Autonomous Agents and Multiagent Systems (IFAAMAS).Google Scholar
Mannion, P., Devlin, S., Duggan, J. & Howley, E. 2016. Avoiding the tragedy of the commons using reward shaping. In Proceedings of the Adaptive and Learning Agents workshop (at AAMAS 2016).Google Scholar
Mannion, P., Duggan, J. & Howley, E. 2016a. An experimental review of reinforcement learning algorithms for adaptive traffic signal control. In Autonomic Road Transport Support Systems, McCluskey, L. T., Kotsialos, A., Müller, P. J., Klügl, F., Rana, O. & Schumann, R. (eds). Springer International Publishing, 4766.Google Scholar
Mannion, P., Duggan, J. & Howley, E. 2016b. Generating multi-agent potential functions using counterfactual estimates. In Proceedings of Learning, Inference and Control of Multi-Agent Systems (at NIPS 2016).Google Scholar
Mannion, P., Mason, K., Devlin, S., Duggan, J. & Howley, E. 2016c. Dynamic economic emissions dispatch optimisation using multi-agent reinforcement learning. In Proceedings of the Adaptive and Learning Agents workshop (at AAMAS 2016).Google Scholar
Mannion, P., Mason, K., Devlin, S., Duggan, J. & Howley, E. 2016d. Multi-objective dynamic dispatch optimisation using multi-agent reinforcement learning. In Proceedings of the 15th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 1345–1346.Google Scholar
Mannion, P., Devlin, S., Duggan, J. & Howley, E. 2017. Multi-agent credit assignment in stochastic resource management games. The Knowledge Engineering Review 32, e16.Google Scholar
Mannion, P., Devlin, S., Mason, K., Duggan, J. & Howley, E. 2017. Policy invariance under reward transformations for multi-objective reinforcement learning. Neurocomputing 263, 6073.Google Scholar
Marler, R. T. & Arora, J. S. 2004. Survey of multi-objective optimization methods for engineering. Structural and multidisciplinary optimization 26, 369395.Google Scholar
Mason, K. 2015. Avoidance Techniques and Neighbourhood Topologies in Particle Swarm Optimisation. Master’s thesis. National University of Ireland Galway.Google Scholar
Mason, K., Mannion, P., Duggan, J. & Howley, E. 2016. Applying multi-agent reinforcement learning to watershed management. In Proceedings of the Adaptive and Learning Agents workshop (at AAMAS 2016).Google Scholar
Mitchell, T. M. 1997. Machine Learning. McGraw-Hill Series in Computer Science. McGraw-Hill.Google Scholar
Nash, J. 1951. Non-cooperative games. Annals of Mathematics 54, 286295.Google Scholar
Ng, A. Y., Harada, D. & Russell, S. J. 1999. Policy invariance under reward transformations: Theory and application to reward shaping. In Proceedings of the Sixteenth International Conference on Machine Learning, ICML ’99, 278–287. Morgan Kaufmann Publishers Inc.Google Scholar
Pareto, V. 1906. Manual of political economy. Macmillan.Google Scholar
Rahmattalabi, A., Chung, J. J., Colby, M. & Tumer, K. 2016. D++: Structural credit assignment in tightly coupled multiagent domains. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 4424–4429. IEEE.Google Scholar
Randløv, J. & Alstrøm, P. 1998. Learning to drive a bicycle using reinforcement learning and shaping. In Proceedings of the Fifteenth International Conference on Machine Learning, ICML ’98, 463–471. Morgan Kaufmann Publishers Inc.Google Scholar
Roijers, D. M., Vamplew, P., Whiteson, S. & Dazeley, R. 2013. A survey of multi-objective sequential decision-making. Journal of Artificial Intelligence Research 48, 67113.Google Scholar
Roijers, D. M., Whiteson, S. & Oliehoek, F. A. 2013. Computing convex coverage sets for multi-objective coordination graphs. In International Conference on Algorithmic Decision Theory, 309–323.Google Scholar
Roijers, D. M., Whiteson, S. & Oliehoek, F. A. 2014. Linear support for multi-objective coordination graphs. In Proceedings of the 2014 International Conference on Autonomous Agents and Multi-Agent Systems, 1297–1304. International Foundation for Autonomous Agents and Multiagent Systems.Google Scholar
Roijers, D. M., Whiteson, S. & Oliehoek, F. A. 2015. Computing convex coverage sets for faster multi-objective coordination. Journal of Artificial Intelligence Research 52, 399443.Google Scholar
Shoham, Y., Powers, R. & Grenager, T. 2007. If multi-agent learning is the answer, what is the question? Artificial Intelligence 171, 365377.Google Scholar
Smith, A. E., Coit, D. W., Baeck, T., Fogel, D. & Michalewicz, Z. 2000. Penalty functions. Evolutionary Computation 2, 4148.Google Scholar
Taylor, A., Dusparic, I., Galván-López, E., Clarke, S. & Cahill, V. 2014. Accelerating learning in multi-objective systems through transfer learning. In Neural Networks (IJCNN), 2014 International Joint Conference on, 2298–2305. IEEE.Google Scholar
Tumer, K. & Agogino, A. 2007. Distributed agent-based air traffic flow management. In Proceedings of the 6th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 330–337. ACM.Google Scholar
Vamplew, P., Dazeley, R., Berry, A., Issabekov, R. & Dekker, E. 2010. Empirical evaluation methods for multiobjective reinforcement learning algorithms. Machine Learning 84, 5180.Google Scholar
Van Moffaert, K. & Nowé, A. 2014. Multi-objective reinforcement learning using sets of pareto dominating policies. The Journal of Machine Learning Research 15, 34833512.Google Scholar
Van Moffaert, K., Drugan, M. M. & Nowé, A. 2013. Scalarized multi-objective reinforcement learning: Novel design techniques. In 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), 191–199. IEEE.Google Scholar
Van Moffaert, K., Brys, T., Chandra, A., Esterle, L., Lewis, P. R. & Nowé, A. 2014. A novel adaptive weight selection algorithm for multi-objective multi-agent reinforcement learning. In Neural Networks (IJCNN), 2014 International Joint Conference, 2306–2314.Google Scholar
Walters, D. C. & Sheble, G. B. 1993. Genetic algorithm solution of economic dispatch with valve point loading. Power Systems, IEEE Transactions on 8, 13251332.Google Scholar
Watkins, C. J. C. H. 1989. Learning from Delayed Rewards. PhD thesis. King’s College, Cambridge.Google Scholar
Wiering, M. & van Otterlo, M. (eds). 2012. Reinforcement Learning: State-of-the-Art. Springer.Google Scholar
Wolpert, D. H. & Tumer, K. 2002. Collective intelligence, data routing and braess’ paradox. Journal of Artificial Intelligence Research 16, 359387.Google Scholar
Wolpert, D. H., Wheeler, K. R. & Tumer, K. 2000. Collective intelligence for control of distributed dynamical systems. EPL (Europhysics Letters) 49, 708.Google Scholar
Wooldridge, M. 2001. Introduction to Multiagent Systems. John Wiley & Sons, Inc.Google Scholar
Yliniemi, L. & Tumer, K. 2016. Multi-objective multiagent credit assignment in reinforcement learning and nsga-ii. Soft Computing 20, 38693887.Google Scholar