Reward shaping for knowledge-based multi-objective multi-agent reinforcement learning

Patrick Mannion; Sam Devlin; Jim Duggan; Enda Howley

doi:10.1017/S0269888918000292

Reward shaping for knowledge-based multi-objective multi-agent reinforcement learning

Part of: Adaptive Learning Agents 2017

Published online by Cambridge University Press: 04 December 2018

Patrick Mannion

Sam Devlin ,

Jim Duggan and

Enda Howley

Show author details

Patrick Mannion: Affiliation:
Department of Computer Science & Applied Physics, Galway-Mayo Institute of Technology, Dublin Road, GalwayH91 T8NW, Ireland; e-mail: patrick.mannion@gmit.ie
Sam Devlin: Affiliation:
Microsoft Research, 21 Station Road, CambridgeCB1 2FB, United Kingdom; e-mail: sam.devlin@microsoft.com
Jim Duggan: Affiliation:
Discipline of Information Technology, National University of Ireland Galway, GalwayH91 TK33, Ireland; e-mail: jim.duggan@nuigalway.ie, enda.howley@nuigalway.ie
Enda Howley: Affiliation:
Discipline of Information Technology, National University of Ireland Galway, GalwayH91 TK33, Ireland; e-mail: jim.duggan@nuigalway.ie, enda.howley@nuigalway.ie

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

The majority of multi-agent reinforcement learning (MARL) implementations aim to optimize systems with respect to a single objective, despite the fact that many real-world problems are inherently multi-objective in nature. Research into multi-objective MARL is still in its infancy, and few studies to date have dealt with the issue of credit assignment. Reward shaping has been proposed as a means to address the credit assignment problem in single-objective MARL, however it has been shown to alter the intended goals of a domain if misused, leading to unintended behaviour. Two popular shaping methods are potential-based reward shaping and difference rewards, and both have been repeatedly shown to improve learning speed and the quality of joint policies learned by agents in single-objective MARL domains. This work discusses the theoretical implications of applying these shaping approaches to cooperative multi-objective MARL problems, and evaluates their efficacy using two benchmark domains. Our results constitute the first empirical evidence that agents using these shaping methodologies can sample true Pareto optimal solutions in cooperative multi-objective stochastic games.

Type: Special Issue Contribution
Information: The Knowledge Engineering Review , Volume 33 , 2018 , e23

DOI: https://doi.org/10.1017/S0269888918000292 [Opens in a new window]
Copyright: © Cambridge University Press, 2018

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Agogino, A. K. & Tumer, K. 2008. Analyzing and visualizing multiagent rewards in dynamic and stochastic environments. Autonomous Agents and Multi-Agent Systems 17, 320–338.Google Scholar

Arthur, W. B. 1994. Inductive reasoning and bounded rationality. The American Economic Review 84, 406–411.Google Scholar

Basu, M. 2008. Dynamic economic emission dispatch using nondominated sorting genetic algorithm-ii. International Journal of Electrical Power & Energy Systems 30, 140–149.Google Scholar

Brys, T., Pham, T. T. & Taylor, M. E. 2014. Distributed learning and multi-objectivity in traffic light control. Connection Science 26, 65–83.Google Scholar

Buşoniu, L., Babuška, R. & Schutter, B. 2010. Multi-agent reinforcement learning: an overview. In Innovations in Multi-Agent Systems and Applications - 1, 310 of Studies in Computational Intelligence , Srinivasan, D. & Jain, L. (eds). Springer Berlin Heidelberg, 183–221.Google Scholar

Claus, C. & Boutilier, C. 1998. The dynamics of reinforcement learning in cooperative multiagent systems. In Proceedings of the Fifteenth National/Tenth Conference on Artificial Intelligence/Innovative Applications of Artificial Intelligence, AAAI ’98/IAAI, 746–752.Google Scholar

Colby, M. & Tumer, K. 2015. An evolutionary game theoretic analysis of difference evaluation functions. In Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation, 1391–1398. ACM.Google Scholar

Colby, M., Duchow-Pressley, T., Chung, J. J. & Tumer, K. 2016. Local approximation of difference evaluation functions. In Proceedings of the 15th International Conference on Autonomous Agents & Multiagent Systems (AAMAS), 521–529. International Foundation for Autonomous Agents and Multiagent Systems (IFAAMAS).Google Scholar

Devlin, S. 2013. Potential-Based Reward Shaping for Knowledge-Based, Multi-Agent Reinforcement Learning. PhD thesis, University of York.Google Scholar

Devlin, S. & Kudenko, D. 2011. Theoretical considerations of potential-based reward shaping for multi-agent systems. In Proceedings of the 10th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 225–232. International Foundation for Autonomous Agents and Multiagent Systems (IFAAMAS).Google Scholar

Devlin, S. & Kudenko, D. 2012. Dynamic potential-based reward shaping. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 433–440.Google Scholar

Devlin, S., Grzes, M. & Kudenko, D. 2011a. An empirical study of potential-based reward shaping and advice in complex, multi-agent systems. Advances in Complex Systems 14, 251–278.Google Scholar

Devlin, S., Grzes, M. & Kudenko, D. 2011b. Multi-agent, potential-based reward shaping for robocup keepaway (extended abstract). In Proceedings of the 10th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 1227–1228. International Foundation for Autonomous Agents and Multiagent Systems (IFAAMAS).Google Scholar

Devlin, S., Yliniemi, L., Kudenko, D. & Tumer, K. 2014. Potential-based difference rewards for multiagent reinforcement learning. In Proceedings of the 13th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 165–172. International Foundation for Autonomous Agents and Multiagent Systems (IFAAMAS).Google Scholar

Duggan, J. 2008. Using system dynamics and multiple objective optimization to support policy analysis for complex systems. In Complex Decision Making: Theory and Practice, Qudrat-Ullah, H., Spector, J. & Davidsen, P. (eds). Springer Berlin Heidelberg, 59–81.Google Scholar

Gábor, Z., Kalmár, Z. & Szepesvári, C. 1998. Multi-criteria reinforcement learning. In Proceedings of the Fifteenth International Conference on Machine Learning, 197–205.Google Scholar

Grześ, M. 2017. Reward shaping in episodic reinforcement learning. In Proceedings of the 16th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 565–573. International Foundation for Autonomous Agents and Multiagent Systems (IFAAMAS).Google Scholar

Khamis, M. A. & Gomaa, W. 2014. Adaptive multi-objective reinforcement learning with hybrid exploration for traffic signal control based on cooperative multi-agent framework. Engineering Applications of Artificial Intelligence 29, 134–151.Google Scholar

Malialis, K., Devlin, S. & Kudenko, D. 2016. Resource abstraction for reinforcement learning in multiagent congestion problems. In Proceedings of the 15th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 503–511. International Foundation for Autonomous Agents and Multiagent Systems (IFAAMAS).Google Scholar

Mannion, P., Devlin, S., Duggan, J. & Howley, E. 2016. Avoiding the tragedy of the commons using reward shaping. In Proceedings of the Adaptive and Learning Agents workshop (at AAMAS 2016).Google Scholar

Mannion, P., Duggan, J. & Howley, E. 2016a. An experimental review of reinforcement learning algorithms for adaptive traffic signal control. In Autonomic Road Transport Support Systems, McCluskey, L. T., Kotsialos, A., Müller, P. J., Klügl, F., Rana, O. & Schumann, R. (eds). Springer International Publishing, 47–66.Google Scholar

Mannion, P., Duggan, J. & Howley, E. 2016b. Generating multi-agent potential functions using counterfactual estimates. In Proceedings of Learning, Inference and Control of Multi-Agent Systems (at NIPS 2016).Google Scholar

Mannion, P., Mason, K., Devlin, S., Duggan, J. & Howley, E. 2016c. Dynamic economic emissions dispatch optimisation using multi-agent reinforcement learning. In Proceedings of the Adaptive and Learning Agents workshop (at AAMAS 2016).Google Scholar

Mannion, P., Mason, K., Devlin, S., Duggan, J. & Howley, E. 2016d. Multi-objective dynamic dispatch optimisation using multi-agent reinforcement learning. In Proceedings of the 15th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 1345–1346.Google Scholar

Mannion, P., Devlin, S., Duggan, J. & Howley, E. 2017. Multi-agent credit assignment in stochastic resource management games. The Knowledge Engineering Review 32, e16.Google Scholar

Mannion, P., Devlin, S., Mason, K., Duggan, J. & Howley, E. 2017. Policy invariance under reward transformations for multi-objective reinforcement learning. Neurocomputing 263, 60–73.Google Scholar

Marler, R. T. & Arora, J. S. 2004. Survey of multi-objective optimization methods for engineering. Structural and multidisciplinary optimization 26, 369–395.Google Scholar

Mason, K. 2015. Avoidance Techniques and Neighbourhood Topologies in Particle Swarm Optimisation. Master’s thesis. National University of Ireland Galway.Google Scholar

Mason, K., Mannion, P., Duggan, J. & Howley, E. 2016. Applying multi-agent reinforcement learning to watershed management. In Proceedings of the Adaptive and Learning Agents workshop (at AAMAS 2016).Google Scholar

Mitchell, T. M. 1997. Machine Learning. McGraw-Hill Series in Computer Science. McGraw-Hill.Google Scholar

Nash, J. 1951. Non-cooperative games. Annals of Mathematics 54, 286–295.Google Scholar

Ng, A. Y., Harada, D. & Russell, S. J. 1999. Policy invariance under reward transformations: Theory and application to reward shaping. In Proceedings of the Sixteenth International Conference on Machine Learning, ICML ’99, 278–287. Morgan Kaufmann Publishers Inc.Google Scholar

Pareto, V. 1906. Manual of political economy. Macmillan.Google Scholar

Rahmattalabi, A., Chung, J. J., Colby, M. & Tumer, K. 2016. D++: Structural credit assignment in tightly coupled multiagent domains. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 4424–4429. IEEE.Google Scholar

Randløv, J. & Alstrøm, P. 1998. Learning to drive a bicycle using reinforcement learning and shaping. In Proceedings of the Fifteenth International Conference on Machine Learning, ICML ’98, 463–471. Morgan Kaufmann Publishers Inc.Google Scholar

Roijers, D. M., Vamplew, P., Whiteson, S. & Dazeley, R. 2013. A survey of multi-objective sequential decision-making. Journal of Artificial Intelligence Research 48, 67–113.Google Scholar

Roijers, D. M., Whiteson, S. & Oliehoek, F. A. 2013. Computing convex coverage sets for multi-objective coordination graphs. In International Conference on Algorithmic Decision Theory, 309–323.Google Scholar

Roijers, D. M., Whiteson, S. & Oliehoek, F. A. 2014. Linear support for multi-objective coordination graphs. In Proceedings of the 2014 International Conference on Autonomous Agents and Multi-Agent Systems, 1297–1304. International Foundation for Autonomous Agents and Multiagent Systems.Google Scholar

Roijers, D. M., Whiteson, S. & Oliehoek, F. A. 2015. Computing convex coverage sets for faster multi-objective coordination. Journal of Artificial Intelligence Research 52, 399–443.Google Scholar

Shoham, Y., Powers, R. & Grenager, T. 2007. If multi-agent learning is the answer, what is the question? Artificial Intelligence 171, 365–377.Google Scholar

Smith, A. E., Coit, D. W., Baeck, T., Fogel, D. & Michalewicz, Z. 2000. Penalty functions. Evolutionary Computation 2, 41–48.Google Scholar

Taylor, A., Dusparic, I., Galván-López, E., Clarke, S. & Cahill, V. 2014. Accelerating learning in multi-objective systems through transfer learning. In Neural Networks (IJCNN), 2014 International Joint Conference on, 2298–2305. IEEE.Google Scholar

Tumer, K. & Agogino, A. 2007. Distributed agent-based air traffic flow management. In Proceedings of the 6th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 330–337. ACM.Google Scholar

Vamplew, P., Dazeley, R., Berry, A., Issabekov, R. & Dekker, E. 2010. Empirical evaluation methods for multiobjective reinforcement learning algorithms. Machine Learning 84, 51–80.Google Scholar

Van Moffaert, K. & Nowé, A. 2014. Multi-objective reinforcement learning using sets of pareto dominating policies. The Journal of Machine Learning Research 15, 3483–3512.Google Scholar

Van Moffaert, K., Drugan, M. M. & Nowé, A. 2013. Scalarized multi-objective reinforcement learning: Novel design techniques. In 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), 191–199. IEEE.Google Scholar

Van Moffaert, K., Brys, T., Chandra, A., Esterle, L., Lewis, P. R. & Nowé, A. 2014. A novel adaptive weight selection algorithm for multi-objective multi-agent reinforcement learning. In Neural Networks (IJCNN), 2014 International Joint Conference, 2306–2314.Google Scholar

Walters, D. C. & Sheble, G. B. 1993. Genetic algorithm solution of economic dispatch with valve point loading. Power Systems, IEEE Transactions on 8, 1325–1332.Google Scholar

Watkins, C. J. C. H. 1989. Learning from Delayed Rewards. PhD thesis. King’s College, Cambridge.Google Scholar

Wiering, M. & van Otterlo, M. (eds). 2012. Reinforcement Learning: State-of-the-Art. Springer.Google Scholar

Wolpert, D. H. & Tumer, K. 2002. Collective intelligence, data routing and braess’ paradox. Journal of Artificial Intelligence Research 16, 359–387.Google Scholar

Wolpert, D. H., Wheeler, K. R. & Tumer, K. 2000. Collective intelligence for control of distributed dynamical systems. EPL (Europhysics Letters) 49, 708.Google Scholar

Wooldridge, M. 2001. Introduction to Multiagent Systems. John Wiley & Sons, Inc.Google Scholar

Yliniemi, L. & Tumer, K. 2016. Multi-objective multiagent credit assignment in reinforcement learning and nsga-ii. Soft Computing 20, 3869–3887.Google Scholar

Article contents

Reward shaping for knowledge-based multi-objective multi-agent reinforcement learning

Abstract

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests