Exploration Beyond Bandits

doi:10.1017/9781009026949.008

Chapter 7 - Exploration Beyond Bandits

from Part II - How Do Humans Search for Information?

Published online by Cambridge University Press: 19 May 2022

Franziska Brändle ,

Marcel Binz and

Eric Schulz

Edited by

Irene Cogliati Dezza ,

Eric Schulz and

Charley M. Wu

Show author details

Irene Cogliati Dezza: Affiliation:
University College London
Eric Schulz: Affiliation:
Max-Planck-Institut für biologische Kybernetik, Tübingen
Charley M. Wu: Affiliation:
Eberhard-Karls-Universität Tübingen, Germany

Book contents

Get access

Summary

The ability to seek out new information is crucial in many situations of our everyday lives. In general, people can display quite elaborated exploration behavior. However, exploration has mainly been studied in multiarmed bandit tasks and theories have predominantly focused on simple directed and random exploration strategies. In this chapter, we review the results of prior studies and argue that the repertoire of human exploration strategies is much more diverse than how it is portrayed in the literature. However, to find evidence for more sophisticated strategies, more complex paradigms than multiarmed bandits are required. In particular, we argue that Markov Decision Processes offer an interesting new setting that allows us to capture strategies beyond random and directed exploration, such as empowerment-based strategies or strategies that explore using explicit goals. We conclude this chapter by discussing several new experimental paradigms that could advance our understanding of human exploration to the next level.

Keywords

Directed exploration random exploration multiarmed bandit Markov decision processes information search

Type: Chapter
Information: The Drive for Knowledge
The Science of Human Information Seeking
, pp. 147 - 168

DOI: https://doi.org/10.1017/9781009026949.008 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2022

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Auer, P. (2002). Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov.), 397–422.Google Scholar

Bellman, R. (1957). A Markovian decision process. Journal of Mathematics and Mechanics, 6(5), 679–684.Google Scholar

Binz, M., & Endres, D. (2019). Where do heuristics come from? In Proceedings of the 41st Annual Conference of the Cognitive Science Society, (pp. 1402–1408). Montreal, QB: Cognitive Science Society.Google Scholar

Borji, A., & Itti, L. (2013). Bayesian optimization explains human active search. Advances in Neural Information Processing Systems, 26, 55–63.Google Scholar

Bramley, N. R., Dayan, P., Griffiths, T. L., & Lagnado, D. A. (2017). Formalizing Neurath’s ship: Approximate algorithms for online causal learning. Psychological Review, 124(3), 301.Google Scholar

Brändle, F., Wu, C. M., & Schulz, E. (2020). What are we curious about? Trends in Cognitive Sciences, 24(9), 685–687.Google Scholar

Chevalier-Boisvert, M., Willems, L., & Pal, S. (2018). Minimalistic gridworld environment for openai gym. https://github.com/maximecb/gym-minigrid. GitHub.Google Scholar

Cohen, J. D., McClure, S. M., & Yu, A. J. (2007). Should I stay or should I go? How the human brain manages the trade-off between exploitation and exploration. Philosophical Transactions of the Royal Society B: Biological Sciences, 362(1481), 933–942.Google Scholar

Colas, C., Karch, T., Sigaud, O., & Oudeyer, P.-Y. (2020). Intrinsically motivated goal-conditioned reinforcement learning: a short survey. arXiv preprint arXiv:2012.09830.Google Scholar

Daw, N. D., O’Doherty, J. P., Dayan, P., Seymour, B., & Dolan, R. J. (2006). Cortical substrates for exploratory decisions in humans. Nature, 441(7095), 876–879.Google Scholar

Findling, C., Skvortsova, V., Dromnelle, R., Palminteri, S., & Wyart, V. (2019). Computational noise in reward-guided learning drives behavioral variability in volatile environments. Nature Neuroscience, 22(12), 2066–2077.CrossRef Google Scholar PubMed

Frank, M. J., Doll, B. B., Oas-Terpstra, J., & Moreno, F. (2009). Prefrontal and striatal dopaminergic genes predict individual differences in exploration and exploitation. Nature Neuroscience, 12(8), 1062.Google Scholar

Geana, A., Wilson, R. C., Daw, N., & Cohen, J. D. (2016). Boredom, information-seeking and exploration. In A. Papafragou, D. Grodner, D. Mirman, & J. C. Trueswell (Eds.), Proceedings of the 38th Annual Conference of the Cognitive Science Society (pp. 1751–1756). Austin, TX: Cognitive Science Society.Google Scholar

Gershman, S. J. (2018). Deconstructing the human algorithms for exploration. Cognition, 173, 34–42.CrossRef Google Scholar PubMed

Griffiths, T. (2014, 12). Manifesto for a new (computational) cognitive revolution. Cognition, 135. https://doi.org/10.1016/j.cognition.2014.11.026.Google Scholar PubMed

Hills, T. T., Todd, P. M., Lazer, D., Redish, A. D., Couzin, I. D., Group, C. S. R., et al. (2015). Exploration versus exploitation in space, mind, and society. Trends in Cognitive Sciences, 19(1), 46–54.Google Scholar

Houthooft, R., Chen, X., Duan, Y., Schulman, J., De Turck, F., & Abbeel, P. (2016). Vime: Variational information maximizing exploration. Advances in Neural Information Processing Systems, 29, 1109–1117.Google Scholar

Jaksch, T., Ortner, R., & Auer, P. (2010). Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(4), 1563–1600.Google Scholar

Jinnai, Y., Park, J. W., Abel, D., & Konidaris, G. (2019). Discovering options for exploration by minimizing cover time. arXiv preprint arXiv:1903.00606.Google Scholar

Kaplan, F., & Oudeyer, P.-Y. (2004). Maximizing learning progress: an internal reward system for development. In Iida, F., Pfeifer, R., Steels, L., & Kuniyoshi, Y. (Eds.), Embodied artificial intelligence (pp. 259–270). Springer. https://doi.org/10.1007/978-3-540-27833-7_19.CrossRef Google Scholar

Kidd, C., Piantadosi, S. T., & Aslin, R. N. (2012). The Goldilocks effect: Human infants allocate attention to visual sequences that are neither too simple nor too complex. PLoS One, 7(5), e36399.Google Scholar

Klyubin, A. S., Polani, D., & Nehaniv, C. L. (2005). All else being equal be empowered. In Capcarrère, M. S., Freitas, A. A., Bentley, P. J., Johnson, C. G., & Timmis, J. (Eds.), Advances in Artificial Life. ECAL 2005. Lecture Notes in Computer Science, vol. 3630. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11553090_75.Google Scholar

Köhler, W. (1925). The mentality of apes (Vol. 74). Paul, K., Trench, Trubner & Company, Limited.Google Scholar

Krebs, J. R., Kacelnik, A., & Taylor, P. (1978). Test of optimal sampling by foraging great tits. Nature, 275(5675), 27–31.Google Scholar

Leibfried, F., Pascual-D´ıaz, S., & Grau-Moya, J. (2019). A unified Bellman optimality principle combining reward maximization and empowerment. Advances in Neural Information Processing Systems, 32, 7869–7880.Google Scholar

Lopes, M., Lang, T., Toussaint, M., & Oudeyer, P.-Y. (2012). Exploration in model-based reinforcement learning by empirically estimating learning progress. Advances in Neural Information Processing Systems, 25, 206–214.Google Scholar

Matusch, B., Ba, J., & Hafner, D. (2020). Evaluating agents without rewards. arXiv preprint arXiv:2012.11538.Google Scholar

Mehlhorn, K., Newell, B. R., Todd, P. M., Lee, M. D., Morgan, K., Braithwaite, V. A., … Gonzalez, C. (2015). Unpacking the exploration–exploitation tradeoff: A synthesis of human and animal literatures. Decision, 2(3), 191.CrossRef Google Scholar

Mohamed, S., & Rezende, D. J. (2015). Variational information maximisation for intrinsically motivated reinforcement learning. In Advances in neural information processing systems (pp. 2125–2133).Google Scholar

Nair, A., Pong, V., Dalal, M., Bahl, S., Lin, S., & Levine, S. (2018). Visual reinforcement learning with imagined goals. arXiv preprint arXiv:1807.04742.Google Scholar

Osband, I., Russo, D., & Van Roy, B. (2013). (More) efficient reinforcement learning via posterior sampling. In Advances in neural information processing systems (pp. 3003–3011).Google Scholar

Oudeyer, P.-Y., & Kaplan, F. (2009). What is intrinsic motivation? A typology of computational approaches. Frontiers in Neurorobotics, 1, 6.Google Scholar

Oudeyer, P.-Y., Kaplan, F., & Hafner, V. V. (2007). Intrinsic motivation systems for autonomous mental development. IEEE Transactions on Evolutionary Computation, 11(2), 265–286.Google Scholar

Pong, V., Gu, S., Dalal, M., & Levine, S. (2018). Temporal difference models: Model-free deep RL for model-based control. arXiv preprint arXiv:1802.09081.Google Scholar

Rich, A. S., & Gureckis, T. M. (2018). Exploratory choice reflects the future value of information. Decision, 5(3), 177.CrossRef Google Scholar

Salge, C., Glackin, C., & Polani, D. (2014). Empowerment–an introduction. In Prokopenko, M. (ed.), Guided self-organization: Inception (pp. 67–114). Springer. https://doi.org/10.1007/978-3-642-53734-9_4.Google Scholar

Sanborn, A. N., & Chater, N. (2016). Bayesian brains without probabilities. Trends in Cognitive Sciences, 20(12), 883–893.Google Scholar

Schaul, T., Horgan, D., Gregor, K., & Silver, D. (2015). Universal value function approximators. In International conference on machine learning (pp. 1312–1320). https://proceedings.mlr.press/v37/schaul15.html.Google Scholar

Schmidhuber, J. (1991). Curious model-building control systems. In Proc. international joint conference on neural networks (pp. 1458–1463). https://doi.org/10.1109/IJCNN.1991.170605.CrossRef Google Scholar

Schmidhuber, J. (2010). Formal theory of creativity, fun, and intrinsic motivation (1990–2010). IEEE Transactions on Autonomous Mental Development, 2(3), 230–247.CrossRef Google Scholar

Schulz, E., Bhui, R., Love, B. C., Brier, B., Todd, M. T., & Gershman, S. J. (2019). Structured, uncertainty-driven exploration in real-world consumer choice. Proceedings of the National Academy of Sciences, 116(28), 13903–13908. https://doi.org/10.1073/pnas.1821028116.Google Scholar

Schulz, E., & Gershman, S. J. (2019). The algorithmic architecture of exploration in the human brain. Current Opinion in Neurobiology, 55, 7–14.Google Scholar

Speekenbrink, M., & Konstantinidis, E. (2015). Uncertainty and exploration in a restless bandit problem. Topics in Cognitive Science, 7(2), 351–367.Google Scholar

Stafford, T., & Dewar, M. (2014). Tracing the trajectory of skill learning with a very large sample of online game players. Psychological Science, 25(2), 511–518. https://doi.org/10.1177/0956797613511466.Google Scholar

Steyvers, M., Lee, M. D., & Wagenmakers, E.-J. (2009). A Bayesian analysis of human decision-making on bandit problems. Journal of Mathematical Psychology, 53(3), 168–179.Google Scholar

Stojić, H., Analytis, P. P., & Speekenbrink, M. (2015). Human behavior in contextual multi-armed bandit problems. In Noelle, D. C., et al. (Eds.), Proceedings of the 37th Annual Meeting of the Cognitive Science Society (pp. 2290--2295). Cognitive Science Society.Google Scholar

Stojić, H., Schulz, E., P Analytis, P., & Speekenbrink, M. (2020). It’s new, but is it good? How generalization and uncertainty guide the exploration of novel options. Journal of Experimental Psychology: General, 149(10), 1878.Google Scholar

Strehl, A. L., & Littman, M. L. (2008). An analysis of model-based interval estimation for Markov decision processes. Journal of Computer and System Sciences, 74(8), 1309–1331.Google Scholar

Strens, M. (2000). A Bayesian framework for reinforcement learning. In Proceedings of the Seventeenth International Conference on Machine Learning (ICML-2000), Stanford University, California, June 29–July 2, 2000.(Vol. 2000, pp. 943–950).Google Scholar

Sun, Y., Gomez, F., & Schmidhuber, J. (2011). Planning to be surprised: Optimal Bayesian exploration in dynamic environments. In International conference on artificial general intelligence (pp. 41–51).Google Scholar

Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT Press.Google Scholar

Sutton, R. S., Modayil, J., Delp, M., Degris, T., Pilarski, P. M., White, A., & Precup, D. (2011). Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In Proc. of 10th Int. Conf. on Autonomous Agents and Multiagent Systems (AAMAS 2011), May, 2–6, 2011, Taipei, Taiwan (pp. 761–768).Google Scholar

Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4), 285–294.Google Scholar

Whittle, P. (1980). Multi-armed bandits and the Gittins Index. Journal of the Royal Statistical Society: Series B (Methodological), 42(2), 143–149.Google Scholar

Wilson, R. C., Bonawitz, E., Costa, V. D., & Ebitz, R. B. (2021). Balancing exploration and exploitation with information and randomization. Current Opinion in Behavioral Sciences, 38, 49–56.Google Scholar

Wilson, R. C., Geana, A., White, J. M., Ludvig, E. A., & Cohen, J. D. (2014). Humans use directed and random exploration to solve the explore–exploit dilemma. Journal of Experimental Psychology: General, 143(6), 2074.CrossRef Google Scholar PubMed

Wilson, R. C., Shenhav, A., Straccia, M., & Cohen, J. D. (2019). The eighty five percent rule for optimal learning. Nature Communications, 10(1), 1–9.Google Scholar

Wimmer, G. E., Daw, N. D., & Shohamy, D. (2012). Generalization of value in reinforcement learning by humans. European Journal of Neuroscience, 35(7), 1092–1104.Google Scholar

Wu, C. M., Schulz, E., Garvert, M. M., Meder, B., & Schuck, N. W. (2020). Similarities and differences in spatial and non-spatial cognitive maps. PLoS Computational Biology, 16(9), e1008149.CrossRef Google Scholar PubMed

Wu, C. M., Schulz, E., Speekenbrink, M., Nelson, J. D., & Meder, B. (2017). Mapping the unknown: The spatially correlated multi-armed bandit. bioRxiv, 106286.Google Scholar

Wu, C. M., Schulz, E., Speekenbrink, M., Nelson, J. D., & Meder, B. (2018). Generalization guides human exploration in vast decision spaces. Nature Human Behaviour, 2(12), 915–924.Google Scholar

Zhang, S., & Yu, A. J. (2013). Forgetful Bayes and myopic planning: Human learning and decision-making in a bandit setting. In NIPS (pp. 2607–2615). https://proceedings.neurips.cc/paper/2013/file/6c14da109e294d1e8155be8aa4b1ce8e-Paper.pdf.Google Scholar

Zheng, Z., Oh, J., Hessel, M., Xu, Z., Kroiss, M., Van Hasselt, H., … Singh, S. (2020, 13–18 Jul). What can learned intrinsic rewards capture? In Duamé . III, H & Singh, A. (Eds.), Proceedings of the 37th international conference on machine learning (Vol. 119, pp. 11436–11446). PMLR.Google Scholar

Zhu, Y., Mottaghi, R., Kolve, E., Lim, J. J., Gupta, A., Fei-Fei, L., & Farhadi, A. (2017). Target-driven visual navigation in indoor scenes using deep reinforcement learning. In 2017 IEEE international conference on robotics and automation (ICRA) (pp. 3357–3364), Singapore.CrossRef Google Scholar

Book contents

Chapter 7 - Exploration Beyond Bandits

Summary

Keywords

Access options

References

Save book to Kindle

Save book to Dropbox

Save book to Google Drive