Skip to main content Accessibility help
×
Hostname: page-component-76fb5796d-skm99 Total loading time: 0 Render date: 2024-04-28T10:21:55.054Z Has data issue: false hasContentIssue false

Chapter 7 - Exploration Beyond Bandits

from Part II - How Do Humans Search for Information?

Published online by Cambridge University Press:  19 May 2022

Irene Cogliati Dezza
Affiliation:
University College London
Eric Schulz
Affiliation:
Max-Planck-Institut für biologische Kybernetik, Tübingen
Charley M. Wu
Affiliation:
Eberhard-Karls-Universität Tübingen, Germany
Get access

Summary

The ability to seek out new information is crucial in many situations of our everyday lives. In general, people can display quite elaborated exploration behavior. However, exploration has mainly been studied in multiarmed bandit tasks and theories have predominantly focused on simple directed and random exploration strategies. In this chapter, we review the results of prior studies and argue that the repertoire of human exploration strategies is much more diverse than how it is portrayed in the literature. However, to find evidence for more sophisticated strategies, more complex paradigms than multiarmed bandits are required. In particular, we argue that Markov Decision Processes offer an interesting new setting that allows us to capture strategies beyond random and directed exploration, such as empowerment-based strategies or strategies that explore using explicit goals. We conclude this chapter by discussing several new experimental paradigms that could advance our understanding of human exploration to the next level.

Type
Chapter
Information
The Drive for Knowledge
The Science of Human Information Seeking
, pp. 147 - 168
Publisher: Cambridge University Press
Print publication year: 2022

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Auer, P. (2002). Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov.), 397422.Google Scholar
Bellman, R. (1957). A Markovian decision process. Journal of Mathematics and Mechanics, 6(5), 679684.Google Scholar
Binz, M., & Endres, D. (2019). Where do heuristics come from? In Proceedings of the 41st Annual Conference of the Cognitive Science Society, (pp. 14021408). Montreal, QB: Cognitive Science Society.Google Scholar
Borji, A., & Itti, L. (2013). Bayesian optimization explains human active search. Advances in Neural Information Processing Systems, 26, 5563.Google Scholar
Bramley, N. R., Dayan, P., Griffiths, T. L., & Lagnado, D. A. (2017). Formalizing Neurath’s ship: Approximate algorithms for online causal learning. Psychological Review, 124(3), 301.Google Scholar
Brändle, F., Wu, C. M., & Schulz, E. (2020). What are we curious about? Trends in Cognitive Sciences, 24(9), 685687.Google Scholar
Chevalier-Boisvert, M., Willems, L., & Pal, S. (2018). Minimalistic gridworld environment for openai gym. https://github.com/maximecb/gym-minigrid. GitHub.Google Scholar
Cohen, J. D., McClure, S. M., & Yu, A. J. (2007). Should I stay or should I go? How the human brain manages the trade-off between exploitation and exploration. Philosophical Transactions of the Royal Society B: Biological Sciences, 362(1481), 933942.Google Scholar
Colas, C., Karch, T., Sigaud, O., & Oudeyer, P.-Y. (2020). Intrinsically motivated goal-conditioned reinforcement learning: a short survey. arXiv preprint arXiv:2012.09830.Google Scholar
Daw, N. D., O’Doherty, J. P., Dayan, P., Seymour, B., & Dolan, R. J. (2006). Cortical substrates for exploratory decisions in humans. Nature, 441(7095), 876879.Google Scholar
Findling, C., Skvortsova, V., Dromnelle, R., Palminteri, S., & Wyart, V. (2019). Computational noise in reward-guided learning drives behavioral variability in volatile environments. Nature Neuroscience, 22(12), 20662077.CrossRefGoogle ScholarPubMed
Frank, M. J., Doll, B. B., Oas-Terpstra, J., & Moreno, F. (2009). Prefrontal and striatal dopaminergic genes predict individual differences in exploration and exploitation. Nature Neuroscience, 12(8), 1062.Google Scholar
Geana, A., Wilson, R. C., Daw, N., & Cohen, J. D. (2016). Boredom, information-seeking and exploration. In A. Papafragou, D. Grodner, D. Mirman, & J. C. Trueswell (Eds.), Proceedings of the 38th Annual Conference of the Cognitive Science Society (pp. 1751–1756). Austin, TX: Cognitive Science Society.Google Scholar
Gershman, S. J. (2018). Deconstructing the human algorithms for exploration. Cognition, 173, 3442.CrossRefGoogle ScholarPubMed
Griffiths, T. (2014, 12). Manifesto for a new (computational) cognitive revolution. Cognition, 135. https://doi.org/10.1016/j.cognition.2014.11.026.Google ScholarPubMed
Hills, T. T., Todd, P. M., Lazer, D., Redish, A. D., Couzin, I. D., Group, C. S. R., et al. (2015). Exploration versus exploitation in space, mind, and society. Trends in Cognitive Sciences, 19(1), 4654.Google Scholar
Houthooft, R., Chen, X., Duan, Y., Schulman, J., De Turck, F., & Abbeel, P. (2016). Vime: Variational information maximizing exploration. Advances in Neural Information Processing Systems, 29, 11091117.Google Scholar
Jaksch, T., Ortner, R., & Auer, P. (2010). Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(4), 15631600.Google Scholar
Jinnai, Y., Park, J. W., Abel, D., & Konidaris, G. (2019). Discovering options for exploration by minimizing cover time. arXiv preprint arXiv:1903.00606.Google Scholar
Kaplan, F., & Oudeyer, P.-Y. (2004). Maximizing learning progress: an internal reward system for development. In Iida, F., Pfeifer, R., Steels, L., & Kuniyoshi, Y. (Eds.), Embodied artificial intelligence (pp. 259270). Springer. https://doi.org/10.1007/978-3-540-27833-7_19.CrossRefGoogle Scholar
Kidd, C., Piantadosi, S. T., & Aslin, R. N. (2012). The Goldilocks effect: Human infants allocate attention to visual sequences that are neither too simple nor too complex. PLoS One, 7(5), e36399.Google Scholar
Klyubin, A. S., Polani, D., & Nehaniv, C. L. (2005). All else being equal be empowered. In Capcarrère, M. S., Freitas, A. A., Bentley, P. J., Johnson, C. G., & Timmis, J. (Eds.), Advances in Artificial Life. ECAL 2005. Lecture Notes in Computer Science, vol. 3630. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11553090_75.Google Scholar
Köhler, W. (1925). The mentality of apes (Vol. 74). Paul, K., Trench, Trubner & Company, Limited.Google Scholar
Krebs, J. R., Kacelnik, A., & Taylor, P. (1978). Test of optimal sampling by foraging great tits. Nature, 275(5675), 2731.Google Scholar
Leibfried, F., Pascual-D´ıaz, S., & Grau-Moya, J. (2019). A unified Bellman optimality principle combining reward maximization and empowerment. Advances in Neural Information Processing Systems, 32, 78697880.Google Scholar
Lopes, M., Lang, T., Toussaint, M., & Oudeyer, P.-Y. (2012). Exploration in model-based reinforcement learning by empirically estimating learning progress. Advances in Neural Information Processing Systems, 25, 206214.Google Scholar
Matusch, B., Ba, J., & Hafner, D. (2020). Evaluating agents without rewards. arXiv preprint arXiv:2012.11538.Google Scholar
Mehlhorn, K., Newell, B. R., Todd, P. M., Lee, M. D., Morgan, K., Braithwaite, V. A., … Gonzalez, C. (2015). Unpacking the exploration–exploitation tradeoff: A synthesis of human and animal literatures. Decision, 2(3), 191.CrossRefGoogle Scholar
Mohamed, S., & Rezende, D. J. (2015). Variational information maximisation for intrinsically motivated reinforcement learning. In Advances in neural information processing systems (pp. 21252133).Google Scholar
Nair, A., Pong, V., Dalal, M., Bahl, S., Lin, S., & Levine, S. (2018). Visual reinforcement learning with imagined goals. arXiv preprint arXiv:1807.04742.Google Scholar
Osband, I., Russo, D., & Van Roy, B. (2013). (More) efficient reinforcement learning via posterior sampling. In Advances in neural information processing systems (pp. 30033011).Google Scholar
Oudeyer, P.-Y., & Kaplan, F. (2009). What is intrinsic motivation? A typology of computational approaches. Frontiers in Neurorobotics, 1, 6.Google Scholar
Oudeyer, P.-Y., Kaplan, F., & Hafner, V. V. (2007). Intrinsic motivation systems for autonomous mental development. IEEE Transactions on Evolutionary Computation, 11(2), 265286.Google Scholar
Pong, V., Gu, S., Dalal, M., & Levine, S. (2018). Temporal difference models: Model-free deep RL for model-based control. arXiv preprint arXiv:1802.09081.Google Scholar
Rich, A. S., & Gureckis, T. M. (2018). Exploratory choice reflects the future value of information. Decision, 5(3), 177.CrossRefGoogle Scholar
Salge, C., Glackin, C., & Polani, D. (2014). Empowerment–an introduction. In Prokopenko, M. (ed.), Guided self-organization: Inception (pp. 67114). Springer. https://doi.org/10.1007/978-3-642-53734-9_4.Google Scholar
Sanborn, A. N., & Chater, N. (2016). Bayesian brains without probabilities. Trends in Cognitive Sciences, 20(12), 883893.Google Scholar
Schaul, T., Horgan, D., Gregor, K., & Silver, D. (2015). Universal value function approximators. In International conference on machine learning (pp. 13121320). https://proceedings.mlr.press/v37/schaul15.html.Google Scholar
Schmidhuber, J. (1991). Curious model-building control systems. In Proc. international joint conference on neural networks (pp. 14581463). https://doi.org/10.1109/IJCNN.1991.170605.CrossRefGoogle Scholar
Schmidhuber, J. (2010). Formal theory of creativity, fun, and intrinsic motivation (1990–2010). IEEE Transactions on Autonomous Mental Development, 2(3), 230247.CrossRefGoogle Scholar
Schulz, E., Bhui, R., Love, B. C., Brier, B., Todd, M. T., & Gershman, S. J. (2019). Structured, uncertainty-driven exploration in real-world consumer choice. Proceedings of the National Academy of Sciences, 116(28), 1390313908. https://doi.org/10.1073/pnas.1821028116.Google Scholar
Schulz, E., & Gershman, S. J. (2019). The algorithmic architecture of exploration in the human brain. Current Opinion in Neurobiology, 55, 714.Google Scholar
Speekenbrink, M., & Konstantinidis, E. (2015). Uncertainty and exploration in a restless bandit problem. Topics in Cognitive Science, 7(2), 351367.Google Scholar
Stafford, T., & Dewar, M. (2014). Tracing the trajectory of skill learning with a very large sample of online game players. Psychological Science, 25(2), 511518. https://doi.org/10.1177/0956797613511466.Google Scholar
Steyvers, M., Lee, M. D., & Wagenmakers, E.-J. (2009). A Bayesian analysis of human decision-making on bandit problems. Journal of Mathematical Psychology, 53(3), 168179.Google Scholar
Stojić, H., Analytis, P. P., & Speekenbrink, M. (2015). Human behavior in contextual multi-armed bandit problems. In Noelle, D. C., et al. (Eds.), Proceedings of the 37th Annual Meeting of the Cognitive Science Society (pp. 2290--2295). Cognitive Science Society.Google Scholar
Stojić, H., Schulz, E., P Analytis, P., & Speekenbrink, M. (2020). It’s new, but is it good? How generalization and uncertainty guide the exploration of novel options. Journal of Experimental Psychology: General, 149(10), 1878.Google Scholar
Strehl, A. L., & Littman, M. L. (2008). An analysis of model-based interval estimation for Markov decision processes. Journal of Computer and System Sciences, 74(8), 13091331.Google Scholar
Strens, M. (2000). A Bayesian framework for reinforcement learning. In Proceedings of the Seventeenth International Conference on Machine Learning (ICML-2000), Stanford University, California, June 29–July 2, 2000.(Vol. 2000, pp. 943950).Google Scholar
Sun, Y., Gomez, F., & Schmidhuber, J. (2011). Planning to be surprised: Optimal Bayesian exploration in dynamic environments. In International conference on artificial general intelligence (pp. 4151).Google Scholar
Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT Press.Google Scholar
Sutton, R. S., Modayil, J., Delp, M., Degris, T., Pilarski, P. M., White, A., & Precup, D. (2011). Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In Proc. of 10th Int. Conf. on Autonomous Agents and Multiagent Systems (AAMAS 2011), May, 2–6, 2011, Taipei, Taiwan (pp. 761768).Google Scholar
Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4), 285294.Google Scholar
Whittle, P. (1980). Multi-armed bandits and the Gittins Index. Journal of the Royal Statistical Society: Series B (Methodological), 42(2), 143149.Google Scholar
Wilson, R. C., Bonawitz, E., Costa, V. D., & Ebitz, R. B. (2021). Balancing exploration and exploitation with information and randomization. Current Opinion in Behavioral Sciences, 38, 4956.Google Scholar
Wilson, R. C., Geana, A., White, J. M., Ludvig, E. A., & Cohen, J. D. (2014). Humans use directed and random exploration to solve the explore–exploit dilemma. Journal of Experimental Psychology: General, 143(6), 2074.CrossRefGoogle ScholarPubMed
Wilson, R. C., Shenhav, A., Straccia, M., & Cohen, J. D. (2019). The eighty five percent rule for optimal learning. Nature Communications, 10(1), 19.Google Scholar
Wimmer, G. E., Daw, N. D., & Shohamy, D. (2012). Generalization of value in reinforcement learning by humans. European Journal of Neuroscience, 35(7), 10921104.Google Scholar
Wu, C. M., Schulz, E., Garvert, M. M., Meder, B., & Schuck, N. W. (2020). Similarities and differences in spatial and non-spatial cognitive maps. PLoS Computational Biology, 16(9), e1008149.CrossRefGoogle ScholarPubMed
Wu, C. M., Schulz, E., Speekenbrink, M., Nelson, J. D., & Meder, B. (2017). Mapping the unknown: The spatially correlated multi-armed bandit. bioRxiv, 106286.Google Scholar
Wu, C. M., Schulz, E., Speekenbrink, M., Nelson, J. D., & Meder, B. (2018). Generalization guides human exploration in vast decision spaces. Nature Human Behaviour, 2(12), 915924.Google Scholar
Zhang, S., & Yu, A. J. (2013). Forgetful Bayes and myopic planning: Human learning and decision-making in a bandit setting. In NIPS (pp. 26072615). https://proceedings.neurips.cc/paper/2013/file/6c14da109e294d1e8155be8aa4b1ce8e-Paper.pdf.Google Scholar
Zheng, Z., Oh, J., Hessel, M., Xu, Z., Kroiss, M., Van Hasselt, H., … Singh, S. (2020, 13–18 Jul). What can learned intrinsic rewards capture? In Duamé . III, H & Singh, A. (Eds.), Proceedings of the 37th international conference on machine learning (Vol. 119, pp. 1143611446). PMLR.Google Scholar
Zhu, Y., Mottaghi, R., Kolve, E., Lim, J. J., Gupta, A., Fei-Fei, L., & Farhadi, A. (2017). Target-driven visual navigation in indoor scenes using deep reinforcement learning. In 2017 IEEE international conference on robotics and automation (ICRA) (pp. 33573364), Singapore.CrossRefGoogle Scholar

Save book to Kindle

To save this book to your Kindle, first ensure coreplatform@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

Available formats
×

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

Available formats
×

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

Available formats
×