Research on human and animal behaviour has long emphasised its hierarchical structure – the divisibility of ongoing behaviour into discrete tasks, which are comprised of subtask sequences, which in turn are built of simple actions. The hierarchical structure of behaviour has also been of enduring interest within neuroscience, where it has been widely considered to reflect prefrontal cortical functions. In this chapter, we re-examine behavioural hierarchy and its neural substrates from the point of view of recent developments in computational reinforcement learning. Specifically, we consider a set of approaches known collectively as hierarchical reinforcement learning, which extend the reinforcement learning paradigm by allowing the learning agent to aggregate actions into reusable subroutines or skills. A close look at the components of hierarchical reinforcement learning suggests how they might map onto neural structures, in particular regions within the dorsolateral and orbital prefrontal cortex. It also suggests specific ways in which hierarchical reinforcement learning might provide a complement to existing psychological models of hierarchically structured behaviour. A particularly important question that hierarchical reinforcement learning brings to the fore is that of how learning identifies new action routines that are likely to provide useful building blocks in solving a wide range of future problems. Here and at many other points, hierarchical reinforcement learning offers an appealing framework for investigating the computational and neural underpinnings of hierarchically structured behaviour.
In recent years, it has become increasingly common within both psychology and neuroscience to explore the applicability of ideas from machine learning. Indeed, one can now cite numerous instances where this strategy has been fruitful. Arguably, however, no area of machine learning has had as profound and sustained an impact on psychology and neuroscience as that of computational reinforcement learning (RL). The impact of RL was initially felt in research on classical and instrumental conditioning (Barto and Sutton, 1981; Sutton and Barto, 1990; Wickens et al., 1995). Soon thereafter, its impact extended to research on midbrain dopaminergic function, where the temporal-difference learning paradigm provided a framework for interpreting temporal profiles of dopaminergic activity (Barto, 1995; Houk et al., 1995; Montague et al., 1996; Schultz et al., 1997). Subsequently, actor–critic architectures for RL have inspired new interpretations of functional divisions of labour within the basal ganglia and cerebral cortex (see Joel et al., 2002, for a review), and RL-based accounts have been advanced to address issues as diverse as motor control (e.g., Miyamoto et al., 2004), working memory (e.g., O’Reilly and Frank, 2006), performance monitoring (e.g., Holroyd and Coles, 2002), and the distinction between habitual and goal-directed behaviour (e.g., Daw et al., 2005).