The structure and development of explore-exploit decision making

It is common for people to be in situations that require them to decide between a familiar option with a known value or to choose a new option with unknown (but perhaps advantageous) value. Examples of this type of decision problem include choosing between selecting a familiar meal at the cafeteria versus trying a new food; choosing between staying with a familiar peer group versus pursuing a different social opportunity; or sticking with a current job versus making an employer or career change. Decisions involve tradeoffs, and the optimal choice is often not clear at the time a decision is made. Familiar options might afford less stress, anxiety, and avoiding an outcome that is worse than the status quo. However, staying with the familiar may prevent a person from discovering and learning new information about the world. Acquisition of new information is particularly important in childhood and adolescence.

For this reason, the development of healthy decision-making involves flexibility navigating between exploration and exploitation, depending on the context and relative risks involved (Hills et al., 2015, Mekern et al., 2019, Mehlhorn et al., 2015, Schulz and Gershman, 2019, Wilson et al., 2021). Despite an emerging literature that is examining how humans learn to manage these decisions (Addicott et al., 2017, Giron et al., 2022, Gopnik, 2020, Meder et al., 2021, Schulz et al., 2019, Somerville et al., 2017, Wilson et al., 2014), little is currently understood about the specific cognitive processes underlying these behaviors or the extent to which these processes change across development. Here, we identify processes that contribute to explore-exploit decision making using multiple common measures in the field and examine whether these components change between early adolescence and adulthood.

We define exploration as seeking new information, and exploitation as utilizing existing knowledge at the expense of learning something new. Both processes can be deployed to seek rewards; however, these fundamental motivations are often, though not always, opposed to one another. Laboratory paradigms designed to measure explore-exploit decision-making are similar in that they create scenarios where an individual must choose to either explore or exploit on a given trial. But, as shown in Fig. 1, these paradigms vary in their emphasis on factors such as working memory, cognitive flexibility, learning from previous outcomes, and uncertainty tolerance (Gershman, 2018, van den Bos and Hertwig, 2017, von Helversen et al., 2018).

Two related components of exploration/exploitation involve learning about which aspects of an environment will be rewarded and then keeping track of the likelihood and magnitude of those various rewards. A class of methods that tap these aspects of explore-exploit decision making are called bandit tasks (Daw et al., 2006, Wilson et al., 2014); see Fig. 1D. In this type of task, individuals choose between several “bandits” (e.g., slot machines) that vary in the rewards they pay out. Individuals learn through exploration which bandits seem most profitable, allowing them to maximize their rewards. In bandit tasks, explore and exploit decisions probably require similar effort, since each subsequent choice could reveal new key information that should affect the next decision, making the tasks less susceptible to response sets. But they also rely heavily upon working memory because an individual must keep track of both the amount of information they have gathered about each bandit and the magnitude of rewards received, while ignoring irrelevant information (Brown, Hallquist, Frank, & Dombrovski, 2022). Dimensions such as uncertainty can be manipulated in these paradigms by making bandit payoffs more or less variable, and by providing more or less information before participants make a choice (Gershman, 2018).

The ability to generalize from prior experiences is another component of decisions about whether to explore. This type of learning is captured in sequential choice tasks (Dale et al., 2018, von Helversen et al., 2018); see Fig. 1 A/B. Here, participants explore options with varying rewards until they find one with a sufficiently high payoff that leads them to choose to exploit this option for the rest of the task (some tasks allow an individual to go back to a previous option; others do not). Typically, participants are given minimal information about the highest payout available. Therefore, these types of tasks require that individuals generalize from previous outcomes to determine an expected range or distribution of reward outcomes (Dale et al., 2018). Thus, the nature of prior expectations regarding the task environment influences how individuals behave across these situations (von Helversen et al., 2018).

Patch foraging types of tasks require individuals to gather resources in various “patches” (Constantino and Daw, 2022, Lenow, Constantino, Daw, & Phelps, 2017); see Fig. 1C. These types of tasks may have a higher demand on cognitive flexibility than others because on a trial-by-trial basis individuals need to flexibly navigate the environment by switching between exploration and exploitation strategies (Hills & Dukas, 2012). They do so by choosing to continue exploiting their current location or deciding to leave in order to explore a new location. This paradigm introduces the tension between exploring and exploiting in two ways. First, the value of a current patch diminishes as it is exploited: For example, in a task that emulates foraging in an apple orchard, continuing to pick apples from the same tree results in fewer apples in that tree available for subsequent picking. Yet, there is a cost in time associated with moving to a new patch because no apples can be picked while searching for a new tree. Typically, the total duration of the game is fixed; thus, to maximize reward earning within a finite amount of time, one needs to increase the harvest per time unit. The harvest per time unit will drop if one either switches too often or stays with one patch for too long. As such, one must strike a balance between staying and switching. Importantly, this ‘sweet spot’ is dependent on one’s estimate of the average reward rate in the given environment – one should increase switching in a generous environment and reduce switching in a scarcer environment. So, the explore-exploit trade-off in this foraging task is not just about information seeking, but also about flexibly adjusting one’s decisions based on the estimate of “richness” of the environment and avoiding getting “stuck” in an exploitation pattern (i.e., staying at the same location too long). A final feature of this type of task structure is that the least effortful behavior in a foraging environment is to exploit the current patch, whereas a decision to move to a new patch may require greater effort or motivation (see Fig. 1).

Although these various approaches all purport to measure the same construct, they differentially tap processes that may contribute to exploration. Working memory, prior expectations, and cognitive flexibility likely play some role in each of these tasks, but the degree to which each is relevant differs based on task structure. Here, we harness these task differences to examine the structure and development of exploration.

Evolutionary theories propose that childhood is a period of learning about the world via exploration (Gopnik, 2020). Indeed, numerous studies show that young children explore more than adults during computerized explore-exploit tasks (Blanco and Sloutsky, 2021, Giron et al., 2022, Schulz et al., 2019, Schulz and Gershman, 2019, Sumner et al., 2019). Moreover, the complexity and efficiency of children’s exploration increases from early childhood to early adolescence (Pelz & Kidd, 2020), and exploration becomes less random and more directed towards reducing uncertainty from age four to nine (Meder et al., 2021). However, less is known about developmental change in explore/exploit decision-making between early adolescence and adulthood. Using a paradigm (Horizon task; Wilson et al, 2017) that could mathematically separate explore-exploit decisions into two components—random exploration (gathering information by chance) and directed exploration (intentional exploring to reduce uncertainty)--Somerville et al., (2017) found that early adolescents and adults were equally likely to engage in random exploration. However, younger individuals engaged in less strategic directed exploration to intentionally reduce uncertainty, as compared with older adolescents and adults. Lower directed exploration among the youngest age group was partially explained by a preference for immediate reward over information gathering. Contrary to other work suggesting that younger children explore more than adults, this study suggests that adolescents might explore less than adults (in some contexts) because of heightened reward drive and/or impulsivity. However, these conclusions are limited because they are derived from just a single task with a particular structure. Here, we investigated the concepts of random and directed exploration across multiple task structures.

At the same time, there is converging evidence that adolescence is a developmental period characterized by qualitative changes in decision-making strategies (Hartley and Somerville, 2015, Shulman et al., 2016). Heightened risk-taking in adolescence often occurs in circumstances where the probability of positive versus negative outcomes is unknown. Adolescents appear to be comfortable with taking risks when they perceive outcomes as highly uncertain (Tymula et al., 2012). This tendency has been reflected in lower levels of sampling and information search prior to decision during a bandit task among adolescents, relative to children and adults (van den Bos & Hertwig, 2017). In addition, children and adolescents tend not to perform as well as adults during probabilistic reward tasks, such as the Iowa Gambling Task, which, like bandit tasks, require information search combined with weighing potentials for risk and reward (Almy, Kuskowski, Malone, Myers, & Luciana, 2018, Cassoti et al., 2014). Inferring from developmental trends in these similar tasks, we might expect adolescents to show less information-driven exploration than adults in contexts such as the highly structured bandit task used by Somerville et al, but, similar to patterns seen in younger children, more exploration in ambiguous environments where an optimal strategy is less clear. Conceptual differences between random and directed exploration will be further explained in the next section.

The current study is the first to addresses two broad issues about exploratory behavior. First, we examined the structure of explore-exploit decision-making by contrasting prominent paradigms in the extant literature that differentially rely on a variety of cognitive processes. We hypothesized that explore-exploit decisions would be comprised of two components that vary depending upon the availability of information available to the learner. The first would resemble a noisy or random form of exploration that emerges when information availability is low. An individual engaging in random exploration explores without a specific goal in mind, but because it is more interesting or engaging to sample something new. We hypothesized that this type of exploration would be reflected in performance on sequential decision-making tasks where an individual’s behavior is dependent upon expectations about the environment, novelty seeking, and tendencies to generalize from previous experience.

The second component, directed exploration, would reflect a more intentional, goal-directed gathering of information that we predicted would be reflected in the patch foraging context. As with other task structures, the goal in this task is to obtain the highest total reward possible. This form of exploration relies upon cognitive flexibility, as an individual’s hypotheses about reward availability must be constantly updated and revised in light of higher degrees of available information (richness of the environment, depletion rate, switch costs, etc). As found by Wilson et al (2014), we reasoned that decisions in a bandit task would contain elements of both random and directed exploration due to a moderate degree of uncertainty, but also substantial available information.

The second issue we tested concerned developmental differences in exploration between early adolescence (age 10–13) and young adulthood. We focused on early adolescence because this period involves increasing exposure to novel social environments (e.g., transitioning from elementary to middle school; forming new peer groups) that afford opportunities to explore. Because executive processes like working memory and cognitive flexibility undergo substantial development from early adolescence to early adulthood (Ferguson, Brunsdon & Bradford, 2021), and previous work shows different exploration strategies in adolescents vs. adults, we hypothesized that the structure (sub-components) involved in exploration described above could change with age. For example, prior research shows that some cognitive constructs (e.g., sub-components of executive function) change in their level of differentiation with development (Howard, Okely, & Ellis, 2015). Finally, we predicted that age-related differences in exploration would depend on the task environment: we predicted that early adolescents would engage in less directed exploration during a bandit task than adults, replicating Somerville et al. (2017); but that younger individuals would engage in more exploration during sequential decision-making tasks that involve high uncertainty (and lend themselves to random exploration). Because exploration is tied to learning, we reasoned that the information gathered from our more comprehensive approach to studying developmental differences in explore-exploit decision making would have implications for the types of environments where adolescents can learn most efficiently.

留言 (0)

沒有登入
gif