Never mind the repeat: How speech expectations reduce tracking at the cocktail party

Social living requires communicating and understanding. Soundscapes carry background noise, yet listeners often retain the ability to choose and follow target speakers. Selective attention operates on this “cocktail party” scenario (Cherry, 1953), where the peripheral representation of mixed sound is incrementally shaped for the brain to single out and parse its target (Har-shai Yahav & Zion Golumbic, 2021; Wood & Cowan, 1995). Analyzing the target from the rest of the scene in a behaviourally meaningful way requires that parallel neural processes feedback upon one another, and part of this balance relies on the observer’s experience and prior expectations (Shinn-Cunningham, 2008; Shinn-Cunningham et al., 2017). For example, if we have to select and understand somebody amid other talkers, knowing their particular voice well will likely help (Domingo et al., 2020). Voice familiarity may also prove advantageous when we must ignore the same person’s speech (Johnsrude et al., 2013; Newman & Evers, 2007). Foreknowledge about speech content is another type of behaviourally relevant experience in the cocktail party (Bhandari et al., 2021; Dekerle et al., 2014; Park et al., 2023). Indeed, exact or full prior knowledge about a target represents the ground truth condition to evaluate experience biases on the balance of neural processing, which favors segregation amid maskers (Wang et al., 2021). Yet the precise stage at which such changes first occur in the brain remains unresolved.

The representations of a speech stream, which emerge across cortical networks during listening may be investigated using temporal response function (TRF) methods in combination with high temporal precision recordings (Bednar & Lalor, 2020; Broderick et al., 2019; Di Liberto et al., 2015; Har-shai Yahav & Zion Golumbic, 2021; Wöstmann et al., 2019; Zion Golumbic et al., 2013). The TRF technique delineates how much and when the brain is likely to engage with specific acoustical characteristics that are present in the masked speech signal (Ding & Simon, 2012; Ding & Simon, 2012; Fiedler et al., 2019; Haykin & Chen, 2005; Kidd et al., 2016; O’Sullivan et al., 2015; Power et al., 2012). For instance, it is used to describe the ability of auditory responses to align with temporal regularities found in the speech envelope, a phenomenon known as ‘phase coding’ or ‘speech tracking’ (Obleser & Kayser, 2019; Zion Golumbic et al., 2013). This stimulus-locked response is often organized into a triphasic neural entrainment pattern whose individual stages typically compare to likewise named P1, N1, and P2 temporal components of auditory event-related potentials (Aiken & Picton, 2008; Fiedler et al., 2019; Martin et al., 2008; Steinschneider et al., 2011). For the cocktail party, attentional gains reliably appear in auditory responses to the target speech around 100 msec, corresponding to the N1 component of the TRF (Ding & Simon, 2012).

Does prior experience synergistically boost or antagonistically decrease selective enhancement at this point? In sensory systems, unpredicted stimuli may carry surprise signals that involve additional processing. Auditory surprise signals may be differentially coded under selective attentional gains, as evidenced by the boosting of cortical mismatch responses to frequency deviant tones by temporal attention (Auksztulewicz & Friston, 2015). In the cocktail party, it is not yet clear whether attention effects and expectations arising from prior experience analogously interact earlier at the N1 or any other relevant stage of speech processing. Unlike mismatch sequence paradigms, cocktail party stream objects meet different attention and expectation conditions to be accounted for continuous and simultaneously. While mismatch designs focus on the novelty gain response (which attention may boost), in the cocktail party the focus is the attentional gain response (which repeated experience may modulate, (Wang et al., 2019). Determining how gains interact with experience and the stage of speech processing at which they may do so matters in the context of perceptual inference problems, driven by observer expectations comparisons against input signals (Clark, 2013; Ten Oever & Martin, 2024). The stage of the first interaction and the precise aspects of speech that are engaged by prior experience may hence single out the learning processes by which the auditory brain shapes objects in the cocktail party.

In a magnetoencephalography (MEG) study, Wang et al. (2019) directly addressed whether exact prior knowledge of a speech stream modulates selective locking responses in the auditory cortex during cocktail party masking conditions. Under target priming, a boosted attentional bias relative to unprimed conditions was observed and explained by weakened representations of masker speech. Although maskers were not primed, the findings superficially appeared at odds with activity reductions that result from more efficient encoding of expected stimuli (Grill-Spector, Henson, & Martin, 2006, Summerfield et al., 2008). Expectation-related reductions featuring in speech processing include evidence that visual predictions decrease the amplitude of listeners' N1 and P2 component auditory responses to corresponding speech (Pinto et al., 2019), for instance. Such research indicates that response reductions under priming may not even require exact sensory repetitions, relying instead on input that is redundant, accurate, or consistent with prior expectations of the observer. Altogether, these aspects raise the issue of whether priming information from a speech object specifically alters its representation during redundant recoding in the cocktail party. At the same time, they underscore the question of whether prior expectations similarly adjust the processing of speech objects that are to be ignored as distractors. The latter may serve to address comparisons between active inhibitory processes arising from the intentions of the observer through attention versus those that stem from the extraction of regularities over time through learning (Wöstmann et al., 2022).

In the present study, the interaction between selective attention and prior experience in speech processing was systematically addressed in the context of a speech comprehension task that represents typical challenges facing the ‘cocktail party’ problem. Using single-trial EEG and TRF tools, we investigated their interaction at each of the relevant P1, N1, and P2 stages of speech processing. In particular, we focused on the impact of auditory priming on the selective tracking of a critical cue in the segmentation and parsing of the temporal structure of speech, namely its temporal envelope edges. These auditory features have been demonstrated to be a predominant basis for stream segregation (Brodbeck et al., 2018). Our aim was to understand how they are tracked in parallel along with the observer’s unfolding temporal predictions about speech. One candidate mechanism is through online comparisons between edge signals and observer expectations about them, where exact correspondence minimizes uncertainty about temporal aspects of a target object during selection.

Comments (0)

No login
gif