According to psychologists and cognitive scientists, language understanding requires the construction of a dynamic mental representation of the state of affairs denoted by a text (Van Dijk and Kintsch, 1983; Zwaan, 2016). A commonly-used notion is the one of situation models, data structures containing a representation of the event that is currently being processed/understood (Zwaan and Radvansky, 1998). The comprehension process takes place within an existing situation model, and the model is dynamically and incrementally updated by unifying the current content with the new information coming in. At the same time, psycholinguistic research brought evidence that human semantic memory stores a generalized knowledge about events and their participants (McRae et al., 1998; Ferretti et al., 2001; McRae et al., 2005; Hare et al., 2009; McRae and Matsuki, 2009). Humans quickly activate this knowledge to anticipate upcoming input while understanding texts, and the coherence of the actual input with the expectations affects processing complexity: for example, sentences including highly-predictable verb-argument combinations are associated with shorter reading times, shorter eye fixations, and reduced N400 amplitudes in ERP studies, compared to sentences with more unexpected and “surprising” event participants (e.g., The journalist checked the spelling is read faster than The journalist checked the brakes, as brakes is more unlikely as a patient in the second sentence) (Bicknell et al., 2010; Matsuki et al., 2011). It has been previously suggested that the extra processing difficulty may be due to the cost of unifying in the situation model portions of the event knowledge that have been activated by the linguistic input but have a low degree of semantic plausibility (Chersoni et al., 2016, 2017, 2021b).
Reading times and N400 effects have also been shown to depend on the wider discourse context, and not just on the verb-argument relations within the sentence. For example, words that are acceptable in a local context but are anomalous in the general discourse lead to longer reading times and larger N400 effects (Van Berkum et al., 1999, 2005). However, a text may be explicitly signaling that the upcoming propositions are unexpected or contradictory given what was said before. From this point of view, discourse connectives (e.g., but, although, because, therefore etc.) play an important role in indicating the semantic relation between text spans (Danlos et al., 2018); they can be used by speakers to increase the coherence of the texts, helping listeners at the same time to update their situation models, thus modulating their expectations about what could be the plausible upcoming words.
To this latter goal, the most interesting connectives are those expressing opposition relations, i.e., concessive and contrastive connectives (Izutsu, 2008). According to Lakoff (1971), contrastive connectives indicate a direct semantic opposition between two clauses (e.g., The weather today is sunny, but yesterday it was rainy), while concessive connectives inform the listener that a given expectation has been denied (e.g., She works as a lawyer for Sullivan & Co, but she is not a bad person, implying that the speaker generally has a negative opinion about the lawyers working for that firm). It is easy to see why connectives are important in articulating the grammar of the discourse and facilitate sentence processing: when the listeners should revert their expectations about what is coming next, such particles can inform them about the necessity to update their situation model accordingly. On the other hand, it should be kept in mind that there is no one-to-one mapping between connectives and discourse relations (Knott, 1996), and their interpretation is probabilistic (Asr and Demberg, 2020), i.e., depending on the distribution of relations that a connective has been used to signal in one's own linguistic experience. In our previous examples, it can be noticed indeed that but can support either a contrastive and a concessive meaning. There might be some margin of subjectivity in the interpretation of a discourse connective, especially when the context does not provide strong disambiguation cues, and the odds of a given discourse relation might be different between the connectives of the same type across languages.
In the recent research in Natural Language Processing (NLP), the field of Artificial Intelligence that is concerned with giving machines the ability to generate and understand human language, a new class of neural language models (henceforth NLMs) has emerged as the dominant one in the literature. Such models are generally based on the Transformer architecture (Vaswani et al., 2017) and are able to generate rich and dynamic representations of words in context, leading to remarkable improvements in several supervised linguistic tasks (Devlin et al., 2019; Liu Y. et al., 2019; Yang et al., 2019). Over the last few years, with the increase of architectural complexity and the amount of training text, NLMs are more and more often evaluated in a zero-shot fashion, by letting them freely generate the answers to the task instances (Radford et al., 2019; Brown et al., 2020; Achiam et al., 2023).
NLMs are trained on the task of predicting the next word given a sequence of words, and one of the reasons for their success is that they can be pre-trained without the need of annotated data, as the objective of the network is to reproduce the original word sequences in a large corpus of text (self-supervised learning, see Manning et al., 2020). The pretraining phase allows the models to encode a lot of knowledge about (the statistical patterns of) language in their internal representations. In this sense, the usage of the distributional behavior of linguistic expressions in NLMs as a way to represent their meaning could be seen in full continuity with the tradition of Distributional Semantics (Lenci, 2023).
Another common evaluation of NLMs makes use of their log-probability or Surprisal (the negative log of the probability) scores to account for a wide range of sentence processing phenomena (Futrell et al., 2018; Van Schijndel and Linzen, 2018; Wilcox et al., 2018), including facilitation (Michaelov and Bergen, 2020, 2022a,b; Michaelov et al., 2023) and interference (Ryu and Lewis, 2021; Cong et al., 2023b) effects in online sentence processing. The idea behind Surprisal is that words that are less predictable should take more time for humans to process, and this predictability can be estimated via a NLM. In psycholinguistics, the predictability of a target word is often manipulated in order to differ by experimental condition. Therefore, the goal of modeling studies is to see whether the scores estimated by a NLM align with human behavioral results.
The goal of the present paper is to analyze, across different languages, the processing effect associated with discourse connectives reflecting opposition relations, in two different perspectives: human perception and computational prediction. We are specifically interested in concessives because of their special effects on event-based expectations: a comprehender generally makes a prediction expecting a plausible, coherent scenario to unfold, but a concessive connective signals that such expectations are going to be reversed. Therefore, we want first to see if we can replicate the findings of Xiang and Kuperberg (2015) on concessives at the behavioral level in Italian and Mandarin Chinese, to verify if we observe the same effects in two new languages. Furthermore, we also introduce a condition with contrastive connectives to see if they behave similarly. Lastly, we want to test if NLMs can keep track of the expectation reversal, and whether their Surprisal scores reflect the update in the situation described by the discourse. Concretely, on the basis of the story stimuli of a dataset introduced by Xiang and Kuperberg (2015), we first built similar datasets for Italian and Mandarin Chinese and we collected judgements about the plausibility of the events in the stories from native speakers; then, we collected reading times in a self-paced reading experiment in both languages. We observed that the two languages exhibit distinct patterns, both in terms of plausibility ratings and of self-paced reading times.
Next, we computed the Surprisal scores for the target verbs in the experimental stimuli using the GPT-2 language models for Italian and Mandarin Chinese, in order to observe the extent to which they were affected by the general plausibility of the stories and by the presence of discourse connectives. We found that NLMs do not reproduce the same effects observed in human data, in particular in Italian. We speculated that this could be due either to the relatively small size of the NLM used in our experiments, or to a large percentage of target words in our datasets that are not included in the models' vocabulary. Our analysis of the results suggested that the latter could be the most important factor.
2 Related work 2.1 Discourse coherence and connectives in sentence comprehensionIn natural language, individual sentences are generally preceded by a broader discourse context. Scholars since Zwaan and Radvansky (1998) have argued that humans use situation models to form an event-based representation of what is being communicated, and that such representations are continuously updated as new input comes in. Coherence in discourse notoriously facilitates human language processing, as shown by experimental studies using different methodologies, e.g., self-paced reading (Albrecht and O'Brien, 1993), naming tasks (Hess et al., 1995), eye-tracking and ERPs (Camblin et al., 2007). In other words, as long as the new information is coherent with the event knowledge, the comprehender can easily integrate it into the current situation model.
Kuperberg (2013) distinguishes three layers in human event representations: (i) the layer of event sequences, related to our knowledge about the likely temporal, spatial and causal connections linking event and states together to form sequences of events, also known as scripts (Schank and Abelson, 1975); (ii) the layer of event structures, corresponding to our knowledge of events and their typical participants (cf. the notion of “generalized event knowledge” in McRae and Matsuki, 2009); (iii) the layer of semantic features, concerning our knowledge of the features and properties of conceptual entities and categories. Xiang and Kuperberg (2015) argue that discourse connectives influence primarily the representation of event sequences: for example, when a comprehender is processing a discourse and then hears or reads a concessive connective (e.g., even so), his or her predictions about the upcoming event will be reversed, as the connective will be signaling to expect an opposite-to-expected causal relationship.
In their experiment, Xiang and Kuperberg (2015) designed a set of 3-sentence stories in four different conditions, differing in degree of coherence of the last sentence with the discourse context and for the presence or not of the even so connective at the beginning of the last sentence. After collecting coherence judgements from humans, they found that no-connective coherent items had the highest coherence ratings, whereas the no-connective incoherent items had the lowest ones. The coherent items with an even so connective (e.g., Liz had a history exam on Monday. She took the test and failed it. Even so, she went home and celebrated wildly). In a follow-up ERP experiment, measuring the N400 amplitude at the main verb in the final sentence (e.g., celebrated, in the example above), they found that the N400 for the verb was more reduced in the coherent even-so items (i.e., lower processing costs), compared to the plain coherent items, suggesting that the presence of the connective made the prediction of the verb even easier.
The data by Xiang and Kuperberg (2015) exemplify discourse expectations in a relatively short range, driven by linguistic elements within the same sentence or in an adjacent sentence. Their work was extended by Scholman et al. (2017), who focused on the discourse markers on the one hand… on the other hand… to check whether comprehenders can maintain discourse coherence expectations across multiple sentences. They setup a norming experiment, where the subjects where exposed to stories introduced by on the one hand, followed by other propositions introduced by connectives, including contrastive ones, and finally by a proposition introduced by on the other hand. The results showed that subjects were keeping track of all the embedded constituents and their dependencies, and that they dispreferred stories with situations of contrast introduced by on the other hand if the previous context already included a contrast with the situation introduced by on the one hand.
Köhne-Fuetterer et al. (2021) presented experiments with visual world paradigm and ERPs in English and German, using story items containing causal or concessive connectives. The visual world experiment revealed that the anticipatory looks toward a target object depended on the specific discourse connective in the item. Moreover, in the ERP experiment the authors found a late positivity on the concessive connectives compared to causal ones in both languages, possibly reflecting the extra processing costs of expectation reversal.
It is not easy to understand to what extent the connective and coherence-driven effects transfer to other languages. Most of the current studies were focusing on English, and in some cases, on German (e.g., the visual world and ERP study by Köhne-Fuetterer et al., 2021). Even in English, the interpretation of connectives has been claimed to be probabilistic, and individual differences can be observed in the expecting coherence relations (Asr and Demberg, 2020; Scholman et al., 2020). It is also worth mentioning that recent research established that the usage itself of predictive processing in sentence comprehension is flexible, heavily depending on the reliability of the prediction, and it can be “dismissed” when a discourse is not coherent. For example, Brothers et al. (2019) presented an ERP study in which subjects heard sentences from different speakers, but one speaker was instructed to frequently produce completions that violated the expectations of the listeners (“unreliable speaker”). The authors found that reliability affected N400 amplitudes, with larger effects of predictability when the speaker was reliable, as if the subjects were less engaging in predictions after seeing that those predictions were often violated. In the same line of work, the self-paced reading experiments by Carter and Hoffman (2024) showed that the general coherence of discourse is an important factor; comprehenders are sensitive to shifts in the topics and they tend to “predict less” when they face incoherent discourse structures.
2.2 NLM Surprisal for modeling sentence processing dataTransformer-based NLMs (Vaswani et al., 2017; Devlin et al., 2019; Radford et al., 2019) have become increasingly popular in NLP research, and a growing body of literature aims at investigating the kind of linguistic knowledge they encode, and to what extent they can reproduce human performance in language processing tasks. A common methodology is the so-called probing (see, inter alia, Tenney et al., 2019a,b; Jawahar et al., 2019; Hewitt and Liang, 2019; Liu N. F. et al., 2019; Wu et al., 2020; Vulić et al., 2020; Sorodoc et al., 2020; Koto et al., 2021; Chersoni et al., 2021a; Thrush et al., 2022). In this methodology, a relatively simple classifier is asked to solve a linguistic task (e.g., number agreement, anaphora agreement etc.) using a representation derived from a NLM without any specific linguistic supervision. If the classifier succeeds, then the NLM can be inferred to encode the target linguistic knowledge in its representations.
Other studies focused directly on the Surprisal scores computed by the models, to understand the extent to which they are sensitive to linguistic phenomena that have been shown to affect human sentence processing. The Surprisal of a word is a measure of its predictability given the previous context (Hale, 2001; Levy, 2008). It is defined as the negative logarithm of the probability of the word given the context and it is generally correlated with human reading times (i.e., more surprising words are read more slowly by humans). For example, a work by Misra et al. (2020) investigated the predictions of BERT in a setting aimed at reproducing human semantic priming; they reported that BERT (Devlin et al., 2019) was indeed sensitive to “priming”, in a way that it predicted a word with lower Surprisal values when the context included a related word as opposed to an unrelated one. Cho et al. (2021) modeled a priming effect on the prediction of typical event locations, which was observed in humans to be related to the imperfective aspect of the verb. The authors found that BERT outputs lower Surprisal scores for typical locations, but differently from humans, and it manages to do so regardless of the aspect of the main verb.
Michaelov and Bergen (2022a) used Surprisal to investigate the issue of collateral facilitation, that is, a scenario when anomalous words in a sentence are processed more easily by humans due to the presence of semantically related words in the context. They compared the scores obtained from several Transformer NLMs with human data from several sentence processing experiments, and found that most of the models reproduced the same significant differences between conditions that were observed for humans' behaviors. In Michaelov et al. (2023), Surprisal was utilized instead to replicate the effect of the discourse context in reducing the N400 amplitude for anomalous words, using the Dutch stimuli in the experiments by Nieuwland and Van Berkum (2006) as the evaluation data. In such experiments, Nieuwland and Van Berkum (2006) showed that sentences containing verb-object animacy violations (e.g., The girl comforted the clock) elicited large N400 effects, but the inclusion of a supportive discourse context (e.g., a girl talking to a clock about its depression) lead to reduction of this effect. Language models showed, once again, a very close pattern to humans, suggesting that the reduction effect of the original study may be due to lexical priming from the previous context.
2.3 Discourse connectives in NLPThe importance of connectives in NLP research is due to the fact that they lexicalize specific discourse relations (Braud and Denis, 2016; Ma et al., 2019). During the acquisition of annotations for discourse-parsing tasks, the connectives sometimes provide a clue to the discourse relations, which are sometimes implicit. In such cases, human annotators are asked to insert the connective that they consider to be more appropriate, given two discourse fragments (Prasad et al., 2007). Given the recent rise of NLMs, researchers in NLP started to explore the capacity of the models to identify the right connectives, which requires in turn an understanding of the relations between discourse fragments.
Ko and Li (2020) proposed to investigate GPT-2's linguistic competence in terms of discourse coherence by testing the model's ability to produce the correct connectives, when given a discourse relation linking two clauses. Using both organic generation and fine-tuned scenarios, they observed that GPT-2 did not always generate coherent discourse, although the generations were better aligned with human behavior in the fine-tuned scenario.
Pandia et al. (2021) evaluated several NLMs on the prediction of the correct connectives in contexts that required Gricean-like pragmatic knowledge and in which a specific connective would correspond to an implicature. For example, in cases such as Maggie did the paperwork by hand and the company bought new computers, which is to say, Maggie did the paperwork by hand [MASK] the company bought new computers., the model had to predict before in the [MASK] position to show an understanding that the implied meaning of and in this context was and then. The authors showed that, when controlling strictly for low-level lexical and syntactic cues, the models performed at chance level at best.
The recent work of Cong et al. (2023a) is the closest one to our study, since it investigates the impact of discourse connectives on the Surprisal scores of the NLMs. The authors analyzed the effects of concessive and contrastive connectives on NLMs with the English stimuli by Xiang and Kuperberg (2015), by measuring the Surprisal scores of the target verbs in their stories in different experimental conditions. They tested several NLMs of different size (e.g., GPT-2 Base, GPT-2 XL and GPTNeo) and found that the larger GPTNeo model was the one showing a pattern closer to human behavior when a concessive connective was used, leading to a reversal of the expectations on the final verb. Moreover, the results were still consistent after replacing the original even so with different concessive connectives. On the other hand, and according to the prediction of linguistic theory (Izutsu, 2008), they found that replacing a concessive with a contrastive connective does not lead to expectation reversal effects.
Our study expands on the previous work of Cong et al. (2023a). First of all, we test whether expectation reversal can also be observed in other languages - we ran our experiments in Italian and Mandarin Chinese, two typologically different languages from English (the former a romance language, the latter a sinitic language). We collect Italian and Chinese native speakers' coherence judgments and self-paced reading of the same items translated from items used in Xiang and Kuperberg (2015). Then we model the same experimental items with each language's GPT-2 Base model.
3 Behavioral experimentsConcessive connectives can be used to create scenarios where the expectations have to be reversed. In translating the English dataset used in Xiang and Kuperberg (2015), we aimed at using connectives that prototypically represent concessives in the target languages, to closely reproduce the original stimuli. We collected human ratings on the naturalness and plausibility of the situations described by our experimental items, and unlike the original study, we tested human processing behavior using a self-paced reading task.
In addition to items with concessive connectives based on the original study, we created an additional version of our stimuli using contrastive connectives. Although contrastive connectives per se simply signal a contrast, rather than the denial of expectations, they can still be used in concessive constructions. While using a contrastive connective for a concessive relation may not be prototypical, it should still be understandable to the readers, as the interpretation of connectives is probabilistic (Asr and Demberg, 2020). Moreover, for our purposes, the use of two types of connectives helps examine the influence of connective types on human comprehension of concessive scenarios.
3.1 Experimental itemsWe used the datasets created by Xiang and Kuperberg (2015) as our starting point to develop similar datasets for Italian and Mandarin Chinese. In their experiment, the authors designed 180 sets of three-sentence discourse items in English, each with four conditions as in Example 1 (45 scenarios per condition). The target word (underlined) was always the main verb of the final sentence.
(1) a. Liz had a history exam on Monday. She took the test and aced it. She went home and celebrated wildly. (Plain, Coherent)
b. Liz had a history exam on Monday. She took the test and failed it. She went home and celebrated wildly. (Plain, Incoherent)
c. Liz had a history exam on Monday. She took the test and failed it. Even so, she went home and celebrated wildly. (Even so, Coherent)
d. Liz had a history exam on Monday. She took the test and aced it. Even so, she went home and celebrated wildly. (Even so, Incoherent)
Notice that in conditions (1c) and (1d), the presence of the connective even so changes the general coherence of the given scenario, reversing the expectations of the reader. In our work, we decided to refer to the notion of plausibility instead of coherence, in order to focus more on the plausibility/naturalness of the described scenario rather than on the discourse connections. We believe plausibility will be a more intuitive notion to grasp for the human readers. Therefore, in this paper, we refer to condition like (1a) as Plaus, (1b) as Implaus, (1c) as ES-Plaus, and (1d) as ES-Implaus.
Compared to Xiang and Kuperberg (2015), we have two additional conditions where the concessive connective in (1c) and (1d) is replaced by a contrastive one, as an example shown in Table 1. We refer to the contrastive conditions in each language dataset as HW-Plaus for a more plausible story containing a contrastive connective, and HW-Implaus for a less plausible story containing a contrastive connective. We translated even so to 即使如此 in Chinese and to Nonostante ció in Italian, and used contrastive connectives 但是 for Chinese and tuttavia for Italian, respectively; both translatable as however in the target languages (Battaglia and Pernicone, 1985; Wang, 2011).
Table 1. Example of the connective dataset item in all the conditions for each language with translation.
The stories of the original dataset were initially translated into the new languages by using the DeepL translation software. The sentences were manually checked one-by-one by two native speakers of Italian and Mandarin, to ensure the naturalness and coherence contrast of the translations in the new language datasets, and to correct possible mistakes. For both target languages, at least one of the authors is a native speaker. Stories for which it was not possible to achieve a natural-sounding translation were revised or excluded from the dataset. In the end, we came up with 100 stories for Italian and 180 stories for Mandarin Chinese for each of the three connective conditions (no connective, concessive, and contrastive). In both languages, each story has three clauses. For the Mandarin set, all sentences were adjusted to be 18 words in length, and in the Italian set all sentences were 22 words. The connective is always the first word of the third clause (Region 12 in Mandarin and Region 16 in Italian), and the target word is the main verb of the third clause that changes the potential plausibility of a given story. We ensured that no additional connectives occurred among the clauses in each story.
To prepare for the self-paced reading task, we split the whole dataset of each language (3 sentence types × 2 plausibility × 50 items in Italian, and 3 sentence types x 2 plausibility x 90 items in Chinese) by the Latin Square design so that each list has all the connective conditions (examples for each language and condition can be seen in Table 1).
3.2 ExperimentsThis section reports the findings of experiments with human participants.
3.2.1 Participants138 Mandarin native speakers (mean age: 23.00, SD: 3.92; 61 females) and 133 Italian native speakers (mean age: 28.95, SD: 5.21; 70 females), who did not report having any left-handedness nor family history of brain damage or speech or hearing impairment, were invited via online questionnaires. All received an explanation of the study and its procedures and gave their informed consent before its commencement. These participants were then directed to the online experiment in a self-paced reading format, which was administered at PCIbex (farm.pcibex.net). Participants were paid 100 HKD upon successfully completing the experiment.
3.2.2 Self-paced reading experiment and ratingsDuring the experimental session, each participant was instructed to make sure the Internet connection was stable and to sit in front of a computer with a proper screen display of the online page of the experiment. The experimental material incorporated a “click-to-proceed” element that prevented the participants from reading ahead, and from understanding any whole story until they had finished reading it. Using this method, it was possible to isolate their reactions to specific regions within each story.
Each story was presented on screen one word at a time, in a non-cumulative, centered-window, self-paced reading paradigm (Just et al., 1982). The participant presses a key to mask the current word and reveal the next word. Upon completing four practice stories, each participant was asked to read the experimental stories carefully but at their natural reading pace. At the end of each story, the participant was asked to provide a rating of the level of plausibility and naturalness of the story they just read on a Likert scale from 1 (not plausible at all) to 7 (very plausible). The whole procedure took each participant between 35 and 45 min to complete.
The participants' reading times for each word and the rating time (i.e., the choice) were recorded to allow us to estimate the processing effort of reading comprehension (Just et al., 1982). Participants' ratings were also recorded for each item. Data points were excluded if the same rating was selected 10 or more times consecutively or only used two of the possible seven points on the scale.
3.2.3 AnalysisWe first excluded the data from participants who had abnormally fast or slow completion times (2.5 SD from the mean reading time), and then further recruitment was carried out to make sure each of the eight lists had at least 15 valid participants' data. After the data exclusion, 120 Mandarin speakers' and 133 Italian speakers' data were kept for the final analyses.
Then the reading times (RTs) were log-transformed to approximate a normal distribution for analysis purposes. For RTs, we analyze the overall RTs of each item, the target regions (i.e., the connective region, the keyword region and the end of the final sentence region) as well as the time participants used to provide ratings (i.e., the choice). RTs were fitted through generalized linear mixed effect models, with the predictors being CONDITION and PLAUSIBILITY, and the random effects included the IDs of participants and items. As an additional predictor, we used the logarithmic frequency of the target word, which was extracted with the WORDFREQ Python tool (Speer, 2022). Pairwise post-hoc Tukey's comparisons were then conducted with R's emmeans package.
For the plausibility ratings, we analyzed the impact of conditions and plausibility using cumulative link mixed models produced by the clmm() function of the ordinal package (Christensen, 2023) in R. Pairwise post-hoc Tukey's comparisons were then conducted with R's emmeans package.
3.3 Results and analysesBased on the boxplots for Italian (left) and Chinese (right) human ratings in Figure 1, both language groups rated the plausible condition without connective the highest. The median rates in the Italian group show that implausible items with connectives are rated similarly to the respective no connective conditions in Italian, but the means (the dots) of plausible items with connectives are much lower than the plausible plain ones. In the implausible conditions, the mean ratings of even so are slightly lower than the other two implausible conditions.
Figure 1. Italian and Chinese human plausibility judgments plotted by condition. Dot, mean; inner box line, median.
In the Chinese group, the median and the mean ratings of plausible items with connectives are much lower than in the plausible condition without connectives, similarly to Italian. However, among the implausible conditions, while the median ratings are almost the same, the mean ratings of HW- and ES-Implaus are higher than the Implausible condition, and are close to the mean ratings of their plausible counterparts.
We fitted the ratings in CLMM with a full model, where connective and plausibility were the main effects, and their interaction was also included. Participant and item were included as the random effects. Results of the CLMMs were significant for all main effects (p < 0.01) for both Italian and Chinese. Post-hoc Tukey comparison was carried out using the emmeans() function from the emmeans library in R (Lenth, 2024). Here, we focus on the interaction effects' results. For the Italian set, there were significant contrasts (ps < 0.001) for all comparisons except:
• evenso-plausible/however-plausible (p = 1.0000)
• evenso-implausible/however-implausible (p = 0.3550)
• null-implausible/evenso-implausible (p = 0.4982)
• null-implausible/however-implausible (p = 0.9953)
For the Chinese set, there were significant contrasts (ps < 0.001) for all comparisons except:
• evenso-plausible/however-plausible (p = 0.1017)
• evenso-implausible/however-implausible (p = 0.9627)
In summary, for both Italian and Chinese, there were no significant differences in the ratings between the two connective types for both plausible and implausible conditions. However, the two language groups still show differences. For the Italian data, no significant differences were found between the implausible condition with no connective (null) and the two connective implausible conditions (“however”: p = 0.9953, “even so”: p = 0.4982). In contrast, the Chinese data did show a difference in that the no connective implausible condition was rated significantly lower than both connective conditions (p < 0.01).
As noted above, one important difference observable in the plots in Figure 1 is that all the Italian conditions show a difference between plausible and implausible, but in Chinese, the presence of a connective appears to even out the differences in plausibility, so that there is not as clear a distinction between plausible and implausible. For example, although still significant, the magnitude of the effect for the contrast between however-implausible/however-plausible (p < 0.05) is smaller than for all the other contrasts (see also the green and the light blue boxplot in Figure 1, right).
It is noticeable that, compared to the no connective implausible condition, the participants rated the plausible conditions with connectives higher. This is an effect of expectation reversal: without the connective, the situations described by even so/however plausible would be implausible; but since the connective is signaling that the last sentence will “contradict” the expected scenario, the entire story sounds more coherent as a whole (e.g., compare Jane failed the test. She celebrated wildly. with Jane failed the test. However/Even so, she celebrated wildly.).
3.3.1 Reading timesWe analyzed and reported the reaction time of two critical regions recorded from the self-paced reading studies, i.e., the region of the target word (the main verb of the third clause) and the region of the end of the third sentence. Before the target word, we also measured the effect of sentence type in the connective region (no connective, even so or however), but not the other effects because before the connective position there is no difference between the items across conditions. We used lmerTest (Kuznetsova et al., 2017) and not the more standard lmer because only the former is able to return p-values for models with random effects.
For the connective region, even so conditions produced significantly slower reading times in both Italian (p < 0.05) and Chinese (p < 0.001), but the however condition was only significantly slower in Italian (p < 0.01) and not in Chinese (p = 0.05).
Figure 2 shows the reading time (RT) of each word in the third clause of Italian (left) and Chinese (right) data by conditions of connective and plausibility, and the statistical analysis with the linear mixed effects models are available in Tables 2, 3. In Table 2 we can see that, in the target verb region, however items were read longer than the no connective condition in Italian (p < 0.001), but there was no significant difference for Chinese (p = 0.91). Additionally, in Italian we can see that the subjects took longer for the even so connective (p < 0.001) condition, and were slower in reading implausible items (p < 0.01). In Chinese we have a similarly strong effect of even so (p < 0.001). No significant effect was observed for frequency and plausibility in Chinese for the target word region, and more in general, the increase in reading times due to the even so connective is the only significant effect observed for Chinese in this region. Finally, in Italian we also see a facilitatory effect of the frequency of the target (p < 0.001).
Figure 2. Log Reading Times per region. y-axis, reading time (ms); x-axis, sentence region.
Table 2. Results of Italian and Chinese reading times (RTs) for the target word region.
Table 3. Results of Italian and Chinese reading times (RTs) for the end of sentence region.
Considering the potential spillover effects, we also ran analyses of RTs at the end region of each condition (Table 3). Both languages showed significant effects for plausibility on RTs (p < 0.001). While no distinction between plausible and implausible was seen in Chinese at the target word region, significantly increased RTs for implausible items were observed at the end of sentence region (p < 0.001).
Moreover, stories containing a connective were processed significantly longer than the baseline condition (ps < 0.001), and this time regardless of the connective, as in both Italian and Chinese we can see that the even so and the however items are read significantly slower at the end of the final sentence. For the interaction between connective and plausibility, no effects were found on the Italian side, but there was a significant effect for both evenso (p < 0.05) and however (p < 0.01) connectives and plausibility on the Chinese side.
For pairwise comparisons of conditions, as shown in Table 4, the null-plausible conditions are significantly different (p < 0.001) from their connective counterparts and from the null-implausible condition for both languages. Furthermore, null-implausible conditions are also significantly different (p < 0.001) from both connective implausible conditions in both languages. There are, however, several differences between the Italian and Chinese RTs in terms of between connective conditions. Firstly, the difference between evenso-plausible and however-plausible is not significantly different for Italian (p = 1.00) but it is for Chinese (p < 0.001), and the same is true of the implausible counterpart (Italian: p = 0.93; Chinese: p < 0.001): in both cases, even so in Chinese seems to elicit significantly longer reading times. Secondly, for evenso-plausible and evenso-implausible, again there is a difference in that the Italian side is not significant (p = 0.98), whereas in Chinese, plausible items are read significantly faster (p < 0.01), although the difference is somewhat less than the other conditions. Lastly, there is no significant difference between however-plausible and however-implausible for Italian (p = 0.58), compared to a significant difference in Chinese (p < 0.01).
Table 4. Interaction of connective and plausibility for Italian and Chinese reading times at the end of sentence region.
Once again, the patterns observed in the two languages are slightly different. For the purpose of observing the expectation reversal in reading, particularly relevant is the comparison between the two plausibility conditions with connectives and the implausible condition without connective, since the presence of a connective can turn an otherwise implausible scenario into a plausible one. Noticeably, in Italian both however and even so plausible items are significantly different than plain implausible ones (p < 0.01), while in Chinese only the even so plausible conditions is significance (p < 0.001), but contrary to expectations, they are associated with longer reading times. We can hypothesize that, despite being plausible, conditions with connectives require additional processing time, due to more structural complexity. This marks a difference with the plausibility judgements, where participants rated the plausible connective conditions higher: this is possibly due to the fact that in the self-paced reading test, they are reading the sentence word-by-word and they have extra processing costs caused by expectation reversal caused by a connective, and this could be translated into regressions and extended reading time. But when they have to rate the semantic plausibility of the connective item as a whole, they still rate it as more plausible that the plain implausible item.
The effects of plausibility and of the however contrastive connective are showing up only in the end region for Chinese, but not in the target region. Even so, in general, seems to be associated with higher difficulty in this language compared to however, which might be due to the difference in frequency between the two connectives, as the Chinese even so (即使如此) has a logarithmic frequency of 3.21 in the combined corpora of the WORDFREQ tool, against 5.85 of however (但是).
4 Experiments with language modelsAfter extracting the human reading times, we compared them with the Surprisal scores of the same experimental items extracted from language models. More specifically, we used the GPT-2 Base model for both languages and the methods implemented in the Minicons library (Misra, 2022) for computing the Surprisal scores. Minicons is an open-source library that provides a high-level API for behavioral and representational analyses of NLMs. We are aware of the fact that we could have chosen different and perhaps more powerful architectures for computing our Surprisal scores. However, we initially preferred to stick with GPT-2 Base in order to select a model that has a standard implementation on HuggingFace in all the target languages, which makes the results easily comparable.
We computed the Surprisal scores for the target word in the stimuli, the word that triggers the change of a story's plausibility, as illustrated in Section 3.1 Experimental items. Formally, the Surprisal for the target wt in the context w1…t−1 is defined as the negative logarithm of the probability of wt given the previous context, as in Equation 1. For each target word we actually computed the summation of the Surprisals of the sub-tokens composing it, in case a word had been split by the tokenizer of the NLMs into multiple sub-tokens.
Surp(wt)=-logP(wt|w1...t-1) (1)For each dataset, we then fitted a linear mixed-effects model using the Surprisals of the target word computed by GPT-2 as the dependent variable, and the ID of each dataset item as the random intercept in our models. The independent variables include the plausibility of the discourse Plausibility (plausible vs. implausible), the discourse connective condition Connective (EvenSo, However, and without connective), the token length of the stimulus (Seq_Len), and an out-of-vocabulary (OOV) binary label indicating whether or not the target word in the stimulus is out of the pre-trained NLM's vocabulary. This is potentially an important factor, since NLMs do not tokenize by words but by subwords and GPT-2 makes use of a Byte-Pair (BPE) encoding tokenizer: it has been argued that the Surprisal values for words with more than one subtoken in BPE models tend to be more uniform and less cognitively realistic (Nair and Resnik, 2023). Finally, we included the interaction between the Connective and the Plausibility conditions.
Once again, we used the lmerTest package (Kuznetsova et al., 2017) for generalized linear mixed model fitting and results, as illustrated in Table 5. As a difference from the models for the reading times data, in this case we only have random effects for the items, but not for participants (all the predictions for a set of data in one language come from a single language model).
Table 5. Summary for the linear-mixed effects models results of predictors of Surprisals with the Italian and Chinese datasets.
We visualized NLMs' Surprisal scores distribution in the conditions for both the Italian and the Chinese datasets in Figure 3. Observing the boxplots suggests that the plausible without-connective condition leads to the lowest overall Surprisal scores across languages and connective types. There are observably more outliers in the Italian datasets than in the Chinese datasets. The data distribution is more “normal” in the Chinese datasets than in the Italian datasets, as suggested by the smaller gaps between mean and median.
Figure 3. Boxplots of the NLM Surprisals for discourse connectives. Inner box dot, mean; inner box line, median; outer box dots, outliers. Plaus, plausible without connective; Implaus, implausible without connective; HW-(Im)Plaus, (Im)plausible with however; ES-(Im)Plaus, (Im)plausible with even so.
For linear mixed-effects models, in the Italian dataset, Table 5 shows that there is no plausibility effect in the Surprisal scores, while there is a strongly significant effect for Seq_Len and a significant effect for OOV: the increase of both variables leads to an increase in Surprisal scores. There is also a significant effect with the frequency of the target word: more frequent words are associated with the decrease of the Surprisal. The interactions of the predictor variables did not give rise to any significant effect.
On the other hand, in Chinese, all the predictors except for Seq_Len can be seen to have a significant effect on the scores: implausible items are associated with an increase in Surprisal, as well as both the Connective
Comments (0)