Web-based care has become increasingly important in health care delivery as a means to accessibly reduce emotional distress. Online support groups (OSG) offer a convenient solution to those who cannot attend in-person support groups [-]. Professionally led OSGs occur in real time with participants engaging with a therapist and other participants in the group. Therapists facilitate the sharing of personal experiences to foster a mutually supportive environment. OSG participants report an increased sense of empowerment and control, as well as improved knowledge about their conditions [].
Cancer Chat Canada (CCC) offers web-based professionally led, synchronous, text-based support groups to patients with cancer or caregivers across 6 Canadian provinces with a text-based nature allowing for anonymity while reaching people in rural areas. All groups provided via CCC are manual-based, consisting of 8-12 sessions. Each session focuses on a specific theme, homework readings, and web-based activities. Participants exchanged their experiences and ideas through a chatbox on the CCC platform. During sessions, therapists facilitate discussions based on the weekly readings, address issues or concerns, attend to emergent emotional needs of the members, and employ therapeutic techniques that promote a continuous sense of mutual support among the 6-10 members [].
For group interventions to be effective, therapists encourage authentic emotional expression while effectively monitoring and addressing signals of distress []. However, the absence of visual cues, along with the simultaneous entries by multiple participants, can impose challenges for therapists to attend to all participants’ needs during the session []. Therapists’ failure to recognize and respond to participants’ expressions of distress can reduce the participants’ perceived level of support, safety, and trust in the group, leading to disengagement and attrition [].
One way to reduce attrition and improve OSG services is through tracking and monitoring group cohesion [,]. A cohesive group experiences a sense of warmth, comfort, acceptance, affiliation, and support from other members they value []. Group cohesion is associated with positive participant outcomes, including reductions in distress and improvements in interpersonal functioning [].
Traditionally, group cohesion is measured by participant self-report instruments, such as the Harvard Community Health Plan Group Cohesiveness Scale[] and the Group Cohesion Scale Revised []. Alternatively, it can be measured by content analysis, where analysts assign ratings to the participants’ statements []. While useful, these approaches have limitations of participant recall bias, measurement fatigue in self-reports, and time and cost of labor in post hoc qualitative analyses.
Previous studies demonstrate that a higher frequency of first-person singular pronouns use (ie, I, my), also referred to as “iTalk” or self-referential language, is a linguistic marker of general distress and is associated with negative psychological outcomes such as depression and suicidal behaviors [-]. In contrast, collective identity language use (ie, our group, us) was instrumental to group attachment []; with greater uses of references to the group as a whole and to other members predicting reduced symptoms of grief []. Aside from content analysis, such as Psychodynamic Work, Object Rating System [], many studies adopted computerized textual analysis systems such as dtSearch [], Linguistic Inquiry, and Word Count (LIWC) to track levels of cohesion through text [-]. In particular, Lieberman et al [] detected group cohesion by combining LIWC to count the proportion of group referential language use and dtSearch to count words indicative of positive connotations (ie, hope, altruistic, accept, affection) within 10 words of such group referential language in an OSG for patients with Parkinson. However, Alpers et al [] questioned the software’s ability to process complex communications, suggesting that future studies should develop systems that analyze the context of discourse for real-time analysis.
Given the evidence, group cohesion can be systematically measured by a well-designed computer analytical system. We designed the Artificial Intelligence–based Co-Facilitator (AICF) to contextually identify therapeutic outcomes, including group cohesion from conversations, and produce real-time analytics [-]. AICF can track basic emotions, including joy, sadness, anger, trust, fear, anticipation, disgust, surprise, and psychological outcomes such as distress, group cohesion, and hopelessness for each participant in the OSGs [,]. AICF extracted emotions from the text by parsing through over 120,000 lines of chat messages from a training data set to multiple levels of granularity: word, phrase, sentence, post, and posts by each user []. AICF employed several natural language processing (NLP) techniques, such as Word2Vec [] and text classification models. Classification models were trained to classify posts containing group cohesion mentions to determine the level of group cohesion in this web-based conversation setting. Each level of extraction served as an input for calibration for the subsequent extraction to increase accuracy [,,]. AICF could, therefore, track and inform facilitators of each participant’s level of cohesive statement use in their posts.
We hypothesized that AICF could detect first-person plural pronoun use (eg, we, our) in OSGs and group-references language use (“we-talk”) as group cohesion, machine learning–based NLP could also identify a broader definition of group cohesion, including expressing gratitude, mutual support, and sense of belonging.
ObjectiveThis study is focused on the development of a method to train and evaluate AICF’s ability to detect group cohesion among cancer OSG members.
The steps involved in the training and development of AICF’s cohesion detection is outlined below.
Collecting Design SpecificationsExperienced CCC therapists participated in phase 1 and phase 3 focus groups to obtain design specifications for which clinically meaningful outcomes AICF should capture and provide real-time analytics for, as well as the pros and cons after experiencing AICF clinically. All therapists who responded to our request to participate were involved in the study and are experienced in their field. In addition to the individual emotion tracking feature, the therapists expressed interest in tracking group processes with a particular emphasis on group cohesion. Therapists described group cohesion as a high frequency of posting by members with a sense of interconnectedness through replying to others. A successful group session results in members feeling supported and acknowledged by other group members. The results herein this manuscript excluded the results of these focus groups as they were published elsewhere.
Scoring Guide DevelopmentA literature-based guide was developed to ensure that group cohesion statements were consistently identified and annotated by the human team.
Group cohesion is the sense of warmth, acceptance, support, and belongingness to the members [], a sense of closeness, and participation []. It is measured by statements that reflect a sense of belonging and support in the group.
This belonging and support could be expressed with the statement themes below []. The following examples were from the CCC chat training data.
Reassurance or encouragement between peersExpressing support or feeling supportedDeepening emotional disclosure and trustA sense of belongingGratitude for the groupFinding shared experiences and commonalitiesLooking forward to future sessions or connecting outside of the groupReflecting on the positive aspects of the groupCreating Training DataTo train AICF to identify group cohesion, 1000 examples of cohesive statements from 10 OSG sessions were annotated by 2 human group therapists (EW and JH). These annotated examples were used for training the algorithm.
Algorithm DevelopmentFeature Selection of Group Cohesion ExpressionsFirst, a corpus of CCC chat sessions (~80,000 messages) was used to train a word embedding model using Word2Vec using the Gensim library in Python (gensim.models.Word2Vec [documents, size=100, window=10, min_count=2, workers=10]). This enabled the creation of a vector representation for each word in the corpus. This positioned semantically similar expressions in closer proximity to generate contexts of cancer OSG discussion. Second, to expand the group cohesion mentions, the annotated samples were fed into the trained Word2Vec model as inputs to query for neighboring words. This resulted in a set of semantically similar, contextually relevant group cohesion expressions. This enabled AICF to identify statements representing group cohesion, including keywords such as “us,” “we,” and “our group,” as well as themes such as expressing gratitude, eagerness to attend upcoming group sessions, chatting outside of group time, mutual support, and a sense of belonging.
Training the Classifiers of Group CohesionTo produce the probability of each post containing group cohesion, 3 models, multinomial naive Bayes, logistic regression, and multilayer perceptron (MLP) classifier with the group cohesion features selected were trained using the training data set.
Before training the classifier, a series of feature engineering steps were followed. Feature engineering is the process of creating features by extracting information from the data. For this purpose, the term frequency–inverse document frequency (TF-IDF) approach was used. TF-IDF is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. It is performed by multiplying the term frequency and inverse document frequency of the word across a set of documents. In this classification task, the TF-IDF vectorizer was used with a limit of 5000 words capturing both unigrams (single words) and bi-grams (2 words that occur together). Next, the vectorizer was applied to the preprocessed training data set.
Once the data set was transformed, it was used to train multiple classifiers; naïve Bayes, random forest, support vector machine (SVM), multilayer perceptron (MLP), and logistic regression models. The objective of training multiple classifiers was to increase the performance of the final classification by incorporating multiple high-performing classifiers. This technique is called “soft voting,” which is an ensemble machine learning technique that combines predictions from multiple models. shows the F1-scores of the trained classifiers.
Table 1. The F1-scores of trained classifiers.ClassifierF1-scoreSupport vector machine0.63Naïve Bayes0.79Multilayer perceptron0.77Random forest0.72Logistic regression0.82Group Cohesion Score CalculationBased on this, the top 3 classifiers were selected: naïve Bayes, MLP, and logistic regression. The outcomes of all 3 classifiers were used to make the final prediction. If a post is classified into the same label by 2 of the 3 classifiers, then the output label is used as the final outcome. A confidence value was generated for each classification based on the weighted F1-scores of each classifier. The average F1-score using 3 classifiers was 0.8.
In order to improve the performance of the model, an active learning approach [] was used where human input is used as feedback to fine-tune the models. Therapists examined 20% (6797/34048) of the outputs using a confusion matrix (see Active Learning via Human Scoring section). The scoring results were then used as a feedback loop to improve the list of keywords of queried expressions. Lastly, to fine-tune group cohesion extraction, linguistic rules were hand-coded to handle exceptions such as past tense and empathetic questions by participants ().
The following rules were added after the first round of scoring based on therapists’ feedback:
Intensifiers: We have used the intensifiers from a pretrained library, Valence Aware Dictionary and Sentiment Reasoner (VADER; [], which considers intensity boosters such as “very” and “so much” to enhance the valence.Past tense (in the part-of-speech tagging via the NLTK Python library): The score would be multiplied by 0.5 if past tense was present, as the event had happened in the past, we assume that the effect of the event on the person would subside.Negation: The calculated cohesion score would be set to zero in case of a negation expression.First-person tagging: This was set to be “False” if second or third person pronouns were found.Empathy: If an empathy statement were found, then the calculated group cohesion score would be doubled to denote the intensity.Finally, an aggregated score of group cohesion (β) was calculated for specific time intervals using the following formula:
where β is the group cohesion score; T is the specified time interval (30 minutes); A(t,t+T) is the set of all posts the occurred during the time t to t+T; and C(t,t+T) is the set of cohesion mentioned posts that occurred during the time t to t+T.
A group cohesion score was displayed and updated at 30-minute intervals on the 90-minute timeline in a real-time dashboard for therapists.
Active Learning via Human ScoringOutputs were scored by undergraduate students (responsible for basic emotions), graduate students, and clinical experts (responsible for clinical and process outcomes). The team scored 20% of the output to inform AICF development, which was improved in light of the human scoring results. The updated AICF was run on the data of a new OSG (test data). Each AICF version was saved before training with new data. The team scored the output using definitions or examples from well-established psychometric measures such as the Group Cohesiveness Scale and Group Openness and Cohesion Questionnaire. A confusion matrix was used to score AICF outputs. The scoring process was based on recall, precision, and F1 measures. Scorers’ feedback using their domain expertise was used to improve AICF’s performance until it achieved an F1-score of 80% before deploying in real-time OSG for beta-testing [].
LIWC EvaluationThe Linguistic Inquiry Word Count (LIWC) software [], considered the gold standard of psychology-based NLP, was used as a validation tool. LIWC reads a given text and calculates the percentage of total words in the text that match each of the LIWC dictionary categories. We tried to capture the concept of group cohesion using multiple LIWC dictionary categories: “we,” “positive emotion,” “family,” “friend,” and “affiliation” as the measurement criteria. We classified the text as an instance of group cohesion when at least 3 out of 5 criteria were met.
Ethical ConsiderationsThe study protocol including the human participant recruitment method was approved by the University Health Network Research Ethics Board (confirmation number: UHN REB#18-5354). All identifiable information was removed from the quotes in this report. Participants were compensated with a CAD (73.34 USD) gift card upon the completion of the focus group.
The results herein only focus on the human evaluation of the AICF system and its ability to detect group cohesion. We compared AICF to LIWC using human judgment using the confusion matrix and F1-score to measure accuracy and precision. AICF was run on 34,048 messages of CCC chat history to generate outputs for human scoring. Every fifth message was scored, totaling 6797 messages (20%). The precision, recall, and F1-scores are reported in and show that logistic regression, followed by naive Bayes and MLP classifiers performed the best.
In this first round, AICF missed a high number of group cohesive statements ().
All scored statements were incorporated into AICF for improvement. In the second round, the team checked another 296 of 1208 messages (20%) from a separate set of CCC group conversations. AICF was able to improve the false-negative rate (recall) from 0.52 to 0.70.
We also ran LIWC on new OSG data (12,034 messages) from the CCC platform. Precision, recall, and F1-scores are listed in .
Within the “true positive” instances identified by AICF in agreement with the human scorers, several thematic categories and keywords emerged. They closely align with established measures of group cohesion [], including expressions of support or a sense of belonging (). Moreover, some keywords consistently emerged within the true-positive statement classifications (eg, “we,” “us,” “our,” “group,” “support”). Among the false-positive identifications, it was typically due to a missed subtle negation within the sentence or when a participant wrote about a supportive person or activity from outside the group ().
Where AICF missed a classification of group cohesion (ie, a false negative), it was typically also due to nuanced conversational features on which it had not yet been trained, such as local expressions or idioms, supportive responses to others or statements missing identified group cohesion keywords (such as “we,” “us,” and “our”; ). These correct and incorrect classifications were used to refine AICF detection of group cohesion as the algorithm progresses in development.
Table 2. Themes, keywords, and examples of AICFa outputs.ThemesExamplesTrue-positive themesaAICF: Artificial Intelligence–based Co-Facilitator.
Table 3. Artificial Intelligence–based Co-Facilitator performance evaluation for identification of group cohesion.Scoring round/methodPrecisionRecallF1-scoreFirst0.990.520.68Second0.980.700.82Linguistic Inquiry and Word Count0.360.230.28AICF, an ensemble of NLP and machine learning algorithms combined with annotation and human scoring, offers a novel way of measuring the group cohesion changes for each group member and alerting the therapist of these changes in real time. This affords therapists the opportunity to allocate their attention and resources for effective facilitation. The objective was to determine whether AICF can detect group cohesion beyond the first-person plural pronunciation use. The findings indicate that it is feasible to measure group cohesion in text-based complex human interactions using AICF. The level of congruency with human scoring suggests that it can be a helpful tool to therapists in improving the group cohesion outcome.
This study has opened an avenue to person-centered and process-outcome research using AI combined with human inputs to improve the quality of care, which otherwise is a labor-intensive research process. Initially, after being trained with 1000 annotated group cohesion statements processed by word embeddings and the domain expertise from therapists, AICF was able to achieve reasonable F1, precision, and recall scores. Furthermore, training the algorithm using only word embeddings allows AICF to identify the various cohesion themes that emerged, which are consistent with previous research 34]. These themes include expressing support, reassurance, a sense of belonging, trust, deepening emotional disclosure, gratitude, remarking on shared experiences, reflecting on positive aspects of the group, and anticipating future chats. The findings suggest that training AICF to monitor therapeutic responses in web-based care is promising.
When combined with the human scoring examples in the algorithm, as little as 20% of the outputs, AICF obtained a high F1-score. The human rater detected both false-positive examples (eg, “Just have to find what works for you, I listen to a lot of audible books while I do chores, it's a mental distraction and really helps me”) and false-negative examples (eg, “Thanks so much to all of you, for being in this moment. You've helped me get ready for yet another week.”). These examples contributed to the rule-based algorithms as a second layer of analysis. While precision values remained relatively low in both rounds (0.99 vs 0.98), the recall value improved from 0.52 to 0.70 due to a reduction in false-negative classifications. These increases strongly suggest that a continuous effort to train AICF using human input can lead to a higher level of accuracy in detecting group cohesion.
After running LIWC on a test data set, its performance was evaluated by a human scorer. The precision, recall, and F1-scores were lower compared to the performance of AICF. Unlike AICF, which is capable of identifying group cohesion expressions and idioms, LIWC is programmed to identify certain keywords. For a false-negative example, LIWC was unable to detect the following quote as an instance of group cohesion due to the lack of the keyword “we”: “I feel like I’ve suddenly inherited a whole group of sisters.” Another instance of an LIWC false positive was that LIWC dictionary categories “family,” “positive emotion,” and “affiliation” falsely detected group cohesion from this quote: “My husband has helped me see that it isn't something I did, or who I am.”
Comparison With Prior WorkThis study successfully trained a machine learning system to detect cohesive statements in contrast to qualitative content analysis, which tends to be onerous and prone to human errors when dealing with large amounts of data []. Emerging computer programs such as Discourse Attributes Analysis Program (DAAP) [] and LIWC [,] offer an iterative psycholinguistic approach to coding transcripts of psychotherapy for therapeutic moments [,]. For example, DAAP is based on a weighted dictionary that assigns weights to different words instead of solely detecting them as belonging to various categories where all matching words contribute equally to the generated scores. This method allows for greater accuracy in measuring different concepts compared to human coding while processing large amounts of data. However, these weighted dictionary approaches can be limited by a fixed number of instances that can be detected, and only one keyword can be considered in each matching rather than taking contexts into account. Additionally, they do not consider emerging words, phrases, idioms of expressions, word order, negation, and context-dependent factors, as well as their post hoc nature []. In this study, the word embedding approach was used to create contextual variables from the keywords to successfully detect a reasonably broad definition of cohesiveness. Thus, work will continue toward improving the accuracy of AICF in upcoming OSG sessions.
LimitationsAICF is based on a previously trained ensemble called Patient-Reported Information Multidimensional Exploration (PRIME) that was primarily trained on Australian web-based forum data []. Thus, Canadians may have used expressions or idioms that were unfamiliar to the original PRIME system and, therefore, not detected (eg, “My head is swirling” to describe feeling overwhelmed or “the clock is ticking” to describe an impending end of life). The local idioms and expressions were handled by the rule-based approach; ideally, AICF would be (re)trained with a large amount of local data in order to capture such idioms and expressions.
Currently, the interactional nature of the statement is not incorporated into AICF, including responses to other members’ or therapists’ statements. Furthermore, AICF cannot consistently distinguish whether participants are speaking about the group or about people outside of the group. When data accumulate, this distinction will become more obvious and refine AICF’s detection ability within the context of an OSG.
The performance of AICF’s group cohesion classification was evaluated in comparison to scores by 2 human experts, whose scoring was guided by the same criteria. However, given the nuanced nature of a group process like cohesion, there was still an element of personal judgment and openness to interpretation in the statements. Finally, emojis were not considered in the algorithm; future studies need to incorporate them as expressions of group cohesion.
Future DirectionsICF has been running in the background on 3 CCC groups and will soon be deployed for beta-testing on 10-12 groups. Participants will be filling out a survey package that includes the psychometrically validated questionnaire that tabs group cohesion for further validation. For algorithm development, sequencing the emotions expressed by each participant will be explored to capture more accurate emotional profiles.
The use of large language models (LLMs), such as ChatGPT, has revolutionized natural language understanding in the field of affective computing. Research suggests that an LLM called ROBERTa [] has been equipped with emotion knowledge that contains 14 human conceptual attributes of emotions, including 2 affective, 6 appraisals, and 6 basic emotions. Future work will incorporate LLMs into our system to enhance AICF’s ability to detect group cohesion and other significant clinical outcomes. For example, the LLM has already understood the syntactic difference between first-person and third-person pronunciation uses and their contexts. Combining both of these emotional attributes and syntax, we are able to better formulate an equation to calculate the tendency of a writer to be self-focused or other-focused. This will truly improve the accuracy and precision of group cohesion detection.
Lastly, In this study, 5 LIWC dictionary categories were used to capture the concept of group cohesion. Future studies may test whether there is a way that will improve the performance of group cohesion prediction using LIWC by (1) adding more categories, (2) reducing some categories, and (3) adding weighting to each criterion.
AICF will explore ways to measure multiple processes comprising group climate, including the level of participation, expression of emotion, signs of cohesion, avoidance, and therapeutic factors such as conflict, altruism, universality, interpersonal learning input and output, catharsis, identification, self-understanding, and instillation of hope [,,]. If successful, AICF will be applied alongside the mobile health chatbot technology to provide a scalable, automated monitoring and referral system that screens users for specific symptoms, recommends individualized web-based and community resources, tracks each user’s psychological outcomes through, and refers them to local therapists when necessary.
ConclusionsOptimal OSG delivery requires rapid alerts for therapists to effectively monitor markers of positive and negative responses within the group. This study has demonstrated that advanced machine learning algorithms combined with human inputs can reasonably detect the clinically meaningful indicator of group cohesion in OSGs. Future research in utilizing LLMs in AICF could enhance the capabilities in understanding the context, given the capability of creating a highly customized model in a short time. Therefore, AICF has the potential to assist therapists by highlighting issues that are amenable to intervention in real time, which allows therapists to provide greater levels of individualized support.
This research is funded by the Ontario Institute for Cancer Research Cancer Care Ontario Health Services Research Network.
The data sets generated during or analyzed during this study are not publicly available due to the presence of private health information but are available from the corresponding author on reasonable request.
YWL contributed to conceptualization, data curation, formal analysis, funding acquisition, investigation, methodology, project administration, resources, supervision, validation, writing the original draft, and reviewing and editing the manuscript. EW was involved in formal analysis and methodology. AA handled visualization, writing the original draft, reviewing and editing, formal analysis, and software. JH participated in formal analysis and methodology. VA contributed to writing the original draft and reviewing and editing. LD and CL were involved in the formal analysis. CK and KPC were involved in software validation. DDS provided supervision and funding acquisition. LT and HR contributed to validation and conceptualization. JW and MJE were responsible for writing, reviewing and editing, supervision, data curation, and funding acquisition.
None declared.
Edited by T de Azevedo Cardoso; submitted 03.10.22; peer-reviewed by D Hu, J Guendouzi, Y Haralambous, M Chatzimina, Pei-fu Chen, D Chrimes, W Ceron; comments to author 10.04.23; revised version received 07.07.23; accepted 08.05.24; published 22.07.24.
©Yvonne W Leung, Elise Wouterloot, Achini Adikari, Jinny Hong, Veenaajaa Asokan, Lauren Duan, Claire Lam, Carlina Kim, Kai P Chan, Daswin De Silva, Lianne Trachtenberg, Heather Rennie, Jiahui Wong, Mary Jane Esplen. Originally published in JMIR Cancer (https://cancer.jmir.org), 22.07.2024.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Cancer, is properly cited. The complete bibliographic information, a link to the original publication on https://cancer.jmir.org/, as well as this copyright and license information must be included.
Comments (0)