Measuring Racial Sentiment Using Social Media Is Harder Than It Seems

Social media, and the internet more broadly, presents a historically unprecedented opportunity to assess the speech of millions of individuals across social classes, ages, nationalities, and racial or ethnic groups, making it a tempting resource for epidemiologists interested in understanding broad-scale public opinions. In the current issue, Nguyen et al.1 present a descriptive study of race-related content on Twitter (now renamed X) over the past decade, reporting patterns of negative sentiment language and concluding the relative frequency has increased over time. They provide a detailed explanation of how to curate a dataset from social media and suggestions for the types of quantitative data analyses that can be done with this data. Interpreting these data, however, is less simple than the analyses might suggest. With any new data source, it is vital to carefully consider the assumptions or potential biases baked into this process of creating, collecting, and using the data. As methodologists and frequent users of social media, we highlight 4 questions we hope will help contextualize current and future analyses of social media data.

WHOSE DATA ARE WE ACTUALLY COLLECTING?

As noted by Nguyen and colleagues, users of Twitter tend to be younger and more diverse than the general US population—perhaps unsurprising, given that the users of Twitter are not confined to the US population. However, race and racial sentiment are socially constructed variables, and so analysis of racial sentiment on social media requires restricting to a clear cultural context.2 Nguyen and colleagues do this by limiting to account holders located in the United States. Unfortunately, the design of the Twitter system makes it difficult, if not impossible, to identify all US account holders.

Twitter users have the option to share or hide their location information, and when activated, this setting adds the geographical location at the time the content is created. Thus, many users who reside in the United States will not have their location information available, and others may have content tagged as US-based when traveling. The data set of tweets tagged as US content will therefore contain neither all US-based users nor only US-based users. This will likely result in sampling bias because users who are frequently victims of harassment and users who are frequently perpetrators of harassment may both be more likely to opt out of sharing location data for privacy reasons. Further complicating the ability to understand whose data are represented, Twitter users have the option to make their entire account private at any time—users whose account is currently private are not included in the data available to researchers, regardless of where they reside or tweet from.

WHAT DATA ARE WE ACTUALLY COLLECTING?

Although social media sites are often thought of as a type of public square, this analogy is misleading. The image of a public square conveys a shared space where all individuals have the same opportunities to share and engage with all content (louder voices reaching wider audiences, notwithstanding). This does not match the format of social media sites like Twitter, however. In these spaces, each individual user is presented with an algorithmically curated set of content unique to their past engagement on the site. The classic friendship paradox of networks3 shows that the properties of networks result in most people having fewer friends on average than their friends do. Similarly, we can expect that, on average, tweets individuals are shown will be more highly viewed than tweets the individual creates. In contrast, data available to be sampled from the Application Programming Interface will include a large amount of content with few or no views, and a smaller amount of content with high views.

This has important implications for the interpretation of social media data, because it means researchers cannot easily recreate the experience of using Twitter. This does not mean that the converse is true, however. Researchers also cannot easily recreate the full corpus of created tweets. In their paper, Nguyen and colleagues acknowledge the first issue, but base their conclusions on an assumption that the racial sentiment patterns in their data accurately reflect the patterns in created content—and in society as a whole.

We have already discussed how privacy settings can affect content sampling. In addition, content sampling will be highly dependent on the time of data collection. This is because social media content can generally be deleted at any time, by both users and administrators. Of the many possible reasons for removing content, the use of content moderation is most relevant here. Content moderation occurs when users or moderators flag content as inappropriate, offensive, or illegal in nature. This typically results in content review by the platform, followed by either administrative deletion or a request for user deletion if the content in fact violates platform rules. Typically, tweets that have been deleted are not available to researchers attempting to create a new dataset after the date of deletion. This results in missing data, and this missingness is more likely for both older content and racially negative content. Although not always deemed a violation, race-related hate speech is often flagged for moderation and therefore more likely to be missing.

Finally, when curating a social media dataset, the inclusion criteria often include some indication of content topic. Nguyen and colleagues use a database of race-related terms in creating their data. The choice of keywords will be an important determinant of both the inclusion and classification of content. Keywords—particularly race-related terms—can have a sentiment valence and the distribution of positive, negative, and neutral keywords may impact comparisons between groups. For example, in their list of keywords, Nguyen et al. include several strongly negative sentiment racial slurs as identifiers for Black-related tweets, whereas keywords for Asian-related tweets are neutral terms for ethnic or national origin (“Asian,” “Asians,” “Filipino,” “Japanese,” “Korean,” “Nepal,” “pacific islander,” “Thai,” “Vietnamese,” “Chinese”).

HOW WELL ARE WE QUANTIFYING THE DATA?

Machine learning methods for sentiment analysis provide a rapid way to assign quantitative values to a large corpus of text, and without these methods analyzing datasets such as that collected by Nguyen et al. would be prohibitively time- and labor-intensive. But algorithms are not people, and it is important to consider what aspects of the text are visible and invisible to the algorithms, and how these might impact the results of the analysis.

A variety of approaches to analyzing text using machine learning exist, but certain similarities exist between the most frequently used methods, including those used by Nguyen et al. First, many machine learning packages for sentiment analysis cannot handle emojis or emoticons. The standard practice is thus to remove these from the input dataset before running the algorithm. This is not, however, an assumption-free practice, because emojis may be used as intensifiers, for negation, to indicate satire, or to provide emotional context. This is also true of other features of Twitter posts such as images, gifs, and hashtags—all of which are generally stripped from the database before the sentiment analysis algorithm can be used. Sarcasm and irony are common in both online and offline race-related content where it is used both to challenge or disarm racial stereotypes, and to provide plausible deniability for the expression of negative racial sentiment.4–7 On social media, sarcasm and irony may be particularly likely to be encoded via the use of hashtags or other extra-textual information, where the extra information is required to provide the shift in evaluative valence.8 When the intended valence of the content differs from the literal valence, even humans will often misclassify content that has been stripped of context.9 Choices about data cleaning can thus have important implications for the resulting analytic conclusions. Nguyen and colleagues report the removal of special characters, emojis, and accompanying nontextual information, including hash symbols (e.g., “#not” becomes “not”).

Once the data have been cleaned, researchers must choose how to turn the text-based content into numerical summary information. A simple approach is to create a polarity metric based on whether the overall tone of an item of content is positive, negative, or neutral. Aside from the issue of context, this is a relatively straightforward way to summarize attitudes present in the text. However, the complexity of language means that the interpretation of the polarity measure may not be consistent across topics. This is particularly salient when evaluating content on politically, or emotionally charged topics, where a negative sentiment polarity of a particular item of content could nevertheless indicate overall positive sentiment towards the particular racial or ethnic group referenced. In addition, assigning polarity to content is generally done via the use of a human-coded training dataset. Nguyen and colleagues report that their training content was obtained from a mix of publicly available prelabeled datasets (Sentiment114, Kaggle, and Sanders), and a new dataset with labels assigned by graduate students. The authors do not provide details on the instructions given to the student coders, but it is of note that the analytic code uses the terminology “happy” and “sad” rather than the more generic “positive” and “negative” categories reported in the manuscript.

The choice of the training dataset, the approach to coding the training content, and the variability between individual coders can all affect the resulting estimated sentiment when the model is applied to the analytic dataset. As an example, consider the reported finding that the proportion of negative sentiment speech tagged “Asian” rose following the March 2021 murder of 8 women, most of them Asian, by a white man at 3 Atlanta spas. Can we interpret this as an increase in the actual levels or degree of negative anti-Asian sentiment? A partial answer may lie in the fact that the primary slogan used by those protesting violence against Asians since February 2020 has been “Stop Asian Hate.”10 This phrase uses a double negative to express an overall positive sentiment towards Asians, via condemnation of negative sentiments. Humans can detect this, but would the algorithm assign a positive or negative sentiment to this phrase—does combining 2 negative terms into “stop hate” create, algorithmically, a stronger negative sentiment or do they counteract each other to make the sentiment more positive? Because Nguyen and colleagues share their code, we were able to test this—their algorithm did in fact score “Stop Asian Hate” as a negative sentiment (though, when fed the hashtag, minus the hash symbol as a single word, i.e., stopAsianhate, the algorithm reported a neutral sentiment).

WHAT ARE WE MISSING WHEN WE RELY ON ALGORITHMS?

So far, we have focused on issues that likely could be minimized if all analyses were done solely by human reviewers on the full content of the tweet (however implausible that may be!). But there is a larger issue with internet content which cannot be solved so easily. Much of the internet is publicly accessible, but many people use it to share ideas that are widely acknowledged as reprehensible and with which they do not want to be publicly associated. The hosting websites also generally do not want to be associated with this type of content. This creates an ecosystem governed by a Red Queen Hypothesis:11 a constant arms race between the creation and use of dog-whistles and coded racist content, and detection and deletion of that content by social media platforms. This has important implications for both the collection and assessment of racial sentiments over time, and the assessment of individual experiences of racial discrimination. Individuals from minoritized groups are often highly aware of the existence of these terms and how they evolve over time; indeed, being aware of them can be an important method of self-protection in online spaces.

Coded hate speech can be as seemingly subtle as a pair of numbers inserted into an otherwise unrelated set of text.12,13 Alternately, it may reference racist memes such as the phrase “how would you have felt if you had not eaten breakfast?.” For those not familiar with racist dog whistles, this may seem like a completely innocuous, neutral sentiment statement, with no particular relevance to any individual racial or ethnic group. That perception is false: the phrase, referred to as “The Breakfast Question,”14 originates from a 4chan post where a supposed researcher claims that people with IQ values below 90 cannot understand a question such as this, and has been adopted by racists as a way to imply that Black people have low IQs. Unfortunately, without deep knowledge of the racist dog-whistles used against specific racial and ethnic groups, identifying content like “The Breakfast Question” can be challenging for even human reviewers—particularly if images, gifs, links, or emojis which often accompany content of this type are removed from the tweets. If we cannot expect the average human reviewer to detect content with this type of coded racial sentiment, nor to correctly interpret the sentiment underlying this content, then we certainly cannot expect machine learning algorithms to do so either. Worse, the fact that these dog-whistles are frequently adapting and changing to avoid moderation filters means that descriptions or comparisons of racial sentiment in internet databases over time cannot rely on a time-invariant sentiment-assignment algorithm if they hope to detect this content.

CONCLUSION

Social media data provide a view into the personal opinions and experiences of millions of people in a way that has seldom before been available. Finding ways to use this to better understand and improve the health of our communities is an important and worthwhile goal. However, doing so will not be without challenges. We focused here on data collection, cleaning, and measurement issues. We hope that the issues we present here provide a starting point for discussions about how to better understand the information obtainable from these data sources. Solving these problems will, however, only be the starting point for designing epidemiologic analyses of social media data; many questions about the design and interpretation of descriptive, predictive, and causal studies of these data remain unasked and unanswered.

ABOUT THE AUTHOR

Eleanor J. Murray is an Assistant Professor of Epidemiology at Boston University School of Public Health, Associate Editor for Social Media for the American Journal of Epidemiology, and cohost of the podcast Casual Inference. Kareem C. Carr is a doctoral candidate in Biostatistics at the Harvard TH Chan School of Public Health. Murray and Carr both engage in science communication on Twitter (now called X) (@epiellie and @kareem_carr), providing education in epidemiology and statistics concepts, and the relationship between data, statistics, and society, to over a hundred thousand followers each. They have personal experience with many of the complex and varied ways that racism, misogyny, and anti-immigrant sentiment are conveyed via social media.

REFERENCES 1. Nguyen TT, Merchant JS, Yue X, et al. A decade of tweets: visualizing racial sentiments towards minoritized groups in the United States between 2011-202. Epidemiology. 2023:35. 2. Ford CL, Griffith DM, Bruce MA, Gilbert KL eds. Racism: Science & Tools for the Public Health Professional. American Public Health Association; 2019. 3. Feld SL. Why your friends have more friends than you do. AJS. 1991;96:1464–1477. 4. Pauwels M. Anti-racist critique through racial stereotype humour: what could go wrong? Theoria. 2021;68:85–113. 5. Mudambi A. Racial satire, race talk, and the model minority: South Asian Americans speak up. South Commun J. 2019;84:246–256. 6. Miller SS, O’Dea CJ, Lawless TJ, Saucier DA. Savage or satire: individual differences in perceptions of disparaging and subversive racial humor. Pers Individ Dif. 2019;142:28–41. 7. Douglass S, Mirpuri S, English D, Yip T. “They were just making jokes”: ethnic/racial teasing and discrimination among adolescents. Cultur Divers Ethnic Minor Psychol. 2016;22:69–82. 8. Kunneman F, Liebrecht C, van Mulken M, van den Bosch A. Signaling sarcasm: from hyperbole to hashtag. Inf Process Manage. 2015;51:500–509. 9. González-Ibáñez R, Muresan S, Wacholder N. Identifying sarcasm in Twitter: a closer look. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers. 2. HLT ’11. Association for Computational Linguistics; 2011:581–586. 10. Our Origins. Stop AAPI Hate. Available at: https://stopaapihate.org/our-origins/. Accessed August 29, 2023. 11. Van Valen L. A new evolutionary law. Evol Theory . 1973;1:1–30. 12. 1488. Hate Symbols Database. ADL. Available at: https://www.adl.org/resources/hate-symbol/1488. Accessed August 29, 2023. 13. 13/52 & 13/90. ADL. Available at: https://www.adl.org/resources/hate-symbol/1352-1390. Accessed August 29, 2023. 14. The Breakfast Question. Know Your Meme. Available at: https://knowyourmeme.com/memes/the-breakfast-question. Accessed August 1, 2023.

留言 (0)

沒有登入
gif