A comparison of methods for coding race in linear and logistic regression models

The persistence of racial/ethnic disparities in health, education, and wealth in the United States is an intractable problem that is extensively researched [9], [10], [18], [21]. However, securing a stringent, predictable definition of race is an equally intractable issue in research. What is meant by “race” or “ethnicity” may influence what a study’s findings infer. While socially assigned race reflects perceptions rooted in societal norms and interactions, race has also been defined, in some research as genetic differences between groups. However, evidence shows minimal genetic variation between races, especially regarding health-related genes [33]. Therefore, race is more commonly considered a man-made construct based on environmental, social, and cultural characteristics [3], [27], [28], [33]. Race and ethnicity are not causal variables or risk factors, but they are important variables for social stratification. Despite the high likelihood of measurement error, race is often included as a research variable because of its predictive nature across many outcomes. This conceptualization aligns with the focus of this paper, which examines how race–treated as a social stratification variable–can influence health outcomes depending on the methods used to code it in regression models. Understanding the implications of race as a socially constructed variable is essential for addressing racial health inequities.

A substantial portion of the racial/ethnic disparities literature is based on secondary data [14]. When using secondary data, analysts are limited in how they can define and examine racial/ethnic differences. The use of race for statistical analysis depends on the research question and available data [28]. Race can be interpreted in myriad ways in statistical analyses because of how an individual's race is specified in the data source and by the person specifying it. Race classifications have changed over time, from mutually exclusive racial categories to potentially multiple categories (as demonstrated in instructions on forms to check all that apply), posing statistical challenges for multiracial interpretation. A person may identify with more than one race; increasingly, people select “other race,” which has limited meaning in data analysis interpretation [40]. Combining many small groups into “other race” limits the ability to interpret this group because it lacks a clear meaning. Several US governmental agencies are considering new guidelines for the collection of race data. However, the amorphous concept of race will continue to pose statistical challenges in measurement and analysis. As the definition and understanding of race change, appropriate changes are necessary in statistical practice and the use of race in analyses [41].

In statistical analyses, race is a categorical variable (i.e., a unit of observation based on a qualitative aspect) [2], [32]. For qualitative data, a coding scheme must be applied to the variable used in the regression model [2], [38]. A coding scheme is a method used to group observations into mutually exclusive and exhaustive categories, where a numerical value is coded to each observation based on which category it falls into [2], [32]. Using an appropriate coding scheme for racial/ethnic groups can help the analyst select which groups are compared for the research hypothesis. This would allow studies to define groups for the comparisons they are interested in, despite differences in the definition of race categories from secondary data. Determining a priori how race will be categorized and coded in a regression analysis has important implications for the inferences that an analyst can make. In this analysis, we compare six coding schemes for a categorical race variable to demonstrate how an analyst's decisions impact the interpretation of results in linear and logistic regression models. The study’s primary goal is to compare and contrast each coding scheme applied to two regression models we developed: one predicting body mass index (BMI) as a continuous outcome (linear regression) and the other predicting diabetes status as a dichotomous outcome (logistic regression). The estimated coefficients were evaluated to assess whether or not there are differences in interpretation of results based on the coding scheme applied. Here, we focus on secondary data analysis using the race data available in regression analyses. We do not address the myriad challenges related to primary data collection of race data (e.g., how race should be defined and collected).

Race as a categorical variable can be included in regression analyses in many ways.1)

Dummy coding is the most commonly used method to incorporate race in regression analysis [19], [28]. A dummy-coded race variable can capture the influence of a race group on the outcome of interest compared with that of the reference group [19]. When using dummy coding, a reference group must be selected; this should not be done randomly and should be related to the hypothesis.

2)

Simple effect coding, which allows the comparison of the mean of the dependent variable for one race with that of the selected reference group (will provide the same results as dummy coding if the reference groups are the same) [38]. This method and reference group could be used to examine hypotheses related to ethnic differences between Hispanics and non-Hispanics. When using simple effect coding there is a need to select a reference group; this should not be done randomly and should be related to the hypothesis.

3)

Forward difference coding can be used to compare the mean of one race subgroup with the mean of one level below or above. For example, in a study investigating the effects of a categorical race variable with four levels (e.g., Hispanic = 1, Asian = 2, African American = 3, and White = 4), Hispanics would be compared to Asians, Asians would be compared to African Americans, and African Americans would be compared to Whites.21 It should be used when a meaningful order among the categories exists (e.g., low, medium, or high). Since race is not an ordinal variable, this coding scheme would be appropriate in only a few instances. Forward difference coding can show a natural progression of mean or risk when there is a dose-response between racial groups and the outcome of interest. We include this method in the case study despite not having a relevant hypothesis for demonstration purposes, but caution others about its use.

4)

Backward difference coding is the reverse of forward difference coding. It should be used similarly to forward difference when there is a hypothesized order of racial/ethnic groups regarding the outcome.

5)

Deviation coding can be used to compare the mean of the dependent variable for a racial group to the mean of the dependent variable for the population [2]. It should be used when researchers seek to compare the outcomes for each racial/ethnic group to those of the overall population or community level, allowing researchers to examine the deviation of any racial/ethnic group from the overall population mean. The population mean is estimated as the mean of the means for each racial/ethnic group; this is different from the overall mean calculated using the observations in the sample. To examine hypotheses where there is no clear choice for a reference group and no a priori hypotheses about race, deviation coding can be used.

6)

Analyst-defined coding allows analysts to control comparisons, with the potential to avoid positioning any race group as normative [27]. It is used when there is a hypothesis that is not covered by standard coding schemes or if there is a custom comparison that is not included in other schemes. This approach can be used to examine hypotheses for combined racial categories (e.g., Whites compared to Blacks and Hispanics combined). Researchers may want to compare nontraditional or combined groups. For example, Sewell and Jefferson defined race as a four-category variable that grouped participants as White non-Latinos, Black non-Latinos, Latinos, or Asian/Pacific Islander non-Latinos [35]. Lo et al. implemented a three-category variable for race (non-Hispanic Black, non-Hispanic White, or Hispanic) [29]. Miller et al. used a dichotomous variable for race (White, non-White) and defined ethnicity as non-Hispanic or Hispanic [19]. The group labels differ in the previous examples, and some of these race categories include different subgroups consolidated into one, limiting the detail that can be explored. Most of these categorizations combine race and ethnicity as a single variable, while one includes race and ethnicity as separate dichotomous variables.

Most of the epidemiological literature uses dummy coding, while some others use analyst-defined coding based on group sample sizes and study context [14]. Forward difference and backward difference coding are rarely appropriate approaches for analyzing race and, thus, are not used in the literature. Some have argued the benefits of effect coding over dummy coding in regression, but the uptake of these methods has been slow [2], [7], [16], [30], [34]. Deviation coding is rarely used for race, although it is sometimes used for other variables [26].

Comments (0)

No login
gif