Trans-Balance: Reducing demographic disparity for prediction models in the presence of class imbalance

Promoting diversity and inclusion is crucial in precision medicine research as many demographic sub-populations are observed to be underrepresented despite the availability of large-scale clinical data [1]. For many clinical conditions, one bottleneck is poor predictive performance for underrepresented populations when heterogeneity across different demographic populations is not negligible [2], [3]. In this work, the first type of imbalance issue we considered is demographic disparity, which refers to the scenarios where there are systematic biases in risk estimation across demographic groups [4], [5]. The second type of data imbalance issue is class imbalance, which refers to scenarios where the different outcomes (classes) being considered are not represented equally. In the binary outcome setting, the class with more data samples is called the majority class (e.g., healthy control), while the class with fewer data samples is called the minority class (e.g., disease) [6].

In the presence of class-imbalance, conventional metrics such as the area under the receiver operating characteristic (AUROC) are predominantly influenced by the performance of the majority class. Utilizing more appropriate metrics such as the area under the precision–recall curve (AUPRC), F1 score, sensitivity, and positive predictive value (PPV) provides a more balanced and meaningful measure of model performance across both minority and majority classes. To avoid terminology confusion in the use of “minority” for different issues, we use the term “underrepresented demographic group” to denote the demographic group with fewer data samples, and “minority class” to denote the outcome class with fewer data samples, which is the disease class (termed C1 for simplicity) in our setting.

We summarized the related work in Table 1. Transfer learning [7] was developed as a powerful framework for leveraging richer source data to facilitate medical discoveries in limited target data. [8] studied transfer learning in high-dimensional settings from a factor model perspective. [9] proposed a minimax optimal transfer learning algorithm for high-dimensional linear models. [10] studied minimax optimal and adaptive methods for non-parametric classification in the transfer learning setting. [11] used transfer learning to port the knowledge learned from diverse populations across multiple institutions to an underrepresented population, without sharing (combining) individual patient data. [12] proposed an inductive and model-assisted transfer learning approach that is robust to both covariates shift and model shift between the source and target population.

For class imbalance, a widely used set of bias-correction approaches leverage sampling schemes such as oversampling. [13] proposed the SMOTE (Synthetic Minority Oversampling Technique) algorithm. Since then, many variants and alternatives to SMOTE have been presented in the academic literature. For example, [14] proposed a noise-reduced imbalanced data classification method based on active learning SMOTE. [15] investigated the properties of SMOTE from a theoretical and empirical point of view, using simulated and real high-dimensional data. [16] proposed ROSE (Random Oversampling Examples), a smoothed bootstrapping-based technique for model estimation and assessment to alleviate imbalanced distribution. However, existing sampling techniques focus on single-site studies and are not specifically designed to address demographic disparity issues. Beyond the recent advances of transfer learning and imbalance learning, it has been shown that integrating larger collections of data from multiple studies may potentially improve the model performance in both underrepresented demographic groups and minority classes [11], [18]. In this work, we consider a general, and arguably practical, setting where data from multiple populations are stored in multiple sites with privacy constraints. More importantly, we consider a general setting in predictive modeling where the sample sizes in different outcome classes are highly imbalanced. We summarize the statement of significance in Table 2. Our contributions are twofold. First, we evaluate the prediction ability of the existing transfer learning methods and sampling methods and propose a new framework, Trans-Balance, to address demographic disparity issues in the presence of class imbalance. Second, the proposed framework incorporates heterogeneous data from diverse populations and multiple cohorts to improve model fitting and prediction in an underrepresented population.

留言 (0)

沒有登入
gif