Background: While skin cancers are less prevalent in people with skin of color, they are more often diagnosed at later stages and have a poorer prognosis. The use of artificial intelligence (AI) models can potentially improve early detection of skin cancers; however, the lack of skin color diversity in training datasets may only widen the pre-existing racial discrepancies in dermatology. Objective: The aim of this study was to systematically review the technique, quality, accuracy, and implications of studies using AI models trained or tested in populations with skin of color for classification of pigmented skin lesions. Methods: PubMed was used to identify any studies describing AI models for classification of pigmented skin lesions. Only studies that used training datasets with at least 10% of images from people with skin of color were eligible. Outcomes on study population, design of AI model, accuracy, and quality of the studies were reviewed. Results: Twenty-two eligible articles were identified. The majority of studies were trained on datasets obtained from Chinese (7/22), Korean (5/22), and Japanese populations (3/22). Seven studies used diverse datasets containing Fitzpatrick skin type I–III in combination with at least 10% from black Americans, Native Americans, Pacific Islanders, or Fitzpatrick IV–VI. AI models producing binary outcomes (e.g., benign vs. malignant) reported an accuracy ranging from 70% to 99.7%. Accuracy of AI models reporting multiclass outcomes (e.g., specific lesion diagnosis) was lower, ranging from 43% to 93%. Reader studies, where dermatologists’ classification is compared with AI model outcomes, reported similar accuracy in one study, higher AI accuracy in three studies, and higher clinician accuracy in two studies. A quality review revealed that dataset description and variety, benchmarking, public evaluation, and healthcare application were frequently not addressed. Conclusions: While this review provides promising evidence of accurate AI models in populations with skin of color, the majority of the studies reviewed were obtained from East Asian populations and therefore provide insufficient evidence to comment on the overall accuracy of AI models for darker skin types. Large discrepancies remain in the number of AI models developed in populations with skin of color (particularly Fitzpatrick type IV–VI) compared with those of largely European ancestry. A lack of publicly available datasets from diverse populations is likely a contributing factor, as is the inadequate reporting of patient-level metadata relating to skin color in training datasets.
© 2023 The Author(s). Published by S. Karger AG, Basel
IntroductionSkin cancer is the most common malignancy worldwide, with melanoma representing the deadliest form. While skin cancers are less prevalent in people with skin of color, they are more often diagnosed at a later stage and have a poorer prognosis when compared to Caucasian populations [1–3]. Even when diagnosed at the same stage, Hispanic, Native, Asian, and African Americans have significantly shorter survival time than Caucasian Americans (p < 0.05) [4]. Skin cancers in people with skin of color often present differently from those with Caucasian skin and are often underrepresented in dermatology training [5, 6].
The use of artificial intelligence (AI) algorithms for image analysis and detection of skin cancer has the potential to decrease healthcare disparities by removing unintended clinician bias and improving accessibility and affordability [7]. Skin lesion classification by AI algorithms to date has performed equivalently to [8] and, in some cases, better than dermatologists [9]. Human-computer collaboration can increase diagnostic accuracy further [10]. However, most AI advances have used homogenous datasets [11–15] collected from countries with predominantly European ancestry [16]. Exclusion of skin of color in training datasets poses the risk of incorrect diagnosis or missing skin cancers entirely [8] and risks widening racial disparities that already exist in dermatology [8, 17].
While multiple reviews have compared AI-based model performances for skin cancer detection [18–20], the use of AI in populations with skin of color has not been evaluated. The objective of this study was to systematically review the current literature for AI models for classification of pigmented skin lesion images in populations with skin of color.
MethodsLiterature SearchThe systematic review follows the PRISMA guidelines [21]. A protocol was registered with PROSPERO (International Prospective Register of Systematic Reviews) and can be accessed at https://www.crd.york.ac.uk/prospero/display_record.php?ID=CRD42021281347.
A PubMed search in March 2021 used search terms relating to artificial intelligence, skin cancer, and skin lesions (search strings in online suppl. eTable; for all online suppl. material, see www.karger.com/doi/10.1159/000530225). No date range was applied, language was restricted to English, and only original research was included. Covidence software was used for screening administration. Search results were screened by reviewing titles/abstracts by two independent reviewers (Y.L. 100% and B.B.S. 20%) using eligibility criteria described in Table 1. Remaining articles were assessed for eligibility by reviewing methods or full text. Disagreements were resolved following discussions with a third independent reviewer (C.P.).
Table 1.Inclusion and exclusion criteria used for screening and assessing eligibility of articles
Inclusion criteriaExclusion criteria1. Any computer modeling or use of AI on diagnosis of skin conditionsData extraction was performed using a standardized form by author Y.L. and confirmation by V.K. The following parameters were recorded: reference, ethnicity/ancestry/race, lesion number, sex, age, location, skin condition, public availability of dataset, number of images, type of images, methods of confirmation, deep learning system, model output, comparison with human input, and any missing data reported. Algorithm performance measures are recorded by either accuracy, sensitivity, specificity, and/or area under the receiver operating characteristic curve. A narrative synthesis of extracted data was used to present findings as a meta-analysis was not feasible due to heterogeneity of the study design, AI systems, skin lesions, and outcomes.
Quality AssessmentQuality was assessed using the Checklist for Evaluation of Image-Based Artificial Intelligence Reports in Dermatology (CLEAR Derm) Consensus Guidelines [22]. This 25-point checklist offers comprehensive recommendations on factors critical to the development, performance, and application of image-based AI algorithms in dermatology [22].
Author Y.L. performed the quality assessment of all included studies, and author B.B.S. assessed 20%. Inter-rater agreement rate was 87%, with disagreements resolved via a third independent reviewer (V.K.). Each criterion was evaluated to be either fully, partially, or not addressed and scored either 1, 0.5, or 0, respectively, using a scoring rubric in online supplementary eTable2.
ResultsThe database search identified 993 articles, including 13 duplicates. After screening titles/abstracts, 535 records were excluded, and the remaining 445 records were screened by methods, with 63 articles reviewed by full text. Forward and backward citations search revealed no additional articles. A total of 22 studies were included in the final review (PRISMA flow diagram in online supplementary eFig. 1).
Study DesignAll 22 studies were performed between 2002 and 2021 [23–32], with 11 (50%) studies published between 2020 and 2021 [33–44]. An overview of study characteristics is displayed in Table 2. The median number of total images used in each study for all datasets combined was 5,846 (ranging from 212 to 185,192). The median dataset size for training, testing, and validation was 4,732 images (range: 247–22,608), 362 (range: 100–40,331), and 1,258 (range: 109–14,883), respectively.
Table 2.Overview of study characteristics
First author, yearPatient populationPublic availability of datasetImage type/Image No.Validation (H = histology, C = clinical diagnosis)Deep learning systemModel outputethnicity/ancestry/race/locationdataset informationPiccolo et al. (2002) [23]Fitzpatrick I–VLesion n = 341The majority of studies (15/22, 68%) analyzed clinical images (i.e., wide-field or regional images), while seven studies analyzed dermoscopy images [23, 24, 27, 29, 30, 40, 42], and one study included both [44]. All but one study included both malignant and benign pigmented skin lesions, with one investigating only benign pigmented facial lesions [43].
Histopathology was used as the ground truth in 15 studies for all malignant lesions and partially in two studies [24, 26], while one study only used histopathology to resolve clinician disagreements [23]. Seven studies used histopathology as ground truth for benign lesions [23, 27, 29, 34, 35, 41, 44]. In nine studies, ground truth was established by consensus of experienced dermatologists [25, 30–32, 38–40, 42, 43]. Other studies used a mix of both [24, 26, 33, 36] or were not clearly defined [28, 37].
The number of pigmented skin lesion classifications used for AI model evaluation ranged from binary outcomes (e.g., benign vs. malignant) to classification of up to 419 skin conditions [39]. While most studies (19/22, 86%) evaluated lesions across all body sites, one study exclusively analyzed the lips/mouth [33], another assessed only facial skin lesions [43], and one study specifically addressed acral melanoma [29].
PopulationHomogenous datasets were collected from the Chinese/Taiwanese (n = 8, 36%) [25, 30, 32, 37, 40, 41, 43, 44], Korean (n = 5, 23%) [27–29, 33–35], and Japanese populations (n = 3, 14%) [31, 38, 42]. Seven studies (32%) included populations from Caucasians/Fitzpatrick skin type I–III [23, 24, 26, 28, 36, 39, 42], with at least 10% American Indian [26], Alaska Native [26], black or African American [26], Pacific Islander [26], Native American [26], or Fitzpatrick IV–VI [23, 39] in the training and/or test set (Table 2).
The majority of studies did not specify the sex distribution (n = 13, 59%) or participant age (n = 15, 68%). Seven studies included age specification, ranging from 18 to older than 85 years [23, 25, 26, 34–36, 41].
Outcome and PerformanceThe outcome of the classification algorithms used either a diagnostic model, risk categorical model (e.g., low, medium, or high), or a combination of both. An overview of AI model performance is described in Table 3. Majority of studies (20/22, 91%) used a diagnostic model, either with binary classification of benign or malignant [23–27, 29, 33, 35, 37], multiclass classification of specific lesion diagnosis [28, 30, 32, 39, 42, 43], or both [31, 34, 36, 38, 40, 41, 44]. One study used categorical risk as the outcome [32]. Another study reported both diagnostic model and risk categorical model [24].
Table 3.Measures of output and performance for AI models included in the review
ReferenceAccuracy (%)Sensitivity (%)Specificity (%)AUCBinary classification modelsPiccolo et al. (2002) [23]n/a9274n/aIyatomi et al. (2008) [24]n/a86860.93Chang et al. (2013) [25]9186880.95Chen et al. (2016) [26]919092n/aYang et al. (2017) [27]99.710099n/aYu et al. (2018) [29]8293720.80Cho et al. (2020) [33]n/aDataset 1: 76The AI models using binary classification (16/22) reported an accuracy ranging from 70% to 99.7%. Of these studies, 6/16 reported ≥90% accuracy [25–27, 31, 38, 41], three studies reported between 80 and 90% accuracy [29, 37, 44], and one study reported <80% accuracy [40]. Twelve AI models reported sensitivity and specificity as a measure of performance, which ranged from 58 to 100% and 72 to 99%, respectively. Eight studies provided an area under the curve (AUC) with 5/8 reporting values >0.9 [24, 25, 35–37], with the remaining three models scoring between 0.77 and 0.86 [29, 33, 34].
For the 13 studies using multiclass output (i.e., >2 diagnoses), accuracy of models ranged from 43% to 93%. Six of these studies (6/13) scored <80% accuracy [31, 34, 36, 39, 41, 44], six others scored between 80 and 90% accuracy [30, 32, 38, 40, 42, 43], and one provided sensitivity and specificity of 86% and 86%, respectively, as a measure of performance [28].
Reader StudiesReader studies, where the performance of AI models and clinician classification is compared, were performed in 14/22 studies, with results provided in Table 4[23, 25, 29, 31–39, 42, 44]. Six studies compared AI outcomes to classification by experts, e.g., dermatologists [25, 32, 34, 36, 42, 44]. Eight studies compared outcomes for both experts and non-experts, e.g., dermatology residents and general practitioners [23, 29, 31, 33, 35, 37–39].
Table 4.Reader studies between AI models and human experts (e.g., dermatologists), and non-experts (e.g., dermatology residents, GPs)
ReferenceAI performanceExpert performanceNon-expert performancePiccolo et al. (2002) [23]Sensitivity: 92%In reader studies comparing binary classification between AI and experts (n = 11), one study reported similar diagnostic accuracy/specificity [29], three showed higher accuracy for AI models [25, 31, 38], and two reported higher accuracy in experts [42, 44]. Five studies reported specificity, sensitivity, and AUC instead of accuracy with varying outcomes [23, 32, 33, 35, 37]. For reader studies between AI and non-experts (n = 7), AI showed higher accuracy, specificity, sensitivity, and AUC in most studies [23, 29, 31, 33, 35, 37
Comments (0)