Validation of scoring systems for the prediction of complicated appendicitis in adults using clinical and computed tomographic findings

Our investigation identified factors independently predictive of complicated appendicitis that are crucial to consider in the era of potential nonoperative management of acute appendicitis. We validated the diagnostic performance of 8 existing scoring systems and proposed a new scoring system to predict complicated appendicitis without the need for serum C-reactive protein. Of these, modified Atema, Kim HY, and our proposed scores showed similarly high AUCs with reasonably high sensitivities and modest specificities in the identification of complicated appendicitis.

Since 2015, multiple scoring systems have been proposed to identify appendicitis with complications, utilizing clinical-only [15,16,17,18], imaging-only [19], or both clinical and imaging data [5,6,7,8,9,10,11]. In this study, we validated eight systems that utilized both clinical features and CT findings as these scores generally performed better than those utilized only clinical or CT features. Previous investigations have validated these models using a traditional statistical methodology [10, 12, 13] and artificial neural network [20]. Fujiwara et al. [13], Lin et al. [10], and Geerdink et al. [12] used 203 to 678 patients (52 to 175 with complicated appendicitis) for validation. In another study by Lin et al. [20], datasets of 592 patients were split for training of and validated by artificial neural network.

The Atema score [5] was introduced in 2015, with an original sensitivity of 97% and specificity of 46% in the differentiation of complicated from uncomplicated appendicitis. The scores demonstrated sensitivities from 64 to 90% and specificities from 51 to 95% in subsequent studies [10, 12, 13, 20]. Our investigation found that even with C-reactive protein excluded from the equation and a cutoff value reduced to ≥ 5, the Atema score still had the best performance with high AUC (0.831; 95% CI 0.787–0.875) and sensitivity (91%; 95% CI 84–95%). However, its specificity was only 61% (95% CI 53–68%).

Another scoring system that demonstrated promising results in our investigation was the Kim HY score [11]. In its original description, this score had an AUC of 0.81, a sensitivity of 93%, and a specificity of 28%. However, subsequent validations reported higher AUCs ranging from 0.84 to 0.92 and specificities between 88 and 100%, but lower sensitivities at 64% [10, 20]. Our study showed a balanced sensitivity and specificity at 73% (95% CI 64–81%), and 71% (95% CI 64–77%), respectively, indicating its potential usefulness. Other validated scoring systems showed varying results, with some demonstrating high specificity (Kim TH, Lin Model 2 scores), and others exhibiting variable performance (Imaoka, Avanesov, Khan, Lin Model 1 scores) [10, 13, 20].

Our proposed scoring system, when validated internally, the score that used odds ratio demonstrated 100% sensitivity and 100% negative predictive value, allowing it to avoid misclassification of complicated appendicitis, albeit at a moderate specificity. It overcame the modified Atema score in terms of less complexity as it consisted of only 5 factors for calculation, did not require C-reactive protein, and accumulated fewer total points.

The performance of other scoring systems in our evaluation was suboptimal. Specifically, the Khan score exhibited a lower AUC of 0.699 (95% CI 0.643–0.756), alongside moderate sensitivity (76%; 95% CI 67–83%) and specificity (48%; 95% CI 41–55%). Similarly, the modified Imaoka score demonstrated a lower AUC of 0.692 (95% CI 0.642–0.741), with moderate sensitivity (80%; 95% CI 72–81%) and specificity (58%; 95% CI 51–65%). Both of these were validated by Lin et al. [10], who reported similar diagnostic performance for predicting complicated appendicitis. Additionally, the Imaoka score had been validated by other studies [13, 20], revealing inconsistent diagnostic performance. For the modified Kim score, it exhibited very high sensitivity (98%; 95% CI 94–100%) but low specificity (23%; 95% CI 17–29%), limiting its utility. Notably, our results diverged significantly from the validation performed by Lin et al. [10, 20], who reported the original score as having much lower sensitivity but higher specificity.

When comparing the elements within the scoring systems that exhibited optimal vs. suboptimal performance, the factors contributing the most to enhanced performance were CT findings. Notably, the presence of extraluminal air, which was found in the modified Atema, Kim HY, and our proposed scores but absent in the modified Imaoka, Kim, or Khan scores, played a significant role. Additionally, the presence of appendicolith, which was included in the modified Atema and our proposed score but excluded from the modified Imaoka and Kim scores, also contributed to improved performance.

While our investigation provided a detailed evaluation of the performance of existing scoring systems, there are several limitations that need to be acknowledged. Firstly, our study was retrospective and conducted in a single center with a small sample size. As appendectomy remained the standard of care for acute appendicitis in our hospital, we were unable to evaluate the success rate of nonoperative management fully. However, our approach allowed us to use pathological results as a standard reference for the diagnosis of complicated appendicitis. Secondly, the absence of C-reactive protein data in most patients prevented us from validating some scores in full. However, this allowed us to test the scores without C-reactive protein and demonstrated that the modified Atema score still performed well. Thirdly, we designed our endpoint to prioritize high sensitivity to detect complicated appendicitis, rather than balancing the sensitivity and specificity. This approach ensured patient safety by avoiding sending complicated appendicitis for nonoperative management. Fourthly, we did not validate scores that utilized only clinical factors [16,17,18] as they were not our target population. Cross-sectional imaging is necessary for safe selection of nonoperative management in this condition even in young individuals [3, 21]. The scores proposed by Mahankali et al. [19] which utilized purely CT findings were not validated in our study due to incomplete data. Additionally, we believe that some data points including grading of periappendicial fat stranding [10] may pose a challenge in terms of real-world applicability as they were subjective.

In conclusion, our study demonstrated that the modified Atema, Kim HY, and our proposed scores were effective in predicting complicated appendicitis with high AUC and reasonable sensitivities. These scores have the potential to aid in the safe selection of patients for nonoperative management. However, further validation is required in larger, multicenter studies with a diverse patient population. Recent publications have shown that artificial neural networks may play a crucial role in this regard [20, 22]. Additionally, it is important to note that a prospective trial [23] focused on this issue is currently ongoing, and its results are eagerly awaited to further guide clinical decision-making.

Comments (0)

No login
gif