A machine learning approach for a 15-year prediction model of liver cancer incidence: Results from two large Chinese population cohorts

Primary liver cancer (PLC) is among the most formidable global public health challenges. According to GLOBOCAN 2022, PLC ranks the sixth in global cancer incidence and the third in cancer-related mortality [1]. Despite significant advances in treatment strategies, the prognosis of PLC remains poor due to its asymptomatic onset in early stages, often leading to delayed diagnoses. The five-year survival rate remains below 20 % [2], [3]. Therefore, accurately identifying high-risk individuals within the general population and implementing timely interventions are critical to reducing the burden of PLC.

Although several risk prediction models for PLC have been developed based on specific high-risk populations, such as individuals with viral hepatitis or chronic liver disease, models tailored to the general population are still strikingly limited, particularly in high-prevalence regions like China, which bears a disproportionate and staggering burden of the disease. Moreover, most existing models rely heavily on clinical indicators [4], [5], [6], which may limit their applicability in resource-constrained settings where access to laboratory or imaging data is restricted. Meanwhile, the few existing models [7], [8] for the general population suffer from critical limitations in predictor scope, methodology, and performance. For instance, the model reported in reference [7] included only six predictors and demonstrated modest predictive accuracy, with a 5-year area under the receiver operating characteristic curve (AUC) of 0.71 in the validation set. While the model reported in reference [8] relied heavily on five serological markers and non-prospective data, limiting its utility for broad screening. Beyond their individual limitations, these approaches share a common and perhaps more fundamental flaw: the omission of detailed dietary factors. This is a significant gap, given that dietary factors are not only pivotal risk modifiers but are also notoriously complex. For instance, high consumption of red meat has been linked to increased risk [9], whereas a higher intake of vegetables, fruits, and dietary fiber appears to have protective effects [10]. These dietary factors often exhibit non-linear (e.g., J-shaped or U-shaped) dose-response relationships and may interact in complex ways. Such intricate patterns pose a substantial challenge for traditional statistical models like Cox proportional hazards regression model or logistic regression model [11]. It is precisely to address these complexities that machine learning algorithms have garnered increasing attention, due to their ability to accommodate non-linear relationships and high-dimensional data structures [12].

Despite the clear need and the suitability of machine learning for this task, studies that apply machine learning to develop long-term PLC risk prediction models based on large-scale cohorts of the general Chinese population remain scarce. In addition, practical and user-friendly tools suitable for application in clinical or public health settings are still lacking. To fill this research gap, in this study, we utilized data from two large-scale prospective cohorts-the Shanghai Men’s Health Study (SMHS) and the Shanghai Women’s Health Study (SWHS), to integrate a wide range of modifiable lifestyle and dietary factors. By leveraging advanced machine learning algorithms, we aimed to develop a 15-year PLC risk prediction model, identify the optimal model through rigorous validation, and construct an online risk calculator to facilitate personalized risk assessment and support early prevention strategies within the general population.

Comments (0)

No login
gif