Background Due to its late stage of diagnosis lung cancer is the commonest cause of death from cancer in the UK. Existing epidemiological risk models in clinical usage, which have Positive Predictive Values (PPV) of less than 10%, do not consider the temporal relations expressed in sequential electronic health record (EHR) data. Machine learning with deep ‘transformer’ models can learn from these temporal relationships. We aimed to build such a model for lung cancer diagnosis in primary care using EHR data.
Methods In a nested case-control study within the Whole Systems Integrated Care (WSIC) dataset, lung cancer cases were identified and control cases of ‘other’ cancers or respiratory conditions. GP EHR data going back three years from the date of diagnosis less the most recent one months were semantically pre-processed by mapping from more than 30,000 terms to 450. Model building was performed using ALBERT with a Logistic Regression Classifier (LRC) head. Clustering was explored using k-means. We split the data into 70% training and 30% validation. An additional regression model alone was built on the pre-processed data as a comparator.
Findings Based on 3,303,992 patients from January 1981 to December 2020 there were 11,847 lung cancer cases of whom 9,629 had died. 5,789 cases and 7,240 controls were used for training and a population of 368,906 for validation. Our model achieved an AUROC of 0·924 (95% CI 0·921– 0·927) with a PPV of 3·6% (95% CI 3·5 – 3·7) and Sensitivity of 86·6% (95% CI 85·3 – 87·8) based on the three year’s data prior to diagnosis less the immediate month before index diagnosis. The comparator regression model achieved a PPV of 3·1% (95% CI 3·0 – 3·1) and AUROC of 0·887 (95% CI 0·884 – 0·889).
Interpretation Capturing temporal sequencing between cancer and non-cancer pathways to diagnosis enables much more accurate models. Future work will focus on external dataset validation and integration into GP clinical systems for evaluation.
Evidence before the study Predictive models for early detection of cancer are a priority as treatment intensity and cancer outcomes and survival are strongly linked to cancer stage at diagnosis. We searched PubMed and Embase for research on lung cancer prediction, using the search terms “lung cancer”, “diagnos$”, and “prediction model” between Jan 1, 2000 and Dec 31, 2023, to look into the contemporary research on prediction models for lung cancer. The QCancer Lung model has been recommended for prediction of lung cancer in primary care. However, classic regression models do not consider the rich relationships and dependencies in the electronic health record (EHR) data, such as cough followed by pneumonia rather than just cough in isolation. Since 2018, with advances in the natural language processing (NLP) domain, transformer-based models have been applied on large amounts of EHR data for clinical predictive modelling. We searched Google Scholar and PubMed for studies using transformer-based models on EHR data. We used the terms (“transformer” OR “bert” OR “pretrain” OR “prediction” OR “predictive modelling” OR “contextualised”) AND (“ehr” OR “health records” OR “healthcare” OR “clinical records” OR “cancer” OR “disease”) in free text, published from Jan 2019 to Dec 2023. We found these studies were limited to diagnosis and medication concepts/codes in patients’ records in secondary care, omitting symptom, test, procedure, and referral codes. The early detection of lung cancer requires the improvement in the prediction performance of deep learning models. We updated the literature review when writing this paper (Apr 2024) to include the latest published studies.
Added value of this study We pretrained a transformer-based deep learning model, MedAlbert, for learning deep patient pathway representations from coded EHR data in primary care. This ‘Pathway to Diagnosis’ for each patient is defined to contain the most possible elaboration of the coded medical records appearing over three years before diagnosis. To our knowledge, we are the first to build models on such detailed clinical records in primary care without data aggregation. Developed and validated based on the pretrained MedAlbert, the prediction model, MedAlber+LRC, shows improved prediction performance for diagnosis of suspected lung cancer as well as one- and two-year lung cancer early detection compared with a classic machine learning model (a single Logistic Regression Model), MedAlbert+LRC performed better in terms of sensitivity, specificity, PPV and AUROC. The explainability of the model discovered a series of symptoms, comorbidities and procedures associated with lung cancer diagnosis and identified six groups of patients related to COPD, diabetes, other cancers, etc. The prediction model we developed could be applied to the UK primary care population for early diagnosis of lung cancer.
Implications of all available evidence In order to progress beyond simple ‘red flag’ driven referral guidance and to develop more accurate prediction models for early diagnosis of lung cancer, it is necessary to use more sophisticated machine learning methods. Additionally, the framework we designed for deriving, modelling, and analysing the patient pathways could be used for the prediction of other cancers or diseases. The improvement in early diagnosis of lung cancer could contribute to better cancer outcomes and survival rates. Deep learning for diagnosis could provide more efficient care delivery and more accurate decisions faster, reducing costs and suffering across societies in the UK and worldwide.
Competing Interest StatementThe authors have declared no competing interest.
Funding StatementThis work was supported by a project grant from Cancer Research UK 37891/A25310 and the NIHR Imperial Biomedical Research Centre.
Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
Ethical approval was from London Bromley Research Ethics Committee ID: 252487 REC Ref-erence: 18/LO/2240. Data Access was approved by the WSIC Data Access Committee. All data used in this paper were fully anonymized before analysis.
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Data AvailabilityDue to data governance limitations, the deidentified patient data used to develop and validate the models cannot be shared.
Comments (0)