Identifying patients with undiagnosed small intestinal neuroendocrine tumours in primary care using statistical and machine learning: model development and validation study

Overall modelling strategy

This study evaluated four different model-building approaches—logistic regression, two penalised regression methods (LASSO logistic model, and ridge logistic model), and XGBoost. Within a diagnostic modelling framework, models were developed to provide a probabilistic estimate that a given individual had a SI-NET. Model evaluation was then performed using internal-external cross-validation [12].

Minimum sample size calculation

The approach of Riley et al. for prediction models with a binary outcome was used to estimate the minimum sample size required [13]. Prior to the study commencing, a SQL query in the target database identified 709 recorded SI-NET cases. Using this number, and the size of the entire potential cohort at that time (n ~ 15,000,000), an outcome prevalence of 0.0047% was estimated. Assuming this, targeting an R-squared of 0.00015 (conservative 15% of the maximum permitted in this setting, 0.001), shrinkage of 0.9, and 50 predictor parameters, we required a dataset comprising 2,999,750 adults (141 cases, events per predictor parameter = 2.82). No clear guidance exists for estimating the minimum sample size for machine learning models.

Study population and data sources

This study used the Optimum Patient Care Research Database (OPCRD) database (https://www.opcrd.optimumpatientcare.org), which had at the time, collected de-identified electronic routine primary care data from over 17 million patients registered at over 1000 general practices in the United Kingdom (UK). OPCRD collects data from practices using all UK clinical software systems. The data fields available in OPCRD include demographics, clinical encounters such as measurements and diagnoses (defined as the presence of recorded SNOMED and Read/CTV3 codes), prescriptions and referrals to secondary care.

An open cohort of adults registered with general practices contributing data to OPCRD between 1st Jan 2000 and 30th March 2023 was identified. Follow-up started from the latest of: cohort start date, date of registration with the practice plus 1 year (to exclude ‘temporary patients’), or the patient’s 18th birthday. Follow-up was until the earliest of NET diagnosis (for cases), date of leaving the practice/death, date of 90th birthday, or the cohort end date. Individuals that had a diagnosis of SI-NET recorded prior to the cohort start date (prevalent cases) were excluded.

As there is no precedent or consensus on an appropriate prediction horizon for a case-finding tool for SI-NET, and that SI-NET are rare, we sought to maximise the number of cases available for model development and evaluation. In preliminary analyses, using a time-to-event modelling framework with a prediction horizon of 2 years from cohort entry would lead to over 50% attrition of NET cases included for analysis (i.e. most individuals were diagnosed over 2 years after cohort entry). In order to maximise case numbers, and for the purposes of computational efficiency when developing multiple case-finding (diagnostic) models with hyperparameter tuning repeated within a cross-validation framework, the extracted cohort dataset was converted to a matched, weighted case-control dataset for model fitting. Cases were assigned an index date of the recorded date of NET diagnosis. Each case was matched with 100 non-cases from the same geographical region (n = 10). To account for time-varying contributions of follow-up from individuals during the cohort period, and possible trends in diagnostic modalities, non-cases were assigned an index date randomly drawn from a uniform distribution between their cohort entry date and their follow-up end date. For all individuals, predictor values were assigned at this index date—the most recently recorded values of BMI and smoking prior to/on the index date were used. To permit accurate estimation of model intercepts when fitting to case-control data (and therefore reliably predict probabilistic risks on unmatched data), participants were also assigned weights. Cases were assigned a weight of 1 and controls were assigned a weight equal to the inverse of the sampling fraction—these were used when fitting models (see below).

Outcome and candidate predictor definitions

The outcome was defined as the presence of a recorded SNOMED/Read code for SI-NET (see link to code below). Predictors were defined by SNOMED clinical codes, with code lists developed and cross-checked by two clinicians with experience in EHR research (AKC & OB). Three categories of candidate predictors were based on clinical understanding and epidemiological evidence [14] and are summarised in Table 1. These were: factors associated generally with the risk of developing a gastrointestinal cancer (e.g. age, family history), symptoms or signs that could be attributable to an underlying SI-NET (e.g. abdominal pain), and features that reflect the diagnostic journey towards diagnosis or potential misdiagnoses (e.g. imaging or coeliac testing, and functional gastrointestinal disorder, respectively). Comorbidities were defined as being recorded in the primary care record at any point prior to the index date. For symptoms and investigations, these were defined as a recorded clinical code in the primary care record at any point in the 5 years prior to the index date—this was based on recent studies suggesting that NET patients may start consulting with their general practitioner up to 5 years prior to ultimate diagnosis [6, 7].

Table 1 Characteristics and predictor distributions in individuals with a recorded diagnosis of midgut NET and those that did not.

Fractional polynomials with up to two powers were used to model potential non-linearities between age and body mass index (BMI) and the outcome for the regression and penalised regression models. A closed-test procedure was used to identify polynomial terms that minimised the deviance [15]. Pre-specified interaction terms were between weight loss and BMI, age and diabetes, and age and functional gastrointestinal disorder.

Missing data

There was incomplete recording of BMI and smoking status due to non-recording by the index date. This was handled using single imputation with chained equations due to computational considerations – the imputation model included all candidate predictors (including fractional polynomial terms), pre-specified interactions and the outcome, and imputation was performed separately for each region. This singly imputed dataset was used throughout all model development and evaluation steps.

Model development and evaluation

All models were fit to the whole nested and weighted case-control dataset. Tenfold cross-validation was used to identify lambda values for the LASSO and ridge models that minimised the cross-validated deviance. These models were then refitted to the dataset with these lambda values.

Continuous variables were left unscaled for the XGBoost model, and categorical predictors were handled as dummy variables. Hyperparameter tuning with Bayesian Optimisation and 10-fold cross-validation was used to identify the configurations of the XGBoost parameters that maximised the cross-validated area under the curve (AUC). The final XGBoost model was then fit to the dataset with these hyperparameters.

The performance of each model was then assessed with internal-external cross-validation (IECV) using non-random dataset splitting by geographical region [12]. Our approach sought to support computational tractability by using the weighted, matched case-control data for model fitting (and repeated fitting during IECV), but using the entire unmatched dataset for model evaluation. By non-randomly splitting the whole dataset into geographically distinct units, this provides a stronger assessment of transportability to new settings than using a single random split, which would yield two sub-datasets with similar distributions of predictors [12, 16]. In the IECV process, one region was held out, the model refit to the matched, case-control data from all other regions—the case weights were applied at this stage. Then, the performance of that model was evaluated on full (unmatched) data for the held-out region. This was iterated so that all regions were used once as test sets. Region-level performance metrics (AUC, calibration slope, and calibration-in-the-large) were pooled with a random effects meta-analysis model (Hartug-Knapp-Sidik-Jonkmann approach [17]) to provide a pooled overall estimate, 95% confidence intervals and a 95% prediction interval. The latter provides an indication of the range of model performance if applied in a new, similar setting [16]. Cross-validation of lambda values of the penalised models and hyperparameter tuning for XGBoost was recapitulated in every iteration of IECV to provide a form of ‘nested’ cross-validation that avoided evaluating models on the same data used for tuning [18]. As the dataset was split by geographically distinct regions, there was no dependence across the ‘outer folds’, thus permitting meta-analytical pooling of region-level performance metrics.

Decision curve analysis [19] was used to explore the clinical utility (net benefit) associated with each model across a range of threshold probabilities. These analyses used the individual-level predictions generated for each participant obtained during IECV (i.e. when they were in the ‘held out’ region). The sensitivity and PPV of each model were assessed at different cut-offs of the predicted risk score distribution.

Software and code

Data extraction used SQL. Analyses were conducted using Stata v17 and R, with analysis code available in the following repository: https://github.com/Mendelian/NETs_prediction_modelling.

Study approval and conduct

Ethical approval for the OPCRD database for clinical research has been obtained from the NHS Health Research Authority (REC reference 20/EM/0148). This study was approved by the ADEPT committee (reference: PROTOCOL2318). The study is reported in accordance with TRIPOD guidance [20].

Comments (0)

No login
gif