A multi-stage feature selection method to improve classification of potential super-agers and cognitive decliners using structural brain MRI data—a UK biobank study

In this study, participants were a part of the UK Biobank study. UK Biobank is one of the largest longitudinal biomedical databases in the world investigating the health of over half a million middle-aged to older UK adults [22]. Participants in our study were aged 55 to 70 at the baseline visit in 2006–2010 (i.e., Visit I). The first follow-up visit (i.e., second visit) was held in 2012–2013 (i.e., Visit II). Visits I and II included (1) informed consent; (2) touchscreen questionnaire; (3) verbal interview; (4) eye measures (i.e., visual acuity, refractometry, retinal optical coherence tomography, and eye surgery history); (5) anthropometric measures; and 6) blood/urine sample collection. Socio-demographic characteristics, occupation, lifestyle, and cognitive function were gathered by questionnaires using touchscreens or laptops. Two subsequent follow-up visits, i.e., the third visit (denoted as visit III) and the fourth visit (denoted as visit IV), were held starting in 2014 and 2019, respectively. Visits III and IV included the above five items and additional imaging for the brain, heart, and body [23]. In this study, we used an initial cohort of 8528 participants based on the available cognitive tests, sMRI, and demographic features.

Demographic data

The demographics data includes information on age, sex, socioeconomic class, education level, and tobacco use. Age was defined as the age in years of each participant at their baseline visit. The socioeconomic class was defined by the participant’s average total household income. There were five options provided based on UK Biobank Data-Coding to be selected by participants in British pounds, which included “Less than 18,000,” “18,000 to 30,999,” “31,000 to 51,999,” “52,000 to 100,000,” and “Greater than 100,000.” These groups were labeled as “under,” “lower,” “middle,” “upper-middle,” and “upper class.” For education level, a categorical variable was used with the following levels: college or other higher-level status; post-secondary or vocational; secondary; and none of the previous ones. Finally, tobacco smoking status was defined as a categorical variable with levels: “never smoked,” “previously smoked,” and “currently smoking.”

Cognitive function tests

Cognitive function tests in the UK Biobank study were operated as part of the fully automated touch screen questionnaire [22]. The cognitive function testing section of the UK Biobank data describes the administration and basic statistics [24]. Here, we focused on Reaction Time (RT), Fluid Intelligence (FI), Prospective Memory (PM), and Pair Matching Memory (PMM), which are associated with processing speed, verbal and numerical reasoning, prospective memory, and visual declarative memory, respectively [25]. It is worth noting that RT and PMM values were log(x)- and log(x + 1)-transformed, respectively.

General cognitive ability

General Cognition (GC) can be derived from the cognitive function assessment scores while retaining most of the variability in the original exams’ scores [25,26,27]. One way to create a GC composite score is by using PCA [28]. A composite score that is derived by correlation coefficients in the first principal component accounts for the most variance in the cognitive tests and is considered the GC score [25].

In this study, we considered a longitudinal combination of the cognitive tests described in the “Cognitive function tests” section at visits I and III as the input for the PCA. Then we extracted the first component which accounted for 28% of the variance in the eight cognitive tests (see Supplementary Text A). This method is based on valid and comprehensive cognitive exams that are susceptible to age-related decline [25]. The composite scores extracted from the first principal component were considered as the GC score. Looking at the loading scores in Supplementary Table A.1, we can observe that regardless of the time points, weights corresponding to FI and PM tests are negative and the signs for PMM and RT exams are positive. This means that the higher the GC score, the lower the cognitive performance of an individual is, and vice versa. To make it more interpretable and easier to understand, we multiplied the GC score by − 1, so that larger GC scores are associated with a higher likelihood of being a Positive-Ager.

The first and third quartiles of the composite scores were then used as the thresholds to identify “Cognitive Decliner” and “Positive-Ager” classes. The remaining 50% of middle participants were considered “Cognitive Maintainers” (i.e., who showed relatively stable GC scores) and were dropped from the analysis. This enabled us to focus on extreme cognitive performance and identify the most resilient and non-resilient groups to age-related changes in brain areas. Additionally, this helps us compare positive agers against decliners and consequently find the brain regions associated with cognitive reserve.

sMRI data

UK Biobank conducted structural imaging at visits III and IV. Imaging assessments were done at three centers on the same Siemens Skyra scanners with a standard Siemens 32-channel head coil [29]. UK Biobank imaging includes six modalities, including T1, T2-weighted, susceptibility-weighted MRI, resting functional MRI (rsfMRI), task functional MRI (tfMRI), and diffusion MRI (dMRI). Among these six modalities, T1, T2-weighted, and susceptibility-weighted MRI present anatomical and neuropathological structures of the brain. In this study, we used T1 at visit III, which is informative about the structure of the brain, depiction of the main tissue type, main anatomical landmarks, and tissue loss [30, 31]. From the T1 sMRI features, we used Freesurfer ASEG, Freesurfer desikan gw, Freesurfer desikan pial, Freesurfer desikan white, and Freesurfer desikan sub-segmentation (558 sMRI features in total).

A multi-stage feature selection method

Data mining and machine learning algorithms have difficulty in dealing with high-dimensional data. This gets even worse when the feature space has noisy, irrelevant, or highly correlated features. While consuming a lot of computational resources, such cases often end up with poor model performances. One solution to such problems is to employ dimensionality reduction techniques, such as utilizing feature selection methods to improve the overall performance. This also helps to decrease memory usage and may reduce the overall run time [32, 33]. In this study, we had high-dimensional and highly correlated feature sets. Therefore, incorporating feature selection was necessary to have reliable and accurate predictive models.

We faced the following challenges during the analysis of sMRI data: (1) high correlation between features; (2) the high number of features even after correlation-based filtering; (3) very long training and tuning times; (4) discordance of the final feature subset among different traditional feature selection approaches such as heuristic algorithms, filter, wrapper, and embedded methods.

To address the above issues and have a reliable and accurate model, we designed a multi-stage feature selection algorithm. The proposed algorithm removed irrelevant features, addressed the multicollinearity problem in the data, decreased the overall run time, and improved the classification performance. The three stages of the proposed algorithm are summarized below:

Stage A

Let \(\left\_, _\right): _ \in ^, _ \in \left\, i=1, \dots , n\right\}\) be the training set, which consists of \(n\) pairs of feature vectors of size \(p\) and target values (binary, here). First, the algorithm drops the features for which the mutual information (MI) value, with respect to the response variable, is less than a specified threshold α. In the next step, the algorithm uses MI scores to sort the features in an ascending order and then calculates the correlation between the first feature and the second. If the pairwise correlation exceeds a certain threshold \(\upbeta\), then the feature with the lower MI score will be dropped. Otherwise, the search will continue for the first feature until all remaining variables are explored. Then, the process is repeated on the next feature. This process continues until no feature with a high degree of correlation (i.e., Pearson correlation coefficient higher than the \(\beta\)) remains in the feature set.

Stage B

Using the L1-regularized Logistic Regression (L1-regularized LR), the algorithm computes the cross-validated area under the curve (AUC) for a range of regularization parameter values, \(C \in (0, _]\), on the training set obtained from Stage A, in which \(_\) is a large positive float. In this notation, \(C\) is the inverse of regularization strength (often denoted as \(\lambda\)). In the next step, the smallest \(^\) is selected for which no significant improvement in the cross-validated AUC in the interval \((^, _]\) is recorded. Then, the algorithm picks the subset of features resulting from training the L1-regularized LR with regularization parameter \(^\), on the training set.

Stage C

The algorithm employs the sequential feature selection (SFS) method to find the final set of best features. Bayesian optimization is used to ensure the optimal performance of the algorithm.

A detailed description of the proposed algorithm is provided in Supplementary Text B.

Classification

Classification methods are supervised machine learning techniques that are designed to find a relationship between some class labels and the corresponding input features. The classification process includes two main steps: training and testing. In the training phase, a classifier model is trained on a proportion of the observations. The training phase may include the feature selection procedure and hyperparameters tuning. In the testing phase, the performance of the fitted model is evaluated using the rest of the data, which has not been seen by the model during the training [34].

In this study, we reported Random Forest (RF) and Support Vector Machine (SVM) classifiers to discriminate between the Positive-Ager and Cognitive Decline groups. RF is a powerful ensemble method consisting of many decision tree classifiers whose predictions are then aggregated into a final predicted class label using the majority vote rule. RF is robust against overfitting and multicollinearity. It provides feature importance ranking based on the mean decrease in the impurity, which shows the contribution of the features toward classification [35]. The second model used in this study is SVM which is a widely used classification technique founded on statistical learning theory to locate the decision boundaries of class labels. SVM uses linear or nonlinear decision surfaces based on the kernel function to separate the classes [34].

As the multi-stage feature selection algorithm incorporates L1-regularized LR and SFS, we compared the proposed algorithm against L1-regularized and SFS methods individually. For this purpose, we evaluated three feature selection algorithms (i.e., the multi-stage feature selection, L1-regularized LR, and SFS) combined with one of the RF or SVM classifiers.

View original article

GEROSCIENCE

Like

Share Bookmark

0 0 0 0 0 0 0

More from this channel

A multi-stage feature selection method to improve classification of potential super-agers and cognitive decliners using structural brain MRI data—a UK biobank study

Comments (0)