Infectious Disease Reports, Vol. 14, Pages 900-931: Risk Stratification of COVID-19 Using Routine Laboratory Tests: A Machine Learning Approach

1. IntroductionWith the emergence of COVID-19 caused by the novel SARS-CoV-2 [1], in late 2019 and the early months of the year 2020, the world was put on hold. Since 17 January 2020, when the World Health Organization (WHO) announced it as an international public health concern, the numbers of people who contracted COVID-19 increased dramatically. As of 15 November 2022, the confirmed cases stood at 635 million with a death toll of 6.6 million people worldwide. These figures are four million and 102 thousand respectively, for South Africa (https://www.google.com/search?q=covid+19+world+stats&rlz=1C1VDKB_enZA943ZA944&oq=&aqs=chrome.0.69i59i450l8.107752819j0j15&sourceid=chrome&ie=UTF-8, accessed on 20 September 2022). These numbers led the world leaders to seek solutions to contain viral spread. Among many methods, national lockdowns, social distancing, and wearing protective face masks were used and these methods proved to be effective in at least the management of the viral spread [2]. South Africa was in various national lockdown levels from April 2020 lasting until June 2022.The rapid spread, high infectivity and quick progression of the disease in positive patients [3] stressed the health care systems. This stress meant that there was always an urgent need for quick and effective diagnostic and risk stratification measures, in order to identify patients that require intensive care. The most common test for the virus is the PCRs which involves testing swabs from various respiratory tracts, but mainly the nasopharyngeal swab [4]. Delays in the traditional risk stratification methods causes a lag in hospital admission time and bed assignment for patients as well as possible exposure of the healthcare givers to infected patients.This study aimed at using machine and statistical learning models to predict the severity of COVID-19 using quick and readily accessible routine laboratory tests. The problem is approached as a classification problem using supervised, machine learning algorithims. Routine lab tests are available within 30 min to 2 h. The tests results provide a number of analyte markers values which, together with some pre-known health conditions, can be used for risk stratification. An analyte is a biochemical compound, which is a target for chemical analysis, hence an analyte marker is made by spiking the compound to make it effective in measurements. Artificial intelligence has inspired machine learning algorithms capable of the initial diagnosis and risk stratification of various diseases [5,6,7,8,9,10,11]. 1.1. Aim

The aim of the research was to classify whether the positive COVID-19 cases had severe COVID-19 symptoms or not, hence helping to decide the type of hospital ward where the patients would be admitted. Severe COVID-19 symptoms include: difficult breathing, body weakness, high fever and muscle and joint pains. The classification was carried out using the data provided from a network of laboratories in South Africa, which contains various analyte measures. The analytes were used as variables in the classification.

1.2. Objectives

The objectives followed the aim. These were:

To use machine learning and statistical learning models to classify the severity of COVID-19, which is (RS).

To compare the fitted machine learning models using different measures of performance.

To identify the top-performing model for the above-mentioned objectives.

To identify the top important analytes relevant to the risk stratification of COVID-19.

1.3. Research Design

A retrospective study design was used to describe results extracted from the Central Data Warehouse (CDW) of the National Laboratory Health Services (NHLS), for all patients tested between 1 March 2020 and 7 July 2020, in the public sector healthcare facilities of the country. The CDW houses all laboratory results for public sector patients in South Africa.

1.4. Data

We extracted data for all individuals who had at least one PCR test conducted via the NHLS between 1 March 2020 and 7 July 2020. Patient data was anonymized to prevent traceablility. A six month period of demographic, biochemical and haematological and microbiology data was extracted for all patients who had a SARS-CoV-2 PCR test. Out of a total of 842,197 tests, 11.7% were positive and 88.3% negative. A critical case was defined as a patient who was admitted into a ward because of COVID-19 complications and non-critical patients are positive cases that were not admitted.

2. Literature Review

This section presents a review of the findings of various studies that are similar to this study. The literature review goes through various papers, and other publications that looked, directly or indirectly to the use of machine learning in prediction of COVID-19 as well as risk stratification. Multiple articles were written and/or presented that used different predictors, which include analytes and imaging techniques to predict and stratify COVID-19 risk. The section analyses and critiques various methods used by other researchers as well.

2.1. Background of Using Routine Lab Results for COVID-19 DiagnosisClinical characteristics of COVID-19 have been well studied and a lot of abnormalities have been noticed in patients infected with the disease [12]. These abnormalities demonstrate that they play an important role in early diagnosis, detection and even management of the disease [13]. The table in Figure 1 shows the findings of their research and how COVID-19 affected the various listed analytes.With the devastating nature of COVID-19, it is important to identify and rank groups of patients who are at risk of severe COVID. Several authors have documented risk factors mostly associated with severe COVID-19 outcomes. These studies included mostly comorbidity conditions such as HIV infection [14], type 1 and type 2 diabetes [15]. Hesse et al. [16] presented findings using the same data as used for this research. The study looked at how comorbidity factors which include: HIV, TB, and Diabetes HBA1c and other related laboratory analytes affect the severity of COVID-19. All these studies demonstrated that it is possible to use routine lab test results and comorbidity factors to predict and diagnose COVID-19 status and severity. 2.2. Machine and Statistical Learning in Predicting COVID-19 SeverityZimmerman et al. [17] review the prospective uses of machine learning and artificial intelligence for cardiovascular diagnosis, prognosis, and treatment in COVID-19 infection in a number of cardiovascular applications. Applications of Artificial intelligence, particularly machine learning, have the potential to take advantage of platforms with a lot of data and change how cardiovascular illness is identified, risk-stratified, prevented, and treated. The authors also cite how improvements in AI have been made in various fields of cardiology.There have been various studies that have documented how statistical and machine learning models can be used to diagnose and predict COVID-19 and its severity [18,19,20]. These have used supervised learning albeit with different features. The models that have been developed demonstrated great predictive performances.Zoabi et al. [18] used machine learning models to diagnose COVID-19 based on symptoms experienced. The study features space comprised of sex, age, and symptoms such as cough, fever, sore throat, shortness of breath, headache, and whether one was in contact with a positive COVID-19 case. The study employed the gradient-boosting model in Python and used the ROC curve to assess model performance with the bootstrap re-sampling method. The model demonstrated high predictive performance with the area under ROC curve being 86% as shown in Figure 2.Yang et al. [19] used 27 laboratory tests together with demographic features (age, sex, race) to fit machine learning models in the R software [21] for COVID-19 prediction and risk stratification. The research fitted a logistic regression classifier, decision tree, random forest, and gradient-boosting decision tree classifiers. A 5-fold cross-validation re-sampling method was employed, with the area under the ROC curve used predominantly as the measure of model performance. As shown in Figure 3, the two ensemble models which are gradient boosting models and random forest outperformed the singular models of logistic regression and single decision tree in that order. All the models, however, had a high predictive performance.Jucknewitz et al. [20] used statistical learning to analyse prior risk factors for the severity of COVID-19. The study used factors such as age, gender, nationality, occupation, employment, income, etc. LASSO was used in variable selection, and a regression model together with gradient boosting models were used. ROC, AUC, and accuracy were used to evaluate the models’ performances. Figure 4 shows the ROC curves for the models, the gradient boosting model shows an AUC of 88.79%, and the baseline model had an AUC of 87.55%.A study was conducted from 287 COVID-19 samples from King Faha University Hospital in Saudi Arabia by [22] on prediction of the disease using three classification algorithms, namely, random forest, logistic regression, and extreme gradient boosting model. The data was re-sampled using 10-k cross-validation with SMOTE to alleviate the imbalances that were present. The modeling was conducted on 20 features that included some symptoms as well. The RF model outperformed the other classifiers with an accuracy of 0.938, sensitivity of 0.947, and specificity of 0.929, with the results given in a table shown in Figure 5.Alballa et al. [23] compiled a review of a number of studies that employed machine learning in COVID-19 diagnosis, mortality and risk predictions. The study noticed that most studies employed supervised machine learning models. The papers aims were to:

Review ML algorithms used in the field mainly used for diagnosis of COVID-19 and prediction of mortality risk and severity, using routine clinical and laboratory data that can be accessed within an hour.

Analyses the top features/variables that were found to be top predictors, i.e., the most important features relevant to machine learning predictor models.

Outline the algorithms mostly used and for which purpose.

Points out some areas of improvement as well as areas of further study.

The paper concluded that the results of machine learning and statistical models are consistent with those of pure medical studies. It also pointed out the issue of imbalance and missing values in the data usually used in the studies. The results from their study are shown in Figure 6, Figure 7 and Figure 8. 3. Methodology

This section provides detailed explanations and descriptions of various methods that were implemented and used in the study in order to arrive at the intended results.

Consider supervised data with predicted variable Y and predictor variables X=(X1,X2,⋯,Xp), X¯ is a vector representation of the predictor variables Xi, i.e., the predictor variables and analytes from routine clinical tests. Let Yi∈ be an indicator variable with Yi=0 be not-Severe COVID-19 and Yi=1 be Severe COVID-19. This study classified this supervised data set for Logistic regression, Decision Trees, Random Forest, Extreme Gradient Boosting, the Self Normalising Neural Network and the Convolutional Neural Network.

3.1. Missing Values

The data used in this study contained missing values. The missingness of values in the data is both structural missing (data missing because of an explainable reason, e.g., patients who did not get any blood tests because they were not hospitalised) and Missing Completely at Random (MCR) (i.e., data missing because of reasons that cannot be traced). This is because some patients were not admitted or some test values were not available.

3.1.1. missForest Missing Value ImputationmissForest is a non-parametric missing value imputation method that uses the random forest algorithm [24] on every single variable to estimate and predict the missing values. We used the package missForest [25] in R which enables control of the process with an adjustable number of trees, number of iterations and other parameters to tune. 3.1.2. Simple Statistics Missing Value Imputation (SSMVI)SSMVI is a non-parametric method of missing values imputation which assumes a symmetric distribution of the data points of any given variable [26,27]. This imputes numeric missing values with the mean of the observed values and imputes factor values with the modal class of the observed values. We created an algorithm that implemented what is known as a predictive mean matching for numeric variables in R, as well as predictive mode matching for factor variables. 3.2. Variable Selection 3.2.1. Boruta Algorithm for Feature (Variable) SelectionWith the high volumes of data presented in the machine and statistical learning modeling practices, it is of much necessity to reduce the volume of the data, particularly the number of variables. This process is conducted by removing redundant and correlated features, which in turn helps to produce non-complex models that are relatively easy to interpret and faster to compute [28].The Boruta algorithm was named after the Slavic mythology god of the forest, as it modifies and improves on much of the variable importance algorithm used in RF models [29,30] (Algorithm 1). Algorithm 1: Boruta Variable Selection Algorithm [29]

Create Shadow Features: the data set is duplicated column by column and all values are randomly permuted and hence removing any relationship that might have originally existed.

Random Forest Training: the data set is trained using a random forest classifier and the variable importance from the training are collected.

Comparison: for each variable, the algorithm compares the feature importance of the original variable and the maximum importance of all shadow variables (The best shadow variable). A shadow variable is one that has been created with similar characteristics as the original variable given in a data set. If the feature importance is higher than the best shadow variable it is recorded as an important variable.

Iterations: the process continues until a pre-defined number of iterations is obtained and a table of hits is recorded and these are the variables that will be selected for the model.

The Boruta algorithm is widely used as it gives the user more flexibility in the number of iterations one can run and has produced good results for biomedical data [30]. We used the Boruta package [29] in R, which is computationally cheap and gave the advantage of tuning the number of trees and the number of iterations. 3.2.2. LASSO Feature SelectionThe method was first coined by [31]. The Least Absolute Shrinkage and Selection Operator (LASSO) concentrates on doing two fundamental tasks, i.e., regularisation and feature selection, with regularisation being the driving factor used in the feature selection. Regularisation is defined as the reduction of data values towards a central point, usually the mean. LASSO introduces a penalty over the sum of the absolute values of the coefficients of the model (model parameters). This results in shrinking (regularisation), where some of the coefficients are shrunk to zero. In the feature selection process, the variables that will remain with a non-zero coefficient value (after shrinking) are then selected for modeling. This is conducted with the objective of minimising the prediction error (SSE) [32].The strength of the penalty is controlled and determined by the value of a tuning parameter say ζ. The larger the value of ζ, the more the coefficients are forced to zero, hence more variables are rendered insufficient during shrinking. Notice that if ζ=0, the model is an OLSs regression [31,32].The study used the Buhlmann and Van de Geer formulation of the LASSO modeling [33], for a linear model with X¯ and Y as vectors, defined before, with β being the the coefficient matrix and ϵ being the error vector. The LASSO estimate is defined by the solution to the l1 penalty optimisation problem.

minimise||Y−X¯β||22nsubjectto∑p||β||1<t

(2)

where t is defined as the upper bound of the sum of all coefficients βi for n data points. This minimisation is the same as the parameter estimation that comes after

β(ζ)=argminβ||Y−X¯β||22n+ζ||β||1

(3)

where ||Y−X¯β||22=∑i=0n(Yi−(X¯βi)2,||β||1=∑j=1n|βj| and ζ≥0 is the penalty parameter.Lasso was used in conjunction and compared to Boruta. LASSO gives accurate models, since the shrinking process results in reduced bias. Model interpretability is highly improved by LASSO due to the elimination of irrelevant features [32,33]. The study used the Caret package [34] to perform LASSO in R as it allows adjustments of various parameters and is computationally cheap. 3.3. Logistic RegressionLet π indicate the probability of a patient being COVID-19 positive, and let βi be regression coefficients associated with the feature xi and β0 be the intercept. Presently, a logistic regression (LR) model [35] is given by the equation:

logπi1−πi=β0+∑i=1nβi(xi)

(4)

Regression coefficients can be fitted using the maximum likelihood estimation [36]. To solve each probability of success using the logit use:

πi=expβ0+∑i=1nβi(xi)1+expβ0+∑i=1nβi(xi)

The probability can be estimated using a set threshold [36,37] to determine which class a patient belongs to. The study used a threshold of 0.5, that is, if πi>0.5, then a patient belongs to the positive class and if πi≤0, then the patient belongs to the negative class. Based on the algorithm, predictions are made to quantify the accuracy and other measures of performance. We used the Caret package [34] in R to fit a logistic regression. The package allows the adjustment of various parameters as well as the implementation of automatic cross-validation. 3.4. Tree-Based Methods 3.4.1. Decision TreeDecision Tree (DT) algorithms are non-parametric techniques that seek to classify data according to various rules where they continuously divide and split (divide and conquer) the feature space [38,39]. The partitioning splits the feature space into small chunks of non-overlapping spaces, whose response values correspond as guided by the set rules. The predictions are then obtained by fitting simple models (such as a constant) to each chunk of space [40]. DT are used, since they require less assumptions compared to classical methods and can handle huge varieties and types of data [39]. Tree StructureTrees are characterised by two features, which are decision nodes and leaves. Leaves represent the decisions and/or the final label while decision nodes show points where data are divided. Figure 9 shows an example tree from the data and how classification can be conducted. The Tree Building AlgorithmThe algorithm commences by looking for a variable that divides the data into two nodes. This division is arrived at by minimising the impurity measurement at that node. A node that has two or more classes is impure while a node that has only one node is pure, hence a measure of impurity measures how much each node has multiple classes. The algorithm’s division is recursive and will continue until a certain stop criteria is achieved. Some examples of stop criteria include when a tree is too large and complex or when the set depth of the tree is reached [41]. Classification and Regression Tree AlgorithmFor the Classification and Regression Tree (CART) algorithm, the impurity measure at each node is the MSE. This results in a tree, which is a collection of estimators at each node from the starting node to the terminal node [41]. In R, the study used the rpart package [42] to implement the CART algorithm. Gini ImportanceThe Gini index is often used as a measure of impurity for splitting in tree-building algorithms for classification outcomes. The aim is to maximise the decrease in impurity at a node. A large Gini index indicates a large decrease in impurity at a node and hence a covariate split with a large Gini index can be considered to be important for classification [43,44]. Given that the decrease in impurity at a node, h is denoted as i(h), the Gini importance of a covariate, Xj in a tree is the total decrease in impurity at all nodes of all trees in a forest (I=∑hi(h)) where the variable of interest is selected for splitting. That is the sum of all the Gini indices at all nodes in which covariate Xj is selected for splitting. The average of all tree importance values for the covariate, Xj is then termed the Gini importance of the random forest for Xj. 3.4.2. Random ForestA random forest is an ensemble of multiple CART decision trees [45]. The ensemble is fast and flexible, as it grows by bootstrapping without pruning the data. Random forest employs a modification of an algorithm known as bagging. They are many contemporary deployments of random forest algorithms but the most popular is Leo Breiman’s algorithm [40]. This involves a method of aggregating simple bootstrap to single tree general learners [40,46]. 3.4.3. Extension of the Bagging Algorithm Out of Bag (OOB) PerformanceRF are popularised because of their great OOB performance, usually giving high accuracy even from the default parameters in R (the study used the rfviz package [47]). Variable ImportanceRandom forest models are usually considered black box, due to the ubiquity of the inner workings of the algorithm (Algorithm 2). To this end, for a degree of explainability it is recommended that one evaluate some form of variable importance when using RF models. Variable importance helps to obtain which covariates were more influential together with their degree of influence on the resulting classification model [48]. The most common measure of variable importance is usually the Gini importance and permutation importance, although the Gini importance is often biased. On the other hand, permutation importance bases its value on the effect of the covariate on the predictive power of the resulting forest [49]. 3.4.4. Extreme Gradient-Boosted Models (XGB)Extreme Gradient Boosting (XGB) is a family of ensemble of the same type of machine learning models [50]. For the data, the study used gradient boosting for classification; however, it was also applied in regression prediction. For extreme gradient boosting, they consist of an ensemble of decision trees. Unlike RF which ensemble deep independent trees (multiple trees connecting in parallel) XGB ensemble shallow trees (single trees connected in series). For this model, the study adds trees to the ensemble one tree per time [50,51]. Each tree seeks to correct and mend the errors that the previous ensemble model would have made [52]; see Figure 10 for illustration. Algorithm 2: Random Forest Algorithm [44]

Obtain training data (selecting a random number of data points) and select number of trees to be built (say n trees;

For every tree, obtain a bootstrap sample and grow a CART tree to this data;

For each bootstrap split, obtain m (where m is half the total number of all variables) variables out of all variables and select the best variable at the split. Then, divide the arising node into 2;

Apply a tree stopping criteria (without pruning) to know when the tree is complete;

From the above, obtain the output of this tree’s ensamblage.

With gradient boosting models, the study fitted models by the use of a differentiable loss function and an algorithm that minimises (optimises) gradient descent [52]. Presently, extreme gradient boosting is designed for high effectiveness and computational efficiency. This is because it uses an open-source approach to implement [53] and we used the xgboost package [54] in R. 3.5. Artificial Neural Networks (ANNs)ANN are a mathematical copy of how the brain has an interconnected network of nodes called neurons. Each neuron is connected to another and can receive, process and output information. Neural networks are made of three main layers of neurons: (1) Input layer, (2) hidden layer and (3) output layer [55]. There are many arrangements and configurations onto which the various neurons connect to each other, and this configuration determines the type and how the network functions. A mathematical neuron has three main features [55] Deep Learning, also known as Deep Structure Learning, is under the family of ANNs machine learning methods based on a group of algorithms based on a multi-layer NN that is able to perform various machine learning tasks [56]. These algorithms include more than one hidden layer in the NN structure, hence the name deep. The most common form of Deep Learning algorithms is the FFN, which allows information to move in the forward direction only without any recurrence [57]. Because of its computational flexibility, it is easy to tune hyper-parameters and good visual outputs, especially during the net training process; we used the Keras package in R [58] to fit both the SNN and the CNN. 3.5.1. Self Normalising Neural Network (SNNs)The study implemented the SNNs, a deep learning method capable of performing classification and regression (statistical models capable of predicting the values of an outcome y, using the values of predictor variables x) machine learning. Normalisation is changing and adjusting data values to a similar scale. The SNN network, unlike other NNs, uses a unique and different activation function called SELU as well as a unique and different dropout method, called the alpha dropout [59]. These two features provide the unique self normalising (normalising data without need of any human input) property of this neural network. 3.5.2. Scaled Exponential Linear Units (SELU)The SNNs use SELU as the activation function. SELU make the SNN self normalising by the construction of a special mapping g, which maps normalised inputs to outputs. RELU, leaky RELUs tanh unit and sigmoid units can not be used to construct SNNs [59]. SELU is defined, for any value, say x, as:

SELU(x)=λx,ifx>0.αex−α,ifx≤0.

(5)

Hyperparameters, λ and α. λ>1, can be controlled to ensure that there are positive net inputs, which can be computed and normalised to obtain a Gaussian distribution. To make sure that the distribution mean is 0 and variance 1, set α=1.67326 and λ=1.0507.

Alpha DropoutAnother different feature that is presented by SNNs is their dropout technique. Dropout is defined as the method in which a NN ignores neuron units during the training phase. The ordinary dropout sets the activation of x to zero with probability (p), and hence keeps the average of the input distribution to the output distribution, albeit does not do so for the variance. For SNNs’ properties to be implemented, a NN needs to keep the mean and variance at 0 and 1, respectively. The standard traditional dropout fits RELUs (ReLU functions are either non-linear or linear but piece-wise, whose output equals the input for a positive input, otherwise the output is zero) and other activation functions without issues. Unlike the traditional dropouts, for SELU the alpha dropout, [59] is proposed, since the standard traditional dropout does not perform well with the SELU activation function. Hence, the alpha dropout is unique in two ways:

It randomly set the dropout input values to some value α´ instead of zero.

It keeps both the mean and variance at (0,1).

3.5.3. Convolutional Neural NetworkThis study implemented a convolutional neural network (CNN) to the data. The CNN is a feed-forward neural network, with a depth of up to 30 layers and multiple cells. The layers are connected in a series (one after the other), with the hidden layers consisting of convolutional layers followed by either activation layers or pooling (poling layers are there to reduce the number of parameters and calculations required in the network). Convolutional layers are different from regular layers in other NNs because they use convolutions (convolutions sum two functions, usually polynomials, to obtain an output) rather than matrices [60,61]. Activation layers in CNN use different activation functions depending on the function of the network. This study used the sigmoid function, and RELU in the activation layers. 3.6. Measures of Model Performance 3.6.1. Confusion MatrixThe study used the confusion matrix as given in Table 1 to define various measures of performance and thus compared how the models performed against each other (Visa et al., 2011). 3.6.2. Accuracy, Precision, and SensitivityAccuracy, precision, and sensitivity are defined as [62,63]

Accuracy=TN+TPP+NPrecision=TPTP+FPSensitivity=TPTP+FN

Higher values of the accuracy, precision, and sensitivity demonstrate more correct predictions the model makes. Hence, a model with higher classification or regression accuracy, precision, and sensitivity will be better than that with lower accuracy, precision, and sensitivity [62,63]. 3.6.3. Cohen’s Kappa κThis quantifies the reliability and accuracy of a classification method [64]. Unlike accuracy, Cohen’s κ takes into account the agreement that can happen merely by chance. Presently, it is defined as where p0 is the relative observed agreement and pe is the chance agreement probability. Cohen’s κ ranges from 0 to 1 with κ=1, meaning that there is complete agreement, while with κ=0, there is no agreement. 3.6.4. Predictive Values (PPV/NPV)Predictive values measure the probability of correct predictions. Positive Predictive Value (PPV) is the probability that a model predicts a positive result given that the individual is actually positive. Negative Predictive Value (NPV) is the probability that a model predicts a negative result given the individual is actually negative [65]. The higher the values of PPV/NPV, the better the predictive performance of a model. Lower values of PPV/NPV indicates that the model predicts many false positives/negatives [65,66]. PPV and NPV are computed by: 3.6.5. Receiver Operating CharacteristicThe Receiver Operating Characteristic Curve (ROC) is a great way of visualising the performance of a classifier. The graph has been used for a long time to paint a picture of the trade-off between false alarm rates and hit rates of a classifier [67]. The ROC curve and Area Under Curve (AUC) have been mostly adopted in conjunction with other performance measures to provide a comprehensible comparison between classifiers [67,68]. TPR and FPR is defined as:

FPR=FPP+N=SensitivityTPR=TPP+N=Precision

The ROC graphs are plotted on a 2-D with TPR on the Y axis and FPR on the X-axis with Figure 2 and Figure 3 being examples of ROC curves. This thus shows the trade-off between gains (true positives) and losses (false positives) [68]. The higher the value of AUC, the better the performance of the model. An AUC value of 50% or less is worse than random guessing. 3.6.6. The Wald TestTo test for the significance of a variable in a regression classifier, this study used the Wald test [69] at a 5% level of significance. Consider the LR model given in Equation (4) on Section 3.3:

logπi1−πi=β0+∑i=1nβi(xi)

The Wald test (sometimes known as the Z-test), tests the null hypothesis H0:βi=0 against an alternative hypothesis of H1:βi≠0. Failure to reject H0 means that the variable whose coefficient is given by βi is not significant to the model, while rejecting H0 means that there is enough evidence to suggest that the variable with coefficient βi is significant to the model [70]. 4. Exploratory Data Analysis

This section describes various methods of data exploratory and steps that were taken to achieve the data structure as required by the analysis of the study. It also provides summary statistics and visualisations of the data.

4.1. Data PreparationFor data preparation, we took all repeated test results from individuals who had two or more tests into one reading. An individual with at least one positive PCR COVID-19 test was labeled as positive. For other analytes and variables, an average over all the tests available was used. Analytes’ data was then filtered for time relevancy by taking results seven days prior and 14 days past the recorded COVID test result, although data concerning chronic diseases such as HIV, TB, and DM were taken six months preceding the SARS-CoV-2 test. The exclusion process is shown in Figure 11 and the summary of the demographics of the final data results are given in Table 2. For validation and re-sampling, the research split the data set into training and testing sets of the ratio 70:30, respectively. The data split was coupled together with 10-fold cross-validation repeated five times in each of the proposed models. 4.2. Missing Values and ImputationFigure 12 (the blue bars represent the percentage of missing values for each variable), shows that most variables have at least 80% missing values. The missingness in the data was structural because not all tested patients were admitted to obtain their routine blood tests. To deal with missing values, we used the missForest package in R [21], which is robust to deal with such an amount of missing values, and compared it with simple statistics missing values imputation (SSMVI) using measures of central tendency, specifically the mean and mode. Results of the comparison of the methods coupled with two robust variable selection methods, applied on the base ML method of logistic regression are shown on Table 3. 4.3. Variables and Variable Selection 4.3.1. Variables

The data contained analytes that were grouped by physiological system as follows: inflammatory [Creactive protein (CRP), IL-6, procalcitonin (PCT), ferritin, erythrocyte sedimentation rate (ESR)], coagulation (D-dimer; INR; fibrinogen), full blood count [white cell count (WCC) total, red cell count, haemoglobin, haematocrit, mean corpuscular volume, mean corpuscular haemoglobin, mean corpuscular haemoglobin concentration, red cell distribution width, platelet count], WCC differential [absolute count, neutrophil, lymphocyte, monocyte, eosinophil, basophil, as well as the neutrophil to lymphocyte ratio (NLR)], liver related [aspartate aminotransferase (AST), alanine aminotransferase (ALT), gamma-glutamyl transferase (GGT), lactate degydrogenase (LDH), total bilirubin, albumin)], cardiac related [troponin T, troponin I, N-terminal pro b-type natriuretic peptide (NT-proBNP)], endocrine related (HbA1c) and renal function-related [urea, creatinine, estimated glomerular filtration rate (eGFR)].

4.3.2. Variable SelectionTo begin with, the research removed features that had a confounding effect on the results and/or those that were created because of a positive COVID-19 test exemplified by features such as severity of the COVID-19 disease and some which included results of the methods that are used to arrive at HIV, TB and DM results. Eventually, there was a total of 37 variables with names displayed in Table 4.We compared the two methods of variable selection, namely the LASSO and the Boruta algorithm. The two methods are paired with the two methods of missing data imputation and the resulting data was run through the LR base model. The results of the comparative permutations of the methods of missing value imputation and variable selection applied on our base ML model are given in Table 3.In Table 3, the top performing measure is coloured in red. The combination of the missForest missing values imputation method and the Boruta variable selection algorithm outperforms the other three permutations on five out of seven measures of the model performance. Thus, we concluded that both missForest missing value method and the Boruta algorithm for variable selection were the best methods for the data given. The data and variables selected from the combination of the two methods is used in further ML modelling both in status prediction and risk stratification. 4.3.3. Boruta Variable SelectionWe ran the initial 43 variables and data obtained after missing value imputation using missForest in the Boruta algorithm. The results of the variable selection are shown in the Figure 13. The variables with green (38 variables) shaded box plots are the important variables, with those in red (five variables) not being of importance to the model. Shadow variables are shaded in blue. The variables whose box-plots are coloured in blue are selected for use in this study. 5. Results: Risk Stratification

This section analyses results of ML models fitted for risk stratification (RS). As noted, a critical case was defined as a patient who was admitted into a ward because of COVID-19 complications, non critical patients are positive cases that were not admitted. Of the 3301 positive cases, 1036 were classified as risky and 2265 not risky. The study fitted each model with a 70% training set and 30% test set, together with 10-fold cross validation repeated 5 times for re-sampling.

5.1. Logistic RegressionThe study fitted the Equation given on Section 3.6.6 with the predicted variable being severity of the patient, i.e, critical or not critical and the predictors as the variables selected from running the Boruta algorithm as given in

View original article

INFECTIOUS DISEASE REPORTS

分享书签

0 0 0 0 0 0 0

More from this channel

Infectious Disease Reports, Vol. 14, Pages 900-931: Risk Stratification of COVID-19 Using Routine Laboratory Tests: A Machine Learning Approach

留言 (0)