Prediction of complications in diabetes mellitus using machine learning models with transplanted topic model features

Fig. 1figure 1

Basic concept of LDA [4]

α: Initial parameter to Dirichlet distribution; β: parameter signifying the relations between words and topics; θ: parameter containing the relations between documents and topics; M: number of documents in a dataset; N: number of words in a document; w: words appearing in a document; Z:topic allocations to words in a document

As aforementioned, LDA was the basic approach to processing the clinical notes of our data. To provide a brief background on LDA, it is assumed that every word (w) in the actual document (d) is produced under the influences of θ and β (see Fig. 1). α is an initial parameter to Dirichlet distribution. θ expresses the document-topic relation, and β reflects the word-topic relation. Z is an example of the effects of the influences of θ and β. This includes pairs of words and topic numbers, indicating the assignment of a word to a particular topic in the document.

The gray circle in Fig. 1 represents the actual occurrence of a word, while the transparent circles represent hidden or abstract objects. Topic modeling is a posterior procedure estimating the approximate parameters of θ and β from a data set. Thus, the dataset can be translated into a matrix (M × K), where M represents the number of documents, and K is the topic count. The matrix (M × K) is called the topic structure.

The basic approach of this study aimed to predict the onset of DM-related complications using the clinical notes of patients through a semi-supervised classification model. LDA or topic modeling was employed to reduce the dimensions of the input data. Topic modeling is advantageous because it reduces the dimension of the TF matrix filled with 0s, which provides a memory space benefit. The data of the patients were grouped according to four types of well-known DM complications. In each group, an analogous number of positive cases (i.e., patients with DM who developed complications) and negative cases (i.e., patients with DM who did not develop complications) were included. This enabled the subsequent computation of the correlation between topic structure and complications. In each group, 90% of the data were used to train the classification model, and the remaining 10% were used as test data.

After the training data were text processed and indexed, they were organized into a document-term matrix (M × V). Through topic modeling, this matrix was converted into a document-topic matrix (M × K) that demonstrated the estimated topic structure. The topic structure and complication information of the training data were entered into the classification model, which then computed the correlation between them.

Subsequently, based on the topic structure of the training dataset, the weighted topic structure of the test data was computed, referred to as the transplanting process. Therefore, we matched the structures of the training and test data. The weighted topic structure of the test data was inputted into the designed classification model. The classification model automatically computed the probability of complications in the test data based on the trained correlation between the topic structure and the onset of complications. Figure 2 illustrates the overall workflow of this study explained. Data acquisition from the SNUH EMR systems precedes pre-processing/indexing. The preprocessing included tokenization and part of speech (POS) tagging. Tokenization splits sentences into tokens, POS tagging identifies POS properties, and POS tags are attached to each token. Indexing counts the TF of each word in a document and composes an M × V matrix. Topic modeling accepts an M × V matrix as the input variable to produce an M × K matrix. Examples of the M × V and M × K matrices are included in the supplement.

Fig. 2figure 2

Overall Workflow

Preprocessing: correcting typos, part of speech (POS) tagging, composing stop words list, and replacing drug product names with ingredient names; Indexing: filling a matrix (M × V) with term frequency(TF) values; Topic Modeling: filling a matrix (M × K) with the document-topic weight values; Classification: predicting the label variable utilizing the machine learning model

2.1 Data set

The clinical notes collected for this study were text documents written by clinicians in the outpatient clinics while treating patients. These generally contain the medical history of the patient, chief complaint, physical examination results, test results, impression, and a plan describing subsequent examinations and medications. We obtained the clinical notes of 9,430 patients with DM from the EMR system of the SNUH outpatient clinic, from 2013 to 2015. Furthermore, we collected diagnostic data for these patients from their outpatient clinic visits between 2013 and 2020. Data collection was approved by the Institutional Review Board (IRB) of Seoul National University Hospital (IRB NO: C-1612-085-815). Thereafter, we divided the data into four groups according to the type of DM complication: diabetic retinopathy (DMR), diabetic nephropathy (DMN), nonalcoholic fatty liver disease (NAFLD), and cerebrovascular accident (CVA). To analyze the correlation between the topic structure of the data and complications, negative cases were included in each group of data. The numbers of positive cases (i.e., DM patients who developed complications) and negative cases (i.e., DM patients who did not develop complications) in each group were balanced. For topic modeling, clinical notes of three years for each patient were merged into a single document. The average number of visits for positive cases in each group is described in each subsection.

2.1.1 DMR data Set

The DMR group comprised 1,747 patients diagnosed with DMR ( positive cases) and 1,653 patients with DM who did not develop DMR ( negative cases). The ICD-10 codes used to identify the dataset were E14.3 (diabetic retinopathy), H36.0 (nonproliferative diabetic retinopathy), and E11.3 (type 2 diabetes mellitus with non-proliferative retinopathy). On average, the patients visited the outpatient clinic 13.9 times between the first diagnosis date of DM and that of DMR.

2.1.2 DMN data Set

Using ICD-10 codes E14.2 (unspecified diabetes mellitus with renal complications) and E11.2 (diabetes mellitus with kidney complications), 970 patients with DM diagnosed with DMN were included in the DMN group. In total, 997 negative cases were included in this group. The average number of visits to the outpatient clinic by DMN-positive patients in this group was 20.8 times between the first diagnosis of DM and that of DMN.

2.1.3 NAFLD data Set

In the NAFLD group, 444 patients with DM and NAFLD were selected as positive cases. In total, 411 negative cases were included. The ICD-10 codes used to obtain these data were K75.8 (nonalcoholic steatohepatitis) and K76.0 (fatty liver). NAFLD-positive patients in this group visited the outpatient clinic 13.2 times on average, between the first diagnosis of DM and that of NAFLD.

2.1.4 CVA data Set

In the CVA group, 401 patients also diagnosed with CVA were selected as positive cases. There were 407 negative cases in this group. The ICD-10 codes I63.9 (cerebral infarction, unspecified) and I63.8 (other cerebral infarctions) were used to obtain this dataset. The CVA-positive patients in this group visited the outpatient clinic 15.2 times on average, between the first diagnosis of DM and that of CVA.

Table 1 summarizes the properties of each dataset. As shown in this table, the proportions of positive and negative cases in each group were balanced.

Table 1 Properties of Datasets 2.2 Text processing

The collected clinical notes were written using Korean syntax. For topic modeling, words in the functional category were excluded. Therefore, we employed a Korean POS tagging program to sort meaningful tokens. The POS tagger used was the Korean Intelligent Word Identifier, developed through the 21st century Sejong Project [13].

Another issue was that the collected clinical notes contained many English terms. English terms representing diseases, symptoms, laboratory tests, etc. were used as tokens in their normalized forms. Finally, the same drug was referred to under different names. Drugs are represented by either their product names or their ingredient names in clinical notes. For example, “amlodipine,” which is named after its ingredient name, can be also called “Norvasc,” its product name. We replaced the product names with ingredient names to unify the different terms for the same drugs. Thus, the document frequency (DF) of drug names increased.

2.3 Held-out test data

As stated above, 10% of each dataset was used as the test data. The remaining 90% of the dataset was used to train the classification model. This is contrary to the general convention of machinelearning projects that utilize dimensionality reduction. Conventionally, the test data are obtained after dimensionality reduction. However, in our study, the test data were held out before topic modeling to ensure that the classification model learned only the pattern inherent in the training data. This is essential because the model must forecast the onset of any future complications considering only the presence of clinical notes of patients with DM and the learned pattern in the training data. Table 2 presents the properties of the test data.

Table 2 Properties of Isolated Test Data 2.4 Topic modeling

For topic modeling, we used LDA-C, provided by David M. Blei [14] and translated it into Microsoft Visual. C#.NET 2022. The topic count was set to 100 because our unpublished preliminary study estimated that 100 was the optimal number of topics. First, a document-term matrix was created from the training data. Thereafter, it was converted into a document-topic matrix and topic-term matrix through topic modeling. The topic models can be optimized using two methods: Gibbs sampling and the EM algorithm. In this study, the EM algorithm was applied.

Next, the topic structure of the test data was estimated, considering the extracted topic structure of the training data. This process is called transplantation. Transplanting the topic models of the training data into the test data was necessary to match the dimensions of the topic structures of the two datasets. Matching the dimensions of the two structures is essential because the topic structure of the test data is inputted into the classification model. The model can compute the probabilities given a learned pattern in the training data when the input value has the same dimensions as the learned topic structure of the training data.

In the original LDA model proposed by Blei et al. [4], γ is a matrix (M × K) that represents the relationship between documents and topics. ϕ is a matrix (K × V) showing the relation between topics and words. γ is the feature set for a supervised machine learning project. The main concern of the transplantation in this study is, how to infer γ of the documents in the test data.

Therefore, we first check whether the nth word in the mth document in the test data, wm,n, is included in the test data in ϕ which was estimated from train data.. When wm,n is the t-th word in ϕ, the weight value showing the relationship between the mth document and kth topic(γm,k) can be calculated as follows:

Here, N is the number of words in the mth document of the test data, and TFn is the term frequency of the nth word in the mth document. The second term expresses the rational number converted from the weight value of the tth word and kth topic in ϕ of the training data. Thereafter, we utilized the inferred γ as the feature set for supervised machine learning.

An important issue at this stage is the number of words appearing in the held-out test data that are absent from the transplanted topic model. These words are referred to as unseen data. Unseen data are those that the model has not yet learned. Therefore, they must be smoothed to improve the model quality. Consequently, the log-value of the unseen data was initialized to -100.0 to minimize its influence on calculating γ of the test dataset. Table 3 presents the percentage of unseen words included in the transplanted topic model for each held-out test dataset.

Table 3 Percentages of the words included in the transplanted topic model 2.5 Prediction models

Three prediction methods were used in this study: Random Forest (RF), Gradient Boosting Machine (GBM), and Extreme Gradient Boosting (XGBoost or XG). We utilized various R machine learning packages for the classification. We utilized “randomForest” package for R 4.2.1 for RF [15], “gbm” package for R 4.2.1 for GBM [16], and “xgboost” package for R 4.2.1 for XGBoost [17]. First, we performed a preliminary study of 10-fold cross-validation of each group of data. The ”caret” package for R 4.0.2 was utilized [18]. This preliminary study ensured the reliability of the prediction performance of the model. In this preliminary study, topic modeling was conducted prior to data segmentation. After topic modeling of the entire set, each group of data was divided into 10 parts. In each trial, using the nine parts as a training set, the remaining parts (i.e., the test set) were predicted. The test sets were rotated in a total of ten trials to ensure that every ten parts of the dataset were subject to prediction. As the main study, a held-out test was conducted for each group of data. As previously stated, the training set-test set ratio was set to 9:1. Contrary to the preliminary study, topic modeling was conducted after the training and test data were segmented.

留言 (0)

沒有登入
gif