A Multilabel Text Classifier of Cancer Literature at the Publication Level: Methods Study of Medical Text Classification

IntroductionBackground

According to the World Health Organization (WHO) reports, cancer is one of the leading causes of death worldwide, accounting for nearly 10 million deaths in 2020 []. With cancer emerging as the greatest threat to human life, there is a rapid growth in the volume of literature published in the cancer field. The trend of disciplinary convergence has led to publications requiring labels from multiple subjects. Consequently, there is increasingly more demand for accurate cancer literature classification for retrieval, evidence support, academic analysis, and statistical evaluation in order to support clinical research, precision medicine, and discovery of interdisciplinary cancer research [,] by forecasting trends and hotspot statistics.

Text classification is the process of assigning specific labels to the literature based on individual features. The current methods of classifying the literature can be divided into 3 groups: mapping based, subject information based, and machine learning based [-]. Recently, an increasing number of studies have experimented with deep learning to enhance the effects of text classification []. Most of the existing literature classification (eg, Web of Science, Scopus [-]) is usually carried out at the journal level, that is, all papers in a given journal get the same labeling categories as the hosted journal. However, given that interdisciplinary research is increasing in the cancer field [], there is a need for a more precise classifier, as papers from a journal always present a diverse range of topics [,]. Moreover, literature classification at the journal level can no longer adapt to the dynamics of newly developing subjects and fully characterize text features.

Related WorksDevelopment of Text Classification Technology

Text classification technology has undergone rapid development from expert systems to machine learning to finally deep learning []. Maron [] published the first paper on automatic text classification in 1961. By the end of the 20th century, machine learning had matured into a fully developed field []. Joachims [] established the bag-of-words model to transform text into a vector with a fixed length and then selected features using the information gain criterion to achieve dimensionality reduction, eventually training the feature vector iteratively using a support vector machine (SVM) classifier. Rasjid [] focused on data classification using k-nearest neighbors (k-NN) and naive Bayes. Liang et al [] improved the feature recognition of the literature using appropriate clusters and the introduction of differential latent semantics index (DLSI) spaces. In 2006, with the rapid development of deep learning, text classification research based on deep learning gradually replaced traditional machine learning methods and became the mainstream, with wide applications in numerous tasks [].

The deep learning–based text classification method adopts word vectors (eg, GloVe [] and word2vec []) for word semantic representation [], and subsequently, various deep neural network (DNN)–based text classification methods develop. Convolutional neural networks (CNNs) [] were originally constructed for image processing and have been broadly used for text classification [-]. Due to the computational complexity, deep network gradient disappearance, and short text content of a CNN, a series of optimized models were gradually derived to address these issues, including FastText [], deep pyramid convolutional neural networks (DPCNNs) [], knowledge pyramid convolutional neural networks (KPCNNs) [], and text convolutional neural networks (TextCNNs) []. Especially, the TextCNN is a simple, shallow network and requires only a small number of hyperparameters for fine-tuning. Compared with CNNs, recurrent neural networks (RNNs) can easily implement multilayer superposition to construct a multilayer neural network [], such as multilayer long short-term memory (LSTM) [] or multilayer gate recurrent unit (GRU). In terms of improvements to RNNs, the text recurrent neural network (TextRNN) uses a multitask learning framework to jointly learn across multiple related tasks; a deep recurrent neural network (DRNN) [] incorporates position invariance into an RNN, captures local features to find the optimal window size, and then achieves marked improvements over RNN and CNN models. All these models have laid the foundation for follow-up studies.

Bidirectional Encoder Representation from Transformers (BERT) has emerged as a new linguistic representation model by the introduction of attentional mechanisms, which have been broadly applied in machine translation [], image description generation [], machine reading comprehension [], and text classification []. Being a bidirectional encoder model based on a transformer, BERT became an important advancement of natural language processing, especially text classification. For example, Shen et al [] attempted to train a Chinese corpus–based BERTbase model for the classification of the literature on Chinese social science and technology and also explored its application to practical production. In addition, Lu and Ni [] developed a multilayer model for patent classification using the combination “BERT + CNN,” while Liu et al [] proposed a sentence-BERT for hierarchical clustering of literature abstracts. However, the accuracy of applying universal language models directly to the biomedical field is not sufficient, and this motivated studies to train the biomedical BERT from scratch. Typical instances are BioBERT pretrained on PubMed citations and PubMed Central (PMC) full text [] and PubMedBERT obtained with mixed-domain pretraining using PubMed text and clinical notes []. Up to now, BioBERT and PubMedBERT have achieved success in named entity recognition, extraction of relationships between entities, entity normalization [,], International Classification of Diseases (ICD) autocoding [], and its multilabel classification (MLC) []. These studies afford us lessons that merit attention.

Multilabel Text Classification

The deep learning model has contributed to the success of MLC due to its dynamic representation learning and end-to-end learning framework. Multilabel text classification (MLTC) is the application of MLC to the task of text classification and the assigning of a set of targeted labels to each sample [], which has been part of the longstanding challenge in both academia and industry. In the biomedical domain, Du et al [] proposed an end-to-end deep learning model ML-Net for biomedical text, while Glinka et al [] focused on a mixture of feature selection methods, filter, and wrapper methods. In addition, Hughes et al [] tried to classify medical text fragments at the sentence level based on a CNN, and Yogarajan et al [] used a multilabel variant of medical text classification to enhance the prediction of concurrent medical codes. Automatic question-and-answer systems and auxiliary decision-making are the 2 typical applications of MLTC. For example, Wasim et al [] proposed a classification model for multilabel problems in the flow of biomedical question-and-answer systems with factlike and listlike questions. Similarly, Baumel et al [] presented a hierarchical attentional bidirectional gated cyclic unit that used attention weights to better understand the sentence and word with the greatest impact on the decision.

In this study, we adopted MLTC technology to classify cancer publications for better retrieval, academic analysis, and statistical evaluation. We introduced a method of multilabel classification for cancer research at the publication level based on the “BERT + X” model. BERT is a learning model that can migrate to other tasks to obtain better outcomes, which was pretrained on a large and easily accessible data set, and X is a deep learning model that can capture semantic features of text accurately. The combined model was trained on a corpus from a multilabel publication database, and then cancer publications were classified into appropriate categories directly at the publication level instead of the journal level.

MethodsStudy Design

This study mainly aimed to train a deep learning–based multilabel classifier for cancer literature classification at the publication level. The overall framework of the study is illustrated in . First, a corpus of titles and abstracts of cancer publications retrieved from the Dimensions database were divided into a training and a testing set in a ratio of 7:3 after preprocessing. Second, to capture sufficient text features for multilabel classification, the titles and abstracts were taken separately as 2 independent layers, which were called the “tuple” in this study. Finally, “BERT + X” classifiers based on 5 deep learning models were trained; X refers to “TextRNN,” “TextCNN,” “FastText,” “DPCNN,” and “DRNN.” The performance of the candidate classifiers was compared quantitatively in terms of 3 conventional metrics in order to identify the optimal model for classifying cancer literature.

‎

Figure 1. Study framework. BERT: Bidirectional Encoder Representation from Transformers; DPCNN: deep pyramid convolutional neural network; DRNN: deep recurrent neural network; TextRNN: text recurrent neural network; TextCNN: text convolutional neural network. Data Collection and Preprocessing

Refined data with semantic and context features are the basis of deep learning model training. In this study, to train a lightweight, compatible, and high-applicability classifier for text classification, we first preprocessed cancer literature data to extract features and sequential semantic information. The classification terminology maintained by the International Cancer Research Partnership (ICRP), called cancer type (CT), was used as standard labels to characterize individual cancer studies in terms of 62 CTs. The ICRP CT has been linked to the ICD maintained by WHO [] and is increasingly gaining rapid recognition worldwide. Moreover, the ICRP CT has been applied to several international databases for labeling cancer literature and research documents with fine granularity. A typical example is the Dimensions database (Digital Science & Research Solutions, Inc), which covers more than 135 million publications, 6.5 million grants, and 153 million patents, providing a collaborative path to enhanced scientific discovery with transparent data sources. Importantly, cancer publications with ICRP CT–classified labels from the Dimensions database provide a way to prepare annotation data for model training.

Construction of the Corpus and Balanced Sampling

A set of 70,599 publications from 2003 to 2022 was randomly sampled from the Dimensions database using the keyword “cancer,” along with the ICPR CT labels for each publication. shows the distribution of different CTs among the corpus data. Here, to intuitively demonstrate the volume distribution in the 62 different cancer categories, we ranked the categories in descending order of the number of corresponding publications included. Categories with sample sizes larger than 500 were listed separately, while the remaining 41 (66.13%) categories were grouped into 3 classes: CT22-CT35, n=14 (34.15%) categories; CT36-CT49, n=14 (34.15%); and CT50-CT62, n=13 (31.71%). The top 9 (14.52%) categories (breast cancer, non-site-specific cancer, colon and rectal cancer, lung cancer, prostate cancer, ovarian cancer, stomach cancer, cervical cancer, and pancreatic cancer) accounted for 76.35% (53,899/70,599) of the total corpus. Obviously, the distribution of labeling data was uneven, and the top 9 labels contained more than three-fourths of the total corpus. Once the original corpus was directly used to train the deep learning model without balanced sampling, the resulting classifier would cause overfitting due to a few categories with excessive volume and fail to be generalized in practical use.

To avoid the degradation of precision caused by overfitting, we balanced the data sampling before model training. First, we ranked the cancer categories in descending order of the number of corresponding publications. Next, a threshold (70 in our study) used for setting the sampling index was obtained after multiple testing, followed by calculation of the index using categories with more than 70 publications. The resulting mean , median, variance, standard deviation of the number of samples contained in each category were recorded as follows: mean=1127, median=282, variance=5,889,927, and standard deviation=2427. Here, the median was selected as the initial index according to the actual distribution of sampling data. Considering that 30% of the corpus was used as the testing set, the final index was set to 500 in terms of the number of publications for a competent training set. Therefore, categories whose number of publications exceeded the final index were balanced and down-sampled separately, which means that 500 publications from each category were randomly extracted for a uniform corpus.

shows part of the balanced sampling results. It is clear that the optimized corpus contributed to a reduction in the adverse effects of overfitting and was subsequently used for keyword extraction and model training.

‎

Figure 2. Original sample distribution of CTs. CT: cancer type. Table 1. Example of balanced sampling of top 21 categories (N=70,599 publications).Sequence numberICRP CTaICRP codebICD-10c codeOriginal sampled, n (%)Balanced samplee, n (%)1Breast cancer7C5014,103 (19.98)500 (3.55)2Non-site-specific cancer2N/Af12,170 (17.24)500 (4.11)3Colon and rectal cancer64C18, C19, C206681 (9.46)500 (7.48)4Lung cancer28C34, C456190 (8.77)500 (8.08)5Prostate cancer42C613890 (5.51)500 (12.85)6Ovarian cancer66C563294 (4.67)500 (15.18)7Stomach cancer51C162783 (3.94)500 (17.97)8Cervical cancer9C532659 (3.77)500 (18.80)9Pancreatic cancer37C252129 (3.02)500 (23.49)10Bladder cancer3C671197 (1.70)500 (41.77)11Esophageal/oesophageal cancer12C151158 (1.64)500 (43.18)12Endometrial cancer11C541131 (1.60)500 (44.21)13Oral cavity and lip cancer36C00, C01, C02, C03, C04, C05, C06, C09965 (1.37)500 (51.81)14Liver cancer23C22921 (1.30)500 (54.29)15Thyroid cancer54C73734 (1.04)500 (68.12)16Melanoma29C43702 (0.99)500 (71.23)17Pharyngeal cancer61C14.0661 (0.94)500 (75.64)18Leukemia/leukaemia27C91, C92, C93, C94, C95529 (0.75)500 (94.52)19Laryngeal cancer26C32522 (0.74)500 (95.79)20Non-Hodgkin’s lymphoma35C82, C83, C84, C85, C96.3514 (0.73)500 (97.28)21Head and neck cancer21C76.0502 (0.71) 500 (99.60)

aICRP CT: International Cancer Research Partnership Cancer Type; here, “ICRP CT” denotes the label.

b“ICRP code” refers to the label code.

cICD-10: International Classification of Diseases, Tenth Revision; this is the ICD code linked to the appropriate ICPR CT.

d“Original sample” represents the number of publications obtained directly from the Dimensions database.

e“Balanced sample” means the number of publications after balanced sampling.

fN/A: not applicable.

Construction of a Tuple Consisting of a Title and an Abstract

The corpus of cancer publications consisted of titles and abstracts in English, while the title and abstract of each publication was saved separately for ease of use. Generally, the title is a short sentence of a confined length to express an independent meaning, which has a high rate of conformity regarding the content. Comparatively, an abstract is also valuable, since it clearly and accurately summarizes the main content of the publication by expressing its purpose, methods, results, and conclusions. In this study, both title and abstract were independently used to train a 2-layer classifier based on their semantic and context features, called a tuple for simplicity.

Keyword Extraction From the Abstracts of Cancer Publications

In this study, in contrast to the abstract layer, the operation of the title layer mainly focused on keyword training. However, the number of valid keywords contained in titles is quite limited. To improve feature representation and model performance, more keywords were extracted from the abstracts and merged with the title layer. The TextRank algorithm [] was adopted for keyword extraction from the abstracts, taking advantage of the co-occurring semantics information between words from the given sentences.

A lightweight classifier was desired in this study, so the length of the abstracts needed to be controlled to keep a balance between the running speed, effectiveness, and volume occupied. To ascertain the most appropriate abstract length, we analyzed the scatter plot of the abstract length distribution, as shown in . Here, the horizontal axis represents the serial number of publications, the vertical axis represents the length of the publication abstracts, and the red (length>512), green (256<length≤512), blue (128<length≤256), and orange (length≤128) colors represent different abstract lengths separately. Among them, the blue zone was evenly distributed and had a high proportion, the red zone had the least proportion, and the green and orange zones were comparable.

After statistical analysis, the maximum length of the abstracts was set to 256 characters mainly for 3 reasons. First, 256 is the 8th power of 2, which facilitates machine understanding after tuning []. Second, we wanted to collect the most valid information as much as possible, while reducing sparsity. Third, we wanted to avoid the learning of shallow networks, which are equivalent to follow-up layers when training (ie, gradient disappearance).

‎

Figure 3. Scatter plot of distribution of abstract length. Training a Model for MLCTUpstream Pretrained Language Models

Pretraining generally refers to putting a large amount of low-cost collected training data together, learning the commonalities of the data, and then tuning the model that has the commonalities with a small amount of labeled data of a specific domain. Therefore, pretrained language models start from the commonalities and learn the special parts of the specific task. BERT is a successful model pretrained on Wikipedia and a book corpus via self-supervision tasks, and fine-tuning benefits downstream tasks. Being a pretrained language model based on the bidirectional transformer encoder architecture, BERT uses sentence-level negative sampling to obtain sentence representation/sentence pair relationships. In addition, BERT takes advantage of the transformer model instead of LSTM for expressive and temporal efficiency, as well as the masked language model to extract contextual features. We used BERT to handle specific natural language processing tasks downstream to produce word vectors in the pretraining stage. During the fine-tuning, we then completed data training for the pretrained BERT through the output layer based on cancer publications in order to save time and improve accuracy.

Downstream Classification Model Training

To complete the actual task of downstream natural language processing based on the fine-tuned upstream procedure, we tried to train several preliminary language models and then chose one as the optimal model for our classifier according to its all-round performance. The “BERT + X” pattern was adopted for the classifier to determine the optimum option for X and the best way to combine BERT and X. According to the actual scenario and expert consultations, including the length, tightness of context, and multidisciplinarity, 5 models suitable for cancer publications were compared for the classification model: TextCNN, TextRNN, FastText, DPCNN, and DRNN. Eventually, the definitive combined classifier would come out dependent on the comprehensive performance analysis of these 5 models.

The structure of the classification model is shown in , where the TextRNN is selected as a representation of the 5 models, for example. Here, the title and abstract were input into the title layer and the abstract layer, respectively, while the word vectors were converted by BERT and passed to the encoder layer. The word vectors output from the title and abstract layers were stitched together and then transferred to the fully connected layer for normalization, while the final output of multilabel classification was generated by the sigmoid layer with the activation function.

‎

Figure 4. Structure of the classification model. BERT: Bidirectional Encoder Representation from Transformers; TextRNN: text recurrent neural network. Testing and Verification

To evaluate the performance of the trained classifier, a subset of the testing sample was selected from the corpus, which covered all 62 classification labels defined in the ICRP CT. We applied 3 frequently used indexes, namely precision, recall, and the F1-score, to verify the classification results of the 5 models. Here, the F1-score is the harmonic mean of precision and recall; the larger the F1-score, the better the performance of the classification model. The quantitative indexes of the 5 “BERT + X” models were compared numerically and independently to choose X.

ResultsQuantitative Analysis of the Performance of Classification Models

The test results of the combined classification downstream models are shown in , where we compared performance on 3 aspects (precision, recall, F1-score) of 5 classification models (BERT + TextRNN, BERT + TextCNN, BERT + FastText, BERT + DPCNN, and BERT + DRNN). All metrics of BERT + TextRNN were consistently at a high level, with a precision of 93.09%, a recall of 87.75%, and an F1-score of 90.34%. Here, BERT was directly used for fine-tuning training, combined with the TextRNN for multilabel classification. After adjusting and testing the parameters several times, the best parameters were obtained and are shown in .

Table 2. Performance comparison of 5 different “BERTa + X” models.ModelPrecision (%)Recall (%)F1-score (%)BERT + TextRNNb93.0987.7590.34BERT + TextCNNc84.1979.6981.88BERT + FastText93.0575.7383.50BERT + DPCNNd81.7875.0078.25BERT + DRNNe88.9853.05 66.47

aBERT: Bidirectional Encoder Representation from Transformers.

bTextRNN: text recurrent neural network.

cTextCNN: text convolutional neural network.

dDPCNN: deep pyramid convolutional neural network.

eDRNN: deep recurrent neural network.

Table 3. Parameters of the optimal “BERTa + TextRNNb” model.ParameterValueNum_ train_ epochs200.0Max_ seq_ length256leaning_ rate0.0001train_ batch_ size32Predict_ batch_ size32Drop0.5Dense1256Dense262TextRNN256×2LTSMc_UNITS5BERT_OUTDIM768

aBERT: Bidirectional Encoder Representation from Transformers.

bTextRNN: text recurrent neural network.

cLSTM: long short-term memory.

Supplementary Analysis of the Model Structure

The proposed classification model takes the title and abstract of a publication as independent input, namely a tuple, as mentioned in the “Methods” section. To verify the effectiveness of the tuple input of the trained model, we conducted a set of comparison experiments based on “BERT + TextRNN.” Especially, “2 tuples and 2 levels” represents the model that took the title and abstract as 2 levels of the training model separately, “1 unit and 1 level” represents the model that combined the title and abstract as whole-text input for training, and “the title alone” and “the abstract alone” represent the models that took the title or the abstract alone as input, respectively. records the performance of different models from supplementary experiments, where the “2 tuples and 2 levels” model was superior, with a precision of 93.09%, a recall of 87.75%, and an F1-score of 90.34%. The reason is that when applying the title or abstract alone to train the classification model, feature reduction occurs, which further leads to inferior performance of classification. In addition, compared with the “1 unit and 1 level” model taking the title and abstract as a 1-part text input, the “tuple and 2 levels” model enhanced the specificity of feature extraction. Notice that the title and abstract make different contributions to the subject of publication, and the classification model will lose the sufficient feature of the abstract if it is not trained separately. Eventually, the classification of the tuple input was elected for the proposed classification model. This is also identical to most of the subject-based literature processing, which take the title and abstract as independent text, such as subject indexing.

In addition, the proposed classification model used the TextRank algorithm to extract keywords from the abstract and supplement the title layer with them. To demonstrate the necessity and effectiveness of this step, shows the numerical indexes of the classification model with and without keyword extraction and supplementation. Here, the performance of the model with TextRank keyword extraction is shown in blue, and the results with direct training using titles and abstracts separately are plotted in green. It is obvious that the model without keyword supplementation had a lower performance in recall, precision, and the F1-score, which confirms the efficiency and effectiveness of the proposed classification model from a different perspective.

Table 4. Comparison of different types of input.InputPrecision (%)Recall (%)F1-score (%)2 tuples and 2 levels93.0987.7590.341 unit and 1 level82.3279.7881.04The title alone44.3731.5436.88The abstract alone85.5079.79 82.55‎

Figure 5. Comparison of the model with and without keyword supplementation. Comprehensive Analysis of the Multilabel Classification

To explore whether there was a particular regularity in the distribution of multiple labels among different categories, the proportion of multilabel publications was statistically analyzed for classifier training. In total, 15,296 (21.67%) of the 70,599 publications had 2 or more labels, that is, more than one-fifth of the publications were multilabel ones. In addition, the categories were counted based on the characteristics of the number of labels () to visualize multilabel distributions. Here, we listed the top 20 categories by the volume of the publications included. The blue color refers to the total number of samples collected under a specific category, the green color is the number of samples with multiple labels under that category, and the yellow color denotes those with at least 3 labels. The categories with fewer samples had a higher ratio of multiple labels, and multiple labels had different characteristics among different categories. Analysis for a deeper relationship between multiple labels is necessary.

There are principally 2 roles of comprehensive analysis of multilabel publications. First, it highlighted the process of balanced sampling. Since part of the publications belonged to multiple categories and some of the categories had a high co-occurrence frequency compared to other categories, direct model training on the original corpus would lead to overfitting due to uneven distribution of samples. This is why we selected multilabel papers instead of those with a single label in order to obtain a balanced sample for classifier training. Second, the multilabel publications revealed the potential semantic correlation of texts, which provided a direction for the analysis of data characteristics. Based on the co-occurrence correlation and distribution between different categories, the semantic features were further characterized and the proposed classification model extended to other data with the same characteristics.

To explore the inherent correlation between multiple labels, we selected 2500 multilabel publications from the corpus for characteristics analysis. Specifically, samples with 2 labels accounted for 59.04% (1476/2500) and samples with at least 3 labels accounted for 40.96% (1024/2500) of all publications. lists part of the analysis results. For instance, different CTs often co-occurred for statistical surveys in the literature review with a weak association.

The correlation strength of multiple labels of cancer publications were independently reviewed and assessed by 3 biocurators with relevant knowledge. Concretely, a publication with 2 labels and a clear semantic correlation within the corresponding subject classification labels was interpreted as “1,” while a publication with 3 or more labels and more than two-third of the labels holding an obvious semantic association were also considered as “1”. Once the 2 biocurators reached the same results, that specific publication was passed into the “review completed” data set. When they had different opinions, the corresponding publication was annotated as “pending review.” After the first round of reviewing, the “pending review” data set was discussed together for the second time, and a third biocurator was invited for confirmation and agreement.

shows specific numbers of labeled publications with interlabel correlation, where the “strong association” zone consists of publications whose co-occurrence labels had explicit links between semantics, the “low association” zone consists of publications whose co-occurrence labels did not have clearly semantic links, and the “independent examples” zone consists of publications whose cancer labels were taken as single entities or independent examples for observation without intrinsic correlations. Of the 1476 publications with 2 labels, 1201 (81.37%) had a strong association, while 572/1024 (55.86%) publications with at least 3 labels had a low association. Among the publications with lowly correlated labels, 718/1024 (70.12%) took the different categories of cancers as a single entity or independent examples for observation without intrinsic correlations. We noticed some possible influence on the association distributions for training. In addition, the relationship between 2 labels in a publication was stronger than that among 3 or more labels, which justifies the demand to classify publications by subject at the publication level. Therefore, the strength of interlabel association could achieve the effect of assisting decision-making after multilabel classification to further support clinical diagnosis and treatment. In the future, we will carry out knowledge mining based on the existing interlabel semantic network and strengthen the training of interlabel association to improve the performance of the proposed classification model.

‎

Figure 6. Comparison of samples with multilabels. Table 5. Examples of data evaluated by experts.Sequence numberTitleLabels, nCorrelation strengtha1Temporal Trends of Subsequent Breast Cancer Among Women With Ovarian Cancer: A Population-Based Study []212Clinical Characteristics and Survival Outcomes of Patients With Both Primary Breast Cancer and Primary Ovarian Cancer []213Secondary Malignancies in Long-Term Ovarian Cancer Survivors: Results of the “Carolin Meets HANNA” Study []214Trends in Participation Rates of the National Cancer Screening Program among Cancer Survivors in Korea []305Increasing Trends in the Prevalence of Prior Cancer in Newly Diagnosed Lung, Stomach, Colorectal, Breast, Cervical, and Corpus Uterine Cancer Patients: A Population-Based Study []416Cancer Registration in China and Its Role in Cancer Prevention and Control []307Cancer Incidence, Mortality, and Burden in China: A Time‐Trend Analysis and Comparison With the United States and United Kingdom Based on the Global Epidemiological Data Released in 2020 []508Excess Costs and Economic Burden of Obesity-Related Cancers in the United States []319Cancer Attributable to Human Papillomavirus Infection in China: Burden and Trends []4010Excess Costs and Economic Burden of Obesity-Related Cancers in the United States []3111Cancer Awareness in the General Population varies With Sex, Age and Media Coverage: A Population-Based Survey With Focus on Gynecologic Cancers []5012Public Attitudes Towards Cancer Survivors Among Korean Adults []3013Importance of Hospital Cancer Registries in Africa []2014Correlation Between Family History and Characteristics of Breast Cancer []2115Familial Aggregation of Early‐Onset Cancers []3016Trends in Regional Cancer Mortality in Taiwan 1992–2014 []6017Statin Use and Incidence and Mortality of Breast and Gynecology Cancer: A Cohort Study Using the National Health Insurance Claims database []4118Management of Breast Cancer Risk in BRCA1/2 Mutation Carriers Who Are Unaffected With Cancer []2119Association Between Diabetes, Obesity, Aging, and Cancer: Review of Recent Literature []4020The European Cancer Burden in 2020: Incidence and Mortality Estimates for 40 Countries and 25 Major Cancers []3 0

aIn the last column, “1” refers to a strong correlation, which means 2 labels of a given publication are semantically or syntactically linked to each other, such as relaying, concurrency, and coupling effects. Conversely, “0” indicates a weak association between multiple labels of a specific publication, and there is no obvious semantic or syntactic correlation.

‎

Figure 7. Relational mapping of multilabel publication distributions.
DiscussionPrincipal Findings

There are several reasons for the “BERT + TextRNN” model to show optimal performance in cancer publication classification. First, cancer publications usually consist of long texts (eg, titles and abstracts) containing specialty terms and intensive contextual semantic correlations, which are quite suitable for the TextRNN model, which is good at processing sequential information with strong correlation and a high degree of uniformity. Moreover, comprehensive analysis of multilabel classification reflects that cancer publications are characterized by a high multilabel rate in areas with low research intensity due to the interdisciplinary and cooperative working, which enhances the contextual correlation to a certain extent. The “BERT + TextRNN” model is more likely to be efficient in such fields because it can effectively capture contextual semantics.

Compared with the TextRNN, the other models were insufficient and could be further improvement. The TextCNN might not be able to capture sufficient features, since it is not highly interpretable and well suited to address the fixed-length horizon issue. Although the DRNN is an enhanced version of the RNN with low computational speed, it fails to consider any upcoming input to the current state. Therefore, the DRNN is much less effective than the TextRNN. Being a long-linear model, FastText hardly handles the recognition of the long text of cancer publications and needs further optimization due to a limited recall rate.

Limitations

The proposed classifier based on the “BERT + TextRNN” model has 2 issues. On the one hand, the performance of the classifier may be reduced due to the accumulation of errors caused by keyword extraction, which will be enhanced by adjusting the model parameters and adding a self-testing function. On the other hand, the tuple input of titles and abstracts was integrated to train the multilabel classifier, which proved to be better than the input of titles or abstracts alone. Therefore, cancer publications with both title and abstract are desired for the proposed classifier. However, for the few cancer publications papers without abstracts, the classifier we trained will still be usable and has a slight performance cost.

Major Applications of the Proposed Classifier

We trained a classifier based on the “BERT + TextRNN” model for classifying the cancer literature at the publication level, which could directly assign multiple labels to each publication. The proposed classifier has at least 2 major applications. First, the desired model can achieve efficient and effective multilabel classification of cancer publications more granularly, not only for cancer publications in English, but also for full-text literature in other language whose titles and abstracts have English versions. Since the trained classifier is based on cancer publications with titles and abstracts, it should be suitable for any papers whose titles and abstracts are written in English (eg, Chinese medical publications). Another significant application is the fine-grained classification of scientific data on cancer research. Given that valuable data are accompanied by a brief description in English, the proposed model will classify them into the groups with appropriate CTs. Therefore, a content-based label in terms of CTs will be assigned to scientific data and literature, which provides a way to construct a full spectrum of data foundation for precision medicine.

Conclusion

Given that existing classification methods are at the journal level and there is an urgent need for subject classification due to the proliferation of cancer research, a multilabel classifier was trained based on deep learning models, specifically “BERT + TextRNN.” Moreover, the proposed high-resolution classification model was evaluated as being efficient and effective for cancer publications in terms of quantitative comparison and feature analysis.

The innovative exploration in this study is as follows:

The “BERT + TextRNN” classification model was trained for classifying cancer literature at the publication level, which shows promise in automatically assigning each publication at least 1 label to which it belongs.The proposed model achieves high-quality multilabel classification at the publication level, which could reflect the features of cancer publications more accurately with multiple labels compared to the existing method that annotates papers with a single label at the journal level.By comprehensive analysis of the correlation between multiple labels, as well as the data characteristics of multilabel cancer publications, the proposed model was verified to be suitable for the literature with features such as high specialization, uniform entity nouns, and standardized long texts.

In the future, the classification model will be extended to classify medical literature on cardiovascular disease and diabetes, where a great number of highly specialized publications have accumulated and are attracting increasing research attention, in order to improve health conditions worldwide.

None declared.

Edited by A Benis; submitted 07.12.22; peer-reviewed by Z Ben-Miled, L Guo; comments to author 11.01.23; revised version received 07.03.23; accepted 06.09.23; published 05.10.23

©Ying Zhang, Xiaoying Li, Yi Liu, Aihua Li, Xuemei Yang, Xiaoli Tang. Originally published in JMIR Medical Informatics (https://medinform.jmir.org), 05.10.2023.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.

View original article

JMIR MEDICAL INFORMATICS

Share Bookmark

0 0 0 0 0 0 0

More from this channel

A Multilabel Text Classifier of Cancer Literature at the Publication Level: Methods Study of Medical Text Classification

Comments (0)