Extracting Clinical Information From Japanese Radiology Reports Using a 2-Stage Deep Learning Approach: Algorithm Development and Validation

Introduction

Radiology reports are important for radiologists to communicate with referring physicians. The reports include clinical information about observed structures, diagnostic possibilities, and recommendations for treatment plans. Such information is also valuable for various applications such as case retrieval, cohort building, diagnostic surveillance, and clinical decision support. However, since most radiology reports are written in a free-text format, important clinical information is locked in the reports. This format presents major obstacles in secondary use [,]. To address this problem, a system for extracting structured information from the reports would be required.

Natural language processing (NLP) has demonstrated potential for improving the clinical workflow and reusing clinical text for various clinical applications [-]. Among the various NLP tasks, information extraction (IE) plays a central role in extracting structured information from unstructured texts. IE mainly consists of two steps: (1) the extraction of specified entities such as person, location, and organization from the text and (2) the extraction of semantic relation between 2 entities (eg,location_ofandemployee_of) [,].

Earlier IE systems mainly used heuristic methods such as dictionary-based approaches and regular expressions [-]. To extract clinical information from radiology reports, the Medical Language Extraction and Encoding system [] and Radiology Analysis tool [] have been developed. To detect clinical terms, these systems mainly use predefined dictionaries such as the Unified Medical Language System [] and their customized dictionaries and apply some grammatical rules to present them in a structured format.

The major issues of these systems include the lack of coverage and scalability []. A dictionary-based system often fails to detect clinical terms such as misspelled words, abbreviations, and nonstandard terminologies. Building exhaustive dictionaries to enhance the coverage and maintaining them are highly labor-intensive. It is also challenging to apply complicated grammar rules according to the context of the reports. In addition, IE systems based on dictionaries and grammar rules are highly language dependent and do not scale to other languages. The Medical Language Extraction and Encoding system and Radiology Analysis tool only cover English clinical texts and cannot handle non-English clinical texts. Languages other than English, including Japanese, do not have sufficient clinical resources such as the Unified Medical Language System. This has been a major obstacle in developing clinical NLP systems in countries where English is not the official language [].

Recently, machine learning approaches have been widely accepted in clinical NLP systems [,]. Hassanpour and Langlotz [] used a conditional random field (CRF) [] for extracting clinical information from computed tomography (CT) reports. They showed that their machine learning model had a superior ability compared to the dictionary-based systems.

Deep learning approaches have drawn a great deal of attention in more recent studies. Cornegruta et al [] built a bidirectional long short-term memory (BiLSTM) model [] to extract clinical terms from chest x-ray reports. Miao et al [] built a BiLSTM model to handle Chinese radiology reports. Both studies reported that deep learning approaches yielded better results than dictionary-based approaches.

Various state-of-the-art deep learning models have been applied to extract named entities [,,]. Clinical systems such as concept extraction can be achieved though extracting named entities alone, whereas the relation extraction step is needed to obtain structured information about concepts and their attributes [,]. Extracting comprehensive information in a structured format is desirable when developing a complex system.

Xie et al [] developed a 2-stage IE system for processing chest CT reports. They exploited a hybrid approach involving deep learning to extract named entities and a rule-based method to organize the detected entities in a structured format. They reported that their deep learning model achieved better performance, whereas the rule-based structuring approach degraded the overall performance, since the rule-based approach could not capture the contextual relations in the reports. Jain et al [] developed RadGraph, an end-to-end deep learning system for structuring chest x-ray reports. They reported that their schema had a higher report coverage in their corpus.

In this study, we developed a 2-stage deep learning system for extracting clinical information from CT reports. For secondary use of the radiology reports, we believe that our system has some advantages compared with recent related works [,,,,]. First, our 2-stage NLP system can represent clinical information in a structured format, which can be challenging when only using an entity extraction approach. Second, although the rule-based approach struggled to extract relations between entities in the reports [], leveraging state-of-the-art deep learning models leads to superior performance. Third, previous studies [,,] have combined clinical information about factual observations and radiologist interpretations into single entity, even though they have different semantic roles in the context. According to the context, distinct entity types are defined in our information model, which allows it to capture detailed clinical information in the reports. To structure the report more appropriately, we defined distinct entities for 2 different clinical pieces of information.

The rest of this paper is organized as follows. First, an information model was built, mainly comprising observation entities, clinical finding entities, and their modifier entities. Second, a data set was created using in-house CT reports annotated by medical experts. Third, state-of-the-art deep learning models were trained and evaluated to extract the clinical entities and relations. The entire performance of our 2-stage system was also evaluated. Finally, we evaluated the coverage of the clinical information in the CT reports using our information model.

The development of the information model was already reported in our previous study []. However, the previous study only focused on extracting entities and did not cover extracting relations between the entities. This study developed a 2-stage system containing entity extraction and relation extraction modules. Furthermore, although the previous study only used chest CT reports, a data set using abdomen CT reports was created in this study to validate the generalizability of our information model and 2-stage system.

MethodsOur Information Model

An information model was built for extracting comprehensive clinical information from free-text radiology reports. Our information model contained observation entities, clinical finding entities, and modifier entities. Observation entities are specific terms representing observed abnormal features such as “nodule” or “pleural effusion.” Clinical finding entities encompass terms such as “cancer,” including diagnoses given by the radiologists based on the observation entities. Modifier entities are subdivided into the following entities: anatomical location, certainty, change, characteristics, and size. Thus, 7 entity types were defined in our information model. A detailed description of our information model is provided in our previous study [].

Furthermore, modifier and evidence relations between entities were defined. A modifier relation is derived from an observation or a clinical finding entity and a modifier entity. This relation type gives clinical information, such as the anatomical location of the observations and the characteristics of the clinical findings. An evidence relation is derived from an observation entity and a clinical finding entity. This relation is also clinically meaningful in capturing the diagnostic process of the radiologist. Report examples of entities and relations are shown in.

‎

Figure 1. Report examples of entities and relations. IPMN: intraductal papillary mucinous neoplasm. Data Set

Radiology reports from 2010 to 2021 that were stored in the radiology information system at Osaka University Hospital, Japan, were used. They consisted of 912,505 reports written in Japanese. To create a gold standard data set, 540 chest CT reports and 500 abdomen CT reports were randomly extracted. The remaining unannotated reports (911,465 reports) were used to pretrain the model.

Ethical Considerations

This study was performed in accordance with the World Medical Association Declaration of Helsinki, and the study protocol was approved by the institutional review board of the Osaka University Hospital (permission 19276). Only anonymized data were used in this study, and we did not have access to information that could identify individual participants during the study.

Annotation Scheme

Overall, 3 medical experts (2 clinicians and 1 radiological technologist) performed the annotation process. The gold standard data sets of chest and abdomen CT reports were developed by different annotation methods.

For the chest CT reports, the data set that was developed in our previous study was leveraged []. After making minor adjustments for entities, the relation types between entities were newly annotated by 2 clinicians. Following a guideline describing the rules and annotation examples, they independently annotated each report. Disagreements between the annotators were resolved by discussion. The interannotator agreement (IAA) score for the entities was 91%, as reported in our previous study []. To calculate the IAA score for the relations, we used Cohen κ [], resulting in an IAA score of 81%. Both IAA scores indicated very high agreement [].

For the abdomen CT reports, to reduce the burden of the annotation work, a deep learning model trained on the chest CT reports was implemented to preannotate the entities and relations in the reports. Annotators were provided with the preannotated reports, and they modified the result according to the guidelines. We did not compute IAA scores for the abdomen data set because it was preannotated by the deep learning model.

All entities and relations were annotated using BRAT (Stenetorp et al []). The number of annotated entities and relations are shown in.

Our 2-Stage SystemOverview

An overview of our 2-stage system is shown in. The system pipeline mainly consists of 2 deep learning modules. In the first step, our module extracts the clinical entities in the radiology reports according to the predefined information model. The extracted entities are fed into subsequent modules. In the second step, the relation between clinical entities is extracted. The details of each module are described in the subsequent sections.

‎

Figure 2. Overview of our 2-stage deep learning system. Entity Extraction

According to the predefined information model, this module extracts clinical entities from a report. Named entity recognition (NER) [] is well suited for this task. As a preprocessing pipeline, the report was segmented into sentences using regular expressions, and each sentence was tokenized with MeCab (Kyoto University Graduate School of Informatics and Nippon Telegraph and Telephone Corportation’s Communication Science Research Institute) []. Then, a sequence of tokens was fed into the model. To represent the spans of specified entities, the IOB2 format [], which is a widely used tagging format in NER tasks, was used. In this format, the B and I tags represent the beginning and inside of an entity, respectively, and the O tag represents the outside of an entity. A tagging example is illustrated in.

‎

Figure 3. An illustration of the entity extraction module. BERT: Bidirectional Encoder Representations from Transformers; BiLSTM: bidirectional long short-term memory; CRF: conditional random field.

State-of-the-art deep learning models for NER—BiLSTM-CRF [], BERT [], and BERT-CRF—were compared.

Relation Extraction

Following the implementation of the entity extraction module, reports with clinical entities were obtained. As a preprocessing pipeline of relation extraction, the original sentences of the report were reconstructed by concatenating sentences from the beginning to the end. This was implemented for extracting relations across multiple sentences in a report. Next, the pipeline generated possible candidate relations by each relation type in a report (see). Then, this module solved a binary classification problem to determine the existence of relations given the candidate relations.

‎

Figure 4. Example of instances generated for relation extraction. In this case, 6 candidate relations were generated from 2 observations and 3 modifiers. CT: computed tomography.

Next, we explain how we represented each relation candidate in a fixed-length sequence. Previous studies have introduced a method to add position indicator tokens to the input sequence to indicate the entity span of the pair in the sequence [,]. We expanded this method to allow the representation of the entity types. These position indicator tokens are referred to as “entity span tokens.” For example, the input sequence of the model representing the relation between an observation entity and an anatomical location modifier entity was represented as follows: “A 3 cm <OBS> nodule </OBS> is in the <AE> right upper lobe </AE>.” Here, “<OBS>,” “</OBS>,” “<AE>,” and “</AE>” are entity span tokens. Possible entity span tokens were appended to the vocabulary, and thus, an entity span token was treated as a single token. The input sequence containing 4 entity span tokens was fed into the model. A classification example is illustrated in. All generated relation candidates were transformed into fixed-length sequences and fed into the model.

‎

Figure 5. An illustration of the relation extraction module. BERT: Bidirectional Encoder Representations from Transformers; BiLSTM: bidirectional long short-term memory.

The BiLSTM attention model [] and BERT model were compared. For the BiLSTM attention model, the output vector representation for classification was obtained from the weighted sum of the sequence vector representations. For the BERT model, the representation of the first “[CLS]” token for classification was used, which is a straightforward sequence classification tasks introduced by the original BERT.

Experimental SettingsData Set Splitting

A total of 540 annotated chest CT reports were divided into 3 groups: 378 reports for training, 54 reports for development, and 108 reports for testing. Similarly, a total of 500 annotated abdomen CT reports were divided into 3 groups: 350 reports for training, 50 reports for development, and 100 reports for testing. In total, 728 reports for training, 94 reports for development, and 208 reports for testing were prepared.

Parameter Optimization

For the BiLSTM-CRF model, a minibatch stochastic gradient descent with momentum was used, and the initial learning rate and momentum were set to 0.1 and 0.9, respectively. The learning rate was reduced when theF1-score of the development data set stopped improving. Learning rate decay and a gradient clipping of 5.0 were used. Dropout [] was applied on both the input and output vectors of the BiLSTM model. A batch size of 16, a dropout rate of 0.1, a word embedding dimension of 100, and a hidden layer dimension of 512 were chosen. For the BERT model, BERTBASEwas used, which has 12 layers of transformer blocks, 768 hidden units, and 12 self-attention heads. The model was fine-tuned with the initial learning rate of 5 × 10–5, a batch size of 16, and training epochs of 10. The best hyperparameter setting was chosen using a development data set.

Domain Adaptation

Previous studies have reported that pretraining the domain corpora improved the model performance for various downstream tasks [,,]. However, some studies have pointed out that domain adaptation (DA) leads to a degradation in model performance due to forgetting general domain knowledge [,]. To validate the effect of DA in our experiments, we evaluated the model performance with and without DA for both the entity extraction and relation extraction models.

For pretraining the word embeddings of the BiLSTM model with the general domain, Japanese Wikipedia articles [] (12 million sentences) were used. For pretraining the word embeddings of the BiLSTM model with DA, 911,465 in-house radiology reports were used. We used word2vec (Mikolov et al []) for both tasks of pretraining the word embeddings.

For the BERT model, the publicly available pretrained Japanese BERT (Tohoku NLP Group and Tohoku University) [] was first initialized. The model was pretrained using Japanese Wikipedia articles. The BERTBASEsubword tokenization model pretrained with whole word masking was chosen. For DA, continued pretraining using 911,465 in-house radiology reports for approximately 100,000 steps using a batch size of 256 was implemented.

Evaluation Metrics

To validate the capability of our system, we conducted 2 experiments. First, the performances of the deep learning modules were calculated. In this experiment, the mean scores were obtained over 5 runs with different parameter initializations to mitigate the effects of a random seed. For both the entity extraction and relation extraction, theF1-score was used for evaluation. For the entity extraction, entity-levelF1-score was used as an evaluation metric, and the results were aggregated by microaveraging. Second, to validate that our information model encompassed clinical information in the reports, we measured the coverage with the following formula:

where B-tagged tokens and I-tagged tokens were annotated as entities represented in the IOB2 format [], and O-tagged tokens as outside entities were not annotated. Following to the scope definition of our information model, the sentences that only contained information about the technique of the imaging test, the surgical procedures of the patients, and recommendations were excluded. Punctuations and stop words were also excluded from the calculation. The list of stop words is presented in.

ResultsEntity Extraction

shows the performance metrics for the entity extraction model. The BiLSTM-CRF model with DA achieved a microaveragedF1-score of 96.1%. In our experiments, the BiLSTM-CRF model with DA achieved the best performance of all the microaveraged scores. For the BERT model, concatenating the CRF layer to the output of the BERT improved the meanF1-scores with and without DA. Given that the BiLSTM-CRF model with DA yielded the highest meanF1-score, it was used as the entity extraction module for our system and was used for the remaining experiments.

Table 1. Comparison of entity extraction models using meanF1-scores.ModelWithout DA, meanF1-score (%)With DA, meanF1-score (%)BiLSTM95.296.1BERT94.895.2BERT-CRF95.195.4

aDA: domain adaptation.

bBiLSTM: bidirectional long short-term memory.

cThe best performance is italicized.

dBERT: Bidirectional Encoder Representations from Transformers.

eCRF: conditional random field.

The detailed performance of BiLSTM-CRF model with DA is shown in. In the test set using chest and abdomen reports, theF1-scores of observation, clinical finding, anatomical location modifier, certainty modifier, and size modifier entities were over 95%, whereas the change modifier and characteristics modifier entities had lowerF1-scores than the other entities.also shows that the test set of abdomen reports had a 0.5% higherF1-score than the chest reports. On the test set of abdomen reports, the clinical finding and change modifier entities achieved betterF1-scores than the chest reports, with an increase of 2.9% and 2.5%, respectively. Conversely, the observation and characteristics modifier entities using the test set of chest reports obtained betterF1-scores than the abdominal reports, with an increase of 1.0% and 2.6%, respectively.

Table 2. Comparison of the results of the entity extraction model for the test set of chest and abdomen reports.Entity typeChest reports,F1-score (%)Abdomen reports,F1-score (%)Chest and abdomen reports,F1-score (%)Observation96.195.195.6Clinical finding94.297.196.1Anatomical location modifier96.396.396.3Certainty modifier98.699.198.9Change modifier90.593.091.5Characteristics modifier89.586.988.5Size modifier98.798.798.7Microaverage95.896.396.1Relation Extraction

The performances of the relation extraction models were compared. In this experiment, to focus on evaluating the relation extraction module, human-annotated entities were used for the input of each model.shows the comparisons of the performance of the relation extraction models. A microaveragedF1-score of 95.6% was achieved for the BiLSTM attention model with DA and 97.6% for the BERT model with DA, which indicated that both classification models could achieve a satisfactory performance for relation extraction. Pretraining with domain corpora improved the performance of both relation models. In contrast to the experimental results of the entity extraction models, the BERT model outperformed the BiLSTM attention model by 2.0% in theF1-score.

Table 3. F1-score of the relation extraction models.ModelWithout DA, microaveragedF1-score (%)With DA, microaveragingF1-score (%)BiLSTM95.595.6BERT97.297.6

aDA: domain adaptation.

bBiLSTM: bidirectional long short-term memory.

cThe best performance is italicized.

dBERT: Bidirectional Encoder Representations from Transformers.

The performance difference between the chest and abdomen reports was also compared (). TheF1-scores of the modifier relation were almost the same for the chest reports and abdomen reports, whereas the evidence relation was 6.3% lower in the abdomen reports than the chest reports.

Table 4. Comparison of the results of the relation extraction model for the test set of chest and abdomen reports.Relation type and entity typeChest reports,F1-score (%)Abdomen reports,F1-score (%)Chest and abdomen reports,F1-score (%)Modifier relationAnatomical location97.997.697.6Certainty99.499.599.4Change95.495.095.1Characteristics95.196.595.7Size99.198.098.8Evidence relationClinical finding96.790.494.9Microaverage97.797.497.6Our 2-Stage System

To evaluate the performance of the entire pipeline of our system, the performance of the relation extraction module using the output of the entity extraction module was examined. According to the experimental results, the BiLSTM-CRF and BERT models were used for the entity extraction model and relation extraction model, respectively.shows that the performance of the 2-stage system obtained an overallF1-score of 91.9%. The overallF1-score was 5.7% lower than the results using the human-annotated entities, as shown in. This decrease is reasonable since the misclassification of entity extraction is fed into the relation extraction model in this experiment.

Table 5. TheF1-score of our 2-stage system.Relation type and entity type2-Stage system,F1-score (%)Modifier relationAnatomical location92.8Certainty96.3Change81.4Characteristics84.7Size94.6Evidence relationClinical findings87.1Microaverage91.9Coverage of Clinical Entities

The test set of reports contained an average of 11.9 sentences. An average of 1.0 (8.4%) out of 11.9 sentences about the technique of the imaging test, the surgical procedures of the patients, and recommendations were excluded from the calculation.shows the coverage of clinical entities with our information model. The coverage of the clinical entities across entire sequence was 70.2% (7050/10,036). We observed that 96.2% (6595/6853) of tokens were annotated when punctuations and stop words were excluded from the sequences.

Table 6. Coverage of the clinical entities with our information model.Token scopeAnnotated tokens , n/N (%)Entire sequence7050/10,036 (70.2)Without punctuations and stop words6595/6853 (96.2)Error Analysis

A quantitative error analysis was further performed to understand our 2-stage system. For the entity extraction module, we found that the entity mentions that rarely occurred in our corpus were likely missed. To evaluate this empirically, 2 additional test sets were used.

Major test set: entity mentions that occurred multiple times in the training setMinor test set: entity mentions that only occured once or did not occur in the training set

shows the comparison of the result of the major and minor test sets with the original test set (). In the major test set, theF1-score of the overall entities was improved by 2.1% (from 96.1% to 98.2%). This increase was also observed in the individual entities except for the size modifier entity. However, theF1-score of the overall entities was markedly decreased by 9% in the minor test set. This was expected as the deep learning model struggled to predict the samples that were rare or unseen in the training set. Another reason for this difference may be the difficulty in determining the appropriate entities for the minor mentions. We observed that annotation disagreements during the adjudication process occurred more frequently for the minor mentions than the major mentions. Interestingly, we found that the size modifier was robust to the minor entity mentions. The simplicity of these entity mentions, such as “5 cm” and “30×14 mm,” may have contributed to the result. Our analysis shows that the entity extraction module could extract frequent entity mentions in the training set accurately; however, there remains much room for improvement regarding rare or unseen terms in the training set.

Table 7. Error analysis.Entity typeOriginal test set,F1-score (%)Major test setMinor test setF1-score (%)Difference from the original test setF1-score (%)Difference from the original test setObservation95.697.9+2.382.0–13.6Clinical finding96.197.9+1.987.8–8.2Anatomical location modifier96.398.7+2.489.6–6.7Certainty modifier98.999.3+0.480.5–18.4Change modifier91.593.5+2.089.0–2.5Characteristics modifier88.595.5+7.161.5–26.9Size modifier98.798.–0.398.2–0.6Microaverage96.198.2+2.187.1–9.0

To decrease the ratio of rare or unseen terms in the test set, more samples would be required in the training set. However, it is inefficient to sample reports randomly to improve the overall performance. For an efficient sampling strategy, active learning [,] may be a promising approach that allows for the selective sampling of reports in the current module.

The performance of the entity extraction and relation extraction modules were compared using the test set of chest and abdomen reports, respectively. For the entity extraction, theF1-score of the clinical finding entities in the test set for the abdomen reports was 2.9% better than that of the chest reports. In the abdomen reports, it was often written using terms such as “肝臓 : n.p. (Liver: n.p.)” when there were no particular findings for a specific organ. This simple expression, “n.p.,” constituted 66.2% of the clinical finding entities in the test set of the abdomen reports, which substantially impacted the performance.

The overall performance of the relation extraction module demonstrated excellent performance on the test set for both the chest and abdomen reports. However, theF1-score for the evidence relation between the observation and clinical finding entities was 6.3% lower on the test set of the abdomen reports than that of the chest reports. We found a few examples where the observations and clinical findings were clinically related; however, we could not determine if the observation was the diagnostic basis for the finding. The first example shown inindicates that the “whirlpool sign” was the observation for the diagnostic basis of an “intestinal obstruction (イレウス),” whereas no observation was found for the diagnostic basis of an “intestinal obstruction (イレウス).” Even though a “whirlpool sign” was clinically related to an “intestinal obstruction (イレウス),” the evidence relation cannot be derived from this example. However, our model misclassified this as a positive example of the evidence relation. In the second example, annotators did not assign the evidence relation between “air” and “biloma,” since they considered that the “air” has already disappeared. However, we discussed that the clinical finding of “biloma” was actually derived from the evidence of an unchanged “low density area (低吸収域)” and disappeared “air.” Thus, the model prediction was more preferable than the gold standard. To derive the diagnostic basis, it is preferable to consider information about the observation and its modifying entities.

‎

Figure 6. Misclassification examples of the relation extraction model (blue highlighted relations are examples of false positives).
DiscussionPrincipal Findings

shows the performance of the entity extraction model, which yielded a microaveragedF1-score of 96.1%. TheF1-scores of the observation entity and the clinical finding entity were 95.6% and 96.1%, respectively. These superior performances are desirable for our system since the observation and clinical finding entities are principal components of our information model. Moreover,shows that the modifier relation with the certainty entity also had superior performance. These results suggest that our system will be applicable for practical secondary uses, such as a query-based case retrieval system []. However, to reuse radiology reports for various clinical applications, improvements in extracting the change modifier and characteristics modifier would also be required.

BiLSTM Versus BERT

shows that the BiLSTM-based model achieved better performance than the BERT-based model in the entity extraction task, whereasshows that the BERT-based model outperformed the BiLSTM-based model in the relation extraction task. We considered that the differences between entity and relation extractions might be due to their task characteristics. Local neighborhood information and the representation of the token itself are considered important in the entity extraction task, whereas more global context information is required in the relation extraction task, especially for long-distance relations. Due to their attention mechanism, BERT and other transformer-based models are capable of learning long-range dependencies [], which probably contributed to the superiority of the BERT model in the relation extraction task.

DA Performance

andshow the comparison results of the model performances with and without DA for each task. These results indicate that DA is beneficial for performance improvement, regardless of the architecture of the model. Since our system focuses on extracting information from radiology reports, we consider that the problem of forgetting general domain knowledge to be outside the scope of this study.

Coverage of Clinical Entities

The coverage of the clinical entities with our information model was calculated. Sentences about the technique of the imaging test, the surgical procedures of the patients, and recommendations were excluded from the calculation, as such information was outside of the scope of our information model. Punctations and stop words were also excluded from the calculation. A total of 96.2% (6595/6853) of tokens were annotated, which indicates that our information model covered most of the clinical information in the reports.

Limitations

This study has a limitation in terms of generalizability, since we only used 1 institutional data set for evaluation. More data sets outside our institution would be needed to ensure generalizability. Although we validated the capability of our system using only chest and abdomen CT reports, fine-tuning of the deep learning models with reports for other body parts and modalities would be required for various secondary uses.

Furthermore, we are aware that there is still a gap to bridge to reuse radiology reports for various applications. As reports usually contain misspellings, abbreviations, and nonstandard terminologies, we believe that term normalization techniques [,] would be needed for clinical applications.

Conclusions

This study developed a 2-stage system to extract structured clinical information from radiology reports. First, we developed an information model and annotated in-house chest and abdomen CT reports. Second, we trained and evaluated the performance of 2 deep learning modules. The microaveragedF1-scores of our best model for entity extraction and relation extraction were 96.1% and 97.4%, respectively. The entire pipeline of our system achieved a microaveragedF1-score of 91.9%. Finally, we measured the ratio of annotated entities in the reports. The coverage of the clinical information in the reports was 96.2% (6595/6853). To reuse radiology reports, future studies should focus on term normalization. We also plan to develop a platform that allows us to evaluate the generalizability of our system using reports from outside of our institution.

This research was supported by Japan Society for the Promotion of Science KAKENHI grant T22K12885A.

KS developed the entire system, conducted the experiments, and prepared the manuscript. KS, YM, and TT designed the project. YM and TT supervised the project. SW, SK, SM, and KO validated the data. All authors discussed the results and contributed to the final manuscript.

None declared.

Edited by Jeffrey Klann; submitted 16.05.23; peer-reviewed by Jamil Zaghir, Manabu Torii, Tian Kang; final revised version received 25.09.23; accepted 03.10.23; published 14.11.23

© Kento Sugimoto, Shoya Wada, Shozo Konishi, Katsuki Okada, Shirou Manabe, Yasushi Matsumura, Toshihiro Takeda. Originally published in JMIR Medical Informatics (https://medinform.jmir.org), 14.11.2023.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication onhttps://medinform.jmir.org/, as well as this copyright and license information must be included.

View original article

JMIR MEDICAL INFORMATICS

Share Bookmark

0 0 0 0 0 0 0

More from this channel

Extracting Clinical Information From Japanese Radiology Reports Using a 2-Stage Deep Learning Approach: Algorithm Development and Validation

Comments (0)