Patient Information Summarization in Clinical Settings: Scoping Review


IntroductionBackground

Summarization is an essential element of human cognition and consists of taking a set of information and retaining the pertinent elements to take action []. Feblowitz et al [] defined information summarization in the health care context as “the act of collecting, distilling, and synthesizing patient information for the purpose of facilitating any of a wide range of clinical tasks.” This definition translates as simplifying the presented information so that health care professionals (HCPs) can act more smoothly and efficiently in different clinical situations.

Automatic summarization of information in electronic health records (EHRs) can serve as a solution for information overload [], a widespread problem in health care when the presented data are too much to be efficiently processed in a care situation. Information overload can have detrimental effects on patient care in the form of professional stress, fatigue, delays, and medical errors []. Although the phenomenon is not novel, it is increasingly present owing to an aging population with an exponentially increasing presence of chronic diseases, increased administrative burden, overabundance, and suboptimal storage of medical data [,]. Furthermore, current EHR systems present information in a fragmented manner [] with widespread repetition, copy-pasting [], and details not relevant to clinical care [].

Despite the need for automatic patient information summarization, there is no widely accepted theory or methodology. This report aimed to synthesize the contributions of patient information summarization scattered in the literature. Scoping review is the chosen form with the aim of mapping ideas, mapping concepts related to the question, and identifying knowledge gaps.

The review is not unprecedented: in their narrative review, Pivovarov and Elhadad [] already summarized the most important contributions to clinical summarization in 2015. Moreover, there have been several published studies surveying the literature in related fields, including the summarization of biomedical literature [,], the summarization from medical documents [], neural natural language processing (NLP) in EHRs [], named entity recognition, a type of information extraction and NLP technique, free-text clinical notes [], automatic clinical documentation [], the visualization of medical information in the clinical context [-], the visualization of intensive care unit (ICU) data [], and the visualization of trends in medical data []. The latter reviews, although exhaustive in their specific scope, do not permit the identification of state-of-the-art summarization methods for HCPs. For example, it is difficult to state the current state of research for the management of uncertainty and time in clinical summarization. Moreover, they did not provide any informed guidelines for clinical summarization.

Objective

This review, building on a broader scope of articles than the combination of all the previous studies, systematically evaluates where, what, and how medical information is summarized; whether summarization conserves temporality, uncertainty, and medical pertinence; and how the propositions are evaluated and deployed.

On the basis of cognitive science literature, this review also proposes a novel “collect—synthesize—communicate” framework to compare studies on how they contribute to clinical summarization.


MethodsOverview

The methodology was designed to process a broad scope of articles; hence, different search strategies were combined to diversify the sources. Two reviewers agreed on the selection method: 2 databases with a clinical focus were searched with similar queries and the retrieved articles were filtered by one of the reviewers according to their titles and abstracts. The same reviewer read the remaining reports in the full text and selected them according to the inclusion and exclusion criteria. The same filtering was then carried out on citations within these articles and the citations of these articles. The reporting was done using the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews) [], and a checklist is provided in .

The 2 databases searched in this review were the Web of Science Core Collection and PubMed as they contain a broad scope of articles in the medical field and are less inclusive of other articles in computer science, not related to the medical or scientific domain.

The search query for Web of Science was designed as a combination of 2 parts: capturing the summarization process and capturing the health care content. An iterative process was used to define the exact search term, where the gain of adding a keyword was examined by determining whether the first 5% (sorted by relevance defined by Web of Science) of the results from a query containing the new word but excluding the previous words shows any relevant article. This led to the following query: “ALL=((‘ehr’ OR ‘emr’ OR ‘health’ OR ‘patient’ OR ‘medical’ OR ‘hospital’ OR ‘healthcare’) AND (‘summarization’ OR ‘summarisation’ OR ‘summarizing’))” searching in the title, abstract, and metadata of the articles in the database, including the “keyword plus” field containing terms frequently appearing in the body of an article but not mentioned in the title or abstract.

The query in PubMed was “(‘her’ OR 'emr' OR ‘health’ OR ‘patient’ OR ‘medical’ OR ‘hospital’ OR ‘health care’ OR ‘medical record’[(MeSH Terms])) AND (‘summarization’ OR ‘summarisation’ OR ‘summarizing’).” This query was almost identical to the search made in Web of Science except that PubMed does not have a “keyword plus” field like Web of Science to search in. Instead, preindexed articles with the Medical Subject Headings term “medical records” were included in the search.

All results of the queries were imported into the Rayyan app [], helping to organize the citations for a review article. After duplicates were removed, the abstracts and titles were scanned in this application to filter records that were obviously irrelevant to patient information summarization. The app enables highlighting specific words in the abstract with 2 distinct colors, speeding up the review process. After this filtering step, the remaining articles were read in the full text for inclusion using the inclusion and exclusion criteria detailed in the following section.

After identifying the relevant works, the list of references in the selected articles and the list of citing papers retrieved by Google Scholar were manually reviewed for titles related to the topic. These potentially relevant titles were manually added to the citation manager and further filtered by reading their abstract and eventually reading them in the full text (as with the original results). If these articles contained further (previously unseen) relevant references or citations toward them, they were also processed.

Inclusion and Exclusion Criteria

According to the inclusion criteria, all records mentioning the summarization of clinical or health data as a general goal in their abstract, proposing solutions for information overflow, or claiming to make steps for these general goals were included.

All records were excluded where the corresponding article was not available in the full text for the authors (ie, at the University of Geneva campus), as the analysis could not be conducted on these records.

As several articles do not mention the source of the data in their abstracts, an exclusion criterion excludes articles about summarizing non-EHR data (eg, summarizing research articles; not EHR). Similarly, articles developing summarization for users other than HCPs (not for HCP) or for contexts other than clinical applications (not clinical) were excluded.

As many different methods can be labeled as “summarization,” only records presenting a type of overview of a patient’s current or past status are aimed to be included, therefore articles proposing alerts (eg, risk scores and cues) or similar simple parameters to summarize the state of a disease or patient (alert) and articles proposing other remedies for information overflow than automatic summarization for information overflow (not automatic) were excluded.

As previous reviews analyzed articles using different aspects, a broad timeframe was aimed at the review. However, as early EHR systems were very different from current systems, and hence the concept of summarization is largely different in these systems, articles before 2005 were excluded (<2005). The cutoff year is somewhat of an arbitrary (but round) threshold, although contributions before this year are sporadic.

Finally, articles presenting summarization solutions only for nontextual and nonstructured data (eg, video or signal summarization; Other data) and review papers (Review) were also excluded.

Articles found relevant to the review were evaluated by one of the reviewers for several criteria chosen to answer the following questions:

Where is summarization performed?What is summarized? How?How crucial aspects of clinical information are conserved and used?How are the algorithms evaluated?

presents the detailed criteria. Some of these criteria were defined a priori, whereas others were shaped during the analysis process. For 1 aspect, the input data type for summarization, the analysis was carried out on a broader scope, and reports excluded by the “other data” criterion were also analyzed for this information.

Textbox 1. Criteria according to which articles are evaluated in the analysis part of this review. For some of the criteria, categories are defined a priori. For others, they are shaped during the analysis process (the ones defined a posteriori are marked with an asterisk).

General aspects

Type of the studyPrototype: articles describing a summarization system or algorithm that can be evaluated. The evaluation might be present or absent from the article.Evaluation study: articles evaluating summarization systems, algorithms, or current summarization processes in health care without presenting a new automatic summarization solution.Recommendations: articles with theoretical contributions not being implemented.

Where are summaries needed?

Field of application*: the medical or clinical domain where the summarization is applied. The categories are discovered during the review process.

What should be summarized?

Source of information:Single encounterMultiple encountersThis information cannot be inferred from the text.Input for the summarization:Structural data: a combination of numerical and categorical dataTextual data: free-text patient information present in electronic health record systems

How to summarize?

The summarization method*: the categories of summarization methods are shaped during the review process.Presentation*: how the summary is presented to the end user. The types of presentations are shaped during the review process.View on the summarization problem:The top-down group represents records where summarization consists of eliminating “disturbances” from all available information, that is, hiding information deemed to be unnecessary.Bottom-up methods see summarization as a process of finding the most salient information available and building a summary from it.

Aspects to be conserved during summarization

Temporality*: if and how temporal information is conserved and used during summarization. The categories are shaped by the discoveries in the scoping review.Uncertainty*: if and how the uncertainty of information is represented during summarization. The categories are shaped by the discoveries in the scoping review.Medical knowledge*: if any medical knowledge is included in the design of the summarization system or during summarization. The categories are shaped during the review process.

What is a good summary? Evaluation and deployment

Evaluation*: the method of evaluation. The types are shaped according to the discoveries from the review process.Deployment: if the summarization system was deployed in real clinical settingsCollect—Synthesize—Communicate Framework

During the analysis, we developed a new framework to compare methods of how they summarize clinical information. Following the definition presented in the introduction [], the model divides the summarization process into an ideally sequential process of information collection, information synthesis, and summary communication. Information collection refers to the extraction of information from raw data, synthesis describes the selection and eventual transformation of the retrieved information, and communication refers to the representation of the synthesized information in a human-digestible format.

This view was consistent with that of several sources of cognitive psychology. For example, Johnson [] describes summarization as a sequence of prerequisites for summarization (including comprehending individual propositions of a story, establishing connections, identifying the consistent structure of the story, and remembering the information), information selection, and formulating a concise summary. This is also similar to the view presented by Hidi and Anderson [], who discussed selection, condensation, and transformation.

Nevertheless, the few theoretical studies on clinical summarization have slightly different views on the process. Feblowitz et al [] described clinical summarization as a process of aggregation, organization, reduction, transformation, interpretation, and synthesis. Jones [,] describes textual summarization as a process of interpretation, transformation, and text generation. Although these theories mention seemingly different steps for summary creation, they can be mapped to the proposed simpler and more general 3-step framework. presents the mapping to the present framework of these theories and some of the most commonly used summarization methods.

Table 1. Summary of how existing theoretical frameworks and most abundant summarization methods relate to the collect—synthesize—communicate framework.Theory or methodCollectionSynthesisCommunicationFeblowitz et al []Aggregation, organization, and interpretationReduction, transformation, and synthesisOrganization and synthesisJones [,]InterpretationTransformationText generationExtractive summarization (eg, Liang et al [])N/Aa (not covered)Sentence selectionN/A (not covered)Abstractive summarization (eg, Gundogdu et al [])EncodingAttention mechanismN/A (not covered)Topic modeling (eg, Botsis et al [])Topic extractionN/A (not covered)N/A (not covered)

aN/A: not applicable.


ResultsOverview

As shown in [], a total of 7925 titles were retrieved from PubMed and 3641 articles were retrieved from Web of Science. After removing duplicates, 9166 records were screened by their title and abstract for inclusion criteria and 380 records were chosen for full-text reading. From these, 1 could not be accessed by the authors and 328 were excluded based on the exclusion criteria.

From the 52 articles included in the analysis, 612 records were identified as potentially relevant by their title and 175 titles were chosen to be read in the full text after screening the abstracts. From these 175 titles, 2 could not be accessed, 97 records were excluded according to the exclusion criteria, and 76 titles were included in the analysis.

Among the 128 articles remaining in the analysis, 102 titles were categorized as a prototype, 20 were categorized as evaluation studies, and 6 were categorized as “recommendations.”

Fields of Application

This review identified diverse fields of application for which summarization methods have been developed. A grouping of these fields is as follows:

ICU (27/128, 21.1%), where recent events and vital parameters are summarizedSurgery (1/128, 0.8%) and related anesthesiology (5/128, 3.9%), requiring all the information related to surgery to be summarizedDiagnostics, showing findings from one or several diagnostic sessions and including radiology (19/128, 14.8%), out of which 5.5% (7/128) were presented as a solution in the MEDIQA 2021 summarization task [], ultrasound (2/128, 1.6%), prostatectomy (1/128, 0.8%), and laboratory data management in a clinical context (1/128, 0.8%)Hospital care (9/128, 7%), where information related to a hospital stay requires efficient summarizationChronic disease monitoring including diabetes (4/128, 3.1%), HIV (1/128, 0.8%), chronic obstructive pulmonary disease care (1/128, 0.8%), cardiology (2/128, 1.6%), nephrology (1/128, 0.8%), and monitoring of multiple chronic diseases (4/128, 3.1%), where salient events and information during a complex and long-lasting disease are requiredOncology (5/128, 3.9%), where the main events and elements of complex treatment are summarizedDrug prescription (3/128, 2.3%), where pharmaceutical history is summarizedOther medical environments included psychotherapy (3/128, 2.3%), opioid misuse treatment (1/128, 0.8%), general practice (2/128, 1.6%), emergency room (2/128, 1.6%), older adult care (2/128, 1.6%), and maternal care (1/128, 0.8%).

In addition, 25% (32/128) of articles did not specify their field of application or were meant to be usable in multiple types of medical environments and domains.

Input for Summarization

Regarding the source of information, 62.5% (80/128) of reports talk about systems summarizing single patient encounters, 27.3% (35/128) of reports explicitly talk about summarizing multiple encounters, 6.3% (8/128) of reports implicitly describe multiple encounter summarization, and 3.9% (5/128) of reports did not specify the cardinality of encounters.

Among the 128 articles in the review, 3 (2.3%) reports do not specify the input type for the summary, 59 (46.1%) worked only with structured data, 53 (41.4%) worked only with textual data, and 13 (10.2%) worked with both types. The trends in the number of articles with different input types are shown in . Although more records use only structured data as the input type, the number of articles using textual information has shown a rapidly increasing trend in recent years. Textual information is usually assumed to be in English [,] and presents solutions for Finish and German languages [,].

Figure 1. The number of records by year of publication and the input type used in the summarization system or method presented or evaluated. Each column corresponds to a year, the different input types are aggregated into this column, their proportion for the given year is visible on the figure. ICU: intensive care unit.

A rapid analysis of the excluded articles using other types of clinical data identified the following:

Overall, 37 video and image sequence visualization tools in the clinical domain, mainly keyframe extraction [,] or motif discovery [] methods in various fields of medicine, including older adult care, endoscopy [], hysteroscopy [,], laparoscopy [], magnetic resonance image [,], and ultrasound []Overall, 20 sensory data simplification techniques using time-series analysis, motif discovery [], and classification [] methods for electrocardiogram or other types of signalsOverall, 117 articles about summarizing genomic data []How Data Can Be Summarized? Summarization Methods

Common summarization methods used in the analyzed studies include (a report might use several of these) the following:

Visual design (59/128, 46.1%) organizes the information visually to help HCPs understand it within a short timeframe.Concept and relation extraction (30/128, 23.4%): extracts semantic information from textual informationAbstractive summarization (19/128, 14.8%) [,-] shortens texts by reformulating them using different wording to describe the content of a document [].Extractive summarization (13/128, 10.2%) [,] shortens texts by omitting a part of it, that is, composing a short text (a summary) from extracts of the original document.Summary specification (11/128, 8.6%) describes the content to be presented for a summary.Pipeline extracting information and synthesizing and communicating it with natural language generation tools (7/128, 5.5%)Topic modeling (5/128, 3.9%) [-] categorizes documents according to their content and labels them with a list of representative words []Time-series analysis (6/128, 4.7%) identifies characteristic properties in a temporal data series, including motif discovery, identifying meaningful patterns in temporal data (used in the study by Jane et al []), trend detection [], or change detection [].Dimensionality reduction (3/128, 2.3%) treats patient data as a long vector encoding all patient information (ie, a row in a table with many columns), and reduces this information to a shorter vector (ie, a row with a much smaller number of columns) without losing too much information.

Some of these methods are intrinsic to the input data type and work only with a particular data type. For example, time-series analysis (including motif discovery), risk scores, and dimensionality reduction are intrinsic methods for structured data. Although a large number of articles using these methodologies are not included in this review as they are used by non-HCPs (eg, machine learning algorithms), some of the titles propose this approach as the first step to clinical summarization [-].

The most common intrinsic methods for textual data are extraction, abstractive summarization, and topic modeling.

Some of these summarization methods can apply machine learning techniques. An overview of the applied machine learning methods is presented in . The table lists all records obtained using machine learning and categorizes the records according to the summarization method and the type of machine learning method they use. Machine learning methods can be categorized into traditional machine learning methods, deep neural networks, and transformers. Traditional methods include support vector machines, random forests, and conditional random field methods; deep neural networks include deep neural networks, recurrent neural networks, and convolutional neural networks; transformers contain BART [], BERT [], Pegasus-based [] methods, and pointer-generator models. In addition, Reunamo et al [] used an interpretable machine learning technique (Local Interpretable Model-Agnostic Explanations, LIME []). N/A indicates that a machine learning method is not used for a given type of summarization method.

Table 2. Summary of records applying machine learning methods for clinical summarizationa.
Traditional techniquesDeep neural networksTransformersExtractive summarizationSVMb+CRFc
Liang et al [], 2019
Liang et al [], 2021
CNNd
Liang et al [], 2019
Liang et al [], 2021
Subramanian et al [], 2021
RNNe
Alsentzer and Kim [], 2018
Liu et al [], 2018
Chen et al [], 2019
BERTf
Chen et al [], 2019
Kanwal and Rizzo [], 2022
McInerney et al [], 2020
Liang et al [], 2022
Shah and Mohammed [], 2020
Other
Liang et al [], 2019
Liang et al [], 2021
Abstractive summarizationRNN
Gundogdu et al [], 2021
Hu et al [,], 2021
BERT
Cai et al [], 2021
Chang et al [], 2021
Mahajan et al [], 2021
Sotudeh et al [], 2020
BARTh
Dai et al [], 2021
He et al [], 2021
Kondadadi et al [], 2021
Shing et al [], 2021
Xu et al [], 2021
Pegasus
Dai et al [], 2021
He et al [], 2021
Kondadadi et al [], 2021
Zhu et al [], 2021
Xu et al [], 2021
Pointer generator
MacAvaney et al [], 2019
Zhang et al [], 2018
Zhang et al [], 2019
Own architecture
GPT-2i
Concept and relation extractionPipelineTopic modelingAlternating decision tree:
Devarakonda et al [], 2017

aMachine learning methods are categorized into traditional machine learning methods, deep neural networks, and transformers.

bSVM: support vector machine.

cCRF: conditional random field.

dCNN: convolutional neural network.

eRNN: recurrent neural network.

fBERT: Bidirectional Encoder Representation from Transformers.

gN/A: not applicable.

hBART: Bidirectional Autoregressive Transformer.

iGPT-2: Generative Pre-trained Transformer 2.

Summarization methods can also be categorized based on their outputs. The review identified several ways in which the summarized information is presented to the end user:

A graphical display (53/128, 41.4%) is a specific way (interactive or not) to present information on the computer screen.A short textual summary (41/128, 32%) describes information in an ordinary language (eg, English).A preset static report: including its content designed to include specific medical information (6/128, 4.7%) or chosen statistical distributions representative of the patient (1/128, 0.8%)Problem-oriented view: a view grouping findings according to the problems the patient may present (7/128, 5.5%).Low-dimensional vector (4/128, 3.1%): encoding information n numbers (where n is the dimension) where each number represents the state of the patient from a particular aspect.List of words representing a topic (5/128, 3.9%), problem list (2/128, 1.6%), list of medical concepts found in the document (2/128, 1.6%), or label (2/128, 1.6%)A table (1/128, 0.8%) with rows and columns or a directed graph or concept map representing information in a graph-structured data model (2/128, 1.6%)No presentation: the articles in the “recommendation” group (5/128, 3.9%) did not present the results to the end user.

depicts the evolution (by the time of publication of records) of the most abundant formats for communicating the summarization results.

Figure 2. The number of records by year of publication and the most common ways of summary presentation used in the summarization method presented or evaluated in the report.

Concerning their view on summarization, 32.8% (42/128) of the records regarded summarization as a bottom-up approach and 64.8% (83/128) used the top-down view, whereas 1.6% (2/128) of records do not show a clear opinion on summarization.

Using the proposed framework, 42.2% (54/128) of the records contributed to information collection, 33.6% (43/128) recorded information synthesis, and 46.1% (59/128) presented solutions for summary communication. [,,,-,-,-,-,-] visualizes all the analyzed prototype articles and how they fall into these categories (ie, which step of the framework is addressed within the corresponding work). The records’ year of publication, presentation of summaries, and relationship between records are also displayed. The diagram has a vertical axis showing the year of publication, and all the “prototype” records (presented as a reference) published in that year appear in a line (or in 2 lines if the number of publications for a given year is very high). The order within a line has no significance, although the records were grouped within a line to show their contributions. The shape or shapes surrounding a reference symbolizes the steps that the record addresses in the “collect—synthesize—-communicate” framework. The reference to the study by Liang et al [] is surrounded by all 3 shapes, indicating that the study addresses all the 3 steps. The records also have a color representing in which format the summaries are presented to the HCPs. Closer relationships (ie, follow-up studies) are also presented. The studies submitted to the MEDIQA-2021 challenge [] are also marked in the diagram.

Figure 3. Diagram showing references to all analyzed “prototype” records and how they contribute to the “collect—synthesize—communicate” framework (ie, which step of the framework is addressed within the corresponding work). The records’ year of publication, the presentation of summaries, and the relation between records are also displayed [,,,-,-,-,-,-].

Regarding information collection, concept and relation extraction (50/54, 56%), time-series analysis (6/54, 11%), encoding (22/54, 41%), temporal abstraction (6/54, 11%), and topic extraction (5/54, 9%) were proposed as solutions. Medical concepts are extracted from textual data either using publicly available solutions (eg, cTAKES [] in the study by Goff and Loehfelm []) or tools developed by the authors (eg, [,,]). The retrieved list of concepts can be used for simpler tasks, such as problem list generation [], or some records present systems that take a step further extracting the context [], syntactic structure [], or approximate semantic structure of a sentence [] as well.

With regard to information synthesis, sentence selection by scoring (13/43, 30%), knowledge-based rules (18/43, 42%), and attention mechanism (19/43, 44%) were possible solutions.

Proposals for summary visualizations are usually features on a graphical screen; they are listed and compared in . For unprocessed textual data, the solutions included highlighting important concepts (3/5, 60%) and creating graphs that visualize the semantic structure of the textual data (2/5, 40%).

Table 3. The number of records presenting various features for visualizations in works with graphical displays. A record can use several features.FeatureOccurrence (out of 58), nColors43Selection of features37Tabular interface35Change in time31Visualization of divergence22Placement of variables19Interactive display18Pictograms10Physical location shown6Alerts5Size difference5Customizability3Shape3Word cloud3Comparison2Variability of parameters1Aspects to Be Conserved and Used

In total, 58.6% (75/128) of the titles did not conserve temporal information, whereas 2.3% (3/128) of titles were agnostic to temporal information (they conserve but do not use it). The remaining articles used a variety of approaches:

Timeline visualization: plotting information along a horizontal or vertical temporal axis (34/128, 26.6%).Other visualizations: showing only the trend of parameters (1/128, 0.8%) or providing a complex visualization framework in which temporal information can be displayed and analyzed (1/128, 0.8%).Information extraction from the temporal domain by analyzing how the parameters change during the patient journey. This group included a time-series analysis (6/128, 4.7%), pattern recognition (2/128, 1.6%), and change detection (1/128, 0.8%). Time-series analysis (applied in several studies) [,,-,] extracts statistical information from the temporal evolution of one or several variables. Pattern recognition [,] attempts to identify meaningful patterns in temporal data. Change detection [] seeks important events that manifest in the trends and patterns of temporal variables. Some studies have revealed the relationship between these events.A theoretical model of temporal events, which can either describe more complex interactions between temporal events (6/128, 4.7%) or be very simple (eg, creating an order: 1/128, 0.8% or describing events with a single time [n=1]).

It is worth noting that timeline visualization was applied in 3 articles in the temporal information extraction and in 1 article in the complex model of the temporality group as well.

Regarding information uncertainty, 89.1% (114/128) of the articles did not consider the uncertainty of information. Others have proposed the following solutions:

Statistical methods were used to treat uncertainty in data. These methods included correcting detectable errors (3/128, 2.3% [,,]) and optimizing the statistical description of the data using robust statistics (1/128, 0.8% []).Uncertainty of temporal eventswas described (2/128, 1.6%).Uncertainty of statements was described by assigning them to uncertainty categories (3/128, 2.3% [,,]) or using existing ontology (3/128, 2.3%).

Medical pertinence was not conserved in 34.4%, (44/128) of the studies (ie, they had no requirements that the summary had any relation to medical concepts or knowledge). A total of 35.9% (46/128) of records used medical knowledge to specify the information to be included in the summary and with what design. Other propositions included the following:

Using ontologies to find and relate concepts within textual notes (20/128, 15.6% [,,]), the use of Unified Medical Language System (UMLS) extraction tools (6/128, 4.7%) to extract them (eg, [-]), or improving the performance of abstractive summarization (2/128, 1.6% [])Use of risk scores to create visualizations (3/128, 2.3% [-]) or the application of guidelines to assess risks (2/128, 1.6% [,])The use of medically salient rules to constrain summarization (3/128, 2.3% [,,])Evaluation of the factual correctness of the created summaries integrated into reinforcement learning (2/128, 1.6% [,])Application of medical knowledge to select pertinent information (2/128, 1.6% [,])The use of medical knowledge to construct evaluation metrics (2/128, 1.6% [,])What Is a Good Summary? Evaluation and Deployment

Several types of evaluation methods and metrics are presented in the publications:

Quantitative measurements in experiments with human participants (31/128, 24.2%)Quantitative measurements when summarization was performed in a real clinical environment (8/128, 6.3%)Interviews (7/128, 5.5%), focus groups (2/128, 1.6%), or surveys (19/128, 14.8%) asking the opinions of the users after exposure to the summarization systemIntrinsic evaluation (25/128, 19.5%) of measuring quality by comparing the results to a ground truthPerformance on a proxy task (ie, disease prediction; 10/128, 7.8%)Performance in identifying human-annotated concepts (9/128, 7%) or topics (2/128, 1.6%)

The distribution of the number of human evaluators is shown in . Two records [,] used significantly more evaluators than other solutions, which are represented as 2 distinct groups at the tail of the histogram. One of these records [] is a large-scale survey, whereas the other [] is a pilot study measuring user performance.

Although some records present several evaluation techniques, in 9.4% (12/128) of the articles, no evaluation is presented; in 1.6% (2/128) of articles, the evaluation is not detailed; and in 4.7% (6/128) of records, the evaluation consists of a subjective evaluation carried out by the authors of the article.

The metrics used in the evaluations are as follows:

Performance metrics (eg, precision, recall, and F score) on a prediction/classification task measuring the “goodness” (validity) of predictions or classifications (used both in usability experiments and formative evaluations; 11/128, 8.6%)Performance metrics (eg, accuracy) of human participants (ie, the validity of their decisions) on an experimental task (24/128, 18.8%)Time savings due to summarization systems: time to completion (ie, the time needed to perform a predefined task) in experiments (21/128, 16.4%) or time saved during patient visits (1/128, 0.8%) in deployed systemsPatient outcome metrics (6/128, 4.7%) included mortality and hospital readmission rates.The NASA-TLX score describes the workload of the user (5/128, 3.9%) and the relationship between the NASA-TLX score and error count (1/128, 0.8%).Number of interactions (eg, click and screen change) in usability studies (3/128, 2.3%)Grades given by human evaluators measuring the utility and usability of a system (13/128, 10.2%) or trust in it (1/128, 0.8%).Scores comparing textual summaries with properties of the input text. These scores included Recall-Orientated Understudy for Gisting Evaluation (ROUGE) [] (20/128, 15.6%), bilingual evaluation understudy [] (2/128, 1.6%), and comparison between input and output distributions(2/128, 1.6%).Other properties of the output textual summaries including readability/fluency (10/128, 7.8%), accuracy or factual correctness (5/128, 3.9%), completeness (7/128, 5.5%), and overall quality (7/128, 5.5%) in qualitative evaluations of textual outputs. Two (N=128, 1.6%) records distinguished between ontological and nonontological correctness.Proxy measures for the faithfulness of textual summarization (6/128, 4.7%)Heuristics derived from requirement specification (5/128, 3.9%)

The evaluation metrics used in quantitative evaluations usually depend on the method of summarization; for dimensionality reduction, it is often a performance metric to predict diseases; for extractive and abstractive summarizations, the ROUGE score [] is the most commonly used metric, as it is considered the most reliable [], and for topic modeling, it is its empirical likelihood [].

For text summarization, evaluations with human participants are often carried out by annotators subjectively grading each produced summary along some metrics, including readability, factual correctness, and completeness. For other summarization methods, this task is usually approximated by either usability tests [-] or experiments [-] where performance and workload are measured. The few systems deployed in clinical settings are often evaluated by measuring patient outcomes or clinical indicators.

Reviewing the results of each report, some records compared the results with summarization methods in the general domain [], and 6 (5%) [,,,,,] presented a comparison of clinical summarization methods. The distribution of cross-citations between articles, that is, the number of other publications appearing in the review cited by each report, is represented in . Furthermore, 80% of the records cited fewer than 3 other articles analyzed in this review.

Among the 128 records analyzed, 4 (3.1%) talked about a method deployed on a large scale, 7 (5.5%) described deployment in a pilot study, and 1 (0.7%) disclosed the code alongside the publication.


DiscussionPrincipal FindingsWhere Are Summaries Needed in Health Care?

Publications on clinical summarization are tied to several different medical and clinical fields, mainly where quick decision-making is crucial (eg, ICU) or where a large amount of information is routinely produced (eg, oncology, chronic disease management, and hospital care).

However, some fields requiring quick decision-making (eg, emergency room environments) have seen less progress. In contrast, others where quick decision-making is less critical (eg, radiology) are covered by a relatively large number of records. This discrepancy suggests that clinical summarization can be beneficial in almost all medical fields, although the idea may not have reached all domains at the same pace. Although the previous drivers are easily identifiable, we speculate that the presence of other solutions proposed to handle information overload (eg, the study by Xu et al []; see the study by Hall and Walton [] for review) can decelerate, whereas a shortage of personnel in a field (eg, radiology []) can accelerate adaptation.

What Should Be Summarized?

The increasing trend in both single-encounter and multiencounter summarizations suggests that both types are salient and should be used depending on the care situation.

Regarding the input for summarization, several experiments show that HCPs can act at least as accurately and in a timely manner with summarized structured [-,] data or textual data [,] or with most information coded in these types of data [] than using complete documentation. Therefore, the focus should be on summarizing textual and structured data when creating summaries for HCPs.

The increasing trend of using textual data for summarization might be attributed to the improvement of NLP, the improved computing power required for some NLP tasks, and the results published by Van Vleck et al [], who claimed that a significant portion of patient information lies in clinical notes. In contrast, Hsu et al [] challenged this hypothesis by presenting experiments to predict some clinical measures (eg, hospital readmission and mortality) using textual and structured patient information sources. They concluded that textual sources have little predictive power for the outcomes. However, their analysis might be biased by their methodology, as they use only simple syntactic metrics to describe textual information, whereas semantic information is not included in their model.

How Data Can Be Summarized?

The records analyzed in this review show myriad techniques for summarizing clinical data. Some are intrinsic to the input data type and work only with a particular data type, whereas others are not dependent on the input data type.

For textual data, the review reveals more works about abstractive summarization than extractive summarization or topic modeling combined, whereas in the general domain, topic modeling and extractive summarization techniques are the most researched [,]. This discrepancy suggests that despite abstractive summarization techniques being immature [], general problems with extractive summarization, such as redundancy [], lack of coherence [,], and lengthiness [], can be problematic for clinical applications. The verdict about topic modeling is unclear. Arnold et al [] argue that clinicians are good at interpreting topic model results, but other records using this technique do not present evaluations with human participants.

An alternative (and natural) way of organizing summarization methods is to assess how they contribute to the summarization process. Motivated by the lack of a widely accepted theory of the summarization process, this review proposes a 3-step (collect—synthesize—communicate) framework to describe the summarization process, where each step should ideally be addressed by all summarization methods.

For the information collection step, many studies assume an easily queriable information source or propose medical concept extraction from textual data as a solution. More complex information (context, syntactic, or semantic structure of statements) is extracted in only a few studies, and some works propose extracting specific aspects as information.

Concerning information synthesis, a common approach is to precisely define the content of the summary (eg, [,,]) or at least its format []. However, these studies do not evaluate the quality of their proposition (except the study by Ham et al []). In contrast, some records carried out experiments on the information needs of physicians [,]; however, the results were not integrated into any of the reviewed systems.

Concerning summary visualizations, there is no clear opinion on whether textual or graphical summaries are preferable in the medical context. Although there is a slight dominance of graphical displays among the analyzed records, some works [,] argue that textual summaries lead to more accurate decisions. However, a general pattern of these works is that they compare a specific graphical display with a particular textual display, limiting generalizability. These contradictory results suggest that both formats are helpful for clinical summarization, if relevant features are present. Problem-oriented views presented in some records (eg, []) can include both types of display and might have other advantages, as they group all available information about patient-specific problems [].

Concerning the view on summarization, both top-down and bottom-up approaches are justifiable in a clinical setting. However, several bottom-up approaches have been inspired by studies that use top-down approaches. One example is the recent development of techniques for identifying salient concepts in source documents for abstractive summarization. This phenomenon may be due to the natural need for accountability and interpretability, which can be achieved more easily with a bottom-up approach closer to human cognition. The need for bottom-up approaches also suggests that there is a need that summarization techniques address all 3 steps of the proposed “collect—synthesize—communicate” framework, including information collection, synthesis, and visualization.

How Are the Temporal, Uncertain Aspects of Information and Its Medical Persistence Conserved and Used?

The temporal nature of clinical data is an essential aspect of clinical reasoning [], and a relatively large portion of analyzed records presents solutions to use this aspect of information. However, in most of these studies, this aspect was only represented as a visualization feature. Most visualizations are timeline visualizations, plotting information along a horizontal or vertical temporal axis following Plaisant et al [], although some alternative methods exist [-]. Alternatively, some studies have revealed the relationship between events by analyzing how variables change during the patient journey.

Although temporality in clinical settings is believed to be more complex [] than a series of punctual events, current solutions to clinical summarization hardly reflect this complexity. Very few studies have attempted to incorporate more temporal information by using more complex models of temporality (eg, events lasting during an interval). Complex temporal information is usually not directly available in patient records and must be deduced from the context and knowledge-based rules. This process is called temporal abstraction and was applied in previous studies [,,-,]. Hunter et al [,] considered the uncertainty of temporal information by defining the beginning and end of each time interval as an interval.

Although several levels of uncertainty exist in clinical care [], the majority of the analyzed reports do not present solutions to conserve or handle any information uncertainty.

To a lesser extent, the pertinence of medical knowledge has been overlooked by many summarization approaches. Many records do not consider medical pertinence or use it only for some design considerations. However, the few records that handle this aspect of medical data provide a relatively wide range of solutions to constrain the resulting summaries. In most cases, these constraints are relatively weak; for example, concepts are assumed to be part of a specific medical ontology. This is obviously the case for concept extraction tools, but the records using reinforcement learning approximate factual correctness using this approach as well.

Deeper integration of medical knowledge is only present in works using medical rules to select salient information and in the 2 works using medical rules to create summaries. Liang et al [] used medical knowledge to create components of their proposed NLP pipeline, whereas Shi et al [] used medical knowledge–based rules to visualize abnormalities in the human body.

How to Identify a Good Summary?

Using the definition of clinical summarization (ie, simplifying and presenting information so that HCPs can act more smoothly and efficiently in different clinical situations), the ultimate purpose of an evaluation might be to determine whether using the proposed summarization systems would improve the efficiency of HCPs.

However, such an evaluation is often unfeasible owing to the high costs and ethical issues associated with potential medical errors.

This is supported by the results, as many of the proposed evaluations are approximative solutions, and there is quasi-uniform agreement that evaluating summarization is challenging and suboptimal []. The spectrum of these evaluations is broad, but a common trend is to carry out a qualitative evaluation using an easily calculable evaluation metric describing either the quality of the summary or its “usefulness” to perform a proxy task (ie, disease prediction).

These qualitative analyses are suboptimal. For example, one of the most common qualitative metrics, the ROUGE score, assumes a human-annotated “gold standard” summary to which to compare, but this standard may not exist given the high cost of annotation or because there are disagreements between people about what would be a “gold standard summary” [,]. To tackle this problem, some records [,] present a comparison between the semantic distribution of the input and the summary, whereas others [,,] use heuristics to evaluate the results. Another problem with the ROUGE score is that even with a high ROUGE score, a summary can be very inaccurate []; therefore, there have been attempts to measure the “faithfulness” of summaries either by the number of medical concepts retrieved [] or with a more complex faithfulness measure defined by Zhang et al [].

Evaluations with human participants often complement the qualitative evaluations. Human evaluations have mainly positive outcomes (except in the study by van Amsterdam et al []); however, most of the evaluations are carried out on a small scale. This can explain why very few long-lasting implementations in health care have been presented in the literature.

It is also important to mention that there is very little comparison between summarization methods, and citations between records are scarce. This suggests that the research in this domain is fragmented.

These shortcomings suggest that evaluation is a weak point in clinical summarization proposals, and the lack of widely accepted evaluation metrics and methodology might be a main obstacle for research in the field.

Limitations

Methodological biases are present in selection, synthesis, and reporting. First, the number of reviewers was limited both in the selection and analysis of records, resulting in selection and synthesis bias.

Selection bias also comes from the fact that the review was carried out on works published in a scientific paper or in the gray literature, and the initial search was carried out on 2 databases that are more specific to medicine. However, several unpublished summarization solutions have been applied to current EHR systems.

Moreover, publishing bias also adds to selection bias, as there is a clear dominance of positive results in scientific publishing.

Furthermore, the applied research queries and the selection of the database are also a source of bias. Using other data sources (eg, IEEE Xplore and Scopus) might have introduced further bias to the analysis. However, including the citations and references in the review process might have reduced this bias significantly. Moreover, the queries are formulated in English; therefore, results not in English and containing non-English terms might be missed if their abstract was not translated or if they do not appear in the citation list or references in the retrieved articles. Finally, there were potentially relevant records [,,], where the full text could not be read and analyzed as it was not available at the time of writing the manuscript.

Conclusions

Clinical summarization has not reached all domains at the same pace, although it is potentially beneficial in most medical fields. Two aspects, the requirement for quick decision-making and the overabundance of data, were identified as the main drivers for the development of automatic summarization methods. However, other less-evident drivers might also play a role in adaptation.

Despite this need, very few [,,] scientific publications are presenting adaptation in real clinical settings, suggesting a low success rate in clinical environments.

Despite the large number and variety of propositions, hardly any comparisons exist between the solutions. This low rate is due to the difficulty in comparing the summarization methods.

From a cognitive psychological perspective and to measure how the summarization methods align with the definition of summarization, this review proposes to compare these algorithms through a “collect—synthesis—communicate” framework referring to i

留言 (0)

沒有登入
gif