Web-Based Application Based on Human-in-the-Loop Deep Learning for Deidentifying Free-Text Data in Electronic Medical Records: Development and Usability Study


Introduction

Narrative free-text data in electronic medical records (EMRs) include a variety of clinical documents such as consultation notes, nursing notes, progress notes, and discharge summaries, which contain valuable information for analysis and research to inform better patient care [-]. The free-text data can include personally identifiable information (PII), for example, patient name, date of birth, address, phone number, and patient identifier, which can be used to identify an individual on its own or with other information. It is necessary to deidentify the free-text data by removing this PII before releasing to researchers for secondary purposes where the reuse of the data is not covered in patients’ informed consent forms or when requested as part of a waiver of informed consent by an institutional review board or any other human research ethics committees, as required by legislation including the Privacy Rule of the HIPAA (Health Insurance Portability and Accountability Act) [] in the United States and the Privacy Act, 1988 [], in Australia. However, manual deidentification has been proven to be a time-consuming and labor-intensive task [].

In the past decade, researchers have investigated many different automated deidentification approaches including rule-based matching [-] and machine learning (ML) models [-]. Hand-written regular expressions and ad hoc knowledge dictionaries are used in rule-based deidentification approaches for a specific free-text data set []. In contrast, ML-based deidentification approaches use manually annotated data to train named entity recognition (NER) models, enabling the prediction of PII entities from free-text data. Although rule-based methods do not necessitate the preparation of annotated data, the rules are challenging to generalize to other corpora without manual adjustments from experienced domain experts []. The Bidirectional Long Short-Term Memory-Conditional Random Field (BiLSTM-CRF) has been proven to achieve state-of-the-art or competitive results on the free text deidentification task [-]. With the success of transformer models in the NLP domain, some studies began to explore its use on deidentification tasks [-]. Johnson et al [] fine-tuned pretrained transformer models and achieved a binary token-level F1-score of 0.984 on the 2014 i2b2 test set. However, a benchmark study [] showed that BiLSTM-CRF achieved better performances compared to transformer-based models. Another study conducted by Tang et al [] demonstrated that incorporating the pretrained transformer language model as the word embedding layer in BiLSTM-CRF led to an improvement in the F1-scores on the 2014 i2b2 deidentification data set, compared to using other word embeddings (eg, Word2Vec and ELMo). Furthermore, several ensemble approaches that combine multiple individual ML models have been proposed on deidentification tasks [,,]. By leveraging the strengths of individual methods, these ensemble methods have demonstrated improved performance on deidentification tasks.

The traditional workflow of the ML-based deidentification approaches consist of three stages: (1) annotation: human annotators manually tag all the PII in the free text. The interannotator agreement is calculated to measure the quality of the annotation []; (2) model training: ML experts train models using the annotated free text; and (3) deidentification: the PII predicted by the models are substituted by surrogates or tags or removed completely. Although pretrained ML solutions can potentially be used “out of the box,” there are significant variations between hospitals, vendors, and countries in the structure and content of EMRs and the nature of PIIs. Furthermore, data custodians may require performance metrics based on their specific data before gaining sufficient confidence to use these tools. Therefore, manual annotation remains a time-consuming process and is the main bottleneck in training ML-based deidentification models [,]. To overcome this, several annotation tools (eg, BRAT [] and WAT []) have been used to speed up the annotation stage [,]. Nevertheless, the second and third stages of the workflow still require considerable input from ML experts. In recent years, some annotation tools (eg, ezTag [], INCEpTION [], and Prodigy []) have integrated interactive learning for iteratively retraining the models using the latest annotated free text to provide preannotation suggestions, which automates the second stage of the workflow. These tools can be used to handle some of the deidentification task (ie, PII tagging). Aberdeen et al [] developed an open-source deidentification tool, MITRE Identification Scrubber Toolkit (MIST), which comprises a web-based graphical annotation tool, a training module, a tagging module, a redaction and resynthesis module, and an experiment engine. These modules work together using a “tag-a-little, learn-a-little” loop strategy to complete the deidentification task, bypassing the need for ML experts for the second and third stages of the traditional deidentification workflow. The annotation tool is used by annotators to tag the PII from the free-text files. The training module trains a conditional random field-based sequence tagger using these annotated files. The tagging module automatically tags the PII for the new files that can be manually corrected by the annotators. Furthermore, MIST provides a workspace mode to conveniently manage a corpus which needs to be deidentified. However, many operations of MIST need to be done via the command line, for example, creating a workspace, importing files into a workspace, and training models. The MIST server needs to be restarted to get the newly trained model into effect. Moreover, the end users of the MIST tool are required to have the technical knowledge to run command lines.

Off-the-shelf tools [] such as Amazon Comprehend Medical [], Clinacuity CliniDeID [], and National Library of Medicine (NLM) Scrubber [], can be used to deidentify free text directly without following the traditional deidentification workflow. These 3 tools are all HIPAA-compliant, following the HIPAA’s “Safe Harbor” method [] to remove 18 types of identifiers. Amazon Comprehend Medical is a cloud-based service which needs the data be uploaded to its service end point. This represents a significant barrier to adopting it for use with EMRs, which are stored in a secure and internet access-restricted environment. CliniDeID, originally a commercial clinical text deidentification software, has recently been made available as free open-source software since November 2022. This allows for additional retraining on specific data sets to enhance the model's performance. Although NLM Scrubber can be installed locally, its performance cannot be improved because it has no ability to learn from the end users’ free-text data, which may vary considerably from the data used to develop the tools. Therefore, data custodians are responsible for reviewing and evaluating the deidentification results provided by these off-the-shelf tools to ensure it meets their benchmarks. Another main obstacle for adopting the off-the-shelf tools is that the PII types present in specific free-text data outside of a HIPAA covered entity or country (such as Australia) can be different from the 18 HIPAA PII types.

In this study, we designed and implemented a web-based system, deidentifying free text (DEFT), for tagging and substituting designated PII in free-text data in EMRs with human (annotators) in the loop. The system can be readily and easily adopted for the free text deidentification task in secure and internet access–restricted network environments. The main features of the system are listed below:

Web-based: the web-based deidentification system can be easily accessed by multiple end users via a web browser. Compared to desktop-based applications, web-based applications are less hardware dependent and can easily be updated and upgraded.Powered by deep learning models: we used a deep learning NER model to recognize and tag PII entities from free-text data.Preannotation bootstrapped by a semiautomatic learning loop: following the learning loop we have proposed previously [], DEFT preannotates the free text using the ML model that is automatically trained on the previous annotated free text completed by the annotators.Customized PII types: the PII types can vary depending on the free text corpus and different data sharing scenarios. For example, full dates may need to be kept for future research in a secure environment which can only be accessed by ethically approved users [].Suitable for nontechnical end users: experienced health-domain annotators can complete the whole deidentification task using DEFT. No technical or ML knowledge is required.Simple and task-focused web user interface (UI): DEFT makes the annotators focus on the annotation work through a simple and well-designed web UI.Implements autosave: each annotation action, including annotator name, PII entity positions, PII type, and annotation time, is saved automatically.Fewer clicks for annotation: fewer clicks mean quick annotation and less deidentification time.Supports project management and team collaboration: the deidentification task can be managed as a project, which can contain one or more data sets. The team annotators can work on the same data set in the project at the same time to accelerate the annotation process.Implements user access control: only approved users can access specific projects.Uses central data storage: all the data can be stored in 1 central location which the DEFT server can access through filesystem path. This avoids importing or transferring thousands or millions of small free-text files across the network.
MethodsSystem Architecture

DEFT has been designed to be simple and manageable, so end users can easily conduct the deidentification work on their own free-text data that are stored in a secure and internet access–restricted environment. shows the overview of the DEFT system architecture. DEFT communicates with the central data storage by using a filesystem path to retrieve the free-text data and generate deidentified data. The annotators remotely access the DEFT web UI to annotate the free text via their own devices. DEFT is responsible for all the business logic processing. The data manager helps to manage projects and data through the DEFT web UI and direct connection to the data storage, respectively.

We implemented DEFT using Django, a high-level Python web framework. As shown in , DEFT comprises frontend and backend modules. The frontend module was built with HTML, CSS, and jQuery and has 2 different web UIs including an end user UI and admin UI. The former is used by the annotator to tag the PII and the latter is for the data manager to manage users, models, projects, and PII types. The backend consists of 2 components, that is, business logic and database. The first one is the controller that receives the requests from the frontend and invokes the relevant business logic to produce the responses which are sent back to the frontend. The second one has the persistence component built with SQLite and interacts with the business logic component to store all the application data such as project information, and PII positions and types.

Figure 1. Overview of the DEFT system architecture. DEFT: deidentifying free text; UI: user interface. Project-Based Deidentification Work

In the DEFT system, the deidentification work is organized as projects which must have at least 1 data set. Multiple projects can be processed at the same time. A project is created by the users for a deidentification task of a new free text corpus. Multiple data sets can be added into the project according to the users’ requirements. The users can flexibly customize any number of PII types with different display colors for the project and control who has access to the project. The system does not provide default PII types, because these can vary depending on the project. The raw data files are stored in specific data storage outside the DEFT system and the access path is configured in DEFT for the relevant project. Users can import the data files (txt format) into DEFT manually or wait for the system to import them automatically. The importing operation only saves the file names into the DEFT database rather than the file contents so that all the identified data can be safely maintained in specific data storage. shows an example of a deidentification project structure. A project named “Clinical Notes” is created and 2 data sets are added with the names “Discharge Summaries” and “Progress Notes,” respectively. A list of PII types (PERSON, date of birth [DOB], ADDRESS, PHONE, ID) are configured for the project according to the users’ requirements. The user shown in solid black is assigned access to the project as the annotator. All the project configurations can be done via the admin UI by the data managers of the team.

Figure 2. An example of a deidentification project structure and access control. DOB: date of birth; PII: personally identifiable information. Deep Learning NER Model

In the DEFT system, we used the BiLSTM-CRF architecture to train a NER model that can identify word spans related to specific PII types in the working text. Although an ensemble model may perform better than a single BiLSTM-CRF model, we selected the latter for DEFT considering its competitive performance and faster training time, noting that DEFT may be deployed in settings where computing resources are constrained. The Python library FLAIR [] was used to implement the BiLSTM-CRF NER model. The pretrained RoBERTa model [] was selected to generate input representations in our model. shows the model architecture.

Figure 3. Deep learning named entity recognition model architecture. BiLSTM: Bidirectional Long Short-Term Memory; CRF: conditional random field; LSTM: Long Short-Term Memory. Learning Loop

The core of the DEFT system’s workflow is a learning loop, which comprises 3 elements: end user UI, annotated free text, and ML model, as shown in . At the beginning of the deidentification work for a new project, there is no ML model in the system. The deidentification process follows the following six steps:

Feed raw free text into DEFT system. An initial set of raw free-text files are loaded into the end user UI where the annotators begin to manually tag PII in the raw free text. The end user UI provides a simple and task-focused interface to the annotators so that the annotation can be done quickly and easily.Annotate raw free text or correct preannotated free text. shows a screenshot of the end user UI which contains 4 main areas: file list area, PII type area, annotation work area, and PII list area. First, the annotator selects 1 file in the file list area to load its free text content into the annotation work area and clicks on 1 PII type in the PII type area. Second, the annotator reviews the free text and tags the words using double left click or word spans using click drag-and-release that are related to the selected PII type. The PII entities will be surrounded by colored boxes with the PII type underneath the words. Incorrectly tagged PII entities can be removed by single left click on the entity text. Finally, the tagged PII entities are listed in the PII list area with detailed information about start index, end index, PII entity type, PII entity text, and annotator name. The annotator needs to change the “Edit” mode to “Complete” mode for marking the completion of the annotation work for the current working file. The above mentioned PII entity details, except PII entity text, are automatically saved in the DEFT database so that they can be integrated with the raw free-text file contents to generate the annotated free-text files for model training. Furthermore, not having PII entity text stored in the DEFT system protects the identified data.Train ML models. Once the completed amount of the annotated free-text files reaches the preconfigured model retraining threshold which is the number of clinical notes (eg, 200 clinical notes), the system starts the ML model training process, which splits the annotated data into training, validation, and test sets. The model is automatically trained in the backend and loaded into the system after that.Preannotate raw free text. Another small set of free-text files, that has been preannotated by the trained model from Step 3, are assigned to the annotators to add the PIIs which the model does not preannotate or to correct incorrect PIIs that the model has preannotated, via the end user UI. The added or corrected PII entity details are added into the system database.Iterative ML model training. If the number of the new completed files reaches the retraining threshold, it will trigger the ML model to be retrained on all the annotated free-text files. Steps 2, 3, and 4 are iteratively conducted until the performance of the model meets the specified benchmark for the data set. In the DEFT system, we use the strict entity-level microaverage F1-score, which is the primary metric in previous deidentification challenges [,,], to evaluate the performance of the model and deidentification work.Deidentify raw free text. The final model is used to deidentify the remaining free text in the data set, including automatically tagging the PIIs and replacing the PIIs with the special tags. For example, the person names are replaced by “<**PERSON**>” in the deidentified free text.Figure 4. Learning loop of DEFT system. (1) Feed raw free text into DEFT system; (2) annotate raw free text or correct preannotated free text; (3) train ML models; (4) preannotate raw free text; (5) iterative ML model training; and (6) deidentify raw free text. DEFT: deidentifying free text; ML: machine learning; UI: user interface. Figure 5. A demo of the end user UI for the annotators. The patient information is dummy data. PII: personally identifiable information; UI: user interface. DEFT Functionalities

describes the main functionalities of the DEFT system which are grouped according to the 2 DEFT web UIs. The admin UI includes the management components of the key elements such as projects, data sets, PII types, and users. The functionalities of the end user UI are mainly focused on the annotation work, for example, tagging the PII, changing the file status, and preannotation. Currently, only txt format data files are supported in DEFT. There are two export options: (1) export the annotated data files in the XML format which include the raw free text and all the PII entities (Figure S1 in ); the annotated XML data files could be a valuable data source for future deidentification research; and (2) export the deidentified data files in TXT format (Figure S2 in ). The file contents are the same as the original one except that the PII words are replaced by the special tags.

Table 1. The main functionalities of the DEFTa system.UIb and functionalityDescription
Admin

Project managementCreate or delete or modify project information
Configure the project data path
Assign user access right


Data set managementCreate or delete or modify data set information
Import data file names
Export all the annotated data files
Export all the deidentified data files


Data file managementCreate or delete or modify data files
Export single annotated data file
Export single deidentified data file


PIIc type managementCreate or delete or modify PII types


Model managementReview all the trained models of the project


User managementCreate or delete or modify users

End user
Project list

Data set listList the data sets of the selected projects


Annotation dashboardList all the files of the selected data set
Open 1 file
Tag the PII entities in the free text


Auto saveThe add or remove PII actions are saved automatically


PreannotationPretag the possible PII entities by the trained MLd model when the users open a file


Hide completed filesFilter completed files out from the data file list


File status managementSwitch the file status between “Edit” and “Complete”

aDEFT: deidentifying free text.

bUI: user interface.

cPII: personally identifiable information.

dML: machine learning.

Ethics Approval

This study has obtained ethical approval from the South Eastern Sydney Local Health District Human Research Ethics Committee (reference 2019/ETH12625) and the Population Health Services Research Ethics Committee (reference number 2020/ETH01614). The ethics committees allow the data usage for this study without additional consent. The experiments in this study were conducted in the E-Research Institutional Cloud Architecture [], a secure cloud computing infrastructure for individuals working with sensitive data.


ResultsOverview

We selected 2 deidentification systems (MIST and NLM Scrubber) for functionality comparison with DEFT. We chose these systems because MIST has a similar design strategy of “human in the loop” to DEFT, and NLM Scrubber is an accessible open-source off-the-shelf deidentification system. We also considered INCEpTION, ezTag, and Prodigy, which are text annotation systems with interactive learning loops, because they can be used for the deidentification task with extra effort from technical or ML experts. The system features we compared are (1) support for project-based free text file management; (2) support for user access control; (3) support for customized PII (NER) types; (4) support for bulk file import; (5) support for automated preannotation based on a pretrained model; (6) support for auto save of the tagging actions; (7) support for interactive learning loop; (8) support for annotated data export; (9) support for deidentified data export; (10) suitability for nontechnical end users; (11) support for team collaboration; (12) web-based system; (13) central data storage; (14) off-the-shelf; and (15) autotag matches (all occurrences are automatically tagged when annotators tag a PII entity in the whole working free text). shows that DEFT, MIST, and INCEpTION support the user access control functionality, which is important in deploying the systems for team collaboration on the deidentification task. Otherwise, any devices on the same network can access the system via the system’s URL to potentially access the identified data. Although ezTag provides a session-based login, it is not a secure way to control access because anyone who has the session URL can access the project freely. Both MIST and Prodigy require end users to use a command-line interface to configure the projects or data sets and retrain the models in the interactive learning loop. MIST and ezTag need to manually trigger the model retraining and preannotation from the command-line interface and the web UI, respectively. When starting the Prodigy system, a path to the free-text data needs to be configured. Therefore, it partially supports the project management and bulk file import. Different from other systems, MIST must manually save each tagging action by the user clicking on the save button. In both the INCEpTION and ezTag systems, importing the free-text data requires transferring the files from the original data location to the specified location using the web UI, while MIST uses the command-line interface to do the same thing. This could be a bottleneck when importing large volumes of data sets due to network delays. Because NLM Scrubber is an off-the-shelf desktop software, most of the comparison functionalities are not supported by it. All the systems except NLM Scrubber cannot be used “out-of-the-box.” Only the MIST system provides “Autotag matches” functionality.

Fewer mouse clicks make the annotation process more efficient. We counted the mouse clicks of the add-PII and remove-PII operations for all the selected systems except NLM Scrubber, which is pretrained and doesn’t support annotation. The “drag-and-release” action is counted as 1 mouse click. As shown in , DEFT and Prodigy had the fewest clicks for both operations. MIST and INCEpTION needed 3 mouse clicks to annotate a PII entity. However, MIST provides an “Autotag matches” functionality, which can automatically tag all the occurrences of the same word spans of the selected PII entity in the whole working file. ezTag needed the most mouse clicks to remove a tagged PII entity.

We evaluated the performance of our model by comparing it with 5 benchmark models [,,,,] on the 2014 i2b2 data set. The microaverage strict entity–level scores and binary PII token–level scores are reported in . Our model achieved the highest strict entity–level recall and F1-scores at 0.9563 and 0.9627, respectively. Table S1 in lists the hyperparameters used for model training. The microaverage scores by i2b2 category for strict entity matching are shown in Table S2 in .

Figure 6. Comparison of the selected tools and DEFT. DEFT: deidentifying free text; MIST: MITRE Identification Scrubber Toolkit; NER: named entity recognition; NLM: National Library of Medicine; PII: personally identifiable information. Table 2. Mouse click comparison of add-PIIa and remove-PII operations.OperationDEFTbMISTcNLMd ScrubberINCEpTIONezTagProdigyAdd-PIIe1-23N/Af31-21-2Remove-PII12N/A231

aPII: personally identifiable information.

bDEFT: deidentifying free text.

cMIST: MITRE Identification Scrubber Toolkit.

dNLM: National Library of Medicine.

eThe mouse-click number of add-PII operation for DEFT, ezTag, and Prodigy can be 1 or 2, depending on whether the relevant PII type has been selected or not.

fN/A: not applicable.

Table 3. Microaverage scores comparison on the 2014 i2b2 data set.Model (reference)Model architectureStrict entity-levelBinary PIIa token-level

PrecisionRecallF1-scorePrecisionRecallF1-scoreLiu et al []Ensemble model0.96460.9380.95110.9930.97280.9828Kim et al []Ensemble model0.9704b0.94450.95730.99160.98060.9861Tang et al []BiLSTM-CRFc (BERTd)0.95990.95020.9550.99020.9838b0.987bCatelli et al []BiLSTM-CRF0.96530.95060.95790.9910.97550.9832Liu et al []Ensemble model0.9690.95590.96240.9943b0.97860.9864Our modelBiLSTM-CRF (RoBERTa)0.96920.9563b0.9627b0.98920.97660.9829

aPII: personally identifiable information.

bThe best result for each metric.

cBiLSTM-CRF: Bidirectional Long Short-Term Memory-Conditional Random Field.

dBERT: Bidirectional Encoder Representations from Transformers.

Use Case

To evaluate its efficiency and performance, we used DEFT to deidentify clinical notes for the project CardiacAI []. CardiacAI is a prospective data repository that collects EMR data for patients with cardiac issues who are admitted to a group of participating hospitals in New South Wales, Australia. The EMR data are linked to state-wide hospital and emergency department visits, the state’s death registry, as well as mobile health remote monitoring data. The aim of this data collection is to enable collaborative research, facilitate the use and rapid translation of state-of-the-art tools and technologies, and ultimately drive improvement in patient care and outcomes. The CardiacAI data repository holds data from January 1, 2017 onwards. The repository currently holds EMRs of 44,201 patients with 61,721 individual hospitalizations to a cardiac, cardiothoracic, or vascular surgical specialty. There are 2,087,737 clinical documents including discharge summaries, progress notes, and other semistructured forms with an average of 34 (SD 44) documents per hospitalization. Discharge summaries have undergone some rule-based deidentification where the document header, which contains structured identifier fields, has been removed. However, embedded identifiers remain within the unstructured text.

The DEFT system was deployed on a workspace with 32 GB memory and 8 vCPU in the E-Research Institutional Cloud Architecture [] environment. A project named “CardiacAI,” and 2 data sets named “Discharge Summary” and “Progress Notes” were created in the system. According to the data content and research requirements, 5 PII types (PERSON, identification number [IDN], DOB, PHONE, ADDRESS) were defined for the project. At the beginning, we randomly selected 400 clinical notes including 200 discharge summaries and 200 progress notes to be manually annotated using the DEFT end user UI by 1 annotator who is an experienced health-domain data analyst. An intermediate NER model was automatically trained by DEFT using the 400 clinical notes with a split ratio of 0.8, 0.1, and 0.1 for training, validation, and test, respectively. shows the model performance on the test set. Nearly 81% PII entities were person type and only 2 were address type. The model achieved a micro-F1-score of 0.9432.

To compare annotation time with and without auto-preannotation, we randomly selected another 200 clinical notes that consisted of 100 discharge summaries and 100 progress notes and split them into 10 rounds, of which each round contained 10 discharge summaries and 10 progress notes. We also considered the text length when splitting the data to make sure that the compared rounds had similar total words as other rounds. The annotator manually tagged the PIIs in the first 5 rounds without auto-preannotation and continued to complete the last 5 rounds with auto-preannotation enabled. Each round was completed in a single session and the annotation time was recorded. As shown in , the total time of 5 rounds with-preannotation was about 58 minutes, which was approximately 40% less than the total time of 5 rounds without preannotation (~98 minutes). Moreover, the annotation of the last 5 rounds (with preannotation) was much quicker than the one of the first 5 rounds, despite there being more PII entities in the last 5 rounds. Through the 5-round comparison, the preannotation by the trained model sped up the annotation work by 43% (95% CI 33.9-52.1; P<.001).

To compare manual annotation performance without and with preannotation, a gold standard corpus was created by having 2 annotators independently review and correct the annotations of the 200 clinical notes. All disagreements were resolved through a consensus meeting between the 2 annotators. The performance results are shown in . Manual annotation with preannotation by the pretrained model achieved higher overall microaverage recall and F1-scores by 0.013 and 0.0053, respectively. Although there were a few false negative and positive errors on the PHONE and ADDRESS types, manual annotation with preannotation demonstrated higher accuracy on the PERSON type (the majority type within the data set) than manual annotation without preannotation.

We reconfigured the retraining threshold to 200 to trigger the system to retrain the model using the whole 600 clinical notes. The results are shown in . Although the test sets were different between this training and the previous training, we do observe that the overall F1-score increased by 0.0075, compared to the previous intermediate model trained on 400 clinical notes. The new model improved the F1-score of DOB and PHONE PII types by large margins of 0.1367 and 0.1298, respectively. Moreover, the recall, which is the most important metric for the deidentification tasks, was improved for each PII type, except ADDRESS.

Table 4. The NERa model performance on the test set from the corpus with 400 clinical notes.PIIb typePrecisionRecallmicro-F1-scorePII entity numberPERSON0.96570.95050.958444IDNc10.96490.982157DOBd0.8750.80770.8426PHONE0.72730.72730.727322ADDRESS0.33330.50.42Overall0.95190.93470.9432551

aNER: named entity recognition.

bPII: personally identifiable information.

cIDN: identification number.

dDOB: date of birth.

Table 5. Annotation time comparison.Without preannotationWith preannotationRoundClinical note numberAnnotation time (min)PIIa entity numberWord numberRoundClinical note numberAnnotation time (min)PII entity numberWord number12013:038126871a206:28113277422015:5611740892a207:55184415332019:0815551983a209:31179536942020:2413971334a2012:13185740452029:0518911,5035a2022:0830012,304Total10097:3668130,610Total10058:1596132,004

aPII: personally identifiable information.

Table 6. Performance comparison without and with preannotation.Without preannotationWith preannotationPIIa TypePrecisionRecallmicro-F1-scorePII TypePrecisionRecallmicro-F1-scorePERSON0.99800.96750.9825PERSON0.99830.98510.9917IDNb111IDN111DOBc111DOB111PHONE111PHONE0.96670.96670.9667ADDRESS111ADDRESS0.857110.9231Overall0.9985d0.97560.9869Overall0.99580.9886d0.9922d

aPII: personally identifiable information.

bIDN: identification number.

cDOB: date of birth.

dThe best result for each metric between performance without and that with preannotation.

Table 7. The NERa model performance on the test set from the corpus with 600 clinical notes.PIIb typePrecisionRecallmicro-F1-scorePII entity numberPERSON0.94230.95830.9502528IDNc10.97620.98884DOBd0.98440.96920.976765PHONE0.85710.85710.857135ADDRESS10.50.66676Overall0.94870.95260.9507718

aNER: named entity recognition.

bPII: personally identifiable information.

cIDN: identification number.

dDOB: date of birth.

In order to evaluate the deidentification efficiency, we used the trained model to deidentify CardiacAI clinical notes that summarized the encounters of patients who have had heart failure, containing on average about 170 words. The whole process took about 95 hours to export 280,785 deidentified clinical notes (about 0.82 seconds per clinical note). There are on average about 5 PII entities per clinical note.


DiscussionPrincipal Findings

We have presented the design and implementation of DEFT, a simple web-based deidentification system with interactive learning loop. DEFT provides a task-focused web UI to end users so that annotation of PII can be done quickly and easily. In addition, an admin web UI is provided to manage the projects, users, and team collaboration on the deidentification task. The DEFT system can be deployed on a central server to provide deidentification services to multiple teams within the same network environment at the same time through user access control functionality.

The results of the functionality comparison showed that our DEFT system is better than the selected comparison systems on the deidentification task, in terms of the ability to customize the task, ease and security of management of users and files, automation of the interactive learning loop, and central data storage.

DEFT can easily customize PII types at the project level using the admin web UI. It provides flexibility for determining the level of deidentification according to risk assessment and analysis requirements of the project. The selected systems which we compared with DEFT have similar customization functionality, with the exception of NLM Scrubber. However, only INCEpTION and ezTag provide a web UI for this function. MIST and Prodigy uses either a configuration file or command line to predefine PII types. NLM Scrubber uses fixed HIPAA-compliant PII types, which restrict the adoption of the system outside of a HIPAA-covered country or organization [].

DEFT has a 2-level file management structure (ie, project and data set) which allows the users to organize the data more flexibly, compared to the 1-level structure (ie, project) of the other systems. For example, the deidentification project may have many different types of clinical notes. Users can create different data sets for each type of clinical note. Like INCEpTION, DEFT can easily manage user credentials and assign user access to the project level through the admin web UI. However, in the MIST system, user management and access control need to be done from a command-line interface, and all the users share 1 project access key. When starting up the Prodigy system, the users need to configure an input data source. Therefore, the different project users cannot work on the system at the same time. Moreover, there is no user access control functionality in Prodigy. ezTag uses a session-based login to generate a unique URL for each user to manage the user access. However, the session can be accessed by anyone with the unique URL, so it is not suitable for annotating sensitive information in EMR free-text data.

DEFT supports a fully automated interactive learning loop. The model training process is triggered when the newly annotated files reach the preset retraining threshold. The retrained model is automatically loaded for the relevant project. When the annotator opens a nonannotated file, the model automatically preannotates it to provide suggestions for the annotator. In contrast, MIST, Prodigy, and ezTag need human manual operations during the learning loop. MIST retrains a model and preannotates the free-text files via a command-line interface. Moreover, the system needs to be restarted to load the retrained model. Similarly, Prodigy also needs different command lines to perform retraining and preannotation operations. Users of ezTag need to click the “Auto Annotation” and “Train” buttons to trigger the relevant tasks. The real-world use case study we present demonstrated that DEFT’s annotation speed can be increased by 43% with the automated preannotation in the learning loop. Furthermore, the DEFT system can use only 600 annotated clinical notes to achieve good performance with an F1-score of 0.9507, which is greater than 0.95, the rule-of-thumb benchmark for evaluating the reliability of a deidentification system [,]. The low scores for ADDRESS PII entity as shown in and may be caused by the lack of training samples, as there were only 44 and 55 ADDRESS entities in the training sets of 400 and 600 clinical notes, respectively. The model performance will continually improve as the amount of annotated free-text data increases. We also evaluated the annotation time and accuracy of manual annotation without and with preannotation. Using preannotation during the manual annotation process resulted in a time savings of 43% and a micro-F1-score improvement of 0.005 compared to manual annotation performed without preannotation.

The design of DEFT allows the raw free-text data to be stored in a central data storage location which is under the control of the researchers or data custodians. The system-related data, such as project and data set information, users, PII index and type details, and the free-text file names, are saved in DEFT’s database. All the raw data and exported data are managed centrally by the data manager. This not only protects the identified data from unauthorized access, but also avoids transferring large volumes of data through the network. A data source path is required when starting up the Prodigy system and therefore data can be stored at a central location. However, none of the other systems support this.

In the traditional annotation process, more than 2 annotators are needed to generate a gold standard data set. The interannotator agreement is measured to evaluate the annotation reliability between different annotators. Although a single annotator was used in this case study, the interactive learning loop enabled the model and the annotator to suggest and correct each other to continually improve the model performance. However, DEFT can easily perform the traditional annotation process. For example, the data manager creates a data set named “gold standard” in the project. In total, 2 annotators are required to tag the PIIs from all the files in the data set. Another annotator as an audit changes the file status to completion after reviewing the annotation results. The NER model will be trained on the gold-standard data set.

Limitations

In DEFT, we preannotate all the PII from the free text using the trained NER model. The annotators need to review the whole text to correct the preannotations. In contrast, INCEpTION and Prodigy combine preannotation with active learning which queries the user for feedback (accept, reject, or skip) on the annotation suggestions that are most informative to the model []. Active learning can achieve rapid and accurate annotation with less annotation time []. Integration of active learning within the learning loop could enhance the next version of DEFT. Moreover, we will explore fine-tuning pretrained transformer language models for NER in DEFT to improve the accuracy of deidentification. Furthermore, implementing “Autotag matches” in DEFT could have the potential to further decrease manual effort involved in the annotation process. Another limitation of DEFT is that it relies solely on deep learning models. The system’s performance could potentially be enhanced by incorporating a hybrid method that combines knowledge-based methods (such as predefined regular expressions or knowledge dictionaries) and deep learning. The evaluation results in this study are derived from clinical notes which were annotated by a single annotator. This may introduce potential reliability issues when assessing the deidentification performance. To address this, an annotation process, which is manually annotated by 3 annotators (2 annotators and 1 adjudicator), is needed to be introduced in the next version of DEFT.

Conclusions

DEFT is a web-based deidentification system, which is designed for health domain researchers and data custodians to easily deidentify free-text data in EMRs with the support of an interactive learning loop. End users can perform all the operations through a well-designed web UI. DEFT has many good features to help manage and organize the deidentification tasks. In particular, the central data storage feature ensures that the identified data are protected properly in a central location without transfer through the network. The real-world use case demonstrated that DEFT can speed up the annotation process and quickly complete the deidentification work for large volumes of data with high reliability. The source code of the DEFT system is available at GitHub [].

LL designed and i

留言 (0)

沒有登入
gif