Multi-task transfer learning for the prediction of entity modifiers in clinical text: application to opioid use disorder case detection

Data

We utilized two data sets for the evaluation of modifier detection in clinical text. The published ShARe corpus from SemEval 2015 Task 14 [13] and an unpublished corpus from the University of Alabama at Birmingham that was annotated for OUD-related entities (including modifiers). An overview of both corpora, including the number of documents, entities, modifier types, and counts, is shown in Table 1.

ShARe data set

Task 14 of the SemEval-2015 [13] provided a data set for two tasks, including clinical disorder name entity recognition and template slot filling. It consists of 531 de-identified clinical notes. For this publication, we focus only on the template slot filling task. This task requires the identification of negation, severity, course, subject, uncertainty, conditional, and generic modifiers of the clinical disease entity. The assigned training and development set are combined to build our final model, and the test set results are reported.

OUD data set

After training by WC (OUD research co-coordinator) annotators created a corpus consisting of 3295 clinical notes from 59 patients (23 controls) from physician case referrals between 2016 and 2021. Annotation of 25478 OUD entity mentions and modifiers were done using BRAT 1.3 software. Annotators modified entities for negation, subject and assigned a DocTime value of before, overlaps, or after. Additionally, annotators annotated mentions of substance and opioid use, OUD and Substance Use Disorder (SUD) as illicit. To the best of our knowledge, illicitDrugUse is a unique event modifier in our data set. We split the data set to 80% for training, 10% for development, and 10% for testing based on entities, not documents. The training and development set are combined to build our final model, and the test set results are reported. We plan to de-identify and release this data set as part of a future shared task.

Table 1 Statistics of the ShARe and OUD corpusArchitecture

We created a single-task and a multi-task architecture as shown in Fig. 1, where the single-task architecture uses a single classification head for each modifier trained separately whereas the multi-task architecture has a head for each clinical modifier trained jointly. Both architectures use BioBERT [26] as the base model. We chose BioBERT over other variants like BERT [23], ClinicalBERT [27], and PubMedBERT [28] since it performed better for detecting modifiers at initial experiments. No additional pre-training is performed using text from either the ShaRe Corpus or OUD corpus.

Fig. 1figure 1

Overview our modifier predication model. The multi-task (MT) architecture contains a classification head for each distinct modifier type. The single-task (ST) architecture has a single head for the classification of each modifier that is trained separately

Models

We name our single-task fine-tuned BioBERT model ST, our multi-task fine-tuned model MT. For our transfer learning experiments, models have a dash separated suffix indicating which data set they were fine-tuned on. For example, SHR or OUD reference fine-tuning on the ShARe corpus or OUD corpus respectively. Fine-tuning operations are ordered from left (most distant) to right (most recent) based on the order in which they occurred. See Fig. 2 for reference. The MT-SHR-OUD and MT-OUD-SHR have 5 and 7 classification heads, respectively. We also perform an experiment by combining the two data sets called MT-BOTH with a total of 9 distinct classification heads representing all clinical modifiers from both data sets.

Fig. 2figure 2

Overview of transfer learning process. Thick arrows indicate aggregation of training data. Thin arrows indicate the training data used for supervised fine-tuning. Medium width grey arrows indicate the use of a previously fine-tuned model. Model names are prefixed with the architecture type (ST or MT) and postfixed with the most recent training dataset the model has been fine-tuned on

For all models we use the final hidden vector corresponding to the [CLS] token (\(\mathbf } \in \mathbb ^H\)) generated from BioBERT(\(\widehat\)) as the common feature vector that is passed to each modifier classification head, which is a linear layer with learnable parameter \(\textbf_i \in \mathbb ^\). Formally, the probability distribution of a modifier type is:

$$\begin P\left(\hat_|\widehat\right) = \textrm(\textbf_i \textbf_ + \textbf_i) \end$$

(1)

where \(y_\) is a label of the modifier \(m_i\). We train the model using cross-entropy loss for each classifier:

$$\begin L_i = \sum \limits _^ P(y_) \log P\left(\hat_|\widehat\right) \end$$

(2)

where k is the length of the batch. The final loss is the average of all classifiers.

$$\begin L = 1/n \sum \limits _^ L_i \end$$

(3)

where n represents the number of modifier heads in the model. Additionally, we experiment with the use of focal loss [29] on our model, a type of loss function commonly used in deep learning for tasks involving imbalanced data sets.

Feature extraction

We have adopted the question-answering input format described in the original BERT [23] to fine-tune BioBERT and adapt it to modifier prediction. As illustrated in Fig. 1, the model gets two sequences as input. The leftcontext sequence is a disorder mention and its context and the rightcontext sequence is the string of the entity itself. We chose the context to be 200 characters before the mention and 50 characters after an empirically driven hyper-parameter choice that achieved better performance than word-based and sentence-based context. This choice was made after experimenting with 50, 100, 150, and 200 character offset combinations before and after the disorder mention. We did not experiment with sentence boundary offsets because sentence boundaries are not well-formed in clinical text [30], for instance some clinical notes contain paragraphs that use commas instead of periods. Additionally, OUD modifier annotations were not restricted to sentence boundaries. For discontiguous entities, only the strings that represent the entity are used. An example of what is passed to the model is: [CLS] The patent[sic] was found to be in fulminant liver failure. There she was having hallucinations, suicidal ideations and ... [SEP] hallucinations. The second example would include the first sequence and suicidal ideations as a second sequence.

The goal of this design is to help direct the model’s attention to focus on the desired entity to extract its modifiers. Specifically, the entity is contextualized with the surrounding words, which generally include the modifiers in the first sequence. In addition, the second sequence redirects the aggregate sequence representation [CLS] attention to the entity under consideration (hallucinations in the first aforementioned example).

Model training

For training, we follow standard procedures and use the curated training data set for both data sources to develop our models. Hyper-parameters are optimized using the designated development set. We have trained our model for 10 epochs with an empirically derived early stop. Details are in Fig. 1 of the Supplementary Materials. The maximum sequence length is 144, the learning rate is 2e-5, the weight decay is 1e-2, and the batch size is 64. AdamW is used as our optimizer. Similar to Xu et al. [17], the training and development set are combined to build our final model. We report the results on their respective test sets. We used a single Tesla P100 GPU with 16GB memory to run all experiments. The model will be made available upon the acceptance of the publication.

Experiments

We conducted the following experiments:

To assess the performance of transfer learning and multi-task training for clinical modifiers, we evaluate MT on the OUD corpus and the ShARe corpus. We compare our results to previously reported results for the ShARe corpus and against a generalized clinical modifier model that combines all training examples from both corpora.

To evaluate the feasibility of domain adaptation for clinical modifiers when only a portion of clinical modifiers match the target and source domain, we perform bidirectional fine-tuning between the OUD and ShARe corpus. We fine-tune on the source domain, then perform an additional round of fine-tuning on the target domain to create 2 models (MT-SHR-OUD) and (MT-OUD-SHR) that are fine-tuned first on the ShARe corpus or OUD corpus respectively. We include a classification head for each modifier from both data sets for a total of # heads.

We performed 2 ablation studies on both the OUD and ShARe corpus by (1) removing the disorder mention after the [SEP] or (2) replacing the MT model heads with a single-headed (ST) model, where a model is trained separately for each modifier.

Evaluation

To evaluate a system for identifying rare values of different modifiers, the original challenge for the ShARe corpus used weighted accuracy, where the prevalence of different values for each of the modifiers is considered. Specifically, for each modifier \(m_i\) the weights are calculated as follows

$$\begin weight \left( m_i^k \right) =1-prevalence \left( m_i^k\right) \end$$

where k represents the different classes of the modifier \(m_i\) as described in the task description paper [13]. We used the evaluation script from the challenge organizers to compare with the previous state-of-the-art system [17]. We have also used the standard unweighted accuracy and micro-averaged F1 to compare against other later works (Table 2). For the OUD data set, we used the standard unweighted accuracy and macro-averaged F1 (including the null class) and micro-averaged F1 for possible future comparison.

Chi-square test

We perform a Chi-square test [31] to compute the statistical difference between our model results and previous results through comparing the number of correct and incorrect predictions. We used accuracy and the total number of examples in the test set if these numbers were not available.

Comments (0)

No login
gif