Randomized Clinical Trials (RCTs) are considered the gold standard for evidence in the biomedical field [1]. However, the way in which studies are conducted and results are published can sometimes introduce bias [2]. The COVID-19 pandemic, which triggered a surge in biomedical research globally, exemplifies this issue. With an overwhelming number of publications emerging in a short time, the urgency to deliver rapid findings increased the risk of inaccuracies, misconduct, or retractions [3].
The Risk of Bias (RoB) is a critical framework for evaluating the reliability of clinical trials by identifying systematic errors that may occur during the planning, conduct, or analysis phases [4]. Categorizing RoB into specific domains helps systematically assess the quality of clinical studies. For example, selection bias may result from deviations in patient selection methods, while performance bias may arise when participants or researchers are aware of the treatment being administered. Similarly, detection bias occurs when the outcome assessor is aware of the treatment, and this may affect their judgment [5]. These categories provide structured insights into potential sources of bias.
Over the years, several instruments1 have been proposed to identify and evaluate RoB in clinical trials [6], [7], [8], [9], [10], [11]. Although these instruments differ methodologically, they share a focus on various types (or domains) of bias. For example, the Cochrane RoB Tool [10] comprises guidelines based on signaling questions (i.e., yes, no or not informed questions) to be answered by a human reviewer, followed by the manual extraction of evidence sentences that support the answer. Based on these answers, the RoB for the entire study is assessed.
Recent research has explored automating RoB assessment using machine learning approaches such as Support Vector Machines (SVM) [12], [13], [14], [15], Convolutional Neural Networks (CNNs) [16], Logistic Regression (LR) [15], [17], [18] and BERT [15]. Large Language Models (LLMs) have recently been applied to RoB assessment in RCTs [19]. However, while these studies created (or used) RoB datasets for training and evaluation, none of these datasets have been made publicly available, making reproducibility and comparison of methods challenging.
This work introduces two main contributions to automating RoB assessment. To address the lack of publicly available datasets, we created the RoBIn dataset, the first public dataset focused on RoB assessment (to the best of our knowledge). Constructed using distant supervision techniques [20], [21], the dataset includes tuples of signaling questions and their corresponding answers, supported by evidence. The label of each tuple indicates whether the RoB is categorized as (low, high, or unclear).
The second main contribution of this work is the proposal of two novel Transformer-based models [22] to automate RoB assessment. The key idea behind the construction of our models is to address the problem as a pipeline of a machine reading comprehension (MRC) task followed by a classification task. The two models that implement this pipeline are RoBInExt, an extractive model, and RoBInGen, a generative model. The extractive model uses the Transformer encoder architecture to identify and extract relevant spans with evidence. The generative model leverages encoder and decoder components to generate textual outputs supporting RoB classification.
Both models were evaluated through experiments designed to assess their performance on two tasks: (a) identifying sentences that provide supporting evidence for the RoB assessment (MRC task) and (b) performing the RoB classification itself. The results demonstrated that RoBInExt and RoBInGen performed strongly in the MRC task across all bias types and achieved competitive results for RoB classification. RoBInExt and RoBInGen achieved the best overall performance, with an AUROC of 0.83, outperforming classical methods. They even outperformed LLMs in some cases with respect to F1-score, precision, and recall metrics.
The main contributions of this article are summarized in the statement of significance table below.
Comments (0)