Research studies often involve comparing groups to understand the relationship between an exposure or intervention and an outcome of interest. The quality of evidence from such studies varies by study design, with meta-analyses, systematic reviews, and randomized controlled trials (RCTs) at the top of the evidence pyramid and observational studies at the bottom. However, healthcare decisions are often based on observational studies since these outnumber RCTs, and may be the only feasible design for several research questions. Hence, it is important to understand the strengths and weaknesses of such studies.
In RCTs, randomization ensures that the treatment groups are (on average) similar and differ only in the intervention (i.e. exposure) received. Thus, any differences in outcomes between the groups can be expected to be causally related to the exposure of interest.1,2 Further, this design ensures that exposure precedes outcome—an essential criterion for causation. Of course, the outcomes could differ across groups purely by chance. However, this risk can be assessed using statistical tests of hypotheses, and if it is below a certain threshold (usually 5% or 0.05), we conclude that the difference is likely to be ‘real’.
Observational studies are of several types. In decreasing order of their perceived strength of evidence, these are: (i) cohort studies, where groups with distinct exposures are followed longitudinally and outcomes compared; (ii) case–control studies, where those who have developed a particular outcome are compared to those who have remained free of it for the exposure of interest; (iii) cross-sectional studies, where a snapshot of a group is taken to record both exposure and outcome, and (iv) ecological studies, where aggregated data from several populations are compared. In contrast to RCTs, associations identified even in analytical observational studies are not necessarily causal and could arise, besides chance, from one or more of the following: confounding, bias, or reverse causation.3,4
Confounding arises from the compared groups being dissimilar in characteristics other than the factors being studied.5 For instance, in a cohort study looking at the effect of a nutritional supplement on a disease, persons taking the supplement and those not taking these may differ in numerous ways, such as economic status, nutrition, health-related behaviours, etc. (e.g. in being richer, better nourished, more health conscious, less likely to smoke). Thus, any inter-group differences in outcomes (e.g. incidence of a disease, duration of survival) may be due either to the exposure of interest (the dietary supplement) or to these other differences, or a combination thereof.
Bias occurs when the study subjects are not representative of the entire population.5 In the example above, this could occur if the study enrolls only well-nourished individuals, whose diet is rich in the nutrient of interest. A negative study result would then not necessarily negate a beneficial effect of supplementation in a population with an overall lower level of dietary intake. Bias can also occur if either exposure or outcome is not measured accurately. For instance, if the determination of exposure is based on self-reporting, some study participants may incorrectly report taking the supplement while they do not actually take it regularly, or vice versa, interfering with the study conclusions.
Reverse causation refers to an association where the presumed exposure is not the cause but a consequence of the presumed outcome. For instance, a recent observational study looked at whether regular physical activity could prevent cognitive decline and found that those with regular physical activity had better electrophysiological brain activity. However, this finding can also occur if the relationship is reversed, i.e. better electrophysiological activity permits a more physically active lifestyle.6
Thus, it is not infrequent for findings of observational studies to be misleading. In fact, biomedical literature is replete with examples where observational studies suggested that a particular intervention is associated with a positive health outcome, leading to widespread adoption of the intervention. However, when an RCT was conducted to study the effect of the intervention, the results differed from what was expected.
An oft-quoted example for this divergence between the observational and trial data is the effect of oestrogen–progestin therapy in post-menopausal women. Several observational studies have shown that users of this therapy have a 40%–60% lower risk of coronary artery disease after multivariable adjustment for numerous potential confounders.7,8 However, in the subsequent large Women’s Health Initiative RCT, women administered this therapy had an increased risk of disease.9 This indicates that either the multivariable analysis failed to fully adjust for the various confounders included in it, or the existence of other unmeasured confounders; in either case, the results of observational studies were unreliable. Some additional examples of such disparity are non-skeletal effects of vitamin D deficiency,10 and the effect of dietary supplementation with retinoids and vitamin E on cardiac disease.11
Further, for certain specific use-cases, observational studies are associated with certain specific forms of bias. In studies on the use of diagnostic tests to screen for a disease, the group of persons undergoing the screening test may appear to live longer due to various biases, such as lead-time, length-time, detection, and selection biases. Similarly, in Mendelian randomization studies, a form of observational design, linkage disequilibrium, and pleiotropism can lead to false associations.12
Can we avoid such misinterpretation? First, researchers conducting observational studies need to be aware of these pitfalls. Second, it helps to understand well the issues around the study question and to review past studies on the topic to identify the likely sources of bias and potential confounders. This knowledge can then help reduce bias by selecting more suitable procedures for participant selection and exposure and outcome measurement. Some of the ways to limit confounding include (i) restricting the study to those without the confounder variable, (ii) matching the exposed and the unexposed subjects in a cohort design (or cases and controls in a case-control design) for status of the confounder variable, (iii) undertake stratified analysis or (iv) use multivariable statistical techniques to control for the confounding variables. For example, during the Covid-19 pandemic, test-negative case–control studies were used to assess vaccine efficacy. In these cases, individuals with proven Covid-19 and those who were Covid-19 negative were compared for a history of prior vaccination. However, the controls, instead of being healthy persons, were individuals who attended the same hospitals with symptoms suggestive of Covid-19 but tested negative for it.13 This ensured that the controls were more akin to the cases in characteristics such as health-seeking behaviour. However, such procedures can only partly mitigate the risk of bias and confounding, but not eliminate it. For instance, even after matching, stratified or multivariable analysis, some ‘residual’ confounding persists. Importantly, there is no way to quantitate such residual risk.
An even bigger challenge arises from a thoughtless overuse of observational studies, without paying attention to the relevance or validity of the underlying research question. In recent years, several large datasets, such as the National Health and Nutrition Examination Surveys (NHANES) in the US, National Family Health Surveys (NFHS) in India, and large genomic databases, have become available. These contain data on numerous demographic, social, economic, and nutritional variables, as well as genetic variations at millions of loci across the human genome, for several thousand individuals, including those with various diseases.
By convention, we consider associations with a probability of occurrence of less than 5% to be significant. This approach works well when only a few hypotheses are tested; however, if numerous hypotheses are tested, a fair number may be expected to be significant fortuitously. Such large databases allow assessment of a virtually infinite number of possible exposure–outcome pairs. Hence, any undirected ‘data dredging’ will lead to ‘discovery’ of several ‘significant’ associations––serendipitous findings that are likely to be finally proven untrue.14,15 It is possible to correct for this multiple hypothesis testing using methods such as the Bonferroni correction. However, the main challenge is that a lot of ‘shooting in the dark’ occurs before the hypothesis is framed (e.g. in choosing which genetic polymorphism to study) and hence cannot be accounted for. Hence, authors, reviewers, and editors need to be aware of these limitations and publish only studies with well-founded hypotheses.
Another recent concern is the use of novel phrases to describe research studies based on observational data, such as ‘real-world data’, ‘propensity score matching’, ‘target trial emulation’, etc. Although these labels suggest a higher level of provenance, these studies retain many of the limitations of observational designs as discussed above. For instance, ‘propensity score matching’ possibly performs no better than multivariable statistical analysis in adjusting for confounders, and its results continue to carry the risk of ‘residual confounding’ and bias due to the study cohort being different from the general population.16 Similarly, ‘target trial emulation’ design, often referred to as equivalent to a hypothetical RCT, is susceptible to immortal time bias and various forms of confounding.17 Hence, such studies cannot be equated with RCTs where random allocation of participants ensures equivalence of study groups.
There is a common belief that meta-analyses represent the highest level of evidence. However, this applies only when these analyses incorporate trial data, but not when these pool observational data. Pooling of observational studies may serve to amplify the effect of a bias or a confounder that is often present across multiple similar studies, inducing a spurious sense of confidence in the results of such analyses. It is also often argued that a higher relative risk and demonstration of a dose response (a higher risk with more marked exposure) indicate that the association is more likely to be true; however, this is not always true.
Thus, the results of observational studies need to be interpreted with a lot of caution. Ideally, one needs to test the association using a more robust design, such as an RCT, where feasible. This may need some ingenuity on the part of researchers. For instance, where the exposure carries a risk of adverse outcomes and hence RCT is considered ethically impermissible, an RCT of withdrawal can be considered. Thus, to reliably assess the effect of smoking on health outcomes, a cluster randomized trial, wherein an anti-smoking behavioural intervention is offered in some locations but not others, can be done. An improvement in health outcomes in the target clusters will prove that this is due to the intervention, indirectly establishing the role of smoking in causing poor health outcomes. An alternative, though less reliable approach, is to rely on data from mechanistic studies, while being careful that data from in vitro or animal models may not always be transferable to humans.
Comments (0)