An overview of real‐world data sources for oncology and considerations for research

Introduction

Generating accurate evidence on the patterns and effectiveness of preventing, diagnosing, and treating cancer in real-world settings is a priority for researchers, health care providers, payers, and regulators. Real-world data (RWD), or data relating to patient health and/or the delivery of health care from routinely collected sources as opposed to clinical trials,1 can be an important component in addressing a range of important research questions across the cancer continuum. When combined with rigorous design and analytic methods, RWD can be used to generate real-world evidence about preventive and cancer-focused care delivered outside the selected trial populations in which they are often studied. Previous reviews have summarized different RWD sources for oncology research, their potential uses, and important biases for consideration.2-4 In this review, we extend this prior work to: 1) introduce a conceptual model to help researchers with the process of RWD source selection for a given research question; 2) update and describe features of commonly used RWD source types, including their strengths and limitations; and 3) provide an example of RWD source selection using a case study from a recently published article.

Conceptual Model

We propose a conceptual model (Fig. 1) to assist researchers in assessing the suitability of an RWD source for answering a specific cancer-related research question. The model has 3 primary steps: 1) clearly define the research question, 2) understand the data source contents and target population coverage, and 3) assess the data source relevance to the research question. In step 1, we recommend applying the previously published PICOTS framework to clearly delineate the population, intervention, comparator, outcome, timing, and setting.5, 6 This framework is often used in evidence-based practice and thus can be adapted as a way of emulating a target trial using nonrandomized RWD.7 It may also be useful for researchers to think through the study goal—description (eg, summarizing patterns), prediction (eg, identifying those likely at risk of an event), or effect estimation (eg, identifying effects of interventions or policies)—to clarify objectives and interpretation.8 Step 2 highlights the importance of fully understanding representation and content of the data and coverage of the target population to which study the results will ultimately apply. The PICOTS framework and a clear specification of the target population outline the data requirements for a specific study question. In step 3, researchers must then assess the relevance of the data source for the proposed research question. This includes understanding the original purpose for which it was generated and key steps in data provenance and processing. Understanding the original data collection processes provides insight into the quality of specific data elements and whether the RWD source is suitable, or fit-for-purpose, for the intended use case.9, 10 Information about the availability of specific variables, their completeness, and their validity is another key component of this assessment. There is substantial variability across RWD sources in the type, breadth, completeness, and quality of data elements. Understanding the underlying differences in RWD sources is central to the appropriate selection and valid use of RWD for cancer research.

image

Conceptual Model to Guide the Selection of a Real-World Data Source for a Specific Cancer-Related Research Question.

Real-World Data Types

The landscape of RWD is broad, expanding, and includes a variety of data source types that represent the complex and fragmented delivery of health care in the US system. Because of this fragmentation, RWD may be influenced by several key factors: who is paying for care (insurer), who is delivering the care (provider), where the care is delivered (geography or health system), or the specific population represented (disease or demographic). The main categories of RWD sources covered in this review, although not fully comprehensive, include those most commonly used by researchers. These include the following: 1) administrative claims, 2) electronic health records (EHRs), 3) registries, 4) health care data aggregators, and 5) specialty data providers and networks (Table 1).11-26 Of note, these categories are somewhat subjective and data sources are dynamic, continually expanding their capture of information through data linkage and collation of other resources. As such, we acknowledge that others may consider specific datasets in different categories. Within each of these high-level categories are more detailed types and subtypes related to the network, organization, facility, setting, or modality of health care covered by the data source. Appropriate analysis of RWD requires an understanding of both the original purpose and current use cases of the data because the primary use case and subsequent changes made provide important context about the data elements captured and the data structure. Table 2 provides details about each RWD type, including the population and estimated coverage, strengths and limitations, and example studies from the literature.

TABLE 1. Overview of Oncology Real-World Data Sources: Data Elements, Intended Purpose, and Examples DATA TYPE DATA SUBTYPE DESCRIPTION TYPES OF DATA AVAILABLE INTENDED PURPOSE EXAMPLE Administrative claims Private insurers Administrative claims are generated to record health care transactions between a health care plan and health care providers for covered individuals; private health insurers may provide accessibility to these data for researchers through licensing and signing a data use agreement directly or through third-party vendors Enrollment, demographics, dates of service, diagnosis codes, procedure codes, vital status, and pharmacy transactions Data are collected for the purposes of billing and reimbursement for health care services (eg, medical, pharmacy) Sharma 202011 Public/federal insurers (Medicare) Federally sponsored health insurance coverage for adults aged 65 y and older and selected individuals with disabilities; administrative claims capture health care transactions between covered individuals and health care providers; researchers can access these data through a submission and approval process, which also requires a data use agreement Enrollment, demographics, dates of service, diagnosis codes, procedure codes, vital status, and pharmacy transactions Potosky 199212 Public/state insurers (Medicaid) State-provided health insurance coverage for specific populations (eg, income-based, pregnant women, and children); administrative claims generated through the reimbursement of covered services are recorded; researchers can access these data through a submission and approval process, which also requires a data use agreement Enrollment, demographics, dates of service, diagnosis codes, procedure codes, vital status, and pharmacy transactions Maclean 202013 Electronic health record (EHR) Health maintenance organizations (HMOs) Health system or catchment area provides patient care aggregated through an integrated model of health care delivery, including coordination of a health care plan, medical physician groups, and a health care facility system Varies by HMO; typically includes data required for the provision of clinical care across the HMO settings as well as billing purposes, such as demographics, clinical variables, diagnosis, radiology, laboratory, diagnostics, and pharmacy Data are collected for the documentation, assessment, and provision of clinical care and treatment pathways within health care systems or inpatient or outpatient settings Bowles 201214 Ambulatory care EHR systems developed to facilitate the provision of care in the outpatient setting, including physician office visits, radiology centers, laboratories, and other treatment centers, for the primary purposes of clinical care, documentation, and quality assessment Demographics, diagnosis, clinical variables, medical oncology, radiation oncology, radiology, laboratory, diagnostics, and pharmacy Lau 201115 Inpatient care EHR systems developed to facilitate the provision of care in the inpatient setting, including hospitals, systems, and long-term care facilities, for the primary purposes of highly monitored clinical care, documentation, and quality assessment Demographics, diagnosis, clinical variables, radiology, laboratory, diagnostics, and pharmacy Callahan 202016 Registry data Federally sponsored (SEER, NPCR) Data that are collected and curated systematically on a specific disease, condition, or population and entered into federally managed registry Data variables are typically organized around variables to evaluate the etiology, diagnosis, treatment, and outcomes of patients within the registry Provides epidemiology of disease incidence, prevalence, and trends for disease monitoring; data that are curated systematically according to data standards as part of public health reporting; data are HIPAA-exempt for maintaining PII and linking longitudinally Cronin 201817 State or regional registries Data that are collected and curated systematically on a specific disease, condition, or population that cover a specific geography, population, or is captured under state regulation (public health reporting) Incidence data and trends, high-level data collection, survival trends Contributes to epidemiology of disease, monitors trends in disease, and supports public health planning; data that are curated systematically according to data standards as part of public health reporting for the specific disease state; data are HIPAA-exempt for maintaining personal identities and linking longitudinally Gearhart-Serna 202018 Industry-sponsored (drug specific) Voluntarily developed or mandated for postmarketing to collect data specifically on patients receiving a specific drug or combination of drugs to allow longitudinal exposure monitoring for potential adverse events, safety, outcomes, and follow-up data Demographics; pharmacy, including drug dosing and administration, laboratory, and adverse drug events (related to the drug of interest) Collect data elements on patients receiving a specific agent Brown 201319 Hospital-based registries (NCDB) Registries developed to capture information for quality assurance at the facility level, with the focus on patients treated within the health care system Demographics, clinical variables, limited treatment, incidence data and trends, survival outcomes; subset of practices include detailed data on cases, including quality measures Monitoring quality of care at the facility level Boffa 201720 Disease-specific registries (cancer site) Registries developed or established to collect data on patients with a specific disease (eg, rare cancer) Demographics, treatment data, pharmacy, diagnosis, laboratory, and clinical variables (related to the specific disease) Data collected from patients with a disease for longitudinal monitoring and epidemiologic studies Steele 200622 Health care data aggregators Nonprofit Data aggregators combine data across varied sources using a specified data model (eg, federated or software system-based) to provide composite data for evaluation Demographics, diagnosis, and clinical variables (can vary: radiology, laboratory, diagnostics, pharmacy) Used to measure care delivery or improve quality of care (CancerLinQ); used to gather data on patient-centered outcomes (PCORnet); used to query signals to assess drug safety for marketed products (Sentinel) Brown 202023 Commercial Single-sourced, curated data Demographics, diagnosis, clinical variables (can vary: radiology, laboratory, diagnostics, pharmacy) from one system or group of systems Clinical data that are refined and cleaned for research purposes (eg, Flatiron, primarily Oncology EMR software) Khozin 201924 Multiple unique sourced Demographics, diagnosis (can vary: radiology, laboratory, diagnostics, pharmacy) from multiple sources across geographic areas Data that are collected and curated from heterogenous data sources to fit commercial research models (eg, COTA Inc, Symphony Health, HealthVerity, OptumLabs) Kabadi 201925 Specialty data providers and networks Varied Organizations capturing a specific individual data type, such as radiology images or reports, administrative pharmacy data, or genomic information Demographics, diagnosis, clinical variables (can vary: radiology, laboratory, diagnostics, pharmacy), document type can vary by image or report (eg, DICOM or pdf) Data exchange—typically to enhance clinical care—enabling providers across different entities to view patient results Gajra & Feinberg 202026 Abbreviations: DICOM, Digital Imaging and Communications in Medicine system; HIPAA, the Health Insurance Portability and Accountability Act of 1996; NCDB, National Cancer Data Base; NPCR, National Program of Cancer Registries; pdf, portable document format; PII, personally identifiable information; SEER, Surveillance, Epidemiology, and End Results program of the National Cancer Institute. TABLE 2. Characteristics of Oncology Real-World Data Sources: Coverage, Strengths, and Limitations DATA TYPE DATA SUBTYPE POPULATION COVERAGE/LONGITUDINALITY COVERAGE ACROSS SETTINGS STRENGTHS LIMITATIONS Administrative claims Private insurers Individuals with specific insurance coverage (eg, employer-based, self-insured, other) Longitudinal capture of health care service encounters while enrolled in benefits; vital status is often available for state and federal insurance programs Medium to high; coverage is based on benefits enrollment; can include capture of inpatient, outpatient, and pharmacy services Clear population denominator; longitudinal data capture Often short enrollment periods; lacks clinical details and laboratory results; no information on provider or patient intent/preference Public federal insurers (Medicare) Adults aged 65 y and older or with qualifying disabilities Clear population denominator; longitudinal data capture; often a more stable enrolled population; has been linked to other forms of data (eg, registry); vital status data available Does not include individuals enrolled in Medicare Advantage; lacks clinical details and laboratory results; no information on provider or patient intent/preference Public/state insurers (Medicaid) Income-based eligibility or coverage for special populations (eg, pregnant women and children) Clear population denominator; longitudinal data capture; several states' data can be accessed through centralized processes; vital status data available Often population is unstable because of fluctuating eligibility requirements; lacks clinical details and laboratory results; no information on provider or patient intent/preference Electronic health record (EHR) Integrated delivery organizations Individuals enrolled and receiving care in a health maintenance organization Longitudinal capture of health care service encounters while enrolled in benefits High Clear population denominator; longitudinal data capture; high level of completeness across care settings; low rates of attrition within plan Not representative of general population or patients in fee-for-service plans Ambulatory care Patients receiving care within the specified outpatient setting captured through the source Coverage may be sporadic, depending on sharing between centers based on the use specific of EHR software Medium to low; only available through central linkage by EHR software or previously linked clinical centers Contains data that may not be captured elsewhere Care received outside of the system would not be documented; not longitudinal (question-dependent); population denominator often unclear Inpatient care Patients receiving care within the specified inpatient setting captured through the source Lacks longitudinality because the data are episodic and typically best used for short-term studies Medium to low; only available within health systems with a common EHR software Provides detailed data for episodic study Care received outside of the system would not be documented; population denominator may not be clear Registry data Federally sponsored (SEER, NPCR) Combined data from all patients who have cancer within a specific set of geographic catchment area (based on regional and/or state registries) Longitudinal capture of health care for available data sources; may have gaps in knowledge (eg, treatment over time, recurrence) Medium to high; consolidates data from across multiple health care settings and providers Large sample size of population-based data; facilitates temporal trends assessment across different strata Delays in reporting of data; limited detail currently State or regional registries All patients who have cancer within a specific geographic catchment area Longitudinal capture of health care for available data sources; may have gaps in knowledge (eg, treatment over time, recurrence) Medium; consolidates data from across multiple health care settings and providers Includes all cancers diagnosed within geographic area; followed longitudinally Limited outcomes; limited detailed data on treatment, genomic characterization Industry sponsored (drug-specific) Limited, drug-specific only Coverage is limited; longitudinality is typically good Medium to high; for a very specific population only Very detailed information on specific data elements Very narrow data collection Hospital-based registries (NCDB) Patients receiving diagnosis or treatment in inpatient facilities or associated outpatient facilities Limited capture of longitudinal follow-up of patient—dependent on access to information outside the institutional setting Medium; consolidates data from across multiple health care settings and providers More detailed data on each patient if within selected site; focus on facility quality of care Not a population-based sample; limited information on care delivered outside the facility setting Disease-specific registries (voluntary) Voluntary submission for a specific disease Coverage is limited as focus on a particular disease; longitudinality is typically good; volunteer-based Medium to high; for a very specific population only Well defined cohort of interest; potential to target rare or unusual cancers Limited data because of volunteer reporting Health care data aggregators Nonprofit Variable, depends on the aggregator's purpose Highly varied on data source; may be similar to EHR or claims-based sources Medium to high; varies by source, although the objective is often longitudinal If purpose is well defined, produces high-quality studies Convenience, not population-based, sample Commercial Patients receiving care within the specified setting captured through the source Highly varied on data source; may be similar to EHR or claims-based sources; based on care received in the specific system Medium to high; varies by source, although the objective is often longitudinal Ability to curate data for specific purpose or extract variables (eg, EGFR) Convenience, not population-based, sample Patients receiving care within the specified setting captured through the source Coverage is complex and varies significantly by the intersection of linked sources; highly varied on data source; may be similar to EHR or claims-based sources Medium to high; includes various settings Includes multiple, heterogenous data sources to provide a detailed, longitudinal understanding of clinical interaction Complete coverage for all data types may be sparse; convenience, not population-based, sample Specialty data providers and networks Varied Variable, depending on the network size and mission Highly varied based on data source purpose and structure Typically crosses multiple health care settings Variable by data source; may provide detailed clinical data elements from the specific source Variable by data source; may have limited capture of complete clinical picture; may require linkage with other sources Abbreviations: NCDB, National Cancer Data Base; NPCR, National Program of Cancer Registries; SEER, Surveillance, Epidemiology, and End Results program of the National Cancer Institute. Administrative Claims Data

Administrative claims data have been a longstanding source of RWD for cancer research. These data recorded for reimbursement purposes include information about coded diagnoses and services rendered during patient visits from claims for insurance providers. Longitudinal data from claims can be captured on individuals who are continuously enrolled in specific health insurance plans or pharmacy or other specific programs. Common sources of administrative data used by cancer researchers include enrollment and claims data generated from government insurers, including Medicare (federal level) and Medicaid (state level); commercial insurance providers; and health care claims data aggregators.

Approval of the Health Insurance Portability and Accountability Act (HIPAA) of 1996 led to requirements that resulted in claims data sources sharing many common data elements. Importantly, most administrative claims databases contain enrollment files, which track individual monthly enrollment in a covered health plan over the time span of the data source. This distinct longitudinal feature enables a clear description of a population over time that can be used to define a study denominator. In addition, many claims data sources contain patient health data across health care settings, including inpatient visits, outpatient visits, or other specialty health care providers. Increasingly, health plans provide additional pharmacy benefits and thus include prescription medication dispensing information from outpatient or community pharmacies. The latter data are increasingly important in understanding cancer outcomes in the context of treatment. In general, health care services that are not reimbursable by the health plan or program (eg, over-the-counter medicines or services paid out of pocket by the patient) are not captured. In addition, the type of insurance plan or program participation by the patient or provider may influence the sensitivity and specificity of care as recorded in the claim (eg, fee-for service vs managed care, such as health maintenance organizations or accountable care organizations). Claims data can also include valuable information on health care delivery that enables research on providers, care quality, access, hospital volume, and prescribing patterns.

Because administrative claims are generated for billing purposes rather than for patient care, the validity and completeness of costly procedures (eg, surgical resection) are likely to be high; however, the accuracy of specific diagnoses (eg, hypertension) is variable and depends on several factors. These include the specific patient population and the provider setting (eg, physician office vs inpatient care). On a specific claim, only the diagnoses and procedures that are needed to describe clinical care provided for reimbursement are likely to be included, which may lead to reduced sensitivity in the capture of certain outcomes. In addition, administrative data often lack important clinical, laboratory, or behavioral health information that may be important for cancer research, such as the cancer stage, genomic biomarker testing results, and smoking status.

Substantial efforts have been made to address some of these limitations in oncology by linking administrative claims with registry (eg, the National Cancer Institute's Surveillance Epidemiology and End Results [SEER] program) and survey (eg, the Health and Retirement Survey) resources. The National Cancer Institute has led several efforts to enhance cancer research data for the scientific community, resulting in widely used resources, including SEER-Medicare,27 SEER-Consumer Assessment of Healthcare Providers and Systems,28 and the SEER-Medicare Health Outcomes Survey.29, 30

Electronic Health Records

EHRs are another increasingly prevalent RWD type. EHRs can provide rich information that may not be available from other types of RWD because they contain data from multiple sources within the health care system (eg, pathology reports, laboratory results, medication records, provider notes). However, the vast majority of information held within EHRs is maintained in unstructured text documents or is captured as a scanned, nonoptical character recognition portable document format, requiring curation and translation to extract structured data. Furthermore, EHRs do not include comprehensive information on health care provided outside the facility covered by the system. A patient with cancer may have data held independently within multiple EHRs across hospitals, community oncology practices, radiation oncology practices, or other settings, depending on the software used and its integration across these practices.31, 32 This is especially true for patients at different stages in their cancer journey. For example, a newly diagnosed patient may see a general practitioner or urologist, and early treatment phases may have a combination of surgical and systemic treatments provided in different practices. Patients undergoing passive surveillance or cancer survivors may also receive a large proportion of their care outside of an oncology practice.

There are now emerging opportunities to extract data across electronic health systems using fast health care interoperability resources technologies. There are also potential opportunities for manually assisted natural language processing or deep-learning methods to capture vast, unstructured data directly within EHRs. These tools may partially overcome the limitations of fragmented and unstructured data but are still early in implementation or systematic use. EHRs may include granular information, but the data are not adjudicated or quality checked as part of routine practice, which may result in inconsistencies in key data elements (eg, cancer stage).33

EHR data systems often are not interoperable, even across the same EHR system, which is a critical barrier to their use in research. However, new requirements issued by the US Department of Health and Human Services Office of the National Coordinator for Health Information Technology (ONC) mandate an increased ability to share data across these various systems to assure continuity of patient care.34, 35 As part of the 21st Century Cures Act,36 2 new laws colloquially known as information (or data) blocking laws are being enacted by the ONC and the Centers for Medicare and Medicaid Services. The ONC rule specifically requires health care providers to adopt or integrate standardized application programming interfaces into their electronic medical records. These requirements mean that all patients will have direct access to their electronic health information (structured and/or unstructured) using smart phones (or computers) at no cost. Similar to the application of HIPAA on claims data, this law will require a standardized set of data (referred to as data classes and data elements) outlined as the United States Core Data for Interoperability.37 Although these data are still untested and their ability to capture specialty care like oncology is less clear, broad adoption of these application programming interfaces are likely to significantly improve data interoperability and the ability to share electronic data between and across health care systems.

Integrated care delivery conducted by health care maintenance organizations represents a different type of health care delivery in which comprehensive care is provided to patients for almost all health care services. The integrated care delivery model includes the coordination of a health care plan, medical physician groups, and a health care facility system.38 From a data perspective within the EHR, the model includes all billed services for patients within the closed system—unlike fee-for-service insurance plans, in which patients can select care across multiple systems. Several integrated care organizations have consolidated their data into a virtual data warehouse to facilitate research.38, 39 The use of an EHR system across integrated care providers facilitates data access and, if patients do not receive care outside that system, potentially enables complete data on each patient. This is in contrast to the data from fee-for-service care plans and systems that are fragmented across various practices and EHRs. An additional caution is that there is little assessment of quality of the data contained in the EHR system.

In summary, although EHRs can provide rich and deep data on a patient, the data may yield only a partial picture of the patient trajectory longitudinally because cancer care may be received in multiple facilities with different EHR systems.31 Nevertheless, with the appropriate evaluation and study design, EHR data can be used effectively and appropriately to address many research questions.40-43

Cancer Registries

Registries are designed to collect uniform and systematic data on a population of patients based on exposure, disease, or outcome. Registries may or may not be independent of any one health system, payer, or EHR data vendor. Cancer registries compile records specifically on patients with cancer. These can be convenience-type samples (eg, volunteer registries for specific cancer types or drug-specific registries) based on a health care setting (eg, hospital registries) or population-based (state or central cancer registries). Most registries are designed for a specific purpose—for example, registries of patients representing rare tumors or familial syndromes. Many registries frequently collect information that may not be available from more traditional sources, such as detailed exposure data (eg, diet, physical activity) and patient-reported outcomes, but they typically represent a nonrandom sample of patients. Hospital-based registries capture detailed data on each patient and are useful for understanding the quality of care provided within a specific hospital setting. However, these health care system-based registries may not capture data on care provided outside the system in which the registry is based, similar to the limitations of EHR data. For example, if recurrence is diagnosed in the oncology office, the recurrence may be unidentified by the hospital-based registry. Facility-based registries may also have limitations for understanding the outcomes of tests ordered by the oncologist because the test results are sent directly to the oncologist and are not entered into the hospital-based information system.44 These hospital-based registries are becoming increasingly agile, as new data items may be added readily, and data are often available in real time to enable analysis of quickly evolving clinical issues. The most important role for these hospital-based registries is in monitoring provider metrics and improving the quality of care for patients treated at that facility.45

Central cancer registries are unique in that they are legally mandated in each state and provide a census of all patients with cancer in a well defined geographic area. Central registries (usually state-based) collect data under state regulations that require the reporting of patient identifying health information (personally identifiable information [PII] and protected health information) from all health care providers. This data collection is HIPAA-exempt as part of public health reporting. Registries must maintain PII to comply with the requirement to consolidate data from multiple sources into a single record and to follow patients over time. This consolidation of multiple sources provides a more comprehensive picture of the cancer case, although currently the data collection is focused on the incident diagnosis and subsequent therapies. These registries do perform routine, and often active, follow-up of every patient from diagnosis until death. They contain detailed data on the characterization of each cancer case. By using new linkage methodologies, that characterization is being enhanced to include more clinically relevant information—such as genomic characterization of the cancer and detailed treatment received by each patient. National cancer registries include SEER and the National Program of Cancer Registries, which collate de-identified data from participating state central cancer registries that are then made accessible to researchers.

Limitations of registry data include a lack of information about longitudinal treatment and outcomes other than survival. Those deficiencies are being addressed through several new initiatives, including linkage of registry data with data collected by other organizations and external partners.46, 47 These new methods, along with the integration of real-time access to pathology reports, will also enable data to be reported in a more contemporary interval. With the addition of these new data, and because population-based cancer registries cover all patients within a defined geographic area, such registries provide an important opportunity to supplement understanding of therapeutic advances and their impact and effectiveness outside the clinical trial setting for population subgroups that may be underrepresented in clinical trials. An additional important component of population-based registries is that linked studies, even if not linked to the entire population, can provide information on characteristics of those individuals not included in the linkage to better understand bias.

Health Data Aggregators

The use of health data aggregators is increasingly common with the development of novel technology platforms, privacy-preserving linkages via encryption, and the need for more rapid and advanced data analytics. Health data aggregators, often called health technology data companies, enable health care data to be harnessed from across different clinical sources and sites in an integrated fashion. In the current review, we define data aggregators as entities that combine data across varied clinical sources and sites using a specified data model (eg, federated or software system-based) to provide multimodal composite data for evaluation. The resulting data sources may include patients from the general population and diverse clinical settings (ie, general practice, hospital, specialty clinic, pharmacy, etc) or may be restricted to certain diseased populations (eg, oncology clinics). The organization performing the aggregation may be gathering data for either nonprofit purposes (eg, quality improvement), or commercial purposes (eg, drug development), or both. It is critical to understand the diversity of sources being aggregated, the primary research intention, and the business model driving the data aggregation as well as to recognize that these data generally do not represent the entire population of patients.

Generally, the objective of data aggregators is to try to address the longitudinal and disparate challenges of data capture in the US health care delivery system. Therefore, they provide an infrastructure to capture patient care across the various health care facilities, physician practices, and laboratories that comprise the fragmented US health care system. Examples of data aggregators include, but are not limited to, HealthVerity,48 IQVIA,49 Symphony Health,50 Flatiron,51 and OptumLabs.52 Although individual data sets may be limited to a single practice, health care system, or EHR software vendor, health data aggregators reduce those barriers by linking on a common protected identifier (usually encrypted) to provide aggregated, individual-level data across data sources. This approach provides a potentially more complete picture of health care utilization. The ability of aggregators to securely link patients may also result in an increased sample size, especially in rare diseases.

Limitations of data aggregation include the potential for selection bias (because patients with linkable data across clinical settings may differ from those without), and missing information is unlikely to occur at random. In designing a study, missingness across different clinical data types might be challenging to interpret or adequately understand. This can be particularly problematic for data analysts, who must be familiar with the underlying data structure and provenance of all data types. Moreover, the data pipelines and capture of elements often lack transparency and may not be systematic across all data sources. Often, these data are more challenging to use beca

Comments (0)

No login
gif