Development and psychometric evaluation of item banks for memory and attention – supplements to the EORTC CAT Core instrument

Sample

This study was based on data used within the development of the EORTC CAT Core CF item bank [12, 13]. The sample consisted of 1,030 cancer patients from four countries (Denmark (13.4%), France (15.3%), Poland (27.2%), and United Kingdom (44.1%). Characteristics of the sample were: mean age 63y [range 26–97]; 52.6% female; 23.3% breast cancer as the most common cancer diagnosis; 59.7% cancer stage I-II. Local ethics committees of the participating countries approved the study and written informed consent was obtained before study participation [13].

Study procedure and statistics

In a multistep process, it was determined whether two item banks, one per domain (memory and attention), could be formed from the existing EORTC CAT Core CF item bank [13], hereafter referred to as the previous study to enhance clarity.

All data analyses were conducted using SAS Enterprise Guide 7.1, except the factor analyses, which were conducted using CBID (https.//biostats-shinyr.kumc.edu/CBID/).

Step 1: content analysis for memory and attention items

The static EORTC QLQ-C30 questionnaire consists of 2 CF items only. To develop an item bank suitable for CAT use more CF items are needed. During the development of the EORTC CAT Core CF item bank, further 42 items for memory and attention were generated based on a literature search, expert evaluations, and interviews with cancer patients [12] to provide a comprehensive picture of the memory and attention domains. These 42 items were then formulated in the QLQ-C30 item style, including a 4-point Likert-type scale ranging from ‘not at all’ to ‘very much’ and a recall period of one week [12]. In this paper’s study, these 42 new items together with the two established CF items from the EORTC QLQ-C30 formed the initial item pool. The resulting 44 CF items were then evaluated by experts in the field of neurology, (neuro-)psychology, epidemiology and oncology (n = 5) and assigned to either the memory or attention domain. Successful item allocation was reached, when four out of five experts (80% consensus threshold) agreed on the same domain. Items not reaching the described threshold were discussed within the expert group, and then either allocated to one domain only, or dismissed.

Step 2: psychometric analysis of items allocated to the memory and attention domain, respectively

Results of step 1 were subsequently evaluated using confirmatory factor analysis for ordinal data to test item allocation to each domain. In case of misfit to a 1-factor model, items were removed from the item pool. For both domains, higher scores per item and in total reflect better functioning in the patient.

The statistical analyses followed the previous study [13]. Reasonable fit for the 1-factor model was assessed using the following criteria: the root mean square error of approximation (RMSEA) < 0.08, the Tucker-Lewis Index (TLI) and the comparative fit index (CFI) each > 0.95, and the standardized root mean square residual (SRMR) < 0.05 for acceptable/good fit [18, 19]. Threshold levels followed procedure introduced by the COSMIN initiative [20].

Step 3: item response theory (IRT) model calibration and evaluation

IRT models assume monotonicity (i.e., increasing likelihood for an item response reflecting good memory functioning with increasing memory score). Hence, monotonicity was assessed for each item by inspecting whether the mean item score increased as the overall rest score increased (the sum score of all items except the item in focus) [21]. Additionally, infit and outfit indices were calculated to further detect differences between the model expected and the actual observed responses to each item (acceptable range between 0.7 and 1.3). Mean items residuals with 95% confidence interval (CI) for each scale respectively were evaluated (acceptable range − 0.1 to 0.1) for a good model fit.

A generalized partial credit model (GPCM) was calibrated to each domain item set [22]. In case of potentially inflated slope parameters possibly caused by local dependence of items [23], an item was re-estimated in a separate model excluding items with high correlations (> 0.8). The slope parameter was then fixed at the obtained estimate and the item added to the full model again. Item residual correlations were used to assess local item independence. Acceptable local independence was assumed if correlations were < 0.25 [24].

Analysis for differential item functioning (DIF) was conducted using ordinal logistic regression for age, sex, cancer site, cancer stage, treatment, country, educational level, cohabitation status, and work status. Due to the large sample size and multiple testing, DIF was regarded as significant if p < .001. Although statistically significant, a potential DIF finding may still have only a trivial impact at the domain score level. Such ‘trivial’ DIF would not be a concern. We mainly aimed to detect indications of practically relevant DIF, i.e., DIF which might have a clinically relevant impact at the domain level. As described in a previous study, translation differences could lead to differences at the domain score level [25], and hence, result in biased findings. To evaluate the practical impact of the possible DIF for IRT score estimation, scores based on the standard model were compared with scores from a model allowing for different parameters in the two DIF groups for the DIF item of focus [13, 26]. If the domain scores obtained with the two models were similar, the DIF was considered not to have practical importance. Here, we were particularly focused on differences greater than the median standard error for the domain score estimated [27].

Step 4: evaluation of measurement properties of the item banks used in CAT

For each item bank per domain (memory and attention, respectively) simulations based on observed and Monte Carlo simulated data, respectively, were used to mimic the performance of CATs of varying lengths (1,2,3, to x items). All new CATs were initiated with the original CF item, attention or memory, from the EORTC QLQ-C30. The relative validity (RV) of using the newly developed CAT item bank per domain in comparison to the EORTC QLQ-C30 CF item was evaluated using known group comparisons. For the observed data the following groups were compared: disease stage I and II vs. III and IV; working (yes/no); on treatment (yes/no); the EORTC QLQ-C30 sum scores for the 14 other domains (< 33 vs. ≥33 sum scores).

For the simulated data, 1000 simulations were conducted for each subdomain bank. In each simulation two groups of random size between 50 and 250, were sampled and responses to all items simulated. The groups’ true subdomain scores (memory or attention) differed randomly corresponding to an effect size of 0.2–1.2. Hence, the simulated groups were known to differ. For both the observed and simulated data, RV and sample size requirements of the newly developed CATs compared to using the EORTC QLQ-C30 CF items (memory and attention, respectively) were estimated [28].

留言 (0)

沒有登入
gif