Comparative Psychometric Performance of Common Generic Paediatric Health-Related Quality of Life Instrument Descriptive Systems: Results from the Australian Paediatric Multi-Instrument Comparison Study

P-MIC study data from children aged 5–18 years (inclusive) were used [17, 19]. P-MIC participants (children and their caregivers) were recruited between June 2021 and August 2022 into three samples: Sample (1) children with or without health conditions recruited via a large tertiary paediatric hospital based in Victoria, Australia; Sample (2) general population children recruited via an online panel available nationally (Pureprofile Australia); and Sample 3) children from nine condition-specific groups (attention-deficit/hyperactivity disorder (ADHD), anxiety and/or depression, autism spectrum disorder (ASD), asthma, eating disorder, epilepsy, recurrent abdominal pain, sleep problems, and tooth problems) recruited via the same online panel as above or—for rarer conditions—via patient organisations associated with the condition. P-MIC study data were from Data Cut 2, dated 10 August 2022. Data cut 2 includes approximately 94% of the total planned P-MIC participants.

2.1 Data Collection

All participants consented and completed an initial survey online via REDCap. Participants were then asked to complete a second online follow-up survey at 4 weeks. A small subset of participants from the online panel general population sample (Sample 2) were asked to complete the follow-up survey at 2 days to enable assessment of test–retest reliability.

All instruments were self-completed by the participant (i.e., no instruments were interviewer administered). Instruments were either proxy reported by the caregiver or self-reported by the child. Children aged 7 years or older who were deemed by their caregiver as currently able to complete questions about their health completed the HRQoL instruments themselves (child self-report), otherwise these were completed by the caregiver (proxy report). Where an instrument was proxy reported, the proxy was asked to rate the child’s health from their perspective (i.e., from the caregiver’s perspective).

For further information on P-MIC study methodology, including details of participant recruitment (i.e., quotas), survey structure, instruments, survey questions, and statistical analysis plans, please see the technical methods paper [19].

2.2 Instruments

The PedsQL core generic version 4.0, EQ-5D-Y-3L and EQ-5D-Y-5L, CHU9D, AQol-6D adolescent, and HUI3 were included in both the initial and follow-up surveys. As per the prespecified protocol [17], the PedsQL EQ-5D-Y-3L and EQ-5D-Y-5L and CHU9D were included in the core set of instruments received by all participants, and the HUI3 and AQoL-6D were included as additional instruments that only some participants were randomised to receive. Although the study team wanted to include all instruments for all participants, feedback was received from the consumer group during the design phase of the study expressing concern about responder burden. Hence, efforts were made by the study team to reduce responder burden where possible. The HUI3 and AQoL-6D were not included in the sample recruited via hospital (Sample 1) to minimise responder burden (following patient feedback), and in the online panel samples (Samples 2 and 3), participants were randomised to receive either the HUI3 or the AQoL-6D or another generic instrument not included in this analysis. A summary of participants who received each instrument is available in Table 1, and characteristics of instruments included in analysis are available in Supplementary Table 1 (see electronic supplementary material [ESM]). The order of the core set of instruments (PedsQL, EQ-5D-Y-3L, EQ-5D-Y-5L, and CHU9D) was randomised to minimise order effects, and there was always another instrument between the EQ-5D-Y-3L and EQ-5D-Y-5L given their similarity. The AQoL-6D or HUI3 were completed after the other generic HRQoL instruments. Where participants were allocated to an instrument, they were required to answer all instrument questions; hence, there is no missing HRQoL instrument data.

Table 1 Participant characteristics by child age, report type, and child health status

The priority for which instruments to include in the study, and which instruments to include in the core set (all participants receive) or additional set (only some participants randomised to receive) was determined by the study team following a review of key literature available at the time of study design [10, 11, 15], and consultation with experts (including clinical, health technology assessment, health economist, government, and consumer experts). The decision was guided by the following factors: (1) instruments commonly used to measure HRQoL in children (instruments were prioritised if they had evidence of strong psychometric performance from single studies), (2) instruments used to measure HRQoL in children that had been recently developed and were likely to be commonly used in future, and (3) instruments that would be useful in informing policy and healthcare decision making in Australia. It was not a requirement that instruments had preference weights available, although the study team did consider which instruments had preference weights available at the time of study design and which may be likely to have preference weights available in future. For example, although the EQ-5D-Y-5L and PedsQL did not have preference weights available at the time of study design, they were considered instruments that would quite likely have preference weights available in future. Further details on the justification for the inclusion of each instrument are available in the published study protocol [17].

2.3 Instrument Scoring Used for Analysis

The PedsQL total score was calculated by reverse scoring and linearly transforming raw item responses (0 = 100, 1 = 75, 2 = 50, 3 = 25, 4 = 0), then the sum of all item scores was divided by the number of items [20]. PedsQL domain scores were calculated using a similar approach, where raw items were linearly transformed, and the sum of all item scores in a domain was divided by the number of items in that domain [20]. An exploratory level sum score (LSS) approach was used to obtain an overall instrument total score for all other instruments. LSSs were calculated by summing together the numerical value attached to each item response (e.g., 1 for ‘no problems’ and 5 for ‘extreme problems’ in the EQ-5D-Y-5L) for all items in the instrument. The total score range possible for each instrument varies and is described in Supplementary Table 1 (see ESM). The LSS approach is considered exploratory. It has some advantages in providing an equally weighted score for comparison but other disadvantages, such as its non-normal distribution and inability to distinguish between health states that may be quite different from one another [21]. In addition, there were a lack of preference weights available for all instruments included in this study. Furthermore, the aim of this analysis is to understand the descriptive systems of each instrument.

2.4 Statistical Analysis

Analyses were completed in Stata Version 17 (StataCorp, Texas, US). Statistical tests, hypotheses, and thresholds were based on the statistical analysis protocol set a priori by the study team; the statistical analysis protocol is available in the technical methods paper [19]. Where appropriate, subgroup analyses were completed using the following prespecified subgroups: child age (5–12 years vs 13–18 years), report type (proxy vs self-report) and health status (children without a special healthcare need vs children with a special healthcare need) [19, 22]. The child age subgroups (5–12 years and 13–18 years) reflect key child development stages (pre-adolescence and adolescence), and this age cut point is consistent with the PedsQL instrument age versions, which is one of the most well-validated paediatric HRQoL instruments [23]. Adjusting for multiple comparisons was not required in the primary analyses as all statistical tests were hypothesis driven or included different samples. Adjusting for multiple comparisons may have been applicable for subgroup analyses; however, given this is not commonly performed in the research field of psychometric analysis, we opted for an approach where subgroup analyses were not adjusted for multiple comparisons.

2.4.1 Distribution of Responses

Distribution of responses was evaluated by descriptively assessing participant responses to each instrument item. The distributions of responses were visually inspected. Additionally, the total instrument ceiling and floor effects were assessed. As this study includes general population children and children with health conditions, ceiling effects were assessed only in children with a special healthcare need [22], as these children were expected to report health problems on HRQoL instruments. An instrument was considered to have a ceiling or floor effect if > 15% of participants with a special healthcare need reported the lowest severity (e.g., ‘no problems’) or highest severity category across all items. This 15% threshold is based on previous thresholds used in the literature [24, 25].

2.4.2 Test–Retest Reliability

Test–retest reliability was assessed by comparing instrument total scores between initial and follow-up measurements for participants who reported no change in health and were allocated to receive their first reminder for the follow-up survey at 2 days. Only participants in the online panel general population sample were allocated to receive the follow-up survey at 2 days. Test–retest reliability was assessed using intraclass correlation coefficient (ICC) estimates and corresponding 95% confident intervals. ICC estimates were calculated based on an absolute-agreement, two-way mixed-effects model [26]. As per Koo and Li (2016), an ICC of < 0.5 indicates poor reliability, 0.50–0.74 moderate reliability, 0.75–0.90 good reliability, and > 0.90 excellent reliability [26]. An ICC ≥ 0.5 (moderate reliability) was considered acceptable test–retest reliability. Primary analysis was completed using Koo and Li (2016) thresholds [26], however, it is acknowledged that other thresholds for interpreting ICC results exist. Cicchetti (1994) thresholds were applied in a sensitivity analysis. These thresholds state that an ICC of < 0.4 indicates poor agreement, 0.40–0.59 indicates fair agreement, 0.60–0.74 indicates good agreement, and ≥ 0.75 indicates excellent agreement [27].

2.4.3 Known-Group Validity

Known-group validity was assessed by comparing groups with expected differences in HRQoL, which were set a priori by the study team [19]. Group differences were assessed by comparing the mean instrument total score for each group, and effect sizes were estimated using Cohen’s d [28]. Effect sizes of 0.2–0.49 were considered small, 0.5–0.79 moderate, and ≥ 0.8 large [28, 29]. A mean difference with a p value of < 0.05 and a large effect size (≥0.8) was considered acceptable. Children with a special healthcare need were considered a known group who were hypothesised to have differences in HRQoL compared with children without special healthcare needs [30]. Additionally, sensitivity analysis was conducted on other known groups: children with a chronic health condition, EQ VAS score ≤  80 [31], PedsQL total score ≤  69.7 (one standard deviation below the child self-reported population mean for children aged 5–18 years), and PedsQL total score ≤  74.2 (child self-reported mean from a sample of children with chronic conditions) [20]. PedsQL known-group cut points were not used to assess the known-group validity of the PedsQL and were only used to assess known-group validity of other instruments.

2.4.4 Convergent and Divergent Validity

The assessment of an instrument’s convergent or divergent validity usually requires a ‘gold standard’ to compare against, to see how much another instrument converges or diverges from this gold standard. Although there is currently no gold standard instrument for measuring quality of life in children, the PedsQL is a very commonly used instrument that has undergone extensive content validity testing [20, 23]. Hence, for the purposes of assessing the convergent and divergent validity of instruments, the PedsQL was chosen as the comparator instrument. Convergent and divergent validity were assessed by correlating each item in the EQ-5D-Y-3L, EQ-5D-Y-5L, CHU9D, AQoL-6D, and HUI3 with each item and domain in the PedsQL. Correlations were calculated using Spearman’s correlation, as data were not normally distributed. Correlations of 0.1–0.29 were considered weak, 0.3–0.49 moderate, and ≥ 0.5 strong [28]. Through an a priori consensus approach involving members of the study team, different instrument item combinations were reviewed to assess if the study team hypothesised the item of one instrument would be at least moderately correlated with a PedsQL item (to assess for convergence) or not correlated at all with a PedsQL item (to assess for divergence) [19]. Hypotheses were based on similarity (convergence) or dissimilarity (divergence) of item wording [19]. The proportion of an instrument’s items hypothesised to be at least moderately correlated with the PedsQL items and that resulted in at least a statistically moderate correlation were assessed to evaluate convergent validity. The proportion of an instrument’s items hypothesised not to be correlated with the PedsQL items and that resulted in a statistically weak correlation were assessed to evaluate divergent validity.

2.4.5 Responsiveness

Responsiveness was assessed by comparing the mean difference in total instrument score between initial and follow-up surveys for children whose caregiver reported the child had a change in health between the initial and follow-up survey. Analysis focused on participants allocated to receive the follow-up survey at 4 weeks. Responsiveness was assessed by comparing the mean total score at initial and follow-up survey using a paired t-test. A mean difference in the expected direction with a p value of < 0.05 was considered acceptable and was used as the main indicator of responsiveness. Responsiveness was also assessed by calculating the standardised response mean (SRM) to provide a more detailed picture of instrument responsiveness [32]. SRM is a ratio of the mean change to the standard deviation of that change [32]. An SRM of 0.2–0.49 was considered small, 0.5–0.79 moderate, and ≥ 0.8 large [28, 29, 32]. Caregivers were asked to report their child’s change in health in the follow-up survey. Change in health was calculated as follows:

1.

Change in general health was reported as (1) much better, (2) somewhat better, (3) about the same, (4) somewhat worse, or 5) much worse. Responses were split into two categories for analysis: ‘much better’ and ‘somewhat worse and much worse’.

2.

For participants who reported a health condition in the initial survey, caregivers were asked to report their child’s change in ‘main health condition’. The same categorisation used for change in general health was applied.

Responsiveness was only assessed in those who reported a change in health as ‘much better’, rather than those who reported ‘somewhat or much better’. It was felt that a more stringent classification would provide a clearer indication that a change in health had occurred. Due to small sample sizes in the number of children who had worsening health, this same stringent classification was not possible, and ‘somewhat worse and much worse’ were pooled together.

A sensitivity analysis was conducted whereby the responsiveness analysis described above was repeated in only participants recruited via hospital (Sample 1), as this sample had a higher follow-up survey response rate compared with other samples.

2.4.6 Summary of Psychometric Performance

The psychometric performance of all instruments was summarised by categorising each instrument as 1) having significant evidence of good performance (tick), 2) having significant evidence of poor performance (cross) or, 3) having inconclusive evidence of performance (question mark) for each psychometric attribute assessed. Significant evidence of good performance (tick) for each psychometric attribute was based on the following thresholds:

Response distribution (no ceiling effect), < 15% of participants with a special healthcare need report the lowest severity or frequency level (e.g., ‘no problems’) across all instrument items.

Test–retest reliability, moderate, good, or excellent agreement (ICC ≥ 0.5).

Known-group validity, mean difference with a value of < 0.05 and large effect size (Cohen’s d effect size ≥ 0.8).

Convergent and divergent validity, items at least moderately correlated (Spearman’s correlation ≥ 0.3) with other instrument items where hypothesised to be correlated (convergent validity), and weakly correlated (Spearman’s correlation < 0.3) where hypothesised not to be correlated (divergent validity).

Responsiveness, significant mean difference (p value < 0.05).

An instrument was considered to have inconclusive evidence for a psychometric attribute if the sample size used to assess the psychometric attribute was too small (i.e., inadequate or doubtful according to the 2019 Consensus-based Standards for the selection of health Measurement Instruments [COSMIN] guidelines) [33], or the direction of evidence was unclear.

留言 (0)

沒有登入
gif