Estimating Costs Associated with Disease Model States Using Generalized Linear Models: A Tutorial

To inform decision analytic disease models with the cost evidence, our research question is what are the costs associated with disease states over discrete time periods corresponding to the cycle length of a decision model. The costs can be any type of costs such as total healthcare costs (for example, primary and/or hospital care costs), patient out-of-pocket costs or social care costs. The disease states are key states related to the disease and/or intervention, which are included in the decision analytic model to assess the cost effectiveness of the intervention. For example, disease states may be disease stages or events, such as cancer progression stages or whether experiencing a myocardial infarction (MI). The scope of costs, disease states of interest and cycle length should be consistent with the choices made while conceptualizing the economic evaluation and decision model. In addition, key patient characteristics may also be important factors in the economic evaluation and thus in developing the estimation of costs of disease states using participant-level data, since they may modify health effects, costs and possibly the cost effectiveness of the intervention.

To answer the research question, we will ideally use a longitudinal dataset from a cohort of participants reporting their healthcare and other resource use and costs and disease status over time. This longitudinal data will be used to form estimation data, which have multiple records per participant with each record including the costs accumulated over the periods of interest and the disease state status in the respective periods. All the records from all the participants will be pooled to develop the cost prediction model using participants’ profiles and time-updated disease state status. The developed cost model will allow the prediction of individual patient costs, taking into account participant characteristics, model states and the interactions between them.

2.1 Step 1. Preparing the Dataset for Estimating Costs of Disease States2.1.1 Raw Dataset Generation

The first step is to prepare the dataset to support the cost estimation analysis. This dataset should include records for each participant for discrete time periods over which costs are estimated, with each record including the outcome cost variable and a number of covariates representing the participant’s characteristics. For example, if data is available for the hospital care costs of an individual over 10 years but we are interested in estimating annual hospital care costs, we would allocate costs into respective annual periods in chronological order and generate 10 records or rows with annual costs for this individual. Each row represents a unique record contributing to the analysis. The column of costs over the discrete periods (e.g. annual hospital care cost) represents the outcome.

Two types of individual characteristics are further needed to estimate costs of disease states: the disease states’ indicators and the other individual characteristics associated with the costs. The disease states’ indicators are specific to the individual and each discrete period of time but can change across time periods with an individual’s disease trajectory. For example, an individual remains in the ‘without MI’ state until they experience an MI, and move into an annual period ‘had MI in same year’, followed by ‘had MI 1 year ago’ etcetera, corresponding to timing of the MI with respect to the current time period. In this example, ‘without MI’, ‘had MI in same year’ and ‘had MI 1 year ago’ represent different states and, therefore, distinct columns in the dataset to support estimation of costs. Distinct disease states could be specified by more than one disease state descriptor (e.g. ‘without MI or stroke’ requires both ‘without MI’ and ‘without stroke’ descriptors to be met) (Fig. 1). The choice of disease state descriptors is pre-specified but could be adjusted (e.g. number of temporal categories) alongside covariate selection in the model selection step (see step 3). The other individual characteristics of interest include, for example, individual’s age, sex, and other socio-demographic and clinical risk factors that determine the extent of healthcare costs. Ideally, characteristics which are plausible predictors of healthcare costs given the data availability should be prospectively identified prior to cost modelling from previous evidence. Most of these characteristics are likely to be specific to individuals and fixed at entry into the model but some, such as age, may be updated over the time periods in the dataset.

Fig. 1figure 1

Schematic of dataset for modelling healthcare costs associated with disease states

2.1.2 Handling Censored and Missing Data

Typically, individual patient data is subject to administrative censoring (e.g. end of data collection due to end of follow-up in the study). In our context, death is an event of interest and not a censoring event; all costs in year of death are observed. In effect, ‘death in year’ is usually a covariate in the cost model as we want to assess its impact on costs. Simple approaches to handling censored cost data are to (1) add a covariate indicating the proportion of period unobserved; or (2) exclude all observations with partially observed data due to censoring (if sample size is generous).

We may encounter missing costs data, frequently the case when costs data is collected from the patients (e.g. case report form in a clinical trial) rather than sourced from linked routine healthcare data (e.g. hospital or primary care data). Generally, multiple imputation under the missing-at-random assumption is used in this context as single imputation methods overstate precision [10]. Violations of the missing-at-random assumption, a particular consideration in the presence of substantial attrition in the sample, would require further methods [14,15,16]. Besides, we may also need to handle missing values of covariates, which has been discussed in detail elsewhere [17].

2.1.3 Covariate Specification

For continuous covariates, we will need to specify their functional form in the model. If the relationship between the covariate and the outcome is known, we can transform the covariate correspondingly (e.g. natural logarithm transformation). Such a relationship can be informed from previous studies or preliminary analyses. When the relationship is complex, other approaches, including (1) specifying spline effects; (2) specifying polynomial effects and (3) categorization [16] should be considered.

To facilitate model interpretation, we recommend standardizing continuous covariates and for discrete (binary and categorical) covariates to have an explicit choice of reference category. For example, for a cohort with mean and standard deviance of age of 59 and 9 years, respectively, we can standardize age by centring at 60 years, a round number close to mean, and expressing it per 10 years using a transformation: (age—60)/10; for BMI (kg/m2) categorized into underweight (< 18.5), healthy weight (18.5–25), overweight (25–30) and obesity (≥ 30), we can choose the healthy BMI as the reference category.

2.2 Step 2. Candidate Statistical Models for Estimating Costs of Disease States2.2.1 Common Candidate Statistical Models

The statistical models for modelling costs are chosen based on the features of cost data and the features of statistical models. A feature of cost data is that the distribution of the costs is typically right skewed (long tail at the higher costs), which may not be suitable for ordinary linear regression that requires normality and homoscedasticity in the residuals (i.e. error). Therefore, the GLM framework is often employed by specifying a link function \(g\) and family distribution, which standardize the mean and variance function. Through the inverse link function \((^\left(.\right))\), \(E\left(y|x\right)=\mu\), the expected value of the cost y given a vector of covariates x, can be calculated from the linear predictor \((x\beta )\):

$$g\left(\mu \right)=x\beta$$

where \(\beta\) is the vector of the regression coefficients.

In a GLM, \(\mu \propto v\left(y|x\right)=_^_}\)

where \(\mu\), \(y\) and \(x\) are as above, \(v\) is the variance, \(_\) is a constant, and \(_\) indicates the mean–variance power relationship.

\(_=0\) corresponds to a Gaussian error variance, \(_=1\) to a Poisson variance, and \(_=2\) to a Gamma variance.

For modelling healthcare costs, three common distributions are Gaussian, Poisson and Gamma distribution. Depending on the distribution, common link functions are identity, natural logarithm, inverse and the squared root link. The most popular ones (combinations of link function and distribution) for healthcare costs are linear regression (identity link with Gaussian distribution) and Gamma regression with a natural logarithm link [9].

Another feature of cost data is a large proportion of zero observations. This is usually addressed using two-part models, with the first part, typically a logistic or probit regression, modelling the probability of incurring any cost, and the second part modelling the cost conditional on incurring any [9]. The expected cost from the two-part model is the product of the expectation of each part:

$$E\left(y|x\right)=Prob\left(y>0|x\right)E\left(y\right|x, y>0)$$

where \(y\) is the cost outcome and \(x\) is a vector of covariates.

Both a one-part model (i.e. a single regression equation) and two-part model (two regression equations with the first modelling the probability of incurring costs and the second the costs, conditional on incurring any) should be considered. We should use six GLM specifications defined using the combinations of two link functions (identity and natural logarithm link) and three variance functions (Gaussian, Poisson and Gamma distribution) as candidate models for the one-part model and the second part of the two-part model.

2.2.2 Initial Set of Covariates

For each candidate model specification, the model should be fit to the data to aid model selection. Initially, the full set of the pre-specified covariates from the prepared dataset could be used in every candidate statistical model. We can also perform covariate selection (will be mentioned in Step 3) for each candidate model before the selection of the promising candidate statistical models in the next step.

2.2.3 Tests to Choose Statistical Model Specification2.2.3.1 The Hosmer-Lemeshow test

The appropriateness of the link function can be tested using the Hosmer-Lemeshow test [9, 18]. The test regresses the residual error \((e)\) on binary indicators for the deciles of the predicted costs \((}_\,\mathrm\,}_)\), and tests the joint significance of the coefficients, with a significant test indicating an inappropriate link function

$$e\sim }_+}_+}_+}_+}_+}_+}_+}_+}_+}_$$

2.2.3.2 The Pregibon link test

The appropriateness of the link function can also be tested using the Pregibon link test [19]. The test regresses the costs from the data \((y)\) on the linear predictor \(\left(\beta \right)\) and a squared linear predictor \(\beta \right)}^]\) using an identical GLM specification, with a significant coefficient for the squared linear predictor indicating an inappropriate link function

$$y\sim 1+ \beta +\beta \right)}^$$

2.2.3.3 The modified Park’s test

The appropriateness of the distribution family can be checked using the modified Park’s test [20]. The test reflects the relationship between the variance and the mean for a specific distribution based on a power function mentioned above for different GLM distributions. The modified Park’s test regresses the natural logarithm of the squared residual error \((\left(y-\widehat\right)}^))\) on the natural logarithm of predicted costs \((\mathrm\left(\widehat\right))\) using a GLM specification with gamma distribution and usually a natural logarithm link. The coefficient close to 0 indicates Gaussian distribution, 1 indicates Poisson distribution, and 2 indicates Gamma distribution

$$(\left(y-\widehat\right)}^)\sim \mathrm\left(\widehat\right)$$

Statistical models that demonstrate promise are taken forward.

2.3 Step 3. Selecting the Final Model

The model selection thereafter has two parts: selection of covariates for each remaining candidate statistical model and selection of the statistical model from the final specifications of all candidate statistical models.

2.3.1 Covariate Selection

The cost models are intended to predict costs in decision models for patients with particular characteristics at entry. Therefore, cost models should perform well not only across the population but potentially also at the individual patient level. Thus, all covariates retained in models should be reliably associated with cost. To minimize the likelihood of spurious associations, the covariates in final cost models, unless their inclusion was informed from strong previous evidence with consistent estimates in our dataset, are expected to reach statistical significance and their inclusion and retention subject to covariate selection.

Stepwise selection using a pre-specified level of statistical significance (e.g. 5%) is widely used given its simplicity and availability in statistical software [21, 22]. However, the stepwise approaches may lead to unstable selection and an overfitting issue. Alternative covariate selection approaches aiming to address these issues, such as bootstrapping stepwise selection and penalised techniques (e.g. least angle selection and shrinkage operator, LASSO) have been proposed [15]. The bootstrapping approach is an extension of the stepwise approach by performing selection in the bootstrap samples and selecting the covariates based on their frequency of being selected. It has the potential to address the issue of instability of the selection, but has much higher computation burden. The LASSO method constrains the regression coefficients and shrinks some regression coefficient estimates to zero to optimize covariate selection. This approach may address the issue of overfitting, but it may also end up including implausible covariates or omitting known predictive factors [15].

For a two-part model, covariate selection could be performed for each part of the model, as covariates may have different impacts on the probability of incurring the costs and the costs conditional on any incurring.

2.3.2 Final Model Selection

Finally, the performance of each final statistical model specification should be checked against the observed costs. The model performance can be assessed with three measures: mean error, mean absolute error, and root mean squared error. Mean error (ME) is the mean of the residual errors, which tests for aggregate bias. Mean absolute error (MAE) is the mean of the absolute value of the residual errors, which tests for individual level predictive accuracy. Root mean squared error (RMSE) is the squared root of the mean of the squared of the residual errors, which tests for goodness of fit. Smaller values for these measures indicate better performing models.

We can also perform a visual inspection of model performance by plotting mean predictive error by decile of predicted outcome to check for systematic errors not detected by ME/MAE/RSME above. Better fitting models have smaller errors across deciles of predicted outcomes.

2.3.3 Consideration of Interactions

We can further refine the cost model by considering interactions between key covariates. Such considerations should be pre-specified to limit data dredging. For the cost model of interest, we focus on the interactions between acute disease events (e.g. experiencing MI and stroke in the same year). The overall impact of co-occurring acute disease events on costs may not be a simple addition of the impact of each event. However, it is also difficult to assess all possible interactions in view of the number of possible combinations. We suggest a practical criteria for the choice of interactions to consider based on (1) the number co-occurrences in the same period and (2) the percentage of occurrences from the total individual occurrences for the respective events. The purpose is to assure sufficient data is available to reliably estimate interactions. For example, we may investigate the interaction between MI and stroke if (1) the number of cases when both MI and stroke occur in the same year is more than 50; and (2) both percentages of this number from the total number of MIs and strokes are > 5%. The thresholds may be smaller if we focus on rarer but costly events. Besides, we may also need to consider the interaction between other participant characteristics, which has been discussed in detail elsewhere [16].

2.4 Step 4. Use of the Cost Model

The final cost model can be used to (1) predict the cost for individuals, and (2) derive the mean effects of events on costs across particular patient population/s.

2.4.1 Cost Prediction Given Individual’s Characteristics

To predict costs of an individual in a particular time period, we should prepare the individual’s characteristics to correspond to respective characteristics in the model’s specification. Thereafter, for one-part models, we can use the prepared individual’s characteristics together with the model’s parameter estimates to generate the predicted cost. For two-part models, we should use the prepared individual’s characteristics together with parameter estimates of each part of the model, with the first part generating the probability of incurring any costs \((Probabilit_)\) and with the second part generating the costs conditional on incurring any costs \((Cos_)\). With the predictions from both parts, we can generate the predicted costs with the following formula:

$$Predicted\, costs=Probabilit_\times Cos_$$

If logistic regression is used for the first part of the two-part model, ProbabilityP1 can be calculated with the odds of incurring any costs \((Odd_)\) from the logistic regression using the following formula:

$$Probabilit_=Odd_/(1+Odd_)$$

2.4.2 Effect of a Disease State on Costs

Entry into a disease state is often associated with a change in healthcare costs. Cost models can inform changes in healthcare costs associated with a disease state by calculating the marginal effect of disease states in the cost model. For a one-part model with identity link, the marginal effect is represented by the corresponding coefficient in the cost model. For a one-part non-linear model or a two-part model, marginal effects can be derived using recycled prediction. It includes the following two steps: (1) run two scenarios across the target population by setting the disease state of interest to be (a) present (e.g. recurrent cancer) or (b) absent (e.g. no cancer recurrence); (2) calculate the difference in mean costs between the two scenarios. Standard errors of the mean difference can be estimated using bootstrapping.

留言 (0)

沒有登入
gif