Optimizing the maximum reported cluster size for the multinomial-based spatial scan statistic

Spatial scan statistic for multinomial data

The multinomial-based spatial scan statistic [6] is used to detect disease clusters with statistically different disease-type distributions. Let \(_\) and \(_\) denote the probabilities of category \(k\) inside and outside the scanning window \(z\), respectively. If we want to identify regions with different disease-type distributions, the null and alternative hypotheses are stated as

$$_}: _}=_}, \ldots, _}=_}\; for\;all\;z\in Z\quad v.s. \quad _}: not \, _}$$

where \(Z\) denotes the set of all scanning windows and \(K\) denotes the total number of categories. The likelihood ratio test statistic, given the scanning window z, is denoted as

$$_=\frac\left\_}\sum __}\right)}^_}\cdot _}\sum __}\right)}^_}\right\}}\left\_}\right)}^_}\right\}}$$

where \(_\) is the number of cases belonging to category \(k\) inside the region \(i\), \(_\) is the total number of cases belonging to category \(k\) in the whole study area and \(C\) is the total number of cases in the whole study area.

Spatial cluster information criterion (SCIC)

Now we propose an optimization criterion called the spatial cluster information criterion (SCIC) for selecting the optimal MRCS value. Our criterion draws inspiration from the formulation of the Bayes information criterion (BIC) [25], which is a widely used criterion in statistical modeling for model selection. The BIC for a candidate model \(_\) is defined as

$$BIC\left(_\right)=-2\cdot logL\left(\widehat_}|y\right)+u\cdot log\left(v\right),$$

where \(y\) is observed data, \(L\left(_|y\right)\) is the likelihood of \(y\) given the model \(_\), \(\widehat_}\) is the maximum likelihood estimation (MLE) of \(_\) that maximizes the \(L\left(_|y\right)\), \(u\) is the number of parameters in the model \(_\), and \(v\) is the total number of observations. The BIC equation includes a penalty term as the second component, which penalizes models with additional parameters. The model exhibiting the minimum BIC value is considered the most appropriate selection [26].

We define the SCIC as the sum of the LLR test statistic for all significant clusters, along with a penalty term. In the multinomial-based spatial scan statistic, the LLR test statistic for each scanning window is used to measure the degree of heterogeneity in the spatial distribution of the categories. A higher LLR test statistic indicates a greater degree of heterogeneity within the scanning window compared to the surrounding area. However, as the scanning window size increases, there is a tendency for the LLR test statistic to rise due to the growing number of cases included within the window.

The spatial scan statistic has faced criticism for its tendency to identify clusters that are considerably larger than the actual clusters, often incorporating neighboring regions with no elevated risk of disease occurrence [27,28,29]. This tendency is mainly noticeable when the default settings of MSWS and MRCS, both set at 50%, are used with circular scanning windows. Optimizing the MRCS improves the spatial scan statistic’s ability to identify clusters with greater precision [17, 19,20,21]. To utilize the sum of the LRT statistics as an optimizing criterion, we need to offset the inflation of the test statistic due to a large number of observations within the window.

The penalty term in the SCIC is defined in two versions. In the first version, the penalty term is calculated by multiplying the logarithm of the number of cases within the significant clusters by the product of the number of categories and the number of significant clusters. In the second version, we substitute the number of regions inside the significant clusters for the number of cases. This is based on the understanding that the number of cases within a cluster tends to increase as the number of regions inside the cluster increases. Both versions serve as optimization criteria with similar implications. For the multinomial model, the algorithm for computing the SCIC is as follows:

(Step 1) For a given MRCS \(m\)% (\(m\)=1, …, 50), denote \(_\) significant clusters reported using the multinomial-based spatial scan statistic by \(_^, \cdots , __}^\).

(Step 2) For each \(m\), calculate the SCIC for all significant clusters as follows:

$$_\left(m\right)=-2\sum _^_}log\left(__^}\right)+K\cdot _\cdot log\left(^\right)$$

(Version 1)

$$_\left(m\right)=-2\sum _^_}log\left(__^}\right)+K\cdot _\cdot log\left(^\right)$$

(Version 2)

where \(__^}\) denotes the LRT statistic for the multinomial-based spatial statistic given the \(^\) significant cluster \(_^\), \(K\) is the total number of categories, and \(^\) and \(^\) denote the sum of the number of total cases and the sum of the number of regions inside all significant clusters, respectively.

(Step 3) Choose the MRCS which minimizes the SCIC as the optimal MRCS.

Figure 1 illustrates the flowchart of the proposed method.

Fig. 1figure 1

The flowchart of the proposed method

Elbow method, MCS-P, and MCHS-P

For the Poisson-based spatial scan statistic, optimization criteria such as the elbow method [22], the maximum clustering set–proportion (MCS-P) [23], and the maximum clustering heterogeneous set-proportion (MCHS-P) [24] have been proposed to determine the optimal value of MRCS or MSWS. Since these methods are likelihood-based optimization criteria, we have adapted them to the multinomial model in order to evaluate and compare their performance with our proposed approaches. The logical order is the same as the SCICs, with the only difference being the measure being calculated. It’s important to emphasize that we should consider optimizing MRCS, not MSWS, to avoid the multiple testing problem, as noted by Han et al. [17].

The elbow method [30] is commonly employed in unsupervised learning to determine the optimal number of clusters by identifying the elbow point. In the context of selecting the optimal MRCS value, Meysami et al. [22] proposed an optimization criterion for the Poisson model by adopting the method for finding the optimal elbow point as suggested by Delgado et al. [31]. We employ the method for the multinomial model by calculating the negative sum of the likelihood ratio test (LRT) statistic values over all \(_\) significant clusters for each \(m\) as

$$-LRT\left(m\right)=-\sum _^_}__^}$$

where \(__^}\) denotes the LRT statistics value for the \(}\) significant cluster \(_^\) (\(j\)= 1, …, \(_\)). If no significant cluster is present, use the maximum LRT statistic. The elbow plot is constructed by connecting the points (\(m, -LRT\left(m\right)\)) for \(m\)= 1, …, 50. For each \(m\), we calculate the orthogonal distance between each point (\(m, -LRT(m)\)) and the line connecting the first and last points. The optimal MRCS is the one that maximizes this orthogonal distance.

Ma et al. [23] proposed the maximum clustering set–proportion (MCS-P) as an optimization criterion to determine the optimal value of the MSWS for the Poisson-based spatial scan statistic. This criterion assumes that all identified significant clusters are homogeneous clusters with the same relative risks. However, considering the issue of multiple testing, analyzing the data multiple times with different MSWS values to select the best result might not be appropriate. In our study, we adapt the MCS-P criterion to the multinomial model and utilize it to select the optimal MRCS, while keeping the MSWS value fixed at 50%. To apply the MCS-P to the multinomial model, we first define the union cluster set \(_^\) by merging all \(_\) clusters for each \(m\) as

where \(_^\) is the \(}\) detected significant cluster (\(j\)= 1, …, \(_\)). Then, we calculate the union log-likelihood ratio (LLR) test statistic \(log__^}\) given the union cluster set \(_^\) as

$$log__^}=\sum _\left\_^}_\cdot log\left(\frac_^}_}_^}_}\right)+\left(_-\sum __^}_\right)\cdot log\left(\frac_-\sum __^}_}_^}_}\right)\right\}+\sum __\cdot log\left(\frac_}\right)$$

where \(_\), \(_\), and \(C\) were as defined previously and \(_\) is the number of cases inside the region \(i\). The optimal MRCS is the one that maximizes the union LLR test statistic \(log__^}\).

Considering the possibility of detected significant clusters being heterogeneous with varying relative risks, Wang et al. [24] introduced the maximum clustering heterogeneous set-proportion (MCHS-P) as an optimization criterion to determine the optimal value of the MSWS. As previously discussed, we employ the MCS-P criterion in the multinomial model and utilize it to select the optimal MRCS, while maintaining a fixed MSWS value of 50%. For each \(m\), we define the heterogeneous cluster set \(_^\) by merging \(_\) detected significant clusters into \(_ (_\le _)\) merged clusters according to their spatial contiguity.

$$_^=\left\__}^, ___}}^\right\}$$

Then we calculate the union LLR test statistic \(log__^}\) given the heterogeneous cluster set \(_^\) as

$$log__^}=\sum _\left\__}^}_\cdot log\left(\frac__}^}_}__}^}_}\right)+\cdots +\sum ____}}^}_\cdot log\left(\frac___}}^}_}___}}^}_}\right)+\left(_-\sum __^}_\right)\cdot log\left(\frac_-\sum __^}_}_^}_}\right)\right\}+\sum __\cdot log\left(\frac_}\right)$$

The optimal MRCS is the one that maximizes the union LLR test statistic \(log__^}\).

Simulation study

We conducted a simulation study to evaluate the performance of the proposed method for the multinomial model in comparison to other existing methods. The study region comprised Seoul and Gyeonggi Province in South Korea, consisting of 69 districts. For the simulation, we considered five different true cluster models as depicted in Fig. 2. True cluster models (A) and (B) represented one circular-shaped and one elliptical-shaped true cluster, respectively, each consisting of 5 districts, which accounted for 8% of the entire study region. True cluster model (C) depicted one irregular-shaped true cluster with 10 districts, representing 15% of the entire study region. True cluster models (D) and (E) assumed two circular-shaped and two elliptical-shaped true clusters, respectively, each consisting of 5 districts.

Fig. 2figure 2

True cluster models in the simulation study

For each true cluster model, we considered various scenarios of the alternative hypothesis, assuming four categories. The parameter setting for the alternative hypothesis was adopted from a previous study [6]. The null hypothesis was set to equal probabilities of 0.25 for each of four categories. In the previous study [6], several different alternative hypotheses were used to evaluate the multinomial-based spatial scan statistic and successfully showed that the multinomial-based spatial scan statistic worked well under those hypotheses. In this study, we aimed to assess a method for optimizing the MRCS for the multinomial-based spatial scan statistic and believe that it would be good to evaluate its performance under the same hypotheses. Furthermore, because the alternative hypotheses satisfy the likelihood ratio ordering, we were also able to evaluate the performance of the ordinal model [3]. For the true cluster models with two clusters, we included heterogeneous settings where different alternative hypotheses were assigned to each cluster, as well as homogeneous settings where the same alternative hypotheses were applied to both clusters. This allowed us to examine the performance of the proposed method in more plausible heterogeneous settings, where the relative risks of each category differ between the two clusters. We considered four alternative hypotheses for the true cluster models with one cluster and two homogeneous clusters, as well as three alternative hypotheses for the true cluster models with two heterogeneous clusters. This resulted in a total of 26 scenarios considered in combination. Table 1 presents the simulation scenarios for the true cluster model along with their respective alternative hypotheses.

Table 1 Simulation scenarios for the true cluster model and alternative hypothesis

Under each scenario, we generated 1000 datasets, each containing 1000 cases distributed among four categories. For each data set, we repeatedly identified clusters by varying the MRCS values. In SaTScan™, the MRCS value was set to 1%, 2%, 3%, 4%, 5%, 6%, 8%, 10%, 12%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, and 50%. As SaTScan™ provides Gini coefficient values for these 17 candidate MRCS values in the Bernoulli and Poisson models, we computed the SCICs, Gini coefficient (for the ordinal model), Elbow method, MCS-P and MCHS-P values for these 17 candidate MRCS values for consistency. Then, we compared the clusters reported by each method using the optimal MRCS selected, with the true clusters. Regarding the scanning window shape, we presented the simulation results obtained when using the elliptical windows as the main results because Kulldorff et al. [32] found that the spatial scan statistic with elliptic windows exhibited good performance in terms of the power when the shape of the true cluster is elliptical or circular.

Over 1000 randomly generated datasets, we recorded the frequency at which each candidate MRCS value was selected as the optimal MRCS for each method. To compare the performance of the proposed method with other existing methods and default setting (MRCS value of 50%), we used sensitivity, positive predicted value (PPV) and misclassification as the performance measures, as per a previous study [33]. Sensitivity represents the proportion of correctly identified districts within the true cluster, while PPV represents the proportion of correctly identified districts within the detected cluster. A method with higher values of these measures indicates greater precision in identifying the true cluster. A lower sensitivity means that the method failed to identify some districts that belong to the true cluster. A lower PPV means that the method identified some districts that do not belong to the true cluster. Misclassification indicates the proportion of incorrectly identified districts within the true or detected cluster. Higher sensitivity and PPV values, along with lower misclassification values, indicate better performance in accurately identifying clusters. We calculated the average sensitivity, PPV, and misclassification over 1000 simulated datasets for two sets of MRCS values: (1) those selected by SCIC1, SCIC2, Gini coefficient (only for the ordinal model), Elbow method, MCS-P, and MCHS-P, and (2) the default value of 50%. The simulation was conducted using SaTScan™ version 10.0 and R software version 4.0.2, employing the ‘rsatscan’ package [34].

Comments (0)

No login
gif