The data set from Sprick et al. [7] assesses the damage inflicted by four different horseshoe materials (steel, aluminium, polyurethane, horn) on the long bones of horses. For welfare reasons, horses are increasingly kept in groups. During social interactions, kicks—particularly with the hind limbs—possibly cause fractures in the long bones, radii and tibiae when loads are applied perpendicular to the longitudinal axis. In the study from Sprick et al. [7], kicks with a comparable velocity of 16 m/s were simulated during a drop impact test setup for four different horseshoe materials. To obtain a random and representative sample, the bones were allocated to the groups to obtain a uniform distribution with respect to age, sex, and type of bone. Each group did not contain more than one bone of the same horse. We focused only on one condition: horn (2 radial or tibial fractures out of 16 kicks). The authors found a relative frequency of fractures equal to 12.5% and provided a Clopper-Pearson CI (\(\pi \)) (2 to 38%).
Classical and Bayesian approachesThere are two approaches in statistics: classical and Bayesian. In the classical or frequentist branch, the unknown true parameter of interest is assumed to be fixed and can be learned or estimated by repeatedly frequently drawn samples of identical, independent observations from the population. Thus, classical statistics define statistical procedures by requiring certain properties to hold. Although classical statistics cover many inferential methods, the likelihood-based approaches are very popular for parametric models. By definition, the estimate of the parameter of interest is the value of the parameter for which the likelihood attains its maximum value. A 95% classical confidence interval alludes to the sampling experiment: “If one repeatedly calculates such intervals from many independent random samples, 95% of the intervals would, in the long run, correctly include the actual value of the parameter of interest” (Meeker et al. [8], p. 26).
In contrast, Bayesian methodology assumes that the parameter of interest is random, rather than a fixed quantity, and the observed sample is fixed. Bayesian procedures are valid if they are arrived at by following the Bayes theorem, which specifies how to combine a prior and the likelihood. In addition to the likelihood, containing information about unknown parameters of the data-generating model, the prior information needs to be provided. Based on the likelihood and the prior, the posterior or “post-data” [9] distribution is derived, from which Bayesian interval estimates can be read. Therefore, Bayesian interpretation describes the properties of the distribution of the true parameter after having observed the data subject to the prior. Thus, Bayesian intervals, also called credible intervals (CrI), which are based on posterior distributions, have a completely different interpretation from the repeated sampling (i.e., frequency) probabilities used in the classical statistics. The Bayesian 95% CrI contains 95% of the posterior probability of the parameter of interest.
In applications of the Bayesian methodology, the use of a minimally informative Jeffreys prior has been recommended [10]. For binary observations, the Jeffreys prior is a Beta distribution with both shape parameters \(a\) and \(b\) fixed at 0.5. For this particular choice of shape parameters, the prior has a minimal impact on the posterior results. In fact, for \(a=b=0.5\), the sum of both shape parameters \(a+b=1\) reveals that the impact of the Jeffreys prior corresponds to one observation (Additional file 1). The Jeffreys prior, which is also called the reference or default prior, is quite convenient because practitioners do not need to decide on any prior themselves.
Random sample and point estimatesBelow, we focus on one random sample with independent observations generated by a binary primary outcome at the patient, specimen or object level attaining only two values (0 = “no”, 1 = “yes”). Usually, the value 1 corresponds to an event of interest. Assume that the sample size is equal to \(n\) and observations are a vector (of length \(n\)) of 0/1-values. From a statistical point of view, these observations are independent and identically distributed (iid) realisations of a Bernoulli distribution (\(Be(\pi )\)), which attains value 1 with a true probability \(\pi \) and value 0 with a true probability \(1-\pi \). What the researchers are interested in is the true probability \(\pi \) of an event of interest. Usually, this true value \(\pi \) is unknown, so experiments need to be conducted to obtain an estimate \(\widehat\) from the data that estimates the true probability \(\pi \). The estimate \(\widehat\) is obtained by dividing \(x\), the sum of all events of interest in the sample, by the total sample size \(n\).
In applications, a random sample of independent 0/1 observations is usually summarised by two numbers: \(n\) (sample size: the total number of considered objects in a sample) and \(x\) (the number of objects that show an event of interest in the sample). For the horn data set, a total of independent \(n=16\) kicks were performed, and \(x=2\) of these kicks resulted in a fracture (the event of interest). These numbers are frequently presented as a relative frequency \(\widehat=x/n=2/16=0.125=12.5\%\). In statistics, the \(\widehat\) estimate is called a point estimate, which indicates what proportion of kicks resulted in a fracture when a sample size \(n=16\) of independent kicks was considered. The problem with the \(\widehat\) estimate is that it is only an estimate of the true probability \(\pi \) and is likely only to be close but not exactly equal to the truth. In fact, a point estimate does not show any uncertainty on its own and corresponds to confidence = 0 (Fig. 4). Therefore, to mitigate this serious drawback of point estimates, three interval estimates, CI, PI, and TI, have been developed [8]. These interval estimates share three common properties. First, they indicate an interval marked by two bounds: a lower and an upper one. Second, they require a specification of the confidence or probability level, which we set throughout at \(0.95 = (1-\alpha )\) by fixing the value of the statistical error \(\alpha \) at 0.05. Third, although CI, PI, and TI intervals are computed given one sample of 0/1 observations, they provide new insights into the true underlying distribution \(Be(\pi )\). As we will show below, the three CI, PI, and TI interval estimates inform us about either the unknown probability \(\pi \) or new realisations out of the true distribution \(Be(\pi )\). In the following three subsections, we demonstrate the differences in interpretation and use of the three CI, PI, and TI interval estimates. We present either our own functions or functions implemented in specific packages in R [1].
Fig. 4Funnel plot depicting Wilson-CIs for confidence levels ranging between 0 and 100%. The grey dashed line indicates that the Wilson 95% CI (0.034, 0.360) reported in Table 1 corresponds to the level of confidence equal to 95%. The funnel plot points at the point estimate, \(\widehat=x/n=2/16=0.125\). This indicates that one may claim that the true probability \(\pi \) is equal to 0.125 with a level of confidence equal to 0
Applications of CI, PI, and TIIn what follows, we provide a description of methods for CI, PI, and TI combined with results, interpretation, and some remarks on their applicability. Table 1 reports CI, PI, and TI obtained for the data from Sprick et al. [7] with \(x=2\) out of \(n=16\) fractures with a horn impactor. Note that the interpretation of CI, PI, and TI hinges on the assumption that these data are from a random sample, i.e., long bones were collected from 16 different and unrelated animals, which are representative of the population of horses. For a binary variable, the original scale of CI (CrI) is the probability scale, and for both PI and TI, it is the count scale. Multiplication (division) of the interval bounds by the constant sample size can transform the result to the other scale and vice versa (see Table 1).
Table 1 Confidence interval (CI), prediction interval (PI), and tolerance interval (TI) estimates for horn: \(\widehat=x/n=2/16=0.125=12.5\%\) with confidence level \(\left(1-\alpha \right)\) = 0.95, classical Wilson (W) and Bayesian Jeffreys (J) for different contents \(P\), and different numbers of predicted future observations \(m\)Confidence interval (CI)In classical statistics, the original approach to compute a CI for a mean was first described by Student [11], Neyman [12] and Welch [13]. Procedures for computation of CI for an unknown probability followed [14,15,16]. Morey et al. [17] and Gelman and Greenland [18] warn that classical CI can be (mis)interpreted in the Bayesian way in practice. Occasionally, users claim that there is a 95% probability that the true parameter lies between the lower and the upper bounds of the CI, although the following interpretation for a classical CI for an unknown probability \(\pi \) applies: “For identical and independent repetitions of the underlying statistical sampling experiment, a \(\left(1-\alpha \right)\times 100\)% confidence interval will cover \(\pi \) in \(\left(1-\alpha \right)\times 100\)% of all cases” [19].
This property of CI(\(\pi \)) is illustrated in Fig. 1. Confidence intervals marked in red do not overlap the true probability \(\pi \). Red CI(\(\pi \)) conveys an incorrect piece of information, as the true probability \(\pi \) is not included within their lower and upper bounds. Note that such an incorrect result should occur for a 95% CI (\(\pi \)) only in 5 out of 100 repetitions on average. In Fig. 1, there are 7 red CIs (\(\pi \)) out of a total of 100 simulations, resulting in an error rate of 7%.
There are several different approaches to computing CI(\(\pi \)), such as the Clopper-Pearson CI [15], the Wilson-CI [14] and the Wald-CI [16]. Held and Sabanés Bové ([19], p. 113 – 119) show that the Wilson procedure for CI(\(\pi \)) computation has the best statistical properties, and we recommend it for wide use in practice. The Wilson-CI(\(\pi \)) can be conveniently computed in R using the package DescTools [20] with the command BinomCI(), specifying the number of successes \(x\) out of \(n\) trials. A \(\left(1-\alpha \right)=95\%\) Wilson-CI is obtained by:
Note that there are also other packages in R offering such functionality: most prominently binom [21] with the command binom.confint() and PropCIs [22] with the command scoreci().
The interpretation of the classical Wilson-CI (\(\pi \)) (0.034 to 0.360) from Table 1 is as follows: For repeated, i.e., independent, identical realisations of the kick experiment with a horn impactor at a velocity of 16 m/s, the Wilson-CI(\(\pi \)) will contain the (unknown) true probability \(\pi \) of a fracture in 95% of repeated kick experiments.
Bayesian CrIAn alternative to the classical approach is the Bayesian approach, resulting in a credible interval (CrI) based on a posterior distribution. The unknown parameter \(\pi \) is contained in the \((1-\alpha )\) credible interval with probability \(\left(1-\alpha \right)\).
To calculate the posterior distribution of the parameter \(\pi \), the concept of conjugacy is useful. Choosing as a prior distribution, a member belonging to the same family of distributions as the posterior distribution is called a conjugate prior distribution [19]. For a binomial distribution, a beta distribution with a support ranging from 0 to 1 is a convenient choice for a conjugate prior [10].
A Jeffreys credible interval with \(x\) out of \(n\) trials is computed based on a \(\left(1-\alpha \right)=95\%\) probability and a minimally informative Beta prior with both parameters \(a\) and \(b\) fixed at 0.5 [8]. This approach is demonstrated in Fig. 5. In [16], it is proven that an equal-tailed Jeffreys CrI is always contained within the corresponding confidence interval computed according to the classical Clopper-Pearson approach and can be regarded as an improved version of the Clopper-Pearson interval. Moreover, Jeffreys CrI has good frequentist properties (coverage).
Fig. 5Density plots of the posterior distributions based on the Jeffreys prior (Beta(0.5,0.5)) and the binomial likelihood for \(x=2\) and \(n=16\) for a horn impactor from [7]. The likelihood (dotted black) and the posterior distribution (red) are similar. The \((1-\alpha =0.95)\) credible interval (0.026 to 0.344) is indicated by green lines
In R [1], a number of packages facilitate the calculation of Bayesian Jeffreys credible intervals, such as the package Desctools with BinomCI() [20] (see details on Jeffreys CrI in Additional file 1).
The recommended Bayesian approach leads to a Jeffreys CrI(\(\pi \)) (0.026 to 0.344) interval estimate, shown in Table 1, and can be interpreted as follows: the posterior probability \(\pi \) of a fracture in a kick experiment with a horn impactor at a velocity of 16 m/s lies in the Jeffreys interval CrI(\(\pi \)) with probability 95%, when a minimally informative Jeffreys prior is assumed. The corresponding prior, likelihood and posterior distributions are displayed in Fig. 5.
If the main objective is the true probability \(\pi \), the CI (CrI) is useful when planning the design of a new study. For example, the length of the CI (CrI) can facilitate the computation of the sample size for a future study. Given a target precision of the result (length of CI (CrI) after the study), one computes the sample size of the study necessary to achieve the required target precision of the CI (CrI).
Note that the length of both CI and CrI highly depends on the sample size \(n\). The lengths of the Bayesian CrI 0.317 for 2/16 and 0.032 for 200/1600 differ drastically. This clearly demonstrates that CI (\(\pi \)) and CrI (\(\pi )\) are mostly concerned with the value of the true probability \(\pi \) but do not predict the outcome in any new future study.
Prediction interval (PI)The main idea behind a prediction interval is to provide an interval that covers the outcome from \(m\) future observations with confidence \((1-\alpha )\), given the data (\(x\) and \(n\)) at hand. If the main focus is on the outcome of the future \(m\) observations, prediction intervals are recommended for planning future studies, power calculation, model checking or deciding whether to conduct a future trial. For details see [10, 23] and references therein.
Classical approaches to prediction intervals are mainly based on regression methods, which are conveniently applicable to quantitative primary outcomes [2]. To our knowledge, there is no simple classical procedure that shows good statistical properties in the setting with one sample and a binary primary outcome. Instead, the Bayesian methodology relying on predictive distributions is recommended in such a situation [10].
For the posterior predictive distribution, a binomial distribution is combined with a conjugate Beta prior with parameters \(a\) and \(b\), and the parameters of the posterior predictive distribution are determined by the sum of initially chosen \(a\) and \(b\) parameters and the already observed data [10]. Further details are presented in the Additional file 1. The Jeffreys PI is obtained for \(a=b=0.5\). In a Bayesian approach, the unknown predicted value lies in a prediction interval with a \(1-\alpha =0.95\) probability. This probability statement is induced by the posterior predictive distribution and should not be mistaken for coverage probability (see Coverage properties and asymptotic behaviour of CI, PI, and TI).
The following R functions compute the Bayesian Jeffreys prediction interval for the number of events of interest for a future sample of \(m\) and a \((1-\alpha )\) probability level using the data from observed sample \(x\) of size \(n\).
The computation behind PI, based on the posterior predictive distribution, which predicts the number of events of interest (fractures) in a future sample of \(m=100\) kicks, is illustrated in Fig. 6. Based on \(x=2\) fractures in \(n=16\) kicks, the Bayesian PI states that the predicted number of fractures for future experiments based on \(m=50\) or \(m=100\) kicks lies between (0 to 19) or (2 to 36) fractures.
Fig. 6Posterior predictive distribution for a future sample of \(m=100\) kicks based on the Jeffreys prior (Beta(0.5,0.5)) and the binomial likelihood for \(x=2\) and \(n=16\) for the horn impactor from [7]. The \((1-\alpha =0.95)\) J-PI (2 to 36) is indicated by green lines
A PI is less concerned with the true probability \(\pi \) but rather aims to show the variability in the future data when the same experiment is conducted again several (\(m\)) times, given the information contained in the current (\(x, n\)) data. PI enables computation of lower and upper bounds on the count of observations that show the event of interest (attaining value 1) in the future sample of \(m\) observations. This is shown in Fig. 2 with 100 simulations for PI with \(m=50\). Red PI indicates situations when the actual observed number of events generated in \(m=50\) future iid Be(0.5) experiments is not included in the PI, which predicts \(m\) = 50 future observations based on \(x\) and \(n=20\) obtained from iid Be(0.5). The proportion of the red PI in 100 simulated PIs is 8, which is approximately equal to the assumed \(1-\alpha =95\%\) confidence level. The main drawback of the PI is that it is useful to predict the performance of one, or a small number, of future observations and does not explicitly specify the proportion of the population to be covered by PI [8]. To mitigate this drawback, tolerance intervals (TIs) have been suggested.
Tolerance interval (TI)Frequentist definitions of tolerance intervals have a long history, dating back at least to the seminal works of Wilks [24] and Hamada et al. [25]. The origins of Bayesian tolerance intervals can be traced to Aitchison [26]. Krishnamoorthy and Mathew [27] and Meeker et al. [8] define the Bayesian tolerance interval by a frequentist formula applied to the posterior distribution. Similar to a PI, a TI enables computation of lower and upper bounds on the count of observations showing an event of interest (attaining value 1) in the future sample of \(m\) observations. TI requires specification of two inputs: the percentage of the population \(P\) that is covered by TI and its confidence level \(\left(1-\alpha \right)\). \(P\) is also called the content of the tolerance interval. For two-sided and equal-tailed tolerance intervals with an upper and a lower limit, a specified proportion \(P\) of the population is contained within the bounds with a specified level of confidence \(\left(1-\alpha \right)\) [27, 28] (Additional file 1). It is also possible to create one-sided tolerance intervals with respect to a threshold of interest. Both values for \(\alpha \)(i.e., \(1-\alpha \)) and \(P\) can be varied independently to adjust for the requested level of confidence and the content. Several authors indicate that TIs are underused in the literature [29, 30] and are frequently not used in situations when they actually should be applied. For example, reference values for diagnostic purposes are a special case of application of tolerance values. In the R code below, \(P\) denotes the chosen content or proportion of the population and does not have anything in common with p-values.
In R, the command bintol.int() available in the package tolerance [28] calculates a two-sided TI (side = 2) of content \(P= 0.9\) for a future sample of size \(m\) based on \(x\) fractures out of \(n\) kicks, based on Wilson’s approach (“WS”) together with a statistical error \(\alpha =0.05\). Note that package tolerance facilitates computation of a broad range of tolerance intervals far beyond this binomial application.
Based on current \(x=2\) and \(n=16\) observations, the classical Wilson-TI for the confidence level \(1-\alpha =0.95\), content \(P=0.8\) and a future sample of \(m=50\) observations indicates a TI interval for counts (0 to 22) in Table 1. This result can be interpreted as follows: When predicting the count of radial or tibial fractures for \(m=50\) future kicks, based on observed \(x=2\) fractures in \(n=16\) kicks, at least a proportion of 80% of future fractures (when repeating such an experiment a large number of times) will be covered by the Wilson-TI (0 to 22) interval with confidence of 95% (i.e., for repeated, i.e., independent, identical realisations of such a kick experiment with a horn impactor at a velocity of 16 m/s in 95% of repeated kick experiments). It is also possible to obtain TIs based on a Bayesian approach by specifying the method indicating that the Jeffreys approach (“JF”) is used.
The Bayesian Jeffreys-TI for the confidence level \(1-\alpha =0.95\), content \(P=0.8\) and a future sample of \(m=50\) observations indicates a TI interval for counts (0 to 22) in Table 1. This result can be interpreted as follows: When predicting the count of radial or tibial fractures for \(m=50\) future kicks, given already observed \(x=2\) fractures in \(n=16\) kicks and the minimally informative Jeffreys prior, at least a proportion of 80% of future fractures (when repeating such an experiment independently a large number of times) will be covered by the Jeffreys-TI (0 to 22) interval with a probability of 95%.
Given a fixed future sample size \(m\), both classical and Bayesian TI show that a larger content \(P\) induces wider TI on the count scale. Moreover, for a fixed content \(P\), increased future sample size \(m\) is linked to narrower TI on the probability scale (Table 1).
A Bayesian TI is computed by a hybrid approach. First, a posterior distribution based on the data and a Jeffreys prior is computed. Second, the classical methodology for TI computation is applied to the posterior distribution [27]. Consequently, the interpretation of a Bayesian TI only partly benefits from the Bayesian argument. For one part of a TI, the classical “when sampling multiple times…” interpretation remains.
Figure 3 demonstrates the properties of the TI. TIs do not need to cover any true parameter \(\pi \), but they contain at least a specified proportion \(P\) of the population with confidence \((1-\alpha )\). Red TIs indicate TIs that are too short and do not contain the requested proportion \(P\) of the population. This occurred in 7 out of 100 simulated samples.
The use of TI is recommended if a researcher wants to use the observed data to make predictions for a large number of future observations and, simultaneously, wants the interval to contain a prespecified proportion (\(P\)) of typical observations with confidence \(\left(1-\alpha \right)\) [8]. For large sample sizes, the length of the TI approaches the quantiles of the underlying population so that the requested content \(P\) is guaranteed for any future sample size \(m\).
Coverage properties and asymptotic behaviours of CI, PI, and TIAn important indicator of adequacy of interval estimates is their coverage. According to Meeker et al. ([8], p.403), the coverage probability “is the probability that the interval obtained using the procedure actually contains what it is claimed to contain, as a function of the procedure’s definition”. Coverage can be verified either by mathematical derivations or through extensive Monte Carlo simulations. The adequacy of mathematical procedures used to compute interval estimates is proven if their effective coverage levels agree well with nominal levels stipulated by assumptions imposed for their computation. For example, in the context of confidence intervals, those procedures for 95% CI computation are adequate and effectively cover the true probability \(\pi \) in 95% of the cases. For CI, it was shown that not every mathematical procedure suggested for computation of 95% CI attains nominal coverage [8, 16, 19, 31,32,33]. For PI, the coverage was investigated by [
Comments (0)