Capturing emergency dispatch address points as geocoding candidates to quantify delimited confidence in residential geolocation

Introduction to the proposed methodology

CCRs lack control over domain constraints of attribute sets that manifest the epidemiologic concepts of Person, Place, and Time (to a lesser extent) [15]. Limits to confidence in the data stem from the availability of information; some information is available to upstream stewards but is effectively missing to downstream stewards, which is typical of a disease surveillance data stream. Managed by upstream stewards, such attribute sets can be seen only as attribute associations (AAs) by downstream stewards like CCRs, meaning that downstream stewards have limited capacity to assess concordance or discordance between different fields of these attribute sets [14, 15]. Considering the cancer surveillance data stream, we proposed a convention of 15 AAs that enable and/or modify spatiotemporal relationships in cancer surveillance data as the basis for assessing delimited confidence in residential geolocation [14, 15]. We limited the number of AAs in cancer incidence data to just those for which an argument for quantifying confidence can be reasonably made. These AAs and their metadata store information about both the patient's residential geolocation and its uncertainty. ED address points present one of the most heavily used AAs for geocoding. Here, we focus on their address components, such as house number and street name, and their concordance or lack thereof.

The theoretical basis of our methodology is rooted in the discriminant power of residential addresses. The term residence implies the spatiotemporal finitude of the residential anthroposphere. This finitude is delimited with the domain of known residential addresses, which are designed to be perfectly discriminant, thus confer discriminant power. We introduce the term Residential Address Discriminant Power (RADP) as theoretically perfect within all residential addresses. ED data also have a certain discriminant power, which stems from the aspiration to the following objectives: (a) data including both object and attribute completeness as defined by Bleiholder and Naumann [20] and (b) having database integrity as defined by Motro [21, 22]. To best identify patients’ residential geolocation, CCRs employ geocoding record linkage against ED data. ED data are used to assign a geocode to the patient’s address by matching it to an ED address point. However, ED address points may not perfectly discriminate between residential addresses (e.g., missing sub-address). Therefore, residential geolocation (i.e., process of assigning residential geocodes) may incur a loss of discriminant power. We propose the term Residential Geolocation Discriminant Power (RGDP) to quantify losses of the discriminant power of a patient’s address during the process of geolocation.

ED address point candidates of equivalent likelihood (CEL) to quantify RGDP for a single cancer case

Matching a patient address to a record linkage candidate most often yields the cardinality of the residential address to the ED address point candidate of 1:1, i.e., the best unique geolocation candidate for the address is identified. In such a case, there is no uncertainty in assigning a geocode to a residence. Uncertainty arises when input values (i.e., patient address) can be matched to more than one candidate in ED data (i.e., more than one address and/or residence).

A typical situation when ED address points do not perfectly discriminate between residential addresses is when ED address points are missing sub-addresses. This means, for example, that 12 ED address points are assigned to 12 sub-address-specific residences, but data does not specify the correspondence between address points and sub-address. As a result, these 12 ED address points have the same address but different geocodes; they become record linkage candidates to a patient address that matches one of the unspecified 12 sub-addresses. In this scenario, each of the candidates is equally likely to be a correct match; we denote the set of these match candidates as candidates of equivalent likelihood (CEL). Any of the CEL can be used to assign a geocode to this residence while incurring uncertainty or a loss of confidence in residential geolocation. We quantify the loss in RGDP as inverse to the number of CEL per linkage:

where RGDP is a fraction, ranging from > 0 to 1, and equals 1 when record linkage identifies a unique best candidate. RGDP decreases as the number of CEL increases, corresponding to the loss of confidence in residential geolocation. For some patient addresses, no ED candidates can be identified (for example, in residential areas undergoing construction to which ED address points have not been assigned). Then we deem CEL = 0. For such cases, the number of candidates still can be estimated as opposed to captured by enumerating existing ED CEL. This technique of estimation is the subject of future research topics, which is beyond the scope of this manuscript. Because the presented formula does not apply to cases with CEL = 0, for this manuscript, cases with no ED candidates are deemed to have RGDP = 0.

Using CEL to quantify confidence with geocoding record linkage substitution

When the best unique candidate cannot be identified, either one of the CEL or a less resolute proxy (e.g., postal code area centroid) can be chosen. These practices we term geocoding record linkage substitution. This term can serve as an umbrella for existing techniques in geocoding, used to avoid record linkage failure. Such techniques can include hierarchical geocoding and others that have been summarized and described in more detail by Goldberg [23] and by other authors [24,25,26,27,28,29]. Record linkage substitution presents a fundamental difference between the practices in geocoding and patient identity linkage. Record linkage produces a binary result for patient identity – success or failure. In contrast, geocoding produces results with degrees of success, depending on the size of the area of uncertainty and the number of CEL within the area. In some situations, a patient address has more than 1 set of CEL (aka ‘multi-match’), because more than one patient address component has uncertainty. Following the principle of information entropy maximization [30], these cancer cases must undergo geocoding record linkage substitution incurring less resolute geocodes until they can be geocoded with a single set of CEL. We illustrated different choices for record linkage substitution with a hypothetical example in Appendix 1 and illustrated how the principle of entropy maximization is a consideration in record linkage choices. In Appendix 2, we propose a convention for the record linkage substitution and calculating RGDP for cancer cases when ED address points candidates are derived from vector planimetric features, including street centerline.

Summarized RGDP

To meet the requirements for protecting the confidentiality of patient data, CCRs cannot release RGDP for a single cancer case. These requirements can be better met when RGDP is summarized across cases, which we denote as sRGDP and propose the following calculation formula:

$$ } = \frac N \hfill \\ i \hfill \\ \end \frac}} $$

In this formula, \(_\) presents the number of ED address point candidates for the \(s\) case; \(N\) is the total number of cases in an area for a certain time period. As such, sRGDP for \(N\) cases will diminish as the number of cases with multiple CEL increases. Cases with CEL = 0 contribute 0 to the numerator and 1 to the denominator. We propose to present sRGDP for all cases used to generate published incidence rates as a measure of the data quality for residential geolocation in a certain area.

Proposed geocoding record linkage steps with computation of sRGDP Using a subset of NC CCR cancer cases

To demonstrate how our approach works with real data, we present the number of CEL per case using the recent data from the NC CCR that included all cancer cases reported to the NC CCR in the month of January 2022. We describe the workflow as consecutive steps in geocoding along with the quantification of uncertainty in residential geolocation. Below we describe these steps and summarize the workflow in Fig. 1.

Fig. 1figure 1

Workflow for Residential Geolocation of Cancer Cases and RGDP Capture

Step 1: Identification of CEL for vector residential geographic data proxied by ED address points

We prepared for geocoding record linkage by identification of duplicates in the ED address domain (Fig. 1, Step1). Often, these duplicates present missing information about sub-addresses. We stored the number of duplicates as an additional meta-data field in ED address data. Further, we prepared for the use of the less resolute geographic reference data by creating cross-reference tables to link, for example, street centerline segments with the associated ED address points. Similarly, we created a cross-reference table to enumerate ED address points associated with each postal code area to prepare for using the postal code area centroid as a proxy.

Another part of the preparation included creating generalizations of all ED addresses using 2014 vintage NC ED data (Fig. 1, Step 1). At the time of our analysis the 2014 vintage ED address points [31] were the most recent statewide ED dataset. ED address generalizations are versions of an ED address from which one or more address components have been subtracted, following the principle of attribute relaxation described by Levine and Kim [32] and Goldberg [23]. For example, the postal code in the patient’s address does not match the combination of house number and street name. However, if the ED candidate address remains unique when we remove the postal code, we use that generalization for geocoding to make a match in spite of reported postal code being discordant. Approximately 80% of 2014 NC ED addresses were still unique without postal codes. We used unique address generalizations as a threshold for record linkage exhaustion because such a threshold is replicable and allows tolerance for error or missingness in patient address components during record linkage. Currently, there are no CCR community-specific conventions for record linkage exhaustion in geocoding of cancer incidence patient address, beyond verification of domain inclusion/exclusion during automated record linkage with commonly employed algorithms such as Felligi-Sunter [33]. Individual CCRs have the discretion to delimit the extent of their record linkage. On a monthly basis, 20–60% of NC CCR deterministic linkage matches involve ED address generalizations.

Step 2: Probabilistic and deterministic geocoding record linkage

Our procedure has been designed to mitigate additional expenses associated with the capture of CEL. By introducing quantification of confidence in residential geolocation, we are compelled to monitor record linkage productivity, which was not necessary for geocoding record linkage alone. In Table 1, we contrast different categories of geocoding productivity providing a framework to discuss this effort.

Table 1 Productivity of Geocoding Record Linkage and Quantification of Confidence in Residential Geolocation for Large Subsets of Cancer Cases

To maximize the number of cases geocoded at the higher productivity level, we used automated geocoding with a probabilistic algorithm to link a batch of patients’ addresses against ED address points and parcels, supplemented by the semi-automated match with a deterministic algorithm (Fig. 1, Step 2). In this research, we started with a probabilistic record linkage. Probabilistic record linkage utilizes an unsupervised classification algorithm designed to allocate field-specific weights, as demonstrated by Fellegi and Sunter [33]. These weights are determined by evaluating the level of concordance exhibited between the linked fields in order to compare the probability of match to probability of non-match. Because of its performance and scalability this approach has been extensively adopted in a variety of record linkage applications including geocoding. Generally, some addresses are not matched probabilistically and therefore, are linked with edit distance using a deterministic algorithm, which is a rule-based evaluation of string similarity.

To further minimize the number of cases requiring interactive geocoding, we deterministically linked patient addresses against unique ED address point address generalizations (Table 1). In this research, approximately 20% of deterministic matches involved address generalizations. The outcome of the probabilistic and deterministic linkage was determining the best ED address point candidate for a patient’s address or alternatively, capturing its CEL. The number of ED address point CEL for a patient address was derived from metadata (e.g., enumerated duplicate ED address points per address) or the number of ED address points corresponding to the street centerline, and stored in a cross-reference table. For parcel centroids, we determined CEL based on the geometric intersection between parcels and ED address points as this placement of address point relative to parcel is based on current standards and best practices [34]. Specifically, we considered CEL = 1 when a parcel geometrically contains one ED address point; generally, CEL was equal to the number of ED address points that coincide with a parcel.

Our strategy was to maximize the number of cases matched to the most address discriminant candidates which are either ED address point or parcel centroid. When all other accessible evidence was exhausted to identify a match to the ED address point or parcel centroid, we attempted to match the remaining addresses to a street centerline.

Step 3: Interactive geocoding record linkage

The remaining addresses did not match to an ED candidate automatically, because their match candidates did not meet probabilistic or deterministic thresholds of similarity. For these addresses, we conducted interactive geocoding attempting to further maximize matching them to an ED address point or parcel centroid. For example, the street name in the patient’s address could not be found in the street domain of the ED data. This happens when an apartment complex has street names within the complex, whereas ED address points are assigned only to the street that the complex faces. We attempted to use additional sources besides ED data (such as linking the patient’s name to the parcel’s owner’s name) to match an address to an ED candidate. Using additional sources of information allowed us to match addresses to either parcel centroid or street centerline interactively (Fig. 1, Step 3). When all other evidence for street-level or manual match has been exhausted, an attempt was made to geocode to the postal code area centroid. The addresses that contained a PO Box instead of a street address were geocoded to USPS branch offices.

We propose to recycle manually placed geocodes (code ‘08’ in the GIS Coordinate Quality metadata item [35]). The geocodes placed by CCR staff to proxy a residence (generally in areas distinguished by the absence of ED address points or parcel lot lines) can be recycled across cases. The meta-data associated with these geocodes communicates that their (x, y) coordinates are not included in the ED address point (x, y) domain, or other geographic reference data, at the time of geocoding. Manually placed geocodes generally indicate a building or parcel that has not had a planimetric feature assigned. As manual placement is a relatively expensive operation, its recycling increases productivity while quantifying confidence in residential geolocation.

The meta-data ideally includes post-geocoding referential integrity between the patient’s address and its CEL and/or storage of convex hulls that delineate the area of uncertainty in residential geolocation. The interactive process of encoding verification and establishing referential integrity with CEL is by far the most expensive part of geocoding with quantification of confidence. Consequently, it is important to verify that all possibilities of matching at a higher level of productivity have been ruled out.

Step 4: Quality control

We want to make a distinction between traditionally used quality control edits in cancer surveillance data and residential geo-edits. Traditionally, quality control of cancer surveillance data has been achieved through tests of cross-column logical consistency called edits [35]. This is done in part to maintain a minimum threshold of comparability of data quality across subsets of cases. Application of these principles to residential geolocation necessitates residential geo-edits, i.e., the test for logical consistency of the domains of data fields that enable and/or modify spatiotemporal relationships in cancer data (Fig. 1, Step 4). These are controlled by stewards other than CCRs [15]. Because of this, meeting a minimum threshold of comparability across subsets of cases requires disproportionately more edits as compared to the domains controlled by CCR data domains. In part because they are verifying geography, residential geo-edits tend to be more numerous than edits for CCR-constrained attribute sets, on an assessed field basis.

Examples of residential geo-edit are tests for consistency between nested census enumeration areas (i.e., US census block group, tract, and county). Geo-edits must verify, for example, that certainty of census tract [35] corresponds to levels of geocoding record linkage substitution as indicated by other meta-data. In addition to testing for consistency in cross-column domain and/or metadata fields, residential geo-edits also need to scan for discordance between reference data, e.g., whether two reference datasets disagree on the county associated with a given address point.

For this research, we used more than 500 residential geo-edits, many of which are specific to NC. These geo-edits included a core set that is ostensibly applicable to all US states to meet minimum thresholds of confidence in residential geolocation. Two of these are needed to screen for false positive matches in address and geocode. Cases exceeding a minimum threshold of edit distance between original and matched address, or Euclidean distance between geocode and centroid of original postal code area, were interactively reviewed, their values changed, or else were assigned override meta-data based on patient address research. Further, we scanned for parcels and building footprints, that span a census enumeration area boundary, when parcels and building footprints are used to derive ED geocode. Of these, a small proportion were address points with missing sub-address on either side the of enumeration area boundary, or the ED address points sharing a common geocode (e.g., high-rise buildings with multi-residences). In these circumstances, the relationship between address and enumeration area based on geometric coincidence breaks down. We quantified the loss of confidence in residential geolocation that this scenario incurs.

The disproportionately large number of geo-edits, relative to the fields they assess, necessitated cost management. The larger the number of edits, the more time was needed to ensure that they are mutually exclusive (non-redundant). Another important consideration was runtime performance. We have managed the costs of geo-edits by running them at off-peak times.

Sources of uncertainty in geolocation of residence at diagnosis

We broadly divided the origins of uncertainty in residential geolocation between two sources of addresses– from ED address point data and from demographic data (patient address). Patient address is authored the by patient and/or medical facility on behalf of the patient. This distinction is important to citizens impacted by the publication of cancer incidence rates at the sub-county scale because it ostensibly allows the concerned citizens to follow up with the organizations responsible for the uncertainty in the data used to produce the sub-county incidence rates.

We used the meta-data generated in Step 1 to identify cases for which uncertainty in residential geolocation stems from missing attributes in ED data. When CEL = 0 but a residence was apparent in orthophoto, then ED candidate points were effectively missing; the uncertainty in residential geolocation for such cases was attributed to the current vintage ED data. Missing address points may be an artifact of data vintage as their missingness is apparent during analysis but not necessarily at a later date. Note, assignment of uncertainty origin to ED stewardship for cases with CEL = 0 may be at best preliminary. If the uncertainty could not be traced to ED data, it is assumed to stem from the patient or medical facility.

Statistical analysis

To illustrate how enumeration areas differ in confidence in residential geolocation, we calculated sRGDP for cancer cases by county. In January 2022, medical facilities reported cases that were geocoded to 74 out of 100 NC counties. In addition, we assessed ED data quality for these 74 counties as a general measure of ED data quality, using ED meta-data and the cross-link files that were created in preparation for geocoding. Specifically, we identified the number of CEL for each ED address in the 2014 version of the NC ED data and calculated ED_sRGDP for each county.

We explored whether each measure – sRGDP and ED_sRGDP – differs between rural vs. urban counties, using the Wilcoxon test. The categorization of counties as rural or urban was according to US Census urban/rural designations as of 2010. We hypothesized that confidence in residential geolocation of cancer cases correlated with the quality of ED data as assessed by ED_sRGDP. To test this hypothesis, we determined the degree of correlation between sRGDP and ED_sRGDP by county, using the Pearson correlation coefficient. We also hypothesized that sRGDP within a certain county is lower than ED_sRGDP in the same county due to added uncertainty, originating from patients and or medical facility reporting. To evaluate formally whether sRGDP and ED_sRGDP differ, we combined them in a regression model as a continuous outcome with two categorical predictors: cases-related vs. ED-related and rural vs. urban. The negative value of the beta-coefficient for case vs. ED data and the p-value for the coefficient < 0.05 was interpreted as evidence in support of the hypothesis.

Comments (0)

No login
gif