Reconstruction of private genomes through reference-based genotype imputation

Seed-sieve-extend (SSE) strategy for haplotype reconstructionRelevant details about minimac

The haplotype reconstruction portion of our attack was designed with the minimac imputation algorithm used by the Michigan Imputation Server (MIS) in mind [6]. Like other imputation algorithms, minimac takes as input a set of query sequences (typically samples genotyped on an array) and uses a reference panel of sequenced samples to predict the most likely genotypes at positions where the input is untyped, outputting for each position typed in the reference panel and for each input sample haplotype a genotype prediction and a corresponding dosage value. Assuming biallelic variants, a genotype value of 0 corresponds to the reference allele (REF) and a value of 1 corresponds to the alternative allele (ALT). The output dosage value indicates, roughly, confidence in the genotype prediction. A dosage close to 0 indicates the REF allele is likely and a dosage close to 1 indicates the ALT allele is likely. Minimac has been updated over the years for the sake of space and computational efficiency, and the current state-of-the-art is minimac4 (which is the specific version of the package we use throughout), but the high-level algorithm remains the same as the original.

Minimac models the imputation problem with a hidden Markov model (HMM) such that a hidden state \(S_i\) at SNP base position i takes on a value in \(\\), where n is the number of haplotypes in the reference panel, and an observation \(X_i\) represents the observed genotype and takes on a value in \(\\), viewed as a noisy copy of the genotype of the reference haplotype \(S_i\) at the same position. Given a set of observed genotypes at a subset of positions, minimac uses the standard forward-backward algorithm to get a distribution expressing the most likely hidden states (reference haplotypes) at each untyped position. This distribution is translated into a genotype distribution based on the genotypes of the reference haplotypes, and the expected value associated with this genotype distribution corresponds to the dosage value output by the program. The discrete genotype prediction is the most likely genotype according to the distribution.

Intuitively, imputation models each query haplotype as a “mosaic” of the haplotypes in the reference panel, as if the query sequence is composed of pieces of the reference haplotypes, so that those pieces can be used to fill in the gaps in the query. This conceptualization is justified by the fact that even distantly related individuals share short stretches of DNA, known as identity-by-descent (IBD) segments. Considered this way, minimac can be thought of as finding the closest matching reference haplotype at each point along the genome. In actuality, minimac computes a probability distribution over all reference haplotypes at each position, but a haplotype that matches the query closely over a stretch of positions will be weighted heavily in the distribution at those positions.

Key insights enabling the strategy

The reconstruction strategy we devised hinges on a consequence of the points made just above: A query sequence that exactly matches only one single reference panel haplotype will result in genotype and dosage output that reveal information about the matching haplotype. Specifically, at positions where the query matches the haplotype and at other nearby positions, the output genotypes will tend to be those of the matching reference haplotype, and the output dosages will tend to be very close to 0 or 1 (extreme dosages), indicating “high confidence” in the genotype output. This occurs because, as noted above, a haplotype that closely matches the query at a set of positions will be heavily weighted in the distribution at those positions: in HMM terminology, no other reference haplotype is as likely to have emitted the observed genotypes in the query than the one haplotype with exactly those genotypes.

The insight described above means that when a query exactly matches a single reference panel haplotype, we may be able to recognize from the output dosages that this has occurred (a pattern of uniformly extreme dosage values is atypical) and infer that the corresponding output genotypes are likely to match a particular reference panel haplotype. In other words, if we can generate queries that exactly match the reference haplotypes, we may get reconstructed haplotypes in the imputation output and can recognize the likely success of this reconstruction. This observation provides the basis for our haplotype reconstruction strategy.

Overview of the approach

The high-level structure of our seed-sieve-extend (SSE) strategy for haplotype reconstruction is as follows: first, it generates a large number of short queries (“seeds”) and runs them together through minimac. Then, it utilizes a classifier operating on the output dosages to predict how many reference haplotypes the query matches. If a query is predicted to match one or a small number of reference haplotypes (say, 2 to 5, which also produces distinct dosage patterns), it proceeds with these and discards the rest (“sieving”). If the dosage pattern indicates that a uniquely matching reference chromosome or chunk has been retrieved in its entirety (imputation operates at the chromosome or subchromosome chunk level, rather than over the whole genome), the strategy has achieved its objective for that query. Alternatively, if a query matches some small number of reference haplotypes, the original query is strategically extended to attempt to retrieve each of the matching haplotypes (“extension”). This whole process is repeated with different queries and on different chromosomes/chunks to retrieve additional reference panel haplotypes. Below, we detail the attack scenario and each step of the attack, but note that values and data used in experiments, and other experiment-specific details, are discussed later in the “Haplotype reconstruction experiments” section.

Attack scenario

Our haplotype reconstruction strategy was devised based on the following scenario: An imputation server holds a hidden reference panel (meaning a user cannot directly access the genotype data) and an adversary’s objective is to retrieve as much of the genomic data in that reference panel as possible through interaction only via the imputation server. The adversary’s attack strategy may submit to the server as input an unlimited number of queries, where a query is a properly formatted set of variants and genotype assignments at those variants. The server’s output returned to the adversary, given an input query sequence, is the result of a minimac run with that query as input. Initially, we assume that the imputation output contains both imputed genotypes and corresponding dosages, at exactly the positions where reference panel samples are typed. Later, we also discuss an alternate strategy which assumes only discretely imputed genotypes, and not continuous dosages, are output.

Step 1: Seed

The first step of the strategy is to generate a large number of query sequences to be input into minimac, the goal being that some of them perfectly match one or a small number of reference panel haplotypes. These are the “seeds” around which we hope to reconstruct the reference haplotypes. In order to accomplish this goal, short queries (of, say, 8 variants) are used because a given conformation of genotypes over a large number of variants would be very unlikely to match any reference haplotype due to the exponential number of possible conformations. Specifically, the strategy repeatedly chooses a small set of variants, enumerates all possible genotype conformations on that set, and submits all of these conformations as queries. In this way, there is a guarantee that some queries will match the reference haplotypes.

Variant sets are chosen strategically in an attempt to increase the probability that some conformations match one or a small number of haplotypes. A variant with a low minor allele frequency (MAF) is chosen per set and fixed with the ALT allele, and the rest of the set is populated with variants with high MAF. The idea here is that because only a small number of haplotypes will have the ALT allele at the low-MAF SNP, it is more likely that some queries will match a small number of haplotypes. For example, if 8 haplotypes in the reference panel have the ALT allele at variant X, it is certain that at least one of the queries will exactly match between 1 and 8 haplotypes, since we are generating all possible conformations over the other set variant genotypes. High-MAF variants are chosen for the rest of the set to make it more likely that haplotypes with the ALT allele in the low-MAF variant will have different genotypes in the other set variants, resulting in fewer matches for different queries. High-MAF variants are chosen not to be too close together. Due to linkage disequilibrium, variants very near in centimorgan (cM) distance are more likely to have the same genotypes across haplotypes. However, the set variants are also chosen to be not too far apart, as we observed that variant sets with variants nearer to each other produced cleaner dosage patterns when a query sequence on those sets matched one or a small number of reference haplotypes.

Our implementation, to create each set, generates a random position within the bounds of the chromosome and attempts to build a variant set centered roughly around that position, moving on to another random position if it fails. It takes arguments specifying query length (number of set variants), allele frequency cutoffs for what counts as a low-MAF or high-MAF variant, and the minimum and maximum genetic distance spacing allowed between variants during set construction. To enable the variant spacing strategy, genetic distance values (in cM) were obtained for all variants using hg38 genetic maps provided by the authors of SHAPEIT4 and using a code adapted from theirs to interpolate genetic distances at positions not included in the maps [24]. For the variant MAF values, we used population-specific gnomAD allele frequency data [25] combined by weighted average to fit the composition of the reference panel. The specific weightings for the different reference panels used in the experiments are noted in the “Haplotype reconstruction experiments” section.

Step 2: Sieve

The second step of the strategy, once the query sequences have been run through minimac, is to utilize the imputation dosage output to predict, for each query, how many reference panel haplotypes it matched. As discussed above, when a query matches a small number of reference haplotypes, this fact is likely to be recognizable by specific patterns in the dosages (Additional file 1: Fig. S1; Additional file 1: Fig. S2). A random forest classifier, trained on the dosage output resulting from queries with known numbers of matches, is used to recognize these patterns and make predictions. We configured and trained the classifier to classify dosage data from the full length of the chromosome and assign a match count prediction of 1 through 5 or Other, with Other covering queries with more than five matches and also those with none.

When the classifier predicts no matches or a large number of matches (6 or more) for a query, the corresponding output is discarded. This is because the imputation outputs resulting from these queries represent a mixture of many samples, and extracting a single reference panel haplotype from these outputs is difficult. What we have left after this “sieving” is only the imputation output corresponding to queries predicted to have matched a small number of reference haplotypes (1 through 5). For queries the classifier predicted to have matched a single haplotype, we infer that a reference panel haplotype has been reconstructed, and the program saves off the corresponding output genotype data as such. The rest of the remaining queries move on to the extension step.

We used the scikit-learn RandomForestClassifier module [26] to implement the classifier. The classifier was designed to operate on bucketed dosage data, so both for training and for use in the attack, output dosage data were first processed into counts of dosages falling into each of 21 non-overlapping dosage range buckets. These buckets consisted of the following: (1) buckets of size \(\alpha =0.1\) at each end (extending from 0 and from 1), (2) buckets of size \(\beta =0.02\) centered at the expected values of the dosage “stripes” anticipated in the patterns for 1 to 5 matches (see the “Step 3: Extend” section for elaboration), and (3) additional buckets that span the remaining gaps in the dosage domain of 0 to 1. We used 75 decision trees in the classifier. Training data included 100 sets of dosage data from 1KG for each of the six classes (1 through 5, Other). Although our classifier serves only as an example approach to separate out the most promising queries from the rest, we proceeded with it in the rest of our experiments, since it led to successful reconstruction results across various settings, as shown in our results. An evaluation of our classifier’s performance on test data is provided in Additional file 1: Fig. S18.

Step 3: Extend

For queries classified as matching a small number of reference haplotypes, an extension strategy is used to attempt to retrieve all the matching haplotypes. As shown in Additional file 1: Fig. S1, the number of haplotypes a query matches corresponds to the number of “stripes” evident in the pattern of dosage values in its output. For example, a query constructed strategically (as discussed in the “Step 1: Seed” section) that matches two reference panel haplotypes will display extreme dosages and a stripe of dosages close to 0.5. This happens because the two matches are weighted more heavily along those positions, and so the output dosages are close to 0 where both have the REF allele, close to 1 where they both have the ALT allele, and close to 0.5 where one has the REF allele and the other the ALT allele. The explanation works analogously for greater numbers of matches: the output of a query matching n haplotypes is characterized by concentrations of dosages (“stripes”) at multiples of 1/n, corresponding to the proportions of REF/ALT alleles among the matches at different positions. These are the stripes that the classifier in the previous step is effectively trained to recognize.

It follows from this understanding, for the two-matches example, that if we add to the original query a variant with an output dosage near 0.5 and assign it the REF allele in a new query, the new query will match only one of the two haplotypes matching the original query. If we assign it the ALT allele in another new query, we will only match the other haplotype. In this way, we may be able to reconstruct both matching haplotypes, and this process can be generalized to a wider range of match counts.

Specifically, our program, given a query matching n haplotypes, chooses a nearby variant whose allele frequency falls within a specified distance from 1/n (in our experiments 0.025, but this distance could be optimized to be as small as possible while still reliably capturing nearby variants). We then create two new extended queries using both alleles of this variant; that is, the added variant is set to the ALT allele in one query and it is set to the REF allele in the other. The former is expected to match only one reference haplotype, as the dosage of the variant indicated only one ALT allele among the matches. The latter new query is expected to match \(n - 1\) haplotypes and is further extended by choosing another variant the same way as before (now using \(n - 1\) instead of n to determine the dosage range from which to draw the variant).

Alternative strategy for discrete genotype predictions

This alternative strategy assumes that imputation only outputs discrete genotypes and not continuous dosages. It hinges on the following insight: let us say we have a query that perfectly matches a single reference haplotype. If we impute on this query, form a new query by appending to the original query a nearby variant and genotype assignment from the imputation output, and impute with this new query, the output genotype predictions will tend to be the same as for the original query. This occurs because, as discussed earlier, the imputation output genotypes resulting from a query matching a single reference haplotype will tend to be those of the matching haplotype, at least at the query variants and others nearby. As a result, the variant and genotype appended to form the new query will usually also correspond to the matching haplotype, and the new query will only be a longer exact match to that haplotype. An important corollary to the above insight is that this will not necessarily happen for an initial query which does not uniquely match a reference haplotype, since the logic described will not apply.

The sum of all this points to a reconstruction strategy similar to the original that does not require continuous dosage output: First, query sequences are constructed as before and run through imputation. Then, each query sequence is extended with a variant and genotype assignment from its own imputation output, and these new query sequences are run through imputation. In our implementation, we use for extension the variant with the highest MAF below some cutoff (defaulted to 0.6) within the range stretching some base pair distance to either side of the low-MAF variant, by default 50,000 base pairs. For each extended query, the resulting output is compared to the original query’s output, and queries for which the outputs differ are filtered out. Each remaining query proceeds to the next round, in which the original version of the query is extended using a different nearby variant, and filtering is repeated in the same way. In our implementation, we choose in each round the in-range variant with the highest MAF below the cutoff that has not already been chosen. This process is repeated for k rounds, specified as input to the attack, or until no queries remain. Any queries remaining after the specified number of rounds are predicted to match a single reference panel haplotype, and its output can be saved as such. In the language used to describe the “standard” method for haplotype reconstruction presented first, this “discrete genotype” method can be described in general as seeding and then iteratively extending and sieving, with a sieving mechanism that does not rely on dosage values.

The chosen number of rounds of extension k affects the number of incorrect reconstructions that the attack makes. A greater number of rounds will catch and filter more of these errors, at the cost of increased running time, so the choice of k is a trade-off between priorities. We would not expect the same k to result in the same amount of error across the imputation tools, and in our experiments we found that different tools required different settings in order to achieve a similar quality of filtering (see the “Imputation tools” below for specific settings). We generally observed that, once sufficiently high, the value of k did not have much effect on performance; in practice, an adversary could further optimize it based on the rate of reconstruction as it progresses.

Although we have not explored this idea in our work, imputing a subset of variants in a sample then checking whether the output is consistent with the input could be a useful approach for distinguishing correctly reconstructed haplotypes from incorrect ones even in the fractional dosage output setting.

Rounds of interaction and input size limits

It is worth commenting on the number of times that the imputation software must be run in the execution of haplotype reconstruction. This aspect is relevant for an accurate understanding of the risk posed to imputation servers, since the number of times imputation is run is, in practice, the number of times an adversary must upload data to the server. We submit the query conformations on a single variant set (of 8) as a discrete unit for a run of imputation, and each following extension step requires an additional round of interaction. Thus, the overall number of rounds is at least as large as the number of rounds of extensions plus one (in our experiments at most 5 for standard reconstruction, 15 or 30 depending on the imputation tool for the discrete-genotype version). Since each set of target variants is processed independently of other target sets, they can be explored by the attacker in parallel. This implies that the total number of rounds depends on the limit imposed by the server on the maximum size of the input; the larger the limit, the more target variant sets can be included in each input to reduce the total interaction rounds. As a reference point, the maximum input sample size of MIS is currently 50,000 [8]. Optimization in terms of query sequences required per reconstructed haplotypes, via adjustment of parameters (discussed briefly below), could further reduce the number of rounds of interaction required, but this optimization was not a priority of our investigation. For minimac, we note that including different sets of target variants in a single input file has a potential impact on the behavior of the imputation by enforcing the HMM computation on the missing sites of each query (corresponding to other target variant sets in the same file); however, the expected difference in probabilities is small, and thus we do not view this as a major limitation of the attack.

Haplotype reconstruction experimentsDatasets

For experiments carried out in 1000 Genomes Phase 3 (1KG) [17], we used phased data downloaded from the International Genome Sample Resource (IGSR). For full chromosome experiments, the full 1KG chromosome 20 VCF data were used as a reference panel. For chunk experiments, the VCF was first subset variant-wise to the bounds of the chunk (using the same 20-Mbp chunks used by MIS and TIS), and this subset file was used as the reference panel.

All of Us (AoU) [3] Controlled Tier Dataset v5 was used for experiments with AoU data. The Asian American and Black or African American panels of the AoU were populated based on the value of the “RACE” column of the database only, including participants with values “Asian” and “African,” respectively, in the following way: for the Asian American panel, we started with all 3064 Asian American AoU participants with whole genome sequencing data; used Hail [27] to filter to only chromosome 20, filter to only biallelic variants, and filter out variants with greater than 10% missingness; phased the data using Eagle2 [28]; and finally, randomly sampled a subset of the desired 1250 samples. For the Black or African American panel, we used an identical process except that the data were downsampled to 5000 participants prior to phasing, because 5000 is considered sufficiently large for accurate phasing [28].

For the variant MAF values used in reconstruction, we used population-specific gnomAD allele frequency data [25] combined by weighted average to fit the demographics of each of these reference panels. For 1KG, these (relative) weightings were 99/2015 for the Finnish allele frequency, 404/2015 for Non-Finnish European, 504/2015 for East Asian, 347/2015 for American Admixed/Latino, and 661/2015 for African/African American, where the numerators are the number of individuals in 1KG that fall into each group. 1KG also contains 489 South Asian samples, but the gnomAD v2 file used included this population in the “other” category due to the small number of South Asian samples, so we opted not to directly account for this population. For the Asian American AoU panel, the weightings were 1/2 for the East Asian allele frequency and 1/2 for South Asian (gnomAD v3.1 used here does have a separate South Asian allele frequency). For the African American AoU panel, we simply used the gnomAD v3.1 African/African American allele frequency.

Imputation tools

Below are the details of our use of minimac4 [6] and the other three imputation tools that we tested: Beagle5.4 [19], IMPUTE5 [18], and PBWT [16].

minimac4

Version 1.0.3 (also known as v4.0.3) was used. It was run with the –minRatio parameter set to 0.000001, –chunkLengthMb set to 140 (longer than chromosome 20), and the –ignoreDuplicates flag. The package accepted our 8-variant queries in their haploid form, merely inserted into VCF format. For the discrete genotype reconstruction tests, the number of extension rounds was set to 15. Updates to minimac4 have been released since we first began our experiments (currently v4.1.2), but these updates do not change our conclusions. We verified by a small test that our code is easily adapted to reconstruct successfully using the newest version by simply changing command-line syntax for calling the package, reference panel file format conversion, and indexing of the query file. This software can be found at https://github.com/statgen/Minimac4.

Beagle5.4

Version 22Jul22.46e was used. The reference panel was first converted to the bref3 format before being passed to the package. The window parameter was set to 110. The package accepted our queries in their haploid form in gzipped VCF format. The number of extension rounds in the reconstruction test was set at 15. This software can be found at http://faculty.washington.edu/browning/beagle/beagle.html.

IMPUTE5

Version 1.1.5 was used. According to its usage instructions, the reference panel was first converted to the imp5 format before being passed to the package. It was run with the –pbwt-depth parameter set to 32, –pbwt-cm set to 0.001, and the –neigh-select flag. To comply with input requirements, queries were converted to diploid form when inserted into the VCF format by simply duplicating each allele. The reconstruction test was set to use 30 rounds of reconstruction, as it seemed a greater number was needed for satisfactory performance with this tool. This software can be found at https://jmarchini.org/software/#impute-5.

PBWT

Version 3.0 was used. The reference panel was first converted to the pbwt format before being passed to the package. To conform to input requirements, queries were converted into diploid form when inserted into VCF format by simply duplicating each allele. The first and last variants of the chromosome were also added to query VCF files, set to the reference allele, so that matches extending the whole length of the chromosome could occur, resulting in imputed genotypes over the entire length. The extension strategy used in the reconstruction test was modified slightly for this package because its algorithm is the most distinct from the others (not using an HMM), and as a result it required a little more targeted treatment: Each round, queries were extended with multiple variants chosen from different gaps in the query, rather than with just one variant. The test was set to use 30 rounds of reconstruction, as it seemed that a greater number were needed for satisfactory performance. This software can be found at https://github.com/richarddurbin/pbwt.

Attack settings

For all haplotype reconstruction experiments, we used query variant sets of 8 variants (1 low-MAF variant, 7 high-MAF), low-MAF cutoff of 0.005 (variant considered low-MAF if it had MAF below this value), high-MAF cutoff of 0.2 (MAF above this value required), minimum genetic distance variant spacing of 0.001 cM, and maximum spacing of 0.0035 cM. While this parameter setting was effective for all our experiments, we did not prioritize optimization, and hence further exploration may lead to better parameter choices with a higher reconstruction performance.

Discarding all rare variants in the reference panel (e.g., below an MAF cutoff of 0.01 or 0.005) to reduce the effectiveness of the low-MAF variant in the query slows reconstruction, but only to a limited extent, as shown in Additional file 1: Fig. S17. For example, we see in this figure that around 20% of the reference panel is reconstructed for both the 0.005 and 0.01 MAF thresholds, in contrast to 30% without filtering using the same number of queries. For all experiments using the standard fractional-dosage version of reconstruction, we only considered queries constructed in the middle 80% of the chromosome (in base position), as we found that this was a simple measure that would reduce error due to boundary effects (see Additional file 1: Supplementary Note 1 for more details).

Accommodating MIS/TIS query requirements in practice

Although we ran our primary experiments directly on imputation packages, rather than against real imputation servers (for ethical reasons, as we discuss in the main text), in the early stages of testing, we checked whether our queries would be accepted by MIS, imputing against the 1KG reference panel. We found that two slight adjustments, compared to the code used in our experiments, were required for our queries to pass MIS quality control (QC). First, only single-nucleotide polymorphisms (SNPs) were accepted by MIS, with any indels filtered out. This is easily addressed by limiting to only SNPs the variants that the attack uses to form queries. Second, QC filtered out monomorphic sites (where all samples have the same allele). This requirement could be addressed to ensure no query variants are filtered out for this reason by including in any attack input file a dummy query which is the inverse (allele at every site flipped) of one of the real queries. In summary, we found that none of the input requirements used in practice by MIS precluded reconstruction in the way we have described, and we were able to successfully impute on the server from a sample of our constructed queries as of July 28, 2023. TIS is based on the same software as MIS and has the same QC requirements, so the same statement applies, though we did not run any queries on TIS to avoid reconstructing from any reference panel to which we did not already have access.

Probabilistic inference algorithm for haplotype linking

The output of our reconstruction attack, performed on an imputation server such as MIS or TIS that utilizes chunking, is a collection of decoupled haplotype chunks of unknown origin. However, since the recent genetic lineage of each individual is likely to be reflected across chunks, having access to the genetic sequence of a relative of an individual can enable linkage of reconstructed haplotypes to reveal a portion of the individual’s genome. Here, we describe the probabilistic algorithm we developed to demonstrate accurate linkage of haplotypes from the same individual.

Capturing haplotype-diplotype relatedness via semi-kinship (SK) coefficients

To link reconstructed haplotypes, we need a way to quantify the genetic relatedness of each compared to diploid samples in an external set of potential relative genomes (the “relative set,” versus the “target set” of reconstructed haplotypes). A common measure of the genetic relatedness of two samples is the kinship coefficient, \(\phi\) [29]. The kinship coefficient of two samples i and j is generally defined as the probability that a random allele chosen from i and a random allele chosen from j at the same locus are identical by descent (IBD). Formally, it can be defined as \(\phi = k_1/4 + k_2/2\), where \(k_1\) is the probability that the samples share one IBD allele at a randomly chosen locus and \(k_2\) is the probability that the samples share two IBD alleles. However, this definition assumes that the two samples being compared are diploid. To quantify the relatedness of a haploid sample to a diploid one, we formulated the semi-kinship (SK) coefficient, \(\phi _\), which we define as \(\phi _ = k_1/2\). The SK coefficient extends the notion of kinship coefficient to the haploid setting; it is the probability that an allele from a haploid sample and a randomly chosen allele from a diploid sample at the same locus are IBD.

Obtaining a SK value for each (chunk-length) pair of target set haplotype and relative set diplotype required computing \(k_1\) for the pair, and to this end, we ran both sets together through GERMLINE2 [30], which computes IBD segments for every pair of haplotypes. The \(k_1\) value was calculated by dividing the total genetic length (in centimorgans) shared IBD with either haplotype in the relative set sample by the total centimorgan length of the chunk.

Formal definition of the haplotype linking problem

In formally presenting our linking algorithm here and in the following, we consider the problem with respect to linking chromosome-length haplotypes, for simplicity. It is straightforward to modify the algorithm to link chunks rather than full chromosomes, as we did for our main linking experiment.

Let N be the number of haplotypes reconstructed in the target set, T be the number of individuals in the target set, and M be the number of diplotypes in the relative set. We use \(S\in[0,1]^\) to denote the observed SK values between every haplotype-diplotype pair between the target set and the relative set. The SK values are based on true (unobserved) genetic relationships between each pair of individuals \(j \in [T]\) and \(\ell \in [M]\), which we represent using the variable \(r_ \in \mathcal := \\). Note that \([n]:= \\) for an integer n. \(\emptyset\) represents unrelated individuals, while other values of \(r_\) represent the degree of relatedness, with \(3+\) corresponding to the 3rd degree or higher relatives (up to the 5th degree in our experiments to distinguish from unrelated). Let \(R\in \mathcal ^\) be the matrix whose elements correspond to \(r_\)’s.

The goal of the haplotype linking problem is to find a mapping \(g:[N] \mapsto [T]\cup \\), such that for each \(j\in [T]\), the set \(g^(j) := \\) corresponds to a set of target haplotypes that belong to the same individual j. We consider \(g^(\emptyset )\) to be haplotypes that could not be linked with sufficient confidence. Also, we require that \(g(\cdot )\) not assign more than two haplotypes of the same chromosome to the same target individual.

Probabilistic generative model

To infer the mapping \(g(\cdot )\) based on the observed SK values S, where g in turn reveals which haplotypes are likely to originate from the same individual, we first formulate a probabilistic model that induces a distribution over S given a mapping g, then obtain a maximum likelihood estimate of g conditioned on S. We define the probabilistic model below, illustrated in Additional file 1: Fig. S9A.

We first let \(\theta _ \in \\) for \(i\in [N]\) and \(j\in [T+1]\) be the parameters defining \(g(\cdot )\). We will use \(g_\Theta (\cdot )\) to clarify its dependence on \(\\}\), where \(g_\Theta (i)=j\) if and only if \(\theta _=1\) for \(j\in [T]\), and \(g_\Theta (i)=\emptyset\) if and only if \(\theta _=1\). The probabilistic model is then fully specified by the variables \(\Theta\), R, and S, where only S is observed. The generative process follows a two-step procedure: we first sample the latent variables \(r_\) for each \(j\in [T]\) and \(\ell \in [M]\) from a prior distribution \(p(r_)\), which we allow to be controlled by an experimental variable. We then sample each \(s_\) from \(p(s_| r_)\) when \(g_\Theta (i)\ne \emptyset\), which represents the conditional distribution over the SK value for a given pair of individuals \((g_\Theta (i),j)\) of a known relatedness degree \(r_\). When \(g_\Theta (i) = \emptyset\), we instead sample \(s_\) from \(p_\emptyset (s_)\), a background distribution over the SK value given an unknown pair of individuals. These conditional distributions can be readily estimated in advance based on an auxiliary dataset that includes related individuals or theoretically derived based on a population genetic model (e.g., [31]). Thus, we consider them fixed during the inference described in the following.

Expectation-maximization (EM) formulation for maximum likelihood estimation

Our probabilistic model induces a distribution over S, conditioned on \(\Theta\) and R. For simplicity, we view \(\Theta\) as model parameters without a prior, and R as latent variables with an input prior. We adopt the Expectation-Maximization (EM) formulation for optimizing \(\Theta\) with respect to the expected log-likelihood of the model in the presence of unobserved variables in R, defined by the optimization problem

$$\begin \text _\Theta \ \ L(\Theta ; S) := \mathbb _[\log p(S,R;\Theta )], \end$$

where \(\Theta\) is subject to the constraints of a valid mapping, i.e., unique assignment of each haplotype to an individual and inclusion of at most two haplotypes per individual for each chromosome. A local optimum for this problem can be obtained by alternating between two optimization steps: (1) an E-step, where the posterior distribution over R in iteration t, denoted \(p(R|S;\Theta ^)\), is computed based on the current parameter estimates \(\Theta ^\); and (2) an M-step, where updated parameters \(\Theta ^\) are obtained by maximizing the expected log-likelihood

$$\begin \mathbb _)}[\log p(S,R;\Theta ^)], \end$$

where the expectation is taken with respect to the posterior estimated in the previous E-step. It can be shown that this procedure returns a non-decreasing objective after every update and converges to a local optimum [32].

Belief updates for relatedness profiles (E-step)

We first describe the computation of \(p(R|S;\Theta )\) given \(\Theta\). Recall that \(\Theta\) defines a mapping \(g_\Theta\) from reconstructed haplotypes to the target individuals. The posterior belief over each \(r_\) can be computed separately as follows:

$$\begin p(r_ | S,\Theta ) \propto p(r_) \prod _(j)} p(s_ | r_). \end$$

In other words, the distribution over \(r_\) is proportional to its prior probability multiplied by the product of probabilities of generating the observed SK values between each haplotype assigned to individual j given by the current linkage \(\Theta\), including all chromosomes, and individual \(\ell\) in the relative set.

Minimum-cost network flow algorithm for linkage optimization (M-step)

The M-step represents a combinatorial problem of matching haplotypes to individuals that maximize the likelihood of the given probabilistic model. We introduce a network flow algorithm to efficiently perform this optimization. First, note that the expected log-likelihood objective can be expanded as

$$\begin \sum \limits _ \left[ \left( \sum \limits _ \theta _\cdot \sum \limits _ \mathbb _\sim p(r_|S ;\Theta )}\left[ \log p(s_ | r_) \right] \right) + \theta _ \cdot \sum \limits _ p_ (s_) \right] , \end$$

which is a linear combination of the model parameters \(\Theta\). To further simplify, given the output of the E-step, i.e., \(p(r_|S ;\Theta )\), consider precomputing the following quantity

$$\begin \pi _ := \left\ \sum _ \mathbb _\sim p(r_|S ;\Theta )}\left[ \log p(s_ | r_) \right] & \text j\in [T], \\ \sum _ p_ (s_) & \text j=T+1, \end\right. \end$$

for each \(i\in [N]\) and \(j\in [T+1]\). Then, the optimization problem for the M-step can be alternatively expressed as the integer linear program

$$\begin \text _\Theta & \ \ \ \ \sum \limits _ \sum \limits _ \theta _ \pi _, \\ \text & \ \ \ \ \theta _\in \, \ \ \ \forall i\in [N], j\in [T+1], \\ & \ \ \ \ \sum \limits _ \theta _ = 1, \ \ \ \forall i\in [N], \\ & \ \ \ \ \sum \limits _ }\(i) = c \} \cdot \theta _ \le 2, \ \ \ \forall j\in [T], c\in [22], \end$$

where \(\textsf (i)\) indicates the chromosome of the reconstructed haplotype (considering only 22 autosomes for simplicity).

We solve this problem using a minimum-cost network flow algorithm as follows and as illustrated in Additional file 1: Fig. S9B. We introduce N source nodes (one for each reconstructed haplotype), each with a supply of 1, and 22 sets of \(T+1\) sink nodes (one set per chromosome), each with a maximum capacity of 2, except for the last sink node of each set, which represents no match (hence not subjected to a capacity constraint). Each source node is connected to all \(T+1\) sink nodes in the set corresponding to the chromosome of haplotype i (i.e., \(\textsf (i)\)). The cost of each edge between the source i and the sink j (in the corresponding chromosome set) is set to \(-\pi _\). It is straightforward to see that the minimum-cost flow on this graph corresponds to a solution to the above problem with the integer domain constraint (\(\theta _\in \\)) removed as a relaxation. The amount of flow on each edge in the solution between the source i and the sink j (in the corresponding chromosome set) is taken as \(\theta _\). Once all EM iterations have been completed, we finally round the solution to obtain an approximate integer solution; we note that in our experiments the optimal solution always coincided with an integer solution.

Construction of an initial solution

Here, we describe the preliminary steps used to generate an initial linking prediction to be input into the algorithm.

Step 1: Chaining adjacent chunks by extending the sequence via imputation

We first attempt to form short chains of adjacent chromosome chunks from the same individual by imputing the target set chunks against the relative set and trying to match the imputed dosages with the genotypes of reconstructed haplotypes in adjacent chunks. More specifically, we do the following: (1) for each chunk, we try to “extend” it by imputing (using minimac4) against the relative set over a region that extends one chunk in each direction (where possible; for a chunk at the start or end of a chromosome, the region only extends in one direction). (2) For each boundary between chunks, we consider a window extending 500 Kb in each direction and, for each pair of target set chunks across the boundary, compute a score which attempts to capture their likelihood of being from the same individual. This score is computed by comparing the actual genotypes of a chunk with the imputed dosages of the adjacent chunk in the pair at every variant in the window, as the number of variants for which the dosage difference is above some threshold (we use 0.5). (3) At each boundary, pairs of chunks are linked if they are each other’s sole closest match (by minimum score) and their score is sufficiently low (we use \(\le\) 5). (4) Chains are formed by simply joining linked pairs with a chunk in common.

Step 2: Grouping chunks with identical relatedness patterns

We separately form groups of target set chunks in the following way: we first threshold the SK values at a fixed constant \(1/(2^)\) and group haplotypes that have identical relatedness patterns with respect to the individuals in the relative set, discarding any sets smaller than 5 haplotypes. Then, we threshold the SK values of the ungrouped haplotypes again at a lower value of \(1/(2^)\), adding to existing groups or creating new ones based on the relatedness pattern, and again discard sets smaller than 5 haplotypes. (The higher and lower thresholds above correspond to expected lower bounds for 3rd and 5th degree relationships, respectively [33].) Groups are capped to include at most two haplotypes per chunk, as the intent is to predict the haplotypes of a diploid individual. This capping is executed for each group by finding the relative with the strongest relationship by SK value to any haplotype in the group, then selecting up to two haplotypes per chunk with the highest SK value with that particular relative.

Step 3: Combining the solutions from steps 1 and 2

We finally combine the information from the two steps above to get an initial prediction to be input into the linking algorithm, in the following way: the chunks in a chain formed in step 1 above are assigned to a group (representing a single individual of origin) if more than half of the chunks in the chain are in a corresponding group produced by step 2. We are, in effect, connecting chains formed in the imputation step using SK information.

Implementation details

We implement the EM algorithm described above, using the NetworkX Python package [34] for the network flow implementation of the M-step. We initialize \(\Theta\) before the initial E-step using the preliminary steps described above in the “Construction of an initial solution” section.

We set T to the number of individuals represented in the target set; a precise value of T is not required in practice, as long as it is sufficiently large, because our algorithm includes a no-match state which allows some of the target individuals to have an empty set in the output if the quality of the match is not high.

We estimate the conditional distribution over the SK values for each relatedness degree, including the background distribution, based on a global set of related pairs of individuals from our dataset. While in a real attack scenario these distributions would be estimated from an independent source of information, we adopted this approach due to a lack of another large dataset with a sufficient number of relatives; note that these distributions are not expected to be dataset-specific and are instead based on the fundamental properties of the recombination process. To further establish that the algorithm could successfully leverage SK distributions based on an external dataset, we conducted an additional experiment in which we split our dataset sample-wise into two equal-sized, non-overlapping groups, estimating distributions from one group and running the haplotype linking algorithm on the other. We found that the algorithm performed with comparable success (Additional file 1: Fig. S19).

Haplotype linking experimentsData

We tested our haplotype linking program using genetic data from 2,000 AoU participants, including 100 pairs of related individuals (20 pairs each of 1st through 5th degree relatives) and randomly sampled 900 pairs of unrelated individuals. For the primary linking experiment with chunks, all 44 haploid autosomal chromosomes from one individual in

Comments (0)

No login
gif