We derive a characterization of constrained k-Dollo phylogeny matrices by building on previous work on characterization of k-Dollo phylogeny matrices [25, 26]. Recall that in the k-Dollo model, a 0 entry in the mutation matrix A indicates that either the mutation did not occur in the cell or that the mutation occurred but then was subsequently lost in the cell. If we could distinguish these two cases, then we could replace the 0 entries resulting from losses by additional character states \(\\) representing the k possible losses of a mutation in the k-Dollo phylogeny. This idea forms the basis of the following (extended) definition of k-completion of a mutation matrix A.
Definition 2(El-Kebir 2018 [25]) A matrix \(B\in \^\) is a k -completion of a mutation matrix \(A\in \^\) provided: (1) \(b_ = 1\) if and only if \(a_ = 1\); (2) \(b_\in \\setminus \\) if and only if \(a_ = 0\); (3) \(b_ \ge 1\) if j is an SNP.
The following definition defines a subset of all possible k-completion matrices of a mutation matrix A.
Definition 3(El-Kebir 2018 [25]) A matrix \(B\in \^\) is a k -Dollo completion of mutation matrix A provided it is a k-completion of mutation matrix A such that there exists no two columns and three rows in B of the following forms:
$$\begin \left( \begin i_1 & 0\\ 0 & j_1 \\ i^_1 & j^_1 \end\right) \text &\left( \begin i_1 & j_1^\\ 0 & j_2 \\ i^_1 & j_2 \end\right) \text \left( \begin i_2 & 0\\ i^_1 & j_1 \\ i_2 & j^_1 \end\right) \text \\&\left( \begin i_2 & j_1^\\ i^_1 & j_2 \\ i_2 & j_2 \end\right) , \end$$
where \(i_1,i'_1,j_1,j'_1\in I^, i_2,j_2\in I^, i^_1\in I^\setminus \\) and \(j^_1\in I^\setminus \\), and \(I^ = \\).
According to this definition, the number of \(3\times 2\) submatrices that are forbidden to exist in k-Dollo completion matrices is \((k+1)^4 + 2k^2(k+1)^2 + k^4\). Ciccolella et al.[26] provided an alternate characterization of k-Dollo completion matrices, which we describe in Additional file 1: Section B.
k-Dollo completion matrices are useful in characterization of k-Dollo phylogeny matrices due to the following theorem.
Theorem 1(El-Kebir 2018 [25]) \(A\in \^\) is a k-Dollo phylogeny matrix if and only if there exists a k-Dollo completion \(B\in \^\) of A.
Constrained k-Dollo phylogenies are a subset of k-Dollo phylogenies that satisfy some additional constraints. In particular, a constrained k-Dollo completion must be consistent with copy-number clustering \(\sigma\), according to the following definition.
Definition 4(Consistency) A k-Dollo completion \(B\in \^\) of a mutation matrix A is consistent with a copy-number clustering \(\sigma\) with p clusters provided the following conditions are true for every mutation j.
1There is at most one cluster \(\ell\) such that for two distinct cells \(i, i' \in \sigma ^(\ell )\), \(b_ = 0\) and \(b_ = 1\).
2If there exists cell i such that \(\sigma (i) = \ell\) and \(b_ = s\) for \(s\in \\), then \(b_ = s\) for all \(i'\in \sigma ^(\ell )\).
Using this definition, we have the following characterization of constrained k-Dollo phylogeny matrices.
Theorem 2A mutation matrix A is a constrained k-Dollo phylogeny matrix for copy-number clustering \(\sigma\) if and only if there exists a k-Dollo completion \(B\in \^\) of A that is consistent with \(\sigma\).
We provide a proof of Theorem 2 in Additional file 1: Section A and show that given a k-Dollo completion B of mutation matrix A that is consistent with \(\sigma\), we can find a constrained k-Dollo phylogeny for A and \(\sigma\) in O(nmk) time. In addition, we also show the following result on the complexity of the CkDP-RC problem (Problem 1).
Theorem 3The CkDP-RC problem is NP-hard, even for \(k = 0\).
A proof of Theorem 3 is provided in Additional file 1: Section A.
ConDoR algorithm for constrained k-Dollo modelWe formulate a mixed integer linear program (MILP) to solve Problem 1 exactly. Specifically, for given read count matrices Q and R, copy-number clustering \(\sigma\) and integer k, the MILP finds a k-Dollo completion B that is consistent with \(\sigma\) and that maximizes the likelihood \(\Pr (Q\mid R, A)\), where A is the mutation matrix corresponding to B.
The MILP is based on encoding the combinatorial characterization of constrained k-Dollo completion matrices described in the previous section. We introduce a binary variables \(a_\) for cell i and mutation j to represent the mutation matrix A. Further, we introduce binary variables \(c_\) for cluster \(\ell\), mutation j and state \(s\in \\) to represent the presence of loss state s of mutation j in cluster \(\ell\). These binary variables are used to model the entries of the k-completion matrix B as follows: \(b_ = 1\) if \(a_ = 1\); \(b_ = s\) if \(c_ = 1\) and \(\sigma (i) = \ell\) for \(s\in \\); and \(b_ = 0\) otherwise.
Since \(b_\) can only attain one value, we enforce the following constraints for all cells \(i\in \\) and all mutations \(j\in \\).
$$\begin a_ + \sum \limits _^ c_ \le 1. \end$$
We also define variables \(x_\) for cell i and mutation j which indicates if \(b_ \ge 1\). As such, for all cells \(i\in \\) and all mutations \(j\in \\) we enforce
$$\begin x_ = a_ + \sum \limits _^ c_. \end$$
Once we have modeled the k-completion matrix B, we need to enforce constraints for (i) consistency with the copy-number clustering \(\sigma\), (ii) handling germline mutations and (iii) B to be a k-Dollo completion matrix. We describe the constraints for (i), (ii) and the objective function of the MILP in the following and refer to Additional file 1: Section B for (iii).
Handling germline mutationsHere, we describe the constraints to handle germline mutations. Note that, if mutation j is germline, it must either be present in cell i, i.e., \(a_ = 1\), or it must have been lost, i.e., \(c_ = 1\) for some \(s\in \\). As such, if \(G\subseteq \\) is the set of germline mutations, we enforce the following constraints for all cells \(i\in \\) and germline mutations \(j\in G\),
$$\begin a_ + \sum \limits _^ c_ = 1. \end$$
Consistency constraintsWe now describe the constraints to enforce consistency between the k-completion matrix B and the copy-number clustering \(\sigma\). Note that Condition 2 of Definition 4 is satisfied by the way B is modeled, and we only need to introduce constraints to satisfy Condition 1 of Definition 4. To that end, we introduce two set of continuous auxiliary variables. First, we introduce \(g^_\in [0,1]\) and enforce constraints so that \(g^_ = 1\) if there exists at least one cell \(i\in \sigma ^(\ell )\) such that \(b_ = 0\) for cluster \(\ell\) and mutation j, and \(g^_ = 0\) otherwise. Similarly, we introduce \(g^_\in [0,1]\) and enforce constraints so that \(g^_ = 1\) if there exists at least one cell \(i\in \sigma ^(\ell )\) such that \(b_ = 1\) for cluster \(\ell\) and mutation j, and \(g^_ = 0\) otherwise. We model these variables using the following constraints for all mutations \(j\in \\) and clusters \(\ell \in \\),
$$\begin \begin g^_ \ge 1 - x_,\quad &\text \ i\in \sigma ^(\ell ),\\ g^_ \le |\sigma ^(\ell )| - \sum \limits _(\ell )} x_,\quad &\\ g^_ \ge a_,\quad &\text \ i\in \sigma ^(\ell ),\\ g^_ \le \sum \limits _(\ell )} a_.\quad & \end \end$$
Next, we introduce continuous variables \(g_\in [0,1]\) such that \(g_ = 1\) if and only if mutation j is gained in cluster \(\ell\) and \(g_ = 0\) otherwise, for cluster \(\ell\) and mutation j. Specifically, \(g_ = 1\) if there exists two distinct cells \(i, i'\in \sigma ^(\ell )\) such that \(b_ = 0\) and \(b_ = 1\). We use \(g^_\) and \(g^_\) to model \(g_\) for all mutations \(j\in \\) and clusters \(\ell \in \\) with the constraints,
$$\begin g_\le & g^_,\\ g_\le & g^_,\\ g_\ge & g^_ + g^_ - 1. \end$$
Finally to enforce that each mutation can be gained in at most one cluster, we have the following constraint for all mutations \(j\in \\),
$$\begin \sum \limits _^p g_ \le 1. \end$$
Objective functionRecall that we want to maximize the likelihood function \(P(Q\mid R, A)\) (Eq. 1), where A is the mutation matrix and B is its k-Dollo completion consistent with copy-number clustering \(\sigma\). After taking \(\log\) on both sides in Eq. 1, we can linearize the log-likelihood to get the following objective function.
$$\begin \max \sum \limits _^n\sum \limits _^m\ &\Bigl (a_\log \Pr (q_\mid r_, a_ = 1) +\\&(1 - a_)\log \Pr (q_\mid r_, a_ = 0)\Bigr ). \end$$
This MILP has \(O(nm + pmk)\) binary variables, \(O(m^2k^2 + pm)\) continuous variables, and \(O(nm^2k^2)\) constraints.
Simulation detailsIn this section, we provide details about the simulations and the input files generated for each method.
Simulation of the phylogenyWe used a growing random network [44] to generate a tree T with \(m + p\) edges. Specifically, starting from the root vertex, T is built by iteratively adding child nodes while choosing the parent uniformly at random from the nodes in the tree in that iteration. The root node r(T) represents the normal cell and is assigned to cluster \(\ell = 0\). The edges are then labeled by either the gain of a mutation \(j\in \\) or change to cluster \(\ell \in \\). For each edge \((v,w)\in E(T)\) labeled with a change in cluster, we allow loss of the mutations gained along the path from the root r(T) to node v with probability \(\lambda = 0.8\). We generate a copy-number profile for each node in the tree, as described below.
Simulation of copy-number statesThe SCARLET [32] algorithm requires the copy-number profile of each cell, as well as the copy-number tree as input. We simulate the copy-number tree as follows.
Each node of the tree is labeled by a copy-number between 0 and 8 for each mutation j. We first initialize the root of the tree with a copy-number profile in which the copy-number for each position is picked uniformly at random between 0 and 8. We then label the remaining nodes as we traverse the tree in a breadth-first order. If the edge \((\pi (w), w)\) does not contain loss of mutation j, the copy-number for node w is the same as the copy-number of \(\pi (w)\). On the other hand, if the edge \((\pi (w), w)\) induces the loss of mutation j, the copy-number at j for node \(\pi (w)\) is picked uniformly at random between 1 and 8, while the copy-number of node w is picked uniformly at random between 0 and \(\pi (w)-1\). This ensures that, (i) the copy-number profile only changes if there is a loss event on the edge and (ii) each loss of mutation is supported by decrement of copy-number. Let \(C\in \^\) be the copy-number matrix such that \(c_\) is the copy-number at locus j in cell i. This copy-number matrix is used during simulation of the variant read counts which we describe in the next subsection.
Read count modelThe total read count \(r_\) for each cell i and mutation j is modeled as Poisson variable with mean coverage \(\text = 50\), i.e.
$$\begin r_\sim \text (\text ). \end$$
We use beta-binomial model, similar to previous works [32, 41, 42] for the variant read count \(q_\) for each cell i and mutation j. Our model accounts for sequencing errors and allelic imbalance during sequencing as follows. For sequencing error, we set error rate \(\epsilon = 0.001\) which is similar to the error rates of most recent Illumina sequencing platforms [54]. Specifically, we assume that the false positive rate \(\alpha _\) and the false negative rate \(\alpha _\) of observing a read with the variant allele is \(\epsilon\). When the mutation j is not present in cell i, i.e., \(a_ = 0\), the number of copies of the variant allele is 0. When the mutation j is present in cell i, i.e., \(a_ = 1\), we assume that the number of copies of the variant allele is 1. As such, the value of \(a_\) indicates the number of variant allele. Given that the total copies of the locus for mutation j in cell i is \(c_\), the true variant allele frequency, which we denote by \(y_\), is given by \(y_ = a_ / c_\). Due to sequencing errors \(\alpha _\) and \(\epsilon _\), the probability \(p_\) of producing a read containing the variant allele for mutation j in cell i is
$$\begin p_ = \left( 1 - \alpha _\right) \frac}} + \left( 1 - \frac}}\right) \alpha _. \end$$
The number of variant reads \(q_\) is given by
$$\begin \pi _\sim & \text (p_, s),\\ q_\sim & \text (r_, \pi _), \end$$
where, we set the dispersion parameter \(s = 15\) in our simulations to simulate allelic imbalance. Finally, we spike-in missing entries in the variant read count matrix Q and total read count matrix R by setting \(q_ = 0\) and \(r_ = 0\) in \(\lfloor d n m \rfloor\) entries where d is the rate of missing entries in the data.
While ConDoR and SCARLET take the variant and total read counts as input, several methods (such as SPhyR, SCITE and SiFit) require an observed mutation matrix \(A'\) as input. In the following section, we show how we obtained the observed mutation matrix from the simulated read counts.
Obtaining the observed mutation matrix from the read countsMethods such as SPhyR, SCITE, and SiFit take an observed mutation matrix \('\in \^\) as input. This observed mutation matrix \('\) may contain missing entries (represented by \(-1\)) and errors (false positives and false negatives). The aforementioned methods estimate the true binary mutation matrix A and build a tumor phylogeny while correcting the errors and imputing the missing entries in the observed mutation matrix \(A'\). We denote the estimated mutation matrix by \(\hat\).
We obtain \('\) from the read count matrices Q and R as follows. We use three filtering parameters to discretize the read count matrices: (i) total read count threshold \(r_t = 10\), (ii) variant read count threshold \(q_t = 5\), and (iii) variant allele frequency threshold \(y_t = 0.1\). We say that mutation j is present in cell i if and only if the total read count \(r_\) is greater than or equal to \(r_t\), the variant read count \(q_\) is greater than or equal to \(q_t\), and the observed variant allele frequency \('_ = q_ / r_\) is greater than or equal to \(y_t\). Specifically, we set \('_ = 1\) if \((r_ \ge r_t) \wedge (q_ \ge q_t) \wedge ('_ \ge y_t)\). For the remaining entries, we set \('_ = 0\) if \(r_ \ge 0\), indicating absence of mutation, and \('_ = -1\), indicating missing entry, otherwise. The median false positive and false negative rates of the observed mutation matrices of the simulated instances are 0.0018 and 0.158, respectively (Additional file 1: Fig. S1). It is possible that using additional procedures to define a higher quality set of mutations could improve the performance of methods such as SPhyR, SCITE and SiFit.
Computation of pairwise ancestral relation accuracyHere, we describe the computation of pairwise ancestral relation accuracy \(E(T, \hat)\) for two tumor phylogenies T and \(\hat\). Under the assumption that a mutation can be gained only once in the phylogeny, any pair \((j, j')\) of mutations can be related in exactly one of the following four ways.
1Mutation j occurs along the path from the root to source node of edge on which mutation \(j'\) occurs.
2Mutation \(j'\) occurs along the path from the root to source node of edge on which mutation j occurs.
3Mutation j and \(j'\) occur on the same edge of the phylogeny.
4Mutation j and \(j'\) occur on distinct branches of the phylogeny.
We compute the accuracy of inferring the correct relationship between all possible pairs of mutations from the inferred tumor phylogeny.
Generation and pre-processing of the PDAC dataHere, we provide details regarding the generation and pre-processing of targeted sequencing data of pancreatic ductal adenocarcinoma tumor used in this study.
Bulk WES library preparation, sequencing, and variant callingGenomic DNA was extracted from each tissue using the phenol-chloroform extraction protocol [55] or the QIAamp DNA Mini Kits (Qiagen) [56]. WES library preparation and sequencing were performed by the Integrated Genomics Operation at Memorial Sloan Kettering Cancer (MSKCC, NY). Briefly, an Illumina HiSeq 2000, HiSeq 2500, HiSeq 4000, or NovaSeq 6000 platform was used to target sequencing coverages of \(>250\times\) for WES samples.
The raw FASTQ files were processed with the standard pipeline of the Bioinformatics Core at MSKCC. Sequencing reads were analyzed in silico to assess quality, coverage, and aligned to the human reference genome (hg19) using BWA [57]. After read de-duplication, base quality recalibration and multiple sequence realignment were completed with the PICARD Suite [58] and GATK v.3.1 [59]; somatic single-nucleotide variants and insertion-deletion mutations were detected using Mutect v.1.1.6 [60] and HaplotypeCaller v.2.4 [61]. This pipeline generated a set of mutations for every single sample. Then, all mutations of all samples of the same sequencing cohort were pooled as a single set. Each sample’s BAM file was used to compute “fillout” values (total depth, reference allele read counts, alternative allele read counts) for each mutation in the pooled list. Mutation with alternate read count less than 2 across all samples were removed to trim down false positives. The purpose was to rescue mutations that were detected with high confidence in one sample but with low confidence in another sample of the same patient/tumor. This generated the final output in mutational annotation format (MAF).
Single-cell DNA sequencing (Tapestri) library preparation and variant callingSingle nuclei were extracted from snap frozen primary patient samples embedded in optimal cutting temperature (OCT) compound using the protocol developed by Zhang et al. [15].
Nuclei were suspended in Mission Bio cell buffer at a maximum concentration of 4000 nuclei/\(\mu\)l, encapsulated in Tapestri microfluidics cartridge, lysed and barcoded. Barcoded samples were then put through targeted PCR amplification with a custom 596-amplicon panel covering important PDAC mutational hotspots in our sample cohort (table with all the amplicons is available at https://github.com/raphael-group/ConDoR).
The 596-amplicon panel was designed based on curation of bulk whole exome/genome sequencing data of PDAC samples collected by the Iacobuzio lab. The goal was to cover as many PDAC-related SNVs within our patient cohort as possible within a 600-amplicon limit, which was deemed economically optimal. The genes/SNVs of interest were determined by querying several resources, such as cBioportal [62, 63] and openCRAVAT [64]. Particular interest was paid to genes in the TGF\(\beta\) pathway as relevant mutations are currently being investigated as clinical biomarkers [65]. In addition to the SNVs, we added amplicons to cover as much exon region as possible for genes that are of particular interest for CNV analyses in PDAC: KRAS, TP53, SMAD4, CDKN2A, TGFBR1, TGFBR2, ACVR1B, ACVR2A, BMPR1A, BMPR1B, SMAD2, SMAD3, MYC, GATA6, BAP1, MUS81, and KAT5.
PCR products were removed from individual droplets, purified with Ampure XP beads and used as templates for PCR to incorporate Illumina i5/i7 indices. PCR products were purified again, quantified with an Agilent Bioanlyzer for quality control, and sequenced on an Illumina NovaSeq. The minimum total read depth was determined by same formula as used in [15].
As described in [15], FASTQ files for single-nuclei DNA libraries were processed through Mission Bio’s Tapestri pipeline with default parameters to arrive at the output H5 file, which mainly consists of two matrices: a cell-by-per-amplicon-read-count matrix \(\mathbf \), and a cell-by-SNV matrix \(\mathbf \). Briefly, the pipeline has the following steps.
1Adaptor sequences are trimmed and align the reads to the hg19 genome (UCSC).
2The reads are assigned to individual cell barcodes, while filtering out the low-quality cell barcodes. For each of these barcodes, the number of forward reads aligned to each amplicon is used to form matrix \(\mathbf \).
3Variant calling is performed to generate a gVCF (genomic variant call format) file from the BAM file for each cell.
4The cells are jointly genotyped to form a cell-by-SNV matrix \(\mathbf \).
A more detailed documentation of the pipeline is available at: https://support.missionbio.com/hc/en-us/categories/360002512933-Tapestri-Pipeline. In respect of Mission Bio’s request, the pipeline code is not to be publicized because it contains proprietary information per industry standard. However, the pipeline used in the paper that demonstrated this scDNA-seq library preparation technology [66] is publicly available as a Github repository at https://github.com/AbateLab/DAb-seq. Although we have not formally tested that it performs identically as the Mission Bio pipeline, we believe it is sufficient to replicate our results.
Variant callingWe detect 40 mutations in the bulk tumor sample with a variant allele frequency (VAF) of at least 0.05. Out of these 40 mutations, 34 mutations were also detected in the matched normal sample indicating that they were germline mutations. From the remaining 6 somatic mutations, we filter out mutations with low prevalence in the scDNA-seq data. Specifically, we only include mutations with variant allele frequency more than 0.1, read depth of more than 20 and variant read depth of more than 10 in at least \(5\%\) of the cells. We end up with 4 somatic mutations: chr3:30715617:C/G (TGFBR2_1), chr3:30715619:G/T (TGFBR2_2), chr8:38314915:G/T (FGFR1), and chr13:32907415:T/A (BRCA2).
Most phylogeny inference methods only consider somatic SNVs as input, and filter out all germline SNPs. However, germline SNPs that have undergone loss in a subset of cells are informative during phylogeny inference. We identify germline SNPs with putative loss by including SNPs with variant allele frequency less than 0.1, variant read depth more than 10, and total read depth less than 20 in at least \(15\%\) of the cells. We find 3 such SNPs: chr10:131506283:C/T (MGMT_1), chr10:131506192:C/T (MGMT_2), and chr1:158612236:A/G (SPTA1). In summary, we consider 3 germline SNPs and 4 somatic SNVs in our analysis.
Copy-number clusteringIn this section, we describe the method to cluster the PDAC cells based on the total reads in each cell. Let \(\mathcal \) be the set of amplicons, \(\mathcal \) be the set of genes and \(\mathcal (g)\) denote the set of amplicons contained in gene \(g\in \mathcal \). Let \(R^_ = [r^_]\) be the \(n\times |\mathcal |\) read depth matrix that contains the number of reads in cell i and amplicon a. We start by normalizing the amplicon-level read depth matrix by the total reads in each cell. Specifically, we form matrix \(\bar^\in \mathbb ^|}\) such that,
$$\begin \bar^_ = \frac^_}} r^A_}. \end$$
Next, we assume that all loci in the same gene have the same copy-number. As such, we compute the average normalized read depth as follows
$$\begin \bar^_ = \frac(g)} \bar^A_}(g)|}. \end$$
This step helps nullify some noise in the amplicon-level total read count data. We focus on 30 genes with the highest number of amplicons and perform k-means clustering on the resulting matrix \(\bar^\).
We use the Silhouette score to determine the number of copy-number clusters. Additional file 1: Fig. S9a shows the the Silhouette score for increasing number of clusters in the k-means clustering. We choose the clustering with the highest Silhouette score resulting in \(p = 3\) clusters as the copy-number clustering \(\sigma\). Additional file 1: Fig. S9b shows a t-SNE [67] where each point is a cell labeled by the cluster index from the copy-number clustering \(\sigma\).
Comments (0)