Matches in SemOpenAlex for { <https://semopenalex.org/work/W4200454918> ?p ?o ?g. }
Showing items 1 to 90 of
90
with 100 items per page.
- W4200454918 endingPage "47" @default.
- W4200454918 startingPage "41" @default.
- W4200454918 abstract "An important step of somatic variant calling algorithms for deep sequencing data is quantifying the errors. For targeted sequencing in which hotspot mutations are of interest, site-specific error estimation allows more accurate calling. The site-specific error rates are often estimated from a panel of normal samples, which has limited size and is subject to sampling bias and variance. We propose a novel statistical validation method for single-nucleotide variation (SNV) calling based on historical data. The validation method extracts the high-quality reads from the Binary Alignment/Map (BAM) files, finds the negative samples in the data, and builds a statistical model to call individual samples. It is particularly useful in detecting low-frequency variants that may be missed by traditional panel of normal–based SNV methods. The proposed method makes it possible to launch a simple and parallel validation pipeline for SNV calling and improve the detection limit. An important step of somatic variant calling algorithms for deep sequencing data is quantifying the errors. For targeted sequencing in which hotspot mutations are of interest, site-specific error estimation allows more accurate calling. The site-specific error rates are often estimated from a panel of normal samples, which has limited size and is subject to sampling bias and variance. We propose a novel statistical validation method for single-nucleotide variation (SNV) calling based on historical data. The validation method extracts the high-quality reads from the Binary Alignment/Map (BAM) files, finds the negative samples in the data, and builds a statistical model to call individual samples. It is particularly useful in detecting low-frequency variants that may be missed by traditional panel of normal–based SNV methods. The proposed method makes it possible to launch a simple and parallel validation pipeline for SNV calling and improve the detection limit. Development of next-generation sequencing (NGS) technologies in the last decade has boosted the growth of cancer diagnostics, offering the opportunity of personalized treatments that target specific mutations. With reduced cost and turnaround time, the market of the molecular diagnostics of cancer has been expanding steadily, accompanied by rapid accumulation of sequencing data collected through clinical laboratories or in vitro diagnostic assays. NGS has great advantages in detecting multiple low-frequency mutations in parallel because of the high-throughput data it generates. It has a variety of important applications in tumor genotyping, early diagnosis, and timely detection of tumor resistance.1Heitzer E. Perakis S. Geigl J.B. Speicher M.R. The potential of liquid biopsies for the early detection of cancer.NPJ Precision Oncol. 2017; 1: 36Crossref PubMed Google Scholar Among different variations NGS can detect, single-nucleotide variation (SNV) is the most common type. Several methods are developed for SNV calling, using either matched tumor-normal samples or tumor samples only.2Xu C. A review of somatic single nucleotide variant calling algorithms for next-generation sequencing data.Comput Struct Biotechnol J. 2018; 16: 15-24Crossref PubMed Scopus (118) Google Scholar, 3Yoon S. Xuan Z. Makarov V. Ye K. Sebat J. Sensitive and accurate detection of copy number variants using read depth of coverage.Genome Res. 2009; 19: 1586-1592Crossref PubMed Scopus (434) Google Scholar, 4Kumar A. White T.A. MacKenzie A.P. Clegg N. Lee C. Dumpit R.F. Coleman I. Ng S.B. Salipante S.J. Rieder M.J. Nickerson D.A. Corey E. Lange P.H. Morrissey C. Vessella R.L. Nelson P.S. Shendure J. Exome sequencing identifies a spectrum of mutation frequencies in advanced and lethal prostate cancers.Proc Natl Acad Sci U S A. 2011; 108: 17087-17092Crossref PubMed Scopus (194) Google Scholar However, accurately detecting low-frequency SNVs remains a challenge because of various sources of errors during DNA fragmentation, library preparation, PCR, sequencing, and reads alignment. For each site, the error was related to nucleotide substitution, sequence context, and storage conditions.5Ma X. Shao Y. Tian L. Flasch D.A. Mulder H.L. Edmonson M.N. Liu Y. Chen X. Newman S. Nakitandwe J. Li Y. Li B. Shen S. Wang Z. Shurtleff S. Robison L.L. Levy S. Easton J. Zhang J. Analysis of error profiles in deep next-generation sequencing data.Genome Biol. 2019; 20: 1-15Crossref Scopus (91) Google Scholar Therefore, proper handling of the relationship between noise from different technical artifacts and true mutations is needed. Several variant calling methods are based on site-specific error rate.2Xu C. A review of somatic single nucleotide variant calling algorithms for next-generation sequencing data.Comput Struct Biotechnol J. 2018; 16: 15-24Crossref PubMed Scopus (118) Google Scholar,6Shiraishi Y. Sato Y. Chiba K. Okuno Y. Nagata Y. Yoshida K. Shiba N. Hayashi Y. Kume H. Homma Y. Sanada M. Ogawa S. Miyano S. An empirical Bayesian framework for somatic mutation detection from cancer genome sequencing data.Nucleic Acids Res. 2013; 41: 89Crossref PubMed Scopus (129) Google Scholar, 7Carrot-Zhang J. Majewski J. LoLoPicker: detecting low allelic-fraction variants from low-quality cancer samples.Oncotarget. 2017; 8: 37032-37040Crossref PubMed Scopus (14) Google Scholar, 8Gerstung M. Beisel C. Rechsteiner M. Wild P. Schraml P. Moch H. Beerenwinkel N. Reliable detection of subclonal single-nucleotide variants in tumour cell populations.Nat Commun. 2012; 3: 811Crossref PubMed Scopus (162) Google Scholar In general, a panel of normal (PON) samples is required to estimate the error rate in advance, and the observed variant allele frequency (VAF) is then compared with the estimated error based on a statistical model (ie, the binomial distribution and its extensions).9Kleftogiannis D. Punta M. Jayaram A. Sandhu S. Wong S.Q. Tandefelt D.G. Conteduca V. Wetterskog D. Attard G. Lise S. Identification of single nucleotide variants using position-specific error estimation in deep sequencing data.BMC Med Genomics. 2019; 12: 115Crossref Scopus (7) Google Scholar A limitation with these PON-based error models is that the amount of normal samples is often insufficient to obtain an accurate profile of the errors, especially for the positions with low coverage.10Van der Auwera G.A. Carneiro M.O. Hartl C. Poplin R. Del Angel G. Levy-Moonshine A. Jordan T. Shakir K. Roazen D. Joel T. Banks E. Garimella K.V. Altshuler D. Gabriel S. DePristo M.A. From FastQ data to high-confidence variant calls: the genome analysis toolkit best practices pipeline.Curr Protoc Bioinformatics. 2013; 43: 11.10.1-11.10.33Google Scholar It is recommended to choose normal from young and healthy subjects and technically similar to the tumor.10Van der Auwera G.A. Carneiro M.O. Hartl C. Poplin R. Del Angel G. Levy-Moonshine A. Jordan T. Shakir K. Roazen D. Joel T. Banks E. Garimella K.V. Altshuler D. Gabriel S. DePristo M.A. From FastQ data to high-confidence variant calls: the genome analysis toolkit best practices pipeline.Curr Protoc Bioinformatics. 2013; 43: 11.10.1-11.10.33Google Scholar Alternatively, cell lines provide good control for individual sites, but limitation of PON size still exists and prefiltering of clonal mutations is needed to scale up to large regions. In some cases, there is no mismatch for some positions in PON; then, the information has to be borrowed from other positions, which may be unreliable. Although a large PON can be built for whole-genome sequencing/whole-exome sequencing by merging public data from different sources,11Ghandi M. Huang F.W. Jané-Valbuena J. Kryukov G.V. Lo C.C. McDonald E.R. et al.Next-generation characterization of the cancer cell line Encyclopedia.Nature. 2019; 569: 503-508Crossref PubMed Scopus (944) Google Scholar it is difficult for proprietary target panels. Some methods utilize a large panel of normal, and separate germline mutations and background errors via k-means clustering of VAF.7Carrot-Zhang J. Majewski J. LoLoPicker: detecting low allelic-fraction variants from low-quality cancer samples.Oncotarget. 2017; 8: 37032-37040Crossref PubMed Scopus (14) Google Scholar However, they are not applicable to control samples if low-frequency mutations exist. Another issue is that when applying the somatic variant calling methods in a clinical setting, more attention is given to suppressing the false positives and using conservative thresholds in post filtering.12Holt J.M. Wilk M. Sundlof B. Nakouzi G. Bick D. Lyon E. Reducing Sanger confirmation testing through false positive prediction algorithms.Genet Med. 2021; 23: 1255-1262Abstract Full Text Full Text PDF PubMed Scopus (3) Google Scholar,13Bobo D. Lipatov M. Rodriguez-Flores J.L. Auton A. Henn B.M. False negatives are a significant feature of next generation sequencing callsets.bioRxiv. 2016; ([Preprint] doi:10.1101/066043)Google Scholar In tumor genotyping, this could result in miss of treatment opportunities for patients. A third issue is that there is yet a statistical framework for validating the variant calls. Some of the variants can be confirmed experimentally via Sanger sequencing or PCR-based methods. The problem with experimental validation is the feasibility and cost. The throughput and sensitivity of Sanger sequencing are low, whereas PCR requires proprietary probes, which may not be available. Besides, there is a significant increase in expense and turnaround time to patients. When experimental validation is not available, the variants are often manually reviewed using data visualization tools, like Integrative Genomics Viewer14Thorvaldsdóttir H. Robinson J.T. Mesirov J.P. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration.Brief Bioinform. 2013; 14: 178-192Crossref PubMed Scopus (4539) Google Scholar or GenomeBrowse (Golden Helix Inc., Bozeman, MT). Manual review relies on individuals' training knowledge and therefore lacks intralaboratory reproducibility. With the demand for tumor NGS testing that keeps increasing in the foreseeable future, there is a pressing need for validating variant calling results more efficiently. Herein, we proposed a statistical method for validating SNV calling results based on retrospective data. Instead of PON, a collection of clinical pan-cancer samples was used to empirically profile the site-specific error distribution of the SNVs. The method detected mutations using an outlier test with high sensitivity and specificity. It was performed on 38 cancer-related variant positions and compared with an in-house developed algorithm. The validation method discovered new low-frequency mutations at these positions. The new SNVs were experimentally validated by droplet digital PCR. A simulation study was also conducted over a larger set of SNVs. We showed the advantages of utilizing retrospective data in validating and further improving the accuracy of SNV calling. The BAM files of 6580 pan-cancer tumor samples sequenced between October 2019 and July 2020 were collected for model training. The sample types included lung, colon, liver, and gastric, of which lung and colon comprised >40% (Supplemental Figure S1). NGS profiling of these samples was performed on a panel of 381 genes. DNA preparation and sequencing were conducted at 3DMed Clinical Laboratory Inc. (Shanghai, China), a College of American Pathologists–certified and Clinical Laboratory Improvement Amendments–certified laboratory. Genomic DNA was extracted with QIAamp DNA FFPE Tissue Kit (Qiagen, Venlo, the Netherlands) and quantified by PicoGreen fluorescence assay (Invitrogen, Carlsbad, CA). Extracted DNA (50 to 200 ng) was sonicated to fragments around 200 bp (Covaris, Woburn, MA), then constructed into sequencing libraries with KAPA Hyper Prep Kit (Kapa Biosystems, Wilmington, MA). The DNA libraries were hybridized with probes targeting a 381-gene panel to capture the targeted genomic regions. Target was enriched by capturing with target-specific biotinylated probes, which were then isolated by magnetic pulldown. The samples were sequenced on a NextSeq 550 instrument (Illumina, San Diego, CA). Sequencing reads were mapped against the human reference genome (hg19/GRCh37) with BWA version 0.7.1215Li H. Durbin R. Fast and accurate short read alignment with Burrows-Wheeler Transform.Bioinformatics. 2009; 25: 1754-1760Crossref PubMed Scopus (24592) Google Scholar and sorted by SAMtools version 1.3.16Li H. Handsaker B. Wysoker A. Fennell T. Ruan J. Homer N. Marth G. Abecasis G. Durbin R. 1000 Genome project data processing subgroup: the sequence alignment/map (SAM) format and SAMtools.Bioinformatics. 2009; 25: 2078-2079Crossref PubMed Scopus (28964) Google Scholar Duplicate reads were removed using Picard version 1.130 (http://broadinstitute.github.io/picard, last accessed August 26, 2021). SNV calling in the target positions employed a custom-built pipeline based on local realignment, context error rate calculation, and binomial error testing, followed by filtering processes for systematic error (PON), strand bias, orientation bias, low base quality, low mapping quality, short tandem repeats, position bias, and DNA damage. Cross-platform droplet digital PCR was conducted to experimentally validate the discordant results between the previous variant calling pipeline and the validation method. Genomic DNA (50 to 100 ng) was cleaved using BamH1 (Sigma-Aldrich, Schnelldorf, Germany). The Mastermix and sample DNA were thoroughly mixed and transferred to a DG8 Cartridge for a QX100TM/QX200 Droplet Generator (Bio-Rad, Hercules, CA). Subsequent amplification was performed in a T100 Thermal Cycler (Bio-Rad). Droplets were read in a QX200 Droplet reader (Bio-Rad), and then data were analyzed using Quantasoft version 1.6.6 (Bio-Rad). In each run, a negative human sample and a nontemplate control were included. The data and code are available at https://github.com/xingmars/ReVal-SNV (last accessed August 27, 2021). The validation method starts by extracting the counts of filtered reads from the deduplicated sample BAM files. For a variant position in sample i, let xi be the number of variant reads and ni be the total number of reads mapped to the position (sequencing depth). The historical data can be viewed as a mixture of negative samples and true mutations. The fundamental assumption is the following: variant reads in the negative samples are errors generated from a homogeneous process that is different from true mutations. The goal is to estimate the underlying distribution of xi in the negative samples (referred to as the null distribution) and to call out samples that are outliers. Assume xi∼ f (ni, θ), where f is the probability mass function of the random error and θ is the set of parameters. Let F be the cumulative distribution functionF(xi)=∑k=0xif(k,ni,θ). Here f is chosen to be the β binomial distribution. It is commonly used in SNV calling algorithms.6Shiraishi Y. Sato Y. Chiba K. Okuno Y. Nagata Y. Yoshida K. Shiba N. Hayashi Y. Kume H. Homma Y. Sanada M. Ogawa S. Miyano S. An empirical Bayesian framework for somatic mutation detection from cancer genome sequencing data.Nucleic Acids Res. 2013; 41: 89Crossref PubMed Scopus (129) Google Scholar,17Shugay M. Zaretsky A.R. Shagin D.A. Shagina I.A. Volchenkov I.A. Shelenkov A.A. Lebedin M.Y. Bagaev D.V. LuKyanov Chudakov D.M. MAGERI: computational pipeline for molecular-barcoded targeted resequencing.PLoS Comput Biol. 2017; 13: e1005480Crossref PubMed Scopus (23) Google Scholar, 18Roth A. Ding J. Morin R. Crisan A. Ha G. Giuliany R. Bashashati A. Hirst M. Turashvili G. Oloumi A. Marra M.A. Aparicio S. Shah S.P. JointSNVMix: a probabilistic model for accurate detection of somatic mutations in normal/tumour paired next-generation sequencing data.Bioinformatics. 2012; 28: 907-913Crossref PubMed Scopus (126) Google Scholar, 19Gerstung M. Papaemmanuil E. Campbell P.J. Subclonal variant calling with multiple samples and prior knowledge.Bioinformatics. 2014; 30: 1198-1204Crossref PubMed Scopus (65) Google Scholar It extends from the binomial model by adding a dispersion parameter to account for the heterogeneity of multiple binomial processes. Without further information, f is not identifiable because the data are a mutation/normal mixture. A zero assumption is made that the mutation distribution has no support in the range xi/ni ≤ a for some threshold a.20Efron B. Large-scale simultaneous hypothesis testing: the choice of a null hypothesis.J Am Stat Assoc. 2004; 99: 96-104Crossref Scopus (645) Google Scholar This is equivalent to the sample being negative if the observed VAF ≤ a (not vice versa). This assumption allows us to estimate f using only low VAF samples. Essentially, f is estimated using the partial data truncated within the threshold xi/ni ≤ a (Figure 1). For the truncated data, xi follows a truncated β binomial distribution:P(xi=k|ni,θ,a)=f(k,ni,θ)F(ani,ni,θ)fork≤ani(1) The joint log likelihood of all truncated data points (xi, ni), i = 1,…,n, isll(θ)=∑xi≤anilogP(xi|ni,θ)(2) The parameter θ can be obtained via the maximum likelihood estimation method:θˆ=argmaxθll(θ)(3) To determine a, the local minimum of the kernel density of log xi/ni is identified, then a is set to be the 80% quantile of the positive VAFs (xi/ni) below the local minimum. The calculation of a is based on the observation that in the historical data for many SNV sites, the nonzero VAFs have either a single mode or bimode distribution. The single-mode distribution may have a heavy right tail that contains the true mutations. For bimodal distribution, the second component is likely to be exclusively composed of true mutations; however, there may be a nonnegligible number of true mutations in right tail of the first component. Here, the local minimum is found that separates the two components, and it is assumed that the lower 80% of the first component is composed of all negative samples. Note that although a is determined by the nonzero VAFs, all samples under a, including those with VAF = 0, are used to calculate the likelihood in equation 2. For outlier detection, at nominal type I error α, sample i is called mutant if1−F(xi−1,ni,θˆ)≤α (Figure 2). The accuracy of the outlier detection depends on how well the estimated distribution fits the truncated data, especially the nonzero VAFs. A modified goodness-of-fit test is designed to evaluate the model fit to the data in the range 0 < xi/ni ≤ a. The nonzero VAFs under a are grouped into 10 equal sized bins by qj, where qj, j = 0, 1, 2, …, 10, are the 0%, 10%, 20%, …, 100% quantiles, so that the jth bin (qj-1, qj] contains ej ≈ 10% portion of the samples. The estimated probability pj is calculated for the VAF falls in the jth bin based on the estimated β binomial distribution.pij=F(qj,ni,θˆ)−F(qj−1,ni,θˆ)pj=∑ipij∑i,jpij(4) A goodness-of-fit statistic G is calculated and compared against χ2 distribution with 9 Df. A large value of G indicates a bad fit to the nonzero VAFs.G=∑j=110(pj−ej)2ejn∗(5) where n∗ is the number of nonzero VAFs under a, n∗ = #I{0 < xi/ni ≤ a}. Note G is calculated only for the nonzeroes because most of the VAFs are zero. As a result, the model fitting is dominated by the zeroes. The ability to detect model deviance on the nonzeroes would be significantly compromised if the zeroes are included in calculating G. The median sequencing error rate of the 6580 samples was 1.7 × 10−3, consistent with other studies.21Pfeiffer F. Gröber C. Blank M. Händler K. Beyer M. Schultze J.L. Mayer G. Systematic evaluation of error rates and causes in short samples in next-generation sequencing.Sci Rep. 2018; 8: 10950Crossref PubMed Scopus (123) Google Scholar For data cleaning, reads by secondary alignment, reads containing nonmatched bases (insertions/deletions, clipping, skipping, and padding), and reads with low base quality (<37) at the variant positions were filtered out. The number of variant reads (xi) (Supplemental Table S1) and sequencing depth (ni) (Supplemental Table S2) were extracted at 38 variant sites (Supplemental Table S3) from the filtered data. The variants were from six genes (BRAF, EGFR, KIT, KRAS, MET, and PIK3CA) that have shown important clinical relevance with certain therapies.22Chapman P.B. Hauschild A. Robert C. Haanen J.B. Ascierto P. Larkin J. Dummer R. Garbe C. Testori A. Mail M. Hogg D. Lorigan P. Lebbe C. Jouary T. Schadendorf D. Ribas A. O’Day S.J. Sosman J.A. Kirkwood J.M. Eggermont A.M.M. Dreno B. Nolop K. Li J. Nelson B. Hou J. Lee R.J. Flaherty K.T. McArthur G.A. Improved survival with vemurafenib in melanoma with BRAF V600E mutation.N Engl J Med. 2011; 364: 2507-2516Crossref PubMed Scopus (5954) Google Scholar,23Paez J.G. Jänne P.A. Lee J.C. Tracy S. Greulich H. Gabriel S. Herman P. Kaye F.J. Lindeman N. Boggon T.J. Naoki K. Sasaki H. Fujii Y. Eck M.J. Sellers W.R. Johnson B.E. Meyerson M. EGFR mutations in lung cancer: correlation with clinical response to gefitinib therapy.Science. 2004; 304: 1497-1500Crossref PubMed Scopus (8323) Google Scholar The median depth was 1500× for the seven positions on EGFR and between 300× and 600× on the other five genes. The value of a was determined from the density of log xi/ni using gaussian kernel with smoothing bandwidth 0.25. The bandwidth was selected large enough to smooth out the noise, while maintaining the major modes of the VAF. The maximum likelihood estimates in equation 3 were obtained by the optim function in R version 3.6.3 (https://www.r-project.org, last accessed July 14, 2021) using the Nelder-Mead algorithm.24Nelder J.A. Mead R. A simplex method for function minimization.Computer J. 1965; 7: 308-313Crossref Google Scholar Five of the 38 variant sites (PIK3CA-E545A, PIK3CA-Q546E, PIK3CA-E542K, KIT-V559G, and KIT-V560G) showed a lack of fit by the χ2 goodness-of-fit test in equation 5 (P < 0.05), and thus were excluded from downstream analysis (Supplemental Table S4 and Supplemental Figure S2). For comparison, the analysis was rerun by replacing the β binomial with binomial distribution in equation 1 (data not shown). Thirteen more variant sites were rejected by the goodness-of-fit test (P < 0.05), suggesting that β binomial distribution provided a better fit in general. For outlier detection, the type I error was set at α = 10−6. In summary, 43 new low-frequency mutations (VAF <5%) were discovered at 14 variant sites (Figure 3). Forty-two of them were confirmed positive by droplet digital PCR, and the other one was negative. In the high-coverage positions of EGFR, the validation method was able to detect mutations <1%. In other positions with lower coverage, more than half of the new positive cases had VAF <2%. The only negative case was EGFR-G719C, with VAF 6 of 1335. Partly because of higher coverage of the in-house samples, the positive rates were generally higher than The Cancer Genome Atlas Mutect results (https://www.cancer.gov/tcga, last accessed August 26, 2021), especially for the high-coverage EGFR (Supplemental Figure S3 and Supplemental Figure S4). To assess how the validation method scales up to a broader range of genomic positions, the model training procedure was repeated on 1337 more variations on 208 genes with clinical relevance to tumorigenesis and treatment (Supplemental Table S5). Of the total 1375 SNVs (including the previous 38 variations), 90% interval of median depth was between 280× and 1520× (Supplemental Figure S5). The estimated error rates were highly heterogeneous across different positions, ranging from 3.4 × 10−6 to 2.1 × 10−3 (Supplemental Figure S6). A total of 110 positions did not pass the goodness-of-fit test (P < 0.05) (Supplemental Table S4). To generate truth data sets for performance evaluation, three samples (two lung cancer and one colorectal colon cancer) were selected as the control group. In silico mutation samples were generated by computationally spiking SNVs into the BAM files of the control samples. The 1265 SNVs that have passed the goodness-of-fit test were spiked in at VAF = 0.01 using VarBen.25Li Z. Fang S. Zhang R. Yu L. Zhang J. Bu D. Sun L. Zhao Y. Li J. VarBen: generating in silico reference data sets for clinical next-generation sequencing bioinformatics pipeline evaluation.J Mol Diagn. 2021; 3: 285-299Abstract Full Text Full Text PDF Scopus (3) Google Scholar The P values were calculated from the trained models. Sensitivity and specificity were calculated from the control and case samples. The performance was compared with CleanDeepSeq5Ma X. Shao Y. Tian L. Flasch D.A. Mulder H.L. Edmonson M.N. Liu Y. Chen X. Newman S. Nakitandwe J. Li Y. Li B. Shen S. Wang Z. Shurtleff S. Robison L.L. Levy S. Easton J. Zhang J. Analysis of error profiles in deep next-generation sequencing data.Genome Biol. 2019; 20: 1-15Crossref Scopus (91) Google Scholar followed by deepSNV.18Roth A. Ding J. Morin R. Crisan A. Ha G. Giuliany R. Bashashati A. Hirst M. Turashvili G. Oloumi A. Marra M.A. Aparicio S. Shah S.P. JointSNVMix: a probabilistic model for accurate detection of somatic mutations in normal/tumour paired next-generation sequencing data.Bioinformatics. 2012; 28: 907-913Crossref PubMed Scopus (126) Google Scholar The validation method maintained higher sensitivity across reasonable range of specificity (Figure 4). The plateau of the sensitivity was a result of the data cleaning step in which the most/all of the variant reads for some positions were filtered out. At 99% specificity, the sensitivity of the validation method reaches 85%. In this article, we proposed a novel statistical framework for validating SNV mutation calls. The method aims to replace PON of limited size with tumor samples to improve the accuracy of site-wise error estimation. The proposed method takes input from BAM-derived read count files and can run in parallel with other SNV callers. The method is based on a large collection of historical samples to statistically validate the variant calls. An advantage of the validation method is the ability to identify low-frequency mutations while keeping high specificity. As mutation rate is related to sequence context, utilizing such large set of samples offers more accurate estimation on site-specific error compared with using a fixed PON. We showed that it was able to detect mutations with VAF as low as 0.5% at 1500× depth. Another feature is that it does not require a prior list of validated normal samples. Instead, it heuristically finds a set of samples that are likely to be negative based on simple guided parameters. Of the 1375 SNVs studied, the range of the estimated error rate (θ1) was 3.4 × 10−6 to 2.1 × 10−3, and the median was 9.6 × 10−5, significantly lower than the overall sequencing error rate of the panel. The site-wise error rate was in the same range of other studies (ie, the error rate of BRAF-V600E was 9.8 × 10−4 versus 5 × 10−4 by CleanDeepSeq).5Ma X. Shao Y. Tian L. Flasch D.A. Mulder H.L. Edmonson M.N. Liu Y. Chen X. Newman S. Nakitandwe J. Li Y. Li B. Shen S. Wang Z. Shurtleff S. Robison L.L. Levy S. Easton J. Zhang J. Analysis of error profiles in deep next-generation sequencing data.Genome Biol. 2019; 20: 1-15Crossref Scopus (91) Google Scholar For the analysis of 38 SNVs, the type I error α was set small (10−6) in the outlier detection for individual site and sample so that the mutations were called with high statistical confidence. The expected number of family-wise type I error was 6580 × 38 × 10−6 = 0.25. For some positions (eg, PIK3CA-E545K and BRAF-V600E), the method significantly increased sensitivity while maintaining specificity. The importance of evaluating the null model was emphasized using the historical data. More importantly, a statistical criterion was proposed to assess the model fit in practice. With a small PON, the null model is likely to be biased. Even a small bias in the empirical data may result in several biased results in large-scale hypothesis testing.19Gerstung M. Papaemmanuil E. Campbell P.J. Subclonal variant calling with multiple samples and prior knowledge.Bioinformatics. 2014; 30: 1198-1204Crossref PubMed Scopus (65) Google Scholar Therefore, attention is needed when a theoretical or assumed null distribution is used. Availability of large historical data makes it possible to evaluate the models in PON-based SNV calling pipelines. On the other hand, the VAFs of some positions were overdispersed from the estimated β binomial model that resulted in rejection of the goodness-of-fit test. Of all the 1375 SNVs studied, 110 failed the test, exceeding the expected number of failures by random (1375 × 0.05 = 69). This shows the heterogeneity of errors for a small portion of sites, which might be associated with uncaptured systematic errors because of the complexity of the regions. The performance of the validation method relies on two factors. The first factor is zero assumption in the truncated data. The idea of zero assumption is introduced to estimate the empirical null distribution in large-scale hypothesis testing.19Gerstung M. Papaemmanuil E. Campbell P.J. Subclonal variant calling with multiple samples and prior knowledge.Bioinformatics. 2014; 30: 1198-1204Crossref PubMed Scopus (65) Google Scholar We adopted the similar idea to select a conservative range (ie, only composed of negative samples) and to estimate the distribution in the full spectrum. Although the assumption may not be entirely correct in practice, it allows us to get a more accurate profile of the site-specific error. The choice of the parameter a is a bias-variance trade-off. Sources of bias include the misspecification of the model and the contamination of true mutations in the range of a, whereas variance is related to the sample size. A smaller a leads to less bias but larger variance. As the sample size grows, it allows us to reach higher precision with the same a, or to select a more conservative a while controlling for the variance. The 80% cutoff for the in-house pan-cancer training samples is reasonable, as it is unlikely that a significantly large portion of samples carry the same mutations. However, it can result in a decrease in sensitivity if many samples have the same driver mutations. The second factor is homogeneous error distribution. Generally, variations at a certain site can be attributable to random errors within sample and random errors across samples, which can be estimated from the whole panel and PON, respectively. The validation method aims to estimate the combined error, assuming that error rate varies across samples according to a β distribution. The truncated data modeling approach worked well for the SNVs that were validated by droplet digital PCR. It is effective when a site is either dominated by within-sample error or cross-sample error, or when both errors have similar dispersion. For SNVs in which both within-sample and cross-sample errors are high and the two errors are differently dispersed, the negative samples on the truncated right tail may be nonhomogeneous from the remaining, in which case the overall error rate may be underestimated. More conservative a can be chosen for positions subject to heterogeneous cross-sample errors. There are several limitations to the validation method. First, it only takes x and n for modeling the errors, not incorporating other information from the reads; therefore, it is not able to distinguish SNV from the other types of variations. It is intended to be used as a validation procedure of variant calling results, not to replace them. Second, to ensure the quality of the input data, all reads with softclips and some other quality control flags were discarded in the data cleaning step, resulting in a failure of detection for some low VAF mutations shown in the simulation. Third, the parameters are estimated for each site. Because of the small number of nonzero VAFs on some positions, the estimates may have relatively large variance, which suppresses its performance on small data sets. To reduce the variance, a pooling across different sites can be considered for future work. Fourth, the maximum likelihood estimation method is used to estimate the null model, while quantifying the inherent bias of the maximum likelihood estimation for truncated data is beyond the scope of this work.26Schwartz J. Giles D.E. Bias-reduced maximum likelihood estimation of the zero-inflated Poisson distribution.Commun Stat Theory Methods. 2016; 45: 465-478Crossref Scopus (17) Google Scholar,27Godwin R.T. Econometric analysis of non-standard count data (doctoral dissertation). University of Victoria, Victoria, BC, Canada2012Google Scholar Last, other methods have been developed to estimate the empirical distribution in the negative samples in different applications.28Schwartzman A. Dougherty R.F. Lee J. Ghahremani D. Taylor J.E. Empirical null and false discovery rate analysis in neuroimaging.Neuroimage. 2009; 44: 71-82Crossref PubMed Scopus (44) Google Scholar, 29Jin J. Cai T.T. Estimating the null and the proportion of nonnull effects in large-scale multiple comparisons.J Am Stat Assoc. 2007; 102: 495-506Crossref Scopus (106) Google Scholar, 30Gauran I.I.M. Park J. Lim J. Park D. Zylstra J. Peterson T. Kann M. Spouge J.L. Empirical null estimation using zero-inflated discrete mixture distributions and its application to protein domain data.Biometrics. 2018; 74: 458-471Crossref Scopus (6) Google Scholar These methods may also be adapted for the error model estimation in SNV validation. Download .pdf (.05 MB) Help with pdf files Supplemental Figure S1Proportion of tumor types in the 6580 pan-cancer samples. Download .pdf (.02 MB) Help with pdf files Supplemental Figure S2Histogram of variant allele frequencies (VAFs; log scale) for the five single-nucleotide variations that did not pass the goodness-of-fit test (PIK3CA-E542K, PIK3CA-E545A, PIK3CA-Q546E, KIT-V559G, and KIT-V560G) in the 6580 historical samples. The dashed line th is the global maximum corresponding to the mode of background error. The dashed line tl is the local minimum that separates positives and negatives. The dashed line a is the 80% quantile of the nonzero VAFs below tl of the kernel density. Download .pdf (.0 MB) Help with pdf files Supplemental Figure S3Positive rate of hotspot single-nucleotide variations in the 1765 in-house lung samples and 1045 The Cancer Genome Atlas (TCGA) lung (lung adenocarcinoma and lung squamous cell carcinoma) samples. Download .pdf (.0 MB) Help with pdf files Supplemental Figure S4Positive rate of hotspot single-nucleotide variations in the 941 in-house colon samples and 537 The Cancer Genome Atlas (TCGA) colon (colon adenocarcinoma and rectum adenocarcinoma) samples. Download .pdf (.0 MB) Help with pdf files Supplemental Figure S5Histogram of median depth of the 1375 single-nucleotide variations. Download .pdf (.0 MB) Help with pdf files Supplemental Figure S6Bar plot of estimated mean error rate (θ1) of all the 1375 single-nucleotide variations. Download .xlsx (26.69 MB) Help with xlsx files Supplemental Table S1 Download .xlsx (47.17 MB) Help with xlsx files Supplemental Table S2 Download .xlsx (.01 MB) Help with xlsx files Supplemental Table S3 Download .xlsx (.23 MB) Help with xlsx files Supplemental Table S4 Download .xlsx (.07 MB) Help with xlsx files Supplemental Table S5" @default.
- W4200454918 created "2021-12-31" @default.
- W4200454918 creator A5001869015 @default.
- W4200454918 creator A5008702262 @default.
- W4200454918 creator A5008879037 @default.
- W4200454918 creator A5023974795 @default.
- W4200454918 creator A5037086767 @default.
- W4200454918 creator A5067037152 @default.
- W4200454918 creator A5069536911 @default.
- W4200454918 creator A5078530516 @default.
- W4200454918 date "2022-01-01" @default.
- W4200454918 modified "2023-10-13" @default.
- W4200454918 title "A Retrospective Statistical Validation Approach for Panel of Normal–Based Single-Nucleotide Variant Detection in Tumor Sequencing" @default.
- W4200454918 cites W1824047490 @default.
- W4200454918 cites W1919257374 @default.
- W4200454918 cites W1989840284 @default.
- W4200454918 cites W2009302881 @default.
- W4200454918 cites W2010882525 @default.
- W4200454918 cites W2080660780 @default.
- W4200454918 cites W2081769927 @default.
- W4200454918 cites W2099377894 @default.
- W4200454918 cites W2103441770 @default.
- W4200454918 cites W2108234281 @default.
- W4200454918 cites W2115976081 @default.
- W4200454918 cites W2117131162 @default.
- W4200454918 cites W2128542677 @default.
- W4200454918 cites W2131039280 @default.
- W4200454918 cites W2137779717 @default.
- W4200454918 cites W2161821474 @default.
- W4200454918 cites W2171074980 @default.
- W4200454918 cites W2597365302 @default.
- W4200454918 cites W2611802561 @default.
- W4200454918 cites W2756787081 @default.
- W4200454918 cites W2763620027 @default.
- W4200454918 cites W2794000666 @default.
- W4200454918 cites W2884441543 @default.
- W4200454918 cites W2940391028 @default.
- W4200454918 cites W2944727022 @default.
- W4200454918 cites W2964368730 @default.
- W4200454918 cites W3112894169 @default.
- W4200454918 cites W3137023660 @default.
- W4200454918 doi "https://doi.org/10.1016/j.jmoldx.2021.09.010" @default.
- W4200454918 hasPubMedId "https://pubmed.ncbi.nlm.nih.gov/34974877" @default.
- W4200454918 hasPublicationYear "2022" @default.
- W4200454918 type Work @default.
- W4200454918 citedByCount "1" @default.
- W4200454918 countsByYear W42004549182023 @default.
- W4200454918 crossrefType "journal-article" @default.
- W4200454918 hasAuthorship W4200454918A5001869015 @default.
- W4200454918 hasAuthorship W4200454918A5008702262 @default.
- W4200454918 hasAuthorship W4200454918A5008879037 @default.
- W4200454918 hasAuthorship W4200454918A5023974795 @default.
- W4200454918 hasAuthorship W4200454918A5037086767 @default.
- W4200454918 hasAuthorship W4200454918A5067037152 @default.
- W4200454918 hasAuthorship W4200454918A5069536911 @default.
- W4200454918 hasAuthorship W4200454918A5078530516 @default.
- W4200454918 hasBestOaLocation W42004549181 @default.
- W4200454918 hasConcept C105795698 @default.
- W4200454918 hasConcept C33923547 @default.
- W4200454918 hasConcept C41008148 @default.
- W4200454918 hasConcept C54355233 @default.
- W4200454918 hasConcept C70721500 @default.
- W4200454918 hasConcept C86803240 @default.
- W4200454918 hasConceptScore W4200454918C105795698 @default.
- W4200454918 hasConceptScore W4200454918C33923547 @default.
- W4200454918 hasConceptScore W4200454918C41008148 @default.
- W4200454918 hasConceptScore W4200454918C54355233 @default.
- W4200454918 hasConceptScore W4200454918C70721500 @default.
- W4200454918 hasConceptScore W4200454918C86803240 @default.
- W4200454918 hasIssue "1" @default.
- W4200454918 hasLocation W42004549181 @default.
- W4200454918 hasLocation W42004549182 @default.
- W4200454918 hasOpenAccess W4200454918 @default.
- W4200454918 hasPrimaryLocation W42004549181 @default.
- W4200454918 hasRelatedWork W1641042124 @default.
- W4200454918 hasRelatedWork W1990804418 @default.
- W4200454918 hasRelatedWork W1993764875 @default.
- W4200454918 hasRelatedWork W2013243191 @default.
- W4200454918 hasRelatedWork W2082860237 @default.
- W4200454918 hasRelatedWork W2117258802 @default.
- W4200454918 hasRelatedWork W2130076355 @default.
- W4200454918 hasRelatedWork W2151865869 @default.
- W4200454918 hasRelatedWork W2748952813 @default.
- W4200454918 hasRelatedWork W2899084033 @default.
- W4200454918 hasVolume "24" @default.
- W4200454918 isParatext "false" @default.
- W4200454918 isRetracted "false" @default.
- W4200454918 workType "article" @default.