Matches in SemOpenAlex for { <https://semopenalex.org/work/W3093350113> ?p ?o ?g. }
- W3093350113 endingPage "100127" @default.
- W3093350113 startingPage "100127" @default.
- W3093350113 abstract "•Msuite provides a unique 4-letter analysis mode for emerging bisulfite-free protocols•Msuite outperforms current tools in terms of higher accuracy and lower resource usage•Msuite has integrated quality control and fruitful data-visualization utilities•Msuite provides an all-in-one solution for DNA methylation data analysis DNA methylation is an essential epigenetic modification responsible for many biological regulation pathways. Despite the fact that various high-throughput methods have been developed for base-resolution DNA methylome profiling, DNA methylation data analysis remains a complex and challenging task. Here, we present Msuite, which has integrated quality control, read alignment, methylation call, and fruitful data-visualization functionalities, aiming to offer an all-in-one package for most of the current DNA methylation profiling assays. Msuite also provides dedicated support for emerging bisulfite-free protocols and outperforms the current tools in terms of higher accuracy and lower computational resource requirement. Hence, Msuite could serve as the optimal toolkit for DNA methylation data analysis as well as facilitating the popularization of emerging bisulfite-free protocols. DNA methylation is a pervasive and important epigenetic regulator in mammalian genome. For DNA methylome profiling, emerging bisulfite-free methods have demonstrated desirable superiority over the conventional bisulfite-treatment-based approaches, although current analysis software could not make full use of their advantages. In this work, we present Msuite, an easy-to-use, all-in-one data-analysis toolkit. Msuite implements a unique 4-letter analysis mode specifically optimized for emerging protocols; it also integrates quality controls, methylation call, and data visualizations. Msuite demonstrates substantial performance improvements over current state-of-the-art tools as well as fruitful functionalities, thus holding the potential to serve as an optimal toolkit to facilitate DNA methylome studies. Source codes and testing datasets for Msuite are freely available at https://github.com/hellosunking/Msuite/. DNA methylation is a pervasive and important epigenetic regulator in mammalian genome. For DNA methylome profiling, emerging bisulfite-free methods have demonstrated desirable superiority over the conventional bisulfite-treatment-based approaches, although current analysis software could not make full use of their advantages. In this work, we present Msuite, an easy-to-use, all-in-one data-analysis toolkit. Msuite implements a unique 4-letter analysis mode specifically optimized for emerging protocols; it also integrates quality controls, methylation call, and data visualizations. Msuite demonstrates substantial performance improvements over current state-of-the-art tools as well as fruitful functionalities, thus holding the potential to serve as an optimal toolkit to facilitate DNA methylome studies. Source codes and testing datasets for Msuite are freely available at https://github.com/hellosunking/Msuite/. DNA methylation is an important epigenetic regulator that plays crucial roles in a broad range of biological processes. In mammalian genomes, DNA methylation mostly involves the addition of a methyl group to cytosine nucleotides and is linked to gene repression.1Smith Z.D. Meissner A. DNA methylation: roles in mammalian development.Nat. Rev. Genet. 2013; 14: 204-220Crossref PubMed Scopus (1827) Google Scholar,2Moore L.D. Le T. Fan G. DNA methylation and its basic function.Neuropsychopharmacology. 2013; 38: 23-38Crossref PubMed Scopus (1439) Google Scholar The cytosine methylation pattern has been found to be tissue specific and possesses high biological and translational values, for example in transcription regulation1Smith Z.D. Meissner A. DNA methylation: roles in mammalian development.Nat. Rev. Genet. 2013; 14: 204-220Crossref PubMed Scopus (1827) Google Scholar, 2Moore L.D. Le T. Fan G. DNA methylation and its basic function.Neuropsychopharmacology. 2013; 38: 23-38Crossref PubMed Scopus (1439) Google Scholar, 3Wang L. Zhao Y. Bao X. Zhu X. Kwok Y.K. Sun K. Chen X. Huang Y. Jauch R. Esteban M.A. et al.LncRNA Dum interacts with Dnmts to regulate Dppa2 expression during myogenic differentiation and muscle regeneration.Cell Res. 2015; 25: 335-350Crossref PubMed Scopus (170) Google Scholar, 4Li L. Zhang Y. Fan Y. Sun K. Su X. Du Z. Tsao S.W. Loh T.K. Sun H. Chan A.T. et al.Characterization of the nasopharyngeal carcinoma methylome identifies aberrant disruption of key signaling pathways and methylated tumor suppressor genes.Epigenomics. 2015; 7: 155-173Crossref PubMed Google Scholar, 5Zhao Y. Yang Y. Trovik J. Sun K. Zhou L. Jiang P. Lau T.S. Hoivik E.A. Salvesen H.B. Sun H. et al.A novel wnt regulatory axis in endometrioid endometrial cancer.Cancer Res. 2014; 74: 5103-5117Crossref PubMed Scopus (109) Google Scholar and cancer liquid biopsy studies.6Chan K.C.A. Jiang P. Chan C.W. Sun K. Wong J. Hui E.P. Chan S.L. Chan W.C. Hui D.S. Ng S.S. et al.Noninvasive detection of cancer-associated genome-wide hypomethylation and copy number aberrations by plasma DNA bisulfite sequencing.Proc. Natl. Acad. Sci. U S A. 2013; 110: 18761-18768Crossref PubMed Scopus (262) Google Scholar, 7Sun K. Jiang P. Chan K.C.A. Wong J. Cheng Y.K. Liang R.H. Chan W.K. Ma E.S. Chan S.L. Cheng S.H. et al.Plasma DNA tissue mapping by genome-wide methylation sequencing for noninvasive prenatal, cancer, and transplantation assessments.Proc. Natl. Acad. Sci. U S A. 2015; 112: E5503-E5512Crossref PubMed Scopus (379) Google Scholar, 8Xu R.H. Wei W. Krawczyk M. Wang W. Luo H. Flagg K. Yi S. Shi W. Quan Q. Li K. et al.Circulating tumour DNA methylation markers for diagnosis and prognosis of hepatocellular carcinoma.Nat. Mater. 2017; 16: 1155-1161Crossref PubMed Google Scholar, 9Lam W.K.J. Gai W. Sun K. Wong R.S.M. Chan R.W.Y. Jiang P. Chan N.P.H. Hui W.W.I. Chan A.W.H. Szeto C.C. et al.DNA of erythroid origin is present in human plasma and informs the types of anemia.Clin. Chem. 2017; 63: 1614-1623Crossref PubMed Scopus (44) Google Scholar, 10Gai W. Ji L. Lam W.K.J. Sun K. Jiang P. Chan A.W.H. Wong J. Lai P.B.S. Ng S.S.M. Ma B.B.Y. et al.Liver- and colon-specific DNA methylation markers in plasma for investigation of colorectal cancers with or without liver metastases.Clin. Chem. 2018; 64: 1239-1249Crossref PubMed Scopus (42) Google Scholar, 11Gai W. Sun K. Epigenetic biomarkers in cell-free DNA and applications in liquid biopsy.Genes (Basel). 2019; 10: 32Crossref Scopus (66) Google Scholar, 12Sun K. Jiang P. Cheng S.H. Cheng T.H.T. Wong J. Wong V.W.S. Ng S.S.M. Ma B.B.Y. Leung T.Y. Chan S.L. et al.Orientation-aware plasma cell-free DNA fragmentation analysis in open chromatin regions informs tissue of origin.Genome Res. 2019; 29: 418-427Crossref PubMed Scopus (84) Google Scholar As a consequence, DNA methylation is actively and widely investigated in various research fields. Multiple biochemical assays have been developed for high-resolution DNA methylome profiling in the past years.13Raiber E.-A. Hardisty R. van Delft P. Balasubramanian S. Mapping and elucidating the function of modified bases in DNA.Nat. Rev. Chem. 2017; 1: 0069Crossref Google Scholar To differentiate methylated cytosines from unmethylated ones, conventional approaches (e.g., whole-genome bisulfite sequencing [WGBS]13Raiber E.-A. Hardisty R. van Delft P. Balasubramanian S. Mapping and elucidating the function of modified bases in DNA.Nat. Rev. Chem. 2017; 1: 0069Crossref Google Scholar,14Lister R. Pelizzola M. Dowen R.H. Hawkins R.D. Hon G. Tonti-Filippini J. Nery J.R. Lee L. Ye Z. Ngo Q.M. et al.Human DNA methylomes at base resolution show widespread epigenomic differences.Nature. 2009; 462: 315-322Crossref PubMed Scopus (3238) Google Scholar) apply a bisulfite treatment procedure to DNA molecules, which converts all unmethylated cytosines into uracil while leaving methylated cytosines unchanged. During the subsequent PCR amplifications, uracils are recognized as thymines, resulting in a cytosine-to-thymine transition to the original DNA. Emerging bisulfite-free techniques (e.g., TET-assisted pyridine borane sequencing [TAPS]15Liu Y. Siejka-Zielinska P. Velikova G. Bi Y. Yuan F. Tomkova M. Bai C. Chen L. Schuster-Bockler B. Song C.X. Bisulfite-free direct detection of 5-methylcytosine and 5-hydroxymethylcytosine at base resolution.Nat. Biotechnol. 2019; 37: 424-429Crossref PubMed Scopus (135) Google Scholar, 16Zeng H. He B. Xia B. Bai D. Lu X. Cai J. Chen L. Zhou A. Zhu C. Meng H. et al.Bisulfite-free, nanoscale analysis of 5-hydroxymethylcytosine at single base resolution.J. Am. Chem. Soc. 2018; 140: 13190-13194Crossref PubMed Scopus (41) Google Scholar, 17Schutsky E.K. DeNizio J.E. Hu P. Liu M.Y. Nabel C.S. Fabyanic E.B. Hwang Y. Bushman F.D. Wu H. Kohli R.M. Nondestructive, base-resolution sequencing of 5-hydroxymethylcytosine using a DNA deaminase.Nat. Biotechnol. 2018; https://doi.org/10.1038/nbt.4204Crossref PubMed Scopus (85) Google Scholar), however, introduce opposite modifications to DNA molecules whereby only the methylated cytosines are converted into thymines while the unmethylated cytosines are left untouched (Figure 1A). In mammalian genomes, methylated cytosines mostly appear in CpG dinucleotides, which account for a very limited proportion (e.g., ~5% in human) of all cytosines. As a result, DNA libraries generated by bisulfite-free approaches show a significantly higher nucleotide complexity (Figure 1A) because only a small proportion of the cytosines are converted after chemical treatment, which characteristically benefits in lower GC-bias and even more coverage of the genome in the sequencing data.15Liu Y. Siejka-Zielinska P. Velikova G. Bi Y. Yuan F. Tomkova M. Bai C. Chen L. Schuster-Bockler B. Song C.X. Bisulfite-free direct detection of 5-methylcytosine and 5-hydroxymethylcytosine at base resolution.Nat. Biotechnol. 2019; 37: 424-429Crossref PubMed Scopus (135) Google Scholar,18Olova N. Krueger F. Andrews S. Oxley D. Berrens R.V. Branco M.R. Reik W. Comparison of whole-genome bisulfite sequencing library preparation strategies identifies sources of biases affecting DNA methylation data.Genome Biol. 2018; 19: 33Crossref PubMed Scopus (104) Google Scholar,19Ross M.G. Russ C. Costello M. Hollinger A. Lennon N.J. Hegarty R. Nusbaum C. Jaffe D.B. Characterizing and measuring bias in sequence data.Genome Biol. 2013; 14: R51Crossref PubMed Scopus (486) Google Scholar However, most of the current mainstream data-analysis tools only support 3-letter read alignment (i.e., they convert all the cytosines in both the reference genome and sequencing reads into thymines);13Raiber E.-A. Hardisty R. van Delft P. Balasubramanian S. Mapping and elucidating the function of modified bases in DNA.Nat. Rev. Chem. 2017; 1: 0069Crossref Google Scholar,20Pedersen B.S. Eyring K. De S. Yang I.V. Schwartz D.A. Fast and accurate alignment of long bisulfite-seq reads.arXiv. 2014; (1401.1129)Google Scholar, 21Jiang P. Sun K. Lun F.M.F. Guo A.M. Wang H. Chan K.C.A. Chiu R.W.K. Lo Y.M.D. Sun H. Methy-pipe: an integrated bioinformatics pipeline for whole genome bisulfite sequencing data analysis.PLoS One. 2014; 9: e100360Crossref PubMed Scopus (47) Google Scholar, 22Krueger F. Kreck B. Franke A. Andrews S.R. DNA methylome analysis using short bisulfite sequencing data.Nat. Methods. 2012; 9: 145-151Crossref PubMed Scopus (243) Google Scholar the fundamental change introduced by the emerging assays thus renders the current tools outdated due to their disability to make full use of such an advantage. Other analysis tools utilize a wild-card mapping strategy; however, they usually suffer from low speed and unsatisfactory mapping efficiency,22Krueger F. Kreck B. Franke A. Andrews S.R. DNA methylome analysis using short bisulfite sequencing data.Nat. Methods. 2012; 9: 145-151Crossref PubMed Scopus (243) Google Scholar,23Sun X. Han Y. Zhou L. Chen E. Lu B. Liu Y. Pan X. Cowley Jr., A.W. Liang M. Wu Q. et al.A comprehensive evaluation of alignment software for reduced representation bisulfite sequencing data.Bioinformatics. 2018; 34: 2715-2723Crossref PubMed Scopus (15) Google Scholar or are also optimized for 3-letter alignment.24Chen H. Smith A.D. Chen T. WALT: fast and accurate read mapping for bisulfite sequencing.Bioinformatics. 2016; 32: 3507-3509PubMed Google Scholar In addition, most of the current software focuses on sequencing read alignment and requires the users to perform quality control, downstream analysis (e.g., methylation call), and data visualization. Hence, a user-friendly, multi-functional toolkit with better support for current bisulfite-free assays is of urgent demand. In this study, we present Msuite, a package that supports data analysis of all the current mainstream DNA methylome assays. As a versatile toolkit, Msuite provides various utilitarian functions, including quality control, a novel 4-letter sequencing read alignment algorithm specifically optimized for bisulfite-free protocols, methylation call, as well as plentiful data-visualization utilities. Msuite also outperforms current state-of-the-art software in terms of higher accuracy, faster speed, and lower computational resource usage. Msuite thus provides an easy-to-use, all-in-one solution for DNA methylation data analysis. Msuite is freely available at https://github.com/hellosunking/Msuite/. Figure 1 shows the schematic workflow of current DNA methylation profiling assays and the Msuite data-analysis toolkit. Typically, the genomic DNA of interest is first fragmented into small pieces of several hundred base pairs long (sometimes the DNA molecules are inherently fragmented such as plasma cell-free DNA25Sun K. Jiang P. Wong A.I.C. Cheng Y.K.Y. Cheng S.H. Zhang H. Chan K.C.A. Leung T.Y. Chiu R.W.K. Lo Y.M.D. Size-tagged preferred ends in maternal plasma DNA shed light on the production mechanism and show utility in noninvasive prenatal testing.Proc. Natl. Acad. Sci. U S A. 2018; 115: E5106-E5114Crossref PubMed Scopus (71) Google Scholar), then the short DNA molecules are treated by various chemistries and several rounds of PCR cycles to differentiate methylated and unmethylated cytosines. The biochemically treated DNA molecules are then subjected to library preparation and sequencing. Raw sequencing data directly serve as the input for Msuite. Msuite adapts our previous sequencing data preprocessing tool, Ktrim,26Sun K. Ktrim: an extra-fast and accurate adapter- and quality-trimmer for sequencing data.Bioinformatics. 2020; 36: 3561-3562Crossref PubMed Scopus (10) Google Scholar to perform extra-fast, accurate adapter-/quality-trimming, and in silico cytosine-to-thymine conversion of the sequencing reads. Notably, Msuite supports sequencing data generated from various library preparation kits and is able to directly handle conventional WGBS and emerging sequencing protocols such as TAPS,15Liu Y. Siejka-Zielinska P. Velikova G. Bi Y. Yuan F. Tomkova M. Bai C. Chen L. Schuster-Bockler B. Song C.X. Bisulfite-free direct detection of 5-methylcytosine and 5-hydroxymethylcytosine at base resolution.Nat. Biotechnol. 2019; 37: 424-429Crossref PubMed Scopus (135) Google Scholar 5hmC-CATCH,16Zeng H. He B. Xia B. Bai D. Lu X. Cai J. Chen L. Zhou A. Zhu C. Meng H. et al.Bisulfite-free, nanoscale analysis of 5-hydroxymethylcytosine at single base resolution.J. Am. Chem. Soc. 2018; 140: 13190-13194Crossref PubMed Scopus (41) Google Scholar and ACE-seq,17Schutsky E.K. DeNizio J.E. Hu P. Liu M.Y. Nabel C.S. Fabyanic E.B. Hwang Y. Bushman F.D. Wu H. Kohli R.M. Nondestructive, base-resolution sequencing of 5-hydroxymethylcytosine using a DNA deaminase.Nat. Biotechnol. 2018; https://doi.org/10.1038/nbt.4204Crossref PubMed Scopus (85) Google Scholar as well as ATAC-me,27Barnett K.R. Decato B.E. Scott T.J. Hansen T.J. Chen B. Attalla J. Smith A.D. Hodges E. ATAC-Me captures prolonged DNA methylation of dynamic chromatin accessibility loci during cell fate transitions.Mol. Cell. 2020; 77: 1350-1364.e6Abstract Full Text Full Text PDF PubMed Scopus (21) Google Scholar methyl-ATAC-seq,28Spektor R. Tippens N.D. Mimoso C.A. Soloway P.D. methyl-ATAC-seq measures DNA methylation at accessible chromatin.Genome Res. 2019; 29: 969-977Crossref PubMed Scopus (18) Google Scholar and EpiMethylTag29Lhoumaud P. Sethia G. Izzo F. Sakellaropoulos T. Snetkova V. Vidal S. Badri S. Cornwell M. Di Giammartino D.C. Kim K.T. et al.EpiMethylTag: simultaneous detection of ATAC-seq or ChIP-seq signals with DNA methylation.Genome Biol. 2019; 20: 248Crossref PubMed Scopus (15) Google Scholar (integrative methods that measure DNA methylation at regulatory elements, such as accessible chromatin or transcription factor binding domains). Msuite then aligns the pre-processed reads to the reference genome in either 3- or 4-letter mode based on the assay and users' settings (see Experimental Procedures). Notably, the 4-letter mode is specifically designed for TAPS-like protocols that are optimized for detecting CpG methylations, while 3-letter mode is generic and works for most kinds of current DNA methylation assays as well as scenarios with gross non-CpG methylation. The sequencing reads are aligned to Watson and Crick strands of the reference genome separately, since cytosine-to-thymine conversion has disrupted their reverse-complementary relationship. After this initial alignment, Msuite recognizes and handles the ambiguously aligned reads as well as PCR duplicates to generate the final alignment result, based on which Msuite further performs a methylation call (i.e., it reports methylation status of all cytosines in both CpG and non-CpG contexts) and data visualizations. To demonstrate the usability of Msuite, we compared the most valuable features for DNA methylation data analysis between Msuite and current state-of-the-art software: Bismark,30Krueger F. Andrews S.R. Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications.Bioinformatics. 2011; 27: 1571-1572Crossref PubMed Scopus (2253) Google Scholar BWA-meth,20Pedersen B.S. Eyring K. De S. Yang I.V. Schwartz D.A. Fast and accurate alignment of long bisulfite-seq reads.arXiv. 2014; (1401.1129)Google Scholar and our previously developed Methy-Pipe21Jiang P. Sun K. Lun F.M.F. Guo A.M. Wang H. Chan K.C.A. Chiu R.W.K. Lo Y.M.D. Sun H. Methy-pipe: an integrated bioinformatics pipeline for whole genome bisulfite sequencing data analysis.PLoS One. 2014; 9: e100360Crossref PubMed Scopus (47) Google Scholar (Table 1). Msuite employs bowtie231Langmead B. Salzberg S.L. Fast gapped-read alignment with Bowtie 2.Nat. Methods. 2012; 9: 357-359Crossref PubMed Scopus (23690) Google Scholar as the bottom aligner, which is the same as Bismark (which also supports Hisat232Kim D. Langmead B. Salzberg S.L. HISAT: a fast spliced aligner with low memory requirements.Nat. Methods. 2015; 12: 357-360Crossref PubMed Scopus (8007) Google Scholar) while different from BWA-meth and Methy-Pipe (which use BWA33Li H. Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform.Bioinformatics. 2009; 25: 1754-1760Crossref PubMed Scopus (24737) Google Scholar and SOAP2,34Li R. Yu C. Li Y. Lam T.W. Yiu S.M. Kristiansen K. Wang J. SOAP2: an improved ultrafast tool for short read alignment.Bioinformatics. 2009; 25: 1966-1967Crossref PubMed Scopus (2785) Google Scholar respectively). As a result, Msuite, Bismark, and BWA-meth output the alignment results in standardized SAM/BAM35Li H. Handsaker B. Wysoker A. Fennell T. Ruan J. Homer N. Marth G. Abecasis G. Durbin R. Genome Project Data Processing SubgroupThe sequence alignment/map format and SAMtools.Bioinformatics. 2009; 25: 2078-2079Crossref PubMed Scopus (29177) Google Scholar format while Methy-Pipe records the data in an alternative format similar to SOAP2. Moreover, Msuite tolerates insertion and deletions in the sequencing data, a feature also supported by Bismark and BWA-meth while not in Methy-Pipe. Msuite supports both 4- and 3-letter alignment while the others only provide 3-letter alignment. In addition, both Msuite and Methy-Pipe automatically perform DNA methylation calls in the data. In contrast, Bismark provides a script but requires the users to run it manually; BWA-meth only performs read alignment without any downstream analysis support. Lastly, Msuite has integrated built-in quality controls, including adaptor-/quality-trimming and removal of PCR duplicates, as well as various data-visualization functions. Msuite thus provides an easy-to-use, all-in-one solution for DNA methylation data analysis and can be readily integrated with other software for comprehensive data mining.Table 1Comparison of Major Features between Msuite and Current SoftwareMsuiteBismarkBWA-methMethy-PipeUnderlying alignerbowtie2bowtie2/Hisat2BWASOAP2Output formatSAM/BAMBAMSAMSOAP-likeAlign mode3-/4-letter3-letter only3-letter only3-letter onlyIndel supportyesYesyesnoQuality controlyesNonoyesMethylation callyesmanuallynoyesData visualizationyesNonoyesMultiple-file supportyesYesyesyesSequencing modepaired-/single-endpaired-/single-endpaired-/single-endpaired-/single-endParallelizationyesYesyesyes Open table in a new tab Sequencing read alignment is the most challenging and imperative step in DNA methylation data analysis. We benchmarked and compared the performance of read alignment algorithms between Msuite and current software. For a fair comparison, Methy-Pipe is excluded from this analysis because both Methy-Pipe and its underline aligner have not been updated for more than 5 years. The latest versions of Bismark (v0.22.3) and BWA-meth (v0.2.2) as well as their underline aligners (bowtie2 v2.3.5.1 and BWA v0.7.17) were downloaded from the literature20Pedersen B.S. Eyring K. De S. Yang I.V. Schwartz D.A. Fast and accurate alignment of long bisulfite-seq reads.arXiv. 2014; (1401.1129)Google Scholar,30Krueger F. Andrews S.R. Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications.Bioinformatics. 2011; 27: 1571-1572Crossref PubMed Scopus (2253) Google Scholar,31Langmead B. Salzberg S.L. Fast gapped-read alignment with Bowtie 2.Nat. Methods. 2012; 9: 357-359Crossref PubMed Scopus (23690) Google Scholar,33Li H. Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform.Bioinformatics. 2009; 25: 1754-1760Crossref PubMed Scopus (24737) Google Scholar and installed on a computing server equipped with Intel Xeon CPU, 192 Gb memory, and standard CentOS 64-bit Linux system. A total of 900 in silico experiments following the BS-seq or TAPS protocol were performed (see Experimental Procedures). The averaged alignment statistics, running time, and peak memory usage on 1 million in silico simulated paired-end reads following the TAPS protocol are shown in Table 2, and the results for paired-end reads following BS-seq protocol as well as single-end data are included in Table S1 (notably, BWA-meth fails in processing single-end 36-bp reads). We measured mapping efficiency as the proportion of reads that could be mapped by the aligner, and accuracy as the proportion of correct alignments (i.e., aligned loci are exactly the same as that in simulation) in the mapped reads. In brief, both mapping efficiency and accuracy are high and comparable among the software benchmarked, while Msuite in 4-letter mode is slightly better. Intriguingly, even though Msuite performs an additional adapter- and quality-trimming step before alignment (whose time is counted in Table 2), it runs faster and uses much less memory than Bismark and BWA-meth, particularly in 4-letter mode.Table 2Benchmark Evaluation of Msuite and Current SoftwareMsuite (4-Letter Mode)Msuite (3-Letter Mode)BismarkBWA-methaFor BWA-meth, alignments with a score of 0 were discarded due to abnormally high error rate.1 Million paired-end 100-bp reads Mapping efficiency (%)96.1995.7595.7995.58 Accuracy (%)99.9499.9599.9399.96 Running time (s)119.45195.80236.37155.14 Peak memory (Gb)3.743.7440.2112.521 Million paired-end 36-bp reads Mapping efficiency (%)92.4691.6291.7191.46 Accuracy (%)99.7999.6399.6599.67 Running time (s)126.60173.84164.80283.55 Peak memory (Gb)3.773.7840.1622.521 Million paired-end 36-bp reads originating from CT/GA ≥ 70% regions Mapping efficiency (%)79.6767.1468.3976.31 Accuracy (%)99.7899.0398.9492.06 Running time (s)120.30307.00294.80360.801 Million paired-end 36-bp reads originating from CT/GA ≥ 80% regions Mapping efficiency (%)73.5353.1454.6065.85 Accuracy (%)99.8198.3898.1986.91 Running time (s)112.60375.00366.00469.70Eight threads were used for benchmark testing, and the data were simulated following the TAPS protocol.a For BWA-meth, alignments with a score of 0 were discarded due to abnormally high error rate. Open table in a new tab Eight threads were used for benchmark testing, and the data were simulated following the TAPS protocol. To further explore the advantage of Msuite's unique 4-letter alignment mode, we generated in silico data originating from CT- or GA-rich regions in human genome (see Experimental Procedures). For reads generated from regions with CT/GA proportion higher than 70%, 4-letter alignment mode shows apparently better mapping efficiency and accuracy; for regions where CT/GA proportion is higher than 80%, the advantage of 4-letter alignment mode becomes highly remarkable in terms of superior mapping efficiency, higher accuracy, and less alignment time (Table 2). In fact, CT/GA-rich regions are ~9.2 Mbp long in total, which accounts for ~0.31% of the human genome; however, ~4.5% of them locate in promoter regions, a proportion much higher than the genomic background (~2.9%); i.e., these regions are enriched in regulators and thus possess biological relevance. Hence, prominently improved performance in aligning reads in CT/GA-rich regions justifies the merit of 4-letter mode in DNA methylation data analysis. We further profiled the accuracy of the methylation call function of Msuite on the benchmark dataset. The overall methylation densities deduced by Msuite are in close approximation to the preset methylation densities during simulation; the differences are on a similar level to the preset sequencing error rate and show no relationship with the preset methylation densities in the data (Figure S1). To illustrate the usage of Msuite, we applied it to a real dataset generated from both WGBS and TAPS protocols on murine embryonic stem cells (ESCs) by Liu et al.15Liu Y. Siejka-Zielinska P. Velikova G. Bi Y. Yuan F. Tomkova M. Bai C. Chen L. Schuster-Bockler B. Song C.X. Bisulfite-free direct detection of 5-methylcytosine and 5-hydroxymethylcytosine at base resolution.Nat. Biotechnol. 2019; 37: 424-429Crossref PubMed Scopus (135) Google Scholar The WGBS data were analyzed using the 3-letter mode, while the TAPS data were analyzed using both 3- and 4-letter modes against reference mouse genome (NCBI assembly GRCm38). The final analysis report on the TAPS data using 4-letter mode is shown in Figure 2, and the reports for the other two analyses are provided in Figure S2. Various figures are provided to help the users to assess the quality of their data, including base composition, M-bias (average methylation level for each cycle) plots,36Hansen K.D. Langmead B. Irizarry R.A. BSmooth: from whole genome bisulfite sequencing reads to differentially methylated regions.Genome Biol. 2012; 13: R83Crossref PubMed Google Scholar and DNA methylation signals around promoters. In this dataset, the base composition plots show high cytosine proportion in read 1 as well as high guanine proportion in read 2, which directly reflects the improved sequence complexity of the TAPS protocol. In addition, the methylation level shows a decreased signal around promoters, which is consistent with the knowledge that most promoters are hypomethylated for active transcription, thus providing a preliminary assessment for the users to inspect the validity of their data. Notably, the 4-letter mode took ~290 min to complete read alignment of the TAPS data using 32 threads, while the 3-letter mode took ~440 min; therefore, the 4-letter mode was ~50% faster. In addition, despite imperfect cytosine-to-thymine conversion rate in this TAPS experiment15Liu Y. Siejka-Zielinska P. Velikova G. Bi Y. Yuan F. Tomkova M. Bai C. Chen L. Schuster-Bockler B. Song C.X. Bisulfite-free direct detection of 5-methylcytosine and 5-hydroxymethylcytosine at base resolution.Nat. Biotechnol. 2019; 37: 424-429Crossref PubMed Scopus (135) Google Scholar and gross non-CpG methylation in ESCs,14Lister R. Pelizzola M. Dowen R.H. Hawkins R.D. Hon G. Tonti-Filippini J. Nery J.R. Lee L. Ye Z. Ngo Q.M. et al.Human DNA methylomes at base resolution show widespread epigenomic differences.Nature. 2009; 462: 315-322Crossref PubMed Scopus (3238) Google Scholar,15Liu Y. Siejka-Zielinska P. Velikova G. Bi Y. Yuan F. Tomkova M. Bai C. Chen L. Schuster-Bockler B. Song C.X. Bisulfite-free direct detection of 5-methylcytosine and 5-hydroxymethylcytosine at base resolution.Nat. Biotechnol. 2019; 37: 424-429Crossref PubMed Scopus (135) Google Scholar which means a high proportion of cytosines in CpH context are converted into thymines and thus affect the performance of 4-letter mode, we found that the 4-letter mode only shows a 2.17% deficit in final mapped reads compared with the 3-letter mode. Moreover, the final mapped reads reported by Msuite is comparable with the original report by Liu et al. using a different alignment strategy; however, we find that Msuite provides even more read coverage on CpG islands, especially for the highly methylated ones (Figure S3; see Discussion).15Liu Y. Siejka-Zielinska P. Velikova G. Bi Y. Yuan F. Tomkova M. Bai C. Chen L. Schuster-Bockler B. Song C.X. Bisulfite-free direct detection of 5-methylcytosine and 5-hydroxymethylcytosine at base resolution.Nat. Biotechnol. 2019; 37: 424-429Crossref PubMed Scopus (135) Google Scholar On the other hand, the deduced DNA methylation densities on CpG loci between 4- and 3-letter modes are largely the same (Pearson's R > 0.99, p < 2.2 × 10−16), with only 3.78% showing methylation differences higher than 10% (such CpG sites suffering from much lower coverage and enrichment in repeat regions, Figure S4). Moreover, the methylation densities deduced from TAPS data are also in good agreement with the WGBS data (Figure S4), which is consistent with the original report by Liu et al.15Liu Y. Siejka-Zielinska P. Velikova G. Bi Y. Yuan F. Tomkova M. Bai C. Chen L. Schuster-Bockler B. Song C.X. Bisulfite-free direct detection of 5-methylcytosine and 5-hydroxymethylcytosine at base resolution.Nat. Biotechnol. 2019; 37: 424-429Crossref PubMed Scopus (135) Google Scholar Besides the analysis report that allows the users to conveniently inspect the key statistics as well as quality assessments of their data (Figure 2), Msuite has also packaged various data-visualization utilities. For instance, during methylation call, Msuite records the DNA methylation densities for each CpG site in a BEDGRAPH (http://genome.ucsc.edu/goldenPath/help/bedgraph.html) format file, which can be readily visualized in the UCSC genome browser37Kent W.J. Sugnet C.W. Furey T.S. Roskin K.M. Pringle T.H. Zahler A.M. Haussler D. The human genome browser at UCSC.Genome Res. 2002; 12: 996-1006Crossref PubMed Scopus (6442) Google Scholar or Integrative Genomics Viewer (IGV).38Thorvaldsdottir H. Robinson J.T. Mesirov J.P. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration.Brief. Bioinform. 2013; 14: 178-192Crossref PubMed Scopus (4562) Google Scholar As illustrated in Figure 3A, a low methylation level and open chromatin signal are found on Pou5f1 gene (also known as Oct4, a transcription factor expressed in ESCs but not in somatic tissues such as liver39Sun K. Wang H. Sun H. mTFkb: a knowledgebase for fundamental annotation of mouse transcription factors.Sci. Rep. 2017; 7: 3022Crossref PubMed Scopus (10) Google Scholar) in murine ESCs40Altun G. Loring J.F. Laurent L.C. DNA methylation in embryonic stem cells.J. Cell. Biochem. 2010; 109: 1-6PubMed Google Scholar in contrast to liver tissue. Msuite also provides utilities to summarize the DNA methylation densities for easy incorporation with other data-visualization software, such as Circos.41Krzywinski M. Schein J. Birol I. Connors J. Gascoyne R. Horsman D. Jones S.J. Marra M.A. Circos: an information aesthetic for comparative genomics.Genome Res. 2009; 19: 1639-1645Crossref PubMed Scopus (5966) Google Scholar An example is shown in Figure 3B, where murine placental tissue presents conspicuous global hypomethylation compared with ESCs and liver tissue. In addition, Msuite contains a dedicated tool, Mviewer, adapted from our previous BSviewer42Sun K. Lun F.F.M. Jiang P. Sun H. BSviewer: a genotype-preserving, nucleotide-level visualizer for bisulfite sequencing data.Bioinformatics. 2017; 33: 3495-3496Crossref PubMed Scopus (4) Google Scholar software, to provide fast, nucleotide-level, and genotype-preserved DNA methylation sequencing data visualization. An example of Mviewer's output on a WGBS dataset generated from human placental tissue42Sun K. Lun F.F.M. Jiang P. Sun H. BSviewer: a genotype-preserving, nucleotide-level visualizer for bisulfite sequencing data.Bioinformatics. 2017; 33: 3495-3496Crossref PubMed Scopus (4) Google Scholar, 43Lun F.M.F. Chiu R.W.K. Sun K. Leung T.Y. Jiang P. Chan K.C.A. Sun H. Lo Y.M.D. Noninvasive prenatal methylomic analysis by genomewide bisulfite sequencing of maternal plasma DNA.Clin. Chem. 2013; 59: 1583-1594Crossref PubMed Scopus (102) Google Scholar, 44Sun K. Lun F.M.F. Leung T.Y. Chiu R.W.K. Lo Y.M.D. Sun H. Noninvasive reconstruction of placental methylome from maternal plasma DNA: potential for prenatal testing and monitoring.Prenat. Diagn. 2018; 38: 196-203Crossref PubMed Scopus (13) Google Scholar is shown in Figure 3C, which highlights the genotype information (i.e., an A/G heterozygous locus at chr7:130493085, which is recorded as rs2301335 in dbSNP45Sherry S.T. Ward M.H. Kholodov M. Baker J. Phan L. Smigielski E.M. Sirotkin K. dbSNP: the NCBI database of genetic variation.Nucleic Acids Res. 2001; 29: 308-311Crossref PubMed Scopus (4643) Google Scholar) as well as the allele-specific methylation pattern of the MEST imprinting gene in human placenta. The development of a WGBS protocol as well as the first base-resolution human methylome was accomplished over a decade ago.14Lister R. Pelizzola M. Dowen R.H. Hawkins R.D. Hon G. Tonti-Filippini J. Nery J.R. Lee L. Ye Z. Ngo Q.M. et al.Human DNA methylomes at base resolution show widespread epigenomic differences.Nature. 2009; 462: 315-322Crossref PubMed Scopus (3238) Google Scholar However, due to the non-unified, context-dependent cytosine-to-thymine conversions introduced in DNA methylation profiling assays, the bioinformatics analysis is still challenging and complex.13Raiber E.-A. Hardisty R. van Delft P. Balasubramanian S. Mapping and elucidating the function of modified bases in DNA.Nat. Rev. Chem. 2017; 1: 0069Crossref Google Scholar,22Krueger F. Kreck B. Franke A. Andrews S.R. DNA methylome analysis using short bisulfite sequencing data.Nat. Methods. 2012; 9: 145-151Crossref PubMed Scopus (243) Google Scholar Tremendous changes have been made in the emerging approaches such as the TAPS protocol, which have demonstrated desirable benefits compared with conventional bisulfite treatment protocols, including higher sequence complexity and lower DNA degradation.15Liu Y. Siejka-Zielinska P. Velikova G. Bi Y. Yuan F. Tomkova M. Bai C. Chen L. Schuster-Bockler B. Song C.X. Bisulfite-free direct detection of 5-methylcytosine and 5-hydroxymethylcytosine at base resolution.Nat. Biotechnol. 2019; 37: 424-429Crossref PubMed Scopus (135) Google Scholar,46Tanaka K. Okamoto A. Degradation of DNA by bisulfite treatment.Bioorg. Med. Chem. Lett. 2007; 17: 1912-1915Crossref PubMed Scopus (136) Google Scholar It is therefore of value to design and implement dedicated analysis tools to meet the requirements raised by these novel protocols as well as facilitate their applications and advances. To this end we present Msuite, a modern data-analysis toolkit that supports almost all of the current mainstream DNA methylation profiling assays. In fact, we have analyzed sequencing data generated from conventional WGBS, emerging TAPS, and EpiMethylTag assays in this work (Figure 3). Moreover, Msuite outperforms current state-of-the-art software in terms of better mapping efficiency, higher accuracy, faster speed, and lower computing resource requirements. Even though Msuite and Bismark both utilize bowtie2 as the underline aligner, Msuite shows higher speed and lower memory usage. This is mostly because Msuite only calls one bowtie2 instance and runs it in multi-thread mode while Bismark calls multiple bowtie2 instances and runs them in single-thread mode, i.e., Bismark loads multiple copies of the genome indices and therefore requires more time and memory. The unique 4-letter analysis mode of Msuite is designed for emerging bisulfite-free assays, such as TAPS and 5hmC-CATCH, which indeed demonstrates improved performance on an in silico simulated dataset, especially in the CT/GA-rich regions. The 3-letter mode, on the other hand, is also essential as it is more generic and could handle datasets generated using conventional bisulfite treatment approaches or from species/tissues with gross non-CpG methylations (e.g., plants, the brain). On the real dataset generated using the TAPS protocol, despite imperfect chemical treatment, the 4-letter mode still shows high-quality results and completes in much less time. In addition, on this real dataset, Msuite also shows certain advantages over the original method used by Liu et al., which directly aligns the reads to the genome without any modifications (i.e., they treat the data as normal DNA sequencing during alignment).15Liu Y. Siejka-Zielinska P. Velikova G. Bi Y. Yuan F. Tomkova M. Bai C. Chen L. Schuster-Bockler B. Song C.X. Bisulfite-free direct detection of 5-methylcytosine and 5-hydroxymethylcytosine at base resolution.Nat. Biotechnol. 2019; 37: 424-429Crossref PubMed Scopus (135) Google Scholar As shown in Figure S3, Msuite shows much better coverage for reads originated from hypermethylated CpG islands (e.g., suppressed regulator elements in the specific cell type); such reads usually contain various methylated cytosines that are converted into thymines in the TAPS assay, and therefore contains too many “mismatches” compared with the reference genome to be aligned efficiently; by contrast, such reads do not affect Msuite because both cytosines and converted thymines in CpG sites are accepted as “matches” after the cytosine-to-thymine conversion of the reference genome and sequencing reads. Together, these results demonstrate the advantage and rationale of Msuite's 4-letter analysis mode for better support to the emerging assays. Interestingly, 4- and 3-letter modes generate rather consistent results, although the deduced methylation levels show a large difference for a small proportion of CpG sites, which are enriched in repeat regions and indeed show much lower coverage (Figure S4), suggesting that they are located in the genomic regions that are difficult to align. Although the 4-letter analysis mode has demonstrated higher mapping accuracy on the benchmark datasets, we do not have strong evidence that the 4-letter mode is more accurate on those inconsistent CpG sites; therefore, it is meaningful to further explore and/or validate the accuracy of 4- and 3-letter analysis modes, as well as the limitation of current sequencing-based protocols, using additional methods (e.g. microarrays) on such loci. In addition, Msuite integrates quality control, sequencing read alignment, and downstream analyses (e.g., methylation call) into one pipeline, thus providing an easy-to-use, all-in-one solution for DNA methylation data analysis. Msuite provides multiple data-visualization functions, which could further help users to inspect and interpret their data. For instance, its accompanying tool, Mviewer, provides favorable characteristics that can be specifically meaningful in scenarios with allele-specific DNA methylation, such as imprinting gene (Figure 3) and tissue-specific signatures.42Sun K. Lun F.F.M. Jiang P. Sun H. BSviewer: a genotype-preserving, nucleotide-level visualizer for bisulfite sequencing data.Bioinformatics. 2017; 33: 3495-3496Crossref PubMed Scopus (4) Google Scholar,47Chan K.C.A. Jiang P. Sun K. Cheng Y.K. Tong Y.K. Cheng S.H. Wong A.I. Hudecova I. Leung T.Y. Chiu R.W.K. et al.Second generation noninvasive fetal genome analysis reveals de novo mutations, single-base parental inheritance, and preferred DNA ends.Proc. Natl. Acad. Sci. U S A. 2016; 113: E8159-E8168Crossref PubMed Scopus (102) Google Scholar Msuite also provides all the features required for a modern data analyzer, including multi-file support, outputs in standardized format, and parallelization (Table 1). Hence, Msuite holds the full potential to serve as an optimal data-analysis toolkit to facilitate DNA methylation studies. In conclusion, we have designed and implemented Msuite, a versatile and high-performance DNA-analysis toolkit, with dedicated support for emerging bisulfite-free assays and enhanced performance over the state-of-the-art tools, providing an easy-to-use and all-in-one solution for analysis of DNA methylation data." @default.
- W3093350113 created "2020-10-22" @default.
- W3093350113 creator A5001836351 @default.
- W3093350113 creator A5020847968 @default.
- W3093350113 creator A5027271553 @default.
- W3093350113 creator A5056069573 @default.
- W3093350113 creator A5072382248 @default.
- W3093350113 creator A5079836538 @default.
- W3093350113 creator A5082217568 @default.
- W3093350113 date "2020-11-01" @default.
- W3093350113 modified "2023-10-15" @default.
- W3093350113 title "Msuite: A High-Performance and Versatile DNA Methylation Data-Analysis Toolkit" @default.
- W3093350113 cites W1971584645 @default.
- W3093350113 cites W1987557775 @default.
- W3093350113 cites W2001964616 @default.
- W3093350113 cites W2038523604 @default.
- W3093350113 cites W2045938670 @default.
- W3093350113 cites W2059099579 @default.
- W3093350113 cites W2065024667 @default.
- W3093350113 cites W2069796345 @default.
- W3093350113 cites W2089470652 @default.
- W3093350113 cites W2092911110 @default.
- W3093350113 cites W2097341408 @default.
- W3093350113 cites W2102278945 @default.
- W3093350113 cites W2103081012 @default.
- W3093350113 cites W2103441770 @default.
- W3093350113 cites W2108234281 @default.
- W3093350113 cites W2117131162 @default.
- W3093350113 cites W2122732537 @default.
- W3093350113 cites W2128016314 @default.
- W3093350113 cites W2131374955 @default.
- W3093350113 cites W2150936392 @default.
- W3093350113 cites W2152656267 @default.
- W3093350113 cites W2158804744 @default.
- W3093350113 cites W2158868251 @default.
- W3093350113 cites W2169673165 @default.
- W3093350113 cites W2170551349 @default.
- W3093350113 cites W2173732482 @default.
- W3093350113 cites W2487424965 @default.
- W3093350113 cites W2541006501 @default.
- W3093350113 cites W2620634673 @default.
- W3093350113 cites W2743402317 @default.
- W3093350113 cites W2743456344 @default.
- W3093350113 cites W2750798189 @default.
- W3093350113 cites W2764195238 @default.
- W3093350113 cites W2783098393 @default.
- W3093350113 cites W2790957875 @default.
- W3093350113 cites W2804652818 @default.
- W3093350113 cites W2808282032 @default.
- W3093350113 cites W2895076946 @default.
- W3093350113 cites W2895610595 @default.
- W3093350113 cites W2909393089 @default.
- W3093350113 cites W2916127307 @default.
- W3093350113 cites W2917613203 @default.
- W3093350113 cites W2950199861 @default.
- W3093350113 cites W2952479320 @default.
- W3093350113 cites W2989638633 @default.
- W3093350113 cites W3003474164 @default.
- W3093350113 cites W3012532094 @default.
- W3093350113 doi "https://doi.org/10.1016/j.patter.2020.100127" @default.
- W3093350113 hasPubMedCentralId "https://www.ncbi.nlm.nih.gov/pmc/articles/7691389" @default.
- W3093350113 hasPubMedId "https://pubmed.ncbi.nlm.nih.gov/33294868" @default.
- W3093350113 hasPublicationYear "2020" @default.
- W3093350113 type Work @default.
- W3093350113 sameAs 3093350113 @default.
- W3093350113 citedByCount "6" @default.
- W3093350113 countsByYear W30933501132022 @default.
- W3093350113 countsByYear W30933501132023 @default.
- W3093350113 crossrefType "journal-article" @default.
- W3093350113 hasAuthorship W3093350113A5001836351 @default.
- W3093350113 hasAuthorship W3093350113A5020847968 @default.
- W3093350113 hasAuthorship W3093350113A5027271553 @default.
- W3093350113 hasAuthorship W3093350113A5056069573 @default.
- W3093350113 hasAuthorship W3093350113A5072382248 @default.
- W3093350113 hasAuthorship W3093350113A5079836538 @default.
- W3093350113 hasAuthorship W3093350113A5082217568 @default.
- W3093350113 hasBestOaLocation W30933501131 @default.
- W3093350113 hasConcept C104317684 @default.
- W3093350113 hasConcept C118524514 @default.
- W3093350113 hasConcept C150194340 @default.
- W3093350113 hasConcept C190727270 @default.
- W3093350113 hasConcept C41008148 @default.
- W3093350113 hasConcept C54355233 @default.
- W3093350113 hasConcept C70721500 @default.
- W3093350113 hasConcept C86803240 @default.
- W3093350113 hasConceptScore W3093350113C104317684 @default.
- W3093350113 hasConceptScore W3093350113C118524514 @default.
- W3093350113 hasConceptScore W3093350113C150194340 @default.
- W3093350113 hasConceptScore W3093350113C190727270 @default.
- W3093350113 hasConceptScore W3093350113C41008148 @default.
- W3093350113 hasConceptScore W3093350113C54355233 @default.
- W3093350113 hasConceptScore W3093350113C70721500 @default.
- W3093350113 hasConceptScore W3093350113C86803240 @default.
- W3093350113 hasIssue "8" @default.
- W3093350113 hasLocation W30933501131 @default.
- W3093350113 hasLocation W30933501132 @default.
- W3093350113 hasLocation W30933501133 @default.
- W3093350113 hasLocation W30933501134 @default.