Matches in SemOpenAlex for { <https://semopenalex.org/work/W2509588349> ?p ?o ?g. }
- W2509588349 endingPage "3512" @default.
- W2509588349 startingPage "3501" @default.
- W2509588349 abstract "Peptide and protein identification remains challenging in organisms with poorly annotated or rapidly evolving genomes, as are commonly encountered in environmental or biofuels research. Such limitations render tandem mass spectrometry (MS/MS) database search algorithms ineffective as they lack corresponding sequences required for peptide-spectrum matching. We address this challenge with the spectral networks approach to (1) match spectra of orthologous peptides across multiple related species and then (2) propagate peptide annotations from identified to unidentified spectra. We here present algorithms to assess the statistical significance of spectral alignments (Align-GF), reduce the impurity in spectral networks, and accurately estimate the error rate in propagated identifications. Analyzing three related Cyanothece species, a model organism for biohydrogen production, spectral networks identified peptides from highly divergent sequences from networks with dozens of variant peptides, including thousands of peptides in species lacking a sequenced genome. Our analysis further detected the presence of many novel putative peptides even in genomically characterized species, thus suggesting the possibility of gaps in our understanding of their proteomic and genomic expression. A web-based pipeline for spectral networks analysis is available at http://proteomics.ucsd.edu/software. Peptide and protein identification remains challenging in organisms with poorly annotated or rapidly evolving genomes, as are commonly encountered in environmental or biofuels research. Such limitations render tandem mass spectrometry (MS/MS) database search algorithms ineffective as they lack corresponding sequences required for peptide-spectrum matching. We address this challenge with the spectral networks approach to (1) match spectra of orthologous peptides across multiple related species and then (2) propagate peptide annotations from identified to unidentified spectra. We here present algorithms to assess the statistical significance of spectral alignments (Align-GF), reduce the impurity in spectral networks, and accurately estimate the error rate in propagated identifications. Analyzing three related Cyanothece species, a model organism for biohydrogen production, spectral networks identified peptides from highly divergent sequences from networks with dozens of variant peptides, including thousands of peptides in species lacking a sequenced genome. Our analysis further detected the presence of many novel putative peptides even in genomically characterized species, thus suggesting the possibility of gaps in our understanding of their proteomic and genomic expression. A web-based pipeline for spectral networks analysis is available at http://proteomics.ucsd.edu/software. Microorganisms have evolved their cellular metabolism to generate energy for life in unusual environments (1.Falkowski P.G. Fenchel T. Delong E.F. The microbial engines that drive Earth's biogeochemical cycles.Science. 2008; 320: 1034-1039Crossref PubMed Scopus (1718) Google Scholar), and their capabilities are of great interest in the production of renewable bioenergy and could contribute toward managing the world's current energy and climate crisis (2.Rittmann B.E. Opportunities for renewable bioenergy using microorganisms.Biotechnol. Bioeng. 2008; 100: 203-212Crossref PubMed Scopus (495) Google Scholar). Genomics studies have increased the number of sequenced bioenergy-related microbial genomes and revealed the possible biological reactions involved in bioenergy production (3.Rittmann B.E. Krajmalnik-Brown R. Halden R.U. Pre-genomic, genomic and post-genomic study of microbial communities involved in bioenergy.Nat. Rev. Microbiol. 2008; 6: 604-612Crossref PubMed Scopus (101) Google Scholar). Studies of photosynthetic microorganisms, for example, have yielded insights into how they harvest solar energy and use it to produce bioenergy products (4.Ferreira K.N. Iverson T.M. Maghlaoui K. Barber J. Iwata S. Architecture of the photosynthetic oxygen-evolving center.Science. 2004; 303: 1831-1838Crossref PubMed Scopus (2846) Google Scholar). Despite this importance of microorganisms, the characterization of diverse microbial phenotypes by proteomics tandem mass spectrometry (MS/MS) has been limited. The dominant approaches for MS/MS analysis heavily rely on the availability of completely annotated genomes (i.e. accurate protein databases) (5.Eng J.K. McCormack A.L. Yates J.R. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database.J. Am. Soc. Mass Spectrom. 1994; 5: 976-989Crossref PubMed Scopus (5420) Google Scholar, 6.Perkins D.N. Pappin D.J. Creasy D.M. Cottrell J.S. Probability-based protein identification by searching sequence databases using mass spectrometry data.Electrophoresis. 1999; 20: 3551-3567Crossref PubMed Scopus (6763) Google Scholar, 7.Craig R. Beavis R.C. A method for reducing the time required to match protein sequences with tandem mass spectra.Rapid Commun. Mass Spectrom. 2003; 17: 2310-2316Crossref PubMed Scopus (398) Google Scholar), yet most microorganisms populating the planet have unsequenced or poorly annotated genomes. Thus it remains challenging to identify proteins from environmental and unculturable organisms. One solution to protein identification in a species with no sequenced genome is to use the genomes of closely related species (8.Wright J.C. Beynon R.J. Hubbard S.J. Cross species proteomics.Methods Mol. Biol. 2010; 604: 123-135Crossref PubMed Scopus (25) Google Scholar). This requires matching MS/MS data to slightly different peptides in amino acid sequences (polymorphic, orthologous peptides); but matching shifted masses of peptides and their fragment ions is computationally expensive and challenging. Moreover, different species-specific post-translational modifications (PTMs) 1The abbreviations used are:PTMpost-translational modificationPSMpeptide-spectrum matchFDRfalse discovery ratePRMprefix-residue massCIDcollision induced dissociationHCDhigher-energy C-trap dissociationETDelectron transfer dissociation. can make the cross-species identification more complex. The common computational approach is tolerantly matching de novo sequences derived from MS/MS data to the database while allowing for amino acid mutations and modifications (9.Habermann B. Oegema J. Sunyaev S. Shevchenko A. The power and the limitations of cross-species protein identification by mass spectrometry-driven sequence similarity searches.Mol. Cell. Proteomics. 2004; 3: 238-249Abstract Full Text Full Text PDF PubMed Scopus (134) Google Scholar, 10.Han Y. Ma B. Zhang K. SPIDER: software for protein identification from sequence tags with de novo sequencing error.J. Bioinform. Comput. Biol. 2005; 3: 697-716Crossref PubMed Scopus (167) Google Scholar, 11.Searle B.C. Dasari S. Wilmarth P.A. Turner M. Reddy A.P. David L.L. Nagalla S.R. Identification of protein modifications using MS/MS de novo sequencing and the OpenSea alignment algorithm.J. Proteome Res. 2005; 4: 546-554Crossref PubMed Scopus (104) Google Scholar). However, this approach critically depends on good de novo interpretations, which are nearly always partially incorrect and yield high-quality subsequences only for a small fraction of all spectra. The blind database search approach, developed to identify peptides with unexpected modifications, can also be used to directly match MS/MS data from unknown species to a database of closely related species, but its utilization is limited because of its exceptionally large search space (12.Tsur D. Tanner S. Zandi E. Bafna V. Pevzner P.A. Identification of post-translational modifications by blind search of mass spectra.Nat. Biotechnol. 2005; 23: 1562-1567Crossref PubMed Scopus (225) Google Scholar, 13.Chalkley R.J. Baker P.R. Medzihradszky K.F. Lynn A.J. Burlingame A.L. In-depth analysis of tandem mass spectrometry data from disparate instrument types.Mol. Cell. Proteomics. 2008; 7: 2386-2398Abstract Full Text Full Text PDF PubMed Scopus (143) Google Scholar, 14.Chen Y. Chen W. Cobb M.H. Zhao Y. PTMap-a sequence alignment software for unrestricted, accurate, and full-spectrum identification of post-translational modification sites.Proc. Natl. Acad. Sci. U.S.A. 2009; 106: 761-766Crossref PubMed Scopus (82) Google Scholar, 15.Baliban R.C. DiMaggio P.A. Plazas-Mayorca M.D. Young N.L. Garcia B.A. Floudas C.A. A novel approach for untargeted post-translational modification identification using integer linear optimization and tandem mass spectrometry.Mol. Cell. Proteomics. 2010; 9: 764-779Abstract Full Text Full Text PDF PubMed Scopus (44) Google Scholar, 16.Dasari S. Chambers M.C. Slebos R.J. Zimmerman L.J. Ham A.J. Tabb D.L. TagRecon: high-throughput mutation identification through sequence tagging.J. Proteome Res. 2010; 9: 1716-1726Crossref PubMed Scopus (95) Google Scholar, 17.Han X. He L. Xin L. Shan B. Ma B. PeaksPTM: Mass spectrometry-based identification of peptides with unspecified modifications.J. Proteome Res. 2011; 10: 2930-2936Crossref PubMed Scopus (123) Google Scholar, 18.Na S. Bandeira N. Paek E. Fast multi-blind modification search through tandem mass spectrometry.Mol. Cell. Proteomics. 2012; 11 (M111.010199)Abstract Full Text Full Text PDF Scopus (114) Google Scholar). These spectrum-database matching approaches to cross-species identification pose significant challenges in its speed and sensitivity with a huge database, which leads to a much longer search time and more false positive identifications (19.Ahrné E. Müller M. Lisacek F. Unrestricted identification of modified proteins using MS/MS.Proteomics. 2010; 10: 671-686Crossref PubMed Scopus (69) Google Scholar, 20.Na S. Paek E. Software eyes for protein post-translational modifications.Mass Spectrom. Rev. 2015; 34: 133-147Crossref PubMed Scopus (38) Google Scholar). post-translational modification peptide-spectrum match false discovery rate prefix-residue mass collision induced dissociation higher-energy C-trap dissociation electron transfer dissociation. As a complementary approach to spectrum-database matching, spectral library searching is an emerging and promising approach (21.Lam H. Building and searching tandem mass spectral libraries for peptide identification.Mol. Cell. Proteomics. 2011; 10 (R111.008565)Abstract Full Text Full Text PDF Scopus (49) Google Scholar). A spectral library is a large collection of identified MS/MS spectra, and an unknown query spectrum can then be identified by direct spectral matching to the library. The great advantage of this approach is the reduction of search space and the use of fragmentation patterns of peptides. The spectral networks approach expands this concept to the identification of modified peptides in MS/MS data sets (22.Bandeira N. Tsur D. Frank A. Pevzner P.A. Protein identification by spectral networks analysis.Proc. Natl. Acad. Sci. U.S.A. 2007; 104: 6140-6145Crossref PubMed Scopus (136) Google Scholar, 23.Guthals A. Watrous J.D. Dorrestein P.C. Bandeira N. The spectral networks paradigm in high throughput mass spectrometry.Mol. Biosyst. 2012; 8: 2535-2544Crossref PubMed Scopus (63) Google Scholar). Spectral networks do not directly search a database, but groups MS/MS spectra by computing the pairwise similarity between MS/MS spectra of peptide variants and then constructs networks where each spectrum defines a node and each significant spectral pair, highly correlated in the fragmentation pattern, defines an edge (Fig. 1). In spectral networks, identification of spectra belonging to the same subnetwork should be related and thus the peptide sequence for an identified spectrum can be propagated to neighboring unidentified spectra. We recently reported that a vast number of polymorphic, orthologous peptides across species are present in MS/MS data sets (24.Payne S.H. Monroe M.E. Overall C.C. Kiebel G.R. Degan M. Gibbons B.C. Fujimoto G.M. Purvine S.O. Adkins J.N. Lipton M.S. Smith R.D. The Pacific Northwest National Laboratory library of bacterial and archaeal proteomic biodiversity.Sci. Data. 2015; 2: 150041Crossref PubMed Scopus (10) Google Scholar). We propose a new approach in cross-species proteomics research that aggregates MS/MS of multiple related species followed by spectral networks analysis of the pooled data to capitalize on pairs of spectra from orthologous peptides, as shown in Fig. 1. This approach does not require advance knowledge of the genomes for all species, and enables the identification of novel, polymorphic peptides across species via interspecies propagation. Compared with previous approaches, cross-species spectral network analysis has two major advantages. First, by matching spectra to spectra instead of spectra to database sequences, spectral networks only consider the sequence variability of peptides present in the samples instead of considering all possible variability across the whole database of related species; thus the performance of spectral networks is independent of database size. Second, the analysis of the set of highly related spectra increases the reliability in identifying polymorphic peptides in that multiple different spectra can support the same novel identification. The utility of spectral networks can be also expanded to the proteomic analysis of microbial communities that often contain hundreds of distinct organisms (25.Ram R.J. Verberkmoes N.C. Thelen M.P. Tyson G.W. Baker B.J. Blake 2nd, R.C. Shah M. Hettich R.L. Banfield J.F. Community proteomics of a natural microbial biofilm.Science. 2005; 308: 1915-1920Crossref PubMed Google Scholar, 26.VerBerkmoes N.C. Denef V.J. Hettich R.L. Banfield J.F. Systems biology: Functional analysis of natural microbial consortia using community proteomics.Nat. Rev. Microbiol. 2009; 7: 196-205Crossref PubMed Scopus (193) Google Scholar). But despite the success of spectral networks in low complexity data sets (22.Bandeira N. Tsur D. Frank A. Pevzner P.A. Protein identification by spectral networks analysis.Proc. Natl. Acad. Sci. U.S.A. 2007; 104: 6140-6145Crossref PubMed Scopus (136) Google Scholar, 23.Guthals A. Watrous J.D. Dorrestein P.C. Bandeira N. The spectral networks paradigm in high throughput mass spectrometry.Mol. Biosyst. 2012; 8: 2535-2544Crossref PubMed Scopus (63) Google Scholar), the analysis of large multi-species proteomics data requires significantly higher reliability in spectral similarity scores because the number of pairwise spectral comparisons grows quadratically with the number of spectra. In this work, we present algorithmic and statistical advances to spectral networks to improve its utility with large and diverse spectral data sets. To statistically assess the significance of spectral alignments in pairing millions of spectra, we propose Align-GF (generating function for spectral alignment) to compute rigorous p values of a spectral pair based on the complete score histogram of all possible alignments between two spectra. We show that Align-GF successfully addressed the reliability challenge in a large data set analysis and demonstrated its utility by leading to a 4-fold increase in the sensitivity of spectral pairs. Even with this dramatically improved accuracy, a very small number of incorrect pairs in a network can still complicate propagation of annotations. To further progress toward the ideal scenario where each subnetwork consists of only spectra from a single peptide family, we introduce new procedures to split mixed networks from different peptide families and show that these effectively eliminate many false spectral pairs. Finally, we propose the first approach to calculation of false discovery rate (FDR) for spectral networks propagation of identifications from unmodified to progressively more modified peptides. The proposed FDR estimation was conservative and was more rigorous for highly modified peptides, and thus now makes propagation results comparable to other peptide identification approaches. The cross-species spectral networks techniques proposed here enabled the proteomic analysis of three different Cyanothece species, including a strain where the genome sequence is not known. Cyanobacteria are one of the most diverse and widely distributed microorganisms and have received significant consideration as satisfying various demands required in bioenergy generation (27.Quintana N. Van der Kooy F. Van de Rhee M.D. Voshol G.P. Verpoorte R. Renewable energy from Cyanobacteria: energy production optimization by metabolic pathway engineering.Appl. Microbiol. Biotechnol. 2011; 91: 471-490Crossref PubMed Scopus (251) Google Scholar). We show that spectral networks can improve peptide identification by up to 38% compared with mainstream approaches, including many polymorphic and modified peptides. Spectral networks could identify peptides with highly divergent sequences (with 7 amino acid mutations) by leveraging networks of variant peptides, and one example subnetwork of species-specific variants of phycobilisome proteins reflects the diversity of photosynthetic light-harvesting strategies (28.Shih P.M. Wu D. Latifi A. Axen S.D. Fewer D.P. Talla E. Calteau A. Cai F. Tandeau de Marsac N. Rippka R. Herdman M. Sivonen K. Coursin T. Laurent T. Goodwin L. Nolan M. Davenport K.W. Han C.S. Rubin E.M. Eisen J.A. Woyke T. Gugger M. Kerfeld C.A. Improving the coverage of the cyanobacterial phylum using diversity-driven genome sequencing.Proc. Natl. Acad. Sci. U.S.A. 2013; 110: 1053-1058Crossref PubMed Scopus (562) Google Scholar). Our approach thus demonstrates the potential gains in multi-species proteomics and sets the stage for related developments in higher-complexity metaproteomics samples. Finally, spectral networks revealed many unidentified subnetworks containing only unidentified spectra, thus strongly suggesting the presence of novel peptides that are missing from current protein databases. Although we illustrate the potential of our approach on a specific set of bioenergy-related species, we note that the proposed approach is generic and should be applicable to any other set of related species. The diversity of biologically important protein families could be studied by comparing closely and more remotely related species. MS/MS data from Cyanothece sp. ATCC 51142 was previously described (29.Aryal U.K. Stöckel J. Krovvidi R.K. Gritsenko M.A. Monroe M.E. Moore R.J. Koppenaal D.W. Smith R.D. Pakrasi H.B. Jacobs J.M. Dynamic proteomic profiling of a unicellular cyanobacterium Cyanothece ATCC51142 across light-dark diurnal cycles.BMC Syst. Biol. 2011; 5: 194Crossref PubMed Scopus (31) Google Scholar, 30.Stöckel J. Jacobs J.M. Elvitigala T.R. Liberton M. Welsh E.A. Polpitiya A.D. Gritsenko M.A. Nicora C.D. Koppenaal D.W. Smith R.D. Pakrasi H.B. Diurnal rhythms result in significant changes in the cellular protein complement in the cyanobacterium Cyanothece 51142.PLoS ONE. 2011; 6: e16680Crossref PubMed Scopus (44) Google Scholar), and MS/MS data for Cyanothece sp. PCC 8801 and Cyanothece sp. ATCC 51472 were prepared in a similar manner. Briefly, proteins from Cyanothece sp. 51142, Cyanothece sp. 8801, and 51472 were treated identically, using 8 m urea and 5 mm tributylphosphine (Sigma-Aldrich, Saint Louis, MO) at 37 °C for 60 min, except for the addition of 1% CHAPS for 45 min prior to digestion in the insoluble protein fraction of 51142. Samples were not alkylated prior to LC-MS/MS analysis. The trypsin-digested samples were separated using strong cation exchange chromatography (SCX) with a PolySulfoethyl A, 200 mm × 2.1 mm, 5 μm, 300-Å column and a 10 mm × 2.1 mm guard column (PolyLC, Inc., Columbia, MD) at a flow rate of 0.2 ml/min. The fractions were subjected to the LC-MS/MS analysis that coupled a constant pressure (5000 psi) reversed phase capillary liquid chromatography system (150 μm i.d. × 360 μm o.d. × 65 cm capillary; Polymicro Technologies Inc., Phoenix, AZ) with a LTQ Orbitrap Mass Spectrometer (Thermo, San Jose, CA). MS1 scans were acquired at 100,000 resolution in the Orbitrap. MS/MS analysis of the ten most abundant precursors was performed at low resolution in the CID ion trap. All RAW files were converted to mzXML using ProteoWizard (ver. 3.0.5655; http://proteowizard.sourceforge.net) which is a set of open-source, cross-platform tools and software libraries that convert various vendor formats to readable standard formats and facilitate proteomics data analysis. All data files and results including annotated mass spectra were deposited on MassIVE (http://massive.ucsd.edu), with the accession MSV000079552. Typically, MS/MS data sets contain substantial amounts of redundancy, with multiple spectra coming from the same peptide. We used MS-Cluster (ver. 2.0) (31.Frank A.M. Bandeira N. Shen Z. Tanner S. Briggs S.P. Smith R.D. Pevzner P.A. Clustering millions of tandem mass spectra.J. Proteome Res. 2008; 7: 113-122Crossref PubMed Scopus (184) Google Scholar) to group spectra of the same peptides and compute cluster consensus spectra prior to spectral networks analysis. In brief, MS-Cluster retains peaks in a cluster consensus spectrum based on peak occurrences in the clustered spectra. MS-Cluster reduced the MS/MS data set to a smaller set of spectra, consequently improving the speed of spectral networks by reducing the number of spectra that undergo pairwise comparisons. Originally, the data consisted of 275,756 MS/MS spectra for Cyanothece sp. 8801, 481,411 spectra for Cyanothece sp. 51142 and 257,442 spectra for Cyanothece sp. 51472. MS-Cluster was applied to each species with the precursor window size of 0.1 Da and the fragment ion mass tolerance of 0.4 Da (MS-Cluster does not support ppm tolerance). MS-Cluster yielded 141,140 cluster-consensus spectra for Cyanothece sp. 8801, 171,430 cluster-consensus spectra for Cyanothece sp. 51142 and 126,148 cluster-consensus spectra for Cyanothece sp. 51472. The clustered spectra were used for all subsequent data analysis. The clustered spectra were searched using MS-GF+ (ver. 9881) (32.Kim S. Pevzner P.A. MS-GF+ makes progress towards a universal database search tool for proteomics.Nat. Commun. 2014; 5: 5277Crossref PubMed Scopus (647) Google Scholar) to identify seed peptides for propagation in our constructed spectral networks. Spectra were searched with the following parameters: enzyme = trypsin, the number of enzymatic termini = 1/2, the number of missed cleavages = any, precursor mass tolerance = ± 20 ppm, variable modifications = oxidation (Met) and pyro-glu (N-terminal Gln), the number of modifications/peptide = up to 1. The database consisted of 4,335 and 5,239 protein sequences of Cyanothece sp. 8801 and 51142, respectively, downloaded from NCBI (August 2014) and was appended with their reversed sequences for target-decoy FDR estimation (33.Elias J.E. Gygi S.P. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry.Nat. Methods. 2007; 4: 207-214Crossref PubMed Scopus (2827) Google Scholar). Finally, the identifications were filtered using a spectrum probability of 1 e-10 (which corresponded to 0.03% spectrum level FDR), resulting in 61,799 peptide-spectrum matches (PSMs). These identifications were also used as the gold standard to evaluate the correctness of spectral pairs: A spectral pair is i) true if two peptides are identical, one peptide is a prefix/suffix of the other, or one peptide is a singly modified form of the other, ii) ambiguous if two peptides share 12 or more consecutive amino acids or the overlap of theoretical fragment ions of the two peptides is more than 60% iii) false for other cases. Ambiguous pairs were not counted when evaluating spectral pairs (i.e. these were neither true nor false). Spectral networks analyses were performed against the clustered spectra from each data set with ± 0.4 Da mass tolerance for fragment ions; mass tolerance for precursor masses was not required as the mass difference between the precursor masses of aligned spectra was assigned to the mass of modification. The maximum possible mass of considered modifications was ± 375 Da. Typically samples digested with trypsin include many partially tryptic peptides (where only one end corresponds to a tryptic cleavage) and peptides containing tryptic cleavages (i.e. missed cleavages). Besides amino acid mutations and modifications, spectral networks can detect those truncated/extended peptides from exact tryptic peptides. The mass range of 375 Da would allow the amino acid deletion/extension up to two Trp residues. Initial spectral pairs were accepted if Align-GF p value was less than 5 e-9 (see Generation of Align-GF score histogram section), and then were filtered out to restrict the number of precursor masses (100 in our work) contained in a subnetwork (see Splitting mixed subnetworks section below). When seed identifications were loaded into spectral networks for propagation, the edge FDR and Align-GF p value threshold could be calculated based on the annotated spectral pairs. Finally, spectral pairs were filtered out to bring the edge FDR to the specified value. Align-GF computes the score histogram of all possible alignments against a spectrum using the generating function approach, and computes rigorous p values of spectral pairs matched to the spectrum based on the score histogram. Each MS/MS spectrum was converted into a Prefix-Residue Mass (PRM) spectrum (scored version of spectrum) using PepNovo (34.Frank A. Pevzner P. PepNovo: de novo peptide sequencing via probabilistic network modeling.Anal. Chem. 2005; 77: 964-973Crossref PubMed Scopus (526) Google Scholar), where MS/MS peak intensities were converted into log-likelihood scores by considering complementary ions (b/y), multiply charged ions, neutral losses (-H2O and -NH3), and 13C isotopes for each peak. In PRM spectra, peaks at masses corresponding to fragmentation of peptide bonds tend to have high scores whereas peaks at other masses tend to have very low scores (or are removed if the resulting likelihood scores are negative), thus improving the signal-to-noise ratio. Align-GF score histograms were computed on converted PRM spectra. A possible alignment against a spectrum S is defined as a subset of peaks in S, and all possible alignments against S can be represented as all possible subsets of peaks in S (i.e. 2N possible alignments, where N is the number of peaks in S). The score of an alignment is calculated as the sum of peak scores in the corresponding subset of S, and its probability is calculated as θm(1 − θ)N−m, where m is the number of peaks in the subset, and θ is the probability of randomly matching a peak in a spectrum, modeled as an independent Bernoulli event (we use θ = 0.05 as estimated by matching randomly generated theoretical peak lists to identified spectra in our data set using fragment mass tolerance of 0.4 Da). Calculated scores and probabilities for all possible subsets of S define the Align-GF score histogram as the probability density function for all possible alignments against S. Supplemental Fig. S1 shows an example of generating the Align-GF score histogram and illustrates the difference between the score histogram generated by Align-GF and the previously used Gaussian empirical approximation (22.Bandeira N. Tsur D. Frank A. Pevzner P.A. Protein identification by spectral networks analysis.Proc. Natl. Acad. Sci. U.S.A. 2007; 104: 6140-6145Crossref PubMed Scopus (136) Google Scholar, 23.Guthals A. Watrous J.D. Dorrestein P.C. Bandeira N. The spectral networks paradigm in high throughput mass spectrometry.Mol. Biosyst. 2012; 8: 2535-2544Crossref PubMed Scopus (63) Google Scholar). The calculation of Align-GF histograms can be done rapidly by dynamic programming. Let a variable D[i, t] be the overall probability of alignments that have score t up to the i-th peak in S. The variable D[i, t] can be calculated using the following recursion: D[i,t]=θ⋅D[i−1,t−score(i)]+(1−θ)⋅D[i−1,t](Eq. 1) where θ is the probability of randomly matching a peak and score(i) is the score of i-th peak. The first term in the sum updates the distribution when i-th peak is matched, and the second term updates the distribution when i-th peak is not matched. D[0,0] is initialized to 1 and elsewhere to zero. Then, if an alignment in S has score T, the probability that the alignment randomly obtained a score of at least T is calculated as follows, where e is the last peak index: ProbS(T)=∑t≥TDe,t(Eq. 2) In spectral networks, the ideal scenario is that each subnetwork consists of only spectra from a single “peptide family” (where the subnetwork remains connected if using only correct edges), whereas a mixed subnetwork contains spectra from different peptide families. Mixed subnetworks can be caused by incorrect spectral pairs because just one incorrect spectral pair with spectra from distinct peptide families is sufficient to combine the two peptide families into a single subnetwork. Mixed subnetworks can also be caused by co-fragmented, multiplexed MS/MS spectra. For example, if there is a multiplexed spectrum S(A, B) including fragment ions from both peptides A and B, the multiplexed spectrum could possibly be paired with S(A) and S(B), and as a result, a subnetwork of peptide A would be connected with that of peptide B by S(A, B). Co-fragmentation is commonly observed with 5∼10% of MS/MS spectra often coming from cofragmented precursors (35.Wang J. Bourne P.E. Bandeira N. MixGF: spectral probabilities for mixture spectra from more than one peptide.Mol. Cell. Proteomics. 2014; 13: 3688-3697Abstract Full Text Full Text PDF PubMed Scopus (16) Google Scholar). Although most multiplexed spectra S(A, B) have suboptimal Align-GF p values against S(A) or S(A), frequent cofragmentation could still lead to mixed subnetworks consisting of up to tens of thousands of spectra, which can substantially complicate propagation" @default.
- W2509588349 created "2016-09-16" @default.
- W2509588349 creator A5040201243 @default.
- W2509588349 creator A5087034314 @default.
- W2509588349 creator A5088522308 @default.
- W2509588349 date "2016-11-01" @default.
- W2509588349 modified "2023-10-17" @default.
- W2509588349 title "Multi-species Identification of Polymorphic Peptide Variants via Propagation in Spectral Networks" @default.
- W2509588349 cites W1498897499 @default.
- W2509588349 cites W1507573241 @default.
- W2509588349 cites W1566086199 @default.
- W2509588349 cites W1973051173 @default.
- W2509588349 cites W1978929500 @default.
- W2509588349 cites W1981593008 @default.
- W2509588349 cites W1984566266 @default.
- W2509588349 cites W1991310235 @default.
- W2509588349 cites W1995229312 @default.
- W2509588349 cites W1995981930 @default.
- W2509588349 cites W2003781119 @default.
- W2509588349 cites W2004079683 @default.
- W2509588349 cites W2007258729 @default.
- W2509588349 cites W2018375934 @default.
- W2509588349 cites W2020340169 @default.
- W2509588349 cites W2020973783 @default.
- W2509588349 cites W2025075986 @default.
- W2509588349 cites W2026465178 @default.
- W2509588349 cites W2028673251 @default.
- W2509588349 cites W2029330897 @default.
- W2509588349 cites W2029674856 @default.
- W2509588349 cites W2038615001 @default.
- W2509588349 cites W2040876567 @default.
- W2509588349 cites W2052877141 @default.
- W2509588349 cites W2057863409 @default.
- W2509588349 cites W2063493461 @default.
- W2509588349 cites W2065438615 @default.
- W2509588349 cites W2073371472 @default.
- W2509588349 cites W2075555781 @default.
- W2509588349 cites W2080258378 @default.
- W2509588349 cites W2081787927 @default.
- W2509588349 cites W2086540936 @default.
- W2509588349 cites W2090968821 @default.
- W2509588349 cites W2096057003 @default.
- W2509588349 cites W2119200038 @default.
- W2509588349 cites W2119523161 @default.
- W2509588349 cites W2129718327 @default.
- W2509588349 cites W2130488902 @default.
- W2509588349 cites W2146122183 @default.
- W2509588349 cites W2148681712 @default.
- W2509588349 cites W2151041036 @default.
- W2509588349 cites W2153290883 @default.
- W2509588349 cites W2162443513 @default.
- W2509588349 cites W2163078532 @default.
- W2509588349 cites W2314833185 @default.
- W2509588349 cites W2326414500 @default.
- W2509588349 cites W2335177920 @default.
- W2509588349 cites W2434124247 @default.
- W2509588349 cites W4214911971 @default.
- W2509588349 doi "https://doi.org/10.1074/mcp.o116.060913" @default.
- W2509588349 hasPubMedCentralId "https://www.ncbi.nlm.nih.gov/pmc/articles/5098046" @default.
- W2509588349 hasPubMedId "https://pubmed.ncbi.nlm.nih.gov/27609420" @default.
- W2509588349 hasPublicationYear "2016" @default.
- W2509588349 type Work @default.
- W2509588349 sameAs 2509588349 @default.
- W2509588349 citedByCount "8" @default.
- W2509588349 countsByYear W25095883492016 @default.
- W2509588349 countsByYear W25095883492017 @default.
- W2509588349 countsByYear W25095883492018 @default.
- W2509588349 countsByYear W25095883492019 @default.
- W2509588349 countsByYear W25095883492020 @default.
- W2509588349 crossrefType "journal-article" @default.
- W2509588349 hasAuthorship W2509588349A5040201243 @default.
- W2509588349 hasAuthorship W2509588349A5087034314 @default.
- W2509588349 hasAuthorship W2509588349A5088522308 @default.
- W2509588349 hasBestOaLocation W25095883491 @default.
- W2509588349 hasConcept C116834253 @default.
- W2509588349 hasConcept C2779281246 @default.
- W2509588349 hasConcept C54355233 @default.
- W2509588349 hasConcept C55493867 @default.
- W2509588349 hasConcept C59822182 @default.
- W2509588349 hasConcept C70721500 @default.
- W2509588349 hasConcept C78458016 @default.
- W2509588349 hasConcept C86803240 @default.
- W2509588349 hasConceptScore W2509588349C116834253 @default.
- W2509588349 hasConceptScore W2509588349C2779281246 @default.
- W2509588349 hasConceptScore W2509588349C54355233 @default.
- W2509588349 hasConceptScore W2509588349C55493867 @default.
- W2509588349 hasConceptScore W2509588349C59822182 @default.
- W2509588349 hasConceptScore W2509588349C70721500 @default.
- W2509588349 hasConceptScore W2509588349C78458016 @default.
- W2509588349 hasConceptScore W2509588349C86803240 @default.
- W2509588349 hasFunder F4320306084 @default.
- W2509588349 hasFunder F4320306151 @default.
- W2509588349 hasFunder F4320332161 @default.
- W2509588349 hasIssue "11" @default.
- W2509588349 hasLocation W25095883491 @default.
- W2509588349 hasLocation W25095883492 @default.
- W2509588349 hasLocation W25095883493 @default.
- W2509588349 hasLocation W25095883494 @default.