Matches in SemOpenAlex for { <https://semopenalex.org/work/W3148902524> ?p ?o ?g. }
- W3148902524 endingPage "102361" @default.
- W3148902524 startingPage "102361" @default.
- W3148902524 abstract "•Analysis of omics data from different spaceflight studies presents unique challenges•A standardized pipeline for RNA-seq analysis eliminates data processing variation•The GeneLab RNA-seq pipeline includes QC, trimming, mapping, quantification, and DGE•Space-relevant data processed with this pipeline are available at genelab.nasa.gov With the development of transcriptomic technologies, we are able to quantify precise changes in gene expression profiles from astronauts and other organisms exposed to spaceflight. Members of NASA GeneLab and GeneLab-associated analysis working groups (AWGs) have developed a consensus pipeline for analyzing short-read RNA-sequencing data from spaceflight-associated experiments. The pipeline includes quality control, read trimming, mapping, and gene quantification steps, culminating in the detection of differentially expressed genes. This data analysis pipeline and the results of its execution using data submitted to GeneLab are now all publicly available through the GeneLab database. We present here the full details and rationale for the construction of this pipeline in order to promote transparency, reproducibility, and reusability of pipeline data; to provide a template for data processing of future spaceflight-relevant datasets; and to encourage cross-analysis of data from other databases with the data available in GeneLab. With the development of transcriptomic technologies, we are able to quantify precise changes in gene expression profiles from astronauts and other organisms exposed to spaceflight. Members of NASA GeneLab and GeneLab-associated analysis working groups (AWGs) have developed a consensus pipeline for analyzing short-read RNA-sequencing data from spaceflight-associated experiments. The pipeline includes quality control, read trimming, mapping, and gene quantification steps, culminating in the detection of differentially expressed genes. This data analysis pipeline and the results of its execution using data submitted to GeneLab are now all publicly available through the GeneLab database. We present here the full details and rationale for the construction of this pipeline in order to promote transparency, reproducibility, and reusability of pipeline data; to provide a template for data processing of future spaceflight-relevant datasets; and to encourage cross-analysis of data from other databases with the data available in GeneLab. Opportunities to perform biological studies in space are rare due to high costs and a limited number of funding sources, rocket launches, and spaceflight crew hours for experimental procedures. In addition, spaceflight research is decentralized and distributed across numerous laboratories in the United States and abroad. As a result, studies performed in different laboratories often utilize different organisms, strains, cell lines, and experimental procedures. Adding to this complexity are variance in spaceflight factors and/or confounders within each study, such as degree of radiation exposure, experiment duration, CO2 concentration, light cycle, and water availability, all of which can have effects on an organism's health and gene expression profiles during spaceflight (Rutter et al., 2020Rutter L. Barker R. Bezdan D. Cope H. Costes S.V. Degoricija L. Fisch K.M. Gabitto M. Gebre S. Giacomello S. et al.A new era for space Life science: International standards for space omics processing (ISSOP).Patterns. 2020; 1https://doi.org/10.1016/j.patter.2020.100148Abstract Full Text Full Text PDF PubMed Scopus (12) Google Scholar). In order to optimize the integration of data from this diverse array of spaceflight experiments, it is paramount that variations in data processing are minimized. There is presently no consensus on how best to analyze RNA-seq data, and the impact of analysis tool selection on results is an active field of research. Indeed, selections of trimming parameters (Williams et al., 2016Williams C.R. Baccarella A. Parrish J.Z. Kim C.C. Trimming of sequence reads alters RNA-seq gene expression estimates.BMC Bioinformatics. 2016; 17: 103Crossref PubMed Scopus (80) Google Scholar), read aligner (Yang et al., 2015Yang C. Wu P.-Y. Tong L. Phan J.H. Wang M.D. The impact of RNA-seq aligners on gene expression estimation.ACM BCM. 2015; 2015: 462-471PubMed Google Scholar), quantification tool (Teng et al., 2016Teng M. Love M.I. Davis C.A. Djebali S. Dobin A. Graveley B.R. Li S. Mason C.E. Olson S. Pervouchine D. et al.A benchmark for RNA-seq quantification pipelines.Genome Biol. 2016; 17: 74Crossref PubMed Scopus (94) Google Scholar), and differential expression detection algorithm (Costa-Silva et al., 2017Costa-Silva J. Domingues D. Lopes F.M. RNA-seq differential expression analysis: an extended review and a software tool.PLoS One. 2017; 12: e0190152Crossref PubMed Scopus (202) Google Scholar) all affect results. Because of such challenges, groups such as ENCODE and MINSEQE have developed standardized analysis pipelines for better comparison of RNA-seq datasets (ENCODE Project Consortium et al., 2020ENCODE Project Consortium Snyder M.P. Gingeras T.R. Moore J.E. Weng Z. Gerstein M.B. Ren B. Hardison R.C. Stamatoyannopoulos J.A. Graveley B.R. et al.Perspectives on ENCODE.Nature. 2020; 583: 693-698Crossref PubMed Scopus (42) Google Scholar; Functional Genomics Data Society, 2012Functional Genomics Data SocietyMINSEQE: Minimum Information about a high-throughput SEQuencing Experiment. version 1.0, 2012http://fged.org/projects/minseqe/Google Scholar). The NASA GeneLab database (https://genelab-data.ndc.nasa.gov/genelab/projects) was created as a central repository for spaceflight-related omics-data. The repository includes data from experiments that profile transcription (RNA-seq, microarray), DNA/RNA methylation, protein expression, metabolite pools, and metagenomes. The most prevalent data type in this repository is RNA-seq from organisms exposed to spaceflight conditions. As of August 2020, the NASA GeneLab database has over eighty datasets with RNA-sequencing data (Table S1). These datasets include Homo sapiens (human), Mus musculus (mouse), Drosophila melanogaster (fruit fly), Arabidopsis thaliana (model higher plant), Oryzias latipes (Japanese rice fish), Helix lucorum (land snail), Brassica rapa (Fast Plant), Eruca vesicaria (arugula/edible plant), Euprymna scolopes (Hawaiian bobtail squid), Ceratopteris richardii (aquatic fern), and the bacterium, Bacillus subtilis from experiments performed during true spaceflight on various orbital platforms such as the Space Shuttle and International Space Station (ISS), as well as spaceflight-analog studies, such as hindlimb unloading and bed rest studies (Berrios et al., 2020Berrios D.C. Galazka J. Grigorev K. Gebre S. Costes S.V. NASA GeneLab: interfaces for the exploration of space omics data.Nucleic Acids Res. 2020; https://doi.org/10.1093/nar/gkaa887Crossref PubMed Scopus (14) Google Scholar). NASA's GeneLab and Ames Life Sciences Data Archive (ALSDA) projects have put forward an ambitious strategy focused on integrating data, metadata, and biospecimens to fully utilize the 40+ years of archived NASA Life Sciences data (Scott et al., 2020Scott R.T. Grigorev K. Mackintosh G. Gebre S.G. Mason C.E. Del Alto M.E. Costes S.V. Advancing the integration of Biosciences data sharing to further enable space exploration.33. Cell Rep., 2020Google Scholar). One of the first steps in this effort is the ability to analyze how experimental factors common to multiple datasets impact molecular signaling. Such meta-analysis can only occur if metadata, data, and processed data are harmonized. As part of this strategy, GeneLab engaged with the scientific community and held its first Analysis Working Group (AWG) workshop in 2018. Spaceflight researchers from universities and organizations across the United States and abroad met to begin the creation of a standardized, consensus data-processing pipeline for one of the most common types of spaceflight datasets: transcription profiling via RNA-sequencing. Scientists at this workshop met to discuss the merits of various bioinformatic software tools for processing RNA-sequencing data and ultimately agreed on a single pipeline of these tools. The main driver for developing the consensus pipeline was to present consistently processed data to the public, therefore making space-relevant multi-omics data more accessible and reusable. The overall goals were (1) to get more consistently processed data to the public; (2) to provide output data from every step of the consensus pipeline so users can download and use these “intermediate” data; (3) to support easier and more consistent analysis of space-relevant data by users including those in the NASA AWGs; and (4) to allow easier cross-analysis of experiments to identify effects that result from the spaceflight environment, independent of confounding factors. In addition, many of these data in the GeneLab database have not been previously analyzed, as their generation was relatively recent. Therefore, providing new and processed datasets to the public allows biologists and others to more easily interpret these data and contributes significantly to our collective knowledge of the effects of spaceflight on terrestrial organisms. Here we present the RNA-seq consensus pipeline (RCP) developed by the GeneLab AWG along with the rationale behind the tool settings and options selected. The RCP includes three distinct steps: data pre-processing, data processing, and differential gene expression computation/annotation (Figure 1A). These steps use tools for quality control (FastQC, MultiQC) (Andrews, 2010Andrews S. FastQC: A Quality Control Tool for High Throughput Sequence Data. Babraham Bioinformatics. Babraham Institute, 2010Google Scholar; Ewels et al., 2016Ewels P. Magnusson M. Lundin S. Käller M. MultiQC: Summarize analysis results for multiple tools and samples in a single report.Bioinformatics. 2016; 32: 3047-3048Crossref PubMed Scopus (1306) Google Scholar), read trimming (TrimGalore) (Krueger, 2019Krueger F. Trim Galore: a wrapper around cutadapt and FastQC to consistently apply adapter and quality trimming to FastQ files, with extra functionality for RRBS data (version 0.6.5).https://github.com/FelixKrueger/TrimGaloreDate: 2019Google Scholar), mapping (STAR) (Dobin et al., 2013Dobin A. Davis C.A. Schlesinger F. Drenkow J. Zaleski C. Jha S. Batut P. Chaisson M. Gingeras T.R. STAR: ultrafast universal RNA-seq aligner.Bioinformatics. 2013; 29: 15-21Crossref PubMed Scopus (14511) Google Scholar), quantification (RSEM) (Li and Dewey, 2011Li B. Dewey C.N. RSEM: accurate transcript quantification from RNA-seq data with or without a reference genome.BMC Bioinformatics. 2011; 12: 323Crossref PubMed Scopus (8701) Google Scholar), and differential gene expression calculation/annotation (DESeq2) (Love et al., 2014Love M.I. Huber W. Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.Genome Biol. 2014; 15: 550Crossref PubMed Scopus (23337) Google Scholar) (Figure 1B). The RCP has been integrated into the GeneLab database, and files produced by the RCP for each RNA-seq dataset hosted in GeneLab are and will continue to be publicly available for download. There are three distinct steps to the RCP, the first of which is data preprocessing (Figure 2A). The pipeline begins with quality control (QC) of raw FASTQ files from a short-read Illumina sequencer using the FastQC software (Andrews, 2010Andrews S. FastQC: A Quality Control Tool for High Throughput Sequence Data. Babraham Bioinformatics. Babraham Institute, 2010Google Scholar) (Figure 2B). FastQC is one of the most widely used QC programs for short-read sequencing data. It provides information that can be used to assess sample and sequencing quality, including base statistics, per base sequencing quality, per sequence quality scores, per base sequence content, per base GC content, per sequence GC content, per base N content, sequence length distributions, sequence duplication levels, overrepresented sequences, and k-mer content. The FastQC program is run on each individual sample file. However, reviewing the FastQC results for each sample file can be tedious and time consuming. Experiments typically have many sample files (biological and/or technical replicates) for multiple experimental conditions (spaceflight, ground control, etc.). For this reason, we also use the MultiQC package (Ewels et al., 2016Ewels P. Magnusson M. Lundin S. Käller M. MultiQC: Summarize analysis results for multiple tools and samples in a single report.Bioinformatics. 2016; 32: 3047-3048Crossref PubMed Scopus (1306) Google Scholar) (Figure 2C) to create a summary statistics report that includes the same quality control result categories from FastQC across all experiment samples. After performing quality control on the raw FASTQ data, reads are trimmed using TrimGalore (Krueger, 2019Krueger F. Trim Galore: a wrapper around cutadapt and FastQC to consistently apply adapter and quality trimming to FastQ files, with extra functionality for RRBS data (version 0.6.5).https://github.com/FelixKrueger/TrimGaloreDate: 2019Google Scholar) to remove sequencing adapters and low-quality bases that would disrupt read mapping during the data processing pipeline step (Figure 2D). TrimGalore is a wrapper program that uses the cutadapt program (Martin, 2011Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads.EMBnet. J. 2011; 17: 10-12Crossref Google Scholar) for read trimming. TrimGalore was selected for the RCP due to its simplified command line interface, thorough output of trimming metrics, and ability to automatically detect adapters. In this step, bases that are part of a sequencing adapter or of low quality are removed from each read, and reads that become too short are subsequently removed. After trimming, the quality control programs, FastQC and MultiQC, are again run on the trimmed FASTQ files for viewing the quality control metrics of the reads that will be used for data processing. Once the data have been preprocessed, the sequenced reads are ready for mapping and quantification. In the data processing step (Figure 1; Step 2A), the trimmed reads are first aligned to the reference genome (Figure 3A) with STAR, a splice-aware aligner (Dobin et al., 2013Dobin A. Davis C.A. Schlesinger F. Drenkow J. Zaleski C. Jha S. Batut P. Chaisson M. Gingeras T.R. STAR: ultrafast universal RNA-seq aligner.Bioinformatics. 2013; 29: 15-21Crossref PubMed Scopus (14511) Google Scholar). STAR must be run in two steps. The first step is to create indexed genome files (Figure 3B). These files are used to assist read mapping and only need to be generated once for each reference genome file. This step requires reference FASTA and GTF files (Table S2). Some datasets include the External RNA Control Consortium (ERCC) spike-in control—a pool of 96 synthetic RNAs with various lengths and GC content covering a 220 concentration range (Jiang et al., 2011Jiang L. Schlesinger F. Davis C.A. Zhang Y. Li R. Salit M. Gingeras T.R. Oliver B. Synthetic spike-in standards for RNA-seq experiments.Genome Research. 2011https://doi.org/10.1101/gr.121095.111Crossref Scopus (377) Google Scholar). If ERCC spike-ins were included, the spike-in FASTA and GTF files are appended to the reference FASTA and GTF files, respectively. The second step of STAR mapping is to use the indexed reference genome and the trimmed reads from the preprocessing step in order to map the reads to the genome and the transcriptome (Figure 3C). STAR will also produce genome mapped data, which can optionally be used to find reads that map outside of annotated reference transcripts. STAR mapping output data are in Binary Alignment Map (BAM) format, which has a separate entry for each mapped read and states which transcript each read is mapped to. In order to improve the detection and quantification of splice sites, STAR is run in “two-pass mode.” Here, splice sites are detected in the initial mapping to the reference and used to build a new reference that includes these splice sites. Reads are then re-mapped to this dynamically generated reference to improve the quantification of splice isoforms (Dobin et al., 2013Dobin A. Davis C.A. Schlesinger F. Drenkow J. Zaleski C. Jha S. Batut P. Chaisson M. Gingeras T.R. STAR: ultrafast universal RNA-seq aligner.Bioinformatics. 2013; 29: 15-21Crossref PubMed Scopus (14511) Google Scholar). Users are provided with these results (as per sample SJ.out files) for further analysis of differential splicing. The second part of processing is quantifying the number of reads mapped to each annotated transcript and gene (Figure 1A; Step 2B, Figure 4A). For this task, the RCP uses RSEM (Li and Dewey, 2011Li B. Dewey C.N. RSEM: accurate transcript quantification from RNA-seq data with or without a reference genome.BMC Bioinformatics. 2011; 12: 323Crossref PubMed Scopus (8701) Google Scholar). The main reasons for using RSEM are its ability to account for reads that map to multiple transcripts and distinguish gene isoforms. In short-read sequencing experiments it is likely that some number of reads will map to multiple regions in the genome. RSEM computes maximum likelihood abundance estimates to split the read count across multiple genes. Similar to STAR, RSEM is run in two distinct phases (Figure 4A). The first phase uses the reference genome and GTF files (with or without ERCC as appropriate) (Table S2) to prepare indexed genome files (Figure 4B). The second phase uses the indexed files and the mapped reads from STAR to assign counts to each gene (Figure 4C). There are two output files generated for each sample: counts assigned to genes and counts assigned to isoforms. Gene counts are used to calculate differential gene expression. Isoform counts are also generated as an option to look at differential isoform expression but are not used during differential gene expression calculation in the RCP. Once the RSEM count files are generated, the data are used to compute differentially expressed genes. A list of the reference genomes used in the GeneLab pipeline is available in Table S2 . These reference genomes were the most recent releases at the time each STAR and RSEM indexed references were created. Although it is possible to run STAR mapping through the RSEM toolkit, we elected not to do this because the alignment parameters used in this case are from ENCODE's STAR-RSEM pipeline and are not customizable. Thus, we would have been precluded from using the precise mapping parameters agreed to by the GeneLab AWG. We elected to adopt a mapping-based approach rather than rapidly quantifying the reads via a k-mer-based counting algorithm, pseudo-aligners, or a quasi-mapping method that utilizes RNA-seq inference procedures such as Kallisto (Bray et al., 2016Bray N.L. Pimentel H. Melsted P. Pachter L. Near-optimal probabilistic RNA-seq quantification.Nat. Biotechnol. 2016; 34: 525-527Crossref PubMed Scopus (2866) Google Scholar) or Salmon (Patro et al., 2017Patro R. Duggal G. Love M.I. Irizarry R.A. Kingsford C. Salmon provides fast and bias-aware quantification of transcript expression.Nat. Methods. 2017; 14: 417-419Crossref PubMed Scopus (2458) Google Scholar) despite their speed advantages. This is because alignment-free quantification tools do not accurately quantify low-abundant and small RNAs especially when biological variation is present (Wu et al., 2018Wu D.C. Yao J. Ho K.S. Lambowitz A.M. Wilke C.O. Limitations of alignment-free tools in total RNA-seq quantification.BMC Genomics. 2018; 19: 510Crossref PubMed Scopus (27) Google Scholar). Furthermore, alignment of reads allows for additional analyses beyond transcript and gene quantification such as measurement of gene body coverage and detection of novel transcripts. There are several alignment-based mapping tools available and each has advantages and disadvantages. An alignment tool that is sensitive to splice-isoforms is critical to accurately identify how expression of splice-isoforms is affected by the spaceflight environment. DNA-specific aligners such as BWA (Li and Durbin, 2009Li H. Durbin R. Fast and accurate short read alignment with burrows-wheeler transform.Bioinformatics. 2009; https://doi.org/10.1093/bioinformatics/btp324Crossref Scopus (22797) Google Scholar) and Bowtie (Langmead et al., 2009Langmead B. Cole T. Pop M. Salzberg S.L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome.Genome Biol. 2009; 10: R25Crossref PubMed Scopus (13822) Google Scholar) cannot handle intron-sized gaps and thus an RNA-seq-specific aligner is needed (Baruzzo et al., 2017Baruzzo G. Hayer K.E. Kim E.J. Di Camillo B. FitzGerald G.A. Grant G.R. Simulation-based comprehensive benchmarking of RNA-seq aligners.Nat. Methods. 2017; 14: 135-139Crossref PubMed Scopus (125) Google Scholar). In addition to splice-awareness, when selecting an aligner the following criteria were also considered: ability to input both single- and paired-end reads, handle strand-specific data, applicability to a variety of different model organisms with both low- and high-complexity genomic regions, efficient runtime and memory usage, ability to identify chimeric reads, high sensitivity, low rate of false discovery, and ability to output both genome and transcriptome alignments. Several studies have been conducted to compare the wide variety of available RNA-seq specific alignment tools, and of these, the STAR aligner consistently performs better than or on par with the tools tested for the indicated criteria (Baruzzo et al., 2017Baruzzo G. Hayer K.E. Kim E.J. Di Camillo B. FitzGerald G.A. Grant G.R. Simulation-based comprehensive benchmarking of RNA-seq aligners.Nat. Methods. 2017; 14: 135-139Crossref PubMed Scopus (125) Google Scholar; Schaarschmidt et al., 2020Schaarschmidt S. Fischer A. Zuther E. Hincha D.K. Evaluation of seven different RNA-seq alignment tools based on experimental data from the model plant Arabidopsis thaliana.Int. J. Mol. Sci. 2020; 21https://doi.org/10.3390/ijms21051720Crossref Scopus (11) Google Scholar; Raplee et al., 2019Raplee I.D. Evsikov A.V. Marín de Evsikova C. Aligning the aligners: comparison of RNA sequencing data alignment and gene expression quantification tools for clinical breast cancer research.J. Personalized Med. 2019; 9https://doi.org/10.3390/jpm9020018Crossref Scopus (8) Google Scholar). Once reads have been mapped and quantified, differential expression analysis is performed using the DESeq2 R package (Figure 1; Step 3, Figure 5A). Unlike the previous steps, a custom R script (GeneLab_DGE_wERCC.R or GeneLab_DGE_noERCC.R) (Data S1and S2) is used to run DESeq2; to create both unnormalized and normalized counts tables; and to generate a differential gene expression (DGE) output table containing normalized counts for each sample, DGE results, and gene annotations (Figure 5B). The GeneLab DGE R script also creates computer-readable tables that are used by the GeneLab visualization portal to generate various plots so users can easily view and begin interpreting the processed data. These scripts are provided in the NASA GeneLab_Data_Processing Github repository (https://github.com/nasa/GeneLab_Data_Processing). In the following sections we describe each step of these scripts in order. The GeneLab DGE R script requires three inputs: the quantified count data from the previous (RSEM) step; sample metadata from the Investigation, Study, and Assay (ISA) tables in the ISA.zip file (provided in the GeneLab repository with each dataset) (Sansone et al., 2012Sansone S.A. Rocca-Serra P. Field D. Maguire E. Taylor C. Hofmann O. Fang H. Neumann S. Tong W. Amaral-Zettler L. et al.Toward interoperable bioscience data.Nat. Genet. 2012; 44: 121-126Crossref PubMed Scopus (257) Google Scholar; Rocca-Serra et al., 2010Rocca-Serra P. Brandizi M. Maguire E. Sklyar N. Taylor C. Begley K. Field D. Harris S. Hide W. Hofmann O. et al.ISA software suite: supporting standards-compliant experimental annotation and enabling curation at the community level.Bioinformatics. 2010; https://doi.org/10.1093/bioinformatics/btq415Crossref PubMed Scopus (185) Google Scholar); and the organisms.csv file (Table S3), which is used to specify the organism used in the study and relevant gene annotations to load. Because samples from some GeneLab RNA-seq datasets contain ERCC spike-in and others do not, there are two versions of the GeneLab DGE R script, one for datasets with ERCC spike-in (GeneLab_DGE_wERCC.R, Data S1) and one for those without (GeneLab_DGE_noERCC.R, Data S2). Prior to running either script, paths to directories containing the input data and the output data location must be defined. Each script starts by defining the organism used in the study, which should be consistent with the name in the organisms.csv file so that it matches the abbreviations used in the PANTHER database (Mi et al., 2013Mi H. Muruganujan A. Thomas P.D. PANTHER in 2013: modeling the evolution of gene function, and other gene attributes, in the context of phylogenetic trees.Nucleic Acids Res. 2013; 41: D377-D386Crossref PubMed Scopus (1137) Google Scholar; Thomas, 2003Thomas P.D. PANTHER: a library of protein families and subfamilies indexed by function.Genome Res. 2003; https://doi.org/10.1101/gr.772403Crossref Scopus (1999) Google Scholar) for that organism. Next, the metadata from the ISA.zip file are imported and formatted for use with the DESeq2 package. During metadata formatting, groups for comparison are defined based on experimental factors, and a sample table is created to specify the group to which each sample belongs. Next, a contrasts matrix is generated, which specifies the groups that will be compared during DGE analysis; each group is compared with every other group in a pairwise manner in both directions (i.e. spaceflight versus ground and ground versus spaceflight). This approach provides the user with the results for all possible group comparisons, allowing each user to select the most relevant comparisons for their particular scientific questions. After metadata formatting, the RSEM gene count data files from each sample are listed and re-ordered (to match the order the samples appear in the metadata), then imported with the R package, tximport (Soneson et al., 2015Soneson C. Love M. Robinson M. Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences.F1000Res. 2015; 4 ([version 2; Peer Review: 2 Approved]): 1521Crossref PubMed Google Scholar), and sample names are assigned. Prior to running DESeq2, a value of 1 is added to genes with lengths of zero, which is necessary to make a DESeqDataSet object. A DESeqDataSet object is then created using the formatted metadata and the count data that was imported with tximport. For datasets that contain samples with ERCC spike-in, we use the GeneLab_DGE_wERCC.R script (Data S1). To reduce the possibility of skewing the data during DESeq2 normalization (McIntyre et al., 2011McIntyre L.M. Lopiano K.K. Morse A.M. Amin V. Oberg A.L. Young L.J. Nuzhdin S.V. RNA-seq: technical variability and sampling.BMC Genomics. 2011; 12: 293Crossref PubMed Scopus (188) Google Scholar; Risso et al., 2011Risso D. Schwartz K. Sherlock G. Dudoit S. GC-content normalization for RNA-seq data.BMC Bioinformatics. 2011; 12: 480Crossref PubMed Scopus (386) Google Scholar; Conesa et al., 2016Conesa A. Madrigal P. Tarazona S. Gomez-Cabrero D. Cervera A. McPherson A. Szcześniak M.W. Gaffney D.J. Elo L.L. Zhang X. et al.A survey of best practices for RNA-seq data analysis.Genome Biol. 2016; 17: 13Crossref PubMed Scopus (1093) Google Scholar; Law et al., 2016Law C.W. Alhamdoosh M. Su S. Dong X. Tian L. Smyth G.K. Ritchie M.E. RNA-seq analysis is easy as 1-2-3 with Limma, Glimma and edgeR.F1000Res. 2016; 5https://doi.org/10.12688/f1000research.9005.3Crossref PubMed Scopus (57) Google Scholar), all genes that have a sum of less than 10 counts across all samples are removed. The cutoff value of 10 is a best practice recommended by the DESeq2 tutorial on Bioconductor. These filtered data are then prepared for normalization and DGE analysis with DESeq2. Because there is no consensus for whether or not ERCC-normalization improves the accuracy of the results (Risso et al., 2014Risso D. Ngai J. Speed T.P. Dudoit S. Normalization of RNA-seq data using factor Analysis of control genes or samples.Nat. Biotechnol. 2014; 32: 896-902Crossref PubMed Scopus (724) Google Scholar), the GeneLab project and its AWG members decided to perform the DGE analysis both with and without ERCC-normalization (for datasets with samples containing ERCC spike-in). To enable DESeq2 analysis with and without considering ERCC reads, the DESeqDataSet object is used to create a DESeqDataSet object containing only ERCC reads. Because all samples must contain ERCC spike-in for ERCC-normalization, the DESeqDataSet object containing only ERCC reads is used to identify and remove any samples that do not contain ERCC reads. Next, a DESeqDataSet object containing only non-ERCC reads is created by removing rows containing ERCC reads. These data are then used for DESeq2 analysis. For DESeq2 analysis with ERCC-normalization (Data S2), the size factor object of the non-ERCC data is replaced with group B ERCC size factors for re-scaling in the first DESeq2 step. Group B ERCC genes contain the same concentration in both mix1 and mix 2. Therefore, only group B ERCC genes are used for generating the size factors for re-scaling duirng ERCC-normalization. For DESeq2 analysis without ERCC-normalization, the DESeq2 default algorithm is applied to the DESeqDataSet object containing only non-ERCC reads. The unnormalized and DESeq2-normalized count data as well as the sample table are then outputted as CSV files. The “Unnormalized_Counts.c" @default.
- W3148902524 created "2021-04-13" @default.
- W3148902524 creator A5001481592 @default.
- W3148902524 creator A5004275474 @default.
- W3148902524 creator A5005084000 @default.
- W3148902524 creator A5013920526 @default.
- W3148902524 creator A5014222363 @default.
- W3148902524 creator A5016505833 @default.
- W3148902524 creator A5020068823 @default.
- W3148902524 creator A5021022362 @default.
- W3148902524 creator A5024898846 @default.
- W3148902524 creator A5028573253 @default.
- W3148902524 creator A5029296740 @default.
- W3148902524 creator A5030302727 @default.
- W3148902524 creator A5031216179 @default.
- W3148902524 creator A5031623146 @default.
- W3148902524 creator A5033588544 @default.
- W3148902524 creator A5035403159 @default.
- W3148902524 creator A5038723989 @default.
- W3148902524 creator A5042690361 @default.
- W3148902524 creator A5047578706 @default.
- W3148902524 creator A5048384124 @default.
- W3148902524 creator A5052164874 @default.
- W3148902524 creator A5058181019 @default.
- W3148902524 creator A5059648760 @default.
- W3148902524 creator A5060677015 @default.
- W3148902524 creator A5061600787 @default.
- W3148902524 creator A5062687508 @default.
- W3148902524 creator A5062903553 @default.
- W3148902524 creator A5064276499 @default.
- W3148902524 creator A5065675756 @default.
- W3148902524 creator A5067299383 @default.
- W3148902524 creator A5069667659 @default.
- W3148902524 creator A5070274577 @default.
- W3148902524 creator A5070403963 @default.
- W3148902524 creator A5070652111 @default.
- W3148902524 creator A5072342349 @default.
- W3148902524 creator A5073480340 @default.
- W3148902524 creator A5075032246 @default.
- W3148902524 creator A5076358770 @default.
- W3148902524 creator A5077595545 @default.
- W3148902524 creator A5077718077 @default.
- W3148902524 creator A5079774065 @default.
- W3148902524 creator A5080636187 @default.
- W3148902524 creator A5087834833 @default.
- W3148902524 creator A5089349209 @default.
- W3148902524 creator A5091885695 @default.
- W3148902524 date "2021-04-01" @default.
- W3148902524 modified "2023-10-14" @default.
- W3148902524 title "NASA GeneLab RNA-seq consensus pipeline: Standardized processing of short-read RNA-seq data" @default.
- W3148902524 cites W1973094248 @default.
- W3148902524 cites W1999574084 @default.
- W3148902524 cites W2022472106 @default.
- W3148902524 cites W2036897871 @default.
- W3148902524 cites W2039521726 @default.
- W3148902524 cites W2055027171 @default.
- W3148902524 cites W2098425296 @default.
- W3148902524 cites W2103441770 @default.
- W3148902524 cites W2124649657 @default.
- W3148902524 cites W2124985265 @default.
- W3148902524 cites W2130410032 @default.
- W3148902524 cites W2130430745 @default.
- W3148902524 cites W2141297584 @default.
- W3148902524 cites W2151876471 @default.
- W3148902524 cites W2160697532 @default.
- W3148902524 cites W2167778120 @default.
- W3148902524 cites W2168342464 @default.
- W3148902524 cites W2169456326 @default.
- W3148902524 cites W2179438025 @default.
- W3148902524 cites W2197124664 @default.
- W3148902524 cites W2236822143 @default.
- W3148902524 cites W2287064828 @default.
- W3148902524 cites W2323326409 @default.
- W3148902524 cites W2340210804 @default.
- W3148902524 cites W2432815617 @default.
- W3148902524 cites W2566242138 @default.
- W3148902524 cites W2566567183 @default.
- W3148902524 cites W2592811885 @default.
- W3148902524 cites W2780155440 @default.
- W3148902524 cites W2946173685 @default.
- W3148902524 cites W2949303487 @default.
- W3148902524 cites W2952109315 @default.
- W3148902524 cites W2953336103 @default.
- W3148902524 cites W2979832776 @default.
- W3148902524 cites W3010410008 @default.
- W3148902524 cites W3045696640 @default.
- W3148902524 cites W3109199706 @default.
- W3148902524 cites W3110340265 @default.
- W3148902524 cites W4212903522 @default.
- W3148902524 cites W4231539689 @default.
- W3148902524 doi "https://doi.org/10.1016/j.isci.2021.102361" @default.
- W3148902524 hasPubMedCentralId "https://www.ncbi.nlm.nih.gov/pmc/articles/8044432" @default.
- W3148902524 hasPubMedId "https://pubmed.ncbi.nlm.nih.gov/33870146" @default.
- W3148902524 hasPublicationYear "2021" @default.
- W3148902524 type Work @default.
- W3148902524 sameAs 3148902524 @default.
- W3148902524 citedByCount "17" @default.
- W3148902524 countsByYear W31489025242021 @default.