Matches in SemOpenAlex for { <https://semopenalex.org/work/W2109972285> ?p ?o ?g. }
- W2109972285 endingPage "440" @default.
- W2109972285 startingPage "435" @default.
- W2109972285 abstract "The completion of the human genome has shifted the attention from deciphering the sequence to the identification and characterization of the encoded components. The identification and functional annotation of the proteome is here of special interest and starts with the identification of genes and transcripts as a prerequisite of proteome annotation. Gene predictions are very powerful in predicting most of the exons in a genome, but reliable gene structure predictions of both known and novel genes are dependent on existing transcript and protein information. An enormous amount of data already exists on the function of many human proteins, but this is scattered over many resources. Public domain databases are required to manage and collate this information and present it to the user community in both a human and machine readable manner. The completion of the human genome has shifted the attention from deciphering the sequence to the identification and characterization of the encoded components. The identification and functional annotation of the proteome is here of special interest and starts with the identification of genes and transcripts as a prerequisite of proteome annotation. Gene predictions are very powerful in predicting most of the exons in a genome, but reliable gene structure predictions of both known and novel genes are dependent on existing transcript and protein information. An enormous amount of data already exists on the function of many human proteins, but this is scattered over many resources. Public domain databases are required to manage and collate this information and present it to the user community in both a human and machine readable manner. In November 2004, an article was published in Nature by the International Human Genome Sequencing Centre announcing the finishing of the sequencing of the human genome (1.Stein L.D. Human genome: end of the beginning.Nature. 2004; 431: 915-916Google Scholar). The published sequence covered 99% of the euchromatic genome and contained only 341 gaps. This incredible achievement has rightly been hailed as a foundation for biomedical research in the decades ahead but, in practice, is only the first step in a long and complicated path to decipher the complexity of the proteome content of the human cell.To fully understand the workings of the human proteome, scientists must first be able to identify every protein coding region contained within the genome and the amino acid sequence of the proteins that these regions encode. In addition to this basic information, an incredible amount of metadata needs to be assembled. For example, the signals that trigger the expression of these proteins must be identified, the actual protein expression experimentally observed and catalogued. The subsequent duration of gene expression along with the factors can control its eventual repression, the stability of mRNA transcripts and the rates at which they are translated into protein products must also be known and understood. Every potential site of posttranslational modification of the protein should be identified, the conditions under which these modifications are made and their biological significance understood. The biological function of each protein molecule needs to be catalogued along with how this varies according to the cell type in which it is expressed and the temporal position within the cell's life cycle. The significance of the intermolecular interactions each protein makes with other proteins, lipids, and nucleic acids also needs be understood at both a temporal and functional level and in conjunction with knowledge of the intracellular pathways and processes performed by the cell.Not only does all this information first need to be generated, and this task is currently being tackled in laboratories all over the world, but the information then needs to be collated, annotated, and stored in a manner that makes it easily accessible to anyone with an interest in the field. Although much of this data is already available in published literature, and is being added to with every journal issue, the potential user is faced with a daunting task should they wish to search out information on a particular gene product and compare this with that pertaining to other related sequences. To assist in this task, public domain databases exist to gather this information and curate it to a similar standard allowing easy comparison across individual records, while still allowing the user to access the original underlying data.ASSESSING THE SIZE OF THE CHALLENGEJust how many protein coding genes are present in the human genome has been a question that has interested scientists since long before the start of large-scale sequencing efforts. In 1994, Antequera estimated the number to be 80,000 (2.Antequera F. Bird A. Predicting the total number of human genes.Nat. Genet. 1994; 8: 114Google Scholar) based on the number of CpG islands, while approximations based on EST data varied between 35,000 and 64,000 (3.Ewing B. Green P. Analysis of expressed sequence tags indicates 35,000 human genes.Nat. Genet. 2000; 25: 232-234Google Scholar, 4.Fields C. Adams M.D. White O. Venter J.C. How many genes in the human genome?.Nat. Genet. 1994; 7: 345-346Google Scholar). Deriving an accurate count entails reconciling the large number of experimentally determined sequences stored in the nucleotide databases, which range from individually sequenced mRNAs to large-scale collections of cDNAs, to the output of ab initio gene prediction tools. This is done by Ensembl (www.ebi.ac.uk/ensembl) (5.Birney E. Andrews T.D. Bevan P. Caccamo M. Chen Y. Clarke L. Coates G. Cuff J. Curwen V. Cutts T. Down T. Eyras E. Fernandez-Suarez X.M. Gane P. Gibbins B. Gilbert J. Hammond M. Hotz H.R. Iyer V. Jekosch K. Kahari A. Kasprzyk A. Keefe D. Keenan S. Lehvaslaiho H. McVicker G. Melsopp C. Meidl P. Mongin E. Pettett R. Potter S. Proctor G. Rae M. Searle S. Slater G. Smedley D. Smith J. Spooner W. Stabenau A. Stalker J. Storey R. Ureta-Vidal A. Woodwark K.C. Cameron G. Durbin R. Cox A. Hubbard T. Clamp M. An overview of Ensembl.Genome Res. 2004; 14: 925-928Google Scholar), a database that organizes biological information around the sequences of large genomes. Developed in response to the acceleration of the public effort to sequence the human genome, Ensembl employs two gene prediction programs: GeneWise, which predicts gene structure using similar protein sequences (6.Birney E. Clamp M. Durbin R. GeneWise and Genomewise.Genome Res. 2004; 14: 988-995Google Scholar), and Genomewise, which provides a gene structure final parse across cDNA- and EST-defined spliced structure (6.Birney E. Clamp M. Durbin R. GeneWise and Genomewise.Genome Res. 2004; 14: 988-995Google Scholar). Both algorithms provide high-specificity gene prediction at the expense of some loss of sensitivity. The number of protein coding genes predicted by such algorithms has varied with each build and rebuild of the genomic sequence. However, the current prediction by Ensembl of the number of coding sequences based on the recently announced final release of the human genome is 22,221, excluding pseudogenes (Release 26.35.1). RefSeq also provide predicted coding sequences derived automatically from the human genome (www.ncbi.nlm.nih.gov/RefSeq/) (7.Pruitt K.D. Tatusova T. Maglott D.R. NCBI Reference Sequence (RefSeq): A curated non-redundant sequence database of genomes, transcripts and proteins.Nucleic Acids Res. 2005; 33: 501-504Google Scholar).Reconciliation of these datasets is performed by the International Protein Index (IPI) 1The abbreviation used is: IPI, International Protein Index. (www.ebi.ac.uk/IPI) (8.Kersey P.J. Duarte J. Williams A. Karavidopoulou Y. Birney E. Apweiler R. The International Protein Index: An integrated database for proteomics experiments.Proteomics. 2004; 4: 1985-1988Google Scholar), which was first developed for the original analysis of the human genome draft. IPI merges the experimentally determined protein sequences held in the UniProt sequence database (9.Bairoch A. Apweiler R. Wu C.H. Barker W.C. Boeckmann B. Ferro S. Gasteiger E. Huang H. Lopez R. Magrane M. Martin M.J. Natale D.A. O'Donovan C. Redaschi N. Yeh L.S. The Universal Protein Resource (UniProt).Nucleic Acids Res. 2005; 33: 154-159Google Scholar) with the protein predictions of Ensembl and both protein predictions and experimentally derived datasets provided by RefSeq to provide a minimally redundant yet maximally complete set of human, mouse, rat, and zebrafish proteins consisting of one sequence per transcript. All annotated splice variants are included in IPI as separate entries (unless their protein sequences are identical). IPI is produced automatically by mapping between the different datasets on the basis of protein similarity and maintains cross-references between the primary data sources.IPI is updated monthly but maintains stable identifiers (with incremental versioning) to allow the tracking of sequences in IPI between IPI releases. When proteins disappear from source databases and a corresponding sequence cannot be identified, IPI identifiers are archived and can be traced by researchers who used the identifier in a particular dataset. Similarly if two IPI entries are merged as a result of changing data within the source databases, a secondary identifier will be maintained to allow the original entry to be traced.Version 3.0 of the human IPI suggests there to be 47,094 unique transcripts (including splice variants) produced from the human genome, with only 1,500 of those suggested solely by predictive programs. It is to be hoped that the existence of these final 1,500 gene products can be experimentally confirmed (or disproven) over the next few years, to give a full profile of the human transcriptome, although it is to be expected that the number of splice variants will increase as methods to both predict their existence and to experimentally confirm the predictions are improved.The Reference Sequence (RefSeq) collection also aims to provide a comprehensive, integrated, nonredundant set of sequences, including genomic DNA, transcript (RNA), and protein products, for the human proteome.MANUAL ANNOTATION OF PROTEIN SEQUENCE AND FUNCTIONWith protein sequence information coming from a variety of sources, including the translation of transcripts from many different origins such as genome projects, cDNAs, and individual gene sequencing in addition to data generated by direct protein sequencing, there arose the need for a single, central database where these sequences could be merged into a unique entry and annotated with additional functional and structural information. UniProt (www.ebi.ac.uk/uniprot/) (9.Bairoch A. Apweiler R. Wu C.H. Barker W.C. Boeckmann B. Ferro S. Gasteiger E. Huang H. Lopez R. Magrane M. Martin M.J. Natale D.A. O'Donovan C. Redaschi N. Yeh L.S. The Universal Protein Resource (UniProt).Nucleic Acids Res. 2005; 33: 154-159Google Scholar) was created to fulfil this role and was formed through the merger of the existing Swiss-Prot (10.Boeckmann B. Bairoch A. Apweiler R. Blatter M.C. Estreicher A. Gasteiger E. Martin M.J. Michoud K. O’Donovan C. Phan I. Pilbout S. Schneider M. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003.Nucleic Acids Res. 2003; 31: 365-370Google Scholar), TrEMBL (10.Boeckmann B. Bairoch A. Apweiler R. Blatter M.C. Estreicher A. Gasteiger E. Martin M.J. Michoud K. O’Donovan C. Phan I. Pilbout S. Schneider M. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003.Nucleic Acids Res. 2003; 31: 365-370Google Scholar), and PIR (11.Wu C.H. Yeh L.S. Huang H. Arminski L. Castro-Alvear J. Chen Y. Hu Z. Kourtesis P. Ledley R.S. Suzek B.E. Vinayaka C.R. Zhang J. Barker W.C. The Protein Information Resource.Nucleic Acids Res. 2003; 31: 345-347Google Scholar) sequence databases. UniProt is produced in a collaboration between the European Bioinformatics Institute, the Swiss Institute of Bioinformatics, and Protein Information Resource, Washington, D.C. UniProt is comprised of three components, each optimized for different uses. The UniProt Knowledgebase (UniProt) is the central access point for extensive curated protein information, including function, protein classification, and cross-references. The UniProt Nonredundant Reference (UniRef) databases combine closely related sequences into a single record to speed up searches. The UniProt Archive (UniParc) is a comprehensive repository, reflecting the history of all protein sequences.The central UniProt Knowledge base consists of two core databases, UniProt/Swiss-Prot and UniProt/TrEMBL. Within Swiss-Prot, protein sequences from many sources are merged to provide a single entry, which describes all unique protein products produced by an individual gene from a particular species. Sequences are curated to correct sequencing errors and to identify both splice variants and sites of polymorphisms (12.Farriol-Mathis N. Garavelli J.S. Boeckmann B. Duvaud S. Gasteiger E. Gateau A. Veuthey A.L. Bairoch A. Annotation of post-translational modifications in the Swiss-Prot knowledge base.Proteomics. 2004; 4: 1537-1550Google Scholar). These observations are mapped and given unique identifiers such that each original sequence may be recreated from within the entry. Potential sites of posttranslational modification are identified, and those confirmed by experimental observation are recorded as such. The protein is given both a systematic protein and gene name, and all known synonyms are recorded. Taxonomic data and citation information are checked and amended, if necessary. If further information on the protein is available, the entries contain detailed annotation on items such as the function(s) of the protein, enzyme-specific information (catalytic activity, cofactors, metabolic pathway, regulation mechanisms), biologically relevant domains and sites, molecular weight determined by mass spectrometry, subcellular location(s) of the protein, tissue-specific expression, developmentally specific expression of the protein, secondary structure, quaternary structure, similarities to other proteins, use of the protein in a biotechnological process, diseases associated with deficiencies in the protein, use of the protein as a pharmaceutical drug, etc. Extensive (and increasing) use of controlled vocabularies improves computer readability.High-quality manual annotation is time consuming and limits the rate at which the UniProt/Swiss-Prot dataset can grow. TrEMBL (translation of EMBL nucleotide sequence database) was established in 1996 and consists of computer-annotated entries derived from the translation of all coding sequences in the nucleotide sequence databases, except for coding sequences already included in Swiss-Prot. It also contains those protein sequences extracted from the literature or submitted directly by the user community that are not directly entered in Swiss-Prot. TrEMBL has a certain degree of sequence redundancy, namely a single gene from an individual species may be represented by more than one entry. The UniProt/TrEMBL data content is enhanced by extensive automatic annotation procedures (13.Wieser D. Kretschmann E. Apweiler R. Filtering erroneous protein annotation.Bioinformatics. 2004; 20: I342-I347Google Scholar). The UniProt Knowledgebase contains a nonredundant set of ∼29,000 human sequences; however, this will include many splice variants, which will eventually be merged into a single entry within UniProt/Swiss-Prot.One of the many strengths of the UniProt Knowledgebase is the extensive cross-referencing made to other, more-specialized databases. No one database can hold all the diverse pieces of information on a protein but UniProt cross-references to more than 60 other data sources, including model organism, protein classification, and structural and disease databases (Fig. 1). UniProt may be regarded as a central hub of knowledge, which extends out to many additional sources to expand the information summarized in the source record.UniProt/Swiss-Prot has initiated a major project to annotate all known human sequences according to the quality standards of Swiss-Prot—the Human Proteome Initiative (HPI) (14.O'Donovan C. Apweiler R. Bairoch A. The human proteomics initiative (HPI).Trends Biotechnol. 2001; 19: 178-181Google Scholar). To date, 11,638 human protein records have been fully manually annotated with an additional 4,932 splice variants being identified within these entries (Table I).Table IThe Human Proteome InitiativeMax. per entryAverage per entryNo. of entriesTotal number of annotated human entries in Swiss-Prot11,638Number of splice variants4,932320.422,696 (23.17%)Number of variants (disease mutations and polymorphisms)19,7752441.702,642 (22.70%)Number of annotated posttranslational modifications (experimentally proven or potential)28,6152122.465,265 (45.24%)Number of references to published articles46,544 (29,000 distinct references)1434.0011,281 (96.93%)Number of comment blocks59,071295.0811,438 (98.28%)Number of feature lines212,18356818.2311,098 (95.36%)Number of cross-referenced EMBL protein_ids49,142 (49,077 distinct protein_ids)5394.2211,465 (98.51%)Number of cross-references to InterPro27,189192.3410,677 (91.74%)Number of cross-references to PDB (3D-structure)5,5101930.471,371 (11.78%)Number of cross-references to MIM10,181 (9,648 distinct IM entries)130.878,403 (72.20%)Number of cross-references to Genew10,697 (10,585 distinct Genew entries)130.9210,640 (91.42%) Open table in a new tab PROTEIN CLASSIFICATION AND AUTOMATIC ANNOTATION OF FUNCTIONAs previously stated, the process of manual annotation is necessarily slow and can only represent data that has been experimentally verified for a given protein in a particular species. In order to transfer some or all of this information to closely related proteins within the same species or across species, there must be a means of identifying closely related families of proteins or particular functional domains or regions within less closely related sequences. A number of groups have individually developed signature and sequence cluster-based methods for protein classification. Many of these have been collated and merged into an integrated resource, InterPro (www.ebi.ac.uk/interpro/) (15.Mulder N.J. Apweiler R. Attwood T.K. Bairoch A. Bateman A. Binns D. Bradley P. Bork P. Bucher P. Cerruti L. Copley R. Courcelle E. Das U. Durbin R. Fleischmann W. Gough J. Haft D. Harte N. Hulo N. Kahn D. Kanapin A. Krestyaninova M. Lonsdale D. Lopez R. Letunic I. Madera M. Maslen J. McDowall J. Mitchell A. Nikolskaya A.N. Orchard S. Pagni M. Ponting C.P. Quevillon E. Selengut J. Sigrist C.J. Silventoinen V. Studholme D.J. Vaughan R. Wu C.H. InterPro, progress and status in 2005.Nucleic Acids Res. 2005; 33: 201-205Google Scholar). InterPro (Release 8.1) is formed from signatures provided by PROSITE (16.Falquet L. Pagni M. Bucher P. Hulo N. Sigrist C.J.A. Hofmann K. Bairoch A. The PROSITE database, its status in 2002.Nucleic Acids Res. 2002; 30: 235-238Google Scholar), PRINTS (17.Attwood T.K. The PRINTS database: A resource for identification of protein families.Brief Bioinform. 2002; 3: 252-263Google Scholar), Pfam (18.Bateman A. Birney E. Cerruti L. Durbin R. Etwiller L. Eddy S.R. Griffiths-Jones S. Howe K.L. Marshall M. Sonnhammer E.L. The Pfam protein families database.Nucleic Acids Res. 2002; 30: 276-280Google Scholar), ProDom (19.Corpet F. Servant F. Gouzy J. Kahn D. ProDom and ProDom-CG: Tools for protein domain analysis and whole genome comparisons.Nucleic Acids Res. 2000; 28: 267-269Google Scholar), SMART (20.Ponting C.P. Schultz J. Milpetz F. Bork P. SMART: Identification and annotation of domains from signalling and extracellular protein sequences.Nucleic Acids Res. 1999; 27: 229-232Google Scholar), TIGRFAMs (21.Haft D.H. Selengut J.D. White O. The TIGRFAMs database of protein families.Nucleic Acids Res. 2003; 31: 371-373Google Scholar), PIRSF (22.Huang H. Xiao C. Wu C.H. ProClass protein family database.Nucleic Acids Res. 2000; 28: 273-276Google Scholar), and SUPERFAMILY (23.Andreeva A. Howort H.D. Brenner S.E. Hubbard T.J.P. Chothia C. Murzin A.G. SCOP database in 2004: Refinements integrate structure and sequence family data.Nucleic Acids Res. 2004; 32: D226-D229Google Scholar), with InterPro protein matches calculated for all UniProt proteins and cross-references within the UniProt entries. InterPro release 8.1 contains 11,330 entries, representing 2,933 domains, 8,126 families, 222 repeats, 27 active sites, 21 binding sites, and 20 posttranslational modification sites. Structural links are generated automatically to the CATH and SCOP databases through residue-by-residue mappings with UniProt proteins and there are links to all the PDB entries for proteins that match the InterPro entry, provided they cover the signatures within that entry.By using the tool provided, InterProScan (www.ebi.ac.uk/InterProScan/) (24.Zdobnov E.M. Apweiler R. InterProScan—An integration platform for the signature-recognition methods in InterPro.Bioinformatics. 2001; 17: 847-848Google Scholar), users have the ability to take a novel protein sequence and ascribe function by similarity to known protein families and to identify functional domains, active sites or binding sites within the molecule. InterPro is utilized within UniProt as the basis for automatically transferring annotation from the manually annotated Swiss-Prot entries to similar, closely related proteins sequences in the TrEMBL database. This adds valuable information to a large percentage of the 1.5 million protein sequences currently residing in the UniProt/TrEMBL database (Release 28.2).CAPTURING THE PROTEIN EXPRESSION AND INTERACTIONSWhile the human genome encodes all potentially expressed proteins, our understanding of the mechanisms governing protein expression is still too limited to reliably predict the protein content of a given cell in a given state. The systematic experimental analysis of protein expression is currently being pursued in a number of large-scale proteomics projects, e.g. the HUPO Plasma Proteome Project (25.Omenn G.S. The Human Proteome Organization Plasma Proteome Project pilot phase: Reference specimens, technology platform comparisons, and standardized data submissions and analyses.Proteomics. 2004; 4: 1235-1240Google Scholar). A major challenge in the systematic capture of protein expression data is the diversity of experimental technologies and data formats in the field. The HUPO Proteomics Standards Initiative (PSI) (26.Orchard S. Hermjakob H. Julian Jr., R.K. Runte K. Sherman D. Wojcik J. Zhu W. Apweiler R. Common interchange standards for proteomics data: Public availability of tools and schema.Proteomics. 2004; 4: 490-491Google Scholar, 27.Orchard S. Taylor C.F. Hermjakob H. Weimin-Zhu >Jr., Julian R.K. Apweiler R. Advances in the development of common interchange standards for proteomic data.Proteomics. 2004; 4: 2363-2365Google Scholar) develops community standards for proteomics to facilitate the capture, analysis, and distribution of proteomics data. Two data formats have now been produced by the MS group within the PSI: mzData, which allows the capture and interchange of peak list information, and mzIdent, which describes both protein identity and the corresponding peptides from which the identification was made. The PRIDE (PRoteomics IDEntification) database (www.ebi.ac.uk/pride) implements these standards and provides a public repository for protein identification data, which is extensively cross-referenced to UniProt and further external data sources (L. Martens, in preparation).Proteins do not function in isolation, and the role of a protein may vary with the point in a cell cycle at which the molecule is expressed, the tissue in which it is present, and the availability of the other molecules with which it is capable of interacting. It is impossible to capture such a level of detail in any one database. UniProt/Swiss-Prot summarizes this information within the Comment lines but enhances this by extensive cross-referencing to other, more specialized data sources. For example, protein interaction data is captured in IntAct (www.ebi.ac.uk/intact), a freely available, open source database (28.Hermjakob H. Montecchi-Palazzi L. Lewington C. Mudali S. Kerrien S. Orchard S. Vingron M. Roechert B. Roepstorff P. Valencia A. Margalit H. Armstrong J. Bairoch A. Cesareni G. Sherman D. Apweiler R. IntAct: An open source molecular interaction database.Nucleic Acids Res. 2004; 32: D452-D455Google Scholar). Information within IntAct is manually curated from two sources: either extracted from existing literature by the curation team or directly submitted by laboratories prior to publication and made available to the journal reader concomitant to publication. IntAct also makes freely available a number of tools for viewing and analyzing the data, for example ProViz (29.Iragne F. Nikolski M. Mathieu B. Auber D. Sherman D. ProViz: Protein interaction visualization and exploration.Bioinformatics. 2005; 21: 272-274Google Scholar), a graph visualization system, and MiNe, an application that computes minimal connecting networks for protein sets.The IntAct data model has three main components: Experiment, Interaction, and Interactor. An Experiment groups a number of Interactions from one publication and classifies the experimental conditions under which these Interactions have been generated. An Experiment may have only a single interaction, or hundreds of interactions in the case of large-scale experiments. An Interactor is a biological entity participating in an Interaction, usually a protein, but potentially also a DNA sequence, or a small molecule. An Interaction contains one or more Interactors participating in the Interaction. Extensive use of controlled vocabularies enables both data consistency and increases the ability of computers to easily parse and extract specific portions of the data, for example it is easy to select all interactions identified by x-ray crystallography or deselect all that were generated using yeast two-hybrid technology.IntAct is fully compatible with the Proteomic Standards Initiative XML interchange standard and can import and export data in both PSI-MI Level 1 and 2 (30.Hermjakob H. Montecchi-Palazzi L. Bader G. Wojcik J. Salwinski L. Ceol A. Moore S. Orchard S. Sarkans U. von Meringm C. Roechertm B. Poux S. Jung E. Mersch H. Kersey P. Lappe M. Li Y. Zeng R. Rana D. Nikolski M. Husi H. Brun C. Shanker K. Grant S.G. Sander C. Bork P. Zhu W. Pandey A. Brazma A. Jacq B. Vidal M. Sherman D. Legrain P. Cesareni G. Xenarios I. Eisenberg D. Steipe B. Hogue C. Apweiler R. The HUPO PSI’s molecular interaction format—A community standard for the representation of protein interaction data.Nat. Biotechnol. 2004; 22: 177-183Google Scholar). IntAct is also a founder member of the IMEx consortium, a collaboration of interaction databases, currently also including BIND (31.Bader G.D. Betel D. Hogue C.W.V. BIND: The Biomolecular Interaction Network database.Nucleic Acids Res. 2003; 31: 248-250Google Scholar), DIP (32.Xenarios I. Salwinski L. Duan X.J. Higney P. Kim S. Eisenberg D. DIP: The Database of Interacting Proteins. A research tool for studying cellular networks of protein interactions.Nucleic Acids Res. 2002; 30: 303-305Google Scholar), MINT (33.Zanzoni A. Montecchi-Palazzi L. Quondam M. Ausiello G. Helmer-Citterich M. Cesareni G. MINT: A Molecular INTeraction database.FEBS Lett. 2002; 513: 135-140Google Scholar), and MIPS (MPACT) (34.Pagel, P., Kovac, S., Oesterheld, M., Brauner, B., Dunger-Kaltenbach, I., Frishman, G., Montrone, C., Mark, P., Stumpflen, V., Mewes, H. W., Ruepp, A., and Frishman, D. (2004). The MIPS mammalian protein-protein interaction database. Bioinformatics [Epub ahead of print]Google Scholar), which plan to regularly exchange curated interaction data to ensure users may eventually access an identical dataset at any one of the member databases.Higher level information, namely the metabolic and signal transduction pathways that these molecules participate in, is collected and annotated in pathway databases such as Reactome (www.reactome.org) (35.Robertson M. Reactome: Clear view of a starry sky.Drug Discov. Today. 2004; 9: 684-685Google Scholar). Reactome is authored by biological researchers with expertise in their field and maintained and curated by the Reactome editorial staff. Reactome maintains links to the underlying proteins by cross-linking to specific UniProt records, with corresponding links to Reactome from the UniProt entry giving information as to which pathways or reactions each specific protein plays a role in.MAINTAINING DATA COMPATIBILITYData on the human proteome is now spread over an increasing number of databases and a certain degree of compatibility must be maintained to allow all information on a particular protein to be parsed and collated. Use of a stable protein identifier, such as the UniProt accession number, or of a stable gene identifier, such as those generated by the Human Gene Nomenclature Committee (35.Robertson M. Reactome: Clear view of a starry sky.Drug Discov. Today. 2004; 9: 684-685Google Scholar), allows one degree of compatibility in that the protein can be unambiguously identified across all the databases. Other efforts in establishing data standardization largely center on the increasing use of controlled vocabularies and ontologies. Leaders in this field are the GO Consortium (geneontology.org) that produce terms to describe the attributes of gene products, enabling the description of their molecular function, the biological processes in which they play a role, and cellular components in which they are expressed (36.Wain H.M. Lush M.J. Ducluzeau F. Khodiyar V.K. Povey S. Genew: The Human Gene" @default.
- W2109972285 created "2016-06-24" @default.
- W2109972285 creator A5026050023 @default.
- W2109972285 creator A5066667859 @default.
- W2109972285 creator A5066783652 @default.
- W2109972285 date "2005-04-01" @default.
- W2109972285 modified "2023-09-29" @default.
- W2109972285 title "Annotating the Human Proteome" @default.
- W2109972285 cites W1527927437 @default.
- W2109972285 cites W1966353174 @default.
- W2109972285 cites W1968050545 @default.
- W2109972285 cites W1982428412 @default.
- W2109972285 cites W1994025789 @default.
- W2109972285 cites W2006933206 @default.
- W2109972285 cites W2019075603 @default.
- W2109972285 cites W2021177388 @default.
- W2109972285 cites W2021275596 @default.
- W2109972285 cites W2022366078 @default.
- W2109972285 cites W2051795211 @default.
- W2109972285 cites W2093992569 @default.
- W2109972285 cites W2108491719 @default.
- W2109972285 cites W2108787198 @default.
- W2109972285 cites W2111373249 @default.
- W2109972285 cites W2119195082 @default.
- W2109972285 cites W2122232025 @default.
- W2109972285 cites W2125725297 @default.
- W2109972285 cites W2125806930 @default.
- W2109972285 cites W2126653676 @default.
- W2109972285 cites W2130863229 @default.
- W2109972285 cites W2131924932 @default.
- W2109972285 cites W2132582966 @default.
- W2109972285 cites W2137859534 @default.
- W2109972285 cites W2138821983 @default.
- W2109972285 cites W2141885858 @default.
- W2109972285 cites W2143173841 @default.
- W2109972285 cites W2146638336 @default.
- W2109972285 cites W2147285856 @default.
- W2109972285 cites W2148853951 @default.
- W2109972285 cites W2149124035 @default.
- W2109972285 cites W2152545090 @default.
- W2109972285 cites W2155723007 @default.
- W2109972285 cites W2161062388 @default.
- W2109972285 cites W2165625418 @default.
- W2109972285 cites W2166043581 @default.
- W2109972285 cites W4210531204 @default.
- W2109972285 cites W4230932396 @default.
- W2109972285 cites W4234937118 @default.
- W2109972285 cites W4255855797 @default.
- W2109972285 doi "https://doi.org/10.1074/mcp.r500003-mcp200" @default.
- W2109972285 hasPubMedId "https://pubmed.ncbi.nlm.nih.gov/15691850" @default.
- W2109972285 hasPublicationYear "2005" @default.
- W2109972285 type Work @default.
- W2109972285 sameAs 2109972285 @default.
- W2109972285 citedByCount "28" @default.
- W2109972285 countsByYear W21099722852012 @default.
- W2109972285 countsByYear W21099722852013 @default.
- W2109972285 countsByYear W21099722852014 @default.
- W2109972285 crossrefType "journal-article" @default.
- W2109972285 hasAuthorship W2109972285A5026050023 @default.
- W2109972285 hasAuthorship W2109972285A5066667859 @default.
- W2109972285 hasAuthorship W2109972285A5066783652 @default.
- W2109972285 hasBestOaLocation W21099722852 @default.
- W2109972285 hasConcept C104317684 @default.
- W2109972285 hasConcept C104397665 @default.
- W2109972285 hasConcept C185592680 @default.
- W2109972285 hasConcept C41008148 @default.
- W2109972285 hasConcept C46111723 @default.
- W2109972285 hasConcept C55493867 @default.
- W2109972285 hasConcept C70721500 @default.
- W2109972285 hasConcept C86803240 @default.
- W2109972285 hasConcept C94795543 @default.
- W2109972285 hasConceptScore W2109972285C104317684 @default.
- W2109972285 hasConceptScore W2109972285C104397665 @default.
- W2109972285 hasConceptScore W2109972285C185592680 @default.
- W2109972285 hasConceptScore W2109972285C41008148 @default.
- W2109972285 hasConceptScore W2109972285C46111723 @default.
- W2109972285 hasConceptScore W2109972285C55493867 @default.
- W2109972285 hasConceptScore W2109972285C70721500 @default.
- W2109972285 hasConceptScore W2109972285C86803240 @default.
- W2109972285 hasConceptScore W2109972285C94795543 @default.
- W2109972285 hasIssue "4" @default.
- W2109972285 hasLocation W21099722851 @default.
- W2109972285 hasLocation W21099722852 @default.
- W2109972285 hasLocation W21099722853 @default.
- W2109972285 hasOpenAccess W2109972285 @default.
- W2109972285 hasPrimaryLocation W21099722851 @default.
- W2109972285 hasRelatedWork W1813496818 @default.
- W2109972285 hasRelatedWork W2088324202 @default.
- W2109972285 hasRelatedWork W2165553666 @default.
- W2109972285 hasRelatedWork W2548623071 @default.
- W2109972285 hasRelatedWork W2582535884 @default.
- W2109972285 hasRelatedWork W4210992813 @default.
- W2109972285 hasRelatedWork W4223652769 @default.
- W2109972285 hasRelatedWork W50015136 @default.
- W2109972285 hasRelatedWork W2572638152 @default.
- W2109972285 hasRelatedWork W4234452205 @default.
- W2109972285 hasVolume "4" @default.
- W2109972285 isParatext "false" @default.