From 6dbfde8969825bf7b159996a9d2286c9b3ea7e0f Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Andreas=20Kusalananda=20K=C3=A4h=C3=A4ri?= <ak4@sanger.ac.uk> Date: Mon, 22 Jan 2007 16:39:43 +0000 Subject: [PATCH] Now on the right branch... I hope. --- .../xref_mapping/parsing_information.txt | 371 +++++++++--------- 1 file changed, 185 insertions(+), 186 deletions(-) diff --git a/misc-scripts/xref_mapping/parsing_information.txt b/misc-scripts/xref_mapping/parsing_information.txt index 8c18e5157f..bc864e5764 100644 --- a/misc-scripts/xref_mapping/parsing_information.txt +++ b/misc-scripts/xref_mapping/parsing_information.txt @@ -1,233 +1,251 @@ +UniProt/Swissprot - UniProt/Trembl (UNIversal PROTein resource) +--------------------------------------------------------------- -UniProt/Swissprot - Uniprot/Trembl (UNIversal PROTein resource) ----------------------------------- - -The files cans come in two types; +The files cans come in two types: -1) contains data for all species -ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/uniprot_sprot.dat.gz -or -ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/uniprot_trembl.dat.gz -This is the normal case. +1) Contains data for all species -2) contains data for that one species -ftp://ftp.ebi.ac.uk/pub/databases/integr8/uniprot/proteomes/17.D_melanogaster.dat.gz + ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/uniprot_sprot.dat.gz + or -These are primary xrefs in that they contain sequence and hence can be mapped -to the ensembl entities via normal alignment methods (we use exonerate). + ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/uniprot_trembl.dat.gz + This is the normal case. + +2) Contains data for that one species + + ftp://ftp.ebi.ac.uk/pub/databases/integr8/uniprot/proteomes/17.D_melanogaster.dat.gz -Below is a list of dependent xrefs that might be added. -EMBL -PDB -protein_id -GO -MIM_GENE (human only) -MIM_MORBID (human only) -HUGO (human only) -MarkerSymbol (mouse only) aka MGI. + +These are primary Xrefs in that they contain sequence and hence can be +mapped to the Ensembl entities via normal alignment methods (we use +exonerate). + + +Below is a list of dependent Xrefs that might be added: + + EMBL + PDB + protein_id + GO + MIM_GENE (human only) + MIM_MORBID (human only) + HUGO (human only) + MarkerSymbol (mouse only) aka MGI. Refseq_peptide -------------- The files come in two types those for specific species i.e. -ftp://ftp.ncbi.nih.gov/genomes/Canis_familiaris/protein/protein.gbk.gz + + ftp://ftp.ncbi.nih.gov/genomes/Canis_familiaris/protein/protein.gbk.gz or as a series of numbered none specific species files i.e. -ftp://ftp.ncbi.nih.gov/refseq/release/vertebrate_other/vertebrate_other3.protein.gpff.gz + + ftp://ftp.ncbi.nih.gov/refseq/release/vertebrate_other/vertebrate_other3.protein.gpff.gz These files are parsed by the parser RefSeqGPFFParser.pm -These are primary xrefs in that they contain sequence and hence can be mapped -to the ensembl entities via normal alignment methods (we use exonerate). +These are primary Xrefs in that they contain sequence and hence can be +mapped to the Ensembl entities via normal alignment methods (we use +exonerate). + +Below is a list of dependent Xrefs that might be added: + + GO + EntrezGene + HUGO (human only) + RGD (rat only) -Below is a list of dependent xrefs that might be added. -GO -EntrezGene -HUGO (human only) -RGD (rat only) Refseq_dna ---------- -Refseq_dna is now a priority xref source for human, so in addition to the ncbi file used it will -also use a local file that is generated from the CCDS data which DIRECTLY links refseqs to the ensembl -trancsripts. If a refseq is not in this file then the sequence data from the ncbi is used to mapped -via exonerate in the normal manner. - -More generally. The files come in two types those for specific species i.e. -ftp://ftp.ncbi.nih.gov/genomes/Gallus_gallus/RNA/rna.gbk.gz + ftp://ftp.ncbi.nih.gov/genomes/Gallus_gallus/RNA/rna.gbk.gz or as a series of numbered none specific species files i.e. -ftp://ftp.ncbi.nih.gov/refseq/release/vertebrate_mammalian/vertebrate_mammalian46.rna.fna.gz + + ftp://ftp.ncbi.nih.gov/refseq/release/vertebrate_mammalian/vertebrate_mammalian46.rna.fna.gz These files are parsed by the parser RefSeqParser.pm -These are primary xrefs in that they contain sequence and hence can be mapped -to the ensembl entities via normal alignment methods (we use exonerate). +These are primary Xrefs in that they contain sequence and hence can be +mapped to the Ensembl entities via normal alignment methods (we use +exonerate). -Below is a list of dependent xrefs that might be added. -HUGO (human only) -RGD (rat only) +Below is a list of dependent Xrefs that might be added: + + HUGO (human only) + RGD (rat only) IPI (International Protein Index) ---- +--------------------------------- + +Comes as species specific file i.e. -comes as species specific file i.e. -ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.HUMAN.fasta.gz + ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.HUMAN.fasta.gz -The files have something like :- +The files have something like >IPI:IPI00000005.1|SWISS-PROT:P01111|TREMBL:Q5U091|ENSEMBL:ENSP00000261444;ENSP00000358548|REFSEQ:NP_002515|VEGA:OTTHUMP00000013879 Tax_Id=9606 GTPase NRas precursor seqeunce.................. -But most of the header information is ignored except for the description and -the IPI value. The sequence is used to position the ipi xref. +But most of the header information is ignored except for the description +and the IPI value. The sequence is used to position the ipi Xref. +These are primary Xrefs in that they contain sequence and hence can be +mapped to the Ensembl entities via normal alignment methods (we use +exonerate). -These are primary xrefs in that they contain sequence and hence can be mapped -to the ensembl entities via normal alignment methods (we use exonerate). - -Has no dependent xrefs. +Has no dependent Xrefs. UniGene ------- -comes as species specific file i.e. -ftp://ftp.ncbi.nih.gov/repository/UniGene/Bos_taurus/Bt.seq.uniq.gz -ftp://ftp.ncbi.nih.gov/repository/UniGene/Bos_taurus/Bt.data.gz +Comes as species specific file i.e. + ftp://ftp.ncbi.nih.gov/repository/UniGene/Bos_taurus/Bt.seq.uniq.gz + ftp://ftp.ncbi.nih.gov/repository/UniGene/Bos_taurus/Bt.data.gz -These are primary xrefs in that they contain sequence and hence can be mapped -to the ensembl entities via normal alignment methods (we use exonerate). -No longer loaded via Uniprot. +These are primary Xrefs in that they contain sequence and hence can be +mapped to the Ensembl entities via normal alignment methods (we use +exonerate). No longer loaded via UniProt. -Has no dependent xrefs. +Has no dependent Xrefs. AgilentProbe ------------ -This is a local and is not down loaded automatically via ftp so an AgilentProbe -the file must be copied by hand. this will be some thing like:- -LOCAL:AgilentProbe/HumanExpression.fasta +This is a local and is not down loaded automatically via ftp so an +AgilentProbe the file must be copied by hand. This will be some thing +like + + LOCAL:AgilentProbe/HumanExpression.fasta -These are primary xrefs in that they contain sequence and hence can be mapped -to the ensembl entities via normal alignment methods (we use exonerate). +These are primary Xrefs in that they contain sequence and hence can be +mapped to the Ensembl entities via normal alignment methods (we use +exonerate). + +Has no dependent Xrefs. -Has no dependent xrefs. AgilentCGH ---------- -This is a local and is not down loaded automatically via ftp so an AgilentProbe -the file must be copied by hand. this will be some thing like:- -LOCAL:AgilentCGH/HumanCGH.fasta - +This is a local and is not down loaded automatically via ftp so an +AgilentProbe the file must be copied by hand. This will be some thing +like -These are primary xrefs in that they contain sequence and hence can be mapped -to the ensembl entities via normal alignment methods (we use exonerate). + LOCAL:AgilentCGH/HumanCGH.fasta -Has no dependent xrefs. +These are primary Xrefs in that they contain sequence and hence can be +mapped to the Ensembl entities via normal alignment methods (we use +exonerate). +Has no dependent Xrefs. EMBL ---- -These are dependent xrefs and are linked to ensembl via the Uniprot entrys. +These are dependent Xrefs and are linked to Ensembl via the UniProt +entries. PDB --- -Protein Data Bank entries are dependent xrefs and are linked to ensembl via the -Uniprot entrys. +Protein Data Bank entries are dependent Xrefs and are linked to Ensembl +via the UniProt entries. protein_id ---------- -These are dependent xrefs and are linked to ensembl via the Uniprot entrys. +These are dependent Xrefs and are linked to Ensembl via the UniProt +entries. PUBMED + Medline ---------------- -These are no longer stored due to the large numbers of these. If you want to -add these then see the UniprotParser and RefseqPArser for more details. +These are no longer stored due to the large numbers of these. If you +want to add these then see the UniProtParser and RefseqPArser for more +details. GO -- Can come in a species specific file or can contain all species. -ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gene_association.goa_uniprot.gz -ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/gene_association.goa_human.gz -GO information in the Uniprot and refseq files are ignored and just the information -from the above files are used. The files have references to uniprot and refseq entries -and so the GO entries are set to be dependent xref on these. + ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gene_association.goa_uniprot.gz + ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/gene_association.goa_human.gz + +GO information in the UniProt and Refseq files are ignored and just the +information from the above files are used. The files have references to +UniProt and Refseq entries and so the GO entries are set to be dependent +Xref on these. EntrezGene ---------- -gene-centered information at NCBI is stored as a depenedent xref and the mappings are -obtained from the refseq entires. Data about descriptions and synonyms are obtained from -the file gene_info.gz file from ncbi. - +Gene-centered information at NCBI is stored as a depenedent Xref and is +obtained from the Refseq entires. Interpro -------- -InterPro is a database of protein families, domains and functional sites and -gets it data from the file:- -ftp://ftp.ebi.ac.uk/pub/databases/interpro/interpro.xml.gz - -NOTE: Interpro has its own table and hence the xrefs are stored but are not linked to -the ensembl entities directly but a list of interpro and identifiers are stored. -The identifiers stored are of the type :- -PROSITE, PFAM, PRINTS, PREFILE, PROFILE, TIGRFAMs +InterPro is a database of protein families, domains and functional sites +and gets it data from the file + ftp://ftp.ebi.ac.uk/pub/databases/interpro/interpro.xml.gz +NOTE: Interpro has its own table and hence the Xrefs are stored but +are not linked to the Ensembl entities directly but a list of interpro +and identifiers are stored. The identifiers stored are of the type + PROSITE, PFAM, PRINTS, PREFILE, PROFILE, TIGRFAMs UniProt/Varsplic ---------------- -Alternative splice forms are obtained from the follwing file;- -ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/uniprot_sprot_varsplic.fasta.gz +Alternative splice forms are obtained from the follwing file + + ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/uniprot_sprot_varsplic.fasta.gz +These are primary Xrefs in that they contain sequence and hence can be +mapped to the Ensembl entities via normal alignment methods (we use +exonerate). -These are primary xrefs in that they contain sequence and hence can be mapped -to the ensembl entities via normal alignment methods (we use exonerate). +Has no dependent Xrefs. -Has no dependent xrefs. +ncRNA, RFAM, miRNA_Registry +--------------------------- -ncRNA,RFAM,miRNA_Registry -------------------------- +This is a local and is not down loaded automatically via ftp so you must +copy this file first before running the parser. -This is a local and is not down loaded automatically via ftp so you must copy this -file first before running the parser. -LOCAL:ncRNA/ncRNA.txt + LOCAL:ncRNA/ncRNA.txt -These are direct xrefs so the file contains data on what the xref is and which -ensembl entity it matches too. +These are direct Xrefs so the file contains data on what the Xref is and +which Ensembl entity it matches too. -SPECIES SPECIFIC ENTRYS ------------------------ ------------------------ +SPECIES SPECIFIC ENTRIES +------------------------ +------------------------ Human @@ -237,86 +255,72 @@ Human MIM - Online Mendelian Inheritance in Man ----------------------------------------- -Descriptions and types are obtained from the file:- -ftp://ftp.ncbi.nih.gov/repository/OMIM/omim.txt.Z +Descriptions and types are obtained from the file -This creates two set of xrefs these being :- + ftp://ftp.ncbi.nih.gov/repository/OMIM/omim.txt.Z + +This creates two set of Xrefs: 1) MIM_GENE (disease genes and other expressed gene) 2) MIM_MORBID (the disease genes) Note those in set 2 will also be in set 1. -These MIM xrefs are linked to UniProt/SwissProt entries using the -UniProtParser.pm creating dependent xrefs. Note if the Swissprot Entrie does -not specify wether the MIM entrie is a phenotype or a gene then it is ignored. -Fro this same reason MIM dependent xrefs are NOT obtained from the refseq -entries +These MIM Xrefs are linked to UniProt/SwissProt entries using the +UniProtParser.pm creating dependent Xrefs. Note if the Swissprot Entrie +does not specify wether the MIM entrie is a phenotype or a gene then it +is ignored. For this same reason MIM dependent Xrefs are NOT obtained +from the Refseq entries. -So when the swissport entries are matched to ensembl the MIM -entries will also be matched. +So when the swissport entries are matched to Ensembl the MIM entries +will also be matched. HUGO ---- -HUGO data uses prioritys to allocate each identifier to one ensembl id. -The prioritys are :- -1) via Havana -2) Via CCDS -3) Via Refseq -4) Via Uniprot -5) Via Entrezgene - -1) DIRECT relationships are made by transfering the manually annotated ones from - havana to ensembl. - LOCAL:HUGO/HUGO_TO_ENSG - -2) DIRECT relationships are made by transfering the ones from CCDS to ensembl. - LOCAL:HUGO/CCDS_TO_HUGO - -3,4 and 5) - -The Human Genome Organisation xrefs are obtained from using the following url:- +The Human Genome Organisation Xrefs are obtained from using the +following url: http://www.gene.ucl.ac.uk/cgi-bin/nomenclature/gdlw.pl?title=Genew+output+data &col=gd_hgnc_id&col=gd_app_sym&col=gd_app_name&col=gd_prev_sym&col=gd_aliases -&col=md_prot_id&col=gd_pub_refseq_ids&col=md_eg_id&status=Approved -&status=Approved+Non-Human&status_opt=3&=on&where=&order_by=gd_hgnc_id&limit= -&format=text&submit=submit&.cgifields=&.cgifields=status&.cgifields=chr +&col=md_prot_id&col=gd_pub_refseq_ids&status=Approved&status=Approved+Non-Human +&status_opt=3&=on&where=&order_by=gd_hgnc_id&limit=&format=text&submit=submit +&.cgifields=&.cgifields=status&.cgifields=chr +Which is a script that produces a list of HUGO identifiers with the +UniProt and Refseq entries they are linked to. -Which is a script that produces a list of HUGO identifiers with the Uniprot and -Refseq and EntrezGene entries they are linked to. +The files have references to UniProt and Refseq entries and so the GO +entries are set to be dependent Xref on these. -The files have references to uniprot, refseq and entrezgen entries and so the -HUGO entries are set to be dependent xref on these. - -NOTE: due to length of its name the file is stored in the name of its checksum. +NOTE: due to length of its name the file is stored in the name of its +checksum. OTTT ---- -These are the direct mapping between the vega genes and ensembl ones. Not all -of these are mapped but a fair proportion are. -These create direct xrefs. -the file used should be :- -LOCAL:OTTT/OTTT.txt +These are the direct mapping between the vega genes and Ensembl +ones. Not all of these are mapped but a fair proportion are. These +create direct Xrefs. The file used should be + + LOCAL:OTTT/OTTT.txt CCDS ---- -The CCDS database identifies a core set of human protein coding regions that -are consistently annotated by multiple public resources and pass quality tests. +The CCDS database identifies a core set of human protein coding regions +that are consistently annotated by multiple public resources and pass +quality tests. -A local file is used here:- -LOCAL:CCDS/CCDS.txt +A local file is used here: -The file contains a list of ccds identifiers and the ensembl entities they match to. -So direct xrefs are created for these. + LOCAL:CCDS/CCDS.txt +The file contains a list of ccds identifiers and the Ensembl entities +they match to. So direct Xrefs are created for these. Mouse @@ -327,14 +331,12 @@ MarkerSymbol Also known as MGI. -ftp://ftp.informatics.jax.org/pub/reports/MRK_SwissProt_TrEMBL.rpt -ftp://ftp.informatics.jax.org/pub/reports/MRK_Synonym.sql.rpt + ftp://ftp.informatics.jax.org/pub/reports/MRK_SwissProt_TrEMBL.rpt + ftp://ftp.informatics.jax.org/pub/reports/MRK_Synonym.sql.rpt -This is mouse specific xref being the Mouse Genome Informatics data. -Xrefs are generated via the Uniprot entries in the MGI files first, and -via the RefSeq entries if there is no Uniprot entry. -The files have references to uniprot entries and so the GO entries are -set to be dependent xrefs on these. +This is mouse specific Xref being the Mouse Genome Informatics data. +The files have references to UniProt entries and so the GO entries are +set to be dependent Xrefs on these. Rat @@ -343,11 +345,11 @@ Rat RGD -- -Rat Genome Database entires are populate by using the file:- -ftp://rgd.mcw.edu/pub/data_release/GENES +Rat Genome Database entries are populate by using the file -The rgd xrefs are dependent xrefs on the refseq entries. + ftp://rgd.mcw.edu/pub/data_release/GENES +The rgd Xrefs are dependent Xrefs on the Refseq entries. Zebra fish @@ -357,31 +359,28 @@ ZFIN_ID ------- The two files -http://zfin.org/data_transfer/Downloads/refseq.txt -http://zfin.org/data_transfer/Downloads/swissprot.txt -contain list of zfin identifiers and refseq or swissprot indentifiers depending -on the file. -This creates a set of dependent xrefs on refseq and uniprot entries. + http://zfin.org/data_transfer/Downloads/refseq.txt + http://zfin.org/data_transfer/Downloads/swissprot.txt + +contain list of zfin identifiers and Refseq or swissprot indentifiers +depending on the file. + +This creates a set of dependent Xrefs on Refseq and UniProt entries. C Elegans --------- - -wormpep_id , wormbase_locus, wormbase_gene, wormbase_transcript +wormpep_id, wormbase_locus, wormbase_gene, wormbase_transcript --------------------------------------------------------------- -Uses the file -ftp://ftp.sanger.ac.uk/pub/databases/wormpep/wormpep150/wormpep.table150 - -and the database (last release should do) -mysql:ensembldb.ensembl.org::caenorhabditis_elegans_core_39_150a:anonymous - -This creates direct xrefs for all these. - - +Uses the file + ftp://ftp.sanger.ac.uk/pub/databases/wormpep/wormpep150/wormpep.table150 +and the database (last release should do) + mysql:ensembldb.ensembl.org::caenorhabditis_elegans_core_39_150a:anonymous +This creates direct Xrefs for all these. -- GitLab