Now on the right branch... I hope.

6dbfde89 · Andreas Kusalananda Kähäri · 525d9ea5 · 6dbfde89
Commit 6dbfde89 authored 18 years ago by Andreas Kusalananda Kähäri
--- a/misc-scripts/xref_mapping/parsing_information.txt
+++ b/misc-scripts/xref_mapping/parsing_information.txt
+UniProt/Swissprot - UniProt/Trembl (UNIversal PROTein resource)
+---------------------------------------------------------------

-UniProt/Swissprot - Uniprot/Trembl (UNIversal PROTein resource)
----------------------------------
- 
-The files cans come in two types;
+The files cans come in two types:

-1) contains data for all species
-ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/uniprot_sprot.dat.gz
-or 
-ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/uniprot_trembl.dat.gz
-This is the normal case.
+1)  Contains data for all species

-2) contains data for that one species
-ftp://ftp.ebi.ac.uk/pub/databases/integr8/uniprot/proteomes/17.D_melanogaster.dat.gz
+    ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/uniprot_sprot.dat.gz

+    or

-These are primary xrefs in that they contain sequence and hence can be mapped 
-to the ensembl entities via normal alignment methods (we use exonerate).
+    ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/uniprot_trembl.dat.gz

+    This is the normal case.
+
+2)  Contains data for that one species
+
+    ftp://ftp.ebi.ac.uk/pub/databases/integr8/uniprot/proteomes/17.D_melanogaster.dat.gz

-Below is a list of dependent xrefs that might be added.
-EMBL
-PDB
-protein_id
-GO
-MIM_GENE     (human only)
-MIM_MORBID   (human only)
-HUGO         (human only)
-MarkerSymbol (mouse only) aka MGI.
+
+These are primary Xrefs in that they contain sequence and hence can be
+mapped to the Ensembl entities via normal alignment methods (we use
+exonerate).
+
+
+Below is a list of dependent Xrefs that might be added:
+
+    EMBL
+    PDB
+    protein_id
+    GO
+    MIM_GENE     (human only)
+    MIM_MORBID   (human only)
+    HUGO         (human only)
+    MarkerSymbol (mouse only) aka MGI.


 Refseq_peptide
 --------------

 The files come in two types those for specific species i.e.
-ftp://ftp.ncbi.nih.gov/genomes/Canis_familiaris/protein/protein.gbk.gz
+
+    ftp://ftp.ncbi.nih.gov/genomes/Canis_familiaris/protein/protein.gbk.gz

 or as a series of numbered none specific species files i.e.
-ftp://ftp.ncbi.nih.gov/refseq/release/vertebrate_other/vertebrate_other3.protein.gpff.gz
+
+    ftp://ftp.ncbi.nih.gov/refseq/release/vertebrate_other/vertebrate_other3.protein.gpff.gz

 These files are parsed by the parser RefSeqGPFFParser.pm

-These are primary xrefs in that they contain sequence and hence can be mapped 
-to the ensembl entities via normal alignment methods (we use exonerate).
+These are primary Xrefs in that they contain sequence and hence can be
+mapped to the Ensembl entities via normal alignment methods (we use
+exonerate).
+
+Below is a list of dependent Xrefs that might be added:
+
+    GO
+    EntrezGene
+    HUGO (human only)
+    RGD (rat only)

-Below is a list of dependent xrefs that might be added.
-GO
-EntrezGene
-HUGO (human only)
-RGD (rat only) 

 Refseq_dna
 ----------

-Refseq_dna is now a priority xref source for human, so in addition to the ncbi file used it will
-also use a local file that is generated from the CCDS data which DIRECTLY links refseqs to the ensembl 
-trancsripts. If a refseq is not in this file then the sequence data from the ncbi is used to mapped 
-via exonerate in the normal manner.
-
-More generally.
 The files come in two types those for specific species i.e.
-ftp://ftp.ncbi.nih.gov/genomes/Gallus_gallus/RNA/rna.gbk.gz

+    ftp://ftp.ncbi.nih.gov/genomes/Gallus_gallus/RNA/rna.gbk.gz

 or as a series of numbered none specific species files i.e.
-ftp://ftp.ncbi.nih.gov/refseq/release/vertebrate_mammalian/vertebrate_mammalian46.rna.fna.gz
+
+    ftp://ftp.ncbi.nih.gov/refseq/release/vertebrate_mammalian/vertebrate_mammalian46.rna.fna.gz

 These files are parsed by the parser RefSeqParser.pm

-These are primary xrefs in that they contain sequence and hence can be mapped 
-to the ensembl entities via normal alignment methods (we use exonerate).
+These are primary Xrefs in that they contain sequence and hence can be
+mapped to the Ensembl entities via normal alignment methods (we use
+exonerate).

-Below is a list of dependent xrefs that might be added.
-HUGO (human only)
-RGD  (rat only)
+Below is a list of dependent Xrefs that might be added:
+
+    HUGO (human only)
+    RGD  (rat only)


 IPI (International Protein Index)
---
+---------------------------------
+
+Comes as species specific file i.e.

-comes as species specific file i.e.
-ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.HUMAN.fasta.gz
+    ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.HUMAN.fasta.gz

-The files have something like :-
+The files have something like

 >IPI:IPI00000005.1|SWISS-PROT:P01111|TREMBL:Q5U091|ENSEMBL:ENSP00000261444;ENSP00000358548|REFSEQ:NP_002515|VEGA:OTTHUMP00000013879 Tax_Id=9606 GTPase NRas precursor
 seqeunce..................

-But most of the header information is ignored except for the description and 
-the IPI value. The sequence is used to position the ipi xref.
+But most of the header information is ignored except for the description
+and the IPI value.  The sequence is used to position the ipi Xref.

+These are primary Xrefs in that they contain sequence and hence can be
+mapped to the Ensembl entities via normal alignment methods (we use
+exonerate).

-These are primary xrefs in that they contain sequence and hence can be mapped 
-to the ensembl entities via normal alignment methods (we use exonerate).
-
-Has no dependent xrefs.
+Has no dependent Xrefs.


 UniGene
 -------

-comes as species specific file i.e.
-ftp://ftp.ncbi.nih.gov/repository/UniGene/Bos_taurus/Bt.seq.uniq.gz 
-ftp://ftp.ncbi.nih.gov/repository/UniGene/Bos_taurus/Bt.data.gz
+Comes as species specific file i.e.

+    ftp://ftp.ncbi.nih.gov/repository/UniGene/Bos_taurus/Bt.seq.uniq.gz
+    ftp://ftp.ncbi.nih.gov/repository/UniGene/Bos_taurus/Bt.data.gz

-These are primary xrefs in that they contain sequence and hence can be mapped 
-to the ensembl entities via normal alignment methods (we use exonerate).
-No longer loaded via Uniprot.
+These are primary Xrefs in that they contain sequence and hence can be
+mapped to the Ensembl entities via normal alignment methods (we use
+exonerate).  No longer loaded via UniProt.

-Has no dependent xrefs.
+Has no dependent Xrefs.


 AgilentProbe
 ------------

-This is a local and is not down loaded automatically via ftp so an AgilentProbe
-the file must be copied by hand. this will be some thing like:-
-LOCAL:AgilentProbe/HumanExpression.fasta
+This is a local and is not down loaded automatically via ftp so an
+AgilentProbe the file must be copied by hand.  This will be some thing
+like
+
+    LOCAL:AgilentProbe/HumanExpression.fasta

-These are primary xrefs in that they contain sequence and hence can be mapped 
-to the ensembl entities via normal alignment methods (we use exonerate).
+These are primary Xrefs in that they contain sequence and hence can be
+mapped to the Ensembl entities via normal alignment methods (we use
+exonerate).
+
+Has no dependent Xrefs.

-Has no dependent xrefs.

 AgilentCGH
 ----------

-This is a local and is not down loaded automatically via ftp so an AgilentProbe
-the file must be copied by hand. this will be some thing like:-
-LOCAL:AgilentCGH/HumanCGH.fasta
-
+This is a local and is not down loaded automatically via ftp so an
+AgilentProbe the file must be copied by hand.  This will be some thing
+like

-These are primary xrefs in that they contain sequence and hence can be mapped 
-to the ensembl entities via normal alignment methods (we use exonerate).
+    LOCAL:AgilentCGH/HumanCGH.fasta

-Has no dependent xrefs.
+These are primary Xrefs in that they contain sequence and hence can be
+mapped to the Ensembl entities via normal alignment methods (we use
+exonerate).

+Has no dependent Xrefs.


 EMBL
 ----

-These are dependent xrefs and are linked to ensembl via the Uniprot entrys.
+These are dependent Xrefs and are linked to Ensembl via the UniProt
+entries.


 PDB
 ---

-Protein Data Bank entries are dependent xrefs and are linked to ensembl via the
-Uniprot entrys.
+Protein Data Bank entries are dependent Xrefs and are linked to Ensembl
+via the UniProt entries.


 protein_id
 ----------

-These are dependent xrefs and are linked to ensembl via the Uniprot entrys.
+These are dependent Xrefs and are linked to Ensembl via the UniProt
+entries.


 PUBMED + Medline
 ----------------

-These are no longer stored due to the large numbers of these. If you want to 
-add these then see the UniprotParser and RefseqPArser for more details.
+These are no longer stored due to the large numbers of these.  If you
+want to add these then see the UniProtParser and RefseqPArser for more
+details.


 GO
 --

 Can come in a species specific file or can contain all species.
-ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gene_association.goa_uniprot.gz
-ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/gene_association.goa_human.gz

-GO information in the Uniprot and refseq files are ignored and just the information 
-from the above files are used. The files have references to uniprot and refseq entries
-and so the GO entries are set to be dependent xref on these.
+    ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gene_association.goa_uniprot.gz
+    ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/gene_association.goa_human.gz
+
+GO information in the UniProt and Refseq files are ignored and just the
+information from the above files are used.  The files have references to
+UniProt and Refseq entries and so the GO entries are set to be dependent
+Xref on these.


 EntrezGene
 ----------

-gene-centered information at NCBI is stored as a depenedent xref and the mappings are 
-obtained from the refseq entires. Data about descriptions and synonyms are obtained from
-the file gene_info.gz file from ncbi.
-
+Gene-centered information at NCBI is stored as a depenedent Xref and is
+obtained from the Refseq entires.


 Interpro
 --------

-InterPro is a database of protein families, domains and functional sites and 
-gets it data from the file:-
-ftp://ftp.ebi.ac.uk/pub/databases/interpro/interpro.xml.gz
-
-NOTE: Interpro has its own table and hence the xrefs are stored but are not linked to
-the ensembl entities directly but a list of interpro and identifiers are stored.
-The identifiers stored are of the type :-
-PROSITE, PFAM, PRINTS, PREFILE, PROFILE, TIGRFAMs
+InterPro is a database of protein families, domains and functional sites
+and gets it data from the file

+    ftp://ftp.ebi.ac.uk/pub/databases/interpro/interpro.xml.gz

+NOTE:  Interpro has its own table and hence the Xrefs are stored but
+are not linked to the Ensembl entities directly but a list of interpro
+and identifiers are stored.  The identifiers stored are of the type

+    PROSITE, PFAM, PRINTS, PREFILE, PROFILE, TIGRFAMs


 UniProt/Varsplic
 ----------------

-Alternative splice forms are obtained from the follwing file;-
-ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/uniprot_sprot_varsplic.fasta.gz
+Alternative splice forms are obtained from the follwing file
+
+    ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/uniprot_sprot_varsplic.fasta.gz

+These are primary Xrefs in that they contain sequence and hence can be
+mapped to the Ensembl entities via normal alignment methods (we use
+exonerate).

-These are primary xrefs in that they contain sequence and hence can be mapped 
-to the ensembl entities via normal alignment methods (we use exonerate).
+Has no dependent Xrefs.

-Has no dependent xrefs.

+ncRNA, RFAM, miRNA_Registry
+---------------------------

-ncRNA,RFAM,miRNA_Registry
-------------------------
+This is a local and is not down loaded automatically via ftp so you must
+copy this file first before running the parser.

-This is a local and is not down loaded automatically via ftp so you must copy this 
-file first before running the parser.
-LOCAL:ncRNA/ncRNA.txt
+    LOCAL:ncRNA/ncRNA.txt

-These are direct xrefs so the file contains data on what the xref is and which 
-ensembl entity it matches too.
+These are direct Xrefs so the file contains data on what the Xref is and
+which Ensembl entity it matches too.


-SPECIES SPECIFIC ENTRYS
-----------------------
-----------------------
+SPECIES SPECIFIC ENTRIES
+------------------------
+------------------------


 Human
@@ -237,86 +255,72 @@ Human
 MIM - Online Mendelian Inheritance in Man
 -----------------------------------------

-Descriptions and types are obtained from the file:-
-ftp://ftp.ncbi.nih.gov/repository/OMIM/omim.txt.Z
+Descriptions and types are obtained from the file

-This creates two set of xrefs these being :-
+    ftp://ftp.ncbi.nih.gov/repository/OMIM/omim.txt.Z
+
+This creates two set of Xrefs:

 1) MIM_GENE   (disease genes and other expressed gene)
 2) MIM_MORBID (the disease genes)

 Note those in set 2 will also be in set 1.

-These MIM xrefs are linked to UniProt/SwissProt entries using the 
-UniProtParser.pm creating dependent xrefs. Note if the Swissprot Entrie does
-not specify wether the MIM entrie is a phenotype or a gene then it is ignored.
-Fro this same reason MIM dependent xrefs are NOT obtained from the refseq 
-entries  
+These MIM Xrefs are linked to UniProt/SwissProt entries using the
+UniProtParser.pm creating dependent Xrefs.  Note if the Swissprot Entrie
+does not specify wether the MIM entrie is a phenotype or a gene then it
+is ignored.  For this same reason MIM dependent Xrefs are NOT obtained
+from the Refseq entries.

-So when the swissport entries are matched to ensembl the MIM
-entries will also be matched.
+So when the swissport entries are matched to Ensembl the MIM entries
+will also be matched.


 HUGO
 ----

-HUGO data uses prioritys to allocate each identifier to one ensembl id.
-The prioritys are :-
-1) via Havana
-2) Via CCDS
-3) Via Refseq
-4) Via Uniprot
-5) Via Entrezgene
-
-1) DIRECT relationships are made by transfering the manually annotated ones from
-   havana to ensembl.
-   LOCAL:HUGO/HUGO_TO_ENSG
-
-2) DIRECT relationships are made by transfering the ones from CCDS to ensembl.
-   LOCAL:HUGO/CCDS_TO_HUGO
-
-3,4 and 5)
-
-The Human Genome Organisation xrefs are obtained from using the following url:-
+The Human Genome Organisation Xrefs are obtained from using the
+following url:

 http://www.gene.ucl.ac.uk/cgi-bin/nomenclature/gdlw.pl?title=Genew+output+data
 &col=gd_hgnc_id&col=gd_app_sym&col=gd_app_name&col=gd_prev_sym&col=gd_aliases
-&col=md_prot_id&col=gd_pub_refseq_ids&col=md_eg_id&status=Approved
-&status=Approved+Non-Human&status_opt=3&=on&where=&order_by=gd_hgnc_id&limit=
-&format=text&submit=submit&.cgifields=&.cgifields=status&.cgifields=chr
+&col=md_prot_id&col=gd_pub_refseq_ids&status=Approved&status=Approved+Non-Human
+&status_opt=3&=on&where=&order_by=gd_hgnc_id&limit=&format=text&submit=submit
+&.cgifields=&.cgifields=status&.cgifields=chr

+Which is a script that produces a list of HUGO identifiers with the
+UniProt and Refseq entries they are linked to.

-Which is a script that produces a list of HUGO identifiers with the Uniprot and
-Refseq and EntrezGene entries they are linked to.
+The files have references to UniProt and Refseq entries and so the GO
+entries are set to be dependent Xref on these.

-The files have references to uniprot, refseq and entrezgen entries and so the 
-HUGO entries are set to be dependent xref on these.
-
-NOTE: due to length of its name the file is stored in the name of its checksum.
+NOTE:  due to length of its name the file is stored in the name of its
+checksum.


 OTTT
 ----

-These are the direct mapping between the vega genes and ensembl ones. Not all
-of these are mapped but a fair proportion are.
-These create direct xrefs. 
-the file used should be :-
-LOCAL:OTTT/OTTT.txt 
+These are the direct mapping between the vega genes and Ensembl
+ones.  Not all of these are mapped but a fair proportion are.  These
+create direct Xrefs.  The file used should be
+
+    LOCAL:OTTT/OTTT.txt


 CCDS
 ----

-The CCDS database identifies a core set of human protein coding regions that
-are consistently annotated by multiple public resources and pass quality tests.
+The CCDS database identifies a core set of human protein coding regions
+that are consistently annotated by multiple public resources and pass
+quality tests.

-A local file is used here:-
-LOCAL:CCDS/CCDS.txt 
+A local file is used here:

-The file contains a list of ccds identifiers and the ensembl entities they match to.
-So direct xrefs are created for these.
+    LOCAL:CCDS/CCDS.txt 

+The file contains a list of ccds identifiers and the Ensembl entities
+they match to.  So direct Xrefs are created for these.


 Mouse
@@ -327,14 +331,12 @@ MarkerSymbol

 Also known as MGI.

-ftp://ftp.informatics.jax.org/pub/reports/MRK_SwissProt_TrEMBL.rpt
-ftp://ftp.informatics.jax.org/pub/reports/MRK_Synonym.sql.rpt
+    ftp://ftp.informatics.jax.org/pub/reports/MRK_SwissProt_TrEMBL.rpt
+    ftp://ftp.informatics.jax.org/pub/reports/MRK_Synonym.sql.rpt

-This is mouse specific xref being the Mouse Genome Informatics data.
-Xrefs are generated via the Uniprot entries in the MGI files first, and
-via the RefSeq entries if there is no Uniprot entry.
-The files have references to uniprot entries and so the GO entries are 
-set to be dependent xrefs on these.
+This is mouse specific Xref being the Mouse Genome Informatics data.
+The files have references to UniProt entries and so the GO entries are
+set to be dependent Xrefs on these.


 Rat
@@ -343,11 +345,11 @@ Rat
 RGD
 --

-Rat Genome Database entires are populate by using the file:-
-ftp://rgd.mcw.edu/pub/data_release/GENES
+Rat Genome Database entries are populate by using the file

-The rgd xrefs are dependent xrefs on the refseq entries.
+    ftp://rgd.mcw.edu/pub/data_release/GENES

+The rgd Xrefs are dependent Xrefs on the Refseq entries.


 Zebra fish
@@ -357,31 +359,28 @@ ZFIN_ID
 -------

 The two files
-http://zfin.org/data_transfer/Downloads/refseq.txt 
-http://zfin.org/data_transfer/Downloads/swissprot.txt
-contain list of zfin identifiers and refseq or swissprot indentifiers depending 
-on the file.  

-This creates a set of dependent xrefs on refseq and uniprot entries.  
+    http://zfin.org/data_transfer/Downloads/refseq.txt
+    http://zfin.org/data_transfer/Downloads/swissprot.txt
+
+contain list of zfin identifiers and Refseq or swissprot indentifiers
+depending on the file.
+
+This creates a set of dependent Xrefs on Refseq and UniProt entries.


 C Elegans
 ---------

-
-wormpep_id , wormbase_locus, wormbase_gene, wormbase_transcript
+wormpep_id, wormbase_locus, wormbase_gene, wormbase_transcript
 --------------------------------------------------------------- 

-Uses the file 
-ftp://ftp.sanger.ac.uk/pub/databases/wormpep/wormpep150/wormpep.table150
-
-and the database (last release should do)
-mysql:ensembldb.ensembl.org::caenorhabditis_elegans_core_39_150a:anonymous
-
-This creates direct xrefs for all these.
-
-
+Uses the file

+    ftp://ftp.sanger.ac.uk/pub/databases/wormpep/wormpep150/wormpep.table150

+and the database (last release should do)

+    mysql:ensembldb.ensembl.org::caenorhabditis_elegans_core_39_150a:anonymous

+This creates direct Xrefs for all these.