Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
E
ensembl
Manage
Activity
Members
Labels
Plan
Issues
0
Issue boards
Milestones
Iterations
Wiki
Requirements
Jira
Code
Merge requests
1
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Locked files
Build
Pipelines
Jobs
Pipeline schedules
Test cases
Artifacts
Deploy
Releases
Package Registry
Container Registry
Operate
Environments
Terraform modules
Monitor
Incidents
Service Desk
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Code review analytics
Issue analytics
Insights
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Terms and privacy
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
ensembl-gh-mirror
ensembl
Commits
6dbfde89
Commit
6dbfde89
authored
18 years ago
by
Andreas Kusalananda Kähäri
Browse files
Options
Downloads
Patches
Plain Diff
Now on the right branch... I hope.
parent
525d9ea5
No related branches found
Branches containing commit
No related tags found
Tags containing commit
No related merge requests found
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
misc-scripts/xref_mapping/parsing_information.txt
+185
-186
185 additions, 186 deletions
misc-scripts/xref_mapping/parsing_information.txt
with
185 additions
and
186 deletions
misc-scripts/xref_mapping/parsing_information.txt
+
185
−
186
View file @
6dbfde89
UniProt/Swissprot - UniProt/Trembl (UNIversal PROTein resource)
---------------------------------------------------------------
UniProt/Swissprot - Uniprot/Trembl (UNIversal PROTein resource)
----------------------------------
The files cans come in two types;
The files cans come in two types:
1) contains data for all species
ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/uniprot_sprot.dat.gz
or
ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/uniprot_trembl.dat.gz
This is the normal case.
1) Contains data for all species
2) contains data for that one species
ftp://ftp.ebi.ac.uk/pub/databases/integr8/uniprot/proteomes/17.D_melanogaster.dat.gz
ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/uniprot_sprot.dat.gz
or
These are primary xrefs in that they contain sequence and hence can be mapped
to the ensembl entities via normal alignment methods (we use exonerate).
ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/uniprot_trembl.dat.gz
This is the normal case.
2) Contains data for that one species
ftp://ftp.ebi.ac.uk/pub/databases/integr8/uniprot/proteomes/17.D_melanogaster.dat.gz
Below is a list of dependent xrefs that might be added.
EMBL
PDB
protein_id
GO
MIM_GENE (human only)
MIM_MORBID (human only)
HUGO (human only)
MarkerSymbol (mouse only) aka MGI.
These are primary Xrefs in that they contain sequence and hence can be
mapped to the Ensembl entities via normal alignment methods (we use
exonerate).
Below is a list of dependent Xrefs that might be added:
EMBL
PDB
protein_id
GO
MIM_GENE (human only)
MIM_MORBID (human only)
HUGO (human only)
MarkerSymbol (mouse only) aka MGI.
Refseq_peptide
--------------
The files come in two types those for specific species i.e.
ftp://ftp.ncbi.nih.gov/genomes/Canis_familiaris/protein/protein.gbk.gz
ftp://ftp.ncbi.nih.gov/genomes/Canis_familiaris/protein/protein.gbk.gz
or as a series of numbered none specific species files i.e.
ftp://ftp.ncbi.nih.gov/refseq/release/vertebrate_other/vertebrate_other3.protein.gpff.gz
ftp://ftp.ncbi.nih.gov/refseq/release/vertebrate_other/vertebrate_other3.protein.gpff.gz
These files are parsed by the parser RefSeqGPFFParser.pm
These are primary xrefs in that they contain sequence and hence can be mapped
to the ensembl entities via normal alignment methods (we use exonerate).
These are primary Xrefs in that they contain sequence and hence can be
mapped to the Ensembl entities via normal alignment methods (we use
exonerate).
Below is a list of dependent Xrefs that might be added:
GO
EntrezGene
HUGO (human only)
RGD (rat only)
Below is a list of dependent xrefs that might be added.
GO
EntrezGene
HUGO (human only)
RGD (rat only)
Refseq_dna
----------
Refseq_dna is now a priority xref source for human, so in addition to the ncbi file used it will
also use a local file that is generated from the CCDS data which DIRECTLY links refseqs to the ensembl
trancsripts. If a refseq is not in this file then the sequence data from the ncbi is used to mapped
via exonerate in the normal manner.
More generally.
The files come in two types those for specific species i.e.
ftp://ftp.ncbi.nih.gov/genomes/Gallus_gallus/RNA/rna.gbk.gz
ftp://ftp.ncbi.nih.gov/genomes/Gallus_gallus/RNA/rna.gbk.gz
or as a series of numbered none specific species files i.e.
ftp://ftp.ncbi.nih.gov/refseq/release/vertebrate_mammalian/vertebrate_mammalian46.rna.fna.gz
ftp://ftp.ncbi.nih.gov/refseq/release/vertebrate_mammalian/vertebrate_mammalian46.rna.fna.gz
These files are parsed by the parser RefSeqParser.pm
These are primary xrefs in that they contain sequence and hence can be mapped
to the ensembl entities via normal alignment methods (we use exonerate).
These are primary Xrefs in that they contain sequence and hence can be
mapped to the Ensembl entities via normal alignment methods (we use
exonerate).
Below is a list of dependent xrefs that might be added.
HUGO (human only)
RGD (rat only)
Below is a list of dependent Xrefs that might be added:
HUGO (human only)
RGD (rat only)
IPI (International Protein Index)
---
---------------------------------
Comes as species specific file i.e.
comes as species specific file i.e.
ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.HUMAN.fasta.gz
ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.HUMAN.fasta.gz
The files have something like
:-
The files have something like
>IPI:IPI00000005.1|SWISS-PROT:P01111|TREMBL:Q5U091|ENSEMBL:ENSP00000261444;ENSP00000358548|REFSEQ:NP_002515|VEGA:OTTHUMP00000013879 Tax_Id=9606 GTPase NRas precursor
seqeunce..................
But most of the header information is ignored except for the description
and
the IPI value. The sequence is used to position the ipi
x
ref.
But most of the header information is ignored except for the description
and
the IPI value.
The sequence is used to position the ipi
X
ref.
These are primary Xrefs in that they contain sequence and hence can be
mapped to the Ensembl entities via normal alignment methods (we use
exonerate).
These are primary xrefs in that they contain sequence and hence can be mapped
to the ensembl entities via normal alignment methods (we use exonerate).
Has no dependent xrefs.
Has no dependent Xrefs.
UniGene
-------
comes as species specific file i.e.
ftp://ftp.ncbi.nih.gov/repository/UniGene/Bos_taurus/Bt.seq.uniq.gz
ftp://ftp.ncbi.nih.gov/repository/UniGene/Bos_taurus/Bt.data.gz
Comes as species specific file i.e.
ftp://ftp.ncbi.nih.gov/repository/UniGene/Bos_taurus/Bt.seq.uniq.gz
ftp://ftp.ncbi.nih.gov/repository/UniGene/Bos_taurus/Bt.data.gz
These are primary
x
refs in that they contain sequence and hence can be
mapped
to the
e
nsembl entities via normal alignment methods (we use
exonerate).
No longer loaded via Uni
p
rot.
These are primary
X
refs in that they contain sequence and hence can be
mapped
to the
E
nsembl entities via normal alignment methods (we use
exonerate).
No longer loaded via Uni
P
rot.
Has no dependent
x
refs.
Has no dependent
X
refs.
AgilentProbe
------------
This is a local and is not down loaded automatically via ftp so an AgilentProbe
the file must be copied by hand. this will be some thing like:-
LOCAL:AgilentProbe/HumanExpression.fasta
This is a local and is not down loaded automatically via ftp so an
AgilentProbe the file must be copied by hand. This will be some thing
like
LOCAL:AgilentProbe/HumanExpression.fasta
These are primary xrefs in that they contain sequence and hence can be mapped
to the ensembl entities via normal alignment methods (we use exonerate).
These are primary Xrefs in that they contain sequence and hence can be
mapped to the Ensembl entities via normal alignment methods (we use
exonerate).
Has no dependent Xrefs.
Has no dependent xrefs.
AgilentCGH
----------
This is a local and is not down loaded automatically via ftp so an AgilentProbe
the file must be copied by hand. this will be some thing like:-
LOCAL:AgilentCGH/HumanCGH.fasta
This is a local and is not down loaded automatically via ftp so an
AgilentProbe the file must be copied by hand. This will be some thing
like
These are primary xrefs in that they contain sequence and hence can be mapped
to the ensembl entities via normal alignment methods (we use exonerate).
LOCAL:AgilentCGH/HumanCGH.fasta
Has no dependent xrefs.
These are primary Xrefs in that they contain sequence and hence can be
mapped to the Ensembl entities via normal alignment methods (we use
exonerate).
Has no dependent Xrefs.
EMBL
----
These are dependent xrefs and are linked to ensembl via the Uniprot entrys.
These are dependent Xrefs and are linked to Ensembl via the UniProt
entries.
PDB
---
Protein Data Bank entries are dependent
x
refs and are linked to
e
nsembl
via the
Uni
p
rot entr
y
s.
Protein Data Bank entries are dependent
X
refs and are linked to
E
nsembl
via the
Uni
P
rot entr
ie
s.
protein_id
----------
These are dependent xrefs and are linked to ensembl via the Uniprot entrys.
These are dependent Xrefs and are linked to Ensembl via the UniProt
entries.
PUBMED + Medline
----------------
These are no longer stored due to the large numbers of these. If you want to
add these then see the UniprotParser and RefseqPArser for more details.
These are no longer stored due to the large numbers of these. If you
want to add these then see the UniProtParser and RefseqPArser for more
details.
GO
--
Can come in a species specific file or can contain all species.
ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gene_association.goa_uniprot.gz
ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/gene_association.goa_human.gz
GO information in the Uniprot and refseq files are ignored and just the information
from the above files are used. The files have references to uniprot and refseq entries
and so the GO entries are set to be dependent xref on these.
ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gene_association.goa_uniprot.gz
ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/gene_association.goa_human.gz
GO information in the UniProt and Refseq files are ignored and just the
information from the above files are used. The files have references to
UniProt and Refseq entries and so the GO entries are set to be dependent
Xref on these.
EntrezGene
----------
gene-centered information at NCBI is stored as a depenedent xref and the mappings are
obtained from the refseq entires. Data about descriptions and synonyms are obtained from
the file gene_info.gz file from ncbi.
Gene-centered information at NCBI is stored as a depenedent Xref and is
obtained from the Refseq entires.
Interpro
--------
InterPro is a database of protein families, domains and functional sites and
gets it data from the file:-
ftp://ftp.ebi.ac.uk/pub/databases/interpro/interpro.xml.gz
NOTE: Interpro has its own table and hence the xrefs are stored but are not linked to
the ensembl entities directly but a list of interpro and identifiers are stored.
The identifiers stored are of the type :-
PROSITE, PFAM, PRINTS, PREFILE, PROFILE, TIGRFAMs
InterPro is a database of protein families, domains and functional sites
and gets it data from the file
ftp://ftp.ebi.ac.uk/pub/databases/interpro/interpro.xml.gz
NOTE: Interpro has its own table and hence the Xrefs are stored but
are not linked to the Ensembl entities directly but a list of interpro
and identifiers are stored. The identifiers stored are of the type
PROSITE, PFAM, PRINTS, PREFILE, PROFILE, TIGRFAMs
UniProt/Varsplic
----------------
Alternative splice forms are obtained from the follwing file;-
ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/uniprot_sprot_varsplic.fasta.gz
Alternative splice forms are obtained from the follwing file
ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/uniprot_sprot_varsplic.fasta.gz
These are primary Xrefs in that they contain sequence and hence can be
mapped to the Ensembl entities via normal alignment methods (we use
exonerate).
These are primary xrefs in that they contain sequence and hence can be mapped
to the ensembl entities via normal alignment methods (we use exonerate).
Has no dependent Xrefs.
Has no dependent xrefs.
ncRNA, RFAM, miRNA_Registry
---------------------------
ncRNA,RFAM,miRNA_Registry
-------------------------
This is a local and is not down loaded automatically via ftp so you must
copy this file first before running the parser.
This is a local and is not down loaded automatically via ftp so you must copy this
file first before running the parser.
LOCAL:ncRNA/ncRNA.txt
LOCAL:ncRNA/ncRNA.txt
These are direct
x
refs so the file contains data on what the
x
ref is and
which
e
nsembl entity it matches too.
These are direct
X
refs so the file contains data on what the
X
ref is and
which E
nsembl entity it matches too.
SPECIES SPECIFIC ENTR
Y
S
-----------------------
-----------------------
SPECIES SPECIFIC ENTR
IE
S
-----------------------
-
-----------------------
-
Human
...
...
@@ -237,86 +255,72 @@ Human
MIM - Online Mendelian Inheritance in Man
-----------------------------------------
Descriptions and types are obtained from the file:-
ftp://ftp.ncbi.nih.gov/repository/OMIM/omim.txt.Z
Descriptions and types are obtained from the file
This creates two set of xrefs these being :-
ftp://ftp.ncbi.nih.gov/repository/OMIM/omim.txt.Z
This creates two set of Xrefs:
1) MIM_GENE (disease genes and other expressed gene)
2) MIM_MORBID (the disease genes)
Note those in set 2 will also be in set 1.
These MIM
x
refs are linked to UniProt/SwissProt entries using the
UniProtParser.pm creating dependent
x
refs. Note if the Swissprot Entrie
does
not specify wether the MIM entrie is a phenotype or a gene then it
is ignored.
Fro
this same reason MIM dependent
x
refs are NOT obtained
from the refseq
entries
These MIM
X
refs are linked to UniProt/SwissProt entries using the
UniProtParser.pm creating dependent
X
refs.
Note if the Swissprot Entrie
does
not specify wether the MIM entrie is a phenotype or a gene then it
is ignored. For
this same reason MIM dependent
X
refs are NOT obtained
from the Refseq
entries
.
So when the swissport entries are matched to
e
nsembl the MIM
entries
will also be matched.
So when the swissport entries are matched to
E
nsembl the MIM
entries
will also be matched.
HUGO
----
HUGO data uses prioritys to allocate each identifier to one ensembl id.
The prioritys are :-
1) via Havana
2) Via CCDS
3) Via Refseq
4) Via Uniprot
5) Via Entrezgene
1) DIRECT relationships are made by transfering the manually annotated ones from
havana to ensembl.
LOCAL:HUGO/HUGO_TO_ENSG
2) DIRECT relationships are made by transfering the ones from CCDS to ensembl.
LOCAL:HUGO/CCDS_TO_HUGO
3,4 and 5)
The Human Genome Organisation xrefs are obtained from using the following url:-
The Human Genome Organisation Xrefs are obtained from using the
following url:
http://www.gene.ucl.ac.uk/cgi-bin/nomenclature/gdlw.pl?title=Genew+output+data
&col=gd_hgnc_id&col=gd_app_sym&col=gd_app_name&col=gd_prev_sym&col=gd_aliases
&col=md_prot_id&col=gd_pub_refseq_ids&
col=md_eg_i
d&status=Approved
&status=Approved+Non-Human
&status_opt=3&=on&where=&order_by=gd_hgnc_id&limit=
&format=text&submit=submit
&.cgifields=&.cgifields=status&.cgifields=chr
&col=md_prot_id&col=gd_pub_refseq_ids&
status=Approve
d&status=Approved
+Non-Human
&status_opt=3&=on&where=&order_by=gd_hgnc_id&limit=
&format=text&submit=submit
&.cgifields=&.cgifields=status&.cgifields=chr
Which is a script that produces a list of HUGO identifiers with the
UniProt and Refseq entries they are linked to.
Which is a script that produces a list of HUGO id
ent
if
ie
r
s
with the Uniprot and
Refseq and EntrezGene entries they are linked to
.
The files have references to UniProt and Refseq
ent
r
ies
and so the GO
entries are set to be dependent Xref on these
.
The files have references to uniprot, refseq and entrezgen entries and so the
HUGO entries are set to be dependent xref on these.
NOTE: due to length of its name the file is stored in the name of its checksum.
NOTE: due to length of its name the file is stored in the name of its
checksum.
OTTT
----
These are the direct mapping between the vega genes and
e
nsembl
ones. Not all
of these are mapped but a fair proportion are.
These
create direct
x
refs.
the file used should be :-
LOCAL:OTTT/OTTT.txt
These are the direct mapping between the vega genes and
E
nsembl
ones. Not all
of these are mapped but a fair proportion are.
These
create direct
X
refs.
The file used should be
LOCAL:OTTT/OTTT.txt
CCDS
----
The CCDS database identifies a core set of human protein coding regions that
are consistently annotated by multiple public resources and pass quality tests.
The CCDS database identifies a core set of human protein coding regions
that are consistently annotated by multiple public resources and pass
quality tests.
A local file is used here:-
LOCAL:CCDS/CCDS.txt
A local file is used here:
The file contains a list of ccds identifiers and the ensembl entities they match to.
So direct xrefs are created for these.
LOCAL:CCDS/CCDS.txt
The file contains a list of ccds identifiers and the Ensembl entities
they match to. So direct Xrefs are created for these.
Mouse
...
...
@@ -327,14 +331,12 @@ MarkerSymbol
Also known as MGI.
ftp://ftp.informatics.jax.org/pub/reports/MRK_SwissProt_TrEMBL.rpt
ftp://ftp.informatics.jax.org/pub/reports/MRK_Synonym.sql.rpt
ftp://ftp.informatics.jax.org/pub/reports/MRK_SwissProt_TrEMBL.rpt
ftp://ftp.informatics.jax.org/pub/reports/MRK_Synonym.sql.rpt
This is mouse specific xref being the Mouse Genome Informatics data.
Xrefs are generated via the Uniprot entries in the MGI files first, and
via the RefSeq entries if there is no Uniprot entry.
The files have references to uniprot entries and so the GO entries are
set to be dependent xrefs on these.
This is mouse specific Xref being the Mouse Genome Informatics data.
The files have references to UniProt entries and so the GO entries are
set to be dependent Xrefs on these.
Rat
...
...
@@ -343,11 +345,11 @@ Rat
RGD
--
Rat Genome Database entires are populate by using the file:-
ftp://rgd.mcw.edu/pub/data_release/GENES
Rat Genome Database entries are populate by using the file
The rgd xrefs are dependent xrefs on the refseq entries.
ftp://rgd.mcw.edu/pub/data_release/GENES
The rgd Xrefs are dependent Xrefs on the Refseq entries.
Zebra fish
...
...
@@ -357,31 +359,28 @@ ZFIN_ID
-------
The two files
http://zfin.org/data_transfer/Downloads/refseq.txt
http://zfin.org/data_transfer/Downloads/swissprot.txt
contain list of zfin identifiers and refseq or swissprot indentifiers depending
on the file.
This creates a set of dependent xrefs on refseq and uniprot entries.
http://zfin.org/data_transfer/Downloads/refseq.txt
http://zfin.org/data_transfer/Downloads/swissprot.txt
contain list of zfin identifiers and Refseq or swissprot indentifiers
depending on the file.
This creates a set of dependent Xrefs on Refseq and UniProt entries.
C Elegans
---------
wormpep_id , wormbase_locus, wormbase_gene, wormbase_transcript
wormpep_id, wormbase_locus, wormbase_gene, wormbase_transcript
---------------------------------------------------------------
Uses the file
ftp://ftp.sanger.ac.uk/pub/databases/wormpep/wormpep150/wormpep.table150
and the database (last release should do)
mysql:ensembldb.ensembl.org::caenorhabditis_elegans_core_39_150a:anonymous
This creates direct xrefs for all these.
Uses the file
ftp://ftp.sanger.ac.uk/pub/databases/wormpep/wormpep150/wormpep.table150
and the database (last release should do)
mysql:ensembldb.ensembl.org::caenorhabditis_elegans_core_39_150a:anonymous
This creates direct Xrefs for all these.
This diff is collapsed.
Click to expand it.
Preview
0%
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment