Skip to content
Snippets Groups Projects
Commit f8e86770 authored by Ian Longden's avatar Ian Longden
Browse files

Moved to docs directory

parent 2151b2cc
No related branches found
No related tags found
No related merge requests found
UniProt/Swissprot - UniProt/Trembl (UNIversal PROTein resource)
---------------------------------------------------------------
The files can come in two types:
1) Contains data for all species
ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/uniprot_sprot.dat.gz
or
ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/uniprot_trembl.dat.gz
This is the normal case.
2) Contains data for one species only
ftp://ftp.ebi.ac.uk/pub/databases/integr8/uniprot/proteomes/17.D_melanogaster.dat.gz
These are primary Xrefs in that they contain sequence and hence can be
mapped to the Ensembl entities via normal alignment methods (we use
Exonerate).
This is a list of dependent Xrefs that might be added:
EMBL
PDB
protein_id
Note: For human, mouse and rat we also take the direct mappings from uniprot for the SWISSPROT entries.
Those not mapped by uniprot are then processed in the normal way.
Refseq_peptide
--------------
The files come in two types those for specific species i.e.
ftp://ftp.ncbi.nih.gov/genomes/Canis_familiaris/protein/protein.gbk.gz
or as a series of numbered none specific species files i.e.
ftp://ftp.ncbi.nih.gov/refseq/release/vertebrate_other/vertebrate_other3.protein.gpff.gz
These files are parsed by the parser RefSeqGPFFParser.pm
These are primary Xrefs in that they contain sequence and hence can be
mapped to the Ensembl entities via normal alignment methods (we use
Exonerate).
Below is a list of dependent Xrefs that might be added:
EntrezGene
Refseq_dna
----------
The files come in two types those for specific species i.e.
ftp://ftp.ncbi.nih.gov/genomes/Gallus_gallus/RNA/rna.gbk.gz
or as a series of numbered none specific species files i.e.
ftp://ftp.ncbi.nih.gov/refseq/release/vertebrate_mammalian/vertebrate_mammalian46.rna.fna.gz
These files are parsed by the parser RefSeqParser.pm
These are primary Xrefs in that they contain sequence and hence can be
mapped to the Ensembl entities via normal alignment methods (we use
Exonerate).
IPI (International Protein Index)
---------------------------------
Comes as species specific file i.e.
ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.HUMAN.fasta.gz
The files have something like
>IPI:IPI00000005.1|SWISS-PROT:P01111|TREMBL:Q5U091|ENSEMBL:ENSP00000261444;ENSP00000358548|REFSEQ:NP_002515|VEGA:OTTHUMP00000013879 Tax_Id=9606 GTPase NRas precursor
sequence..................
But most of the header information is ignored except for the description
and the IPI value. The sequence is used to position the IPI Xref.
These are primary Xrefs in that they contain sequence and hence can be
mapped to the Ensembl entities via normal alignment methods (we use
Exonerate).
Has no dependent Xrefs.
UniGene
-------
Comes as species specific file i.e.
ftp://ftp.ncbi.nih.gov/repository/UniGene/Bos_taurus/Bt.seq.uniq.gz
ftp://ftp.ncbi.nih.gov/repository/UniGene/Bos_taurus/Bt.data.gz
These are primary Xrefs in that they contain sequence and hence can be
mapped to the Ensembl entities via normal alignment methods (we use
Exonerate). No longer loaded via UniProt.
Has no dependent Xrefs.
EMBL
----
These are dependent Xrefs and are linked to Ensembl via the UniProt
entries.
PDB
---
Protein Data Bank entries are dependent Xrefs and are linked to Ensembl
via the UniProt entries.
protein_id
----------
These are dependent Xrefs and are linked to Ensembl via the UniProt
entries.
PUBMED + Medline
----------------
These are no longer stored due to the large numbers of these. If you
want to add these then see the UniProtParser and RefseqPArser for more
details.
GO
--
Can come in a species specific file or can contain all species.
ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gene_association.goa_uniprot.gz
ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/gene_association.goa_human.gz
GO information in the UniProt and RefSeq files are ignored and just the
information from the above files are used. The files have references to
UniProt and RefSeq entries and so the GO entries are set to be dependent
Xref on these.
EntrezGene
----------
Gene-centred information at NCBI is stored as a dependent Xref and is
obtained from the RefSeq entries.
InterPro
--------
InterPro is a database of protein families, domains and functional sites
and gets it data from the file
ftp://ftp.ebi.ac.uk/pub/databases/interpro/interpro.xml.gz
NOTE: InterPro has its own table and hence the Xrefs are stored but
are not linked to the Ensembl entities directly but a list of InterPro
and identifiers are stored. The identifiers stored are of the type
PROSITE, PFAM, PRINTS, PREFILE, PROFILE, TIGRFAMs
ncRNA, RFAM, miRNA_Registry
---------------------------
This is a local and is not down loaded automatically via FTP so you must
copy this file first before running the parser.
file:ncRNA/ncRNA.txt
These are direct Xrefs so the file contains data on what the Xref is and
which Ensembl entity it matches too.
SPECIES SPECIFIC ENTRIES
------------------------
------------------------
Human
-----
MIM - Online Mendelian Inheritance in Man
-----------------------------------------
Descriptions and types are obtained from the file
ftp://grcf.jhmi.edu/OMIM/omim.txt.Z
This creates two set of Xrefs:
1) MIM_GENE (disease genes and other expressed gene)
2) MIM_MORBID (the disease genes)
Note those in set 2 will also be in set 1.
These MIM Xrefs are linked to UniProt/SwissProt entries using the
UniProtParser.pm creating dependent Xrefs. Note if the Swissprot entry
does not specify whether the MIM entry is a phenotype or a gene then it
is ignored. For this same reason MIM dependent Xrefs are NOT obtained
from the RefSeq entries.
So when the Swissprot entries are matched to Ensembl the MIM entries
will also be matched.
HGNC
----
The HUman Genome Organisation Xrefs are obtained from various sources:-
1) HGNC (ensembl_mapped)
HGNC has direct mapping to ensembl which have been manually curated.
So information is obtianed from the script http://www.genenames.org/cgi-bin/hgnc_downloads.cgi
2) CCDS
The HGNC's are connected to the same ensembl object that the CCDS are linked
to. We connec to the ccds database to get this information.
3) Vega
This is made from the Havana manually curated database.
4) HGNC
HGNC has links to other databases like uniprot,refseq etc and these can be used to link to ensembl
Which of these is chosen at the mapping stage is based on the prioritys of
the sources. Here they are listed in order above.
This is known as a priority xref as the mapping with the best priority is
chosen.
CCDS
----
The CCDS database identifies a core set of human protein coding regions
that are consistently annotated by multiple public resources and pass
quality tests.
A local file is used here:
file:CCDS/CCDS.txt
The file contains a list of CCDS identifiers and the Ensembl entities
they match to. So direct Xrefs are created for these.
Mouse
-----
MGI
------------
Previously known as 'MarkerSymbol'.
ftp://ftp.informatics.jax.org/pub/reports/MRK_SwissProt_TrEMBL.rpt
ftp://ftp.informatics.jax.org/pub/reports/MRK_Synonym.sql.rpt
This is mouse specific Xref being the Mouse Genome Informatics data.
The files have references to UniProt entries and so the GO entries are
set to be dependent Xrefs on these.
Rat
---
RGD
--
Rat Genome Database entries are populate by using the file
ftp://rgd.mcw.edu/pub/data_release/GENES
The RGD Xrefs are dependent Xrefs on the Refseq entries.
Zebra fish
----------
ZFIN_ID
-------
The two files
http://zfin.org/data_transfer/Downloads/refseq.txt
http://zfin.org/data_transfer/Downloads/swissprot.txt
contains list of ZFIN identifiers and RefSeq or Swissprot identifiers
depending on the file.
This creates a set of dependent Xrefs on RefSeq and UniProt entries.
C Elegans
---------
wormpep_id, wormbase_locus, wormbase_gene, wormbase_transcript
--------------------------------------------------------------
Uses the file
ftp://ftp.sanger.ac.uk/pub/databases/wormpep/wormpep180/wormpep.table180
and the database (last release should do)
mysql:ensembldb.ensembl.org:3306:caenorhabditis_elegans_core_46_170b:anonymous
This creates direct Xrefs for all these.
This diff is collapsed.
The Xref System
========================================================================
The external database references (Xrefs) are added to the Ensembl
databases using the code found in this directory. The process consists
of two parts. First part is parsing the data into a temporary database
(Xref database). The second part is to map the new Xrefs to the Ensembl
database.
Parsing the external database references
------------------------------------------------------------------------
In this directory you will find an ini-file called 'xref_config.ini'.
This file contains two types of configuration sections: source sections
and species sections. A source section defines Xref priority, order
etc. (as key-value pairs, see the comment at the top of the source
sections for a fuller explanation of these keys) for the source and
also the URIs pointing to the data files that the source should use.
The source label will only be used to refer to the source within the
ini-file (from a species section), so this can be any text string which
is easy to understand the meaning of.
A species section contains information about species aliases, the
numerical taxonomy ID(s) and what sources to use for that species. If
a species has more than one taxonomy ID (in the case where there are
multiple strains or subspecies, for example), there can be more than one
'taxonomy_id' key. The name of the species is defined by the source
label and will be store in the Xref database.
For now, the script 'xref_config2sql.pl' (also found in this directory)
should be used to convert the ini-file into a SQL file which you
should replace the file 'sql/populate_metadata.sql' with. The
'xref_config2sql.pl' script expects to find 'xref_config.ini' in the
current directory, but you may specify an alternative file as the first
command line argument to the script if you have moved or renamed the
ini-file. When 'xref_parser.pl' is run it will load the generated SQL
file into the database and will then download and parse all external
data files for one or several specified species.
If you want to add a new source you will have to add a new source
section, following the pattern used by the other source sections. You
will then have to add it to the species that require the data.
If the new data comes in files not previously handled by the Xref
system, you will now also have to write the parser NewSourceParser.pm
(the parser name may be arbitrary chosen) in the XrefParser directory.
You can find lots of examples of parsers in this directory.
Before running the Xref parser, make sure that the environment
variable 'http_proxy' is set to point to the local HTTP proxy to get
outside the firewall. For Sanger, the value of the variable should be
"http://cache.internal.sanger.ac.uk:3128", i.e. for tcsh shells you
should have
setenv http_proxy http://cache.internal.sanger.ac.uk:3128
in your ~/.tcshrc file, while for bash-like shell you should have
export http_proxy=http://cache.internal.sanger.ac.uk:3128
in your ~/.profile or ~/.bashrc file.
When you run the script 'xref_parser.pl' to do the Xrefs you must pass
to it several options but for most runs all you need to specify it the
user (user name on the database), pass (password), host (database host),
dbname, and species, i.e.
perl xref_parser.pl -host mymachine -user admin -pass XXXX \
-dbname new_human_xref -species human
Please keep the output from this script and check it later. At the end
of the output there will be a summary of what was successful and what
failed to run. This is important.
The parsing can create three types of Xrefs these are
1) Primary (These have sequence and are mapped via exonerate)
2) Dependent (Have no sequence but are dependent on the Primary ones)
3) Direct (These are directly linked to the Ensembl entities, so the
mapping is already done)
Some sources will have more than one set of files associated with it,
in these cases they have the same source name but different source IDs.
These are known as "priority Xrefs" as the Xrefs are mapped according to
the priority of the source. An example of this is the HUGOs.
For more information on the what data can be parsed see the
'parsing_information.txt' file.
Mapping the external database references to the Ensembl core database
------------------------------------------------------------------------
This is an overview of what goes on in the script 'xref_mapper.pl' .
Primary Xrefs are dumped out to two Fasta files, one for peptides and
the other for DNA. Ensembl Transcripts and Translations are then dumped
out to two files in Fasta format.
Exonerate is then used to find the best matches for the Xrefs.
If there is more than one best match then the Xref is mapped to
more than one Ensembl entity. A cutoff is used to filter the best
matches to make sure they pass certain criteria. By default this
is that the query identity OR the target identity must be over
90%. This can be changed by creating your own '<method>.pm' file
in the directory 'XrefMapper/Methods' and creating subroutines
'query_identity_threshold()' and 'target_identity_threshold()' which
return the new values.
So exonerate will generate a set of .map files with the mapping in. The
map-files are then parsed and any that pass the criteria are stored in
the 'xref' table, 'object_xref' table and the 'identity_xref' table.
All dependent Xrefs are also stored if the parent is mapped.
Direct Xrefs are also stored at this stage but no mapping is needed here
as we already knew what each Xref maps too.
For priority Xrefs (ones that have multiple sources) the highest
priority one is only stored.
Any Xrefs which fail to be mapped are written to the unmapped_object
table with a brief explanation of why they could not be mapped.
Once all the mapping have been stored the display Xrefs and the
descriptions are generated for the transcripts and genes.
If you want to change any of the default settings you can create a new
'<species>.pm' for your particular species, or '<taxon>.pm' and override
the script 'BasicMapper.pm' (see 'rattus_norvegicus.pm' as an example).
The 'xref_mapper.pl' script needs a configuration file which has
information on the Xref database and the core database and also the
species name. Below is an example of running the mapping.
perl ~/ensembl-live/ensembl/misc-scripts/xref_mapping/xref_mapper.pl \
-file xref_input -upload >&MAPPER.OUT
Here is an example of a configuration file for 'xref_mapper.pl':
------------------------------------------------------------------------
xref
host=ensembl-machine
port=3306
dbname=human_xref_42
user=admin
password=xxxx
dir=./xref
species=homo_sapiens
taxon=mammalia (this is optional - use taxon if you need more than one species to use the same '<taxon>.pm' module)
host=ensembl-machine
port=3306
dbname=homo_sapiens_core_42_36d
user=admin
password=xxxx
dir=./ensembl
farm
queue=long
exonerate=/software/ensembl/bin/exonerate-1.4.0
------------------------------------------------------------------------
Note it is good practice to put a sub-directory for the Ensembl
directory as many files are generated and hence best to put these all
together and way from everything else or it will be hard to find things.
Also the directory can be tared and zipped in case you need to check
things later.
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment