Skip to content
Snippets Groups Projects
Commit c0930eaa authored by Glenn Proctor's avatar Glenn Proctor
Browse files

Removed - superceded by ensembl_xrefs.sxw/pdf

parent b1bab90e
No related branches found
No related tags found
No related merge requests found
The Ensembl Xref System
=======================
Overview
--------
What is an xref
---------------
The xref holding database
-------------------------
Parsing xrefs
-------------
Mapping xrefs to Ensembl objects
--------------------------------
The Ensembl core schema xref tables
-----------------------------------
THe following is the initial first draft/initial mind dump which will be tidied
up.
The Xref tables are created and populated by the scripts in this directory.
The process can be viewed as a two part process.
First parsing of the data and secondly mapping these to the Ensembl data.
Parsing.
what data is stored.
All species have the same "core" set of data which consists of
source name Parser(s) used to populate
-------------------------------------------------------------------------------
Uniprot/SPtrembl UniProtParser.pm
Uniprot/Swissprot UniProtParser.pm
Refseq RefSeqGPFFParser.pm
MIM UniProtParser.pm + RefSeqGPFFParser.pm
PDB UniProtParser.pm
EMBL UniProtParser.pm
Protein_id UniProtParser.pm
LocusLink RefSeqGPFFParser.pm
GO GOParser.pm
Interpro InterproParser.pm (no link to primary_xref, special case)
*pubmed UniProtParser.pm + RefSeqGPFFParser.pm
*medline UniProtParser.pm + RefSeqGPFFParser.pm
*mim2 MIMParser.pm
* are new data sources.
Species specific data:-
source name species parser
------------------------------------------------------
HUGO Human HUGOParser.pm
MarkerSymbol (MGD/MGI) Mouse MGDParser.pm
RGD Rat RGDParser.pm
ZFIN Zebrafish ZFINParser.pm
General Tutorial
The perl script to create and populate the database is xref_parser.pl
xref_parser --help produces:-
xref_parser.pl -user {user} -pass {password} -host {host} -port {port}
-dbname {database} -species {species1,species2}
-source {source1,source2} -skipdownload -create
If no source is specified then then all source are loaded. The same is done for
species so it is best to specify this one or the script may take a while.
So to load/parse all the xrefs for the human the command would be:-
xref_parser.pm -host host1 -port 3350 -user admin -pass password
-dbname xref_store -species human -create
The following is the typical output for a rat run :-)
perl xref_parser.pl -host host1 -port 3350 -user admin -pass pass -dbname xref_test -species rat -create
Removed existing database xref_test
Creating xref_test from sql/table.sql
Populating metadata in xref_test from sql/populate_metadata.sql
Species rat is valid (name = rattus_norvegicus, ID = 10116)
Downloading ftp://ftp.ebi.ac.uk/pub/databases/SPproteomes/swissprot_files/proteomes/10116.SPC to ./UniprotSWISSPROT/10116.SPC
Checksum for 10116.SPC does not match, parsing
Parsing 10116.SPC with UniProtParser
Taxonomy ID 10116 corresponds to species ID 10116 name rattus_norvegicus
SwissProt source id for ./UniprotSWISSPROT/10116.SPC: 1
SpTREMBL source id for ./UniprotSWISSPROT/10116.SPC: 2
Read 3970 SwissProt xrefs and 4432 SPTrEMBL xrefs from ./UniprotSWISSPROT/10116.SPC
Uploading xrefs
Downloading ftp://ftp.ncbi.nih.gov/genomes/R_norvegicus/protein/protein.gbk.gz to ./RefSeq/protein.gbk.gz
Uncompressing ./RefSeq/protein.gbk.gz
Checksum for protein.gbk does not match, parsing
Parsing protein.gbk with RefSeqGPFFParser
Read 21178 xrefs from ./RefSeq/protein.gbk
Uploading xrefs
Downloading ftp://ftp.ncbi.nih.gov/genomes/R_norvegicus/RNA/rna.gbk.gz to ./RefSeq/rna.gbk.gz
Uncompressing ./RefSeq/rna.gbk.gz
Checksum for rna.gbk does not match, parsing
Parsing rna.gbk with RefSeqGPFFParser
Read 21178 xrefs from ./RefSeq/rna.gbk
Uploading xrefs
Downloading ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/RAT/gene_association.goa_rat.gz to ./GO/gene_association.goa_rat.gz
Uncompressing ./GO/gene_association.goa_rat.gz
Checksum for gene_association.goa_rat does not match, parsing
Parsing gene_association.goa_rat with GOParser
44991 GO dependent xrefs added
Downloading ftp://rgd.mcw.edu/pub/data_release/genbank_to_gene_ids.txt to ./RGD/genbank_to_gene_ids.txt
Checksum for genbank_to_gene_ids.txt does not match, parsing
Parsing genbank_to_gene_ids.txt with RGDParser
11863 xrefs succesfully loaded
2461 xrefs ignored
NOTE: Due to the fact that some Uniprot/refseq entries may no longer be valid (loaded) some xrefs are
ignored and so do not panic if you see xref ignored messages with values.
So we now have a set of xrefs that are dependent on the uniprot and refseq
entries loaded. These can then be mapped to the ENSEMBL entitys with the
xref_mapper.pl script.
To add new data to the xrefs you will have to edit sql/populate_metadata.sql
and/or type in the sql.
Add a new source you will insert a new source code. i.e.
INSERT INTO source VALUES (2000, 'NEW', 1, 'Y', 4);
Becouse some sources are dependent on others being loaded the last argument is
the order. Lower numbers are processed first.
You will also have to specify the files to down load and the parser to use.
i.e.
INSERT INTO source_url (source_id, species_id, url, checksum, file_modified_date,
upload_date, parser) VALUES (2000, 9606,'ftp://ftp.new.org/new.gz', '',
now(), now(), "NEWParser");
You will have to create XrefParser/NEWparser.pm.
The parsers.
UniProtParser.pm
Each record is stored as an xref and a primary_xref.
The accession is the main key and is taken from the AC line.
Any DR line that matches a valid source name (in the database) will
be processed and stored as xrefs and dependent xrefs. Currently these
are MIM, PDB, EMBL In addition protein_id is taken from the EMBL line.
NOTE: InterPro is not loaded here as Interpro does not match InterPro
(note capital P here). This is loaded seperately via the interpro parser.
New entries are now added for the medline and pubmed lines by parsing RX
lines with MEDLINE or PUBMED xrefs.
Species specific files are parsed here so all records are stored.
RefSeqGPFFParser.pm
Each record is stored as an xref and a primary_xref.
The accession is the main key and is taken from the AC line.
LocusLink, OMIM, pubmed and medline are stored as xrefs and dependent xref.
Species specific files are parsed here so all records are stored.
GOParser.pm
Will only add entries if uniprot or refseq entry has been loaded already.
ENSEMBL entries are also ignored as these will only map onto themselves.
Most GO entries will already exist in the xref table (from Uniprot parsing)
but most will not have the description, so this is added if there is none.
Dependent xrefs are created if they do not exist already.
InterproParser.pm
The xrefs are stored for each Interpro but NO dependent xrefs are stored.
Instead a seperate table is populated (interpro) with the interpro/pfam
mappings. Uniprot/Refseq accesions are NOT checked to see if they are already
in the database, therefore is species non-specific, but the xref is stored
with the species specified in the run.
MIMParser.pm
Uses the Gene names to map the mim numbers to the protein accessions via the
HUGO numbers. So HUGO has to be parsed already aswell as Uniprot and Refseq.
These are stored as MIM2 at present and are expected to replace the disease
database.
HUGOParser.pm, MGDParser.pm, RGDParser.pm, ZFINParser.pm
Uniprot and Refseq must be already be parsed. Entries are added to the
xref table and the dependent xref linked to the proteins (if they have been
loaded). So Entries are added if the accession is valid for uniprot or refseq
for that particular species.
NOTE: RefSeqParser.pm also exists and can be used to parse the fasta type
files for the Refseq's. At the moment the genbank style files are passed for
both protein and rna files. But the xrefs are on a whole just duplicated
as they contain bascially the same xref data. A decision will have to be made
as to the benefits/disadvantages of this. The alternative is to pass the rna
as a fasta. (which i think is what the old system used to do, judging by the
numbers of xrefs).
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment