Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
E
ensembl
Manage
Activity
Members
Labels
Plan
Issues
0
Issue boards
Milestones
Iterations
Wiki
Requirements
Jira
Code
Merge requests
1
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Locked files
Build
Pipelines
Jobs
Pipeline schedules
Test cases
Artifacts
Deploy
Releases
Package Registry
Container Registry
Operate
Environments
Terraform modules
Monitor
Incidents
Service Desk
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Code review analytics
Issue analytics
Insights
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Terms and privacy
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
ensembl-gh-mirror
ensembl
Commits
c0930eaa
Commit
c0930eaa
authored
20 years ago
by
Glenn Proctor
Browse files
Options
Downloads
Patches
Plain Diff
Removed - superceded by ensembl_xrefs.sxw/pdf
parent
b1bab90e
No related branches found
Branches containing commit
No related tags found
Tags containing commit
No related merge requests found
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
misc-scripts/xref_mapping/README
+0
-257
0 additions, 257 deletions
misc-scripts/xref_mapping/README
with
0 additions
and
257 deletions
misc-scripts/xref_mapping/README
deleted
100644 → 0
+
0
−
257
View file @
b1bab90e
The Ensembl Xref System
=======================
Overview
--------
What is an xref
---------------
The xref holding database
-------------------------
Parsing xrefs
-------------
Mapping xrefs to Ensembl objects
--------------------------------
The Ensembl core schema xref tables
-----------------------------------
THe following is the initial first draft/initial mind dump which will be tidied
up.
The Xref tables are created and populated by the scripts in this directory.
The process can be viewed as a two part process.
First parsing of the data and secondly mapping these to the Ensembl data.
Parsing.
what data is stored.
All species have the same "core" set of data which consists of
source name Parser(s) used to populate
-------------------------------------------------------------------------------
Uniprot/SPtrembl UniProtParser.pm
Uniprot/Swissprot UniProtParser.pm
Refseq RefSeqGPFFParser.pm
MIM UniProtParser.pm + RefSeqGPFFParser.pm
PDB UniProtParser.pm
EMBL UniProtParser.pm
Protein_id UniProtParser.pm
LocusLink RefSeqGPFFParser.pm
GO GOParser.pm
Interpro InterproParser.pm (no link to primary_xref, special case)
*pubmed UniProtParser.pm + RefSeqGPFFParser.pm
*medline UniProtParser.pm + RefSeqGPFFParser.pm
*mim2 MIMParser.pm
* are new data sources.
Species specific data:-
source name species parser
------------------------------------------------------
HUGO Human HUGOParser.pm
MarkerSymbol (MGD/MGI) Mouse MGDParser.pm
RGD Rat RGDParser.pm
ZFIN Zebrafish ZFINParser.pm
General Tutorial
The perl script to create and populate the database is xref_parser.pl
xref_parser --help produces:-
xref_parser.pl -user {user} -pass {password} -host {host} -port {port}
-dbname {database} -species {species1,species2}
-source {source1,source2} -skipdownload -create
If no source is specified then then all source are loaded. The same is done for
species so it is best to specify this one or the script may take a while.
So to load/parse all the xrefs for the human the command would be:-
xref_parser.pm -host host1 -port 3350 -user admin -pass password
-dbname xref_store -species human -create
The following is the typical output for a rat run :-)
perl xref_parser.pl -host host1 -port 3350 -user admin -pass pass -dbname xref_test -species rat -create
Removed existing database xref_test
Creating xref_test from sql/table.sql
Populating metadata in xref_test from sql/populate_metadata.sql
Species rat is valid (name = rattus_norvegicus, ID = 10116)
Downloading ftp://ftp.ebi.ac.uk/pub/databases/SPproteomes/swissprot_files/proteomes/10116.SPC to ./UniprotSWISSPROT/10116.SPC
Checksum for 10116.SPC does not match, parsing
Parsing 10116.SPC with UniProtParser
Taxonomy ID 10116 corresponds to species ID 10116 name rattus_norvegicus
SwissProt source id for ./UniprotSWISSPROT/10116.SPC: 1
SpTREMBL source id for ./UniprotSWISSPROT/10116.SPC: 2
Read 3970 SwissProt xrefs and 4432 SPTrEMBL xrefs from ./UniprotSWISSPROT/10116.SPC
Uploading xrefs
Downloading ftp://ftp.ncbi.nih.gov/genomes/R_norvegicus/protein/protein.gbk.gz to ./RefSeq/protein.gbk.gz
Uncompressing ./RefSeq/protein.gbk.gz
Checksum for protein.gbk does not match, parsing
Parsing protein.gbk with RefSeqGPFFParser
Read 21178 xrefs from ./RefSeq/protein.gbk
Uploading xrefs
Downloading ftp://ftp.ncbi.nih.gov/genomes/R_norvegicus/RNA/rna.gbk.gz to ./RefSeq/rna.gbk.gz
Uncompressing ./RefSeq/rna.gbk.gz
Checksum for rna.gbk does not match, parsing
Parsing rna.gbk with RefSeqGPFFParser
Read 21178 xrefs from ./RefSeq/rna.gbk
Uploading xrefs
Downloading ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/RAT/gene_association.goa_rat.gz to ./GO/gene_association.goa_rat.gz
Uncompressing ./GO/gene_association.goa_rat.gz
Checksum for gene_association.goa_rat does not match, parsing
Parsing gene_association.goa_rat with GOParser
44991 GO dependent xrefs added
Downloading ftp://rgd.mcw.edu/pub/data_release/genbank_to_gene_ids.txt to ./RGD/genbank_to_gene_ids.txt
Checksum for genbank_to_gene_ids.txt does not match, parsing
Parsing genbank_to_gene_ids.txt with RGDParser
11863 xrefs succesfully loaded
2461 xrefs ignored
NOTE: Due to the fact that some Uniprot/refseq entries may no longer be valid (loaded) some xrefs are
ignored and so do not panic if you see xref ignored messages with values.
So we now have a set of xrefs that are dependent on the uniprot and refseq
entries loaded. These can then be mapped to the ENSEMBL entitys with the
xref_mapper.pl script.
To add new data to the xrefs you will have to edit sql/populate_metadata.sql
and/or type in the sql.
Add a new source you will insert a new source code. i.e.
INSERT INTO source VALUES (2000, 'NEW', 1, 'Y', 4);
Becouse some sources are dependent on others being loaded the last argument is
the order. Lower numbers are processed first.
You will also have to specify the files to down load and the parser to use.
i.e.
INSERT INTO source_url (source_id, species_id, url, checksum, file_modified_date,
upload_date, parser) VALUES (2000, 9606,'ftp://ftp.new.org/new.gz', '',
now(), now(), "NEWParser");
You will have to create XrefParser/NEWparser.pm.
The parsers.
UniProtParser.pm
Each record is stored as an xref and a primary_xref.
The accession is the main key and is taken from the AC line.
Any DR line that matches a valid source name (in the database) will
be processed and stored as xrefs and dependent xrefs. Currently these
are MIM, PDB, EMBL In addition protein_id is taken from the EMBL line.
NOTE: InterPro is not loaded here as Interpro does not match InterPro
(note capital P here). This is loaded seperately via the interpro parser.
New entries are now added for the medline and pubmed lines by parsing RX
lines with MEDLINE or PUBMED xrefs.
Species specific files are parsed here so all records are stored.
RefSeqGPFFParser.pm
Each record is stored as an xref and a primary_xref.
The accession is the main key and is taken from the AC line.
LocusLink, OMIM, pubmed and medline are stored as xrefs and dependent xref.
Species specific files are parsed here so all records are stored.
GOParser.pm
Will only add entries if uniprot or refseq entry has been loaded already.
ENSEMBL entries are also ignored as these will only map onto themselves.
Most GO entries will already exist in the xref table (from Uniprot parsing)
but most will not have the description, so this is added if there is none.
Dependent xrefs are created if they do not exist already.
InterproParser.pm
The xrefs are stored for each Interpro but NO dependent xrefs are stored.
Instead a seperate table is populated (interpro) with the interpro/pfam
mappings. Uniprot/Refseq accesions are NOT checked to see if they are already
in the database, therefore is species non-specific, but the xref is stored
with the species specified in the run.
MIMParser.pm
Uses the Gene names to map the mim numbers to the protein accessions via the
HUGO numbers. So HUGO has to be parsed already aswell as Uniprot and Refseq.
These are stored as MIM2 at present and are expected to replace the disease
database.
HUGOParser.pm, MGDParser.pm, RGDParser.pm, ZFINParser.pm
Uniprot and Refseq must be already be parsed. Entries are added to the
xref table and the dependent xref linked to the proteins (if they have been
loaded). So Entries are added if the accession is valid for uniprot or refseq
for that particular species.
NOTE: RefSeqParser.pm also exists and can be used to parse the fasta type
files for the Refseq's. At the moment the genbank style files are passed for
both protein and rna files. But the xrefs are on a whole just duplicated
as they contain bascially the same xref data. A decision will have to be made
as to the benefits/disadvantages of this. The alternative is to pass the rna
as a fasta. (which i think is what the old system used to do, judging by the
numbers of xrefs).
This diff is collapsed.
Click to expand it.
Preview
0%
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment