Commit 3b4ca4a9 authored by Monika Komorowska's avatar Monika Komorowska
Browse files

*** empty log message ***

parent 24c17276
Questions
---------
1) What code do i need to run the external database cross reference mapping.
2) What is the recommended way to run the extrnal databse cross references for
an already entered species?
3) How do i add a new species?
4) How do i add a new external database source?
5) How do i track my process?
6) I have mapping errors how do i fix them?
7) How do i start again from the parsing has finished stage?
8) How do i start again from the mapping_finished stage?
9) How do i run my external database references without a compute farm?
10) I want to use a different list of external database sources for my
display_xrefs (names)?
11) I want to use a different list of external database sources for my gene
descriptions?
Answers
-------
1) What software do i need to run the external database cross reference mapping?
You will need a copy of exonerate and the ensembl API code.
Exonerate installation intructions can be found at
http://www.ebi.ac.uk/~guy/exonerate/
To install the ensembl API see
http://www.ensembl.org/info/docs/api/api_installation.html
2) What is the recommended way to run the xrefs for an already entered species?
The xref system comes in two parts, first parsing the external database sources
into an tempory xref database and then mapping these to the core database.
a) To parse the data into the xref database you should use the script
xref_parser.pl, which can be found in ensembl/misc-scripts/xref_mapping
directory.
xref_parser.pl -user rwuser -pass XXX -host host1 -species human
-dbname human_xref -stats -create >& PARSER.OUT
check the file PARSER.OUT to make sure everything is okay. It could be that
it was unable to connect to an external site and may not have loaded
everything.
If there was a problem with the connections try again but this time use the
option -checkdownload as this will not download data you already have but
will try to get the data you are missing, saving time.
The xref_parser.pl script may wait for you to answer a couple of questions
about overwriting the database or redoing the configuration so you will also
have to look at what is in the output file, but this is usually worth doing
to keep a record of what the parser did.
At the end of the parsing you should get a summary which should look
something like:-
============================================================================
Summary of status
============================================================================
EntrezGene EntrezGeneParser OKAY
GO GOParser OKAY
GO InterproGoParser OKAY
Interpro InterproParser OKAY
RefSeq_dna RefSeqParser OKAY
RefSeq_peptide RefSeqGPFFParser OKAY
UniGene UniGeneParser OKAY
Uniprot/SPTREMBL UniProtParser OKAY
Uniprot/SWISSPROT UniProtParser OKAY
ncRNA ncRNA_DBParser OKAY
If any of these are not OKAY then ther has been a problem so look further
up in the file to find out why it failed.
b) Map the external databases entries to the core database.
First you need to create a configuration file.
Below is an example of a configuration file
####################################################
xref
host=host1
port=3306
dbname=macaca_xref
user=user1
password=pass1
dir=./xref_dir
species=macaca_mulatta
host=host2
port=3306
dbname=macaca_core
user=user2
password=pass2
dir=./ensembl_dir
farm
queue=long
exonerate=/software/ensembl/bin/exonerate-1.4.0
####################################################
Note that the Directorys specified must exist when the mapping is done.
The farm options are totally optional and can be left out but may be needed
if you have different queue names or have exonerate installed not in the
default place
Now we can do the mapping.
Ideally this should be done in two steps so that after the first step you
can check the output to make sure you are happy with everything before
loading into the core database.
i) Map the entitys in the xref database and do some checks etc.
perl ~/src/ensembl/misc-scripts/xref_mapper/xref_mapper.pl
-file xref_config -nofarm >& MAPPER1.OUT
or if using the farm
bsub -o mapper.out -e mapper.err
perl ~/src/ensembl/misc-scripts/xref_mapper/xref_mapper.pl -file xref_config
Check the output file if warning about xref number increasing do not
worry the main thing to be concerned about is a reduction in the number
of that none are in the xref database abut are in the core database.
If you get errors about the mapping files then a couple of things could
have gone wrong, first and usual culprit is that the system ran out of
disk space or the compute farm job got lost.
In this case you have two options
1) reset then database to the parsing stage and rerun all the mappings
To reset the database use the option -reset_to_parsing_finished
xref_mapper.pl -file xref_config -reset_to_parsing_finished
then redo the mapping
xref_mapper.pl -file xref_config -dumpcheck >& MAPPER.OUT
Note here we use -dumpcheck to make the program does not dump the
fasta files if they are already there as this process can take
along time and the fasta files will not have changed.
2) just redo those jobs that failed.
Run the mapper with the -resubmit_failed_jobs flag
xref_mapper.pl -file xref_config -resubmit_failed_jobs
Option 2 will be much faster as it will only redo the jobs that failed.
ii) Load the data into the core database and calculate the display_xrefs etc
xref_mapper.pl -file xref_config -upload >& MAPPER2.OUT
3) How do i add a new species?
Edit the file xref_config.ini and add a new entry in the species section
Here is an example:-
[species macaca_mulatta]
taxonomy_id = 9544
aliases = macaque, rhesus, rhesus macaque, rmacaque
source = EntrezGene::MULTI
source = GO::MULTI
source = InterproGO::MULTI
source = Interpro::MULTI
source = RefSeq_dna::MULTI-vertebrate_mammalian
source = RefSeq_peptide::MULTI-vertebrate_mammalian
source = Uniprot/SPTREMBL::MULTI
source = Uniprot/SWISSPROT::MULTI
source = UniGene::macaca_mulatta
source = ncRNA::MULTI
[species xxxx] and taxonomy_id must be present.
It is usually best just to cut and paste an already existing similar species
and start from that.
4) How do i add a new external database source?
Edit the file xref_config.ini and add a new entry in the sources section
Here is an example:-
[source Fantom::mus_musculus]
# Used by mus_muscullus
name = Fantom
download = Y
order = 100
priority = 1
prio_descr =
parser = FantomParser
release_uri =
data_uri = ftp://fantom.gsc.riken.jp/DDBJ_fantom3_HTC_accession.txt.gz
name: The name you want to call the external database.
You must also add this to the core databases
download: Y if the data needs to be obtained online (i.e. not a local file)
N if you are getting the data from a file.
order: The order in which the source shpuld be parsed. 1 beinging the first.
priority: This is for sources where we get the data from multiple places
i.e. HGNC. For most sources just set this to 1.
prio_desc: Only used for priority sources. And sets a description to give
a way to diffentiate them and track which is which.
parser: Which parser to use. If this is a new source then you will probably
need a new parser. Find a parser that is similar and start from this.
Parsers must be in the ensembl/misc-scripts/xref_mapping/XrefParser
directory.
release_uri: a uri to get the release information from. The parser should
handle this.
data_uri: Explains how and where to get the data from. There can be multiple
lines of this.
The uri can get data via several methods and here is the list and a brief
explaination.
ftp: Get the file via ftp
script: Passes argumant to the parser. This might be things like a database
to connect to to run smome sql to get the data..
file: The name with full path of the file to be parsed.
http: To get data via an external webpage/cgi script.
5) How do i track my process?
If you did not use -noverbose then the output file should give you a general
idea of what stage you are at. By directly examining the xref database you
can see the last stage that was completed by viewing the entries in the
process_status table.
Another option is to use the script xref_tracker.pl which will give you some
information about the status. The script is ran similar to the xref_mapper.pl
code in that it needs a config_file.
xref_tracker.pl -file xref_config
This script gives more information when the xref_mapper is running the
mapping jobs or processing the mapping files as it will tell you how many
have finished and how many are left to run etc. These are the longer stages
of the process.
6) I have mapping errors how do i fix them?
If for some reason a mapping job failed this tends to be things like running
out of disk space, the compute farm loosing a job etc then you have a couple
of options.
i) reset the database to the parsing stage and rerun all the mappings
To reset the database use the option -reset_to_parsing_finished
xref_mapper.pl -file xref_config -reset_to_parsing_finished
then redo the mapping
xref_mapper.pl -file xref_config -dumpcheck
Note here we use -dumpcheck to make sure the program does not dump the fasta
files if they are already there as this process can take along time and the
fasta files will not have changed.
ii) just redo those jobs that failed.
Run the mapper with the -resubmit_failed_jobs flag
xref_mapper.pl -file xref_config -resubmit_failed_jobs
7) How do i start again from the parsing has finished stage?
To reset the database use the option -reset_to_parsing_finished
xref_mapper.pl -file xref_config -reset_to_parsing_finished
8) How do i start again from the mapping_finished stage?
To reset the database use the option -reset_to_mapping_finished
xref_mapper.pl -file xref_config -reset_to_mapping_finished
Remember to use -dumpcheck when you run xref_mapper.pl the next
time to save time.
9) How do i run my external database references without a compute farm?
Simply use the -nofarm option with the xref_mapper.pl script.
This will run the exonerate jobs locally.
10) I want to use a different list of external database sources for my
display_xrefs (names)?
The external databases to be used for the display_xrefs are taken from either
the DisplayXrefs.pm subroutine transcript_display_sources i.e.
sub transcript_display_xref_sources {
my @list = qw(miRBase
RFAM
HGNC_curated_gene
HGNC_automatic_gene
MGI_curated_gene
MGI_automatic_gene
Clone_based_vega_gene
Clone_based_ensembl_gene
HGNC_curated_transcript
HGNC_automatic_transcript
MGI_curated_transcript
MGI_automatic_transcript
Clone_based_vega_transcript
Clone_based_ensembl_transcript
IMGT/GENE_DB
HGNC
SGD
MGI
flybase_symbol
Anopheles_symbol
Genoscope_annotated_gene
Uniprot/SWISSPROT
Uniprot/Varsplic
RefSeq_peptide
RefSeq_dna
Uniprot/SPTREMBL
EntrezGene
IPI);
my %ignore;
$ignore{"EntrezGene"}= 'FROM:RefSeq_[pd][en][pa].*_predicted';
return [\@list,\%ignore];
}
or if you want to create your own list then you need to create a species.pm
file and create a new subroutine there an example here is for
drosophila_melanogaster.
So in the file drosophila_melanogaster.pm
(found in the directory ensembl/misc-scripts/xref_mapping/XrefMapper)
we have :-
sub transcript_display_xref_sources {
my @list = qw(FlyBaseName_transcript FlyBaseCGID_transcript flybase_annotation_id);
my %ignore;
$ignore{"EntrezGene"}= 'FROM:RefSeq_[pd][en][pa].*_predicted';
return [\@list,\%ignore];
}
11) I want to use a different list of external database sources for my gene
descriptions?
As above but this time we use the sub gene_description_sources.
Options: -file xref_input
running in verbose mode
current status is parsing_finished
No alt_alleles found for this species.
Dumping xref & Ensembl sequences
Dumping Xref fasta files
Dumping Ensembl Fasta files
53067 Transcripts dumped 41693 Transaltions dumped
Deleting out, err and map files from output dir: /workdir/release_65/zebrafish/ensembl
Deleting txt and sql files from output dir: /workdir/release_65/zebrafish/ensembl
LSF job ID for main mapping job: 887287, name ExonerateGappedBest1_1318933449 with 481 arrays elements)
LSF job ID for main mapping job: 887288, name ExonerateGappedBest1_1318933451 with 253 arrays elements)
LSF job ID for Depend job: 887289 (job array with 1 job)
already processed = 0, processed = 734, errors = 0, empty = 0
Could not find stable id ENSDART00000126968 in table to get the internal id hence ignoring!!! (for RFAM)
Could not find stable id ENSDART00000121043 in table to get the internal id hence ignoring!!! (for RFAM)
The foillowing will be processed as priority xrefs
Uniprot/SPTREMBL
ZFIN_ID
Process Pairs
Starting at object_xref of 837705
NEW 2733
2733 new relationships added
Writing InterPro
246386 already existed
Wrote 0 interpro table entries
including 51399 object xrefs,
and 51399 go xrefs
ZFIN_ID is associated with both Transcript and Translation object types
Therefore moving all associations from Translation to Transcript
DBASS3 moved to Gene level.
DBASS3 moved to Gene level.
DBASS5 moved to Gene level.
DBASS5 moved to Gene level.
EntrezGene moved to Gene level.
EntrezGene moved to Gene level.
miRBase moved to Gene level.
miRBase moved to Gene level.
RFAM moved to Gene level.
RFAM moved to Gene level.
TRNASCAN_SE moved to Gene level.
TRNASCAN_SE moved to Gene level.
RNAMMER moved to Gene level.
RNAMMER moved to Gene level.
UniGene moved to Gene level.
UniGene moved to Gene level.
Uniprot_genename moved to Gene level.
Uniprot_genename moved to Gene level.
WikiGene moved to Gene level.
WikiGene moved to Gene level.
MIM_GENE moved to Gene level.
MIM_GENE moved to Gene level.
MIM_MORBID moved to Gene level.
MIM_MORBID moved to Gene level.
HGNC moved to Gene level.
HGNC moved to Gene level.
MOVE SQL
UPDATE IGNORE object_xref ox, xref x, source s
SET ox.ensembl_id = ?
WHERE x.source_id = s.source_id AND
ox.xref_id = x.xref_id AND
ox.ensembl_id = ? AND
ox.ensembl_object_type = 'Gene' AND
ox.ox_status = 'DUMP_OUT' AND
s.name in (
'DBASS3', 'DBASS5', 'EntrezGene', 'miRBase', 'RFAM', 'TRNASCAN_SE', 'RNAMMER', 'UniGene', 'Uniprot_genename', 'WikiGene', 'MIM_GENE', 'MIM_MORBID', 'HGNC')
Number of rows:- moved = 0, identitys deleted = 0, object_xrefs deleted = 0
Added 0 new mapping but ignored 0
ZFIN_ID moved to Gene level.
ZFIN_ID moved to Gene level.
MAX xref_id = 620426 MAX object_xref_id = 985210, max_object_xref from identity_xref = 985210
LIST to delete 23, 21, 135, 278, 22, 136, 279, 253
_ins_xref sql is:-
insert into xref (xref_id, source_id, accession, label, version, species_id, info_type, info_text, description) values (?, ?, ?, ?, 0, 7955, 'MISC', ?, ? )
For gene ENSDARG00000001014 we have mutiple ZFIN_ID's
Keeping the best one si:ch211-150d5.2
removing myh9b from gene
For gene ENSDARG00000001470 we have mutiple ZFIN_ID's
Keeping the best one si:ch211-287j19.6
removing zgc:162351 from gene
For gene ENSDARG00000001559 we have mutiple ZFIN_ID's
Keeping the best one si:ch211-46o5.1
removing csmd2 from gene
For gene ENSDARG00000001733 we have mutiple ZFIN_ID's
Keeping the best one si:ch211-198b21.4
removing gulp1 from gene
For gene ENSDARG00000001832 we have mutiple ZFIN_ID's
Keeping the best one si:ch1073-403i13.1
removing zgc:113912 from gene
removing zgc:103599 from gene
For gene ENSDARG00000001879 we have mutiple ZFIN_ID's
Keeping the best one si:ch211-169k21.2
removing im:7156396 from gene
For gene ENSDARG00000001889 we have mutiple ZFIN_ID's
Keeping the best one tuba1l2
removing zgc:123298 from gene
For gene ENSDARG00000001890 we have mutiple ZFIN_ID's
Keeping the best one si:dkey-239i15.3
removing stt3b from gene
For gene ENSDARG00000002084 we have mutiple ZFIN_ID's
Keeping the best one lamb2
removing hm:zehs0001 from gene
Multiple best ZFIN_ID's using vega to find the most common for ENSDARG00000002670
zgc:113944 (chosen as first)
tbpl2 (left as ZFIN_ID reference but not gene symbol)
For gene ENSDARG00000002937 we have mutiple ZFIN_ID's
Keeping the best one meis4.1a
removing meis4.1b from gene
For gene ENSDARG00000003635 we have mutiple ZFIN_ID's
Keeping the best one mogat3b
removing atp6v1e1a from gene
Multiple best ZFIN_ID's using vega to find the most common for ENSDARG00000087402
tpm1 (chosen as first)
zgc:171719 (left as ZFIN_ID reference but not gene symbol)
Multiple best ZFIN_ID's using vega to find the most common for ENSDARG00000087472
For gene ENSDARG00000087472 we have mutiple ZFIN_ID's
removing zgc:154164 from gene
removing zgc:163040 from gene
removing hist1h4l from gene
Keeping the best one wu:fe37d09
Keeping the best one wu:fe38f03
Keeping the best one zgc:165555
wu:fe37d09 (chosen as first)
zgc:165555 (left as ZFIN_ID reference but not gene symbol)
wu:fe38f03 (left as ZFIN_ID reference but not gene symbol)
Multiple best ZFIN_ID's using vega to find the most common for ENSDARG00000087543
For gene ENSDARG00000087543 we have mutiple ZFIN_ID's
removing zgc:154164 from gene
removing zgc:163040 from gene
removing hist1h4l from gene
Keeping the best one wu:fe37d09
Keeping the best one wu:fe38f03
removing zgc:165555 from gene
wu:fe37d09 (chosen as first)
wu:fe38f03 (left as ZFIN_ID reference but not gene symbol)
For gene ENSDARG00000087583 we have mutiple ZFIN_ID's
Keeping the best one si:ch211-226h8.13
removing si:ch211-154a22.8 from gene
Multiple best ZFIN_ID's using vega to find the most common for ENSDARG00000087670
For gene ENSDARG00000087670 we have mutiple ZFIN_ID's
removing zgc:154164 from gene
removing zgc:163040 from gene
removing hist1h4l from gene
Keeping the best one wu:fe37d09
Keeping the best one wu:fe38f03
Keeping the best one zgc:165555
wu:fe37d09 (chosen as first)
zgc:165555 (left as ZFIN_ID reference but not gene symbol)
wu:fe38f03 (left as ZFIN_ID reference but not gene symbol)
Multiple best ZFIN_ID's using vega to find the most common for ENSDARG00000087694
For gene ENSDARG00000087694 we have mutiple ZFIN_ID's
Keeping the best one zgc:112234
Keeping the best one zgc:171759
removing zgc:171937 from gene
Keeping the best one wu:fe11b02
wu:fe11b02 (chosen as first)
zgc:171759 (left as ZFIN_ID reference but not gene symbol)
zgc:112234 (left as ZFIN_ID reference but not gene symbol)
For gene ENSDARG00000096097 we have mutiple ZFIN_ID's
Keeping the best one si:dkeyp-98a7.5
removing zgc:172150 from gene
For gene ENSDARG00000096159 we have mutiple ZFIN_ID's
Keeping the best one si:dkeyp-98a7.4
removing zgc:172150 from gene
For gene.... Lots of these so cut them out to save time and space
WARNING: Clone_based_ensembl_gene has decreased by -5 % was 7652 now 7194
WARNING: Clone_based_ensembl_transcript has decreased by -8 % was 8260 now 7554
WARNING: Clone_based_vega_gene has increased by 144% was 276 now 675
WARNING: GO has increased by 56% was 87289 now 136827
WARNING: goslim_goa has increased by 54% was 62738 now 96927
WARNING: xrefs miRBase_gene_name are not in the new database but are in the old???
WARNING: xrefs OTTG are not in the new database but are in the old???
WARNING: xrefs OTTT are not in the new database but are in the old???
WARNING: RefSeq_ncRNA has increased by 5% was 644 now 677
WARNING: xrefs RFAM_gene_name are not in the new database but are in the old???
WARNING: xrefs shares_CDS_and_UTR_with_OTTT are not in the new database but are in the old???
WARNING: xrefs shares_CDS_with_ENST are not in the new database but are in the old???
WARNING: xrefs shares_CDS_with_OTTT are not in the new database but are in the old???
WARNING: xrefs Vega_transcript are not in the new database but are in the old???
WARNING: xrefs Vega_translation are not in the new database but are in the old???
WARNING: ZFIN_ID_curated_transcript_notransfer has 9748 xrefs in the new database but NONE in the old
xref_mapper.pl FINISHED NORMALLY
------------------------------------------------------------
Sender: LSF System <lsfadmin@bc-24-1-04>
Subject: Job 886769: <perl ~/src/ensembl/misc-scripts/xref_mapping/xref_mapper.pl -file xref_input> Done
Job <perl ~/src/ensembl/misc-scripts/xref_mapping/xref_mapper.pl -file xref_input> was submitted from host <farm2-head4> by user <ianl> in cluster <farm2>.
Job was executed on host(s) <bc-24-1-04>, in queue <normal>, as user <ianl> in cluster <farm2>.
<~/> was used as the home directory.
</workdir/release_65/zebrafish> was used as the working directory.
Started at Tue Oct 18 11:01:18 2011
Results reported at Tue Oct 18 12:17:34 2011
Your job looked like:
------------------------------------------------------------
# LSBATCH: User input
perl ~/src/ensembl/misc-scripts/xref_mapping/xref_mapper.pl -file xref_input
------------------------------------------------------------
Successfully completed.
Resource usage summary:
CPU time : 734.06 sec.
Max Memory : 173 MB
Max Swap : 204 MB
Max Processes : 6
Max Threads : 7
The output (if any) is above this job summary.
PS:
Read file <mapper.err> for stderr output of this job.
Options: -file xref_input -upload
running in verbose mode
current status is tests_finished
Deleting data for Clone_based_ensembl_gene from core before updating from new xref database
Deleting data for Clone_based_ensembl_transcript from core before updating from new xref database
Deleting data for Clone_based_vega_gene from core before updating from new xref database
Deleting data for Clone_based_vega_transcript from core before updating from new xref database
Deleting data for EMBL from core before updating from new xref database
Deleting data for EntrezGene from core before updating from new xref database
Deleting data for GO from core before updating from new xref database
Deleting data for goslim_goa from core before updating from new xref database
Deleting data for IPI from core before updating from new xref database
Deleting data for MEROPS from core before updating from new xref database
Deleting data for miRBase from core before updating from new xref database
Deleting data for miRBase_transcript_name from core before updating from new xref database
Deleting data for PDB from core before updating from new xref database