Skip to content
Snippets Groups Projects
Commit ada79199 authored by Ian Longden's avatar Ian Longden
Browse files

doc changes

parent 6c6b37f6
No related branches found
No related tags found
No related merge requests found
......@@ -10,11 +10,10 @@ Questions
6) I have mapping errors how do i fix them?
7) How do i start again from the parsing has finished stage?
8) How do i start again from the mapping_finished stage?
9) What is fullmode and partupdate?
10) How do i run my external database references without a compute farm?
11) I want to use a different list of external database sources for my
9) How do i run my external database references without a compute farm?
10) I want to use a different list of external database sources for my
display_xrefs (names)?
12) I want to use a different list of external database sources for my gene
11) I want to use a different list of external database sources for my gene
descriptions?
......@@ -113,9 +112,15 @@ Answers
loading into the core database.
i) Map the entitys in the xref database and do some checks etc.
xref_mapper.pl -file xref_config >& MAPPER1.OUT
If you have no compute farm then add the -nofarm option.
perl ~/src/ensembl/misc-scripts/xref_mapper/xref_mapper.pl
-file xref_config -nofarm >& MAPPER1.OUT
or if using the farm
bsub -o mapper.out -e mapper.err
perl ~/src/ensembl/misc-scripts/xref_mapper/xref_mapper.pl -file xref_config
Check the output file if warning about xref number increasing do not
worry the main thing to be concerned about is a reduction in the number
of that none are in the xref database abut are in the core database.
......@@ -305,28 +310,7 @@ data_uri = ftp://fantom.gsc.riken.jp/DDBJ_fantom3_HTC_accession.txt.gz
9) What is fullmode and partupdate?
Fullmode means that all the xrefs are being updated and not just a few specific
external database sources. This is important as this affects the way the
display_xrefs, descriptions are calculated at the end.The user can override
this by setting -partupdate option in the mapper options or change the entry
in the table (key is "fullmode" in meta table).
If we are doing all the xref sources then we know that all the data is local
and hence can do some SQL to get the display_xrefs etc But if this is not the
case then the core database will have extra information in it that may be
needed so we have to query the core database. The xref database has extra
information that is not in the xref database and so simple SQL can be used
whereas with the core
database we have to go for each gene and then for each transcript etc using the
API which is alot slower.
In summary only alter the mode here if you know what you are doing and what
consequences there are.
10) How do i run my external database references without a compute farm?
9) How do i run my external database references without a compute farm?
Simply use the -nofarm option with the xref_mapper.pl script.
......@@ -334,7 +318,7 @@ data_uri = ftp://fantom.gsc.riken.jp/DDBJ_fantom3_HTC_accession.txt.gz
11) I want to use a different list of external database sources for my
10) I want to use a different list of external database sources for my
display_xrefs (names)?
The external databases to be used for the display_xrefs are taken from either
......@@ -401,7 +385,7 @@ data_uri = ftp://fantom.gsc.riken.jp/DDBJ_fantom3_HTC_accession.txt.gz
12) I want to use a different list of external database sources for my gene
11) I want to use a different list of external database sources for my gene
descriptions?
As above but this time we use the sub gene_description_sources.
......
......@@ -413,41 +413,37 @@ calulate display xref for genes and transcripts
-----------------------------------------------
The external databases to be used for the display_xrefs are taken from either the
BasicMapper.pm subroutine transcript_display_sources i.e.
DisplayXrefs.pm subroutine transcript_display_sources i.e.
sub transcript_display_xref_sources {
my @list = qw(miRBase
my @list = qw(HGNC
MGI
Clone_based_vega_gene
Clone_based_ensembl_gene
HGNC_transcript_name
MGI_transcript_name
Clone_based_vega_transcript
Clone_based_ensembl_transcript
miRBase
RFAM
HGNC_curated_gene
HGNC_automatic_gene
MGI_curated_gene
MGI_automatic_gene
Clone_based_vega_gene
Clone_based_ensembl_gene
HGNC_curated_transcript
HGNC_automatic_transcript
MGI_curated_transcript
MGI_automatic_transcript
Clone_based_vega_transcript
Clone_based_ensembl_transcript
IMGT/GENE_DB
HGNC
SGD
MGI
flybase_symbol
Anopheles_symbol
Genoscope_annotated_gene
Uniprot/SWISSPROT
Uniprot/Varsplic
RefSeq_peptide
RefSeq_dna
Uniprot/SPTREMBL
EntrezGene
IMGT/GENE_DB
SGD
flybase_symbol
Anopheles_symbol
Genoscope_annotated_gene
Uniprot/SWISSPROT
Uniprot/Varsplic
Uniprot/SPTREMBL
EntrezGene
IPI);
my %ignore;
$ignore{"EntrezGene"}= 'FROM:RefSeq_[pd][en][pa].*_predicted';
$ignore{"Uniprot/SPTREMBL"} =(<<BIGN);
SELECT object_xref_id
FROM object_xref JOIN xref USING(xref_id) JOIN source USING(source_id)
WHERE ox_status = 'DUMP_OUT' AND name = 'Uniprot/SPTREMBL'
AND priority_description = 'protein_evidence_gt_3'
BIGN
return [\@list,\%ignore];
}
......@@ -456,30 +452,35 @@ sub transcript_display_xref_sources {
or if you want to create your own list then you need to create a species.pm file
and create a new subroutine there an example here is for drosophila_melanogaster.
So in the file drosophila_melanogaster.pm we have :-
and create a new subroutine there an example here is for tetraodon_nigroviridis.
So in the file tetraodon_nigroviridis.pm we have :-
sub transcript_display_xref_sources {
my @list = qw(FlyBaseName_transcript FlyBaseCGID_transcript
flybase_annotation_id);
my @list = qw(HGNC
MGI
wormbase_transcript
flybase_symbol
Anopheles_symbol
Genoscope_annotated_gene
Genoscope_predicted_transcript
Uniprot/SWISSPROT
RefSeq
Uniprot/SPTREMBL
LocusLink);
my %ignore;
$ignore{"EntrezGene"}= 'FROM:RefSeq_[pd][en][pa].*_predicted';
return [\@list,\%ignore];
}
$ignore{"Uniprot/SPTREMBL"} =(<<BIGN);
SELECT object_xref_id
FROM object_xref JOIN xref USING(xref_id) JOIN source USING(source_id)
WHERE ox_status = 'DUMP_OUT' AND name = 'Uniprot/SPTREMBL'
AND priority_description = 'protein_evidence_gt_3'
BIGN
If we are in fullmode then a tempory table is created with these sources in and
some sql is used to find the best display_xrefs for the transcripts and genes.
If we are not in fullmode then we have to use the API to cycle though each gene,
transcript and translation fecth all there xrefs and get the best. This is not that quick.
The results are stored directly to the core database.
return [\@list,\%ignore];
}
calulate the gene descriptions.
......
......@@ -11,12 +11,12 @@ database.
Parsing the external database references
------------------------------------------------------------------------
In this directory you will find an ini-file called 'xref_config.ini'.
This file contains two types of configuration sections: source sections
and species sections. A source section defines Xref priority, order
etc. (as key-value pairs, see the comment at the top of the source
sections for a fuller explanation of these keys) for the source and
also the URIs pointing to the data files that the source should use.
In the xref_mapper directory you will find an ini-file called
'xref_config.ini'.This file contains two types of configuration
sections: source sections and species sections. A source section defines
Xref priority, order etc. (as key-value pairs, see the comment at the top
of the source sections for a fuller explanation of these keys) for the source
and also the URIs pointing to the data files that the source should use.
The source label will only be used to refer to the source within the
ini-file (from a species section), so this can be any text string which
is easy to understand the meaning of.
......@@ -28,15 +28,14 @@ multiple strains or subspecies, for example), there can be more than one
'taxonomy_id' key. The name of the species is defined by the source
label and will be store in the Xref database.
For now, the script 'xref_config2sql.pl' (also found in this directory)
should be used to convert the ini-file into a SQL file which you
should replace the file 'sql/populate_metadata.sql' with. The
'xref_config2sql.pl' script expects to find 'xref_config.ini' in the
current directory, but you may specify an alternative file as the first
command line argument to the script if you have moved or renamed the
ini-file. When 'xref_parser.pl' is run it will load the generated SQL
file into the database and will then download and parse all external
data files for one or several specified species.
For now, the script 'xref_config2sql.pl' should be used to convert the
ini-file into a SQL file which you should replace the file
'sql/populate_metadata.sql' with. The 'xref_config2sql.pl' script expects
to find 'xref_config.ini' in the current directory, but you may specify an
alternative file as the first command line argument to the script if you have
moved or renamed the ini-file. When 'xref_parser.pl' is run it will load the
generated SQL file into the database and will then download and parse all
external data files for one or several specified species.
If you want to add a new source you will have to add a new source
section, following the pattern used by the other source sections. You
......@@ -67,7 +66,15 @@ user (user name on the database), pass (password), host (database host),
dbname, and species, i.e.
perl xref_parser.pl -host mymachine -user admin -pass XXXX \
-dbname new_human_xref -species human
-dbname new_human_xref -species human -stats
If you are using the farm the i strongly advise using this as it makes
the systems people happier and it is easier to get the output and error
files seperately.
bsub -o parse.out -e parse.err perl xref_parser.pl -host mymachine \
-user admin -pass XXXX -dbname new_human_xref -species human \
-stats -force
Please keep the output from this script and check it later. At the end
of the output there will be a summary of what was successful and what
......@@ -83,7 +90,7 @@ The parsing can create three types of Xrefs these are
Some sources will have more than one set of files associated with it,
in these cases they have the same source name but different source IDs.
These are known as "priority Xrefs" as the Xrefs are mapped according to
the priority of the source. An example of this is the HUGOs.
the priority of the source. An example of this is HGNCs.
For more information on the what data can be parsed see the
'parsing_information.txt' file.
......@@ -94,9 +101,8 @@ Mapping the external database references to the Ensembl core database
This is an overview of what goes on in the script 'xref_mapper.pl' .
Primary Xrefs are dumped out to two Fasta files, one for peptides and
the other for DNA. Ensembl Transcripts and Translations are then dumped
out to two files in Fasta format.
Primary Xrefs are dumped out to Fasta files, Ensembl Transcripts and
Translations are then dumped out to two files in Fasta format.
Exonerate is then used to find the best matches for the Xrefs.
If there is more than one best match then the Xref is mapped to
......@@ -163,6 +169,6 @@ exonerate=/software/ensembl/bin/exonerate-1.4.0
Note it is good practice to put a sub-directory for the Ensembl
directory as many files are generated and hence best to put these all
together and way from everything else or it will be hard to find things.
together and away from everything else or it will be hard to find things.
Also the directory can be tared and zipped in case you need to check
things later.
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment