Skip to content
Snippets Groups Projects
Commit f10fc0e4 authored by Magali Ruffier's avatar Magali Ruffier
Browse files

xref projection is now run from ensembl-production

removing copy from ensembl-core
parent 04e328ad
No related branches found
No related tags found
No related merge requests found
GENE NAME AND XREF PROJECTION
==============================
Introduction
------------
Gene display xrefs and GO terms are projected between species, using homology information from the Ensembl Compara database. This means that species that have little or no such data can have gene names and GO terms assigned based on homology with species that have more data, typically human and mouse.
Prerequisites
-------------
The projection script needs all the core databases to be on ens-staging and the Compara homlogies to be available somewhere (see step 1). The homolgies are generally available in a database of their own a few days before the rest of Compara is finished, so keep in contact with the Compara team to find out when they're ready.
It's useful to do a dry run using the previous release's Compara database, just to make sure things are working normally. This may cause some errors in individual jobs (errors of the type "Can't find homology for ..." occur when there are transcripts/translations in a new genebuild that don't appear in the "old" Compara - these will go away when the new Compara is used).
Don't forget to update the .ini file to point to the new homology database when it's ready, and run the projections for real.
Running the projection
----------------------
Check out the latest version of the ensembl module from CVS. The scripts referred to here are in the ensembl/misc-scripts/xref-projection directory.
The script which actually does the projection is called project_display_xrefs.pl; however this is not the one that will be run during the release cycle. The script to run is called submit_projections.pl. This uses LSF to run all the projections concurrently. The projections which are run can be found by looking in submit_projections.pl itself.
The steps to run the projection are as follows:
1. Create a registry file to show the location of the Compara database to be used. A typical example will look something like this:
[Compara]
user = ensro
host = compara2
group = Compara
dbname = ensembl_compara_58
2. Edit submit_projections.pl to set some parameters, all of which are located at the top of the script. The ones to set/check, and example values, are:
my $release = 58; # release number
my $base_dir = "/lustre/scratch103/ensembl/gp1/projections/"; # working base directory; output will be written to a subdirectory with the number of the release
my $conf = "release_58.ini"; # registry config file, specifies Compara location - see above
# location of other databases - note read/write access is required
<Fill in the @config array with the details of the 2 staging servers, e.g.>
my @config = ( {
'-host' => 'HOST',
'-port' => 'PORT',
'-user' => 'USER',
'-pass' => 'PASS',
'-db_version' => $release
},
{
'-host' => 'HOST',
'-port' => 'PORT',
'-user' => 'USER',
'-pass' => 'PASS',
'-db_version' => $release
} );
3. Run submit_projections.pl. It will submit all the Farm jobs and then exit.
4. Monitor the progress of the run using bjobs. The lsload command is useful for monitoring the load on the server where the databases are being modified (typically ens-staging):
lsload -s | grep myens_staging
The gene name (display_xref) projections typically start to finish after about 20 minutes, while the GO term projections take longer. Currently the full set of projections takes about 4 hours to run.
Results
-------
As jobs finish, they write .out and .err files to the working directory specified in the script. If a job finished successfully, the .err file will be of zero size (but will exist). .err files of non-zero length indicate that something has gone wrong.
Note that if you need to re-run individual jobs, the command-line for doing so is at the top of the appropriate .out file.
All databases that have been projected to should be healthchecked; in particular CoreForeignKeys and the xrefs group of healthchecks should be run. To do this, check out the ensj-healthcheck module, cd into ensj-healthcheck, configure database.properties, then run
./run-healthcheck.sh -d '.*_core_58.*' CoreForeignKeys core_xrefs
Once all the projections have been run and checked, inform the release coordinator.
This diff is collapsed.
# Table structure for projection info database
CREATE TABLE projections (
db_release INT NOT null,
timestamp DATETIME,
from_db VARCHAR(255),
from_species_latin VARCHAR(255),
from_species_common VARCHAR(255),
to_db VARCHAR(255),
to_species_latin VARCHAR(255),
to_species_common VARCHAR(255)
);
; Configuration file for release 48
[Compara]
user = ensro
host = compara2
group = Compara
dbname = avilella_compara_homology_48
use strict;
use Data::Dumper;
use Bio::EnsEMBL::ApiVersion qw/software_version/;
$Data::Dumper::Useqq=1;
$Data::Dumper::Terse = 1;
$Data::Dumper::Indent = 0;
# Submits the display name and GO term projections as farm jobs
# Remember to check/set the various config optons
# ------------------------------ config -------------------------------
my $release = software_version();
my $base_dir = "mydir";
my $conf = "release_${release}.ini"; # registry config file, specifies Compara location
# location of other databases
my @config = ( {
'-host' => 'HOST',
'-port' => 'PORT',
'-user' => 'USER',
'-pass' => 'PASS',
'-db_version' => $release
},
{
'-host' => 'HOST',
'-port' => 'PORT',
'-user' => 'USER',
'-pass' => 'PASS',
'-db_version' => $release
} );
my $registryconf = Dumper(\@config);
# -------------------------- end of config ----------------------------
# check that base directory exists
die ("Cannot find base directory $base_dir") if (! -e $base_dir);
# create release subdir if necessary
my $dir = $base_dir. $release;
if (! -e $dir) {
mkdir $dir;
print "Created $dir\n";
} else {
print "Cleaning and re-using $dir\n";
unlink <$dir/*.out>, <$dir/*.err>, <$dir/*.sql.gz>;
}
# common options
my $script_opts = "-conf '$conf' -registryconf '$registryconf' -version '$release' -release '$release' -quiet -backup_dir '$dir'";
my $bsub_opts = "";
$bsub_opts .= "-M2000000 -R'select[mem>2000] rusage[mem=2000]'";
my %names_1_1;
######
# When editing xref projection lists below, remember to check the species is in
# the execution order array that follows.
######
$names_1_1{'human'} = [qw(
alpaca
anolis
armadillo
bushbaby
cat
chicken
chimp
coelacanth
cow
dog
dolphin
elephant
gibbon
gorilla
ground_shrew
guinea_pig
horse
hyrax
macaque
marmoset
megabat
microbat
mouse_lemur
mustela_putorius_furo
opossum
orang_utan
panda
pig
pika
platypus
psinensis
rabbit
sloth
squirrel
tarsier
tasmanian_devil
tenrec
tree_shrew
turkey
wallaby
western_european_hedgehog
xenopus
zebrafinch
)];
$names_1_1{'mouse'} = [qw(
kangaroo_rat
mustela_putorius_furo
rat
)];
my %names_1_many;
$names_1_many{'human'} = [qw(
cod
fugu
lamprey
medaka
stickleback
tetraodon
tilapia
xiphophorus_maculatus
zebrafish
)];
my %go_terms;
$go_terms{'human'} = [qw(
alpaca
anolis
armadillo
bushbaby
cat
chicken
chimp
cow
dog
dolphin
elephant
gibbon
gorilla
ground_shrew
guinea_pig
horse
hyrax
kangaroo_rat
macaque
marmoset
megabat
microbat
mouse
mouse_lemur
mustela_putorius_furo
opossum
orang_utan
panda
pig
pika
platypus
psinensis
rabbit
rat
sloth
squirrel
tarsier
tasmanian_devil
tenrec
tree_shrew
turkey
wallaby
western_european_hedgehog
zebrafinch
)];
$go_terms{'mouse'} = [qw(
alpaca
anolis
armadillo
bushbaby
cat
chicken
chimp
cow
dog
dolphin
elephant
gorilla
ground_shrew
guinea_pig
horse
human
hyrax
kangaroo_rat
macaque
marmoset
megabat
microbat
mouse_lemur
mustela_putorius_furo
opossum
orang_utan
panda
pig
pika
platypus
psinensis
rabbit
rat
sloth
squirrel
tarsier
tasmanian_devil
tenrec
tree_shrew
turkey
wallaby
western_european_hedgehog
zebrafinch
)];
$go_terms{'rat'} = [qw(
human
mouse
)];
$go_terms{'zebrafish'} = [qw(
cod
coelacanth
fugu
lamprey
stickleback
tetraodon
tilapia
xenopus
xiphophorus_maculatus
)];
$go_terms{'xenopus'} = [qw(zebrafish)];
# order to run projections in, just in case they are order-sensitive.
my @execution_order = qw/human mouse rat zebrafish xenopus/;
# except of course order is irrelevant to the job queue. See the -w command below
# in the bsub command to cause serial execution.
# ----------------------------------------
# Display names
print "Deleting projected names (one to one)\n";
foreach my $species (keys %names_1_1) {
foreach my $to (@{$names_1_1{$species}}) {
system "perl project_display_xrefs.pl $script_opts -to $to -delete_names -delete_only\n";
};
}
# 1:1
my $last_name; # for waiting in queue
foreach my $from (@execution_order) {
my $last_name; # for waiting in queue
if (not exists($names_1_1{$from})) {next;}
foreach my $to (@{$names_1_1{$from}}) {
my $o = "$dir/names_${from}_$to.out";
my $e = "$dir/names_${from}_$to.err";
my $n = substr("n_${from}_$to", 0, 10); # job name display limited to 10 chars
my $all;
if ($from eq "human" || $from eq "mouse") { $all = "" ; }
else { $all = "--all_sources"; }
my $wait;
if ($last_name) { $wait = "-w 'ended(${last_name}*)'";}
print "Submitting name projection from $from to $to\n";
system "bsub $bsub_opts -o $o -e $e -J $n $wait perl project_display_xrefs.pl $script_opts -from $from -to $to -names -no_database $all\n";
}
$last_name = substr("n_".$from, 0 ,10);
}
$last_name = "";
print "Deleting projected names (one to many)\n";
foreach my $from (keys %names_1_many) {
foreach my $to (@{$names_1_many{$from}}) {
system "perl project_display_xrefs.pl $script_opts -to $to -delete_names -delete_only\n";
}
}
# 1:many
foreach my $from (@execution_order) {
if (not exists($names_1_many{$from})) {next;}
foreach my $to (@{$names_1_many{$from}}) {
my $o = "$dir/names_${from}_$to.out";
my $e = "$dir/names_${from}_$to.err";
my $n = substr("n_${from}_$to", 0, 10);
my $wait;
if ($last_name) { $wait = "-w 'ended(${last_name}*)'";}
print "Submitting name projection from $from to $to (1:many)\n";
system "bsub $bsub_opts -o $o -e $e -J $n $wait perl project_display_xrefs.pl $script_opts -from $from -to $to -names -no_database -one_to_many\n";
}
$last_name = substr("n_".$from, 0 ,10);
}
$last_name = "";
# ----------------------------------------
# GO terms
$script_opts .= " -nobackup";
print "Deleting projected GO terms\n";
foreach my $from (keys %go_terms) {
foreach my $to (@{$go_terms{$from}}) {
system "perl project_display_xrefs.pl $script_opts -to $to -delete_go_terms -delete_only\n";
}
}
foreach my $from (@execution_order) {
if (not exists($go_terms{$from})) {next;}
foreach my $to (@{$go_terms{$from}}) {
my $o = "$dir/go_${from}_$to.out";
my $e = "$dir/go_${from}_$to.err";
my $n = substr("g_${from}_$to", 0, 10);
my $wait;
if ($last_name) { $wait = "-w 'ended(${last_name}*)'";}
print "Submitting GO term projection from $from to $to\n";
system "bsub $bsub_opts -q long -o $o -e $e -J $n $wait perl project_display_xrefs.pl $script_opts -from $from -to $to -go_terms\n";
}
$last_name = substr("g_".$from, 0 ,10);
}
# ----------------------------------------
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment