Commit 06d07040 authored by Glenn Proctor's avatar Glenn Proctor
Browse files

*** empty log message ***

parent 110a575c
Gene display xrefs and GO terms are projected between species, using homology information from the Ensembl Compara database. This means that species that have little or no such data can have gene names and GO terms assigned based on homology with species that have more data, typically human and mouse.
The projection script needs all the core databases to be on ens-staging and the Compara homlogies to be available somewhere (see step 1). The homolgies are generally available in a database of their own a few days before the rest of Compara is finished, so keep in contact with the Compara team to find out when they're ready.
It's useful to do a dry run using the previous release's Compara database, just to make sure things are working normally. This may cause some errors in individual jobs (errors of the type "Can't find homology for ..." occur when there are transcripts/translations in a new genebuild that don't appear in the "old" Compara - these will go away when the new Compara is used).
Don't forget to update the .ini file to point to the new homology database when it's ready, and run the projections for real.
Running the projection
Check out the latest version of the ensembl module from CVS. The scripts referred to here are in the ensembl/misc-scripts/xref-projection directory.
The script which actually does the projection is called; however this is not the one that will be run during the release cycle. The script to run is called This uses LSF to run all the projections concurrently. The projections which are run can be found by looking in itself.
The steps to run the projection are as follows:
1. Create a registry file to show the location of the Compara database to be used. A typical example will look something like this:
user = ensro
host = compara2
group = Compara
dbname = ensembl_compara_48
2. Edit to set some parameters, all of which are located at the top of the script. The ones to set/check, and example values, are:
my $release = 48; # release number
my $base_dir = "/lustre/scratch1/ensembl/gp1/projections/"; # working base directory; output will be written to a subdirectory with the number of the release
my $conf = "release_48.ini"; # registry config file, specifies Compara location - see above
# location of other databases - note read/write access is required
my $host = "ens-staging";
my $port = 3306;
my $user = "ensadmin";
my $pass = "ensembl";
3. Run It will submit all the Farm jobs and then exit.
4. Monitor the progress of the run using bjobs. The lsload command is useful for monitoring the load on the server where the databases are being modified (typically ens-staging):
lsload -s | grep myens_staging
The gene name (display_xref) projections typically start to finish after about 20 minutes, while the GO term projections take longer. Currently the full set of projections takes about 4 hours to run.
As jobs finish, they write .out and .err files to the working directory specified in the script. If a job finished successfully, the .err file will be of zero size (but will exist). .err files of non-zero length indicate that something has gone wrong.
Note that if you need to re-run individual jobs, the command-line for doing so is at the top of the appropriate .out file.
All databases that have been projected to should be healthchecked; in particular CoreForeignKeys and the xrefs group of healthchecks should be run. To do this, check out the ensj-healthcheck module, cd into ensj-healthcheck, configure, then run
./ -d '.*_core_48.*' CoreForeignKeys core_xrefs
Once all the projections have been run and checked, inform the release coordinator.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment