Newer Older
1 2 3 4 5 6 7 8 9


Gene display xrefs and GO terms are projected between species, using homology information from the Ensembl Compara database. This means that species that have little or no such data can have gene names and GO terms assigned based on homology with species that have more data, typically human and mouse.

Glenn Proctor's avatar
Glenn Proctor committed
10 11 12 13 14 15 16 17 18 19

The projection script needs all the core databases to be on ens-staging and the Compara homlogies to be available somewhere (see step 1). The homolgies are generally available in a database of their own a few days before the rest of Compara is finished, so keep in contact with the Compara team to find out when they're ready.

It's useful to do a dry run using the previous release's Compara database, just to make sure things are working normally. This may cause some errors in individual jobs (errors of the type "Can't find homology for ..." occur when there are transcripts/translations in a new genebuild that don't appear in the "old" Compara - these will go away when the new Compara is used).

Don't forget to update the .ini file to point to the new homology database when it's ready, and run the projections for real.

20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
Running the projection

Check out the latest version of the ensembl module from CVS. The scripts referred to here are in the ensembl/misc-scripts/xref-projection directory.

The script which actually does the projection is called; however this is not the one that will be run during the release cycle. The script to run is called This uses LSF to run all the projections concurrently. The projections which are run can be found by looking in itself.

The steps to run the projection are as follows:

1. Create a registry file to show the location of the Compara database to be used. A typical example will look something like this:

	user    = ensro
	host    = compara2
	group   = Compara
	dbname  = ensembl_compara_48

2. Edit to set some parameters, all of which are located at the top of the script. The ones to set/check, and example values, are:

	my $release = 48;   # release number

	my $base_dir = "/lustre/scratch1/ensembl/gp1/projections/"; # working base directory; output will be written to a subdirectory with the number of the release

	my $conf = "release_48.ini"; # registry config file, specifies Compara location - see above

	# location of other databases - note read/write access is required
	my $host = "ens-staging";
	my $port = 3306;
	my $user = "ensadmin";
	my $pass = "ensembl";

3. Run It will submit all the Farm jobs and then exit.

4. Monitor the progress of the run using bjobs. The lsload command is useful for monitoring the load on the server where the databases are being modified (typically ens-staging):

 	lsload -s | grep myens_staging

The gene name (display_xref) projections typically start to finish after about 20 minutes, while the GO term projections take longer. Currently the full set of projections takes about 4 hours to run.


As jobs finish, they write .out and .err files to the working directory specified in the script. If a job finished successfully, the .err file will be of zero size (but will exist). .err files of non-zero length indicate that something has gone wrong.

65 66
Note that if you need to re-run individual jobs, the command-line for doing so is at the top of the appropriate .out file.

67 68 69
All databases that have been projected to should be healthchecked; in particular CoreForeignKeys and the xrefs group of healthchecks should be run. To do this, check out the ensj-healthcheck module, cd into ensj-healthcheck, configure, then run

./ -d '.*_core_48.*' CoreForeignKeys xrefs