FASTA Pipeline

This is a re-implementation of an existing pipeline developed originally by
core and the webteam. The new version uses eHive, so familiarity with this
system is essential, and has been written to use as little memory as possible.

The Registry File

This is the way we retrieve the database connections to work with. The
registry file should specify:

Here is an example of a file for v67 of Ensembl. Note the use of the
Registry object within a registry file and the scoping of the package. If
you omit the -db_version parameter and only use HEAD checkouts of Ensembl
then this will automatically select the latest version of the API. Any
change to version here must be reflected in the configuration file.

	package Reg;
	use Bio::EnsEMBL::Registry;
	use Bio::EnsEMBL::DBSQL::DBAdaptor;
	Bio::EnsEMBL::Registry->no_version_check(1);
	Bio::EnsEMBL::Registry->no_cache_warnings(1);
	{
	  my $version = 67;
	  Bio::EnsEMBL::Registry->load_registry_from_multiple_dbs(
	    {
	      -host => "mydb-1",
	      -port => 3306,
	      -db_version => $version,
	      -user => "user",
	      -NO_CACHE => 1,
	    },
	    {    
	      -host => "mydb-2",
	      -port => 3306,
	      -db_version => $version,
	      -user => "user",
	      -NO_CACHE => 1,
	    },
	  );
	  Bio::EnsEMBL::DBSQL::DBAdaptor->new(
	    -HOST => 'mydb-2',
	    -PORT => 3306,
	    -USER => 'user',
	    -DBNAME => 'ensembl_website',
	    -SPECIES => 'multi',
	    -GROUP => 'web'
	  );
	  Bio::EnsEMBL::DBSQL::DBAdaptor->new(
	    -HOST => 'mydb-2',
	    -PORT => 3306,
	    -USER => 'user',
	    -DBNAME => 'ensembl_production',
	    -SPECIES => 'multi',
	    -GROUP => 'production'
	  );
	}
	1;

You give the registry to the init_pipeline.pl script via the -registry option

Overriding Defaults Using a New Config File

We recommend if you have a number of parameters which do not change
between releases to create a configuration file which inherits from the
root config file e.g.

	package MyCnf;
	use base qw/Bio::EnsEMBL::Pipeline::FASTA::FASTA_conf/;
	sub default_options {
	  my ($self) = @_;
	  return {
	    %{ $self->SUPER::default_options() },
	    #Override of options
	  };
	}
	1;

If you do override the config then you should use the package name for your overridden config in the upcoming example commands.

Environment

PERL5LIB

PATH

ENSEMBL_CVS_ROOT_DIR

Set to the base checkout of Ensembl. We should be able to add ensembl-hive/sql onto this path to find the SQL directory for hive e.g.

	export ENSEMBL_CVS_ROOT_DIR=$HOME/work/ensembl-checkouts

ENSADMIN_PSW

Give the password to use to log into a database server e.g.

	export ENSADMIN_PSW=wibble

Command Line Arguments

Where Multiple Supported is supported we allow the user to specify the parameter more than once on the command line. For example species is one of these options e.g.

-species human -species cele -species yeast@
Name TypeMultiple Supported DescriptionDefault Required
-registryStringNoLocation of the Ensembl registry to use with this pipeline-YES
-base_pathStringNoLocation of the dumps-YES
-pipeline_db -host=StringNoSpecify a host for the hive database e.g. -pipeline_db -host=myserver.mysqlSee hive generic configYES
-pipeline_db -dbname=StringNoSpecify a different database to use as the hive DB e.g. -pipeline_db -dbname=my_dumps_testUses pipeline name by defaultNO
-ftp_dirStringNoLocation of the current FTP directory with the previous release’s files. We will use this to copy DNA files from one release to another. If not given we do not do any reuse-NO
-speciesStringYesSpecify one or more species to process. Pipeline will only consider these species. Use -force_species if you want to force a species run-NO
-force_speciesStringYesSpecify one or more species to force through the pipeline. This is useful to force a dump when you know reuse will not do the "right thing"-NO
-dump_typesStringYesSpecify each type of dump you want to produce. Supported values are dna, cdna and ncrnaAllNO
-db_typesStringYesThe database types to use. Supports the normal db adaptor groups e.g. core, otherfeatures etc.coreNO
-releaseIntegerNoThe release to dumpSoftware versionNO
-previous_releaseIntegerNoThe previous release to use. Used to calculate reuseSoftware version minus 1NO
-blast_serversStringYesThe servers to copy blast indexes to-NO
-blast_genomic_dirStringNoLocation to copy the DNA blast indexes to-NO
-blast_genes_dirStringNoLocation to copy the DNA gene (cdna, ncrna and protein) indexes to-NO
-scp_userStringNoUser to perform the SCP as. Defaults to the current userCurrent userNO
-scp_identityStringNoThe SSH identity file to use when performing SCPs. Normally used in conjunction with -scp_user-NO
-no_scpBooleanNoSkip SCP altogether0NO
-pipeline_nameStringNoName to use for the pipelinefasta_dump_$releaseNO
-wublast_exeStringNoLocation of the WUBlast indexing binaryxdformatNO
-blat_exeStringNoLocation of the Blat indexing binaryfaToTwoBitNO
-port_offsetIntegerNoThe offset of the ports to use when generating blat indexes. This figure is added onto the web database species ID30000NO
-emailStringNoEmail to send pipeline summaries to upon its successful completion$USER@sanger.ac.ukNO

Example Commands

To load use normally:

	init_pipeline.pl Bio::EnsEMBL::Pipeline::PipeConfig:FASTA_conf \
	-pipeline_db -host=my-db-host -base_path /path/to/dumps -registry reg.pm

Run a subset of species (no forcing & supports registry aliases):

	init_pipeline.pl Bio::EnsEMBL::Pipeline::PipeConfig:FASTA_conf \
	-pipeline_db -host=my-db-host -species anolis -species celegans -species human \
	-base_path /path/to/dumps -registry reg.pm

Specifying species to force (supports all registry aliases):

	init_pipeline.pl Bio::EnsEMBL::Pipeline::PipeConfig:FASTA_conf \
	-pipeline_db -host=my-db-host -force_species anolis -force_species celegans -force_species human \
	-base_path /path/to/dumps -registry reg.pm

Running & forcing a species:

	init_pipeline.pl Bio::EnsEMBL::Pipeline::PipeConfig:FASTA_conf \
	-pipeline_db -host=my-db-host -species celegans -force_species celegans \
	-base_path /path/to/dumps -registry reg.pm

Dumping just gene data (no DNA or ncRNA):

	init_pipeline.pl Bio::EnsEMBL::Pipeline::PipeConfig:FASTA_conf \
	-pipeline_db -host=my-db-host -dump_type cdna \
	-base_path /path/to/dumps -registry reg.pm

Using a different SCP user & identity:

	init_pipeline.pl Bio::EnsEMBL::Pipeline::PipeConfig:FASTA_conf \
	-pipeline_db -host=my-db-host -scp_user anotherusr -scp_identity /users/anotherusr/.pri/identity \
	-base_path /path/to/dumps -registry reg.pm

Running the Pipeline

  1. Start a screen session or get ready to run the beekeeper with a nohup
  2. Choose a dump location
  3. Use an init_pipeline.pl configuration from above
  4. Sync the database using one of the displayed from init_pipeline.pl
  5. Run the pipeline in a loop with a good sleep between submissions and redirect log output (the following assumes you are using bash)
  6. beekeeper.pl -url mysql://usr:pass@server:port/db -reg_conf reg.pm -loop -sleep 5 2>&1 > my_run.log &
  7. Wait