eHive_install_usage.txt 5.28 KB
Newer Older
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
	eHive installation, setup and usage.

1. Download and install the necessary external software:

1.1. Perl 5.6 or higher, since eHive code is written in Perl

	# see http://www.perl.com/download.csp

1.2. MySQL 5.1 or higher

	eHive keeps its state in a MySQL database, so you will need
	(1) a MySQL server installed on the machine where you want to maintain the state and
	(2) MySQL clients installed on the machines where the jobs are to be executed.
	MySQL version 5.1 or higher is recommended to maintain compatibility with Compara pipelines.

	see http://dev.mysql.com/downloads/

1.3. Perl DBI API
	Perl database interface that includes API to MySQL

	# see http://dbi.perl.org/

1.4. Perl UUID API
	eHive uses Universally Unique Identifiers (UUIDs) to identify workers internally.

	# see http://search.cpan.org/dist/Data-UUID/


2. Download and install essential and optional packages from BioPerl and EnsEMBL CVS

2.1. It is advised to have a dedicated directory where EnsEMBL-related packages will be deployed.
Unlike DBI or UUID modules that can be installed system-wide by the system administrator,
you will benefit from full (read+write) access to the EnsEMBL files/directories,
so it is best to install them under your home directory. For example,
	
	$ mkdir $HOME/ensembl_main

It will be convenient to set a variable pointing at this directory for future use:

# using bash syntax:
	$ export ENS_CODE_ROOT="$HOME/ensembl_main"
		#
		# (for best results, append this line to your ~/.bashrc configuration file)

# using [t]csh syntax:
	$ setenv ENS_CODE_ROOT "$HOME/ensembl_main"
		#
		# (for best results, append this line to your ~/.cshrc or ~/.tcshrc configuration file)

2.2. Change into your ensembl codebase directory:

52
	$ cd $ENS_CODE_ROOT
53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98

2.3. Log into the BioPerl CVS server (using "cvs" for password):

	$ cvs -d :pserver:cvs@code.open-bio.org:/home/repository/bioperl login

	# see http://www.bioperl.org/wiki/Using_CVS

2.4. Export the bioperl-live package:

	$ cvs -d :pserver:cvs@code.open-bio.org:/home/repository/bioperl export bioperl-live

2.5. Log into the EnsEMBL CVS server at Sanger (using "CVSUSER" for password):

	$ cvs -d :pserver:cvsuser@cvs.sanger.ac.uk:/cvsroot/ensembl login
	Logging in to :pserver:cvsuser@cvs.sanger.ac.uk:2401/cvsroot/ensembl
	CVS password: CVSUSER

2.6. Export ensembl and ensembl-hive CVS modules:

	$ cvs -d :pserver:cvsuser@cvs.sanger.ac.uk:/cvsroot/ensembl export ensembl
	$ cvs -d :pserver:cvsuser@cvs.sanger.ac.uk:/cvsroot/ensembl export ensembl-hive

2.7. In the likely case you are going to use eHive in the context of Compara pipelines,
	you will also need to install ensembl-compara:

	$ cvs -d :pserver:cvsuser@cvs.sanger.ac.uk:/cvsroot/ensembl export ensembl-compara

2.8. Add new packages to the PERL5LIB variable:

# using bash syntax:
	$ export PERL5LIB=${PERL5LIB}:${ENS_CODE_ROOT}/bioperl-live
	$ export PERL5LIB=${PERL5LIB}:${ENS_CODE_ROOT}/ensembl/modules
	$ export PERL5LIB=${PERL5LIB}:${ENS_CODE_ROOT}/ensembl-hive/modules
	$ export PERL5LIB=${PERL5LIB}:${ENS_CODE_ROOT}/ensembl-compara/modules # optional but recommended, see 2.7.
		#
		# (for best results, append these lines to your ~/.bashrc configuration file)

# using [t]csh syntax:
	$ setenv PERL5LIB  ${PERL5LIB}:${ENS_CODE_ROOT}/bioperl-live
	$ setenv PERL5LIB  ${PERL5LIB}:${ENS_CODE_ROOT}/ensembl/modules
	$ setenv PERL5LIB  ${PERL5LIB}:${ENS_CODE_ROOT}/ensembl-hive/modules
	$ setenv PERL5LIB  ${PERL5LIB}:${ENS_CODE_ROOT}/ensembl-compara/modules # optional but recommended, see 2.7.
		#
		# (for best results, append these lines to your ~/.cshrc or ~/.tcshrc configuration file)


99
3. Useful files and directories of the eHive repository.
100

101 102
3.1 In ensembl-hive/scripts we keep perl scripts used for controlling the pipelines.
    Adding this directory to your $PATH may make your life easier.
103

104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
    * init_pipeline.pl is used to create hive databases, populate hive-specific and pipeline-specific tables and load data

    * beekeeper.pl is used to run the pipeline; send 'Workers' to the 'Meadow' to run the jobs of the pipeline

3.2 In ensembl-hive/modules/Bio/EnsEMBL/Hive/PipeConfig we keep example pipeline configuration modules that can be used by init_pipeline.pl .
    A PipeConfig is a parametric module that defines the structure of the pipeline.
    That is, which analyses with what parameters will have to be run and in which order.
    The code for each analysis is contained in a RunnableDB module.
    For some tasks bespoke RunnableDB have to be written, whereas some other problems can be solved by only using 'universal buliding blocks'.
    A typical pipeline is a mixture of both.

3.3 In ensembl-hive/modules/Bio/EnsEMBL/Hive/RunnableDB we keep 'universal building block' RunnableDBs:

    * SystemCmd.pm  is a parameter substitution wrapper for any command line executed by the current shell

    * SqlCmd.pm     is a parameter substitution wrapper for running any MySQL query or a session of linked queries
                    against a particular database (eHive pipeline database by default, but not necessarily)

    * JobFactory.pm is a universal module for dynamically creating batches of same analysis jobs (with different parameters)
                    to be run within the current pipeline

3.4 In ensembl-hive/modules/Bio/EnsEMBL/Hive/RunnableDB/LongMult we keep bespoke RunnableDBs for long multiplication example pipeline.