Commit 11863831 authored by Leo Gordon's avatar Leo Gordon
Browse files

new style LongMult pipeline config and running added

parent 8cfb6bfd
......@@ -87,7 +87,6 @@ It will be convenient to set a variable pointing at this directory for future us
#
# (for best results, append these lines to your ~/.bashrc configuration file)
# using [t]csh syntax:
$ setenv PERL5LIB ${PERL5LIB}:${ENS_CODE_ROOT}/bioperl-live
$ setenv PERL5LIB ${PERL5LIB}:${ENS_CODE_ROOT}/ensembl/modules
......@@ -97,59 +96,111 @@ It will be convenient to set a variable pointing at this directory for future us
# (for best results, append these lines to your ~/.cshrc or ~/.tcshrc configuration file)
3. It will be convenient to set a variable with MySQL connection parameters to the MySQL instance
where you'll be creating eHive pipelines
(which means, you'll need priveleges to create databases and to write into them):
3. Useful files and directories of the eHive repository.
# using bash syntax:
$ export MYCONN="--host=hostname --port=3306 --user=mysql_username --password=SeCrEt"
3.1 In ensembl-hive/scripts we keep perl scripts used for controlling the pipelines.
Adding this directory to your $PATH may make your life easier.
# using [t]csh syntax:
$ setenv MYCONN "--host=hostname --port=3306 --user=mysql_username --password=SeCrEt"
* init_pipeline.pl is used to create hive databases, populate hive-specific and pipeline-specific tables and load data
* beekeeper.pl is used to run the pipeline; send 'Workers' to the 'Meadow' to run the jobs of the pipeline
3.2 In ensembl-hive/modules/Bio/EnsEMBL/Hive/PipeConfig we keep example pipeline configuration modules that can be used by init_pipeline.pl .
A PipeConfig is a parametric module that defines the structure of the pipeline.
That is, which analyses with what parameters will have to be run and in which order.
The code for each analysis is contained in a RunnableDB module.
For some tasks bespoke RunnableDB have to be written, whereas some other problems can be solved by only using 'universal buliding blocks'.
A typical pipeline is a mixture of both.
3.3 In ensembl-hive/modules/Bio/EnsEMBL/Hive/RunnableDB we keep 'universal building block' RunnableDBs:
* SystemCmd.pm is a parameter substitution wrapper for any command line executed by the current shell
* SqlCmd.pm is a parameter substitution wrapper for running any MySQL query or a session of linked queries
against a particular database (eHive pipeline database by default, but not necessarily)
* JobFactory.pm is a universal module for dynamically creating batches of same analysis jobs (with different parameters)
to be run within the current pipeline
3.4 In ensembl-hive/modules/Bio/EnsEMBL/Hive/RunnableDB/LongMult we keep bespoke RunnableDBs for long multiplication example pipeline.
4 Long multiplication example pipeline.
Long multiplication pipeline solves a problem of multiplying two very long integer numbers by pretending the computations have to be done in parallel on the farm.
While performing the task it uses various features of eHive, so by studying this and other examples you can learn how to put together your own pipeines.
4.1 The pipeline is defined in 4 files:
* ensembl-hive/modules/Bio/EnsEMBL/Hive/RunnableDB/LongMult/Start.pm splits a multiplication job into sub-tasks and creates corresponding jobs
* ensembl-hive/modules/Bio/EnsEMBL/Hive/RunnableDB/LongMult/PartMultiply.pm performs a partial multiplication and stores the intermediate result in a table
* ensembl-hive/modules/Bio/EnsEMBL/Hive/RunnableDB/LongMult/AddTogether.pm waits for partial multiplication results to compute and adds them together into final result
* ensembl-hive/modules/Bio/EnsEMBL/Hive/PipeConfig/LongMult_conf.pm the pipeline configuration module that links the previous Runnables into one pipeline
4.2 The main part of any PipeConfig file, pipeline_analyses() method, defines the pipeline graph whose nodes are analyses and whose arcs are control and dataflow rules.
Each analysis hash must have:
-logic_name string name by which this analysis is referred to,
-module a name of the Runnable module that contains the code to be run (several analyses can use the same Runnable)
Optionally, it can also have:
-input_ids an array of hashes, each hash defining job-specific parameters (if empty it means jobs are created dynamically using dataflow mechanism)
-parameters usually a hash of analysis-wide parameters (each such parameter can be overriden by the same name parameter contained in an input_id hash)
-wait_for an array of other analyses, *controlling* this one (jobs of this analysis cannot start before all jobs of controlling analyses have completed)
-flow_into usually a hash that defines dataflow rules (rules of dynamic job creation during pipeline execution) from this particular analysis.
The meaning of these parameters should become clearer after some experimentation with the pipeline.
5 Initialization and running the long multiplication pipeline.
5.1 Before running the pipeline you will have to initialize it using init_pipeline.pl script supplying PipeConfig module and the necessary parameters.
Have another look at LongMult_conf.pm file. The default_options() method returns a hash that pretty much defines what parameters you can/should supply to init_pipeline.pl .
You will probably need to specify the following:
$ init_pipeline.pl Bio::EnsEMBL::Hive::PipeConfig::LongMult_conf \
-ensembl_cvs_root_dir $ENS_CODE_ROOT \
-pipeline_db -host=<your_mysql_host> \
-pipeline_db -user=<your_mysql_username> \
-pipeline_db -user=<your_mysql_password> \
The same syntax will work for eHive control scripts, so you'll be using the same variable.
This should create a fresh eHive database and initalize it with long multiplication pipeline data (the two numbers to be multiplied are taken from defaults).
Upon successful completion init_pipeline.pl will print several beekeeper commands and
a mysql command for connecting to the newly created database.
Copy and run the mysql command in a separate shell session to follow the progress of the pipeline.
4. Create an "empty" eHive:
5.2 Run the first beekeeper command that contains '-sync' option. This will initialize database's internal stats and determine which jobs can be run.
The state of each eHive is maintained in a MySQL database that has to be created from the file ensembl-hive/sql/tables.sql :
5.3 Now you have two options: either to run the beekeeper.pl in automatic mode using '-loop' option and wait until it completes,
or run it in step-by-step mode, initiating every step by separate executions of 'beekeeper.pl ... -run' command.
We will use the step-by-step mode in order to see what is going on.
$ mysql $MYCONN -e 'CREATE DATABASE ehive_test'
$ mysql $MYCONN ehive_test < $ENS_CODE_ROOT/ensembl-hive/sql/tables.sql
5.4 Go to mysql window and check the contents of analysis_job table:
In this step you have created a database with empty tables
that will have to be "loaded" with tasks to perform, depending on the particular pipeline we want to run.
MySQL> SELECT * FROM analysis_job;
5. Configure and load a pipeline:
It will only contain jobs that set up the multiplication tasks in 'READY' mode - meaning 'ready to be taken by workers and executed'.
Now the structure of a particular pipeline has to be defined:
(A) Each 'analysis' table entry describes a particular type of job that can be run,
with the corresponding Perl module to run and the generic parameters for that module.
Most of our pipelines have more than one analysis.
(B) Each 'control_rule' table entry links two analyses A and B in such a way that until all A-jobs
have been completed none of the B-jobs can be started.
(C) Each 'dataflow_rule' table entry links two analyses A and B in such a way that when an A-job completes,
it is said to "flow into" a B-job (a B-job is automatically created for each A-job,
and parameters are passed from A to B individually).
(D) A particular pipeline may have extra tables defined to store the intermediate and final
results of computation. They may need to be loaded with some initial data.
(E) A certain number of jobs will have to be loaded into 'analysis_job' table (this number won't usually
reflect the total number of jobs, as jobs can create other jobs or "flow into" them).
Go to the beekeeper window and run the 'beekeeper.pl ... -run' once.
It will submit a worker to the farm that will at some point get the 'start' job(s).
The task of loading all the components of pipelines is usually automated by a configuration script
(or sometimes two, the first for loading "pipeline definition" (A-D) and the second for loading jobs (E) )
5.5 Go to mysql window again and check the contents of analysis_job table. Keep checking as the worker may spend some time in 'pending' state.
Please familiarize yourself with the file $ENS_CODE_ROOT/ensembl-hive/docs/long_mult_example_pipeline.txt
that gives step-by-step instruction of how to load and run our toy pipeline for distributed multiplication of long numbers.
Although it is a toy pipeline, it gives a good example on how to address each of the points (A)..(E) listed above.
After the first worker is done you will see that 'start' jobs are now done and new 'part_multiply' and 'add_together' jobs have been created.
Also check the contents of 'intermediate_result' table, it should be empty at that moment:
MySQL> SELECT * from intermediate_result;
6. Scripts used for sending Workers to the farm.
Go back to the beekeeper window and run the 'beekeeper.pl ... -run' for the second time.
It will submit another worker to the farm that will at some point get the 'part_multiply' jobs.
The scripts used to control the loading/execution of eHive pipelines are stored in "$ENS_CODE_ROOT/ensembl-hive/scripts" directory.
(Again, we suggest that you add $ENS_CODE_ROOT/ensembl-hive/scripts to your executable PATH variable to avoid much typing.)
5.6 Now check both 'analysis_job' and 'intermediate_result' tables again.
At some moment 'part_multiply' jobs will have been completed and the results will go into 'intermediate_result' table;
'add_together' jobs are still to be done.
Check the contents of 'final_result' table (should be empty) and run the third and the last round of 'beekeeper.pl ... -run'
The main script to query and run the eHive pipelines is 'beekeeper.pl',
but the other two scripts 'runWorker.pl' and 'cmd_hive.pl' may also become useful at some point.
Running each script without parameters will provide the list of options and usage examples.
5.7 Eventually you will see that all jobs have completed and the 'final_result' table contains final result(s) of multiplication.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment