Skip to content
Snippets Groups Projects
Commit 6efd59d5 authored by Leo Gordon's avatar Leo Gordon
Browse files

bringing docs up-to-date with new init_pipeline

parent 11863831
No related branches found
No related tags found
No related merge requests found
......@@ -16,6 +16,15 @@ Summary:
Bio::EnsEMBL::Analysis::RunnableDB perl wrapper objects as nodes/blocks in
the graphs but could be adapted more generally.
12 May, 2010 : Leo Gordon
* init_pipeline.pl can be given a PipeConfig file name instead of full module name.
* init_pipeline.pl has its own help that displays pod documentation (same mechanism as other eHive scripts)
* 3 pipeline initialization modes supported:
full (default), -analysis_topup (pipeline development mode) and -job_topup (add more data to work with)
11 May, 2010 : Leo Gordon
......
......@@ -124,83 +124,3 @@ It will be convenient to set a variable pointing at this directory for future us
3.4 In ensembl-hive/modules/Bio/EnsEMBL/Hive/RunnableDB/LongMult we keep bespoke RunnableDBs for long multiplication example pipeline.
4 Long multiplication example pipeline.
Long multiplication pipeline solves a problem of multiplying two very long integer numbers by pretending the computations have to be done in parallel on the farm.
While performing the task it uses various features of eHive, so by studying this and other examples you can learn how to put together your own pipeines.
4.1 The pipeline is defined in 4 files:
* ensembl-hive/modules/Bio/EnsEMBL/Hive/RunnableDB/LongMult/Start.pm splits a multiplication job into sub-tasks and creates corresponding jobs
* ensembl-hive/modules/Bio/EnsEMBL/Hive/RunnableDB/LongMult/PartMultiply.pm performs a partial multiplication and stores the intermediate result in a table
* ensembl-hive/modules/Bio/EnsEMBL/Hive/RunnableDB/LongMult/AddTogether.pm waits for partial multiplication results to compute and adds them together into final result
* ensembl-hive/modules/Bio/EnsEMBL/Hive/PipeConfig/LongMult_conf.pm the pipeline configuration module that links the previous Runnables into one pipeline
4.2 The main part of any PipeConfig file, pipeline_analyses() method, defines the pipeline graph whose nodes are analyses and whose arcs are control and dataflow rules.
Each analysis hash must have:
-logic_name string name by which this analysis is referred to,
-module a name of the Runnable module that contains the code to be run (several analyses can use the same Runnable)
Optionally, it can also have:
-input_ids an array of hashes, each hash defining job-specific parameters (if empty it means jobs are created dynamically using dataflow mechanism)
-parameters usually a hash of analysis-wide parameters (each such parameter can be overriden by the same name parameter contained in an input_id hash)
-wait_for an array of other analyses, *controlling* this one (jobs of this analysis cannot start before all jobs of controlling analyses have completed)
-flow_into usually a hash that defines dataflow rules (rules of dynamic job creation during pipeline execution) from this particular analysis.
The meaning of these parameters should become clearer after some experimentation with the pipeline.
5 Initialization and running the long multiplication pipeline.
5.1 Before running the pipeline you will have to initialize it using init_pipeline.pl script supplying PipeConfig module and the necessary parameters.
Have another look at LongMult_conf.pm file. The default_options() method returns a hash that pretty much defines what parameters you can/should supply to init_pipeline.pl .
You will probably need to specify the following:
$ init_pipeline.pl Bio::EnsEMBL::Hive::PipeConfig::LongMult_conf \
-ensembl_cvs_root_dir $ENS_CODE_ROOT \
-pipeline_db -host=<your_mysql_host> \
-pipeline_db -user=<your_mysql_username> \
-pipeline_db -user=<your_mysql_password> \
This should create a fresh eHive database and initalize it with long multiplication pipeline data (the two numbers to be multiplied are taken from defaults).
Upon successful completion init_pipeline.pl will print several beekeeper commands and
a mysql command for connecting to the newly created database.
Copy and run the mysql command in a separate shell session to follow the progress of the pipeline.
5.2 Run the first beekeeper command that contains '-sync' option. This will initialize database's internal stats and determine which jobs can be run.
5.3 Now you have two options: either to run the beekeeper.pl in automatic mode using '-loop' option and wait until it completes,
or run it in step-by-step mode, initiating every step by separate executions of 'beekeeper.pl ... -run' command.
We will use the step-by-step mode in order to see what is going on.
5.4 Go to mysql window and check the contents of analysis_job table:
MySQL> SELECT * FROM analysis_job;
It will only contain jobs that set up the multiplication tasks in 'READY' mode - meaning 'ready to be taken by workers and executed'.
Go to the beekeeper window and run the 'beekeeper.pl ... -run' once.
It will submit a worker to the farm that will at some point get the 'start' job(s).
5.5 Go to mysql window again and check the contents of analysis_job table. Keep checking as the worker may spend some time in 'pending' state.
After the first worker is done you will see that 'start' jobs are now done and new 'part_multiply' and 'add_together' jobs have been created.
Also check the contents of 'intermediate_result' table, it should be empty at that moment:
MySQL> SELECT * from intermediate_result;
Go back to the beekeeper window and run the 'beekeeper.pl ... -run' for the second time.
It will submit another worker to the farm that will at some point get the 'part_multiply' jobs.
5.6 Now check both 'analysis_job' and 'intermediate_result' tables again.
At some moment 'part_multiply' jobs will have been completed and the results will go into 'intermediate_result' table;
'add_together' jobs are still to be done.
Check the contents of 'final_result' table (should be empty) and run the third and the last round of 'beekeeper.pl ... -run'
5.7 Eventually you will see that all jobs have completed and the 'final_result' table contains final result(s) of multiplication.
############################################################################################################################
#
# Bio::EnsEMBL::Hive::RunnableDB::LongMult is an example eHive pipeline that demonstates the following features:
#
# A) A pipeline can have multiple analyses (this one has three: 'start', 'part_multiply' and 'add_together').
#
# B) A job of one analysis can create jobs of another analysis (one 'start' job creates up to 8 'part_multiply' jobs).
#
# C) A job of one analysis can "flow the data" into another analysis (a 'start' job "flows into" an 'add_together' job).
#
# D) Execution of one analysis can be blocked until all jobs of another analysis have been successfully completed
# ('add_together' is blocked both by 'start' and 'part_multiply').
#
# E) As filesystems are frequently a bottleneck for big pipelines, it is advised that eHive processes store intermediate
# and final results in a database (in this pipeline, 'intermediate_result' and 'final_result' tables are used).
#
############################################################################################################################
# 0. Cache MySQL connection parameters in a variable (they will work as eHive connection parameters as well) :
export MYCONN="--host=hostname --port=port_number --user=username --password=secret"
#
# also, set the ENS_CODE_ROOT to the directory where ensembl packages are installed:
export ENS_CODE_ROOT="$HOME/ensembl_main"
# 1. Create an empty database:
mysql $MYCONN -e 'DROP DATABASE IF EXISTS long_mult_test'
mysql $MYCONN -e 'CREATE DATABASE long_mult_test'
# 2. Create eHive infrastructure:
mysql $MYCONN long_mult_test <$ENS_CODE_ROOT/ensembl-hive/sql/tables.sql
# 3. Create analyses/control_rules/dataflow_rules of the LongMult pipeline:
mysql $MYCONN long_mult_test <$ENS_CODE_ROOT/ensembl-hive/sql/create_long_mult.sql
# 4. "Load" the pipeline with a multiplication task:
mysql $MYCONN long_mult_test <$ENS_CODE_ROOT/ensembl-hive/sql/load_long_mult.sql
#
# or you can add your own task(s). Several tasks can be added at once:
mysql $MYCONN long_mult_test <<EoF
INSERT INTO analysis_job (analysis_id, input_id) VALUES ( 1, "{ 'a_multiplier' => '9650516169', 'b_multiplier' => '327358788' }");
INSERT INTO analysis_job (analysis_id, input_id) VALUES ( 1, "{ 'a_multiplier' => '327358788', 'b_multiplier' => '9650516169' }");
EoF
# 5. Initialize the newly created eHive for the first time:
beekeeper.pl $MYCONN --database=long_mult_test -sync
# 6. You can either execute three individual workers (each picking one analysis of the pipeline):
runWorker.pl $MYCONN --database=long_mult_test
#
#
# ... or run an automatic loop that will run workers for you:
beekeeper.pl $MYCONN --database=long_mult_test -loop
# 7. The results of the computations are to be found in 'final_result' table:
mysql $MYCONN long_mult_test -e 'SELECT * FROM final_result'
# 8. You can add more multiplication tasks by repeating from step 4.
4 Long multiplication example pipeline.
4.1 Long multiplication pipeline solves a problem of multiplying two very long integer numbers by pretending the computations have to be done in parallel on the farm.
While performing the task it demonstates the use of the following features:
A) A pipeline can have multiple analyses (this one has three: 'start', 'part_multiply' and 'add_together').
B) A job of one analysis can create jobs of other analyses by 'flowing the data' down numbered channels or branches.
These branches are then assigned specific analysis names in the pipeline configuration file
(one 'start' job flows partial multiplication subtasks down branch #2 and a task of adding them together down branch #1).
C) Execution of one analysis can be blocked until all jobs of another analysis have been successfully completed
('add_together' is blocked both by 'part_multiply').
D) As filesystems are frequently a bottleneck for big pipelines, it is advised that eHive processes store intermediate
and final results in a database (in this pipeline, 'intermediate_result' and 'final_result' tables are used).
4.2 The pipeline is defined in 4 files:
* ensembl-hive/modules/Bio/EnsEMBL/Hive/RunnableDB/LongMult/Start.pm splits a multiplication job into sub-tasks and creates corresponding jobs
* ensembl-hive/modules/Bio/EnsEMBL/Hive/RunnableDB/LongMult/PartMultiply.pm performs a partial multiplication and stores the intermediate result in a table
* ensembl-hive/modules/Bio/EnsEMBL/Hive/RunnableDB/LongMult/AddTogether.pm waits for partial multiplication results to compute and adds them together into final result
* ensembl-hive/modules/Bio/EnsEMBL/Hive/PipeConfig/LongMult_conf.pm the pipeline configuration module that links the previous Runnables into one pipeline
4.3 The main part of any PipeConfig file, pipeline_analyses() method, defines the pipeline graph whose nodes are analyses and whose arcs are control and dataflow rules.
Each analysis hash must have:
-logic_name string name by which this analysis is referred to,
-module a name of the Runnable module that contains the code to be run (several analyses can use the same Runnable)
Optionally, it can also have:
-input_ids an array of hashes, each hash defining job-specific parameters (if empty it means jobs are created dynamically using dataflow mechanism)
-parameters usually a hash of analysis-wide parameters (each such parameter can be overriden by the same name parameter contained in an input_id hash)
-wait_for an array of other analyses, *controlling* this one (jobs of this analysis cannot start before all jobs of controlling analyses have completed)
-flow_into usually a hash that defines dataflow rules (rules of dynamic job creation during pipeline execution) from this particular analysis.
The meaning of these parameters should become clearer after some experimentation with the pipeline.
5 Initialization and running the long multiplication pipeline.
5.1 Before running the pipeline you will have to initialize it using init_pipeline.pl script supplying PipeConfig module and the necessary parameters.
Have another look at LongMult_conf.pm file. The default_options() method returns a hash that pretty much defines what parameters you can/should supply to init_pipeline.pl .
You will probably need to specify the following:
$ init_pipeline.pl Bio::EnsEMBL::Hive::PipeConfig::LongMult_conf \
-ensembl_cvs_root_dir $ENS_CODE_ROOT \
-pipeline_db -host=<your_mysql_host> \
-pipeline_db -user=<your_mysql_username> \
-pipeline_db -user=<your_mysql_password> \
This should create a fresh eHive database and initalize it with long multiplication pipeline data (the two numbers to be multiplied are taken from defaults).
Upon successful completion init_pipeline.pl will print several beekeeper commands and
a mysql command for connecting to the newly created database.
Copy and run the mysql command in a separate shell session to follow the progress of the pipeline.
5.2 Run the first beekeeper command that contains '-sync' option. This will initialize database's internal stats and determine which jobs can be run.
5.3 Now you have two options: either to run the beekeeper.pl in automatic mode using '-loop' option and wait until it completes,
or run it in step-by-step mode, initiating every step by separate executions of 'beekeeper.pl ... -run' command.
We will use the step-by-step mode in order to see what is going on.
5.4 Go to mysql window and check the contents of analysis_job table:
MySQL> SELECT * FROM analysis_job;
It will only contain jobs that set up the multiplication tasks in 'READY' mode - meaning 'ready to be taken by workers and executed'.
Go to the beekeeper window and run the 'beekeeper.pl ... -run' once.
It will submit a worker to the farm that will at some point get the 'start' job(s).
5.5 Go to mysql window again and check the contents of analysis_job table. Keep checking as the worker may spend some time in 'pending' state.
After the first worker is done you will see that 'start' jobs are now done and new 'part_multiply' and 'add_together' jobs have been created.
Also check the contents of 'intermediate_result' table, it should be empty at that moment:
MySQL> SELECT * from intermediate_result;
Go back to the beekeeper window and run the 'beekeeper.pl ... -run' for the second time.
It will submit another worker to the farm that will at some point get the 'part_multiply' jobs.
5.6 Now check both 'analysis_job' and 'intermediate_result' tables again.
At some moment 'part_multiply' jobs will have been completed and the results will go into 'intermediate_result' table;
'add_together' jobs are still to be done.
Check the contents of 'final_result' table (should be empty) and run the third and the last round of 'beekeeper.pl ... -run'
5.7 Eventually you will see that all jobs have completed and the 'final_result' table contains final result(s) of multiplication.
## Configuration file for the long multiplication pipeline example
#
## Run it like this:
#
# init_pipeline_old.pl -conf long_mult_pipeline.conf
#
# code directories:
my $ensembl_cvs_root_dir = $ENV{'HOME'}.'/work';
#my $ensembl_cvs_root_dir = $ENV{'HOME'}.'/ensembl_main'; ## for some Compara developers
# long multiplication pipeline database connection parameters:
my $pipeline_db = {
-host => 'compara2',
-port => 3306,
-user => 'ensadmin',
-pass => 'ensembl',
-dbname => $ENV{USER}.'_long_mult_pipeline',
};
return {
# pass connection parameters into the pipeline initialization script to create adaptors:
-pipeline_db => $pipeline_db,
# shell commands that create and possibly pre-fill the pipeline database:
-pipeline_create_commands => [
'mysql '.dbconn_2_mysql($pipeline_db, 0)." -e 'CREATE DATABASE $pipeline_db->{-dbname}'",
# standard eHive tables and procedures:
'mysql '.dbconn_2_mysql($pipeline_db, 1)." <$ensembl_cvs_root_dir/ensembl-hive/sql/tables.sql",
'mysql '.dbconn_2_mysql($pipeline_db, 1)." <$ensembl_cvs_root_dir/ensembl-hive/sql/procedures.sql",
# additional tables needed for long multiplication pipeline's operation:
'mysql '.dbconn_2_mysql($pipeline_db, 1)." -e 'CREATE TABLE intermediate_result (a_multiplier char(40) NOT NULL, digit tinyint NOT NULL, result char(41) NOT NULL, PRIMARY KEY (a_multiplier, digit))'",
'mysql '.dbconn_2_mysql($pipeline_db, 1)." -e 'CREATE TABLE final_result (a_multiplier char(40) NOT NULL, b_multiplier char(40) NOT NULL, result char(80) NOT NULL, PRIMARY KEY (a_multiplier, b_multiplier))'",
# name the pipeline to differentiate the submitted processes:
'mysql '.dbconn_2_mysql($pipeline_db, 1)." -e 'INSERT INTO meta (meta_key, meta_value) VALUES (\"name\", \"lmult\")'",
],
-pipeline_analyses => [
{ -logic_name => 'start',
-module => 'Bio::EnsEMBL::Hive::RunnableDB::LongMult::Start',
-parameters => {},
-input_ids => [
{ 'a_multiplier' => '9650516169', 'b_multiplier' => '327358788' },
{ 'a_multiplier' => '327358788', 'b_multiplier' => '9650516169' },
],
-flow_into => {
2 => [ 'part_multiply' ], # will create a fan of jobs
1 => [ 'add_together' ], # will create a funnel job to wait for the fan to complete and add the results
},
},
{ -logic_name => 'part_multiply',
-module => 'Bio::EnsEMBL::Hive::RunnableDB::LongMult::PartMultiply',
-parameters => {},
-input_ids => [
# (jobs for this analysis will be flown_into via branch-2 from 'start' jobs above)
],
},
{ -logic_name => 'add_together',
-module => 'Bio::EnsEMBL::Hive::RunnableDB::LongMult::AddTogether',
-parameters => {},
-input_ids => [
# (jobs for this analysis will be flown_into via branch-1 from 'start' jobs above)
],
-wait_for => [ 'part_multiply' ], # we can only start adding when all partial products have been computed
},
],
};
## Configuration file for the long multiplication semaphored pipeline example
#
## Run it like this:
#
# init_pipeline_old.pl -conf long_mult_sema_pipeline.conf
#
# code directories:
my $ensembl_cvs_root_dir = $ENV{'HOME'}.'/work';
#my $ensembl_cvs_root_dir = $ENV{'HOME'}.'/ensembl_main'; ## for some Compara developers
# long multiplication pipeline database connection parameters:
my $pipeline_db = {
-host => 'compara2',
-port => 3306,
-user => 'ensadmin',
-pass => 'ensembl',
-dbname => $ENV{USER}.'_long_mult_sema_pipeline',
};
return {
# pass connection parameters into the pipeline initialization script to create adaptors:
-pipeline_db => $pipeline_db,
# shell commands that create and possibly pre-fill the pipeline database:
-pipeline_create_commands => [
'mysql '.dbconn_2_mysql($pipeline_db, 0)." -e 'CREATE DATABASE $pipeline_db->{-dbname}'",
# standard eHive tables and procedures:
'mysql '.dbconn_2_mysql($pipeline_db, 1)." <$ensembl_cvs_root_dir/ensembl-hive/sql/tables.sql",
'mysql '.dbconn_2_mysql($pipeline_db, 1)." <$ensembl_cvs_root_dir/ensembl-hive/sql/procedures.sql",
# additional tables needed for long multiplication pipeline's operation:
'mysql '.dbconn_2_mysql($pipeline_db, 1)." -e 'CREATE TABLE intermediate_result (a_multiplier char(40) NOT NULL, digit tinyint NOT NULL, result char(41) NOT NULL, PRIMARY KEY (a_multiplier, digit))'",
'mysql '.dbconn_2_mysql($pipeline_db, 1)." -e 'CREATE TABLE final_result (a_multiplier char(40) NOT NULL, b_multiplier char(40) NOT NULL, result char(80) NOT NULL, PRIMARY KEY (a_multiplier, b_multiplier))'",
# name the pipeline to differentiate the submitted processes:
'mysql '.dbconn_2_mysql($pipeline_db, 1)." -e 'INSERT INTO meta (meta_key, meta_value) VALUES (\"name\", \"slmult\")'",
],
-pipeline_analyses => [
{ -logic_name => 'sema_start',
-module => 'Bio::EnsEMBL::Hive::RunnableDB::LongMult::SemaStart',
-parameters => {},
-input_ids => [
{ 'a_multiplier' => '9650516169', 'b_multiplier' => '327358788' },
{ 'a_multiplier' => '327358788', 'b_multiplier' => '9650516169' },
],
-flow_into => {
1 => [ 'add_together' ], # will create a semaphored funnel job to wait for the fan to complete and add the results
2 => [ 'part_multiply' ], # will create a fan of jobs that control the semaphored funnel
},
},
{ -logic_name => 'part_multiply',
-module => 'Bio::EnsEMBL::Hive::RunnableDB::LongMult::PartMultiply',
-parameters => {},
-input_ids => [
# (jobs for this analysis will be flown_into via branch-2 from 'start' jobs above)
],
},
{ -logic_name => 'add_together',
-module => 'Bio::EnsEMBL::Hive::RunnableDB::LongMult::AddTogether',
-parameters => {},
-input_ids => [
# (jobs for this analysis will be flown_into via branch-1 from 'start' jobs above)
],
# jobs in this analyses are semaphored, so no need to '-wait_for'
},
],
};
############################################################################################################################
#
# Please see the long_mult_example_pipeline.txt first.
#
# This is an example ( a follow-up on 'long_mult_example_pipeline.txt', so make sure you have read it first)
# of how to set up a pipeline with counting semaphores.
#
############################################################################################################################
# 0. Cache MySQL connection parameters in a variable (they will work as eHive connection parameters as well) :
export MYCONN="--host=hostname --port=port_number --user=username --password=secret"
#
# also, set the ENS_CODE_ROOT to the directory where ensembl packages are installed:
export ENS_CODE_ROOT="$HOME/ensembl_main"
# 1. Create an empty database:
mysql $MYCONN -e 'DROP DATABASE IF EXISTS long_mult_test'
mysql $MYCONN -e 'CREATE DATABASE long_mult_test'
# 2. Create eHive infrastructure:
mysql $MYCONN long_mult_test <$ENS_CODE_ROOT/ensembl-hive/sql/tables.sql
# 3. Create analyses/control_rules/dataflow_rules of the LongMult pipeline:
mysql $MYCONN long_mult_test <$ENS_CODE_ROOT/ensembl-hive/sql/create_sema_long_mult.sql
# 4. "Load" the pipeline with a multiplication task:
mysql $MYCONN long_mult_test <<EoF
INSERT INTO analysis_job (analysis_id, input_id) VALUES ( 1, "{ 'a_multiplier' => '9650516169', 'b_multiplier' => '327358788' }");
INSERT INTO analysis_job (analysis_id, input_id) VALUES ( 1, "{ 'a_multiplier' => '327358788', 'b_multiplier' => '9650516169' }");
EoF
# 5. Initialize the newly created eHive for the first time:
beekeeper.pl $MYCONN --database=long_mult_test -sync
# 6. You can either execute three individual workers (each picking one analysis of the pipeline):
runWorker.pl $MYCONN --database=long_mult_test
#
# ... or run an automatic loop that will run workers for you:
beekeeper.pl $MYCONN --database=long_mult_test -loop
#
# KNOWN BUG: if you keep suggesting your own analysis_id/logic_name, the system may sometimes think there is no work,
# where actually there will be some previously semaphored jobs that have become available yet invisible to some workers.
# KNOWN FIX: just run "beekeeper.pl $MYCONN --database=long_mult_test -sync" once, and the problem should rectify itself.
# 7. The results of the computations are to be found in 'final_result' table:
mysql $MYCONN long_mult_test -e 'SELECT * FROM final_result'
# 8. You can add more multiplication tasks by repeating from step 4.
# mini-pipeline for testing meta-parameter evaluation and SqlCmd in "external_db" mode
my $cvs_root_dir = $ENV{'HOME'}.'/work';
# family database connection parameters (our main database):
my $pipeline_db = {
-host => 'compara3',
-port => 3306,
-user => 'ensadmin',
-pass => 'ensembl',
-dbname => "lg4_test_sqlcmd",
};
my $slave_db = {
-host => 'compara3',
-port => 3306,
-user => 'ensadmin',
-pass => 'ensembl',
-dbname => "lg4_test_sqlcmd_slave",
};
return {
# pass connection parameters into the pipeline initialization script to create adaptors:
-pipeline_db => $pipeline_db,
# shell commands that create and pre-fill the pipeline database:
-pipeline_create_commands => [
'mysql '.dbconn_2_mysql($pipeline_db, 0)." -e 'CREATE DATABASE $pipeline_db->{-dbname}'",
'mysql '.dbconn_2_mysql($pipeline_db, 1)." <$cvs_root_dir/ensembl-hive/sql/tables.sql",
'mysql '.dbconn_2_mysql($pipeline_db, 1)." <$cvs_root_dir/ensembl-hive/sql/procedures.sql",
'mysql '.dbconn_2_mysql($pipeline_db, 0)." -e 'CREATE DATABASE $slave_db->{-dbname}'",
],
-pipeline_wide_parameters => { # these parameter values are visible to all analyses, can be overridden by parameters{} and input_id{}
'db_conn' => $slave_db, # testing the stringification of a structure here
},
-resource_classes => {
0 => { -desc => 'default, 8h', 'LSF' => '' },
1 => { -desc => 'urgent', 'LSF' => '-q yesterday' },
},
-pipeline_analyses => [
{ -logic_name => 'create_table',
-module => 'Bio::EnsEMBL::Hive::RunnableDB::SqlCmd',
-parameters => { },
-hive_capacity => 20, # to enable parallel branches
-input_ids => [
{ 'sql' => 'CREATE TABLE distance (place_from char(40) NOT NULL, place_to char(40) NOT NULL, miles float, PRIMARY KEY (place_from, place_to))', },
],
-rc_id => 1,
},
{ -logic_name => 'fill_in_table',
-module => 'Bio::EnsEMBL::Hive::RunnableDB::SqlCmd',
-parameters => {
'sql' => [ "INSERT INTO distance (place_from, place_to, miles) VALUES ('#from#', '#to#', #miles#)",
"INSERT INTO distance (place_from, place_to, miles) VALUES ('#to#', '#from#', #miles#)", ],
},
-hive_capacity => 20, # to enable parallel branches
-input_ids => [
{ 'from' => 'Cambridge', 'to' => 'Ely', 'miles' => 18.3 },
{ 'from' => 'London', 'to' => 'Cambridge', 'miles' => 60 },
],
-wait_for => 'create_table',
-rc_id => 1,
},
],
};
......@@ -9,7 +9,7 @@
# and
# ensembl-hive/modules/Bio/EnsEMBL/Hive/PipeConfig/SemaLongMult_conf.pm
#
# which are used to load the Long Multiplicaton pipeline in "analysis control" and "semaphore job control" modes respectively.
# which are used to load the Long Multiplication pipeline in "analysis control" and "semaphore job control" modes respectively.
#
#
# Create these pipelines using init_pipeline.pl and run them using beekeeper.pl in step-by-step mode (use -run instead of -loop option).
......
# create the 3 analyses we are going to use:
INSERT INTO analysis (created, logic_name, module) VALUES (NOW(), 'start', 'Bio::EnsEMBL::Hive::RunnableDB::LongMult::Start');
INSERT INTO analysis (created, logic_name, module) VALUES (NOW(), 'part_multiply', 'Bio::EnsEMBL::Hive::RunnableDB::LongMult::PartMultiply');
INSERT INTO analysis (created, logic_name, module) VALUES (NOW(), 'add_together', 'Bio::EnsEMBL::Hive::RunnableDB::LongMult::AddTogether');
# link the analyses with control- and dataflow-rules:
# 'all_together' waits for 'part_multiply':
INSERT INTO analysis_ctrl_rule (condition_analysis_url, ctrled_analysis_id) VALUES ('part_multiply', (SELECT analysis_id FROM analysis WHERE logic_name='add_together'));
# 'start' flows into a fan:
INSERT INTO dataflow_rule (from_analysis_id, to_analysis_url, branch_code) VALUES ((SELECT analysis_id FROM analysis WHERE logic_name='start'), 'part_multiply', 2);
# 'start' flows into a funnel:
INSERT INTO dataflow_rule (from_analysis_id, to_analysis_url, branch_code) VALUES ((SELECT analysis_id FROM analysis WHERE logic_name='start'), 'add_together', 1);
# create a table for holding intermediate results (written by 'part_multiply' and read by 'add_together')
CREATE TABLE intermediate_result (
a_multiplier char(40) NOT NULL,
digit tinyint NOT NULL,
result char(41) NOT NULL,
PRIMARY KEY (a_multiplier, digit)
);
# create a table for holding final results (written by 'add_together')
CREATE TABLE final_result (
a_multiplier char(40) NOT NULL,
b_multiplier char(40) NOT NULL,
result char(80) NOT NULL,
PRIMARY KEY (a_multiplier, b_multiplier)
);
# create the 3 analyses we are going to use:
INSERT INTO analysis (created, logic_name, module) VALUES (NOW(), 'start', 'Bio::EnsEMBL::Hive::RunnableDB::LongMult::SemaStart');
INSERT INTO analysis (created, logic_name, module) VALUES (NOW(), 'part_multiply', 'Bio::EnsEMBL::Hive::RunnableDB::LongMult::PartMultiply');
INSERT INTO analysis (created, logic_name, module) VALUES (NOW(), 'add_together', 'Bio::EnsEMBL::Hive::RunnableDB::LongMult::AddTogether');
# (no control rules anymore, jobs are controlled via semaphores)
# 'start' flows into a fan:
INSERT INTO dataflow_rule (from_analysis_id, to_analysis_url, branch_code) VALUES ((SELECT analysis_id FROM analysis WHERE logic_name='start'), 'part_multiply', 2);
# 'start' flows into a funnel:
INSERT INTO dataflow_rule (from_analysis_id, to_analysis_url, branch_code) VALUES ((SELECT analysis_id FROM analysis WHERE logic_name='start'), 'add_together', 1);
# create a table for holding intermediate results (written by 'part_multiply' and read by 'add_together')
CREATE TABLE intermediate_result (
a_multiplier char(40) NOT NULL,
digit tinyint NOT NULL,
result char(41) NOT NULL,
PRIMARY KEY (a_multiplier, digit)
);
# create a table for holding final results (written by 'add_together')
CREATE TABLE final_result (
a_multiplier char(40) NOT NULL,
b_multiplier char(40) NOT NULL,
result char(80) NOT NULL,
PRIMARY KEY (a_multiplier, b_multiplier)
);
# To multiply two long numbers using the long_mult pipeline
# we have to create the 'start' job and provide the two multipliers:
INSERT INTO analysis_job (analysis_id, input_id) VALUES (
(SELECT analysis_id FROM analysis WHERE logic_name='start'),
"{ 'a_multiplier' => '123456789', 'b_multiplier' => '90319' }");
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment