Commit d6b29f94 authored by Leo Gordon's avatar Leo Gordon
Browse files

free 'Start' from dealing with 'a_multiplier' by using an input_id_template in...

free 'Start' from dealing with 'a_multiplier' by using an input_id_template in PipeConfig instead; renamed 'Start' to 'DigitFactory' to reflect that
parent a0fb3f03
......@@ -4,11 +4,11 @@
4.1 Long multiplication pipeline solves a problem of multiplying two very long integer numbers by pretending the computations have to be done in parallel on the farm.
While performing the task it demonstates the use of the following features:
A) A pipeline can have multiple analyses (this one has three: 'start', 'part_multiply' and 'add_together').
A) A pipeline can have multiple analyses (this one has three: 'take_b_apart', 'part_multiply' and 'add_together').
B) A job of one analysis can create jobs of other analyses by 'flowing the data' down numbered channels or branches.
These branches are then assigned specific analysis names in the pipeline configuration file
(one 'start' job flows partial multiplication subtasks down branch #2 and a task of adding them together down branch #1).
(one 'take_b_apart' job flows partial multiplication subtasks down branch #2 and a task of adding them together down branch #1).
C) Execution of one analysis can be blocked until all jobs of another analysis have been successfully completed
('add_together' is blocked both by 'part_multiply').
......@@ -18,7 +18,7 @@
4.2 The pipeline is defined in 4 files:
* ensembl-hive/modules/Bio/EnsEMBL/Hive/RunnableDB/LongMult/Start.pm splits a multiplication job into sub-tasks and creates corresponding jobs
* ensembl-hive/modules/Bio/EnsEMBL/Hive/RunnableDB/LongMult/DigitFactory.pm splits a multiplication job into sub-tasks and creates corresponding jobs
* ensembl-hive/modules/Bio/EnsEMBL/Hive/RunnableDB/LongMult/PartMultiply.pm performs a partial multiplication and stores the intermediate result in a table
......@@ -33,7 +33,7 @@
Optionally, it can also have:
-input_ids an array of hashes, each hash defining job-specific parameters (if empty it means jobs are created dynamically using dataflow mechanism)
-parameters usually a hash of analysis-wide parameters (each such parameter can be overriden by the same name parameter contained in an input_id hash)
-wait_for an array of other analyses, *controlling* this one (jobs of this analysis cannot start before all jobs of controlling analyses have completed)
-wait_for an array of other analyses, *controlling* this one (jobs of this analysis cannot take_b_apart before all jobs of controlling analyses have completed)
-flow_into usually a hash that defines dataflow rules (rules of dynamic job creation during pipeline execution) from this particular analysis.
The meaning of these parameters should become clearer after some experimentation with the pipeline.
......@@ -69,11 +69,11 @@
It will only contain jobs that set up the multiplication tasks in 'READY' mode - meaning 'ready to be taken by workers and executed'.
Go to the beekeeper window and run the 'beekeeper.pl ... -run' once.
It will submit a worker to the farm that will at some point get the 'start' job(s).
It will submit a worker to the farm that will at some point get the 'take_b_apart' job(s).
5.5 Go to mysql window again and check the contents of job table. Keep checking as the worker may spend some time in 'pending' state.
After the first worker is done you will see that 'start' jobs are now done and new 'part_multiply' and 'add_together' jobs have been created.
After the first worker is done you will see that 'take_b_apart' jobs are now done and new 'part_multiply' and 'add_together' jobs have been created.
Also check the contents of 'intermediate_result' table, it should be empty at that moment:
MySQL> SELECT * from intermediate_result;
......
......@@ -29,10 +29,10 @@ init_pipeline.pl Bio::EnsEMBL::Hive::PipeConfig::LongMult_conf -job_topup -passw
So long that they do not fit into registers of the CPU and should be multiplied digit-by-digit.
For the purposes of this example we also assume this task is very computationally intensive and has to be done in parallel.
The long multiplication pipeline consists of three "analyses" (types of tasks): 'start', 'part_multiply' and 'add_together'
The long multiplication pipeline consists of three "analyses" (types of tasks): 'take_b_apart', 'part_multiply' and 'add_together'
that we will be using to examplify various features of the Hive.
* A 'start' job takes in two string parameters, 'a_multiplier' and 'b_multiplier',
* A 'take_b_apart' job takes in two string parameters, 'a_multiplier' and 'b_multiplier',
takes the second one apart into digits, finds what _different_ digits are there,
creates several jobs of the 'part_multiply' analysis and one job of 'add_together' analysis.
......@@ -123,47 +123,40 @@ sub pipeline_wide_parameters {
Description : Implements pipeline_analyses() interface method of Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf that defines the structure of the pipeline: analyses, jobs, rules, etc.
Here it defines three analyses:
* 'start' with two jobs (multiply 'first_mult' by 'second_mult' and vice versa - to check the commutativity of multiplivation).
* 'take_b_apart' with two jobs (multiply 'first_mult' by 'second_mult' and vice versa - to check the commutativity of multiplivation).
Each job will dataflow (create more jobs) via branch #2 into 'part_multiply' and via branch #1 into 'add_together'.
* 'part_multiply' initially without jobs (they will flow from 'start')
* 'part_multiply' initially without jobs (they will flow from 'take_b_apart')
* 'add_together' initially without jobs (they will flow from 'start').
* 'add_together' initially without jobs (they will flow from 'take_b_apart').
All 'add_together' jobs will wait for completion of 'part_multiply' jobs before their own execution (to ensure all data is available).
There are two control modes in this pipeline:
A. The default mode is to use the '2' and '1' dataflow rules from 'start' analysis and a -wait_for rule in 'add_together' analysis for analysis-wide synchronization.
B. The semaphored mode is to use '2->A' and 'A->1' semaphored dataflow rules from 'start' instead, and comment out the analysis-wide -wait_for rule, relying on semaphores.
A. The default mode is to use the '2' and '1' dataflow rules from 'take_b_apart' analysis and a -wait_for rule in 'add_together' analysis for analysis-wide synchronization.
B. The semaphored mode is to use '2->A' and 'A->1' semaphored dataflow rules from 'take_b_apart' instead, and comment out the analysis-wide -wait_for rule, relying on semaphores.
=cut
sub pipeline_analyses {
my ($self) = @_;
return [
{ -logic_name => 'start',
-module => 'Bio::EnsEMBL::Hive::RunnableDB::LongMult::Start',
{ -logic_name => 'take_b_apart',
-module => 'Bio::EnsEMBL::Hive::RunnableDB::LongMult::DigitFactory',
-meadow_type=> 'LOCAL', # do not bother the farm with such a simple task (and get it done faster)
-parameters => {},
-analysis_capacity => 1, # use per-analysis limiter
-input_ids => [
{ 'a_multiplier' => $self->o('first_mult'), 'b_multiplier' => $self->o('second_mult') },
{ 'a_multiplier' => $self->o('second_mult'), 'b_multiplier' => $self->o('first_mult') },
],
-flow_into => {
# 2 => [ 'part_multiply' ], # will create a fan of jobs
# 1 => [ 'add_together' ], # will create a funnel job to wait for the fan to complete and add the results
'2->A' => [ 'part_multiply' ], # will create a semaphored fan of jobs (comment out the -wait_for rule from 'add_together')
'2->A' => { 'part_multiply' => { 'a_multiplier' => '#a_multiplier#', 'digit' => '#digit#' } }, # will create a semaphored fan of jobs; will use a template to top-up the hashes
'A->1' => [ 'add_together' ], # will create a semaphored funnel job to wait for the fan to complete and add the results
},
},
{ -logic_name => 'part_multiply',
-module => 'Bio::EnsEMBL::Hive::RunnableDB::LongMult::PartMultiply',
-parameters => {},
-analysis_capacity => 4, # use per-analysis limiter
-input_ids => [
# (jobs for this analysis will be flown_into via branch-2 from 'start' jobs above)
],
-flow_into => {
1 => [ ':////intermediate_result' ],
},
......@@ -171,12 +164,7 @@ sub pipeline_analyses {
{ -logic_name => 'add_together',
-module => 'Bio::EnsEMBL::Hive::RunnableDB::LongMult::AddTogether',
-parameters => {},
# -analysis_capacity => 0, # this is a way to temporarily block a given analysis
-input_ids => [
# (jobs for this analysis will be flown_into via branch-1 from 'start' jobs above)
],
# -wait_for => [ 'part_multiply' ], # we can only start adding when all partial products have been computed
-flow_into => {
1 => [ ':////final_result' ],
},
......
......@@ -3,7 +3,7 @@
=head1 NAME
Bio::EnsEMBL::Hive::RunnableDB::LongMult::Start
Bio::EnsEMBL::Hive::RunnableDB::LongMult::DigitFactory
=head1 SYNOPSIS
......@@ -12,7 +12,7 @@ to understand how this particular example pipeline is configured and ran.
=head1 DESCRIPTION
'LongMult::Start' is the first step of the LongMult example pipeline that multiplies two long numbers.
'LongMult::DigitFactory' is the first step of the LongMult example pipeline that multiplies two long numbers.
It takes apart the second multiplier and creates several 'LongMult::PartMultiply' jobs
that correspond to the different digits of the second multiplier.
......@@ -22,7 +22,7 @@ complete and will arrive at the final result.
=cut
package Bio::EnsEMBL::Hive::RunnableDB::LongMult::Start;
package Bio::EnsEMBL::Hive::RunnableDB::LongMult::DigitFactory;
use strict;
......@@ -33,9 +33,7 @@ use base ('Bio::EnsEMBL::Hive::Process');
Description : Implements fetch_input() interface method of Bio::EnsEMBL::Hive::Process that is used to read in parameters and load data.
Here the task of fetch_input() is to read in the two multipliers, split the second one into digits and create a set of input_ids that will be used later.
param('a_multiplier'): The first long number (a string of digits - doesn't have to fit a register).
param('b_multiplier'): The second long number (also a string of digits).
param('b_multiplier'): The second long number (a string of digits - doesn't have to fit a register)
param('take_time'): How much time to spend sleeping (seconds).
......@@ -44,7 +42,6 @@ use base ('Bio::EnsEMBL::Hive::Process');
sub fetch_input {
my $self = shift @_;
my $a_multiplier = $self->param('a_multiplier') || die "'a_multiplier' is an obligatory parameter";
my $b_multiplier = $self->param('b_multiplier') || die "'b_multiplier' is an obligatory parameter";
my %digit_hash = ();
......@@ -54,7 +51,7 @@ sub fetch_input {
}
# output_ids of partial multiplications to be computed:
my @output_ids = map { { 'a_multiplier' => $a_multiplier, 'digit' => $_ } } keys %digit_hash;
my @output_ids = map { { 'digit' => $_ } } keys %digit_hash;
# store them for future use:
$self->param('output_ids', \@output_ids);
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment