Commit 728f039b authored by Leo Gordon's avatar Leo Gordon
Browse files

another documentation day

parent 7673dbd2
......@@ -17,7 +17,7 @@
Analysis_1: JobFactory.pm is used to turn the list of files in a given directory into jobs
these jobs are sent down the branch #2 into the second analysis
these jobs are sent down the branch #2 into the second analysis
Analysis_2: SystemCmd.pm is used to run these compression/decompression jobs in parallel.
......@@ -31,8 +31,23 @@ package Bio::EnsEMBL::Hive::PipeConfig::FileZipperUnzipper_conf;
use strict;
use warnings;
use base ('Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf'); # All Hive databases configuration files should inherit from HiveGeneric, directly or indirectly
=head2 default_options
Description : Implements default_options() interface method of Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf that is used to initialize default options.
In addition to the standard things it defines three options:
o('unzip') controls whether the files will be zipped or unzipper (zipped by default)
o('only_files') defines which files in the directory will be (un)zipped
o('zipping_capacicy') defines how many files can be zipped in parallel
There are rules dependent on two options that do not have defaults (this makes them mandatory):
o('password') your read-write password for creation and maintenance of the hive database
o('directory') name of the directory where the files are to be (un)zipped
=cut
sub default_options {
my ($self) = @_;
return {
......@@ -48,11 +63,19 @@ sub default_options {
-dbname => $ENV{USER}.'_'.$self->o('pipeline_name'), # a rule where a previously defined parameter is used (which makes both of them optional)
},
'unzip' => 0, # set to '1' to switch to decompression
'only_files' => '*', # use '*.sql*' to only (un)zip these files
'unzip' => 0, # set to '1' to switch to decompression
'only_files' => '*', # use '*.sql*' to only (un)zip these files
'zipping_capacity' => 10, # how many files can be (un)zipped in parallel
};
}
=head2 pipeline_create_commands
Description : Implements pipeline_create_commands() interface method of Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf that lists the commands that will create and set up the Hive database.
It is just the standard stuff, so we could have as well omitted this method altogether.
=cut
sub pipeline_create_commands {
my ($self) = @_;
return [
......@@ -60,6 +83,18 @@ sub pipeline_create_commands {
];
}
=head2 pipeline_analyses
Description : Implements pipeline_analyses() interface method of Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf that defines the structure of the pipeline: analyses, jobs, rules, etc.
Here it defines two analyses:
* 'get_files' generates a list of files whose names match the pattern o('only_files')
Each job of this analysis will dataflow (create jobs) via branch #2 into 'zipper_unzipper' analysis.
* 'zipper_unzipper' actually performs the (un)zipping of the files in parallel
=cut
sub pipeline_analyses {
my ($self) = @_;
return [
......@@ -81,7 +116,7 @@ sub pipeline_analyses {
-parameters => {
'cmd' => 'gzip '.($self->o('unzip')?'-d ':'').'#filename#',
},
-hive_capacity => 10, # allow several workers to perform identical tasks in parallel
-hive_capacity => $self->o('zipping_capacity'), # allow several workers to perform identical tasks in parallel
-input_ids => [
# (jobs for this analysis will be flown_into via branch-2 from 'get_tables' jobs above)
],
......
## Generic configuration module for all Hive pipelines with loader functionality (all other Hive pipeline config modules should inherit from it)
=pod
=head1 NAME
package Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf
=head1 SYNOPSIS
# Example 1: specifying only the mandatory option:
init_pipeline.pl Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf -password <mypass>
# Example 2: specifying the mandatory options as well as overriding some defaults:
init_pipeline.pl Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf -ensembl_cvs_root_dir ~/ensembl_main -pipeline_db -host <myhost> -pipeline_db -dbname <mydbname> -password <mypass>
=head1 DESCRIPTION
Generic configuration module for all Hive pipelines with loader functionality.
All other Hive PipeConfig modules should inherit from this module and will probably need to redefine some or all of the following interface methods:
* default_options: returns a hash of (possibly multilevel) defaults for the options on which depend the rest of the configuration
* pipeline_create_commands: returns a list of strings that will be executed as system commands needed to create and set up the pipeline database
* pipeline_wide_parameters: returns a hash of pipeline-wide parameter names and their values
* resource_classes: returns a hash of resource class definitions
* pipeline_analyses: returns a list of hash structures that define analysis objects bundled with definitions of corresponding jobs, rules and resources
When defining anything except the keys of default_options() a call to $self->o('myoption') can be used.
This call means "substitute this call for the value of 'myoption' at the time of configuring the pipeline".
All option names mentioned in $self->o() calls within the five interface methods above can be given non-default values from the command line.
Please make sure you have studied the pipeline configuraton examples in Bio::EnsEMBL::Hive::PipeConfig before creating your own PipeConfig modules.
=head1 CONTACT
Please contact ehive-users@ebi.ac.uk mailing list with questions/suggestions.
=cut
package Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf;
......@@ -12,6 +53,13 @@ use Bio::EnsEMBL::Hive::Extensions;
# ---------------------------[the following methods will be overridden by specific pipelines]-------------------------
=head2 default_options
Description : Interface method that should return a hash of option_name->default_option_value pairs.
Please see existing PipeConfig modules for examples.
=cut
sub default_options {
my ($self) = @_;
return {
......@@ -29,6 +77,13 @@ sub default_options {
};
}
=head2 pipeline_create_commands
Description : Interface method that should return a list of command lines to be run in order to create and set up the pipeline database.
Please see existing PipeConfig modules for examples.
=cut
sub pipeline_create_commands {
my ($self) = @_;
return [
......@@ -40,6 +95,14 @@ sub pipeline_create_commands {
];
}
=head2 pipeline_wide_parameters
Description : Interface method that should return a hash of pipeline_wide_parameter_name->pipeline_wide_parameter_value pairs.
The value doesn't have to be a scalar, can be any Perl structure now (will be stringified and de-stringified automagically).
Please see existing PipeConfig modules for examples.
=cut
sub pipeline_wide_parameters {
my ($self) = @_;
return {
......@@ -47,6 +110,13 @@ sub pipeline_wide_parameters {
};
}
=head2 resource_classes
Description : Interface method that should return a hash of resource_description_id->resource_description_hash.
Please see existing PipeConfig modules for examples.
=cut
sub resource_classes {
my ($self) = @_;
return {
......@@ -55,6 +125,13 @@ sub resource_classes {
};
}
=head2 pipeline_analyses
Description : Interface method that should return a list of hashes that define analysis bundled with corresponding jobs, dataflow and analysis_ctrl rules and resource_id.
Please see existing PipeConfig modules for examples.
=cut
sub pipeline_analyses {
my ($self) = @_;
return [
......@@ -66,6 +143,13 @@ sub pipeline_analyses {
my $undef_const = '-=[UnDeFiNeD_VaLuE]=-'; # we don't use undef, as it cannot be detected as a part of a string
=head2 new
Description : Just a trivial constructor for this type of objects.
Caller : init_pipeline.pl or any other script that will drive this module.
=cut
sub new {
my ($class) = @_;
......@@ -74,6 +158,13 @@ sub new {
return $self;
}
=head2 o
Description : This is the method you call in the interface methods when you need to substitute an option: $self->o('password') .
To reach down several levels of a multilevel option (such as $self->('pipeline_db') ) just list the keys down the desired path: $self->o('pipeline', '-user') .
=cut
sub o { # descends the option hash structure (vivifying all encountered nodes) and returns the value if found
my $self = shift @_;
......@@ -82,7 +173,7 @@ sub o { # descends the option hash structure (vivifying all enco
while(defined(my $option_syll = shift @_)) {
if(exists($value->{$option_syll})
and ((ref($value->{$option_syll}) eq 'HASH') or completely_defined($value->{$option_syll}))
and ((ref($value->{$option_syll}) eq 'HASH') or _completely_defined($value->{$option_syll}))
) {
$value = $value->{$option_syll}; # just descend one level
} elsif(@_) {
......@@ -94,6 +185,12 @@ sub o { # descends the option hash structure (vivifying all enco
return $value;
}
=head2 dbconn_2_mysql
Description : A convenience method used to stringify a connection-parameters hash into a parameter string that both mysql and beekeeper.pl can understand
=cut
sub dbconn_2_mysql { # will save you a lot of typing
my ($self, $db_conn, $with_db) = @_;
......@@ -104,12 +201,31 @@ sub dbconn_2_mysql { # will save you a lot of typing
.($with_db ? ('--database='.$self->o($db_conn,'-dbname').' ') : '');
}
=head2 dbconn_2_url
Description : A convenience method used to stringify a connection-parameters hash into a 'url' that beekeeper.pl will undestand
=cut
sub dbconn_2_url {
my ($self, $db_conn) = @_;
return 'mysql://'.$self->o($db_conn,'-user').':'.$self->o($db_conn,'-pass').'@'.$self->o($db_conn,'-host').':'.$self->o($db_conn,'-port').'/'.$self->o($db_conn,'-dbname');
}
=head2 process_options
Description : The method that does all the parameter parsing magic.
It is two-pass through the interface methods: first pass collects the options, second is intelligent substitution.
Caller : init_pipeline.pl or any other script that will drive this module.
Note : You can override parsing the command line bit by providing a hash as the argument to this method.
This hash should contain definitions of all the parameters you would otherwise be providing from the command line.
Useful if you are creating batches of hive pipelines using a script.
=cut
sub process_options {
my $self = shift @_;
......@@ -117,20 +233,21 @@ sub process_options {
$self->default_options();
$self->pipeline_create_commands();
$self->pipeline_wide_parameters();
$self->resource_classes();
$self->pipeline_analyses();
# you can override parsing of commandline options if creating pipelines by a script - just provide the overriding hash
my $cmdline_options = $self->{_cmdline_options} = shift @_ || $self->load_cmdline_options();
my $cmdline_options = $self->{_cmdline_options} = shift @_ || $self->_load_cmdline_options();
print "\nPipeline: ".ref($self)."\n";
if($cmdline_options->{'help'}) {
my $all_needed_options = $self->hash_undefs();
my $all_needed_options = $self->_hash_undefs();
$self->saturated_merge_defaults_into_options();
$self->_saturated_merge_defaults_into_options();
my $mandatory_options = $self->hash_undefs();
my $mandatory_options = $self->_hash_undefs();
print "Available options:\n\n";
foreach my $key (sort keys %$all_needed_options) {
......@@ -141,11 +258,11 @@ sub process_options {
} else {
$self->merge_into_options($cmdline_options);
$self->_merge_into_options($cmdline_options);
$self->saturated_merge_defaults_into_options();
$self->_saturated_merge_defaults_into_options();
my $undefined_options = $self->hash_undefs();
my $undefined_options = $self->_hash_undefs();
if(scalar(keys(%$undefined_options))) {
print "Undefined options:\n\n";
......@@ -158,6 +275,14 @@ sub process_options {
# by this point we have either exited or options are good
}
=head2 run
Description : The method that uses the Hive/EnsEMBL API to actually create all the analyses, jobs, dataflow and control rules and resource descriptions.
Caller : init_pipeline.pl or any other script that will drive this module.
=cut
sub run {
my $self = shift @_;
my $topup_flag = $self->{_cmdline_options}{topup};
......@@ -303,11 +428,23 @@ sub run {
# -------------------------------[the rest are dirty implementation details]-------------------------------------
sub completely_defined { # NB: not a method
=head2 _completely_defined
Description : a private function (not a method) that checks whether a certain string is clean from undefined options
=cut
sub _completely_defined {
return (index(shift @_, $undef_const) == ($[-1) ); # i.e. $undef_const is not a substring
}
sub load_cmdline_options {
=head2 _load_cmdline_options
Description : a private method that deals with parsing of the command line (currently it drives GetOptions that has some limitations)
=cut
sub _load_cmdline_options {
my $self = shift @_;
my %cmdline_options = ();
......@@ -320,7 +457,13 @@ sub load_cmdline_options {
return \%cmdline_options;
}
sub merge_into_options {
=head2 _merge_into_options
Description : a private method to merge one options-containing structure into another
=cut
sub _merge_into_options {
my $self = shift @_;
my $hash_from = shift @_;
my $hash_to = shift @_ || $self->o;
......@@ -331,12 +474,12 @@ sub merge_into_options {
if(exists($hash_to->{$key})) { # simply ignore the unused options
if(ref($value) eq 'HASH') {
if(ref($hash_to->{$key}) eq 'HASH') {
$subst_counter += $self->merge_into_options($hash_from->{$key}, $hash_to->{$key});
$subst_counter += $self->_merge_into_options($hash_from->{$key}, $hash_to->{$key});
} else {
$hash_to->{$key} = { %$value };
$subst_counter += scalar(keys %$value);
}
} elsif(completely_defined($value) and !completely_defined($hash_to->{$key})) {
} elsif(_completely_defined($value) and !_completely_defined($hash_to->{$key})) {
$hash_to->{$key} = $value;
$subst_counter++;
}
......@@ -345,14 +488,28 @@ sub merge_into_options {
return $subst_counter;
}
sub saturated_merge_defaults_into_options {
=head2 _saturated_merge_defaults_into_options
Description : a private method to merge defaults into options as many times as required to resolve the dependencies.
Use with caution, as it doesn't check for loops!
=cut
sub _saturated_merge_defaults_into_options {
my $self = shift @_;
# Note: every time the $self->default_options() has to be called afresh, do not cache!
while(my $res = $self->merge_into_options($self->default_options)) { }
while(my $res = $self->_merge_into_options($self->default_options)) { }
}
sub hash_undefs {
=head2 _hash_undefs
Description : a private method that collects all the options that are undefined at the moment
(used at different stages to find 'all_options', 'mandatory_options' and 'undefined_options').
=cut
sub _hash_undefs {
my $self = shift @_;
my $hash_to = shift @_ || {};
my $hash_from = shift @_ || $self->o;
......@@ -362,8 +519,8 @@ sub hash_undefs {
my $new_prefix = $prefix ? $prefix.' -> '.$key : $key;
if(ref($value) eq 'HASH') { # go deeper
$self->hash_undefs($hash_to, $value, $new_prefix);
} elsif(!completely_defined($value)) {
$self->_hash_undefs($hash_to, $value, $new_prefix);
} elsif(!_completely_defined($value)) {
$hash_to->{$new_prefix} = 1;
}
}
......
## Configuration file for the long multiplication pipeline example
=pod
=head1 NAME
Bio::EnsEMBL::Hive::PipeConfig::LongMult_conf;
=head1 SYNOPSIS
# Example 1: specifying only the mandatory option:
init_pipeline.pl Bio::EnsEMBL::Hive::PipeConfig::LongMult_conf -password <mypass>
# Example 2: specifying the mandatory options as well as overriding some defaults:
init_pipeline.pl Bio::EnsEMBL::Hive::PipeConfig::LongMult_conf -password <mypass> -first_mult 2344556 -second_mult 777666555
=head1 DESCRIPTION
This is the PipeConfig file for the long multiplication pipeline example.
The main point of this pipeline is to provide an example of how to write Hive Runnables and link them together into a pipeline.
Please refer to Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf module to understand the interface implemented here.
The setting. Let's assume we are given two loooooong numbers to multiply. Reeeeally long.
So long that they do not fit into registers of the CPU and should be multiplied digit-by-digit.
For the purposes of this example we also assume this task is very computationally intensive and has to be done in parallel.
The long multiplication pipeline consists of three "analyses" (types of tasks): 'start', 'part_multiply' and 'add_together'
that we will be using to examplify various features of the Hive.
* A 'start' job takes in two string parameters, 'a_multiplier' and 'b_multiplier',
takes the second one apart into digits, finds what _different_ digits are there,
creates several jobs of the 'part_multiply' analysis and one job of 'add_together' analysis.
* A 'part_multiply' job takes in 'a_multiplier' and 'digit', multiplies them and records the result in 'intermediate_result' table.
* An 'add_together' job waits for the first two analyses to complete,
takes in 'a_multiplier', 'b_multiplier' and 'intermediate_result' table and produces the final result in 'final_result' table.
Please see the implementation details in the Runnable modules themselves.
=head1 CONTACT
Please contact ehive-users@ebi.ac.uk mailing list with questions/suggestions.
=cut
package Bio::EnsEMBL::Hive::PipeConfig::LongMult_conf;
use strict;
use warnings;
use base ('Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf'); # All Hive databases configuration files should inherit from HiveGeneric, directly or indirectly
=head2 default_options
Description : Implements default_options() interface method of Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf that is used to initialize default options.
In addition to the standard things it defines two options, 'first_mult' and 'second_mult' that are supposed to contain the long numbers to be multiplied.
=cut
sub default_options {
my ($self) = @_;
return {
......@@ -26,6 +82,13 @@ sub default_options {
};
}
=head2 pipeline_create_commands
Description : Implements pipeline_create_commands() interface method of Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf that lists the commands that will create and set up the Hive database.
In addition to the standard creation of the database and populating it with Hive tables and procedures it also creates two pipeline-specific tables used by Runnables to communicate.
=cut
sub pipeline_create_commands {
my ($self) = @_;
return [
......@@ -37,6 +100,21 @@ sub pipeline_create_commands {
];
}
=head2 pipeline_analyses
Description : Implements pipeline_analyses() interface method of Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf that defines the structure of the pipeline: analyses, jobs, rules, etc.
Here it defines three analyses:
* 'start' with two jobs (multiply 'first_mult' by 'second_mult' and vice versa - to check the commutativity of multiplivation).
Each job will dataflow (create more jobs) via branch #2 into 'part_multiply' and via branch #1 into 'add_together'.
* 'part_multiply' initially without jobs (they will flow from 'start')
* 'add_together' initially without jobs (they will flow from 'start').
All 'add_together' jobs will wait for completion of *all* 'part_multiply' jobs before their own execution (to ensure all data is available).
=cut
sub pipeline_analyses {
my ($self) = @_;
return [
......
## Configuration file for the *semaphored* long multiplication pipeline example
=pod
=head1 NAME
Bio::EnsEMBL::Hive::PipeConfig::SemaLongMult_conf;
=head1 SYNOPSIS
# Example 1: specifying only the mandatory option:
init_pipeline.pl Bio::EnsEMBL::Hive::PipeConfig::SemaLongMult_conf -password <mypass>
# Example 2: specifying the mandatory options as well as overriding some defaults:
init_pipeline.pl Bio::EnsEMBL::Hive::PipeConfig::SemaLongMult_conf -password <mypass> -first_mult 2344556 -second_mult 777666555
=head1 DESCRIPTION
This is the PipeConfig file for the *semaphored* long multiplication pipeline example.
The main point of this pipeline is to provide an example of how to set up job-level semaphored control instead of using analysis-level control rules.
Please refer to Bio::EnsEMBL::Hive::PipeConfig::LongMult_conf to understand how long multiplication pipeline works in its original form.
=head1 CONTACT
Please contact ehive-users@ebi.ac.uk mailing list with questions/suggestions.
=cut
package Bio::EnsEMBL::Hive::PipeConfig::SemaLongMult_conf;
use strict;
use warnings;
use base ('Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf'); # All Hive databases configuration files should inherit from HiveGeneric, directly or indirectly
=head2 default_options
Description : Implements default_options() interface method of Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf that is used to initialize default options.
In addition to the standard things it defines two options, 'first_mult' and 'second_mult' that are supposed to contain the long numbers to be multiplied.
=cut
sub default_options {
my ($self) = @_;
return {
......@@ -26,6 +61,13 @@ sub default_options {
};
}
=head2 pipeline_create_commands
Description : Implements pipeline_create_commands() interface method of Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf that lists the commands that will create and set up the Hive database.
In addition to the standard creation of the database and populating it with Hive tables and procedures it also creates two pipeline-specific tables used by Runnables to communicate.
=cut
sub pipeline_create_commands {
my ($self) = @_;
return [
......@@ -37,6 +79,24 @@ sub pipeline_create_commands {
];
}
=head2 pipeline_analyses
Description : Implements pipeline_analyses() interface method of Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf that defines the structure of the pipeline: analyses, jobs, rules, etc.
Here it defines three analyses:
* 'sema_start' with two jobs (multiply 'first_mult' by 'second_mult' and vice versa - to check the commutativity of multiplivation).
Each job will dataflow (create more jobs) via branch #2 into 'part_multiply' and via branch #1 into 'add_together'.
Unlike LongMult_conf, there is no analysis-level control here, but SemaStart analysis itself is more intelligent
in that it can dataflow a group of partial multiplication jobs in branch #2 linked with one job in branch #1 by a semaphore.
* 'part_multiply' initially without jobs (they will flow from 'start')
* 'add_together' initially without jobs (they will flow from 'start').
Unlike LongMult_conf here we do not use analysis control and rely on job-level semaphores to keep the jobs in sync.
=cut
sub pipeline_analyses {