Description : Implements fetch_input() interface method of Bio::EnsEMBL::Hive::Process that is used to read in parameters and load data.
...
...
@@ -62,19 +62,20 @@ use base ('Bio::EnsEMBL::Hive::Process');
Description : Implements run() interface method of Bio::EnsEMBL::Hive::Process that is used to perform the main bulk of the job (minus input and output).
param('input_id'): The template that will become the input_id of newly created jobs (Note: this is something entirely different from $self->input_id of the current JobFactory job).
param('column_names'): Controls the column names that come out of the parser: 0 = "no names", 1 = "parse names from data", arrayref = "take names from this array"
param('step'): The requested size of the minibatch (1 by default). The real size may be smaller.
param('delimiter'): If you set it your lines in file/cmd mode will be split into columns that you can use individually when constructing the template input_id hash.
param('randomize'): Shuffles the ids before creating jobs - can sometimes lead to better overall performance of the pipeline. Doesn't make any sence for minibatches (step>1).
param('input_id'): The template that will become the input_id of newly created jobs (Note: this is something entirely different from $self->input_id of the current JobFactory job).
After introduction of param('column_names') its significance has dropped, but it may still become handy.
param('delimiter'): If you set it your lines in file/cmd mode will be split into columns that you can use individually when constructing the template input_id hash.
param('randomize'): Shuffles the rows before creating jobs - can sometimes lead to better overall performance of the pipeline. Doesn't make any sence for minibatches (step>1).
param('step'): The requested size of the minibatch (1 by default). The real size of a range may be smaller than the requested size.
param('key_column'): If every line of your input is a list (it happens, for example, when your SQL returns multiple columns or you have set the 'delimiter' in file/cmd mode)
this is the way to say which column is undergoing 'ranging'
param('hashed_column_number'): if defined, turns 'hashed_column_number' into a dir_revhash and appends it to the list of fields.
# The following 4 parameters are mutually exclusive and define the source of ids for the jobs:
...
...
@@ -91,45 +92,56 @@ use base ('Bio::EnsEMBL::Hive::Process');
sub run{
my$self=shift@_;
my$template_hash=$self->param('input_id')||die"'input_id' is an obligatory parameter";
my$step=$self->param('step')||1;
my$column_names=$self->param('column_names')||0;# can be 0 (no names), 1 (names from data) or an arrayref (names from this array)
my$delimiter=$self->param('delimiter');
my$randomize=$self->param('randomize')||0;
# minibatching-related:
my$step=$self->param('step')||0;
my$key_column=$self->param('key_column')||0;
my$delimiter=$self->param('delimiter');
my$hashed_column_number=$self->param('hashed_column_number');# skip this step if undefined