Commit e45d4761 authored by Jessica Severin's avatar Jessica Severin
Browse files

complete switch over to new DataflowRule design. Dataflow rules use

URL's to specify analysis objects from mysql databases distributed
across a network.  AnalysisJobAdaptor was switched to create jobs with
a cless method that gets the db connection from the analysis object that
is passed.  Thus the system now exists in a distributed state.
The dataflow rule also implements branching via the branch_code.
SimpleRule will be deprecated.
parent 3d318052
......@@ -6,55 +6,25 @@
=pod
=head1 NAME
Bio::EnsEMBL::Hive::AnalysisJob
Bio::EnsEMBL::Hive::AnalysisJob
=cut
=head1 SYNOPSIS
Object which encapsulates the details of how to find jobs, how to run those
jobs, and then checked the rules to create the next jobs in the chain.
Essentially knows where to find data, how to process data, and where to
put it when it's done (put in next person's INBOX) so the next Worker
in the chain can find data to work on.
Hive based processing is a concept based on a more controlled version
of an autonomous agent type system. Each worker is not told what to do
(like a centralized control system - like the current pipeline system)
but rather queries a central database for jobs (give me jobs).
Each worker is linked to an analysis_id, registers its self on creation
into the Hive, creates a RunnableDB instance of the Analysis->module,
gets $runnable->batch_size() jobs from the analysis_job table, does its
work, creates the next layer of analysis_job entries by querying simple_rule
table where condition_analysis_id = $self->analysis_id. It repeats
this cycle until it's lived it's lifetime or until there are no more jobs left.
The lifetime limit is just a safety limit to prevent these from 'infecting'
a system.
The Queens job is to simply birth Workers of the correct analysis_id to get the
work down. The only other thing the Queen does is free up jobs that were
claimed by Workers that died unexpectantly so that other workers can take
over the work.
=cut
=head1 DESCRIPTION
An AnalysisJob is the link between the input_id control data, the analysis and
the rule system. It also tracks the state the state of the job as it's processed
=cut
=head1 CONTACT
Jessica Severin, jessica@ebi.ac.uk
Contact Jessica Severin on EnsEMBL::Hive implemetation/design detail: jessica@ebi.ac.uk
Contact Ewan Birney on EnsEMBL in general: birney@sanger.ac.uk
=cut
=head1 APPENDIX
The rest of the documentation details each of the object methods.
Internal methods are usually preceded with a _
The rest of the documentation details each of the object methods.
Internal methods are usually preceded with a _
=cut
package Bio::EnsEMBL::Hive::AnalysisJob;
......
......@@ -6,55 +6,19 @@
=pod
=head1 NAME
Bio::EnsEMBL::Hive::AnalysisStats
=cut
Bio::EnsEMBL::Hive::AnalysisStats
=head1 SYNOPSIS
Object which encapsulates the overall statistics on an analysis and
all the jobs associated with it in the hive. Used as a cache of the
stats at a given moment in time (last_update_time). The Queen is
responsible for monitoring the Hive and updating most stats. Certain
status states(ALL_CLAIMED) and batch_size are updated by the workers.
Hive based processing is a concept based on a more controlled version
of an autonomous agent type system. Each worker is not told what to do
(like a centralized control system - like the current pipeline system)
but rather queries a central database for jobs (give me jobs).
Each worker is linked to an analysis_id, registers its self on creation
into the Hive, creates a RunnableDB instance of the Analysis->module,
gets $runnable->batch_size() jobs from the analysis_job table, does its
work, creates the next layer of analysis_job entries by querying simple_rule
table where condition_analysis_id = $self->analysis_id. It repeats
this cycle until it's lived it's lifetime or until there are no more jobs left.
The lifetime limit is just a safety limit to prevent these from 'infecting'
a system.
The Queens job is to simply birth Workers of the correct analysis_id to get the
work down. The only other thing the Queen does is free up jobs that were
claimed by Workers that died unexpectantly so that other workers can take
over the work.
=cut
=head1 DESCRIPTION
=cut
=head1 CONTACT
Jessica Severin, jessica@ebi.ac.uk
=cut
Contact Jessica Severin on EnsEMBL::Hive implemetation/design detail: jessica@ebi.ac.uk
Contact Ewan Birney on EnsEMBL in general: birney@sanger.ac.uk
=head1 APPENDIX
The rest of the documentation details each of the object methods.
Internal methods are usually preceded with a _
The rest of the documentation details each of the object methods.
Internal methods are usually preceded with a _
=cut
package Bio::EnsEMBL::Hive::AnalysisStats;
......
......@@ -46,11 +46,53 @@ use Bio::EnsEMBL::DBSQL::BaseAdaptor;
use Sys::Hostname;
use Data::UUID;
use Bio::EnsEMBL::Utils::Argument qw(rearrange);
use Bio::EnsEMBL::Utils::Exception qw(throw warning);
our @ISA = qw(Bio::EnsEMBL::DBSQL::BaseAdaptor);
###############################################################################
#
# CLASS methods
#
###############################################################################
=head2 fetch_by_dbID
sub CreateNewJob {
my ($class, @args) = @_;
return undef unless(scalar @args);
my ($input_id, $analysis, $input_analysis_job_id, $blocked) =
rearrange([qw(INPUT_ID ANALYSIS input_job_id BLOCK )], @args);
$input_analysis_job_id=0 unless($input_analysis_job_id);
throw("must define input_id") unless($input_id);
throw("must define analysis") unless($analysis);
throw("analysis must be [Bio::EnsEMBL::Analysis] not a [$analysis]")
unless($analysis->isa('Bio::EnsEMBL::Analysis'));
my $sql = "INSERT ignore into analysis_job ".
" SET input_id=\"$input_id\" ".
" ,input_analysis_job_id='$input_analysis_job_id' ".
" ,analysis_id='".$analysis->dbID ."' ";
$sql .= " ,status='BLOCKED', job_claim='BLOCKED'" if($blocked);
my $dbc = $analysis->adaptor->db;
my $sth = $dbc->prepare($sql);
$sth->execute();
my $dbID = $sth->{'mysql_insertid'};
$sth->finish;
return $dbID;
}
###############################################################################
#
# INSTANCE methods
#
###############################################################################
=head2 fetch_by_dbID
Arg [1] : int $id
the unique database identifier for the feature to be obtained
Example : $feat = $adaptor->fetch_by_dbID(1234);
......@@ -59,9 +101,7 @@ our @ISA = qw(Bio::EnsEMBL::DBSQL::BaseAdaptor);
Returntype : Bio::EnsEMBL::Hive::AnalysisJob
Exceptions : thrown if $id is not defined
Caller : general
=cut
sub fetch_by_dbID {
my ($self,$id) = @_;
......@@ -81,6 +121,7 @@ sub fetch_by_dbID {
return $obj;
}
=head2 fetch_by_claim_analysis
Arg [1] : string job_claim (the UUID used to claim jobs)
Arg [2] : int analysis_id
......@@ -90,7 +131,6 @@ sub fetch_by_dbID {
Exceptions : thrown if claim_id or analysis_id is not defined
Caller : general
=cut
sub fetch_by_claim_analysis {
my ($self,$claim,$analysis_id) = @_;
......@@ -102,16 +142,13 @@ sub fetch_by_claim_analysis {
=head2 fetch_all
Arg : None
Example :
Description:
Returntype :
Exceptions :
Caller :
=cut
sub fetch_all {
my $self = shift;
......@@ -126,7 +163,6 @@ sub fetch_all {
###################
=head2 _generic_fetch
Arg [1] : (optional) string $constraint
An SQL query constraint (i.e. part of the WHERE clause)
Arg [2] : (optional) string $logic_name
......@@ -137,9 +173,7 @@ sub fetch_all {
Returntype : listref of Bio::EnsEMBL::SeqFeature in contig coordinates
Exceptions : none
Caller : BaseFeatureAdaptor, ProxyDnaAlignFeatureAdaptor::_generic_fetch
=cut
sub _generic_fetch {
my ($self, $constraint, $join) = @_;
......@@ -193,12 +227,14 @@ sub _generic_fetch {
return $self->_objs_from_sth($sth);
}
sub _tables {
my $self = shift;
return (['analysis_job', 'a']);
}
sub _columns {
my $self = shift;
......@@ -215,6 +251,7 @@ sub _columns {
);
}
sub _objs_from_sth {
my ($self, $sth) = @_;
......@@ -244,11 +281,13 @@ sub _objs_from_sth {
return \@jobs
}
sub _default_where_clause {
my $self = shift;
return '';
}
sub _final_clause {
my $self = shift;
return '';
......@@ -261,16 +300,13 @@ sub _final_clause {
################
=head2 update_status
Arg [1] : $analysis_id
Example :
Description:
Returntype : Bio::EnsEMBL::Hive::Worker
Exceptions :
Caller :
=cut
sub update_status {
my ($self,$job) = @_;
......@@ -310,33 +346,6 @@ sub store_out_files {
}
sub create_new_job {
my ($self, @args) = @_;
return undef unless(scalar @args);
my ($input_id, $analysis_id, $input_analysis_job_id, $blocked) =
$self->_rearrange([qw(INPUT_ID ANALYSIS_ID input_job_id BLOCK )], @args);
$input_analysis_job_id=0 unless($input_analysis_job_id);
throw("must define input_id") unless($input_id);
throw("must define analysis_id") unless($analysis_id);
my $sql = "INSERT ignore into analysis_job ".
" SET input_id=\"$input_id\" ".
" ,input_analysis_job_id='$input_analysis_job_id' ".
" ,analysis_id='$analysis_id' ";
$sql .= " ,status='BLOCKED', job_claim='BLOCKED'" if($blocked);
my $sth = $self->prepare($sql);
$sth->execute();
my $dbID = $sth->{'mysql_insertid'};
$sth->finish;
return $dbID;
}
sub claim_jobs_for_worker {
my $self = shift;
my $worker = shift;
......
# Perl module for Bio::EnsEMBL::Hive::DBSQL::DataflowRuleAdaptor
#
# Date of creation: 22.03.2004
# Original Creator : Jessica Severin <jessica@ebi.ac.uk>
#
# Copyright EMBL-EBI 2004
#
# You may distribute this module under the same terms as perl itself
# POD documentation - main docs before the code
=head1 NAME
Bio::EnsEMBL::Hive::DBSQL::DataflowRuleAdaptor
=head1 SYNOPSIS
$dataflowRuleAdaptor = $db_adaptor->get_DataflowRuleAdaptor;
$dataflowRuleAdaptor = $dataflowRuleObj->adaptor;
=head1 DESCRIPTION
Module to encapsulate all db access for persistent class DataflowRule.
There should be just one per application and database connection.
=head1 CONTACT
Contact Jessica Severin on implemetation/design detail: jessica@ebi.ac.uk
Contact Ewan Birney on EnsEMBL in general: birney@sanger.ac.uk
=head1 APPENDIX
The rest of the documentation details each of the object methods.
Internal methods are usually preceded with a _
=cut
# Let the code begin...
package Bio::EnsEMBL::Hive::DBSQL::DataflowRuleAdaptor;
use strict;
use Carp;
use Bio::EnsEMBL::DBSQL::BaseAdaptor;
use Bio::EnsEMBL::Hive::DataflowRule;
our @ISA = qw(Bio::EnsEMBL::DBSQL::BaseAdaptor);
=head2 fetch_from_analysis_job
Args : Bio::EnsEMBL::Hive::AnalysisJob
Example : my @rules = @{$ruleAdaptor->fetch_from_analysis_job($job)};
Description: searches database for rules with given 'from' analysis
returns all such rules in a list (by reference)
Returntype : reference to list of Bio::EnsEMBL::Hive::DataflowRule objects
Exceptions : none
Caller : ?
=cut
sub fetch_from_analysis_job
{
my $self = shift;
my $fromAnalysisJob = shift;
my $rule;
my @rules;
$self->throw("arg is required\n") unless($fromAnalysisJob);
$self->throw("arg must be a [Bio::EnsEMBL::Hive::AnalysisJob] not a $fromAnalysisJob")
unless ($fromAnalysisJob->isa('Bio::EnsEMBL::Hive::AnalysisJob'));
my $constraint = "r.from_analysis_id = '".$fromAnalysisJob->analysis_id."'"
." AND r.branch_code=". $fromAnalysisJob->branch_code;
return $self->_generic_fetch($constraint);
}
=head2 store
Title : store
Usage : $self->store( $rule );
Function: Stores a rule in db
Sets adaptor and dbID in DataflowRule
Returns : -
Args : Bio::EnsEMBL::Pipeline::DataflowRule
=cut
sub store {
my ( $self, $rule ) = @_;
#print("\nDataflowRuleAdaptor->store()\n");
my $dataflow_rule_id;
my $sth = $self->prepare( q{INSERT ignore INTO dataflow_rule
SET from_analysis_id = ?, to_analysis_url = ? } );
if($sth->execute($rule->from_analysis_id, $rule->to_analysis_url)) {
$dataflow_rule_id = $sth->{'mysql_insertid'};
$sth->finish();
$rule->dbID($dataflow_rule_id);
#print(" stored with dbID = $dataflow_rule_id\n");
} else {
#print(" failed to execute -> already inserted -> need to get dbID\n");
$sth->finish();
$sth = $self->prepare(q{SELECT dataflow_rule_id FROM dataflow_rule WHERE
from_analysis_id = ? AND to_analysis_url = ? } );
$sth->execute($rule->from_analysis_id, $rule->to_analysis_url);
$sth->bind_columns(\$dataflow_rule_id);
if($sth->fetch()) {
$rule->dbID($dataflow_rule_id);
}
$sth->finish;
}
#print(" dataflow_rule_id = '".$rule->dbID."'\n");
$rule->adaptor( $self );
}
=head2 remove
Title : remove
Usage : $self->remove( $rule );
Function: removes given object from database.
Returns : -
Args : Bio::EnsEMBL::Pipeline::DataflowRule which must be persistent.
( dbID set )
=cut
sub remove {
my ( $self, $rule ) = @_;
my $dbID = $rule->dbID;
if( !defined $dbID ) {
$self->throw( "DataflowRuleAdaptor->remove called with non persistent DataflowRule" );
}
my $sth = $self->prepare("DELETE FROM dataflow_rule WHERE dataflow_rule_id = $dbID");
$sth->execute;
}
############################
#
# INTERNAL METHODS
# (pseudo subclass methods)
#
############################
#internal method used in multiple calls above to build objects from table data
sub _tables {
my $self = shift;
return (['dataflow_rule', 'r']);
}
sub _columns {
my $self = shift;
return qw (r.dataflow_rule_id
r.from_analysis_id
r.to_analysis_url
r.branch_code
);
}
sub _objs_from_sth {
my ($self, $sth) = @_;
my @rules = ();
my ($dataflow_rule_id, $from_analysis_id, $to_analysis_url, $branch_code);
$sth->bind_columns(\$dataflow_rule_id, \$from_analysis_id, \$to_analysis_url, \$branch_code);
while ($sth->fetch()) {
my $rule = Bio::EnsEMBL::Hive::DataflowRule->new;
$rule->adaptor($self);
$rule->dbID($dataflow_rule_id);
$rule->from_analysis_id($from_analysis_id);
$rule->to_analysis_url($to_analysis_url);
$rule->branch_code($branch_code);
push @rules, $rule;
}
return \@rules;
}
sub _default_where_clause {
my $self = shift;
return '';
}
sub _final_clause {
my $self = shift;
return '';
}
###############################################################################
#
# General access methods that could be moved
# into a superclass
#
###############################################################################
=head2 fetch_by_dbID
Arg [1] : int $id
the unique database identifier for the feature to be obtained
Example : $feat = $adaptor->fetch_by_dbID(1234);
Description: Returns the Member created from the database defined by the
the id $id.
Returntype : Bio::EnsEMBL::Hive::DataflowRule
Exceptions : thrown if $id is not defined
Caller : general
=cut
sub fetch_by_dbID{
my ($self,$id) = @_;
unless(defined $id) {
$self->throw("fetch_by_dbID must have an id");
}
my @tabs = $self->_tables;
my ($name, $syn) = @{$tabs[0]};
#construct a constraint like 't1.table1_id = 1'
my $constraint = "${syn}.${name}_id = $id";
#return first element of _generic_fetch list
my ($obj) = @{$self->_generic_fetch($constraint)};
return $obj;
}
=head2 fetch_all
Arg : None
Example :
Description:
Returntype :
Exceptions :
Caller :
=cut
sub fetch_all {
my $self = shift;
return $self->_generic_fetch();
}
=head2 _generic_fetch
Arg [1] : (optional) string $constraint
An SQL query constraint (i.e. part of the WHERE clause)
Arg [2] : (optional) string $logic_name
the logic_name of the analysis of the features to obtain
Example : $fts = $a->_generic_fetch('contig_id in (1234, 1235)', 'Swall');
Description: Performs a database fetch and returns feature objects in
contig coordinates.
Returntype : listref of Bio::EnsEMBL::SeqFeature in contig coordinates
Exceptions : none
Caller : BaseFeatureAdaptor, ProxyDnaAlignFeatureAdaptor::_generic_fetch
=cut
sub _generic_fetch {
my ($self, $constraint, $join) = @_;
my @tables = $self->_tables;
my $columns = join(', ', $self->_columns());
if ($join) {
foreach my $single_join (@{$join}) {
my ($tablename, $condition, $extra_columns) = @{$single_join};
if ($tablename && $condition) {
push @tables, $tablename;
if($constraint) {
$constraint .= " AND $condition";
} else {
$constraint = " $condition";
}
}
if ($extra_columns) {
$columns .= ", " . join(', ', @{$extra_columns});
}
}
}
#construct a nice table string like 'table1 t1, table2 t2'
my $tablenames = join(', ', map({ join(' ', @$_) } @tables));
my $sql = "SELECT $columns FROM $tablenames";
my $default_where = $self->_default_where_clause;
my $final_clause = $self->_final_clause;
#append a where clause if it was defined
if($constraint) {
$sql .= " WHERE $constraint ";
if($default_where) {
$sql .= " AND $default_where ";
}