Skip to content
Snippets Groups Projects
user avatar
69f24052
History
Name Last commit Last update
docs
modules/Bio/EnsEMBL
scripts
sql
README
# Bio::EnsEMBL::Hive project
#
# Copyright Team Ensembl
# You may distribute this package under the same terms as perl itself

Contact:
  Please contact ehive-users@ebi.ac.uk mailing list with questions/suggestions.

Summary:
  This is a distributed processing system based on 'autonomous agents' and
  Hive behavioural structure of Honey Bees .  It implements all functionality of both
  data-flow graphs and block-branch diagrams which should allow it to codify
  any program, algorithm, or parallel processing job control system.  It is
  not bound to any processing 'farm' system and can be adapted to any GRID.


30.Mar, 2011 : Leo

* fixed a bug when checking for definedness of branch_name in DataflowRuleAdaptor.pm

29.Mar, 2011 : Leo

* fixed a bug in MySQLTransfer.pm where it never reached automatic dataflow on success
* fixed Process.pm to not import warning() to avoid clashing with ensembl-core's warning

26.Mar-11.Apr, 2011 : Leo

* trying to enforce the 'fetch_by' and 'fetch_all_by' convention for adaptor calls

24.Mar, 2011 : Leo

* some branches can be named instead of numbered (-2, -1, 0, 1)
    and 0 now stands for 'ANYFAILURE' (please use with extreme care!)

23.Mar, 2011 : Leo

* sql/procedures.sql gives timing in minutes as well

18.Mar, 2011 : Leo

* one PREPARE statement per dataflow-to-table call (so works for array of output_ids with identical structure)

16.Mar, 2011 : Leo

* Reviewed some procedures.sql adding a procedure for timing and several tricks suggested by Greg.

10-12.Mar, 2011 : Andy

* Introduced a new script for "off-line" creation of pipeline diagrams based on GraphViz.
    You will have to refrain from direct creation of jobs/analyses/rules in your pipeline
    and do dataflow instead to benefit from the drawing.

9.Mar, 2011 : Leo

* Tried a new debugging trick (dumping the Hive database at key steps in the pipeline)
    and documented it in SystemCmd.pm module.

2.Mar, 2011 : Leo

* -keep_alive option added to the Beekeeper to allow it to loop even when all jobs are done.
    Requested by Bethan for tracking Blast jobs submitted from the web.

1.Feb, 2011 : Miguel 

* fixed init_pipeline.pl that was not propagating an error message from importing a PipeConfig module

27.Jan, 2011 : Leo

* Beekeeper will warn if a pipeline doesn't have a name defined
    (to avoid clashes with other unnamed pipelines on the farm).

21.Jan, 2011 : Leo

* Foreign key constraints have been moved into a separate .sql file,
    to allow the user to switch them all off.

19-20.Jan, 2011 : Leo

* An analysis can now be empty for blocking purposes (blocks only while it contains undone jobs).
    This allows creating more flexible pipelines (that have branches that may never execute, but will not block the rest).

19.Jan, 2011 : Leo

* AnalysisJobAdaptor now retries 3 times to avoid deadlock situations when claiming jobs (may need further attn)

14.Jan, 2011 : Leo

* several textual fields in the schema extended to take longer strings

12.Jan, 2011 : Leo

* MySQLTransfer can now param_substitute()

10-21.Jan, 2011 : Leo

* fixed an issue with $self->o() mechanism in lists (incomplete substitutions).

4-5.Jan, 2011 : Leo

* fixed multiple issues that appeared after introduction of the foreign keys
but only became visible after some testing (thanks to Gautier for extra testing!).

31.Dec, 2010 : Leo Gordon

* Inplemented the long-standing plan to remove the schema/code dependency on UUIDs.
Removed the job_claim field from the schema and the code, changed the way jobs are claimed.

28.Dec, 2010 : Leo Gordon

* Updated schema drawings that contain newly added foreign keys

21-23.Dec, 2010 : Leo Gordon

* Added foreign key constraints, figured out that foreign keys ARE enforced in MySQL 5.1.47,
so had to fix some code (and some ensembl-compara code as well, so keep yours up-to-date).

14.Dec, 2010 : Leo Gordon

* first attempt at creating schema drawings using MySQL Workbench. Drawings added to docs/ .
A lot of foreign key constraints were missing, which influenced the drawing.
If they are not enforced by MySQL anyway, why not add just them?

26.Nov, 2010 : Leo Gordon

* changed JobFactory so that parts of it can be re-used by subclassing. Examples in ensembl-compara.

26.Oct, 2010 : Leo Gordon

* fixed a long-standing bug: input_id was supposed to be able to set things (according to compara code)

19-22.Oct, 2010 : Leo Gordon

* Fixed both rule adaptors and the HiveGeneric_conf to prevent them from creating duplicated rules
when a PipeConfig is re-run.

8.Oct, 2010 : Leo Gordon

A new MySQLTransfer.pm Hive Runnable to copy tables over (an amazingly popular task in our pipelines).
Does an integrity check and fails if underlying mysqldump fails.

1.Oct, 2010 : Leo Gordon

* runWorker.pl only prints the worker once per stream.
So if output is redirected to a file, both the file and the output will contain it.

30.Sept-19.Oct, 2010 : Leo Gordon

* detected a strange behaviour of a Worker that was running a RunnableDB
with 'runaway next' statements. Since it was not possible to fix it (seems to be Perl language issue),
the Worker's code does its best to detect this and exit. Please check that your RunnableDBs do not have runaway nexts.

------------------------------------[previous 'stable' tag]----------------------------------

21-22 Sept, 2010 : Leo Gordon

* a new switch -worker_output_dir allows a particular worker to send its stdout/stderr into the given directory
    bypassing the -hive_output_dir if specified.

* streamlining runWorker.pl-Queen.pm communication so that runWorker.pl is now a very lightweight script
    (only manages the parameters and output, but no longer runs actual unique functionality)

20 Sept, 2010 : Leo Gordon

* big change: added gc_dataflow (jobs dying because of MEMLIMIT or RUNLIMIT can now be automatically sent
    to another analysis with more memory or longer runtime limit. Schema change + multiple code changes.

16 Sept, 2010 : Leo Gordon

* code cleanup and unification of parameter names (older names still supported but not encouraged)

13-14 Sept, 2010 : Leo Gordon

* big change: creating a separate Params class, making it a base class for AnalysisJob,
  and removing parameter parsing/reading/setting functionality from the Process. No need in ProcessWithParams now.
  This is a big preparation for post-mortem dataflow for resource-overusing jobs.

11 Sept, 2010 : Leo Gordon

* schema change: we are producing release 60!

* bugfix: -alldead did not set 'cause_of_death', now it always sets 'FATALITY' (should we invoke proper GarbageCollection?)

7-9 Sept, 2010 : Leo Gordon

* autoflow() should be a property of a job, not the process. Moved and optimized.

* avoiding filename/pid collisions in Worker::worker_temp_directory, improved reliability.

* removed some Extensions by creating proper hive adaptors (AnalysisAdaptor and MetaContainer)

* changed the way a RunnableDB declares its module defaults. NB!

2-3 Sept, 2010 : Leo Gordon

* optimizing the reliability and the time spent on finding out why LSF killed the jobs

* let MEMLIMIT jobs go into 'FAILED' state from the first attempt (don't waste time retrying)

31 Aug - 1 Sept, 2010 : Leo Gordon

* Added support for finding out WHY a worker is killed by the LSF (MEMLIMIT, RUNLIMIT, KILLED_BY_USER),
  the schema is extended to allow this information to be recorded in the 'hive' table.

24 Aug, 2010 : Leo Gordon

* experimental: Queen, Meadow, Meadow::LOCAL and Meadow::LSF changed to make it possible to run several beekeepers
  owned by different users over the same database.  They _should_not_ collide, but it has not been very thoroughly tested.

23 Aug, 2010 : Leo Gordon

* Worker now reports the reason why it decides to die + good working example (FailureTest framework)

20 Aug, 2010 : Leo Gordon

* Added a generic Stopwatch.pm module to allow for fine timing to be done in a cleaner way

* Added the ability for Runnables to throw messages (which will be recorded in 'job_error' table)
  not to be necessarily associated with the job's failure. This change involved schema change as well.

* 'job_error' table is renamed to 'job_message' with the extra field (is_error=0|1) added

13 Aug, 2010 : Javier Herrero

* scripts/cmd_hive.pl: Better support for adding new jobs to an existing analysis. Also, supports adding one single job

13 Aug, 2010 : Leo Gordon

* AnalysisJob and Worker were changed to allow jobs to decide whether it makes any sense to restart them or not.

* a command line switch -retry_throwing_jobs and a corresponding getter/setter method was added to
    beekeeper.pl, runWorker.pl and Worker.pm to let the user decide whether to restart failing jobs or not.

11-12 Aug, 2010 : Leo Gordon

* A new table 'job_error' was added to keep track of jobs' termination messages (thrown via 'throw' or 'die'),
  this involved schema change and lots of changes in the modules.

* Another big new change is that the Workers no longer die when a Job dies. At least, not by default.
  If a Worker managed to catch a dying Job, this fact is registered in the database, but the Worker keeps on taking other jobs.

9-10 Aug, 2010 : Leo Gordon

* RunnableDB::Test renamed into RunnableDB::FailureTest and extended, PipeConfig::FailureTest_conf added to drive this module.
(this was testing ground preparation for job_error introduction)

16 July, 2010 : Leo Gordon

* added -hive_output_dir to beekeeper.pl so that it could be set/overridden from the command line

* dir_revhash is now an importable Util subroutine that is used by both Worker and JobFactory

14 July, 2010 : Leo Gordon

* fixed Meadow::LOCAL so that MacOS's ps would also be supported. eHive now runs locally on Macs :)

13 July, 2010 : Leo Gordon

* added ability to compute complex expressions while doing parameter substitution

12 July, 2010 : Leo Gordon

* added the slides of my HiveTalk_12Jul2010 into docs/

* changed the ambiguous 'output_dir' getter/setter into two methods:
    worker_output_dir (if you set it - it will output directly into the given directory) and
    hive_output_dir (if you set it, it will perform *reverse_decimal* hashing of the worker_id and create a directory for that particular worker)

2 July, 2010 : Leo Gordon

* [Bugfix] Process::dataflow_output_id() is simplified and generalized

* [Feature/experimental] ProcessWithParams::param_substitute() can now understand #stringifier:param_name# syntax,
    several stringifiers added (the syntax is not final, and the stringifiers will probably move out of the module)

* [Feature] TableDumperZipper_conf can now understand negative patterns:
    it is an example how to emulate the inexistent in MySQL 5.1 syntax "SHOW TABLES NOT LIKE "%abc%"
    my using queries from information_schema.tables

* [Cleanup] the 'did' parameter was finally removed from SystemCmd and SqlCmd to avoid confusion
    (the same functionality is already incapsulated into AnalysisJobAdaptor)

* [Feature] SqlCmd can now produce mysql_insert_ids and pass them on as params.
    This allows us to grab auto-incremented values on PipeConfig level (see Compara/PipeConfig for examples)

* [Convenience] beekeeper.pl and runWorker.pl can take dbname as a "naked" command line option
    (which makes the option's syntax even closer to that of mysql/mysqldump)

* [Convenience] -job_id is now the standard option name understood by beekeeper.pl and runWorker.pl
    (older syntax is kept for compatibility)

* [Cleanup] Some unused scripts have been removed from sql/ directory,
    drop_hive_tables() added to procedures.sql

* [Bugfix] claim_analysis_status index on analysis_job table has been fixed in tables.sql
    and a corresponding patch file added


13 June, 2010 : Leo Gordon

* Added support for dataflow-into-tables, see LongMult example.

10 June, 2010 : Leo Gordon

* A bug preventing users from setting hive_output_dir via pipeline_wide_parameters has been fixed.

3 June, 2010 : Leo Gordon

* one important workaround for LSF command line parsing bug
    (the LSF was unable to create job arrays where pipeline name started from certain letters)

* lots of new documentation added and old docs removed (POD documentation in modules as well as eHive initialization/running manuals).
    Everyone is encouraged to use the new init_pipeline.pl and PipeConfig-style configuration files.

* a schema change that makes it possible to have multiple input_id_templates per dataflow branch
    (this feature is already accessible via API, but not yet implemented in init_pipeline.pl)

* JobFactory now understands multi-column input and intput_id templates can be written to refer to individual columns.
    The 'inputquery' mode has been tested and it works.
    Both 'inputfile' and 'inputcmd' should be able to split their input on param('delmiter'), but this has not yet been tested.


12 May, 2010 : Leo Gordon

* init_pipeline.pl can be given a PipeConfig file name instead of full module name.

* init_pipeline.pl has its own help that displays pod documentation (same mechanism as other eHive scripts)

* 3 pipeline initialization modes supported:
    full (default), -analysis_topup (pipeline development mode) and -job_topup (add more data to work with)


11 May, 2010 : Leo Gordon

* We finally have a universal framework for commandline-configurable pipelines' setup/initialization.
    Each pipeline is defined by a Bio::EnsEMBL::Hive::PipeConfig module
    that derives from Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf .
    Compara pipelines derive from Bio::EnsEMBL::Compara::PipeConfig::ComparaGeneric_conf .
    These configuration modules are driven by ensembl-hive/scripts/init_pipeline.pl script.

    Having set up what is an 'option' in your config file, you can then supply values for it
    from the command line. Option interdependency rules are also supported to a certain extent,
    so you can supply *some* options, and rely on the rules to compute the rest.

* Several example PipeConfig files have been written to show how to build pipelines both 'standard blocks'
    (SystemCmd, SqlCmd, JobFactory, Dummy, etc) and from RunnableDBs written specifically for the task
    (components of LongMult pipeline).

* Both eHive RunnableDB::* and PipeConfig::* modules have been POD-documented.

* A new 'input_id_template' feature has been added to the dataflow mechanism to allow for more flexibility
    when integrating external scripts or unsupported software into eHive pipelines.
    You can now dataflow from pretty much anything, even if the Runnable did not support dataflow natively.
    The corresponding schema patch is in ensembl-hive/sql

* pipeline-wide parameters (kept in 'meta' table) no longer have to be scalar.
    Feel free to use arrays or hashes if you need them. init_pipeline.pl also supports multilevel options.

* SqlCmd now has a concept of 'sessions': you can supply several queries in a list that will be executed
    one after another. If a query creates a temporary table, all the following ones down the list
    will be able to use it.

* SqlCmd can run queries against any database - not necessarily the eHive one. You have to supply a hashref
    of connection parameters via $self->param('db_conn') to make it work. It still runs against the eHive
    database by default.

* JobFactory now supports 4 sources: inputlist, inputfile, inputquery and inputcmd.
    *All* of them now support deep param_substitution. Enjoy.

* NB! JobFactory's substituted parameter syntax changed:
    it no longer understands '$RangeStart', '$RangeEnd' and '$RangeCount'.
    But it understands '#_range_start#', '#_range_end#' and '#_range_count#' - should be pretty easy to fix.

* several smaller bug fixes and optimizations of the code have also been done.
    A couple of utility methods have moved places, but it looks like they were mostly used internally. 
    Shout if you have lost anything and we'll try to find it together.


26 March, 2010 : Leo Gordon

* branch_code column in analysis_job table is unnecessary and was removed

    Branching using branch_codes is a very important and powerful mechanism,
    but it is completely defined in dataflow_rule table.

    branch_code() WAS at some point a getter/setter method in AnalysisJob,
    but it was only used to pass parameters around in the code (now obsolete),
    and this information was never reflected in the database,
    so analysis_job.branch_code was always 1 no matter what.

* stringification using Data::Dumper with parameters was moved out of init_pipelines and JobFactory.pm
    and is now in a separate Hive::Utils.pm module (Hive::Utils::stringify can be imported, inherited or just called).
    It is transparently called by AnalysisJobAdaptor when creating jobs which allows
    to pass input_ids as hashrefs and not strings. Magic happens on the adaptor level.

* Queen->flow_output_job() method has been made obsolete and removed from the Queen.pm
    Dataflow is now completely handled by Process->dataflow_output_id() method,
    which now handles arrays/fans of jobs and semaphores (later on this).
    Please always use dataflow_output_id() if you need to create a new job or fan of jobs,
    as this is the top level method for doing exactly this.
    Only call the naked adaptor's method if you know what you're doing.

* JobFactory module has been upgraded (simplified) to work through dataflow mechanism.
    It no longer can create analyses, but that's not necessary as it should be init_pipeline's job.
    Family pipeline has been patched to work with the new JobFactory module.

* branched dataflow was going to meet semaphores at some point, the time is near.
    dataflow_output_id() is now semaphore aware, and can propagate semaphores through the control graph.
    A new fan is hooked on its own semaphore; when the semaphored_job is not specified we do semaphore propagation.
    Inability to create a job in the fan is tracked and the corresponding semaphore_count decreased
    (so users do not have to worry about it).

* LongMult examples have been patched to work with the new dataflow_output_id() method.

* init_pipeline.pl is now more flexible and can understand simplified syntax for dataflow/control rules


22 March, 2010 : Leo Gordon

* Bio::EnsEMBL::Hive::ProcessWithParams is the preferred way of parsing/passing around the parameters.
    Module-wide, pipeline-wide, analysis-wide and job-wide parameters and their precedence.

* A new init_pipeline.pl script to create and populate pipelines from a perl hash structure.
    Tested with ensembl-hive/docs/long_mult_pipeline.conf and ensembl-compara/scripts/family/family_pipeline.conf . It works.

* Bio::EnsEMBL::Hive::RunnableDB::SystemCmd now supports parameter substitution via #param_name# patterns.
    See usage examples in the ensembl-compara/scripts/family/family_pipeline.conf

* There is a new Bio::EnsEMBL::Hive::RunnableDB::SqlCmd that does that it says,
    and also supports parameter substitution via #param_name# patterns.
    See usage examples in the ensembl-compara/scripts/family/family_pipeline.conf

* Bio::EnsEMBL::Hive::RunnableDB::JobFactory has 3 modes of operation: inputlist, inputfile, inputquery.
    See usage examples in the ensembl-compara/scripts/family/family_pipeline.conf

* some rewrite of the Queen/Adaptors code to give us more developmental flexibility

* support for semaphores (job-level control rules) in SQL schema and API
    - partially tested, has some quirks, waiting for a more serious test by Albert

* support for resource requirements in SQL schema, API and on init_pipeline config file level
    Tested in the ensembl-compara/scripts/family/family_pipeline.conf . It works.


3 December, 2009 : Leo Gordon

beekeeper.pl, runWorker.pl and cmd_hive.pl
got new built-in documentation accessible via perldoc or directly.


2 December, 2009 : Leo Gordon

Bio::EnsEMBL::Hive::RunnableDB::LongMult example toy pipeline has been created
to show how to do various things "adult pipelines" perform
(job creation, data flow, control blocking rules, usage of intermediate tables, etc).

Read Bio::EnsEMBL::Hive::RunnableDB::LongMult for a step-by-step instruction
on how to create and run this pipeline.


30 November, 2009 : Leo Gordon

Bio::EnsEMBL::Hive::RunnableDB::JobFactory module has been added.
It is a generic way of creating batches of jobs with the parameters
given by a file or a range of ids.
Entries in the file can also be randomly shuffled.


13 July, 2009 : Leo Gordon

Merging the "Meadow" code from this March' development branch.
Because it separates LSF-specific code from higher level, it will be easier to update.

-------------------------------------------------------------------------------------------------------
Albert, sorry - in the process of merging into the development branch I had to remove your HIGHMEM code.
I hope it is a temporary measure and we will be having hive-wide queue control soon.
If not - you can restore the pre-merger state by updating with the following command:

    cvs update -r lg4_pre_merger_20090713

('maximise_concurrency' option was carried over)
-------------------------------------------------------------------------------------------------------


3 April, 2009 : Albert Vilella

  Added a new maximise_concurrency 1/0 option. When set to 1, it will
  fetch jobs that need to be run in the adequate order as to maximise
  the different number of analyses being run. This is useful for cases
  where different analyses hit different tables and the overall sql
  load can be kept higher without breaking the server, instead of
  having lots of jobs for the same analysis trying to hit the same
  tables.

  Added quick HIGHMEM option. This option is useful when a small
  percent of jobs are too big and fail in normal conditions. The
  runnable can check if it's the second time it's trying to run the
  job, if it's because it contains big data (e.g. gene_count > 200)
  and if it isn't already in HIGHMEM mode. Then, it will call
  reset_highmem_job_by_dbID and quit:

  if ($self->input_job->retry_count == 1) {
    if ($self->{'protein_tree'}->get_tagvalue('gene_count') > 200 && !defined($self->worker->{HIGHMEM})) {
      $self->input_job->adaptor->reset_highmem_job_by_dbID($self->input_job->dbID);
      $self->DESTROY;
      throw("Alignment job too big: send to highmem and quit");
    }
  }

  Assuming there is a

   beekeeper.pl -url <blah> -highmem -meadow_options "<lots of mem>"

   running, or a 
   
   runWorker.pl <blah> -highmem 1

   with lots of mem running, it will fetch the HIGHMEM jobs as if they
   were "READY but needs HIGHMEM".

   Also added a modification to Queen that will not synchronize as
   often when more than 450 jobs are running and the load is above
   0.9, so that the queries to analysis tables are not hitting the sql
   server too much.

23 July, 2008 : Will Spooner
  Removed remaining ensembl-pipeline dependencies.

11 March, 2005 : Jessica Severin
  Project is reaching a very stable state.  New 'node' object Bio::EnsEMBL::Hive::Process
  allows for independence from Ensembl Pipeline and provides extended process functionality
  to manipulate hive job objects, branch, modify hive graphs, create jobs, and other hive
  process specific tasks.  Some of this extended 'Process' API may still evolve.

7 June, 2004 : Jessica Severin
  This project is under active development and should be classified as pre-alpha
  Most of the design has been settled and I'm in the process of implementing the details
  but entire objects could disappear or drastically change as I approach the end.
  Watch this space for further developments

11 March, 2005 : Jessica Severin