README 6.15 KB
Newer Older
Jessica Severin's avatar
Jessica Severin committed
1 2 3 4 5
# Bio::EnsEMBL::Hive project
#
# Copyright Team Ensembl
# You may distribute this package under the same terms as perl itself

Leo Gordon's avatar
Leo Gordon committed
6 7
Contact:
  Please contact ehive-users@ebi.ac.uk mailing list with questions/suggestions.
Jessica Severin's avatar
Jessica Severin committed
8 9 10 11 12 13 14

Summary:
  This is a distributed processing system based on 'autonomous agents' and
  Hive behavioural structure of Honey Bees .  It implements all functionality of both
  data-flow graphs and block-branch diagrams which should allow it to codify
  any program, algorithm, or parallel processing job control system.  It is
  not bound to any processing 'farm' system and can be adapted to any GRID.
Will Spooner's avatar
minor  
Will Spooner committed
15 16
  It builds on the design of the Ensembl Pipeline/Analysis and presently uses 
  Bio::EnsEMBL::Analysis::RunnableDB perl wrapper objects as nodes/blocks in 
Jessica Severin's avatar
Jessica Severin committed
17 18
  the graphs but could be adapted more generally.

19 20 21 22 23 24
22 March, 2010 : Leo Gordon

* Bio::EnsEMBL::Hive::ProcessWithParams is the preferred way of parsing/passing around the parameters.
    Module-wide, pipeline-wide, analysis-wide and job-wide parameters and their precedence.

* A new init_pipeline.pl script to create and populate pipelines from a perl hash structure.
25
    Tested with ensembl-hive/docs/long_mult_pipeline.conf and ensembl-compara/scripts/family/family_pipeline.conf . It works.
26 27

* Bio::EnsEMBL::Hive::RunnableDB::SystemCmd now supports parameter substitution via #param_name# patterns.
28
    See usage examples in the ensembl-compara/scripts/family/family_pipeline.conf
29 30 31

* There is a new Bio::EnsEMBL::Hive::RunnableDB::SqlCmd that does that it says,
    and also supports parameter substitution via #param_name# patterns.
32
    See usage examples in the ensembl-compara/scripts/family/family_pipeline.conf
33 34

* Bio::EnsEMBL::Hive::RunnableDB::JobFactory has 3 modes of operation: inputlist, inputfile, inputquery.
35
    See usage examples in the ensembl-compara/scripts/family/family_pipeline.conf
36 37 38 39 40 41 42

* some rewrite of the Queen/Adaptors code to give us more developmental flexibility

* support for semaphores (job-level control rules) in SQL schema and API
    - partially tested, has some quirks, waiting for a more serious test by Albert

* support for resource requirements in SQL schema, API and on init_pipeline config file level
43
    Tested in the ensembl-compara/scripts/family/family_pipeline.conf . It works.
44 45


Leo Gordon's avatar
typo  
Leo Gordon committed
46
3 December, 2009 : Leo Gordon
Leo Gordon's avatar
Leo Gordon committed
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68

beekeeper.pl, runWorker.pl and cmd_hive.pl
got new built-in documentation accessible via perldoc or directly.


2 December, 2009 : Leo Gordon

Bio::EnsEMBL::Hive::RunnableDB::LongMult example toy pipeline has been created
to show how to do various things "adult pipelines" perform
(job creation, data flow, control blocking rules, usage of intermediate tables, etc).

Read Bio::EnsEMBL::Hive::RunnableDB::LongMult for a step-by-step instruction
on how to create and run this pipeline.


30 November, 2009 : Leo Gordon

Bio::EnsEMBL::Hive::RunnableDB::JobFactory module has been added.
It is a generic way of creating batches of jobs with the parameters
given by a file or a range of ids.
Entries in the file can also be randomly shuffled.

Leo Gordon's avatar
Leo Gordon committed
69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85

13 July, 2009 : Leo Gordon

Merging the "Meadow" code from this March' development branch.
Because it separates LSF-specific code from higher level, it will be easier to update.

-------------------------------------------------------------------------------------------------------
Albert, sorry - in the process of merging into the development branch I had to remove your HIGHMEM code.
I hope it is a temporary measure and we will be having hive-wide queue control soon.
If not - you can restore the pre-merger state by updating with the following command:

    cvs update -r lg4_pre_merger_20090713

('maximise_concurrency' option was carried over)
-------------------------------------------------------------------------------------------------------


Albert Vilella's avatar
Albert Vilella committed
86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112
3 April, 2009 : Albert Vilella

  Added a new maximise_concurrency 1/0 option. When set to 1, it will
  fetch jobs that need to be run in the adequate order as to maximise
  the different number of analyses being run. This is useful for cases
  where different analyses hit different tables and the overall sql
  load can be kept higher without breaking the server, instead of
  having lots of jobs for the same analysis trying to hit the same
  tables.

  Added quick HIGHMEM option. This option is useful when a small
  percent of jobs are too big and fail in normal conditions. The
  runnable can check if it's the second time it's trying to run the
  job, if it's because it contains big data (e.g. gene_count > 200)
  and if it isn't already in HIGHMEM mode. Then, it will call
  reset_highmem_job_by_dbID and quit:

  if ($self->input_job->retry_count == 1) {
    if ($self->{'protein_tree'}->get_tagvalue('gene_count') > 200 && !defined($self->worker->{HIGHMEM})) {
      $self->input_job->adaptor->reset_highmem_job_by_dbID($self->input_job->dbID);
      $self->DESTROY;
      throw("Alignment job too big: send to highmem and quit");
    }
  }

  Assuming there is a

113
   beekeeper.pl -url <blah> -highmem -meadow_options "<lots of mem>"
Albert Vilella's avatar
Albert Vilella committed
114 115 116 117 118 119 120 121 122 123 124 125 126

   running, or a 
   
   runWorker.pl <blah> -highmem 1

   with lots of mem running, it will fetch the HIGHMEM jobs as if they
   were "READY but needs HIGHMEM".

   Also added a modification to Queen that will not synchronize as
   often when more than 450 jobs are running and the load is above
   0.9, so that the queries to analysis tables are not hitting the sql
   server too much.

Will Spooner's avatar
minor  
Will Spooner committed
127 128 129
23 July, 2008 : Will Spooner
  Removed remaining ensembl-pipeline dependencies.

Jessica Severin's avatar
updated  
Jessica Severin committed
130 131
11 March, 2005 : Jessica Severin
  Project is reaching a very stable state.  New 'node' object Bio::EnsEMBL::Hive::Process
Albert Vilella's avatar
Albert Vilella committed
132 133
  allows for independence from Ensembl Pipeline and provides extended process functionality
  to manipulate hive job objects, branch, modify hive graphs, create jobs, and other hive
Jessica Severin's avatar
updated  
Jessica Severin committed
134
  process specific tasks.  Some of this extended 'Process' API may still evolve.
Jessica Severin's avatar
Jessica Severin committed
135 136

7 June, 2004 : Jessica Severin
Albert Vilella's avatar
Albert Vilella committed
137
  This project is under active development and should be classified as pre-alpha
Jessica Severin's avatar
Jessica Severin committed
138 139 140 141
  Most of the design has been settled and I'm in the process of implementing the details
  but entire objects could disappear or drastically change as I approach the end.
  Watch this space for further developments

Jessica Severin's avatar
updated  
Jessica Severin committed
142
11 March, 2005 : Jessica Severin