Running eHive pipelines


A quick overview

Each eHive pipeline is a potentially complex computational process.

Whether it runs locally, on the farm, or on multiple compute resources, this process is centered around a "blackboard" (a MySQL, SQLite or PostgreSQL database) where individual jobs of the pipeline are created, claimed by independent Workers and later recorded as done or failed.

Running the pipeline involves the following steps:


Initialization of the pipeline database

Every eHive pipeline is centered around a "blackboard", which is usually a MySQL/SQLite/PostgreSQL database. This database contains both static information (general definition of analyses, associated runnables, parameters and resources, dependency rules, etc) and runtime information about states of single jobs running on the farm or locally.

By initialization we mean an act of moulding one such new pipeline database from a PipeConfig file. This is done by feeding the PipeConfig file to ensembl-hive/scripts/init_pipeline.pl script.
A typical example:

		init_pipeline.pl Bio::EnsEMBL::Hive::PipeConfig::LongMult_conf -pipeline_url mysql://user:password@host:port/long_mult
    
It will create a MySQL pipeline database called 'long_mult' with the given connection parameters. In case of newer PipeConfig files these could be the only parameters needed, as the rest could be set at a later stage via "seeding" (see below).

If heavy concurrent traffic to the database is not expected, we may choose to keep the blackboard in a local SQLite file:

		init_pipeline.pl Bio::EnsEMBL::Hive::PipeConfig::LongMult_conf -pipeline_url sqlite:///long_mult
    
In the latter case no other connection parameters except for the filename are necessary, so they are skipped.

A couple of more complicated examples:

		init_pipeline.pl Bio::EnsEMBL::Compara::PipeConfig::ProteinTrees_conf -user "my_db_username" -password "my_db_password" -mlss_id 12345
    
It sets 'user', 'password' and 'mlss_id' parameters via command line options. At this stage you can also override any of the other options mentioned in the default_options section of the PipeConfig file.

If you need to modify second-level values of a "hash option" (such as the '-host' or '-port' of the 'pipeline_db' option), the syntax is the following (follows the extended syntax of Getopt::Long) :

		init_pipeline.pl Bio::EnsEMBL::Compara::PipeConfig::ProteinTrees_conf -pipeline_db -host=myhost -pipeline_db -port=5306
    

PLEASE NOTE: Although many older PipeConfig files make extensive use of command-line options such as -password and -mlss_id above (so-called o() syntax), it is no longer the only or recommended way of pre-configuring pipelines. There are better ways to configure pipelines, so if you find yourself struggling to make sense of an existing PipeConfig's o() syntax, please talk to eHive developers or power-users who are usually happy to help.

Normally, one run of init_pipeline.pl should create you a pipeline database.
If anything goes wrong and the process does not complete successfully, you will need to drop the partially created database in order to try again. You can either drop the database manually, or use "-hive_force_init 1" option, which will automatically drop the database before trying to create it.

If init_pipeline.pl completes successfully, it will print a legend of commands that could be run next:

Please remember that these command lines are for use only with a particular pipeline database, and are likely to be different next time you run the pipeline. Moreover, they will contain a sensitive password! So don't write them down.


Generating a pipeline's flow diagram

As soon as the pipeline database is ready you can store its visual flow diagram in an image file. This diagram is a much better tool for understanding what is going on in the pipeline. Run the following command to produce it:

        generate_graph.pl -url sqlite:///my_pipeline_database -out my_diagram.png
    
You only have to choose the format (gif, jpg, png, svg, etc) by setting the output file extension.

LEGEND:

  • The rounded nodes on the flow diagram represent Analyses (classes of jobs).
  • The white rectangular nodes represent Tables that hold user data.
  • The blue solid arrows are called "dataflow rules". They either generate new jobs (if they point to an Analysis node) or store data (if they point at a Table node).
  • The red solid arrows with T-heads are "analysis control rules". They block the pointed-at Analysis until all the jobs of the pointing Analysis are done.
  • Light-blue shadows behind some analyses stand for "semaphore rules". Together with red and green dashed lines they represent our main job control mechanism that will be described elsewhere.

Each flow diagram thus generated is a momentary snapshot of the pipeline state, and these shapshots will be changing as the pipeline runs. One of the things changing will be the colour of the Analysis nodes. The default colour legend is as follows:

  •  [ EMPTY ]  : the Analysis never had any jobs to do. Since pipelines are dynamic it may be ok for some Analyses to stay EMPTY until the very end.
  •  [ DONE ]  : all jobs of the Analysis are DONE. Since pipelines are dynamic, it may be a temporary state, until new jobs are added.
  •  [ READY ]  : some jobs are READY to be run, but nothing is running at the moment.
  •  [ IN PROGRESS ]  : some jobs of the Analysis are being processed at the moment of the snapshot.
  •  [ BLOCKED ]  : none of the jobs of this Analysis can be run at the moment because of job dependency rules.
  •  [ FAILED ]  : the number of FAILED jobs in this Analysis has gone over a threshold (which is 0 by default). By default beepeeper.pl will exit if it encounters a FAILED analysis.

Another thing that will be changing from snapshot to snapshot is the job "breakout" formula displayed under the name of the Analysis. It shows how many jobs are in which state and the total number of jobs. Separate parts of this formula are similarly colour-coded:

  • grey :  s  (SEMAPHORED) - individually blocked jobs
  • green :  r  (READY) - jobs that are ready to be claimed by Workers
  • yellow :  i  (IN PROGRESS) - jobs that are currently being processed by Workers
  • skyblue :  d  (DONE) - successfully completed jobs
  • red :  f  (FAILED) - unsuccessfully completed jobs

Actually, you don't even need to generate a pipeline database to see its diagram, as the diagram can be generated directly from the PipeConfig file:

        generate_graph.pl -pipeconfig Bio::EnsEMBL::Hive::PipeConfig::LongMult_conf -out my_diagram2.png
    
Such a "standalone" diagram may look slightly different (analysis_ids will be missing).

PLEASE NOTE: A very friendly guiHive web app can periodically regenerate the pipeline flow diagram for you, so you can now monitor (and to a certain extent control) your pipeline from a web browser.


Seeding jobs into the pipeline database

Pipeline database contains a dynamic collection of jobs (tasks) to be done. The jobs can be added to the "blackboard" either by the user (we call this process "seeding") or dynamically, by already running jobs. When a database is created using init_pipeline.pl it may or may not be already seeded, depending on the PipeConfig file (you can always make sure whether it has been automatically seeded by looking at the flow diagram). If the pipeline needs seeding, this is done by running seed_pipeline.pl script, by providing both the Analysis to be seeded and the parameters of the job being created:

		seed_pipeline.pl -url sqlite:///my_pipeline_database -logic_name "analysis_name" -input_id '{ "paramX" => "valueX", "paramY" => "valueY" }'
    
It only makes sense to seed certain analyses, typically they are the ones that do not have any incoming dataflow on the flow diagram.


Synchronizing ("sync"-ing) the pipeline database

In order to function properly (to monitor the progress, block and unblock analyses and send correct number of workers to the farm) the eHive system needs to maintain certain number of job counters. These counters and associated analysis states are updated in the process of "synchronization" (or "sync"). This has to be done once before running the pipeline, and normally the pipeline will take care of synchronization by itself and will trigger the 'sync' process automatically. However sometimes things go out of sync. Especially when people try to outsmart the scheduler by manually stopping and running jobs :) This is when you might want to re-sync the database. It is done by running the ensembl-hive/scripts/beekeeper.pl in "sync" mode:

		beekeeper.pl -url sqlite:///my_pipeline_database -sync
    


Running the pipeline in automatic mode

As mentioned previously, the usual lifecycle of an eHive pipeline is revolving around the pipeline database. There are several "Worker" processes that run on the farm. The Workers pick suitable tasks from the database, run them, and report back to the database. There is also one "Beekeeper" process that normally loops on a head node of the farm, monitors the progress of Workers and whenever needed submits more Workers to the farm (since Workers die from time to time for natural and not-so-natural reasons, Beekeeper maintains the correct load).

So to "run the pipeline" all you have to do is to run the Beekeeper:

		beekeeper.pl -url sqlite:///my_pipeline_database -loop
    

You can also restrict running to a subset of Analyses (either by analysis_id or by name pattern) :

		beekeeper.pl -url sqlite:///my_pipeline_database -analyses_pattern 'alignment_%' -loop       # all analyses whose name starts with 'alignment_'
    
or
		beekeeper.pl -url sqlite:///my_pipeline_database -analyses_pattern '1..5,fasta_check' -loop  # only analyses with analysis_id between 1 and 5 and 'fasta_check'
    

In order to make sure the beekeeper.pl process doesn't die when you disconnect your ssh session from the farm, it is normally run in a "screen session".
If your Beekeeper process gets killed for some reason, don't worry - you can re-sync and start another Beekeeper process. It will pick up from where the previous Beekeeper left it.


Monitoring the progress via a direct database session

In addition to monitoring the visual flow diagram (that could be generated manually using generate_graph.pl or via guiHive web app) you can also connect to the pipeline database directly and issue SQL commands. To avoid typing in all the connection details (syntax is different depending on the particular database engine used) you can use a bespoke db_cmd.pl script that takes the eHive database URL and performs the connection for you:

		db_cmd.pl -url sqlite:///my_pipeline_database
    
or
		db_cmd.pl -url mysql://user:password@host:port/long_mult
    
or
		db_cmd.pl -url pgsql://user:password@host:port/long_mult
    
Once connected, you can run any SQL queries using eHive schema (see eHive schema diagram and eHive schema description).

In addition to the tables, there is a "progress" view from which you can select and see how your jobs are doing:

		SELECT * from progress;
    

If you see jobs in 'FAILED' state or jobs with retry_count>0 (which means they have failed at least once and had to be retried), you may need to look at the "msg" view in order to find out the reason for the failures:

		SELECT * FROM msg WHERE job_id=1234;	# a specific job
    
or
		SELECT * FROM msg WHERE analysis_id=15;	# jobs of a specific analysis
    
or
		SELECT * FROM msg;	# show me all messages
    

Some of the messages indicate temporary errors (such as temporary lack of connectivity with a database or file), but some others may be critical (wrong path to a binary) that will eventually make all jobs of an analysis fail. If the "is_error" flag of a message is false, it may be just a diagnostic message which is not critical.