Commit 9e019329 authored by Brandon Walts's avatar Brandon Walts Committed by Matthieu Muffato
Browse files

added several documents and removed creating_runnables/runnables.rst

parent 7cc04842
Continuously running pipelines
There are two main strategies for running different instances of an analysis within the same workflow -- i.e. running the same workflow with different starting data. One method, probably more commonly used, is to instantiate a new hive database with for each new analysis run. Another method is to seed a new job into an existing pipeline (into an already existing hive database). In the second case, the seeded job will start a new parallel path through the pipeline.
The latter method can be used to set up an eHive pipeline to provide a service for on-demand computation. In this arrangement, a single hive pipeline is set up, and a beekeeper is set running continuously. When a job is seeded, the beekeeper will notice the new job during its next loop, and will create workers to take that job as appropriate.
Beekeeper options
A few options should be considered when operating a pipeline in continuous mode:
- Continuous looping is ensured by setting ``-loop_until FOREVER``
- It may be desirable to make the pipeline more responsive by reducing the sleep time below one minute using ``-sleep [minutes]``
- It may be desirable to set the beekeeper to stop after a certain number of loops using ``-max_loops [number of loops]``
Hoovering the pipeline
A continuously running pipeline has the potential to collect thousands of DONE job rows in the jobs table. As these grow, it has the potential to slow down the pipeline, as worker's queries from and updates to the job table take longer. To rectify this, the script is provided to remove DONE jobs from the job table, reducing the size of the table and thereby speeding up operations involving it.
By default, removes DONE jobs that finished more than one week ago. The age of DONE jobs to be deleted by can be adjusted with the -days_ago and -before_datetime options:
- `` -url "sqlite:///my_hive_db" # removes DONE jobs that have been DONE for at least one week``
- `` -url "sqlite:///my_hive_db" -days_ago 1 # removes DONE jobs that have been DONE for at least one day``
- `` -url "sqlite:///my_hive_db" -before_datetime "2017-01-01 08:00:00" # removes DONE jobs that became DONE before 08:00 on January 1st, 2017``
Note that the resource usage statistics are computed for all runs through the pipeline.
Runnables Included in the eHive Distribution
Several Runnables are included in the standard eHive distribution, serving as something like a standard library of components that are commonly helpful when creating pipelines. All of these are located in the directory modules/Bio/EnsEMBL/Hive/RunnableDB/. In addition, there are Runnables included with the examples under modules/Bio/EnsEMBL/Hive/Examples/. Although those are written to fit into specific example pipelines to illustrate specific eHive concepts, some users may find them useful in their own pipelines.
The included examples are:
- Bio::EnsEMBL::Hive::RunnableDB::DatabaseDumper
- Bio::EnsEMBL::Hive::RunnableDB::DbCmd
- Bio::EnsEMBL::Hive::RunnableDB::Dummy
- Bio::EnsEMBL::Hive::RunnableDB::FastaFactory
- Bio::EnsEMBL::Hive::RunnableDB::JobFactory
- Bio::EnsEMBL::Hive::RunnableDB::MySQLTransfer
- Bio::EnsEMBL::Hive::RunnableDB::NotifyByEmail
- Bio::EnsEMBL::Hive::RunnableDB::SlackNotification
- Bio::EnsEMBL::Hive::RunnableDB::SqlCmd
- Bio::EnsEMBL::Hive::RunnableDB::SqlHealthcheck
- Bio::EnsEMBL::Hive::RunnableDB::SystemCmd
IO and Error Handling in Runnables
This section covers the details of programming a runnable to accept and transmit data. Because a large component of handling errors is properly signalling that an error has occurred, along with the nature of that error, it will also be covered in this section.
Parameter Handling
Receiving input parameters
In eHive, parameters are the primary method of passing messages between components in a pipeline. Due to the central role of parameters, one of eHive's paradigms is to try and make most data sources look like parameters; somewhat analogous to the UNIX philosophy of "make everything look like a file." Therefore the syntax for accessing parameters also applies to accessing accumulators and user-defined tables in the hive database.
Within a runnable, parameter values can be set or retrieved using the param() method or one of its variants -- e.g. ``$self->param('parameter_name') #get`` or ``$self->param('parameter_name', $new_value) #set``:
- param() - Sets or gets the value of the named parameter. When attempting to get the value for a parameter that has no value (this could be because it is not in scope), a warning "ParamWarning: value for param([parameter name]) is used before having been initialized!" will be logged.
- param_required() - Like param(), except the job will fail if the named parameter has no value.
- param_exists() - True/false test for existence of a parameter with the given name. Note that this will return true if the parameter's value is undefined. Compare to param_is_defined().
- param_is_defined() - True/false test for the existence of a parameter with the given name with a value. Note that this will return false if the parameter's value is undefined. Compare to param_exists().
Passing parameters within a runnable
It is often desirable to pass data between methods of a runnable. For example, parameter values may need to be moved from fetch_input() into run(), and the results of computation may need to be carried from run() into write_output(). The eHive parameter mechanism is intended to facilitate this kind of data handling.
Reading in data from external files and databases
At a basic level, a runnable is simply a Perl, Python, or Java module, which has access to all of the database and file IO facilities of any standard program. There are some extra facilities provided by eHive for convenience in working with external data sources:
- Database URLs: Runnables can identify any MySQL PostgreSQL, or SQLite database using a URL, not just the eHive pipeline database. Runnable writers can obtain a database connection from a URL using the method ``Bio::EnsEMBL::Hive::Utils::go_figure_dbc()``.
- Database connections handled through eHive's DBSQL modules automatically disconnect when inactive, and reconnect if disconnected.
Error Handling
eHive provides a number of mechanisms to detect and handle error conditions. These include special dataflow events triggered by certain error conditions, somewhat akin to a try-catch system.
Special Dataflow when Jobs Exceed Resource Limits
The eHive system can react when the job scheduler notifies it that a job's memory requirements exceeded the job's memory request (MEMLIMIT error), or when a job's runtime exceeds the job's runtime request (RUNLIMIT error). When receiving notification from the scheduler that a job has been killed for one of those reasons, eHive will catch the error and perform the following actions:
- The job's status will be updated to PASSED_ON (instead of FAILED).
- The job will not be retried.
- A dataflow event will be generated on branch -1 (for MEMLIMIT) or -2 (for RUNLIMIT). This event will pass along the same parameters and values that were passed to the original job. The intent of this event is to seed a job of a new analysis that uses the same Runnable as the PASSED_ON job, but with a different resource class. However, eHive does not enforce any special restrictions on this event -- it can be wired in the same way as any other analysis.
Logging Messages
Runnables have STDOUT and STDERR output streams available, but these are redirected and function differently than they would in a conventional script. During normal eHive operation, when jobs are run by workers submitted via a beekeeper loop, output to these streams is not sent to the shell in the conventional manner. Instead, it is either discarded to /dev/null, or is written to files specified by the -hive_log_dir option. Because of this redirection, STDERR and STDOUT should be treated as "verbose-level debug" output streams in runnables. When a job is run by a worker started with the script, or by using, then STDOUT and STDERR are handled normally (unless the -hive_log_dir option has been set, in which case output is directed to files in the directory specified by -hive_log_dir).
When writing a Runnable, the preferred method for sending messages to the user is via the message log. An API is provided to facilitate logging messages in the log.
- warning(message, message_class) causes the string passed in the message parameter to be logged. A message class (one of the valid classes for a message log entry) can optionally be added. For backwards compatibility, if a non-zero number is passed to message_class, this will be converted to WORKER_ERROR.
- Perl ``die`` messages are redirected to the message log, and will be classified as WORKER_ERROR.
.. eHive guide to creating runnables
Creating a Runnable
Runnables Overview
The code a worker actually runs to accomplish its task is in a module called a "runnable." At its simplest, a runnable is simply a bit of code that implements either:
- Bio::EnsEMBL::Hive::Process (Perl)
- eHive.Process (Python)
- org.ensembl.hive.BaseRunnable (Java)
When a worker specializes to perform a job, it compiles and runs the runnable associated with the job's analysis.
The ...Process base class (in whichever language) provides a common interface for the workers. In particular, it provides a set of methods that are guaranteed to be called in order:
#. pre_cleanup()
#. fetch_input()
#. run()
#. write_output()
#. post_healthcheck()
#. post_cleanup()
Note that the method names are suggestions. With the exception of pre_cleanup and write_output, there is no special behaviour for these methods except for the order in which they are run. There is no need to implement all (or indeed, any) of these methods. If nothing is provided for a method, the default is to do nothing.
There are a number of example runnables provided with eHive. They can be found in two locations:
- Utility runnables are located in ``modules/Bio/EnsEMBL/Hive/RunnableDB/``.
- Runnables associated with example pipelines can be found in the ``RunnableDB/`` subdirectories under the example directories in ``modules/Bio/EnsEMBL/Hive/Examples/``.
If the job has a retry count greater than zero, then pre_cleanup() is the first method called when a worker runs a job. This provides an opportunity to clean up database entries or files that may be leftover from a failed attempt to run the job before trying again.
The fetch_input() method is the first method called the first time a job is run (if a job has a retry count greater than zero, then pre_cleanup() will be the first method called). This method is provided to check that input parameters exist and are valid. The benefits of putting input parameter checks here include:
- Making the code easier to understand and maintain; users of the runnable will know where to look to quickly discover which parameters are required or optional.
- If there are problems with input parameters, the job will fail quickly.
The run() method is called after fetch_input() completes. This method is provided as a place to put the main analysis logic of the runnable.
The write_output() method is called after run() completes. This method is provided as a place to put statements that create dataflow events. It is generally good practice to put dataflow statements here to aid users in understanding and maintaining the runnable.
The post_healthcheck method is called after write_output() completes. This method is provided as a place to verify that the runnable executed correctly.
There are two possible triggers for calling the post_cleanup() method. It is called immediately after post_healthcheck(), and it is called (if possible) if a job is failing (e.g. if a die statement is reached elsewhere in the runnable). Therefore, this method is somewhat similar to an exception handling catch block. This method should contain code performing cleanup that needs happen regardless of whether or not the job completed successfully, such as closing database connections or filehandles.
......@@ -43,6 +43,7 @@ User documentation
.. toctree::
......@@ -55,12 +56,14 @@ User documentation
.. toctree::
:caption: Creating runnables
.. toctree::
......@@ -68,6 +71,7 @@ User documentation
.. toctree::
There are many reasons an eHive pipeline could encounter problems:
- Incomplete or incorrect setup of the eHive system or its associated environment
- Problems with the local compute environment
- Problems with the underlying data
- Lack of availability of resources
- Bugs in the eHive system
In this section, we will describe the tools eHive provides to help diagnose faults. In addition, this section will cover general troubleshooting strategies, and list common problems along with their solutions. See the section on `Error Recovery` for details on how to resume a pipeline once a fault has been diagnosed and corrected.
Several tools are included in eHive to show details of a pipeline's execution:
- The message log
- The hive log directory and submission log directory, which contain
- Beekeeper log files
- Worker log files
- The script, allowing one worker to run and produce output in a defined environment
- The script, allowing a runnable to be executed independently of a pipeline
Message log
The message log stores messages sent from the beekeeper and from workers. Messages in the log are not necessarily indications of trouble. Broadly speaking, they can be categorized into two classes: information messages or error messages. In eHive prior to version 2.5, those were the only two classes available -- indicated by either a 0 or 1 value respectively stored in the is_error column. Starting in eHive version 2.5, the is_error column was replaced with message_class, which expanded the categories available for messages to:
- INFO class messages provide information on run progress, details about the operation of a job, and record certain internal bookkeeping data (such as beekeeper "heartbeats")
- PIPELINE_CAUTION messages are sent when an abnormal condition is detected in a component that affects the entire pipeline, but the condition is not serious enough to stop pipeline execution. Examples of conditions that can generate PIPELINE_CAUTION messages include inconsistencies in semaphore counts, or transient failures to connect the eHive database.
- PIPELINE_ERROR messages are sent when an abnormal condition is detected in a component that affects the entire pipeline, and the condition is serious enough to stop pipeline execution.
- WORKER_CAUTION messages are sent when a worker encounters an abnormal condition relating to its particular lifecycle or job execution, but the condition is not serious enough to end the worker's lifecycle. Examples of conditions that can generate WORKER_CAUTION messages include a preregistered worker taking a long time to contact the database, or problems updating a workers resource usage.
- WORKER_ERROR messages are sent when a worker encounters an abnormal condition related to its particular lifecycle of job execution, and that condition causes it to end prematurely. Examples of conditions that can generate WORKER_ERROR messages include failure to compile a runnable, or a runnable generating a failure message.
The log can be viewed in guiHive's log tab, or by directly querying the hive database. In the database, the log is stored in the log_message table. To aid with discovery of relevant messages, eHive also provides via a view called msg, which includes analysis logic_names. For example, to find all non-INFO messages for an analysis with a logic_name of "align_sequences" one could run:
`` -url sqlite:///my_hive_db -sql 'SELECT * FROM msg WHERE logic_name="align_sequences" AND message_class != "INFO"'``
.. _hive-log-directory:
Hive log directory
In addition to the message log, eHive is equipped to produce additional debugging output and capture that output in an organised collection of files. There are two options to which turn on this output capture: -submit_log_dir and -hive_log_dir.
- -submit_log_dir [directory] stores the job manager's STDERR and STDOUT output (e.g. the output from LSF's -e and -o options) in an collection of directories created under the specified directory. There is one directory per beekeeper per iteration. Each job submission's output is stored in a file named log_default_[pid].[err|out]. If the process is part of a job array, the array index is separated from the pid by an underscore (so -o output for array job 12345[9] would be stored in file log_default_12345_9.out).
- -hive_log_dir [directory] stores STDERR and STDOUT from each worker. This includes anything explicitly output in a runnable (e.g. with a Perl print or warn statement), as well as information generated by the worker as it goes through its lifecycle. There is one directory per worker created under the specified directory, indexed by worker id. Two files are created in each worker's directory: worker.err and worker.out storing STDERR and STDOUT respectively.
.. note::
It is generally safe to restart a beekeeper, or start multiple beekeepers on a pipeline, and have them log to the same -submit_log_dir and/or -hive_log_dir. In the case of -submit_log_dir, each subsequent beekeeper will increment the beekeeper number for the submit output directory. For example, the first beekeeper run on a pipeline will start by creating directory submit_bk1_iter1 for the first loop, followed by submit_bk1_iter2 for the second iteration. A second beekeeper started on that same pipeline will create a submit directory submit_bk*2*_iter1 for its first iteration and so on. Worker IDs will also automatically increment within the same pipeline, preventing worker directory names from colliding.
However, if a pipeline is re-initialized using, then all beekeeper and worker identifiers will restart from 1. In that case, -submit_log_dir and -hive_log_dir will overwrite files and directories within the specified directory.
The script
The script can be useful for observing the execution of a job or analysis within the context of a pipeline. This script directly runs a worker process in the environment (machine and environment variables) of the command line where it is run. When running a job using runWorker, STDERR and STDOUT can be viewed in the terminal, or redirected in the usual way. There are many command-line options to control the behaviour of -- the following are a few that may be useful when invoking to diagnose problems with a particular job or analysis:
- -analyses_pattern and -analysis_id can be used to restrict the worker to claiming jobs from a particular analysis or class of analyses. Note that there is no guarantee of which job out of the jobs in those analyses will be claimed. It could be any READY job (or even a non-READY job if -force 1 is also specified).
- -job_id runs a specific job identified by job id, provided that the job is in a READY state or -force 1 is also specified.
- Combine any of the above with -force 1 to force a worker to run a job even if the job is not READY and/or the analysis is BLOCKED or EXCLUDED.
- -job_limit and -can_respecialize can be set to limit the number of jobs the worker will claim and run. Otherwise, the worker started by runWorker will run until the end of it's lifespan, possibly respecializing to claim jobs from different analyses.
- -hive_log_dir works with in the same way as with See :ref:`hive-log-directory` for details.
- -worker_log_dir will output STDERR and STDOUT into a log directory. Note that this will simply create a file called worker.out in the specified directory. If a worker is run multiple times with -worker_log_dir set to the same directory, only the output from the most recent run will be in worker.out.
- -no_cleanup will leave temporary files in the temporary directory (usually /tmp).
- -no_write will prevent write_output() from being called in runnables.
The script
The script executes a particular runnable, and allows that execution to be partially or completely detached from any existing pipeline. This can be useful to see in detail what a particular runnable is doing, or for checking parameter values. There are many command-line options to control the behaviour of -- the following are a few that may be useful when invoking to diagnose problems with a particular job or analysis:
- -url combined with -job_id allows to "clone" a job that already exists in a hive database. When these options are given, will copy the parameters of the "donor" job specified by -job_id from the database specified by -url, and use those parameters to create and run a new job of the "donor" job's analysis type. Note that this new job is *not* part of the pipeline. In particular
- No new job will be created in the job table
- The status of the "cloned" job will not be changed
- Dataflow events will not be passed into the pipeline (unless explicitly directed there using -flow_into)
- Also note, when "cloning" a job with -url and -job_id, the state of the "donor" job is ignored. It is entirely possible to specify the job_id of a FAILED, SEMAPHORED, READY, or any other state of job. The script will still copy the parameters and attempt to run a job of that analysis type.
- -no_cleanup will leave temporary files in the temporary directory (usually /tmp).
- -no_write will prevent write_output() from being called in the runnable.
.. warning::
If the runnable interacts with files or non-hive databases, it may still do so when running as a standalone job. Take care that important data is not overwritten or deleted in this situation.
- The first indication of problems with a pipeline generally appear in's output and in guiHive, in the form of failed jobs.
- Analyses with failed jobs, and analyses immediately adjacent to them are good places to start looking for informative messages in the message log.
- When running on a farm, it is possible that certain nodes or groups of nodes are problematic for some reason (e.g. failure to mount NFS shares). The worker table in the database keeps track of which nodes the worker was submitted to in the meadow_host column. It is sometimes worth checking to see if there is a common node amongst failed workers. Workers are associated with jobs via the role table, so a query can be constructed to see if failed jobs share a common node or nodes.
- If the failing analysis reads from or writes to the filesystem or another database, checking the relevant files or database tables may reveal clues to the cause of the failure.
- Remember that accepts the -analyses_pattern option, limiting workers it submits to working on jobs from a specific subset of analyses. This can be useful when restarting the beekeeper using -hive_log_dir to get detailed information about a problematic analysis or analyses.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment