Commit 681ee109 authored by Matthieu Muffato's avatar Matthieu Muffato
Browse files

Added a Sphinx extension to generate eHive anlysis diagrams for sample pipeconfigs

parent 9e019329
user_manual/creating_pipelines/dataflows
\ No newline at end of file
......@@ -43,6 +43,5 @@ Internal scripts
:titlesonly:
scripts/create_sql_patches.rst
scripts/make_branch_glossary.rst
scripts/make_docs.rst
=========================
make\_branch\_glossary.pl
=========================
NAME
----
scripts/make\_branch\_glossary.pl
DESCRIPTION
-----------
::
An internal eHive script for regenerating the document that lists all (most ?) of the dataflow patterns.
LICENSE
-------
::
Copyright [1999-2015] Wellcome Trust Sanger Institute and the EMBL-European Bioinformatics Institute
Copyright [2016-2017] EMBL-European Bioinformatics Institute
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License
is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and limitations under the License.
CONTACT
-------
::
Please subscribe to the Hive mailing list: http://listserver.ebi.ac.uk/mailman/listinfo/ehive-users to discuss Hive-related questions or to be notified of our updates
......@@ -19,7 +19,9 @@ import sphinx_rtd_theme
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#sys.path.insert(0, os.path.abspath('.'))
sys.path.insert(0, os.path.abspath('.'))
from xhive import *
# -- General configuration ------------------------------------------------
......@@ -344,3 +346,4 @@ epub_exclude_files = ['search.html']
# (see https://github.com/rtfd/sphinx_rtd_theme/issues/117)
def setup(app):
app.add_stylesheet("theme_overrides.css")
app.add_directive('hive_diagram', HiveDiagramDirective)
......@@ -13,7 +13,7 @@ Dataflow to one analysis
This is what we have used in the Dataflow document. Simply name the target analysis after the ``=>``.
::
.. hive_diagram:: dataflow_targets/101.png
{ -logic_name => 'A',
-flow_into => {
......@@ -23,7 +23,6 @@ This is what we have used in the Dataflow document. Simply name the target analy
{ -logic_name => 'B',
},
.. figure:: dataflow_targets/101.png
Dataflow to multiple analyses
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
......@@ -31,7 +30,7 @@ Dataflow to multiple analyses
A branch can actually be connected to multiple analyses. When a Dataflow
event happens, it will create a job in each of them.
::
.. hive_diagram:: dataflow_targets/102.png
{ -logic_name => 'A',
-flow_into => {
......@@ -43,7 +42,6 @@ event happens, it will create a job in each of them.
{ -logic_name => 'C',
},
.. figure:: dataflow_targets/102.png
Multiple dataflows to the same analysis
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
......@@ -53,7 +51,7 @@ from the same analysis.
Here, jobs are created in B whenever there is an event on branch #2, in C
when there is an event on branch #2 or #3, and D when there is an event on branch #1.
::
.. hive_diagram:: dataflow_targets/103.png
{ -logic_name => 'A',
-flow_into => {
......@@ -69,7 +67,6 @@ when there is an event on branch #2 or #3, and D when there is an event on branc
{ -logic_name => 'D',
},
.. figure:: dataflow_targets/103.png
Table
-----
......@@ -83,7 +80,7 @@ This is what we have used in the Dataflow document. Simply name the target analy
with a URL that contains the ``table_name`` key. URLs can be *degenerate*, i.e. skip the part before
the question mark (like below) or *completely defined*, i.e. start with ``driver://user@host/database_name``.
::
.. hive_diagram:: dataflow_targets/201.png
{ -logic_name => 'A',
-flow_into => {
......@@ -91,7 +88,6 @@ the question mark (like below) or *completely defined*, i.e. start with ``driver
},
},
.. figure:: dataflow_targets/201.png
Dataflow to multiple tables
~~~~~~~~~~~~~~~~~~~~~~~~~~~
......@@ -99,7 +95,7 @@ Dataflow to multiple tables
A branch can actually be connected to multiple tables. When a Dataflow
event happens, it will create a row in each of them.
::
.. hive_diagram:: dataflow_targets/202.png
{ -logic_name => 'A',
-flow_into => {
......@@ -107,7 +103,6 @@ event happens, it will create a row in each of them.
},
},
.. figure:: dataflow_targets/202.png
Multiple dataflows to tables and analyses
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
......@@ -119,7 +114,7 @@ In the example below, a row from the table C will typically not have information
about the analysis (job) that generated it.
This can however be enabled by explicitly adding the job_id to the dataflow payload.
::
.. hive_diagram:: dataflow_targets/203.png
{ -logic_name => 'A',
-flow_into => {
......@@ -135,7 +130,6 @@ This can however be enabled by explicitly adding the job_id to the dataflow payl
},
},
.. figure:: dataflow_targets/203.png
Accumulator
-----------
......@@ -152,7 +146,7 @@ of accumulators (scalar, pile, multiset, array and hash), all described in :doc:
Accumulators can **only** be connected to *fan* analyses of a semaphore group. All the data flown into them
is *accumulated* and passed on to the *funnel* once the latter is released.
::
.. hive_diagram:: dataflow_targets/301.png
{ -logic_name => 'A',
-flow_into => {
......@@ -168,7 +162,6 @@ is *accumulated* and passed on to the *funnel* once the latter is released.
{ -logic_name => 'D',
},
.. figure:: dataflow_targets/301.png
Multiple accumulators and semaphore propagation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
......@@ -177,8 +170,7 @@ During the semaphore propagation, more jobs are added to the current semaphore-g
in order to block the current funnel. Similarly a funnel may receive data from multiple
accumulators (possibly fed by different analyses) of a semaphore-group.
::
.. hive_diagram:: dataflow_targets/302.png
{ -logic_name => 'A',
-flow_into => {
......@@ -200,5 +192,4 @@ accumulators (possibly fed by different analyses) of a semaphore-group.
{ -logic_name => 'D',
}
.. figure:: dataflow_targets/302.png
Analysis
--
In eHive, a job can create another job via a Dataflow event by wiring the branch to another analysis.
Dataflow to one analysis
--
This is what we have used in the Dataflow document. Simply name the target analysis after the ``=>``.
--
{ -logic_name => 'A',
-flow_into => {
1 => [ 'B' ],
},
},
{ -logic_name => 'B',
},
Dataflow to multiple analyses
--
A branch can actually be connected to multiple analyses. When a Dataflow
event happens, it will create a job in each of them.
--
{ -logic_name => 'A',
-flow_into => {
1 => [ 'B', 'C' ],
},
},
{ -logic_name => 'B',
},
{ -logic_name => 'C',
},
Multiple dataflows to the same analysis
--
Reciprocally, an analysis can be the target of several branches coming
from the same analysis.
Here, jobs are created in B whenever there is an event on branch #2, in C
when there is an event on branch #2 or #3, and D when there is an event on branch #1.
--
{ -logic_name => 'A',
-flow_into => {
2 => [ 'B', 'C' ],
3 => [ 'C' ],
1 => [ 'D' ],
},
},
{ -logic_name => 'B',
},
{ -logic_name => 'C',
},
{ -logic_name => 'D',
},
Table
--
A job can store data in a table via the Dataflow mechanism instead of raw SQL access.
Dataflow to one analysis
--
This is what we have used in the Dataflow document. Simply name the target analysis after the ``=>``
with a URL that contains the ``table_name`` key. URLs can be *degenerate*, i.e. skip the part before
the question mark (like below) or *completely defined*, i.e. start with ``driver://user@host/database_name``.
--
{ -logic_name => 'A',
-flow_into => {
1 => [ '?table_name=B' ],
},
},
Dataflow to multiple tables
--
A branch can actually be connected to multiple tables. When a Dataflow
event happens, it will create a row in each of them.
--
{ -logic_name => 'A',
-flow_into => {
1 => [ '?table_name=B', '?table_name=C' ],
},
},
Multiple dataflows to tables and analyses
--
An analysis can dataflow to multiple targets, both of analysis and table types.
Rows inserted by table-dataflows are usually not linked to the emitting job_id.
In the example below, a row from the table C will typically not have information
about the analysis (job) that generated it.
This can however be enabled by explicitly adding the job_id to the dataflow payload.
--
{ -logic_name => 'A',
-flow_into => {
2 => [ 'B', '?table_name=C' ],
1 => [ 'D' ],
},
},
{ -logic_name => 'B',
},
{ -logic_name => 'D',
-flow_into => {
3 => [ '?table_name=C' ],
},
},
Accumulator
--
The last type of dataflow-target is called as an *accumulator*. It is a way of passing data from *fan* jobs
to their *funnel*.
Single accumulator
--
An accumulator is defined with a special URL that contains the ``accu_name`` key. There are five types
of accumulators (scalar, pile, multiset, array and hash), all described in :doc:`accumulators`.
Accumulators can **only** be connected to *fan* analyses of a semaphore group. All the data flown into them
is *accumulated* and passed on to the *funnel* once the latter is released.
--
{ -logic_name => 'A',
-flow_into => {
'2->A' => [ 'B' ],
'A->1' => [ 'D' ],
},
},
{ -logic_name => 'B',
-flow_into => {
1 => [ '?accu_name=pile_accu&accu_input_variable=variable_name&accu_address=[]' ],
},
},
{ -logic_name => 'D',
},
Multiple accumulators and semaphore propagation
--
During the semaphore propagation, more jobs are added to the current semaphore-group
in order to block the current funnel. Similarly a funnel may receive data from multiple
accumulators (possibly fed by different analyses) of a semaphore-group.
--
{ -logic_name => 'A',
-flow_into => {
'2->A' => [ 'B' ],
'A->1' => [ 'D' ],
},
},
{ -logic_name => 'B',
-flow_into => {
2 => [ 'C' ],
1 => [ '?accu_name=pile_accu&accu_input_variable=variable_name&accu_address=[]' ],
},
},
{ -logic_name => 'C',
-flow_into => {
1 => [ '?accu_name=multiset_accu&accu_input_variable=variable_name&accu_address={}' ],
},
},
{ -logic_name => 'D',
}
......@@ -14,7 +14,7 @@ Autoflow
Upon success, each job from A will generate a Dataflow event on branch #1, which is connected to branch B. This is called
*autoflow* as jobs seem to automatically flow from A to B.
::
.. hive_diagram:: dataflows/101.png
{ -logic_name => 'A',
-flow_into => {
......@@ -24,14 +24,13 @@ Upon success, each job from A will generate a Dataflow event on branch #1, which
{ -logic_name => 'B',
},
.. figure:: dataflows/101.png
Autoflow v2
~~~~~~~~~~~
Same as above, but more concise.
::
.. hive_diagram:: dataflows/102.png
{ -logic_name => 'A',
-flow_into => [ 'B' ],
......@@ -39,14 +38,13 @@ Same as above, but more concise.
{ -logic_name => 'B',
},
.. figure:: dataflows/102.png
Autoflow v3
~~~~~~~~~~~
Same as above, but even more concise
::
.. hive_diagram:: dataflows/103.png
{ -logic_name => 'A',
-flow_into => 'B'
......@@ -54,7 +52,6 @@ Same as above, but even more concise
{ -logic_name => 'B',
},
.. figure:: dataflows/103.png
Custom, independent, dataflows
------------------------------
......@@ -68,7 +65,7 @@ Factory
Analysis A triggers 0, 1 or many Dataflow events on branch #2 (this is the convention for non-autoflow events).
In this pattern, A is called the *factory*, B the *fan*.
::
.. hive_diagram:: dataflows/201.png
{ -logic_name => 'A',
-flow_into => {
......@@ -78,7 +75,6 @@ In this pattern, A is called the *factory*, B the *fan*.
{ -logic_name => 'B',
},
.. figure:: dataflows/201.png
Factory in parallel of the autoflow
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
......@@ -89,7 +85,7 @@ was thus lost. You can in fact have both branches connected.
An analysis can use multiple branches at the same time and for instance produce a fan of jobs on branch #2
*and* still a job on branch #1. Both stream of jobs (B and C) are executed in parallel.
::
.. hive_diagram:: dataflows/202.png
{ -logic_name => 'A',
-flow_into => {
......@@ -102,7 +98,6 @@ An analysis can use multiple branches at the same time and for instance produce
{ -logic_name => 'C',
},
.. figure:: dataflows/202.png
Many factories and an autoflow
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
......@@ -112,7 +107,7 @@ They however have to be integers, preferably positive integers for the sake of
this tutorial as negative branch numbers have a special meaning (which is
addressed in :doc:`events`).
::
.. hive_diagram:: dataflows/203.png
{ -logic_name => 'A',
-flow_into => {
......@@ -134,7 +129,6 @@ addressed in :doc:`events`).
{ -logic_name => 'F',
},
.. figure:: dataflows/203.png
Dependent dataflows and semaphores
----------------------------------
......@@ -156,7 +150,7 @@ has to wait for *all* the jobs in group **A** before it can start.
This pattern is called a *semaphore*, and C is called the *funnel* analysis.
::
.. hive_diagram:: dataflows/301.png
{ -logic_name => 'A',
-flow_into => {
......@@ -169,7 +163,6 @@ This pattern is called a *semaphore*, and C is called the *funnel* analysis.
{ -logic_name => 'C',
},
.. figure:: dataflows/301.png
Semaphore propagation
~~~~~~~~~~~~~~~~~~~~~
......@@ -183,8 +176,7 @@ the jobs these may have created in D as well.
This process is called *semaphore propagation*.
::
.. hive_diagram:: dataflows/302.png
{ -logic_name => 'A',
-flow_into => {
......@@ -202,7 +194,6 @@ This process is called *semaphore propagation*.
{ -logic_name => 'D',
},
.. figure:: dataflows/302.png
Semaphore independent from the autoflow
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
......@@ -221,7 +212,7 @@ emit the events in the right order. There are as many semaphore groups as events
each job created on branch #2 is the *funnel* of 0, 1 or many jobs of the *fan* that is defined
on branch #3.
::
.. hive_diagram:: dataflows/303.png
{ -logic_name => 'A',
-flow_into => {
......@@ -234,7 +225,6 @@ on branch #3.
{ -logic_name => 'C',
},
.. figure:: dataflows/303.png
Mixing all patterns
~~~~~~~~~~~~~~~~~~~
......@@ -245,7 +235,7 @@ with the jobs created in te analysis D.
Upon success of the A job, the *autoflow* will create a job in analysis E which is *not* controlled
by any of the B or C jobs. It can thus start immediately.
::
.. hive_diagram:: dataflows/304.png
{ -logic_name => 'A',
-flow_into => {
......@@ -266,5 +256,4 @@ by any of the B or C jobs. It can thus start immediately.
{ -logic_name => 'E',
},
.. figure:: dataflows/304.png
Autoflow
--
*Autoflow* is the default event that happens between consecutive analyses
Autoflow
--
Upon success, each job from A will generate a Dataflow event on branch #1, which is connected to branch B. This is called
*autoflow* as jobs seem to automatically flow from A to B.
--
{ -logic_name => 'A',
-flow_into => {
1 => [ 'B' ],
},
},
{ -logic_name => 'B',
},
Autoflow v2
--
Same as above, but more concise.
--
{ -logic_name => 'A',
-flow_into => [ 'B' ],
},
{ -logic_name => 'B',
},
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment