@@ -17,6 +17,7 @@ The name "Hive" comes from the way pipelines are processed by a swarm
<ul>
<li>Introduction to eHive: <ahref="presentations/HiveWorkshop_Sept2013/index.html">Sept. 2013 workshop</a> (parts <ahref="presentations/HiveWorkshop_Sept2013/Slides_part1.pdf">1</a>, <ahref="presentations/HiveWorkshop_Sept2013/Slides_part2.pdf">2</a> and <ahref="presentations/HiveWorkshop_Sept2013/Slides_part3.pdf">3</a> in PDF)</li>
<li><ahref="install.html">Dependencies, installation and setup</a></li>
You only have to choose the format (gif, jpg, png, svg, etc) by setting the output file extension.
</p>
<table><tr>
<td><imgsrc=LongMult_diagram.pngheight=450></td>
<tdwidth=50></td>
<td>
<h3>LEGEND:</h3>
<ul>
<li>The rounded nodes on the flow diagram represent Analyses (classes of jobs).</li>
<li>The white rectangular nodes represent Tables that hold user data.</li>
<li>The blue solid arrows are called "dataflow rules". They either generate new jobs (if they point to an Analysis node) or store data (if they point at a Table node).</li>
<li>The red solid arrows with T-heads are "analysis control rules". They block the pointed-at Analysis until all the jobs of the pointing Analysis are done.</li>
<li>Light-blue shadows behind some analyses stand for "semaphore rules". Together with red and green dashed lines they represent our main job control mechanism that will be described elsewhere.</li>
</ul>
<p>Each flow diagram thus generated is a momentary snapshot of the pipeline state, and these shapshots will be changing as the pipeline runs.
One of the things changing will be the colour of the Analysis nodes. The default colour legend is as follows:
<ul>
<li><spanstyle="background-color:white"> [ EMPTY ] </span> : the Analysis never had any jobs to do. Since pipelines are dynamic it may be ok for some Analyses to stay EMPTY until the very end.</li>
<li><spanstyle="background-color:DeepSkyBlue"> [ DONE ] </span> : all jobs of the Analysis are DONE. Since pipelines are dynamic, it may be a temporary state, until new jobs are added.</li>
<li><spanstyle="background-color:green"> [ READY ] </span> : some jobs are READY to be run, but nothing is running at the moment.</li>
<li><spanstyle="background-color:yellow"> [ IN PROGRESS ] </span> : some jobs of the Analysis are being processed at the moment of the snapshot.</li>
<li><spanstyle="background-color:grey"> [ BLOCKED ] </span> : none of the jobs of this Analysis can be run at the moment because of job dependency rules.</li>
<li><spanstyle="background-color:red"> [ FAILED ] </span> : the number of FAILED jobs in this Analysis has gone over a threshold (which is 0 by default). By default <b>beepeeper.pl</b> will exit if it encounters a FAILED analysis.</li>
</ul>
</p>
<p>
Another thing that will be changing from snapshot to snapshot is the job "breakout" formula displayed under the name of the Analysis.
It shows how many jobs are in which state and the total number of jobs. Separate parts of this formula are similarly colour-coded:
<p>You can also restrict running to a subset of Analyses (either by analysis_id or by name pattern) :
<pre>
<b>beekeeper.pl</b> -url sqlite:///my_pipeline_database -analyses_pattern 'alignment_%' -loop <i># all analyses whose name starts with 'alignment_'</i>
</pre>
or
<pre>
<b>beekeeper.pl</b> -url sqlite:///my_pipeline_database -analyses_pattern '1..5,fasta_check' -loop <i># only analyses with analysis_id between 1 and 5 and 'fasta_check'</i>
</pre>
</p>
<p>In order to make sure the <b>beekeeper.pl</b> process doesn't die when you disconnect your ssh session from the farm,
it is normally run in a "screen session".<br/>
If your Beekeeper process gets killed for some reason, don't worry - you can re-sync and start another Beekeeper process.
It will pick up from where the previous Beekeeper left it.
</p>
<hrwidth=50%/>
<h2>Monitoring the progress via a direct database session</h2>
<p>In addition to monitoring the visual flow diagram (that could be generated manually using <b>generate_graph.pl</b> or via <b>guiHive</b> web app)
you can also connect to the pipeline database directly and issue SQL commands. To avoid typing in all the connection details (syntax is different
depending on the particular database engine used) you can use a bespoke <b>db_cmd.pl</b> that takes the eHive database URL and performs the connection for you:
Once connected, you can run any SQL queries using eHive schema (see <ahref=hive_schema.png>eHive schema diagram</a> and <ahref=hive_schema.html>eHive schema description</a>).
</p>
<p>In addition to the tables, there is a "progress" view from which you can select and see how your jobs are doing:
<pre>
SELECT * from progress;
</pre>
</p>
<p>If you see jobs in 'FAILED' state or jobs with retry_count>0 (which means they have failed at least once and had to be retried),
you may need to look at the "msg" view in order to find out the reason for the failures:
<pre>
SELECT * FROM msg WHERE job_id=1234; # a specific job
</pre>
or
<pre>
SELECT * FROM msg WHERE analysis_id=15; # jobs of a specific analysis
</pre>
or
<pre>
SELECT * FROM msg; # show me all messages
</pre>
</p>
<p>Some of the messages indicate temporary errors (such as temporary lack of connectivity with a database or file),
but some others may be critical (wrong path to a binary) that will eventually make all jobs of an analysis fail.
If the "is_error" flag of a message is false, it may be just a diagnostic message which is not critical.</p>