Commit 88a4ad61 authored by Brandon Walts's avatar Brandon Walts Committed by Matthieu Muffato
Browse files

Transformed some documentation to RST

parent c658d688
How to use MPI on eHive
=======================
---
With this tutorial, our goal is to give insights on how to set up the Hive to run jobs using Shared Memory Parallelism (threads) and Distributed Memory Parallelism (MPI).
First of all, your institution / compute-farm provider may have documentation on this topic. Please refer to them for implementation details (intranet-only links: [EBI](http://www.ebi.ac.uk/systems-srv/public-wiki/index.php/EBI_Good_Computing_Guide), [Sanger institute](http://mediawiki.internal.sanger.ac.uk/index.php/How_to_run_MPI_jobs_on_the_farm))
We won't discuss the inner parts of the modules, but real examples can be found in the [ensembl-compara](https://github.com/Ensembl/ensembl-compara) repository. It ships modules used for phylogenetic trees inference: [RAxML](https://github.com/Ensembl/ensembl-compara/blob/release/77/modules/Bio/EnsEMBL/Compara/RunnableDB/ProteinTrees/RAxML.pm) and [ExaML](https://github.com/Ensembl/ensembl-compara/blob/feature/update_pipeline/modules/Bio/EnsEMBL/Compara/RunnableDB/ProteinTrees/ExaML.pm). They look very light-weight (only command-line definitions) because most of the logic is in the base class (*GenericRunnable*), but nevertheless show the command lines used and the parametrization of multi-core and MPI runs.
---
How to setup a module using Shared Memory Parallelism (threads)
---------------------------------------------------------------
>If you have already compiled your code and know how to enable the use of multiple threads / cores, this case should be very straightforward. It basically consists in defining the proper resource class in your pipeline. We also include some tips on how to compile code under MPI environment, but be aware that will vary across systems.
1. You need to setup a resource class that encodes those requirements e.g. *16 cores and 24Gb of RAM*:
sub resource_classes {
my ($self) = @_;
return {
#...
'24Gb_16_core_job' => { 'LSF' => '-n 16 -M24000 -R"select[mem>24000] span[hosts=1] rusage[mem=24000]"' },
#...
}
}
2. You need to add the analysis to your pipeconfig:
{ -logic_name => 'app_multi_core',
-module => 'Namespace::Of::Thread_app',
-parameters => {
'app_exe' => $self->o('app_pthreads_exe'),
'cmd' => '#app_exe# -T 16 -input #alignment_file#',
},
-rc_name => '24Gb_16_core_job',
},
We would like to call your attention to the `cmd` parameter, where we define the command line used to run Thread_app. Note that the actual command line would vary between different programs, but in this case, the parameter `-T` is set to 16 cores.
You should check the documentation of the code you want to run to find out how to define the number of threads it will use.
Just with this basic configuration, the Hive is able to run Thread_app in 16 cores.
---
How to setup a module using Distributed Memory Parallelism (MPI)
---------------------------------------------------------------
>This case requires a bit more attention, so please be very careful in including / loading the right libraries / modules.
###Tips for compiling for MPI
MPI usually comes in two implementations: OpenMPI and mpich2. One of the most common source of problems is to compile the code with one MPI implementation and try to run it with another. You must compile and run your code with the **same** MPI implementation.
This can be easily taken care by properly setting up your .bashrc to load the right modules.
If you have access to Intel compilers, we strongly recommend you to try compiling your code with it and checking for performance improvements.
####If your compute environment uses [Module ](http://modules.sourceforge.net/)
*Module* provides configuration files (module-files) for the dynamic modification of the user’s environment.
Here is how to list the modules that your system provides:
module avail
And how to load one (OpenMPI in this example:
module load openmpi-x86_64
Don't forget to put this line in your `~/.bashrc` so that it is automatically loaded.
####Otherwise, follow the recommended usage in your institute
If you don't have modules for the MPI environment available on your system, please make sure you include the right libraries (PATH, and any other environment variables)
###The Hive bit
Here again, once the environment is properly set up, we only have to define the correct resource class and comand lines in Hive.
1. You need to setup a resource class that uses e.g. *64 cores and 16Gb of RAM*:
sub resource_classes {
my ($self) = @_;
return {
# ...
'16Gb_64c_mpi' => {'LSF' => '-q mpi -a openmpi -n 64 -M16000 -R"select[mem>16000] rusage[mem=16000] same[model] span[ptile=4]"' },
# ...
};
}
The resource description is specific to our LSF environment, so adapt it to yours, but:
* `-q mpi -a openmpi` is needed to tell LSF you will run a job in the MPI/OpenMPI environment
* `same[model]` is needed to ensure that the selected compute nodes all have the same hardware. You may also need something like `select[avx]` to select the nodes that have the [AVX instruction set](http://en.wikipedia.org/wiki/Advanced_Vector_Extensions)
* `span[ptile=4]`, this option specifies the granularity in which LSF will split the jobs/per node. In this example we ask for at least 4 jobs to be executed in the same machine. This might affect queuing times.
3. You need to add the analysis to your pipeconfig:
{ -logic_name => 'MPI_app',
-module => 'Bio::EnsEMBL::Compara::RunnableDB::ProteinTrees::MPI_app',
-parameters => {
'mpi_exe' => $self->o('mpi_exe'),
},
-rc_name => '16Gb_64c_mpi',
# ...
},
---
How to write a module that uses MPI
-----------------------------------
Here is an excerpt of Ensembl Compara's [ExaML](https://github.com/Ensembl/ensembl-compara/blob/feature/update_pipeline/modules/Bio/EnsEMBL/Compara/RunnableDB/ProteinTrees/ExaML.pm) MPI module. Note that LSF needs the MPI command to be run through mpirun.lsf
You can also run several single-threaded commands in the same runnable.
sub param_defaults {
my $self = shift;
return {
%{ $self->SUPER::param_defaults },
'cmd' => 'cmd 1 ; cmd 2 ; mpirun.lsf -np 64 -mca btl tcp,self #examl_exe# -examl_parameter_1 value1 -examl_parameter_2 value2',
};
}
###!!!Temporary files!!!
Because Examl is using MPI, it has to be run in a shared directory
Here we override the eHive method to use #examl_dir# instead
sub worker_temp_directory_name {
my $self = shift @_;
my $username = $ENV{'USER'};
my $worker_id = $self->worker ? $self->worker->dbID : "standalone.$$";
return $self->param('examl_dir')."/worker_${username}.${worker_id}/";
}
How to connect eHive to Slack
=============================
---------
> With this tutorial, our goal is to explain how to configure eHive and Slack
to be able to report messages to a Slack channel.
> First of all, you obviously need to have a Slack team. You or someone else
will have to be allowed to configure Apps.
---------
1. Let's first add the "Incoming WebHooks" app to your team
1. In the main Slack menu select "Apps & Custom Integrations"
2. Find "Incoming WebHooks" via the search box and select it
3. You should be on a page that gives an introduction about WebHooks and
lists the teams you belong too. Somebody may have already configured some
WebHooks for your team.
1. If it is the case, click on the "Configure" button next to your team
name and then "Add Configuration"
2. Otherwise, click on the "Install" button next to your team name
2. Let's now configure a webhook to use with eHive
1. You first need to choose the channel eHive will write too. Although
the Slack API allows to override the channel and thus use a single webhook
to post to different channels, we advice to configure 1 webhook per channel
2. Click "Add Incoming WebHooks Integration"
3. The page now shows advanced configuration for the integration. The most
important here is the "Webhook URL". This is what eHive needs
4. If you scroll down to "Integration Settings" you can give a description
for the WebHook, change its name and emoji. Note that the latter can be
overriden in the Runnable SlackNotification
3. Use the WebHook in eHive
1. Define the `EHIVE_SLACK_WEBHOOK` environment variable when running your
beekeeper
2. Configure the `slack_webhook` parameter in the SlackNotification Runnable
......@@ -16,10 +16,8 @@ The name "Hive" comes from the way pipelines are processed by a swarm
<ul>
<li>Introduction to eHive: <a href="presentations/HiveWorkshop_Sept2013/index.html">Sept. 2013 workshop</a> (parts <a href="presentations/HiveWorkshop_Sept2013/Slides_part1.pdf">1</a>, <a href="presentations/HiveWorkshop_Sept2013/Slides_part2.pdf">2</a> and <a href="presentations/HiveWorkshop_Sept2013/Slides_part3.pdf">3</a> in PDF)</li>
<li><a href="install.html">Dependencies, installation and setup</a></li>
<li><a href="running_eHive_pipelines.html">Running eHive pipelines</a></li>
<li><a href="long_mult_walkthrough/long_mult_walkthrough.html">Long Multiplication pipeline walkthrough (with lots of pictures)</a></li>
<li><a href="MPI_howto.md">How to run MPI applications with eHive</a></li>
<li><a href="hive_schema.html">Database schema</a></li>
<li><a href="doxygen/index.html">API Doxygen documentation</a></li>
<li><a href="../wrappers/python3/doxygen/index.html">Python Doxygen documentation (still in beta)</a></li>
......
<html>
<head>
<title>eHive installation and setup</title>
<link rel="stylesheet" type="text/css" media="all" href="ehive_doc.css" />
</head>
<body>
<center><h1>eHive installation and setup</h1></center>
<hr width=50% />
<center><h2>eHive dependencies</h2></center>
eHive system depends on the following components that you may need to download and install first:
<ol>
<li>Perl 5.10 <a href=http://www.perl.org/get.html>or higher</a></li>
<li>A database engine of your choice. eHive keeps its state in a database, so you will need
<ol>
<li>a server installed on the machine where you want to maintain the state of your pipeline(s) and</li>
<li>clients installed on the machines where the jobs are to be executed.</li>
</ol>
At the moment, the following database options are available:
<ul>
<li>MySQL 5.1 <a href=http://dev.mysql.com/downloads/>or higher</a>.<br/>
<b>Warning:</b> eHive is not compatible with MysQL 5.6.20 but is with versions 5.6.16 and 5.6.23. We suggest avoiding the 5.6.[17-22] interval
</li>
<li>SQLite 3.6 <a href=http://www.sqlite.org/download.html>or higher</a></li>
<li>PostgreSQL 9.2 <a href=http://www.postgresql.org/download/>or higher</a></li>
</ul>
</li>
<li>GraphViz visualization package (includes "dot" executable and libraries used by the Perl dependencies).
<ol>
<li>Check in your terminal that you have "dot" installed</li>
<li>If not, visit <a href=http://graphviz.org/>graphviz.org</a> to download it</li>
</ol>
</li>
<li>cpanm -- a handy utility to recursively install Perl dependencies.
<ol>
<li>Check in your terminal that you have "cpanm" installed</li>
<li>If not, visit <a href=https://cpanmin.us>cpanmin.us</a> to download it (just read and follow the instructions in the header of the script)</li>
</ol>
</li>
</ol>
<hr width=50% />
<center><h2>Installing eHive code</h2></center>
<h3>Check out the repository by cloning it from GitHub:</h3>
<p>
All eHive pipelines will require the ensembl-hive repository, which can be found on
<a href="https://github.com/Ensembl/ensembl-hive">GitHub</a>. As such it is assumed that <a href="http://git-scm.com/">Git</a> is
installed on your system, if not follow the instructions <a href="https://help.github.com/articles/set-up-git/">here</a>
</p>
<p>
To download the repository, move to a suitable directory and run the following on the
command line:
</p>
<pre>
git clone https://github.com/Ensembl/ensembl-hive.git
</pre>
<p>
This will create ensembl-hive directory with all the code and documentation.<br/>
If you cd into the ensembl-hive directory and do an ls you should see something like the
following:
</p>
<pre>
ls
Changelog docs hive_config.json modules README.md scripts sql t
</pre>
The major directories here are:
<dl>
<dt>modules</dt>
<dd>This contains all the eHive modules, which are written in Perl</dd>
<dt>scripts</dt>
<dd>Has various scripts that are key to initialising, running and debugging the pipeline</dd>
<dt>sql</dt>
<dd>Contains sql used to build a standard pipeline database</dd>
</dl>
<hr width=50% />
<center><h2>Use cpanm to recursively install the Perl dependencies declared in ensembl-hive/cpanfile</h2></center>
<pre>
cd ensembl-hive
cpanm --installdeps .
</pre>
<p>
If installation of either DBD::mysql or DBD::Pg fails, check that the corresponding database system (MySQL or PostgreSQL) was installed correctly.
</p>
<hr width=50% />
<center><h2>Optional configuration of the system:</h2></center>
<p>
You may find it convenient (although it is not necessary) to add "ensembl-hive/scripts"
to your <code>$PATH</code> variable to make it easier to run beekeeper.pl and other useful Hive scripts.
</p>
<ul>
<li><i>using bash syntax:</i>
<pre>
export PATH=$PATH:$ENSEMBL_CVS_ROOT_DIR/ensembl-hive/scripts<i>
#
# (for best results, append this line to your ~/.bashrc or ~/.bash_profile configuration file)</i>
</pre></li>
<li><i>using [t]csh syntax:</i>
<pre>
set path = ( $path ${ENSEMBL_CVS_ROOT_DIR}/ensembl-hive/scripts )<i>
#
# (for best results, append this line to your ~/.cshrc or ~/.tcshrc configuration file)</i>
</pre></li>
</ul>
<p>
Also, if you are developing the code and not just running ready pipelines,
you may find it convenient to add "ensembl-hive/modules" to your <code>$PERL5LIB</code> variable.
</p>
<ul>
<li><i>using bash syntax:</i>
<pre>
export PERL5LIB=${PERL5LIB}:${ENSEMBL_CVS_ROOT_DIR}/ensembl/modules
export PERL5LIB=${PERL5LIB}:${ENSEMBL_CVS_ROOT_DIR}/ensembl-hive/modules<i>
#
# (for best results, append these lines to your ~/.bashrc or ~/.bash_profile configuration file)</i>
</pre></li>
<li><i>using [t]csh syntax:</i>
<pre>
setenv PERL5LIB ${PERL5LIB}:${ENSEMBL_CVS_ROOT_DIR}/ensembl/modules
setenv PERL5LIB ${PERL5LIB}:${ENSEMBL_CVS_ROOT_DIR}/ensembl-hive/modules<i>
#
# (for best results, append these lines to your ~/.cshrc or ~/.tcshrc configuration file)</i>
</pre></li>
</ul>
</body>
</html>
How to use MPI on eHive
=======================
With this tutorial, our goal is to give insights on how to set up the
Hive to run jobs using Shared Memory Parallelism (threads) and
Distributed Memory Parallelism (MPI).
First of all, your institution / compute-farm provider may have
documentation on this topic. Please refer to them for implementation
details (intranet-only links:
`EBI <http://www.ebi.ac.uk/systems-srv/public-wiki/index.php/EBI_Good_Computing_Guide>`__,
`Sanger
institute <http://mediawiki.internal.sanger.ac.uk/index.php/How_to_run_MPI_jobs_on_the_farm>`__)
We won't discuss the inner parts of the modules, but real examples can
be found in the
`ensembl-compara <https://github.com/Ensembl/ensembl-compara>`__
repository. It ships modules used for phylogenetic trees inference:
`RAxML <https://github.com/Ensembl/ensembl-compara/blob/release/77/modules/Bio/EnsEMBL/Compara/RunnableDB/ProteinTrees/RAxML.pm>`__
and
`ExaML <https://github.com/Ensembl/ensembl-compara/blob/feature/update_pipeline/modules/Bio/EnsEMBL/Compara/RunnableDB/ProteinTrees/ExaML.pm>`__.
They look very light-weight (only command-line definitions) because most
of the logic is in the base class (*GenericRunnable*), but nevertheless
show the command lines used and the parametrization of multi-core and
MPI runs.
--------------
How to setup a module using Shared Memory Parallelism (threads)
---------------------------------------------------------------
If you have already compiled your code and know how to enable the
use of multiple threads / cores, this case should be very
straightforward. It basically consists in defining the proper
resource class in your pipeline. We also include some tips on how to
compile code under MPI environment, but be aware that will vary
across systems.
1. You need to setup a resource class that encodes those requirements
e.g. *16 cores and 24Gb of RAM*:
::
sub resource_classes {
my ($self) = @_;
return {
#...
'24Gb_16_core_job' => { 'LSF' => '-n 16 -M24000 -R"select[mem>24000] span[hosts=1] rusage[mem=24000]"' },
#...
}
}
2. You need to add the analysis to your pipeconfig:
::
{ -logic_name => 'app_multi_core',
-module => 'Namespace::Of::Thread_app',
-parameters => {
'app_exe' => $self->o('app_pthreads_exe'),
'cmd' => '#app_exe# -T 16 -input #alignment_file#',
},
-rc_name => '24Gb_16_core_job',
},
We would like to call your attention to the ``cmd`` parameter, where
we define the command line used to run Thread\_app. Note that the
actual command line would vary between different programs, but in
this case, the parameter ``-T`` is set to 16 cores. You should check
the documentation of the code you want to run to find out how to
define the number of threads it will use.
Just with this basic configuration, the Hive is able to run Thread\_app
in 16 cores.
--------------
How to setup a module using Distributed Memory Parallelism (MPI)
----------------------------------------------------------------
This case requires a bit more attention, so please be very careful
in including / loading the right libraries / modules.
Tips for compiling for MPI
~~~~~~~~~~~~~~~~~~~~~~~~~~
MPI usually comes in two implementations: OpenMPI and mpich2. One of the
most common source of problems is to compile the code with one MPI
implementation and try to run it with another. You must compile and run
your code with the **same** MPI implementation. This can be easily taken
care by properly setting up your .bashrc to load the right modules.
If you have access to Intel compilers, we strongly recommend you to try
compiling your code with it and checking for performance improvements.
If your compute environment uses `Module <http://modules.sourceforge.net/>`__
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
*Module* provides configuration files (module-files) for the dynamic
modification of the user’s environment.
Here is how to list the modules that your system provides:
::
module avail
And how to load one (OpenMPI in this example:
::
module load openmpi-x86_64
Don't forget to put this line in your ``~/.bashrc`` so that it is
automatically loaded.
Otherwise, follow the recommended usage in your institute
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If you don't have modules for the MPI environment available on your
system, please make sure you include the right libraries (PATH, and any
other environment variables)
The Hive bit
~~~~~~~~~~~~
Here again, once the environment is properly set up, we only have to
define the correct resource class and comand lines in Hive.
1. You need to setup a resource class that uses e.g. *64 cores and 16Gb
of RAM*:
::
sub resource_classes {
my ($self) = @_;
return {
# ...
'16Gb_64c_mpi' => {'LSF' => '-q mpi -a openmpi -n 64 -M16000 -R"select[mem>16000] rusage[mem=16000] same[model] span[ptile=4]"' },
# ...
};
}
The resource description is specific to our LSF environment, so adapt
it to yours, but:
- ``-q mpi -a openmpi`` is needed to tell LSF you will run a job in the
MPI/OpenMPI environment
- ``same[model]`` is needed to ensure that the selected compute nodes
all have the same hardware. You may also need something like
``select[avx]`` to select the nodes that have the `AVX instruction
set <http://en.wikipedia.org/wiki/Advanced_Vector_Extensions>`__
- ``span[ptile=4]``, this option specifies the granularity in which LSF
will split the jobs/per node. In this example we ask for at least 4
jobs to be executed in the same machine. This might affect queuing
times.
3. You need to add the analysis to your pipeconfig:
::
{ -logic_name => 'MPI_app',
-module => 'Bio::EnsEMBL::Compara::RunnableDB::ProteinTrees::MPI_app',
-parameters => {
'mpi_exe' => $self->o('mpi_exe'),
},
-rc_name => '16Gb_64c_mpi',
# ...
},
--------------
How to write a module that uses MPI
-----------------------------------
Here is an excerpt of Ensembl Compara's
`ExaML <https://github.com/Ensembl/ensembl-compara/blob/feature/update_pipeline/modules/Bio/EnsEMBL/Compara/RunnableDB/ProteinTrees/ExaML.pm>`__
MPI module. Note that LSF needs the MPI command to be run through
mpirun.lsf You can also run several single-threaded commands in the same
runnable.
::
sub param_defaults {
my $self = shift;
return {
%{ $self->SUPER::param_defaults },
'cmd' => 'cmd 1 ; cmd 2 ; mpirun.lsf -np 64 -mca btl tcp,self #examl_exe# -examl_parameter_1 value1 -examl_parameter_2 value2',
};
}
!!!Temporary files!!!
~~~~~~~~~~~~~~~~~~~~~
Because Examl is using MPI, it has to be run in a shared directory Here
we override the eHive method to use #examl\_dir# instead
::
sub worker_temp_directory_name {
my $self = shift @_;
my $username = $ENV{'USER'};
my $worker_id = $self->worker ? $self->worker->dbID : "standalone.$$";
return $self->param('examl_dir')."/worker_${username}.${worker_id}/";
}
How to connect eHive to Slack
=============================
With this tutorial, our goal is to explain how to configure eHive
and Slack to be able to report messages to a Slack channel.
First of all, you obviously need to have a Slack team. You or
someone else will have to be allowed to configure Apps.
--------------
1. Let's first add the "Incoming WebHooks" app to your team
2. In the main Slack menu select "Apps & Custom Integrations"
3. Find "Incoming WebHooks" via the search box and select it
4. You should be on a page that gives an introduction about WebHooks
and lists the teams you belong too. Somebody may have already
configured some WebHooks for your team.
1. If it is the case, click on the "Configure" button next to your
team name and then "Add Configuration"
2. Otherwise, click on the "Install" button next to your team name
5. Let's now configure a webhook to use with eHive
6. You first need to choose the channel eHive will write too. Although
the Slack API allows to override the channel and thus use a single
webhook to post to different channels, we advice to configure 1
webhook per channel
7. Click "Add Incoming WebHooks Integration"
8. The page now shows advanced configuration for the integration. The
most important here is the "Webhook URL". This is what eHive needs
9. If you scroll down to "Integration Settings" you can give a
description for the WebHook, change its name and emoji. Note that
the latter can be overriden in the Runnable SlackNotification
10. Use the WebHook in eHive
11. Define the ``EHIVE_SLACK_WEBHOOK`` environment variable when running
your beekeeper
12. Configure the ``slack_webhook`` parameter in the SlackNotification
Runnable
......@@ -42,6 +42,13 @@ Creating runnables
creating_runnables/*
Advance usage
=============
.. toctree::
:glob:
advanced_usage/*
Indices and tables
==================
......
eHive installation and setup
============================
eHive dependencies
------------------