Update the MPI guide and add an entry to index.html

715e01a3 · Matthieu Muffato · 998a4eee · 715e01a3 · 715e01a3
Commit 715e01a3 authored 10 years ago by Matthieu Muffato
--- a/docs/MPI_howto.md
+++ b/docs/MPI_howto.md
@@ -3,136 +3,137 @@ How to use MPI on eHive

 ---

-In this tutorial we won't discuss the inner parts of the modules, our goal here is just to give you insights on how to set up the Hive to run jobs using Shared Memory Parallelism (threads) and Distributed Memory Parallelism (MPI).
+With this tutorial, our goal is to give insights on how to set up the Hive to run jobs using Shared Memory Parallelism (threads) and Distributed Memory Parallelism (MPI).

-If you have access to the EBI intranet this is a must read guide:
+First of all, your institution / compute-farm provider may have documentation on this topic. Please refer to them for implementation details (intranet-only links: [EBI](http://www.ebi.ac.uk/systems-srv/public-wiki/index.php/EBI_Good_Computing_Guide), [Sanger institute](http://mediawiki.internal.sanger.ac.uk/index.php/How_to_run_MPI_jobs_on_the_farm))

-<http://www.ebi.ac.uk/systems-srv/public-wiki/index.php/EBI_Good_Computing_Guide>
+We won't discuss the inner parts of the modules, but real examples can be found in the [ensembl-compara](https://github.com/Ensembl/ensembl-compara) repository. It ships modules used for phylogenetic trees inference: [RAxML](https://github.com/Ensembl/ensembl-compara/blob/release/77/modules/Bio/EnsEMBL/Compara/RunnableDB/ProteinTrees/RAxML.pm) and [ExaML](https://github.com/Ensembl/ensembl-compara/blob/feature/update_pipeline/modules/Bio/EnsEMBL/Compara/RunnableDB/ProteinTrees/ExaML.pm). They look very light-weight (only command-line definitions) because most of the logic is in the base class (*GenericRunnable*), but nevertheless show the command lines used and the parametrization of multi-core and MPI runs.

-Real examples can be found for the compara modules RAxML.pm and ExaML.pm, which are used for phylogenetic trees inference.
+---

-For running binaries we use the module Bio::EnsEMBL::Compara::RunnableDB::GeneTrees::GenericRunnable, which gives us a good interface with the command line. Allowing us to submit different command lines and parsing/storing the results on the database. For more documentation on how to use the GenericRunnable please check the module documentation.
+How to setup a module using Shared Memory Parallelism (threads)
+---------------------------------------------------------------

+>If you have already compiled your code and know how to enable the use of multiple threads / cores, this case should be very straightforward. It basically consists in defining the proper resource class in your pipeline. We also include some tips on how to compile code under MPI environment, but be aware that will vary across systems.

---

-####Here it's an example on how to setup a module using Shared Memory Parallelism (threads):
+1. You need to setup a resource class that encodes those requirements e.g. *16 cores and 24Gb of RAM*:

->Given that you already compiled you code properly allowing it to use multiple threads/cores, this case should be very straightforward, the first thing you need to do is to add a new resource class to you pipeline. We also included some tips on how to compile code under MPI environment, but that will vary on different systems.
+        sub resource_classes {
+          my ($self) = @_;
+          return {
+            #...
+            '24Gb_16_core_job' => { 'LSF' => '-n 16 -M24000  -R"select[mem>24000] span[hosts=1] rusage[mem=24000]"' },
+            #...
+          }
+        }

+2. You need to add the analysis to your pipeconfig:

-**1)** You need to setup a resource class that uses e.g. `16` cores and 16Gb of RAM:
+        {   -logic_name => 'app_multi_core',
+            -module     => 'Namespace::Of::Thread_app',
+            -parameters => {
+                    'app_exe'    => $self->o('app_pthreads_exe'),
+                    'cmd'        => '#app_exe# -T 16 -input #alignment_file#',
+            },
+            -rc_name    => '24Gb_16_core_job',
+        },
+We would like to call your attention to the `cmd` parameter, where we define the command line used to run Thread_app. Note that the actual command line would vary between different programs, but in this case, the parameter `-T` is set to 16 cores. 
+You should check the documentation of the code you want to run to find out how to define the number of threads it will use.

+Just with this basic configuration, the Hive is able to run Thread_app in 16 cores.  

-	sub resource_classes {
-    my ($self) = @_;
-    return {
-    	#...
-		'16Gb_16_core_job' => { 'LSF' => '-q production-rh6 -n `16` -M16000  -R"select[mem>16000] span[hosts=1] rusage[mem=16000]"' },
-    	#...
-	}
+---  

-**2)** You need to add the analysis to your pipeconfig:

+How to setup a module using Distributed Memory Parallelism (MPI)
+---------------------------------------------------------------

-	{   -logic_name => 'app_multi_core',
-            -module     => 'Bio::EnsEMBL::Compara::RunnableDB::ProteinTrees::Thread_app',
-            -parameters => {
-                'app_exe'                 => $self->o('app_pthreads_exe'),
-                'cmd'                       => '#app_exe# -T `16` -input #alignment_file#',
-            },
-            -rc_name        => '16Gb_16_core_job',
-	},
+>This case requires a bit more attention, so please be very careful in including / loading the right libraries / modules.


-We would like to call your attention to the 'cmd' parameter, where we code the command line used to run Thread_app.
-Note that in this case the parameter "-T" is set to `16` cores. This command line would vary for different programs.
-You should check the documentation of the code you want to run.
+###Tips for compiling for MPI

-With this basic configuration the Hive is already able to run Thread_app in 16 cores.  
+MPI usually comes in two implementations: OpenMPI and mpich2. One of the most common source of problems is to compile the code with one MPI implementation and try to run it with another. You must compile and run your code with the **same** MPI implementation.
+This can be easily taken care by properly setting up your .bashrc to load the right modules.

---  
-
-####Tips for compiling for Shared Memory Parallelism (OpenMP)
+If you have access to Intel compilers, we strongly recommend you to try compiling your code with it and checking for performance improvements.

->This case will require a bit more attention, so please be very carefull by including/loading the right libraries/modules.
-One of the most common source of problems is to compile the code with one MPI implementation and try to run it with another. This can be easily taken care by properly setting up your .bashrc to load the right modules.
+####If your compute environment uses [Module ](http://modules.sourceforge.net/)

-**MPI Modules:**
+*Module* provides configuration files (module-files) for the dynamic modification of the user’s environment.

-You must compile and run your code with the same MPI implementation (openmpi or mpich2)
+Here is how to list the modules that your system provides:

-Most of the times you can load these modules by just using the command module, which provides packages for the dynamic modification of the user’s environment via modulefiles.
+        module avail
+    
+And how to load one (OpenMPI in this example:
+    
+        module load openmpi-x86_64

-Here is how you could check if your system provides modules for the MPI implementations:
+Don't forget to put this line in your `~/.bashrc` so that it is automatically loaded.

-	module avail
-	
-	#if you have the openmpi module available, please load it:
-	
-	module load openmpi-x86_64

+####Otherwise, follow the recommended usage in your institute

->If you don't have modules for the MPI environment available on your system, please make sure you include the right libraries.
+If you don't have modules for the MPI environment available on your system, please make sure you include the right libraries (PATH, and any other environment variables)

+###The Hive bit

-**1)** Include the module on your .bashrc:
-You must load the MPI module (or libraries) on your source file (e.g. ~/.bashrc). Otherwise your code won't run properlly.
+Here again, once the environment is properly set up, we only have to define the correct resource class and comand lines in Hive.

+1. You need to setup a resource class that uses e.g. *64 cores and 16Gb of RAM*:

-**2)** You need to setup a resource class that uses e.g. `64` cores and 16Gb of RAM:
+        sub resource_classes {
+          my ($self) = @_;
+          return {
+            # ...   
+            '16Gb_64c_mpi' => {'LSF' => '-q mpi -a openmpi -n 64 -M16000 -R"select[mem>16000] rusage[mem=16000] same[model] span[ptile=4]"' },
+            # ...
+          };
+        }
+The resource description is specific to our LSF environment, so adapt it to yours, but:

-	sub resource_classes {
-    my ($self) = @_;
-    return {
-     		# ...   
-			'16Gb_64c_mpi' => {'LSF' => '-q mpi -n 64 -a openmpi -M16000 -R"select[mem>16000] rusage[mem=16000] same[model] span[ptile=4]"' },
-  	 		# ...
-  		};
-	}
-	
-Nothe the option span[ptile=4], this option specifies the granularity in which LSF will split the jobs/per node. In this example we ask for at least 4 jobs to be executed in the same machine. This might affect queuing times. 
+  * `-q mpi -a openmpi` is needed to tell LSF you will run a job in the MPI/OpenMPI environment
+  * `same[model]` is needed to ensure that the selected compute nodes all have the same hardware. You may also need something like `select[avx]` to select the nodes that have the [AVX instruction set](http://en.wikipedia.org/wiki/Advanced_Vector_Extensions)
+  * `span[ptile=4]`, this option specifies the granularity in which LSF will split the jobs/per node. In this example we ask for at least 4 jobs to be executed in the same machine. This might affect queuing times. 

-**3)** You need to add the analysis to your pipeconfig:
+3. You need to add the analysis to your pipeconfig:

-      {   -logic_name => 'MPI_app',
+        {   -logic_name => 'MPI_app',
            -module     => 'Bio::EnsEMBL::Compara::RunnableDB::ProteinTrees::MPI_app',
            -parameters => {
-                'mpi_exe'        => $self->o('mpi_exe'),
+                'mpi_exe'     => $self->o('mpi_exe'),
            },
-            -rc_name => '4Gb_64c_mpi',
+            -rc_name => '16Gb_64c_mpi',
            # ...       
-      },
-
+        },

 ---

-**In this section we will briefly describe how to create a module that uses MPI.**
-
-
-
-Module:
+How to write a module that uses MPI
+-----------------------------------

-	sub param_defaults {
-    my $self = shift;
-    return {
-        %{ $self->SUPER::param_defaults },
-        	# Note that Examl needs MPI and has to be run through mpirun.lsf
-        	'cmd' => 'cmd 1 ; cmd  2 ; mpirun.lsf -np `64` -mca btl tcp,self #examl_exe# -examl_parameter_1 EX1 -examl_parameter_2 EX2',
-   		};
-	}
+Here is an excerpt of Ensembl Compara's [ExaML](https://github.com/Ensembl/ensembl-compara/blob/feature/update_pipeline/modules/Bio/EnsEMBL/Compara/RunnableDB/ProteinTrees/ExaML.pm) MPI module. Note that LSF needs the MPI command to be run through mpirun.lsf
+You can also run several single-threaded commands in the same runnable.

-###**!!!tmp files!!!!**
+        sub param_defaults {
+          my $self = shift;
+          return {
+            %{ $self->SUPER::param_defaults },
+            'cmd' => 'cmd 1 ; cmd  2 ; mpirun.lsf -np 64 -mca btl tcp,self #examl_exe# -examl_parameter_1 value1 -examl_parameter_2 value2',
+          };
+        }

-	Because Examl is using MPI, it has to be run in a shared directory
-	Here we override the eHive method to use #examl_dir# instead
-	sub worker_temp_directory_name {
-    my $self = shift @_;
+###!!!Temporary files!!!

-        my $username = $ENV{'USER'};
-        my $worker_id = $self->worker ? $self->worker->dbID : "standalone.$$";
-        return $self->param('examl_dir')."/worker_${username}.${worker_id}/";
-	}
+Because Examl is using MPI, it has to be run in a shared directory
+Here we override the eHive method to use #examl_dir# instead

+        sub worker_temp_directory_name {
+          my $self = shift @_;
+          my $username = $ENV{'USER'};
+          my $worker_id = $self->worker ? $self->worker->dbID : "standalone.$$";
+          return $self->param('examl_dir')."/worker_${username}.${worker_id}/";
+        }

->**Note: If you have access to Intel compilers, we strongly recommend you to try compiling your code with it and checking for performance improvements**
\ No newline at end of file
--- a/docs/index.html
+++ b/docs/index.html
@@ -18,6 +18,7 @@ The name &quot;Hive&quot; comes from the way pipelines are processed by a swarm
 	<li>Introduction to eHive: <a href="presentations/HiveWorkshop_Sept2013/index.html">Sept. 2013 workshop</a> (parts <a href="presentations/HiveWorkshop_Sept2013/Slides_part1.pdf">1</a>, <a href="presentations/HiveWorkshop_Sept2013/Slides_part2.pdf">2</a> and <a href="presentations/HiveWorkshop_Sept2013/Slides_part3.pdf">3</a> in PDF)</li>
 	<li><a href="install.html">Dependencies, installation and setup</a></li>
 	<li><a href="running_eHive_pipelines.html">Running eHive pipelines</a></li>
+	<li><a href="MPI_howto.md">How to run MPI applications with eHive</a></li>
 	<li><a href="hive_schema.html">Database schema</a></li>
 	<li><a href="doxygen/index.html">API Doxygen documentation</a></li>
 	<li class="tree">eHive scripts<br>