First pass documentation

bfa05e64 · Andy Yates · ceba1e0a · bfa05e64 · bfa05e64
Commit bfa05e64 authored 13 years ago by Andy Yates
--- a/docs/pipelines/fasta.html
+++ b/docs/pipelines/fasta.html
+<h1 id="fasta_pipeline">FASTA Pipeline</h1>
+
+<p>This is a re-implementation of an existing pipeline developed originally by
+core and the webteam. The new version uses eHive, so familiarity with this
+system is essential, and has been written to use as little memory as possible.</p>
+
+<h2 id="the_registry_file">The Registry File</h2>
+
+<p>This is the way we retrieve the database connections to work with. The
+registry file should specify:</p>
+
+<ul>
+<li>The core (and any other) databases to dump from</li>
+<li>A production database
+<ul>
+<li><strong>species = multi</strong></li>
+<li><strong>group = production</strong></li>
+<li>Used to find which species require new DNA</li>
+</ul></li>
+<li>A web database
+<ul>
+<li><strong>species = multi</strong></li>
+<li><strong>group = web</strong></li>
+<li>Used to name BLAT index files</li>
+</ul></li>
+</ul>
+
+<p>Here is an example of a file for v67 of Ensembl. Note the use of the
+Registry object within a registry file and the scoping of the package. If
+you omit the <em>-db_version</em> parameter and only use HEAD checkouts of Ensembl
+then this will automatically select the latest version of the API. Any
+change to version here must be reflected in the configuration file.</p>
+
+<pre><code>package Reg;
+
+use Bio::EnsEMBL::Registry;
+use Bio::EnsEMBL::DBSQL::DBAdaptor;
+
+Bio::EnsEMBL::Registry-&gt;no_version_check(1);
+Bio::EnsEMBL::Registry-&gt;no_cache_warnings(1);
+
+{
+  my $version = 67;
+  Bio::EnsEMBL::Registry-&gt;load_registry_from_multiple_dbs(
+    {
+      -host =&gt; "mydb-1",
+      -port =&gt; 3306,
+      -db_version =&gt; $version,
+      -user =&gt; "user",
+      -NO_CACHE =&gt; 1,
+    },
+    {    
+      -host =&gt; "mydb-2",
+      -port =&gt; 3306,
+      -db_version =&gt; $version,
+      -user =&gt; "user",
+      -NO_CACHE =&gt; 1,
+    },
+  );
+
+  Bio::EnsEMBL::DBSQL::DBAdaptor-&gt;new(
+    -HOST =&gt; 'mydb-2',
+    -PORT =&gt; 3306,
+    -USER =&gt; 'user',
+    -DBNAME =&gt; 'ensembl_website',
+    -SPECIES =&gt; 'multi',
+    -GROUP =&gt; 'web'
+  );
+
+  Bio::EnsEMBL::DBSQL::DBAdaptor-&gt;new(
+    -HOST =&gt; 'mydb-2',
+    -PORT =&gt; 3306,
+    -USER =&gt; 'user',
+    -DBNAME =&gt; 'ensembl_production',
+    -SPECIES =&gt; 'multi',
+    -GROUP =&gt; 'production'
+  );
+}
+
+1;
+</code></pre>
+
+<h2 id="overriding_defaults_using_a_new_config_file">Overriding Defaults Using a New Config File</h2>
+
+<p>We recommend if you have a number of parameters which do not change
+between releases to create a configuration file which inherits from the
+root config file e.g.</p>
+
+<pre><code>package MyCnf;
+
+use base qw/Bio::EnsEMBL::Pipeline::FASTA::FASTA_conf/;
+
+sub default_options {
+  my ($self) = @_;
+  return {
+    %{ $self-&gt;SUPER::default_options() },
+
+    #Override of options
+  };
+}
+
+1;
+</code></pre>
+
+<h2 id="environment">Environment</h2>
+
+<h3 id="perl5lib">PERL5LIB</h3>
+
+<ul>
+<li>ensembl</li>
+<li>ensembl-hive</li>
+<li>bioperl</li>
+</ul>
+
+<h3 id="path">PATH</h3>
+
+<ul>
+<li>ensembl-hive/scripts</li>
+<li>faToTwoBit (if not using a custom location)</li>
+<li>xdformat (if not using a custom location)</li>
+</ul>
+
+<h2 id="example_commands">Example Commands</h2>
+
+<h3 id="to_load_use_normally">To load use normally:</h3>
+
+<pre><code>init_pipeline.pl Bio::EnsEMBL::Pipeline::PipeConfig:FASTA_conf \
+-pipeline_db -host=my-db-host -base_path /path/to/dumps
+</code></pre>
+
+<h3 id="run_a_subset_of_species_no_forcing_supports_registry_aliases">Run a subset of species (no forcing &amp; supports registry aliases):</h3>
+
+<pre><code>init_pipeline.pl Bio::EnsEMBL::Pipeline::PipeConfig:FASTA_conf \
+-pipeline_db -host=my-db-host -species anolis -species celegans -species human \
+-base_path /path/to/dumps
+</code></pre>
+
+<h3 id="specifying_species_to_force_supports_all_registry_aliases">Specifying species to force (supports all registry aliases):</h3>
+
+<pre><code>init_pipeline.pl Bio::EnsEMBL::Pipeline::PipeConfig:FASTA_conf \
+-pipeline_db -host=my-db-host -force_species anolis -force_species celegans -force_species human \
+-base_path /path/to/dumps
+</code></pre>
+
+<h3 id="running_forcing_a_species">Running &amp; forcing a species:</h3>
+
+<pre><code>init_pipeline.pl Bio::EnsEMBL::Pipeline::PipeConfig:FASTA_conf \
+-pipeline_db -host=my-db-host -species celegans -force_species celegans \
+-base_path /path/to/dumps
+</code></pre>
+
+<h3 id="dumping_just_gene_data">Dumping just gene data:</h3>
+
+<pre><code>init_pipeline.pl Bio::EnsEMBL::Pipeline::PipeConfig:FASTA_conf \
+-pipeline_db -host=my-db-host -dump_type cdna \
+-base_path /path/to/dumps
+</code></pre>
+
+<h3 id="using_a_different_scp_user_identity">Using a different SCP user &amp; identity:</h3>
+
+<pre><code>init_pipeline.pl Bio::EnsEMBL::Pipeline::PipeConfig:FASTA_conf \
+-pipeline_db -host=my-db-host -scp_user anotherusr -scp_identity /users/anotherusr/.pri/identity \
+-base_path /path/to/dumps
+</code></pre>
+
+<h2 id="running_the_pipeline">Running the Pipeline</h2>
+
+<ol>
+<li>Start a screen session or get ready to run the beekeeper with a <strong>nohup</strong></li>
+<li>Choose a dump location
+<ul>
+<li>A fasta, blast and blat directory will be created 1 level below</li>
+</ul></li>
+<li>Use an <em>init_pipeline.pl</em> configuration from above
+<ul>
+<li>Make sure to give it the <strong>-base_path</strong> parameter</li>
+</ul></li>
+<li>Sync the database using one of the displayed from <em>init_pipeline.pl</em></li>
+<li><p>Run the pipeline in a loop with a good sleep between submissions and redirect log output (the following assumes you are using <strong>bash</strong>)</p>
+
+<ul>
+<li><strong>2>&amp;1</strong> is important as this clobbers STDERR into STDOUT</li>
+<li><strong>> my<em>run.log</strong> then sends the output to this file. Use <strong>tail -f</strong> to track the pipeline
+beekeeper.pl -url mysql://usr:pass@server:port/db -reg</em>conf reg.pm -loop -sleep 5 2>&amp;1 > my_run.log &amp;</li>
+</ul></li>
+<li><p>Wait</p></li>
+</ol>
--- a/docs/pipelines/fasta.markdown
+++ b/docs/pipelines/fasta.markdown
+# FASTA Pipeline
+
+This is a re-implementation of an existing pipeline developed originally by
+core and the webteam. The new version uses eHive, so familiarity with this
+system is essential, and has been written to use as little memory as possible.
+
+## The Registry File
+
+This is the way we retrieve the database connections to work with. The
+registry file should specify:
+
+* The core (and any other) databases to dump from
+* A production database
+	* **species = multi**
+	* **group = production**
+	* Used to find which species require new DNA
+* A web database
+	* **species = multi**
+	* **group = web**
+	* Used to name BLAT index files
+
+Here is an example of a file for v67 of Ensembl. Note the use of the
+Registry object within a registry file and the scoping of the package. If
+you omit the *-db_version* parameter and only use HEAD checkouts of Ensembl
+then this will automatically select the latest version of the API. Any
+change to version here must be reflected in the configuration file.
+
+	package Reg;
+
+	use Bio::EnsEMBL::Registry;
+	use Bio::EnsEMBL::DBSQL::DBAdaptor;
+
+	Bio::EnsEMBL::Registry->no_version_check(1);
+	Bio::EnsEMBL::Registry->no_cache_warnings(1);
+  
+	{
+	  my $version = 67;
+	  Bio::EnsEMBL::Registry->load_registry_from_multiple_dbs(
+	    {
+	      -host => "mydb-1",
+	      -port => 3306,
+	      -db_version => $version,
+	      -user => "user",
+	      -NO_CACHE => 1,
+	    },
+	    {    
+	      -host => "mydb-2",
+	      -port => 3306,
+	      -db_version => $version,
+	      -user => "user",
+	      -NO_CACHE => 1,
+	    },
+	  );
+  
+	  Bio::EnsEMBL::DBSQL::DBAdaptor->new(
+	    -HOST => 'mydb-2',
+	    -PORT => 3306,
+	    -USER => 'user',
+	    -DBNAME => 'ensembl_website',
+	    -SPECIES => 'multi',
+	    -GROUP => 'web'
+	  );
+  
+	  Bio::EnsEMBL::DBSQL::DBAdaptor->new(
+	    -HOST => 'mydb-2',
+	    -PORT => 3306,
+	    -USER => 'user',
+	    -DBNAME => 'ensembl_production',
+	    -SPECIES => 'multi',
+	    -GROUP => 'production'
+	  );
+	}
+
+	1;
+
+## Overriding Defaults Using a New Config File 
+
+We recommend if you have a number of parameters which do not change
+between releases to create a configuration file which inherits from the
+root config file e.g.
+
+	package MyCnf;
+
+	use base qw/Bio::EnsEMBL::Pipeline::FASTA::FASTA_conf/;
+
+	sub default_options {
+	  my ($self) = @_;
+	  return {
+	    %{ $self->SUPER::default_options() },
+    
+	    #Override of options
+	  };
+	}
+
+	1;
+
+## Environment
+
+### PERL5LIB
+
+* ensembl
+* ensembl-hive
+* bioperl
+
+### PATH
+
+* ensembl-hive/scripts
+* faToTwoBit (if not using a custom location)
+* xdformat (if not using a custom location)
+
+## Example Commands
+
+### To load use normally:
+
+	init_pipeline.pl Bio::EnsEMBL::Pipeline::PipeConfig:FASTA_conf \
+	-pipeline_db -host=my-db-host -base_path /path/to/dumps
+
+### Run a subset of species (no forcing & supports registry aliases):
+
+	init_pipeline.pl Bio::EnsEMBL::Pipeline::PipeConfig:FASTA_conf \
+	-pipeline_db -host=my-db-host -species anolis -species celegans -species human \
+	-base_path /path/to/dumps
+
+### Specifying species to force (supports all registry aliases):
+
+	init_pipeline.pl Bio::EnsEMBL::Pipeline::PipeConfig:FASTA_conf \
+	-pipeline_db -host=my-db-host -force_species anolis -force_species celegans -force_species human \
+	-base_path /path/to/dumps
+
+### Running & forcing a species:
+
+	init_pipeline.pl Bio::EnsEMBL::Pipeline::PipeConfig:FASTA_conf \
+	-pipeline_db -host=my-db-host -species celegans -force_species celegans \
+	-base_path /path/to/dumps
+
+### Dumping just gene data:
+
+	init_pipeline.pl Bio::EnsEMBL::Pipeline::PipeConfig:FASTA_conf \
+	-pipeline_db -host=my-db-host -dump_type cdna \
+	-base_path /path/to/dumps
+
+### Using a different SCP user & identity:
+
+	init_pipeline.pl Bio::EnsEMBL::Pipeline::PipeConfig:FASTA_conf \
+	-pipeline_db -host=my-db-host -scp_user anotherusr -scp_identity /users/anotherusr/.pri/identity \
+	-base_path /path/to/dumps
+
+
+## Running the Pipeline
+
+1. Start a screen session or get ready to run the beekeeper with a **nohup**
+2. Choose a dump location
+	* A fasta, blast and blat directory will be created 1 level below
+3. Use an *init_pipeline.pl* configuration from above
+	* Make sure to give it the **-base_path** parameter
+4. Sync the database using one of the displayed from *init_pipeline.pl*
+5. Run the pipeline in a loop with a good sleep between submissions and redirect log output (the following assumes you are using **bash**)
+	* **2>&1** is important as this clobbers STDERR into STDOUT
+	* **> my_run.log** then sends the output to this file. Use **tail -f** to track the pipeline
+		beekeeper.pl -url mysql://usr:pass@server:port/db -reg_conf reg.pm -loop -sleep 5 2>&1 > my_run.log &
+
+6. Wait
+