From 3d190b8294db8bf9df2adf4e842ffc4670b10918 Mon Sep 17 00:00:00 2001 From: Andrew Yates <ayates@ebi.ac.uk> Date: Wed, 13 Jun 2012 10:16:31 +0000 Subject: [PATCH] First pass of the flatfile documentation --- docs/pipelines/flatfile.html | 46 +++++++++++ docs/pipelines/flatfile.textile | 137 ++++++++++++++++++++++++++++++++ 2 files changed, 183 insertions(+) create mode 100644 docs/pipelines/flatfile.html create mode 100644 docs/pipelines/flatfile.textile diff --git a/docs/pipelines/flatfile.html b/docs/pipelines/flatfile.html new file mode 100644 index 0000000000..da4d4b26b6 --- /dev/null +++ b/docs/pipelines/flatfile.html @@ -0,0 +1,46 @@ +<?xml version='1.0' encoding='utf-8' ?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"/></head><body><h1 id="FlatFilePipeline">FlatFile Pipeline</h1><p>This is a re-implementation of an existing pipeline developed originally by core and the webteam. The new version uses eHive, so familiarity with this system is essential, and has been written to use as little memory as possible.</p><h2 id="TheRegistryFile">The Registry File</h2><p>This is the way we retrieve the database connections to work with. The registry file should specify:</p><ul><li>The core (and any other) databases to dump from</li></ul><p>Here is an example of a file for v67 of Ensembl. Note the use of the Registry object within a registry file and the scoping of the package. If you omit the <strong>-db_version</strong> parameter and only use HEAD checkouts of Ensembl then this will automatically select the latest version of the API. Any change to version here must be reflected in the configuration file.</p><pre><code> package Reg; + use Bio::EnsEMBL::Registry; + Bio::EnsEMBL::Registry->no_version_check(1); + Bio::EnsEMBL::Registry->no_cache_warnings(1); + { + my $version = 67; + Bio::EnsEMBL::Registry->load_registry_from_multiple_dbs( + { + -host => "mydb-1", + -port => 3306, + -db_version => $version, + -user => "user", + -NO_CACHE => 1, + }, + { + -host => "mydb-2", + -port => 3306, + -db_version => $version, + -user => "user", + -NO_CACHE => 1, + }, + ); + } + 1; +</code></pre><p>You give the registry to the <strong>init_pipeline.pl</strong> script via the <strong>-registry</strong> option</p><h2 id="OverridingDefaultsUsingaNewConfigFile">Overriding Defaults Using a New Config File </h2><p>We recommend if you have a number of parameters which do not change between releases to create a configuration file which inherits from the root config file e.g.</p><pre><code> package MyCnf; + use base qw/Bio::EnsEMBL::Pipeline::Flatfile::Flatfile_conf/; + sub default_options { + my ($self) = @_; + return { + %{ $self->SUPER::default_options() }, + #Override of options + }; + } + 1; +</code></pre><p>If you do override the config then you should use the package name for your overridden config in the upcoming example commands.</p><h2 id="Environment">Environment</h2><h3 id="PERL5LIB">PERL5LIB</h3><ul><li>ensembl</li><li>ensembl-hive</li><li>bioperl</li></ul><h3 id="PATH">PATH</h3><ul><li>ensembl-hive/scripts</li></ul><h3 id="ENSEMBLCVSROOTDIR">ENSEMBL_CVS_ROOT_DIR</h3><p>Set to the base checkout of Ensembl. We should be able to add <strong>ensembl-hive/sql</strong> onto this path to find the SQL directory for hive e.g.</p><pre><code> export ENSEMBL_CVS_ROOT_DIR=$HOME/work/ensembl-checkouts +</code></pre><h3 id="ENSADMINPSW">ENSADMIN_PSW</h3><p>Give the password to use to log into a database server e.g.</p><pre><code> export ENSADMIN_PSW=wibble +</code></pre><h2 id="CommandLineArguments">Command Line Arguments</h2><p>Where <strong>Multiple Supported</strong> is supported we allow the user to specify the parameter more than once on the command line. For example species is one of these options e.g. </p><pre><code>-species human -species cele -species yeast +</code></pre><table><tr><th>Name </th><th> Type</th><th>Multiple Supported</th><th> Description</th><th>Default</th><th> Required</th></tr><tr><td><code>-registry</code></td><td>String</td><td>No</td><td>Location of the Ensembl registry to use with this pipeline</td><td>-</td><td><strong>YES</strong></td></tr><tr><td><code>-base_path</code></td><td>String</td><td>No</td><td>Location of the dumps</td><td>-</td><td><strong>YES</strong></td></tr><tr><td><code>-pipeline_db -host=</code></td><td>String</td><td>No</td><td>Specify a host for the hive database e.g. <code>-pipeline_db -host=myserver.mysql</code></td><td>See hive generic config</td><td><strong>YES</strong></td></tr><tr><td><code>-pipeline_db -dbname=</code></td><td>String</td><td>No</td><td>Specify a different database to use as the hive DB e.g. <code>-pipeline_db -dbname=my_dumps_test</code></td><td>Uses pipeline name by default</td><td><strong>NO</strong></td></tr><tr><td><code>-species</code></td><td>String</td><td>Yes</td><td>Specify one or more species to process. Pipeline will only <em>consider</em> these species</td><td>-</td><td><strong>NO</strong></td></tr><tr><td><code>-types</code></td><td>String</td><td>Yes</td><td>Specify each type of dump you want to produce. Supported values are <strong>embl</strong> and <strong>genbank</strong></td><td>All</td><td><strong>NO</strong></td></tr><tr><td><code>-db_types</code></td><td>String</td><td>Yes</td><td>The database types to use. Supports the normal db adaptor groups e.g. <strong>core</strong>, <strong>otherfeatures</strong> etc.</td><td>core</td><td><strong>NO</strong></td></tr><tr><td><code>-pipeline_name</code></td><td>String</td><td>No</td><td>Name to use for the pipeline</td><td>flatfile_dump_$release</td><td><strong>NO</strong></td></tr><tr><td><code>-email</code></td><td>String</td><td>No</td><td>Email to send pipeline summaries to upon its successful completion</td><td>$USER@sanger.ac.uk</td><td><strong>NO</strong></td></tr></table><h2 id="ExampleCommands">Example Commands</h2><h3 id="Toloadusenormally">To load use normally:</h3><pre><code> init_pipeline.pl Bio::EnsEMBL::Pipeline::PipeConfig::Flatfile_conf \ + -pipeline_db -host=my-db-host -base_path /path/to/dumps -registry reg.pm +</code></pre><h3 id="Runasubsetofspeciesnoforcingsupportsregistryaliases">Run a subset of species (no forcing & supports registry aliases):</h3><pre><code> init_pipeline.pl Bio::EnsEMBL::Pipeline::PipeConfig::Flatfile_conf \ + -pipeline_db -host=my-db-host -species anolis -species celegans -species human \ + -base_path /path/to/dumps -registry reg.pm +</code></pre><h3 id="DumpingjustEMBLdatanogenbank">Dumping just EMBL data (no genbank):</h3><pre><code> init_pipeline.pl Bio::EnsEMBL::Pipeline::PipeConfig::Flatfile_conf \ + -pipeline_db -host=my-db-host -type embl \ + -base_path /path/to/dumps -registry reg.pm +</code></pre><h2 id="RunningthePipeline">Running the Pipeline</h2><ol><li>Start a screen session or get ready to run the beekeeper with a <code>nohup</code></li><li>Choose a dump location<ul><li>A fasta, blast and blat directory will be created 1 level below</li></ul></li><li>Use an <code>init_pipeline.pl</code> configuration from above<ul><li>Make sure to give it the <code>-base_path</code> parameter</li></ul></li><li>Sync the database using one of the displayed from <code>init_pipeline.pl</code></li><li>Run the pipeline in a loop with a good sleep between submissions and redirect log output (the following assumes you are using <strong>bash</strong>)<ul><li><code>2>&1</code> is important as this clobbers STDERR into STDOUT</li><li><code>> my_run.log</code> then sends the output to this file. Use <code>tail -f</code> to track the pipeline</li></ul></li><li><code>beekeeper.pl -url mysql://usr:pass@server:port/db -reg_conf reg.pm -loop -sleep 5 2>&1 > my_run.log &</code></li><li>Wait</li></ol></body></html> \ No newline at end of file diff --git a/docs/pipelines/flatfile.textile b/docs/pipelines/flatfile.textile new file mode 100644 index 0000000000..6d19ecd9a6 --- /dev/null +++ b/docs/pipelines/flatfile.textile @@ -0,0 +1,137 @@ +h1. FlatFile Pipeline + +This is a re-implementation of an existing pipeline developed originally by core and the webteam. The new version uses eHive, so familiarity with this system is essential, and has been written to use as little memory as possible. + +h2. The Registry File + +This is the way we retrieve the database connections to work with. The registry file should specify: + +* The core (and any other) databases to dump from + +Here is an example of a file for v67 of Ensembl. Note the use of the Registry object within a registry file and the scoping of the package. If you omit the *-db_version* parameter and only use HEAD checkouts of Ensembl then this will automatically select the latest version of the API. Any change to version here must be reflected in the configuration file. + +bc. + package Reg; + use Bio::EnsEMBL::Registry; + Bio::EnsEMBL::Registry->no_version_check(1); + Bio::EnsEMBL::Registry->no_cache_warnings(1); + { + my $version = 67; + Bio::EnsEMBL::Registry->load_registry_from_multiple_dbs( + { + -host => "mydb-1", + -port => 3306, + -db_version => $version, + -user => "user", + -NO_CACHE => 1, + }, + { + -host => "mydb-2", + -port => 3306, + -db_version => $version, + -user => "user", + -NO_CACHE => 1, + }, + ); + } + 1; + +You give the registry to the *init_pipeline.pl* script via the *-registry* option + +h2. Overriding Defaults Using a New Config File + +We recommend if you have a number of parameters which do not change between releases to create a configuration file which inherits from the root config file e.g. + +bc. + package MyCnf; + use base qw/Bio::EnsEMBL::Pipeline::Flatfile::Flatfile_conf/; + sub default_options { + my ($self) = @_; + return { + %{ $self->SUPER::default_options() }, + #Override of options + }; + } + 1; + +If you do override the config then you should use the package name for your overridden config in the upcoming example commands. + +h2. Environment + +h3. PERL5LIB + +* ensembl +* ensembl-hive +* bioperl + +h3. PATH + +* ensembl-hive/scripts + +h3. ENSEMBL_CVS_ROOT_DIR + +Set to the base checkout of Ensembl. We should be able to add *ensembl-hive/sql* onto this path to find the SQL directory for hive e.g. + +bc. + export ENSEMBL_CVS_ROOT_DIR=$HOME/work/ensembl-checkouts + +h3. ENSADMIN_PSW + +Give the password to use to log into a database server e.g. + +bc. + export ENSADMIN_PSW=wibble + +h2. Command Line Arguments + +Where *Multiple Supported* is supported we allow the user to specify the parameter more than once on the command line. For example species is one of these options e.g. + +bc. -species human -species cele -species yeast + +|_. Name |_. Type|_. Multiple Supported|_. Description|_. Default|_. Required| +|@-registry@|String|No|Location of the Ensembl registry to use with this pipeline|-|*YES*| +|@-base_path@|String|No|Location of the dumps|-|*YES*| +|@-pipeline_db -host=@|String|No|Specify a host for the hive database e.g. @-pipeline_db -host=myserver.mysql@|See hive generic config|*YES*| +|@-pipeline_db -dbname=@|String|No|Specify a different database to use as the hive DB e.g. @-pipeline_db -dbname=my_dumps_test@|Uses pipeline name by default|*NO*| +|@-species@|String|Yes|Specify one or more species to process. Pipeline will only _consider_ these species|-|*NO*| +|@-types@|String|Yes|Specify each type of dump you want to produce. Supported values are *embl* and *genbank*|All|*NO*| +|@-db_types@|String|Yes|The database types to use. Supports the normal db adaptor groups e.g. *core*, *otherfeatures* etc.|core|*NO*| +|@-pipeline_name@|String|No|Name to use for the pipeline|flatfile_dump_$release|*NO*| +|@-email@|String|No|Email to send pipeline summaries to upon its successful completion|$USER@sanger.ac.uk|*NO*| + +h2. Example Commands + +h3. To load use normally: + +bc. + init_pipeline.pl Bio::EnsEMBL::Pipeline::PipeConfig::Flatfile_conf \ + -pipeline_db -host=my-db-host -base_path /path/to/dumps -registry reg.pm + +h3. Run a subset of species (no forcing & supports registry aliases): + +bc. + init_pipeline.pl Bio::EnsEMBL::Pipeline::PipeConfig::Flatfile_conf \ + -pipeline_db -host=my-db-host -species anolis -species celegans -species human \ + -base_path /path/to/dumps -registry reg.pm + +h3. Dumping just EMBL data (no genbank): + +bc. + init_pipeline.pl Bio::EnsEMBL::Pipeline::PipeConfig::Flatfile_conf \ + -pipeline_db -host=my-db-host -type embl \ + -base_path /path/to/dumps -registry reg.pm + +h2. Running the Pipeline + +# Start a screen session or get ready to run the beekeeper with a @nohup@ +# Choose a dump location +#* A fasta, blast and blat directory will be created 1 level below +# Use an @init_pipeline.pl@ configuration from above +#* Make sure to give it the @-base_path@ parameter +# Sync the database using one of the displayed from @init_pipeline.pl@ +# Run the pipeline in a loop with a good sleep between submissions and redirect log output (the following assumes you are using *bash*) +#* @2>&1@ is important as this clobbers STDERR into STDOUT +#* @> my_run.log@ then sends the output to this file. Use @tail -f@ to track the pipeline +# @beekeeper.pl -url mysql://usr:pass@server:port/db -reg_conf reg.pm -loop -sleep 5 2>&1 > my_run.log &@ +# Wait + -- GitLab