Newer
Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
<h1 id="fasta_pipeline">FASTA Pipeline</h1>
<p>This is a re-implementation of an existing pipeline developed originally by
core and the webteam. The new version uses eHive, so familiarity with this
system is essential, and has been written to use as little memory as possible.</p>
<h2 id="the_registry_file">The Registry File</h2>
<p>This is the way we retrieve the database connections to work with. The
registry file should specify:</p>
<ul>
<li>The core (and any other) databases to dump from</li>
<li>A production database
<ul>
<li><strong>species = multi</strong></li>
<li><strong>group = production</strong></li>
<li>Used to find which species require new DNA</li>
</ul></li>
<li>A web database
<ul>
<li><strong>species = multi</strong></li>
<li><strong>group = web</strong></li>
<li>Used to name BLAT index files</li>
</ul></li>
</ul>
<p>Here is an example of a file for v67 of Ensembl. Note the use of the
Registry object within a registry file and the scoping of the package. If
you omit the <em>-db_version</em> parameter and only use HEAD checkouts of Ensembl
then this will automatically select the latest version of the API. Any
change to version here must be reflected in the configuration file.</p>
<pre><code>package Reg;
use Bio::EnsEMBL::Registry;
use Bio::EnsEMBL::DBSQL::DBAdaptor;
Bio::EnsEMBL::Registry->no_version_check(1);
Bio::EnsEMBL::Registry->no_cache_warnings(1);
{
my $version = 67;
Bio::EnsEMBL::Registry->load_registry_from_multiple_dbs(
{
-host => "mydb-1",
-port => 3306,
-db_version => $version,
-user => "user",
-NO_CACHE => 1,
},
{
-host => "mydb-2",
-port => 3306,
-db_version => $version,
-user => "user",
-NO_CACHE => 1,
},
);
Bio::EnsEMBL::DBSQL::DBAdaptor->new(
-HOST => 'mydb-2',
-PORT => 3306,
-USER => 'user',
-DBNAME => 'ensembl_website',
-SPECIES => 'multi',
-GROUP => 'web'
);
Bio::EnsEMBL::DBSQL::DBAdaptor->new(
-HOST => 'mydb-2',
-PORT => 3306,
-USER => 'user',
-DBNAME => 'ensembl_production',
-SPECIES => 'multi',
-GROUP => 'production'
);
}
1;
</code></pre>
<h2 id="overriding_defaults_using_a_new_config_file">Overriding Defaults Using a New Config File</h2>
<p>We recommend if you have a number of parameters which do not change
between releases to create a configuration file which inherits from the
root config file e.g.</p>
<pre><code>package MyCnf;
use base qw/Bio::EnsEMBL::Pipeline::FASTA::FASTA_conf/;
sub default_options {
my ($self) = @_;
return {
%{ $self->SUPER::default_options() },
#Override of options
};
}
1;
</code></pre>
<h2 id="environment">Environment</h2>
<h3 id="perl5lib">PERL5LIB</h3>
<ul>
<li>ensembl</li>
<li>ensembl-hive</li>
<li>bioperl</li>
</ul>
<h3 id="path">PATH</h3>
<ul>
<li>ensembl-hive/scripts</li>
<li>faToTwoBit (if not using a custom location)</li>
<li>xdformat (if not using a custom location)</li>
</ul>
<h2 id="example_commands">Example Commands</h2>
<h3 id="to_load_use_normally">To load use normally:</h3>
<pre><code>init_pipeline.pl Bio::EnsEMBL::Pipeline::PipeConfig:FASTA_conf \
-pipeline_db -host=my-db-host -base_path /path/to/dumps
</code></pre>
<h3 id="run_a_subset_of_species_no_forcing_supports_registry_aliases">Run a subset of species (no forcing & supports registry aliases):</h3>
<pre><code>init_pipeline.pl Bio::EnsEMBL::Pipeline::PipeConfig:FASTA_conf \
-pipeline_db -host=my-db-host -species anolis -species celegans -species human \
-base_path /path/to/dumps
</code></pre>
<h3 id="specifying_species_to_force_supports_all_registry_aliases">Specifying species to force (supports all registry aliases):</h3>
<pre><code>init_pipeline.pl Bio::EnsEMBL::Pipeline::PipeConfig:FASTA_conf \
-pipeline_db -host=my-db-host -force_species anolis -force_species celegans -force_species human \
-base_path /path/to/dumps
</code></pre>
<h3 id="running_forcing_a_species">Running & forcing a species:</h3>
<pre><code>init_pipeline.pl Bio::EnsEMBL::Pipeline::PipeConfig:FASTA_conf \
-pipeline_db -host=my-db-host -species celegans -force_species celegans \
-base_path /path/to/dumps
</code></pre>
<h3 id="dumping_just_gene_data">Dumping just gene data:</h3>
<pre><code>init_pipeline.pl Bio::EnsEMBL::Pipeline::PipeConfig:FASTA_conf \
-pipeline_db -host=my-db-host -dump_type cdna \
-base_path /path/to/dumps
</code></pre>
<h3 id="using_a_different_scp_user_identity">Using a different SCP user & identity:</h3>
<pre><code>init_pipeline.pl Bio::EnsEMBL::Pipeline::PipeConfig:FASTA_conf \
-pipeline_db -host=my-db-host -scp_user anotherusr -scp_identity /users/anotherusr/.pri/identity \
-base_path /path/to/dumps
</code></pre>
<h2 id="running_the_pipeline">Running the Pipeline</h2>
<ol>
<li>Start a screen session or get ready to run the beekeeper with a <strong>nohup</strong></li>
<li>Choose a dump location
<ul>
<li>A fasta, blast and blat directory will be created 1 level below</li>
</ul></li>
<li>Use an <em>init_pipeline.pl</em> configuration from above
<ul>
<li>Make sure to give it the <strong>-base_path</strong> parameter</li>
</ul></li>
<li>Sync the database using one of the displayed from <em>init_pipeline.pl</em></li>
<li><p>Run the pipeline in a loop with a good sleep between submissions and redirect log output (the following assumes you are using <strong>bash</strong>)</p>
<ul>
<li><strong>2>&1</strong> is important as this clobbers STDERR into STDOUT</li>
<li><strong>> my<em>run.log</strong> then sends the output to this file. Use <strong>tail -f</strong> to track the pipeline
beekeeper.pl -url mysql://usr:pass@server:port/db -reg</em>conf reg.pm -loop -sleep 5 2>&1 > my_run.log &</li>
</ul></li>
<li><p>Wait</p></li>
</ol>