Skip to content
GitLab
Projects
Groups
Snippets
Help
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Sign in
Toggle navigation
UniFIRE
Project overview
Project overview
Details
Activity
Releases
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Locked Files
Issues
0
Issues
0
List
Boards
Labels
Service Desk
Milestones
Iterations
Jira
Jira
Requirements
Requirements
List
Security & Compliance
Security & Compliance
Dependency List
License Compliance
Operations
Operations
Incidents
Environments
Packages & Registries
Packages & Registries
Package Registry
Container Registry
Analytics
Analytics
Insights
Issue
Repository
Value Stream
Wiki
Wiki
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Create a new issue
Commits
Issue Boards
Open sidebar
UniProt Public
UniFIRE
Commits
4beb3164
Commit
4beb3164
authored
Aug 06, 2020
by
Hermann Zellner
Browse files
Options
Browse Files
Download
Plain Diff
Merge branch 'singularity-podman' into 'master'
Singularity podman See merge request uniprot-public/unifire!11
parents
a72050e0
d040deff
Changes
6
Hide whitespace changes
Inline
Side-by-side
Showing
6 changed files
with
245 additions
and
20 deletions
+245
-20
README.md
README.md
+63
-12
docker/Dockerfile
docker/Dockerfile
+1
-1
docker/bin/run_unifire_docker.sh
docker/bin/run_unifire_docker.sh
+42
-5
docker/scripts/unifire-workflow.sh
docker/scripts/unifire-workflow.sh
+3
-2
docker/scripts/update-taxonomy-cache.sh
docker/scripts/update-taxonomy-cache.sh
+29
-0
misc/taxonomy/fetchTaxonomicLineage.py
misc/taxonomy/fetchTaxonomicLineage.py
+107
-0
No files found.
README.md
View file @
4beb3164
...
...
@@ -64,18 +64,24 @@ The only input data that need to be provided are the protein sequence data in mu
### Usage
```
usage: run_unifire_docker.sh -i <INPUT_FILE> -i <OUTPUT_FOLDER> [-v <VERSION> [-w <WORKING_FOLDER [-c]]]
-i: Path to multi-FASTA input file with headers in UniProt FASTA header format, containing at least
OX=<taxid>. (Required)
-o: Path to output folder. All output files with predictions in TSV format will be available in this
folder at the end of the procedure. (Required)
-v: Version of the docker image to use, e.g. 2020.2. Available versions are listed under
https://gitlab.ebi.ac.uk/uniprot-public/unifire/container_registry. (Optional), DEFAULT: 2020.2
-w: Path to an empty working directory. If this option is not given, then a temporary folder will be
created and used to store intermediate files. (Optional)
-c: Clean up temporary files. If set, then all temporary files will be cleaned up at the end of the
procedure. If no working directory is provided through option -w then the temporary files are cleaned
up by default
usage: ./docker/bin/run_unifire_docker.sh -i <INPUT_FILE> -o <OUTPUT_FOLDER> [-v <VERSION>] [-w <WORKING_FOLDER] [-c]
[-s docker|singularity|podman]
-i: Path to multi-FASTA input file with headers in UniProt FASTA header format, containing at least
OX=<taxid>. (Required)
-o: Path to output folder. All output files with predictions in TSV format will be available in this
folder at the end of the procedure. (Required)
-v: Version of the docker image to use, e.g. 2020.2. Available versions are listed under
https://gitlab.ebi.ac.uk/uniprot-public/unifire/container_registry. (Optional), DEFAULT: 2020.4.1
-w: Path to an empty working directory. If this option is not given, then a temporary folder will be
created and used to store intermediate files. (Optional)
-c: Clean up temporary files. If set, then all temporary files will be cleaned up at the end of the
procedure. If no working directory is provided through option -w then the temporary files are cleaned
up by default
-s: Container software to be used. (Optional), DEFAULT: docker
Allowed values:
docker: Use Docker to run UniFIRE Docker image
singularity: Use Singularity to run UniFIRE Docker image
podman: Use Podman to run UniFIRE Docker image
```
### Example
...
...
@@ -104,6 +110,51 @@ The application of the UniFIRE Docker image on a complete bacterial proteome wit
procedure.
<br/>
### Alternatives to Docker
For various reasons Docker is not a reasonable solution in a multi-user environment like most HPC clusters. Therefore
alternatives like
*Singularity*
and
*Podman*
have been tested to run the UniFIRE Docker image.
#### Singularity
Instead of Docker, an available Singularity installation can be used to run the UniFIRE docker image. The executable
"singularity" must be available in the PATH environment variable. The UniFIRE Docker image has been tested successfully
with Singularity version 3.6.1.
Because the UniFIRE image is big, you may want to use a folder with enough free disk-space
(~200 GB) available for temporary and cached files:
```
export SINGULARITY_CACHEDIR=/path/to/cache/folder
export SINGULARITY_TMPDIR=/path/to/tmp/folder
export SINGULARITY_LOCALCACHEDIR=/path/to/localcache/folder
```
Run the Docker image with Singularity:
```
./docker/bin/run_unifire_docker.sh -i samples/proteins.fasta -o . -s singularity
```
#### Podman
Instead of Docker, an available Podman installation can be used to run the UniFIRE docker image. The executable
"podman" must be available in the PATH environment variable. The UniFIRE Docker image has been tested successfully
with Podman version 2.0.3.
Because the UniFIRE image is big, you may want to use a folder with a larger amount of free disk-space
(~200 GB) available for temporary and cached files:
```
export TMPDIR=/path/to/tmp/folder
```
Run the Docker image with Podman:
```
./docker/bin/run_unifire_docker.sh -i samples/proteins.fasta -o . -s podman
```
For both cases, Singularity and Podman, the resulting output filder will be located in ${run_folder} with the filenames
```
predictions_unirule.out
predictions_unirule-pirsr.out
predictions_arba.out
```
## 2. Run UniFIRE after building it from its source code
### Prerequisites
...
...
docker/Dockerfile
View file @
4beb3164
...
...
@@ -26,7 +26,7 @@ RUN pip install --upgrade ete3
COPY
scripts /opt/scripts/bin
RUN
chmod
775 /opt/scripts/bin/
*
.sh
RUN
/opt/scripts/bin/update-taxonomy-cache.
py
RUN
/opt/scripts/bin/update-taxonomy-cache.
sh
RUN
/opt/scripts/bin/download-interproscan.sh
RUN
/opt/scripts/bin/download-unifire.sh
...
...
docker/bin/run_unifire_docker.sh
View file @
4beb3164
...
...
@@ -23,11 +23,13 @@ infile=""
outdir
=
""
workdir
=
""
cleanworkdir
=
0
container_software
=
"docker"
docker_version
=
"2020.4.1"
predictionfiles
=
"predictions_unirule.out predictions_arba.out predictions_unirule-pirsr.out"
function
usage
()
{
echo
"usage:
$0
-i <INPUT_FILE> -o <OUTPUT_FOLDER> [-v <VERSION> [-w <WORKING_FOLDER [-c]]]"
echo
"usage:
$0
-i <INPUT_FILE> -o <OUTPUT_FOLDER> [-v <VERSION>] [-w <WORKING_FOLDER] [-c]"
echo
" [-s docker|singularity|podman]"
echo
" -i: Path to multi-FASTA input file with headers in UniProt FASTA header format, containing at least"
echo
" OX=<taxid>. (Required)"
echo
" -o: Path to output folder. All output files with predictions in TSV format will be available in this"
...
...
@@ -39,10 +41,15 @@ function usage() {
echo
" -c: Clean up temporary files. If set, then all temporary files will be cleaned up at the end of the"
echo
" procedure. If no working directory is provided through option -w then the temporary files are cleaned"
echo
" up by default"
echo
" -s: Container software to be used. (Optional), DEFAULT: docker"
echo
" Allowed values:"
echo
" docker: Use Docker to run UniFIRE Docker image"
echo
" singularity: Use Singularity to run UniFIRE Docker image"
echo
" podman: Use Podman to run UniFIRE Docker image"
exit
1
}
while
getopts
"i:o:w:c:v:"
optionName
while
getopts
"i:o:w:c:v:
s:
"
optionName
do
case
"
${
optionName
}
"
in
i
)
infile
=
${
OPTARG
}
;;
...
...
@@ -50,9 +57,26 @@ do
w
)
workdir
=
${
OPTARG
}
;;
v
)
docker_version
=
${
OPTARG
}
;;
c
)
cleanworkdir
=
1
;;
s
)
container_software
=
${
OPTARG
}
;;
esac
done
if
[
${
container_software
}
!=
"docker"
]
&&
[
${
container_software
}
!=
"singularity"
]
&&
\
[
${
container_software
}
!=
"podman"
]
then
echo
"Invalid container software
${
container_software
}
given!"
printf
"This script supports docker, singularity or podman only at this time.
\n\n
"
usage
fi
if
!
command
-v
${
container_software
}
&> /dev/null
then
echo
"
${
container_software
}
executable could not be found. Please make sure
${
container_software
}
is installed and available"
printf
"in the PATH environment variable. Exiting.
\n\n
"
usage
fi
# infile
function
check_infile
()
{
if
[[
!
-f
${
infile
}
]]
...
...
@@ -119,9 +143,22 @@ function check_workdir() {
# Run the docker image on $the prepared {workdir}
function
run_docker_image
()
{
cp
${
infile
}
${
workdir
}
/proteins.fasta
docker run
\
--mount
type
=
bind
,source
=
${
workdir
}
,target
=
/volume
\
dockerhub.ebi.ac.uk/uniprot-public/unifire:
${
docker_version
}
if
[
${
container_software
}
==
"docker"
]
then
docker run
\
--mount
type
=
bind
,source
=
${
workdir
}
,target
=
/volume
\
dockerhub.ebi.ac.uk/uniprot-public/unifire:
${
docker_version
}
elif
[
${
container_software
}
==
"singularity"
]
then
singularity run
\
--bind
${
workdir
}
:/volume
\
docker://dockerhub.ebi.ac.uk/uniprot-public/unifire:
${
docker_version
}
elif
[
${
container_software
}
==
"podman"
]
then
podman run
\
--mount
type
=
bind
,source
=
${
workdir
}
,target
=
/volume
\
docker://dockerhub.ebi.ac.uk/uniprot-public/unifire:
${
docker_version
}
fi
}
# Move output files from ${workdir} to ${outdir}
...
...
docker/scripts/unifire-workflow.sh
View file @
4beb3164
...
...
@@ -18,12 +18,13 @@
UNIFIRE_REPO
=
"/opt/git/unifire"
INTERPROSCAN_REPO
=
"/opt/interproscan-5.45-80.0"
ETE3FOLDER
=
"/opt/ete3"
VOLUME
=
/volume
infilename
=
infile.fasta
cd
${
UNIFIRE_REPO
}
./misc/taxonomy/fetchLineageLocal.py
${
VOLUME
}
/proteins.fasta
${
VOLUME
}
/proteins_lineage.fasta
${
UNIFIRE_REPO
}
/misc/taxonomy/fetchTaxonomicLineage.py
-i
${
VOLUME
}
/proteins.fasta
-o
${
VOLUME
}
/proteins_lineage.fasta
\
-t
${
ETE3FOLDER
}
/taxa.sqlite
${
INTERPROSCAN_REPO
}
/interproscan.sh
-f
xml
-dp
-i
${
VOLUME
}
/proteins_lineage.fasta
\
--appl
"Hamap,ProSiteProfiles,ProSitePatterns,Pfam,TIGRFAM,SMART,PRINTS,SFLD,CDD,Gene3D,ProDom,PIRSF,PANTHER,SUPERFAMILY"
\
...
...
docker/scripts/update-taxonomy-cache.sh
0 → 100644
View file @
4beb3164
#!/usr/bin/env bash
############################################################################
# Copyright (c) 2018 European Molecular Biology Laboratory
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
############################################################################
set
-e
set
-u
SCRIPT_PATH
=
`
dirname
$0
`
ETE3FOLDER
=
"/opt/ete3"
${
SCRIPT_PATH
}
/update-taxonomy-cache.py
mkdir
-p
${
ETE3FOLDER
}
mv
~/.etetoolkit/taxa.sqlite
${
ETE3FOLDER
}
/
chmod
644
${
ETE3FOLDER
}
/taxa.sqlite
misc/taxonomy/fetchTaxonomicLineage.py
0 → 100755
View file @
4beb3164
#!/usr/bin/python
# -*- coding: utf-8 -*-
"""
The fetchTaxonomicLineage.py script reads a MultiFasta file and will replace any occurence of "OX={taxId}" in the header
by the full lineage corresponding to this taxId. The output is the resolved multifasta file written on the given
output path.
This script should be used when processing large amount of sequences from different species.
All the taxonomy data from NCBI are stored locally via Ete (by default in ~/.etetoolkit/taxa.sqlite)
This local storage can be updated using the following Python lines:
from ete3 import NCBITaxa
NCBITaxa().update_taxonomy_database()
Library dependencies (via pip / conda / ...):
* ete3 (pip install ete3 / conda install -c etetoolkit ete3)
* biopython
"""
import
argparse
from
ete3
import
NCBITaxa
from
Bio
import
SeqIO
import
sys
,
re
__copyright__
=
"Copyright 2018, European Molecular Biology Laboratory"
__license__
=
"Apache 2.0"
__maintainer__
=
"EMBL-EBI - Protein Function Development Team"
__status__
=
"Prototype"
__author__
=
"Alexandre Renaux"
header_DE_remove_pattern
=
re
.
compile
(
"([a-zA-Z0-9]+\|[a-zA-Z0-9]+\|[a-zA-Z0-9_]+)\s(.+?)(\s[A-Z]{2}=.+)"
)
header_OX_pattern
=
re
.
compile
(
'(OX=)(\d+)'
)
taxId_to_lineage
=
{}
def
get_taxonomy_full_lineage
(
tax_id
,
ncbi
):
if
tax_id
in
taxId_to_lineage
:
return
taxId_to_lineage
[
tax_id
]
else
:
lineage
=
ncbi
.
get_lineage
(
tax_id
)
taxId_to_lineage
[
tax_id
]
=
lineage
return
lineage
def
resolve_header
(
header
,
ncbi
):
tax_id_match
=
header_OX_pattern
.
search
(
header
)
if
tax_id_match
:
tax_id
=
int
(
tax_id_match
.
group
(
2
))
lineage
=
get_taxonomy_full_lineage
(
tax_id
,
ncbi
)
if
lineage
:
replacement
=
"\g<1>"
+
","
.
join
(
str
(
i
)
for
i
in
lineage
)
return
re
.
sub
(
header_OX_pattern
,
replacement
,
header
)
return
header
def
remove_long_protein_name
(
description
):
match
=
header_DE_remove_pattern
.
search
(
description
)
if
match
:
groups
=
list
(
header_DE_remove_pattern
.
search
(
description
).
groups
())
if
len
(
groups
[
1
])
>
127
:
del
groups
[
1
]
return
" "
.
join
(
groups
)
else
:
return
description
def
main
(
arguments
):
file_in
=
arguments
.
infile
file_out
=
arguments
.
outfile
if
arguments
.
taxadb
is
None
:
ncbi
=
NCBITaxa
()
else
:
ncbi
=
NCBITaxa
(
dbfile
=
arguments
.
taxadb
)
with
open
(
file_out
,
'w'
)
as
f_out
:
for
seq_record
in
SeqIO
.
parse
(
open
(
file_in
,
mode
=
'r'
),
"fasta"
):
seq_record
.
description
=
remove_long_protein_name
(
resolve_header
(
seq_record
.
description
,
ncbi
))
seq_record
.
id
=
""
r
=
SeqIO
.
write
(
seq_record
,
f_out
,
"fasta"
)
if
r
!=
1
:
print
(
"Error while writing sequence: "
+
seq_record
.
id
)
def
parse_args
():
parser
=
argparse
.
ArgumentParser
(
description
=
"""
The script fetchTaxonomicLineage.py reads an input file in multifasta format and will replace any occurence of
OX={taxId}" in the header by the full lineage corresponding to this taxId.
"""
)
parser
.
add_argument
(
'--infile'
,
'-i'
,
dest
=
"infile"
,
required
=
True
,
help
=
"""
Path to the input file in multifasta format with one tax-id in each fasta header in the format OX={taxId}
"""
)
parser
.
add_argument
(
'--outfile'
,
'-o'
,
dest
=
"outfile"
,
required
=
True
,
help
=
"""
Path to the output file in multifasta format with the full taxonomic lineage in each fasta header in the format
OX={taxId1,taxId2,...}
"""
)
parser
.
add_argument
(
'--taxa-sqlite'
,
'-t'
,
dest
=
"taxadb"
,
required
=
False
,
help
=
"""
Path to the sqlite DB file for taxonomy database. Default location is ~/.etetoolkit/taxa.sqlite
"""
)
return
parser
.
parse_args
()
if
__name__
==
"__main__"
:
arguments
=
parse_args
()
main
(
arguments
)
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment