Commit 4beb3164 authored by Hermann Zellner's avatar Hermann Zellner

Merge branch 'singularity-podman' into 'master'

Singularity podman

See merge request uniprot-public/unifire!11
parents a72050e0 d040deff
......@@ -64,18 +64,24 @@ The only input data that need to be provided are the protein sequence data in mu
### Usage
```
usage: run_unifire_docker.sh -i <INPUT_FILE> -i <OUTPUT_FOLDER> [-v <VERSION> [-w <WORKING_FOLDER [-c]]]
-i: Path to multi-FASTA input file with headers in UniProt FASTA header format, containing at least
OX=<taxid>. (Required)
-o: Path to output folder. All output files with predictions in TSV format will be available in this
folder at the end of the procedure. (Required)
-v: Version of the docker image to use, e.g. 2020.2. Available versions are listed under
https://gitlab.ebi.ac.uk/uniprot-public/unifire/container_registry. (Optional), DEFAULT: 2020.2
-w: Path to an empty working directory. If this option is not given, then a temporary folder will be
created and used to store intermediate files. (Optional)
-c: Clean up temporary files. If set, then all temporary files will be cleaned up at the end of the
procedure. If no working directory is provided through option -w then the temporary files are cleaned
up by default
usage: ./docker/bin/run_unifire_docker.sh -i <INPUT_FILE> -o <OUTPUT_FOLDER> [-v <VERSION>] [-w <WORKING_FOLDER] [-c]
[-s docker|singularity|podman]
-i: Path to multi-FASTA input file with headers in UniProt FASTA header format, containing at least
OX=<taxid>. (Required)
-o: Path to output folder. All output files with predictions in TSV format will be available in this
folder at the end of the procedure. (Required)
-v: Version of the docker image to use, e.g. 2020.2. Available versions are listed under
https://gitlab.ebi.ac.uk/uniprot-public/unifire/container_registry. (Optional), DEFAULT: 2020.4.1
-w: Path to an empty working directory. If this option is not given, then a temporary folder will be
created and used to store intermediate files. (Optional)
-c: Clean up temporary files. If set, then all temporary files will be cleaned up at the end of the
procedure. If no working directory is provided through option -w then the temporary files are cleaned
up by default
-s: Container software to be used. (Optional), DEFAULT: docker
Allowed values:
docker: Use Docker to run UniFIRE Docker image
singularity: Use Singularity to run UniFIRE Docker image
podman: Use Podman to run UniFIRE Docker image
```
### Example
......@@ -104,6 +110,51 @@ The application of the UniFIRE Docker image on a complete bacterial proteome wit
procedure.
<br/>
### Alternatives to Docker
For various reasons Docker is not a reasonable solution in a multi-user environment like most HPC clusters. Therefore
alternatives like *Singularity* and *Podman* have been tested to run the UniFIRE Docker image.
#### Singularity
Instead of Docker, an available Singularity installation can be used to run the UniFIRE docker image. The executable
"singularity" must be available in the PATH environment variable. The UniFIRE Docker image has been tested successfully
with Singularity version 3.6.1.
Because the UniFIRE image is big, you may want to use a folder with enough free disk-space
(~200 GB) available for temporary and cached files:
```
export SINGULARITY_CACHEDIR=/path/to/cache/folder
export SINGULARITY_TMPDIR=/path/to/tmp/folder
export SINGULARITY_LOCALCACHEDIR=/path/to/localcache/folder
```
Run the Docker image with Singularity:
```
./docker/bin/run_unifire_docker.sh -i samples/proteins.fasta -o . -s singularity
```
#### Podman
Instead of Docker, an available Podman installation can be used to run the UniFIRE docker image. The executable
"podman" must be available in the PATH environment variable. The UniFIRE Docker image has been tested successfully
with Podman version 2.0.3.
Because the UniFIRE image is big, you may want to use a folder with a larger amount of free disk-space
(~200 GB) available for temporary and cached files:
```
export TMPDIR=/path/to/tmp/folder
```
Run the Docker image with Podman:
```
./docker/bin/run_unifire_docker.sh -i samples/proteins.fasta -o . -s podman
```
For both cases, Singularity and Podman, the resulting output filder will be located in ${run_folder} with the filenames
```
predictions_unirule.out
predictions_unirule-pirsr.out
predictions_arba.out
```
## 2. Run UniFIRE after building it from its source code
### Prerequisites
......
......@@ -26,7 +26,7 @@ RUN pip install --upgrade ete3
COPY scripts /opt/scripts/bin
RUN chmod 775 /opt/scripts/bin/*.sh
RUN /opt/scripts/bin/update-taxonomy-cache.py
RUN /opt/scripts/bin/update-taxonomy-cache.sh
RUN /opt/scripts/bin/download-interproscan.sh
RUN /opt/scripts/bin/download-unifire.sh
......
......@@ -23,11 +23,13 @@ infile=""
outdir=""
workdir=""
cleanworkdir=0
container_software="docker"
docker_version="2020.4.1"
predictionfiles="predictions_unirule.out predictions_arba.out predictions_unirule-pirsr.out"
function usage() {
echo "usage: $0 -i <INPUT_FILE> -o <OUTPUT_FOLDER> [-v <VERSION> [-w <WORKING_FOLDER [-c]]]"
echo "usage: $0 -i <INPUT_FILE> -o <OUTPUT_FOLDER> [-v <VERSION>] [-w <WORKING_FOLDER] [-c]"
echo " [-s docker|singularity|podman]"
echo " -i: Path to multi-FASTA input file with headers in UniProt FASTA header format, containing at least"
echo " OX=<taxid>. (Required)"
echo " -o: Path to output folder. All output files with predictions in TSV format will be available in this"
......@@ -39,10 +41,15 @@ function usage() {
echo " -c: Clean up temporary files. If set, then all temporary files will be cleaned up at the end of the"
echo " procedure. If no working directory is provided through option -w then the temporary files are cleaned"
echo " up by default"
echo " -s: Container software to be used. (Optional), DEFAULT: docker"
echo " Allowed values:"
echo " docker: Use Docker to run UniFIRE Docker image"
echo " singularity: Use Singularity to run UniFIRE Docker image"
echo " podman: Use Podman to run UniFIRE Docker image"
exit 1
}
while getopts "i:o:w:c:v:" optionName
while getopts "i:o:w:c:v:s:" optionName
do
case "${optionName}" in
i) infile=${OPTARG};;
......@@ -50,9 +57,26 @@ do
w) workdir=${OPTARG};;
v) docker_version=${OPTARG};;
c) cleanworkdir=1;;
s) container_software=${OPTARG};;
esac
done
if [ ${container_software} != "docker" ] && [ ${container_software} != "singularity" ] && \
[ ${container_software} != "podman" ]
then
echo "Invalid container software ${container_software} given!"
printf "This script supports docker, singularity or podman only at this time.\n\n"
usage
fi
if ! command -v ${container_software} &> /dev/null
then
echo "${container_software} executable could not be found. Please make sure ${container_software} is installed and available"
printf "in the PATH environment variable. Exiting.\n\n"
usage
fi
# infile
function check_infile() {
if [[ ! -f ${infile} ]]
......@@ -119,9 +143,22 @@ function check_workdir() {
# Run the docker image on $the prepared {workdir}
function run_docker_image() {
cp ${infile} ${workdir}/proteins.fasta
docker run \
--mount type=bind,source=${workdir},target=/volume \
dockerhub.ebi.ac.uk/uniprot-public/unifire:${docker_version}
if [ ${container_software} == "docker" ]
then
docker run \
--mount type=bind,source=${workdir},target=/volume \
dockerhub.ebi.ac.uk/uniprot-public/unifire:${docker_version}
elif [ ${container_software} == "singularity" ]
then
singularity run \
--bind ${workdir}:/volume \
docker://dockerhub.ebi.ac.uk/uniprot-public/unifire:${docker_version}
elif [ ${container_software} == "podman" ]
then
podman run \
--mount type=bind,source=${workdir},target=/volume \
docker://dockerhub.ebi.ac.uk/uniprot-public/unifire:${docker_version}
fi
}
# Move output files from ${workdir} to ${outdir}
......
......@@ -18,12 +18,13 @@
UNIFIRE_REPO="/opt/git/unifire"
INTERPROSCAN_REPO="/opt/interproscan-5.45-80.0"
ETE3FOLDER="/opt/ete3"
VOLUME=/volume
infilename=infile.fasta
cd ${UNIFIRE_REPO}
./misc/taxonomy/fetchLineageLocal.py ${VOLUME}/proteins.fasta ${VOLUME}/proteins_lineage.fasta
${UNIFIRE_REPO}/misc/taxonomy/fetchTaxonomicLineage.py -i ${VOLUME}/proteins.fasta -o ${VOLUME}/proteins_lineage.fasta \
-t ${ETE3FOLDER}/taxa.sqlite
${INTERPROSCAN_REPO}/interproscan.sh -f xml -dp -i ${VOLUME}/proteins_lineage.fasta \
--appl "Hamap,ProSiteProfiles,ProSitePatterns,Pfam,TIGRFAM,SMART,PRINTS,SFLD,CDD,Gene3D,ProDom,PIRSF,PANTHER,SUPERFAMILY" \
......
#!/usr/bin/env bash
############################################################################
# Copyright (c) 2018 European Molecular Biology Laboratory
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
############################################################################
set -e
set -u
SCRIPT_PATH=`dirname $0`
ETE3FOLDER="/opt/ete3"
${SCRIPT_PATH}/update-taxonomy-cache.py
mkdir -p ${ETE3FOLDER}
mv ~/.etetoolkit/taxa.sqlite ${ETE3FOLDER}/
chmod 644 ${ETE3FOLDER}/taxa.sqlite
#!/usr/bin/python
# -*- coding: utf-8 -*-
"""
The fetchTaxonomicLineage.py script reads a MultiFasta file and will replace any occurence of "OX={taxId}" in the header
by the full lineage corresponding to this taxId. The output is the resolved multifasta file written on the given
output path.
This script should be used when processing large amount of sequences from different species.
All the taxonomy data from NCBI are stored locally via Ete (by default in ~/.etetoolkit/taxa.sqlite)
This local storage can be updated using the following Python lines:
from ete3 import NCBITaxa
NCBITaxa().update_taxonomy_database()
Library dependencies (via pip / conda / ...):
* ete3 (pip install ete3 / conda install -c etetoolkit ete3)
* biopython
"""
import argparse
from ete3 import NCBITaxa
from Bio import SeqIO
import sys, re
__copyright__ = "Copyright 2018, European Molecular Biology Laboratory"
__license__ = "Apache 2.0"
__maintainer__ = "EMBL-EBI - Protein Function Development Team"
__status__ = "Prototype"
__author__ = "Alexandre Renaux"
header_DE_remove_pattern = re.compile("([a-zA-Z0-9]+\|[a-zA-Z0-9]+\|[a-zA-Z0-9_]+)\s(.+?)(\s[A-Z]{2}=.+)")
header_OX_pattern = re.compile('(OX=)(\d+)')
taxId_to_lineage = {}
def get_taxonomy_full_lineage(tax_id, ncbi):
if tax_id in taxId_to_lineage:
return taxId_to_lineage[tax_id]
else:
lineage = ncbi.get_lineage(tax_id)
taxId_to_lineage[tax_id] = lineage
return lineage
def resolve_header(header, ncbi):
tax_id_match = header_OX_pattern.search(header)
if tax_id_match:
tax_id = int(tax_id_match.group(2))
lineage = get_taxonomy_full_lineage(tax_id, ncbi)
if lineage:
replacement = "\g<1>" + ",".join(str(i) for i in lineage)
return re.sub(header_OX_pattern, replacement, header)
return header
def remove_long_protein_name(description):
match = header_DE_remove_pattern.search(description)
if match:
groups = list(header_DE_remove_pattern.search(description).groups())
if len(groups[1]) > 127:
del groups[1]
return " ".join(groups)
else:
return description
def main(arguments):
file_in = arguments.infile
file_out = arguments.outfile
if arguments.taxadb is None:
ncbi = NCBITaxa()
else:
ncbi = NCBITaxa(dbfile=arguments.taxadb)
with open(file_out, 'w') as f_out:
for seq_record in SeqIO.parse(open(file_in, mode='r'), "fasta"):
seq_record.description = remove_long_protein_name(resolve_header(seq_record.description, ncbi))
seq_record.id = ""
r = SeqIO.write(seq_record, f_out, "fasta")
if r != 1: print("Error while writing sequence: " + seq_record.id)
def parse_args():
parser = argparse.ArgumentParser(description="""
The script fetchTaxonomicLineage.py reads an input file in multifasta format and will replace any occurence of
OX={taxId}" in the header by the full lineage corresponding to this taxId.
""")
parser.add_argument('--infile', '-i', dest="infile", required=True, help="""
Path to the input file in multifasta format with one tax-id in each fasta header in the format OX={taxId}
""")
parser.add_argument('--outfile', '-o', dest="outfile", required=True, help="""
Path to the output file in multifasta format with the full taxonomic lineage in each fasta header in the format
OX={taxId1,taxId2,...}
""")
parser.add_argument('--taxa-sqlite', '-t', dest="taxadb", required=False, help="""
Path to the sqlite DB file for taxonomy database. Default location is ~/.etetoolkit/taxa.sqlite
""")
return parser.parse_args()
if __name__ == "__main__":
arguments = parse_args()
main(arguments)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment