Commit 33ee245b authored by Hermann Zellner's avatar Hermann Zellner

TRM-23844:

- Owner and group of predition output files is set the same as the input file
- Added wrapper script to run the unifire docker image
parent b0c75fee
# UniFIRE Project
UniFIRE (The UniProt Functional annotation Inference Rule Engine) is an engine to execute rules in the UniProt Rule Markup Language (URML) format.
It can be used to execute the UniProt annotation rules (UniRule and SAAS).
UniFIRE (The UniProt Functional annotation Inference Rule Engine) is an engine to execute rules in the UniProt Rule
Markup Language (URML) format. It can be used to execute the UniProt annotation rules (UniRule and SAAS).
This project is a work in progress, open for collaboration.
Introducing presentation: [UniFIRE-URML.pptx](misc/media/UniFIRE-URML.pptx)
There are two different ways to run UniFIRE:
1. **Downloading and running the UniFIRE *Docker* image**<br/> The UniFIRE Docker image allows to run the whole UniFIRE
workflow including all dependencies like InterProScan and HMMER with a single command. The only necessary
software dependency is an installation of Docker.
Therefore we recommend this way of running UniFIRE to new users.
Because the large InterProScan package and its dependencies are included in the Docker image, a user
needs to download ~25 GB and allow ~100 GB disk space on the system. <br/><br/>
2. **Running UniFIRE after building it from its source**<br/>
This way requires more manual interaction of the user. Each step of a UniFIRE workflow has to
be executed separately or combined by a script. Also some steps require external software like InterProScan or HMMER,
which may need to be installed by the user separately or started through a web-interface. Therefore we recommend this
approach to advanced users who would like to create a particular workflow, e.g. who would like to run the heavy
InterProScan within a separate procedure.
This documentation uses scripts and sample data which are provided by the UniFIRE Gitlab repository. Therefore please
make sure you have checked out a local copy of UniFIRE Gitlab repository. This is done by the
command below, which requires git to be installed on your system:
```
$ git clone https://gitlab.ebi.ac.uk/uniprot-public/unifire.git
```
## 1. Using the Docker image
### Prerequisites
#### Hardware
A machine with at least 4 GB of memory and ~100 GB of available disk space.
#### Operating system support
The Docker image is expected to run on any operating system
#### Software
A recent version of Docker is necessary to start the UniFIRE docker image as a new container. It has been tested
successfully on Ubuntu 18.04 and Docker version 19.03.6.
### Data preparation
The only input data that need to be provided are the protein sequence data in multi-FASTA format for which
functional predictions should be created. The FASTA header needs to follow the UniProtKB conventions
([https://www.uniprot.org/help/fasta-headers](https://www.uniprot.org/help/fasta-headers))
The minimal structure of the header is:
```
>{id}|{name} {flags}
```
* `{id}` must be a unique string amongst the processed sequences
* `{name}`:
* can be any string starting from the previous separator, that should not contain any flag
* might contains `(Fragment)` if applicable (e.g "`ACN2_ACAGO Acanthoscurrin-2 (Fragment)`")
* `{flags}` \[mandatory]: to be considered a valid header, only the following flags should be provided:
* OX=`{taxonomy Id}`
* `{flags}` \[optional]: If possible / applicable, you should also provide:
* OS=`{organism name}`
* GN=`{recommended gene name}`
* GL=`{recommended ordered locus name (OLN) or Open Reading Frame (OLN) name}`
* OG=`{gene location(s), comma-separated if multiple}` ([cf. organelle ontology](https://www.ebi.ac.uk/ena/WebFeat/qualifiers/organelle.html))
This project is a work in progress, open for private collaboration only. Please do not share this code without explicit authorization.
### Usage
```
usage: run_unifire_docker.sh -i <INPUT_FILE> -i <OUTPUT_FOLDER> [-v <VERSION> [-w <WORKING_FOLDER [-c]]]
-i: Path to multi-FASTA input file with headers in UniProt FASTA header format, containing at least
OX=<taxid>. (Required)
-o: Path to output folder. All output files with predictions in TSV format will be available in this
folder at the end of the procedure. (Required)
-v: Version of the docker image to use, e.g. 2020.2. Available versions are listed under
https://gitlab.ebi.ac.uk/uniprot-public/unifire/container_registry. (Optional), DEFAULT: 2020.2
-w: Path to an empty working directory. If this option is not given, then a temporary folder will be
created and used to store intermediate files. (Optional)
-c: Clean up temporary files. If set, then all temporary files will be cleaned up at the end of the
procedure. If no working directory is provided through option -w then the temporary files are cleaned
up by default
```
Introducing presentation: [UniFIRE-URML.pptx](misc/media/UniFIRE-URML.pptx)
### Example
This is a simple example, which shows how to use the UniFIRE Docker image to run the whole UniFIRE workflow on some
sample protein data.
**Warning:** The first time this command is run, it will download the ~25 GB large UniFIRE Docker image from the
docker container registry and extract it on the local machine. Depending on the speed of your network and your CPU
this can take a few hours.
```bash
./docker/bin/run_unifire_docker.sh -i samples/proteins.fasta -o .
```
This command will use as input the file samples/proteins.fasta which is in multi-FASTA format with the header in
the format as describes above. It will run the whole UniFIRE workflow to predict functional annotations from UniRule
and SAAS rules. The resulting functional predictions will be written into these files in the current working
directory:
```
predictions_unirule.out
predictions_unirule-pirsr.out
predictions_saas.out
```
<br/>
## Getting Started
## 2. Run UniFIRE after building it from its source code
### Prerequisites
#### Hardware
A machine with at least 2 Go of memory.
A machine with at least 4 GB of memory.
#### Operating system support
......@@ -36,34 +128,55 @@ $ <Path to UniFIRE parent folder>/build.sh
Depending on the speed of your internet connection, it will take a few minutes to download all dependencies through
maven. You will require in total ~500 MB disk space in UniFIRE folder and in your local maven cache. The script also
downloads the latest UniRule and SAAS rules in URML format and UniRule template alignments in fact XML format from
EBI FTP into samples/ folder.
downloads the latest UniRule, UniRule-PIRSR and SAAS rules in URML format and UniRule template
alignments in fact XML format from EBI FTP into samples/ folder. Additionally it downloads data neccassary to run
UniRule-PIRSR rules from https://proteininformationresource.org/pirsr/pirsr_data_latest.tar.gz and places them
underneath the folder samples/pirsr_data.
### Usage
We provide some sample files in the [sample](samples) folder to test the software.
<br/>
**Example with UniRule rules & InterProScan XML input:**
``` bash
$ ./distribution/bin/unifire.sh -r samples/unirule-urml-latest.xml -i samples/input_ipr.fasta.xml -t samples/unirule-templates-latest.xml -o ~/output_unirule_annotations.csv
$ ./distribution/bin/unifire.sh -r samples/unirule-urml-latest.xml -i samples/input_ipr.fasta.xml -t samples/unirule-templates-latest.xml -o output_unirule_annotations.csv
```
*Note: To be able to predict the UniRule positional annotations, a template file is provided (`samples/unirule-templates-2018_05.xml`) (optional.)*
<br/>
**Example with SAAS rules & Fact XML input:**
``` bash
$ ./distribution/bin/unifire.sh -r samples/saas-urml-latest.xml -i samples/input_facts.xml -s XML -o ~/output_saas_annotations.csv
$ ./distribution/bin/unifire.sh -r samples/saas-urml-latest.xml -i samples/input_facts.xml -s XML -o output_saas_annotations.csv
```
<br/>
**Example with PIRSR rules and protein data in InterProScan XML format:**
In order to use UniRule-PIRSR rules to annotate protein input data, alignments of the protein sequences against
SRHMM signatures need to be calculated in a preparation step. This requires *HMMER*, in particular an
installation of the executable *hmmalign*. With Ubuntu 18.04 *hmmeralign* can be installed at /usr/bin/hmmalign
by the command below:
``` bash
$ sudo apt-get install hmmer
```
As an alternative, *hmmer* source code can be downloaded from the http://hmmer.org/. In the example below we
assume hmmalign binary is available at this path on the filesystem: /usr/bin/hmmalign
**Example with PIRSR rules & InterProScan XML input:**
Running UniRule-PIRSR rules is a two step process:
First, calculate the alignment(s) of your protein(s) against all SRHMM signatures, combine data from the input in
InterProScan XML format with these alignments and write the output to the Fact XML file PIRSR-input-iprscan-urml.xml:
``` bash
$ ./distribution/bin/pirsr.sh -i ./samples/pirsr_data/PIRSR-input-iprscan.xml -o ~/ -a <path-to-hmmalign-command> -d ./samples/pirsr_data
$ ./distribution/bin/pirsr.sh -i ./samples/pirsr_data/PIRSR-input-iprscan.xml -o . -a /usr/bin/hmmalign -d ./samples/pirsr_data
```
Second run UniFIRE with UniRule-PIRSR rules and PIRSR-templates on the protein data in PIRSR-input-iprscan-urml.xml:
``` bash
$ ./distribution/bin/unifire.sh -r samples/unirule.pirsr-urml-latest.xml -i ~/PIRSR-input-iprscan-urml.xml -s XML -t samples/pirsr_data/PIRSR_templates.xml -o ~/pirsr_unifire_annotation.csv
$ ./distribution/bin/unifire.sh -r samples/unirule.pirsr-urml-latest.xml -i ./PIRSR-input-iprscan-urml.xml -s XML -t samples/pirsr_data/PIRSR_templates.xml -o ./pirsr_unifire_annotation.csv
```
_Note_: With all rule systems, it is possible that a protein get the exact same annotation from different rules due to overlap in condition spaces.
_Note_: With all rule systems, it is possible that a protein gets the exact same annotation from different rules due
to overlap in condition spaces.
#### Options
......@@ -101,7 +214,8 @@ usage: unifire -i <INPUT_FILE> -o <OUTPUT_FILE> -r <RULE_URML_FILE> [-f <OUTPUT_
## Data preparation
This section is a walkthrough on how to prepare your data, assuming you are starting from scratch: from a set of sequences (multifasta) that you would like to annotate.
This section is a walk through on how to prepare your data, assuming you are starting from scratch: from a set of
sequences (multifasta) that you would like to annotate.
More advanced users / developers with an existing bioinformatics pipeline already integrating InterProScan results should try to load their existing data into the fact model described on the Developer Guide below.
......@@ -135,32 +249,32 @@ Also note that any additional flags will also be ignored.
#### Examples of valid headers:
From UniProt:
The standard header used in UniProt:
```
>tr|Q3SA23|Q3SA23_9HIV1 Protein Nef (Fragment) OS=Human immunodeficiency virus 1 OX=11676 GN=nef PE=3 SV=1
```
From UniProt, customized with additional flags:
The standard UniProt header, customized with additional flags:
```
>tr|A0A0D6DT88|A0A0D6DT88_BRADI Maturase K (Fragment) OS=Brachypodium distachyon OX=15368 GN=matK GL=BN3904_34004 OG=Plastid,Chloroplast PE=3 SV=1
```
Customized minimal:
Customized minimal header:
```
>123|Mystery protein OX=62977
```
Customized full:
Customized full header:
```
>MyPlantDB|P987|Photosystem II protein D1 OS=Lolium multiflorum OX=4521 GN=psbA GL=LomuCp001 OG=Plastid
```
### Fetching the full lineages
From the previously described multifasta format, you can use the following scripts to fetch the full NCBI taxonomy id lineage.
From the previously described multifasta format, you can use the following scripts to fetch the full NCBI taxonomy id lineage. Both scripts have dependencies which are detailed at the start of the script
* [fetchLineageLocal.py](misc/taxonomy/fetchLineageLocal.py) `<input` `<output>` - for large amount of data on multiple species
* [fetchLineageRemote.py](misc/taxonomy/fetchLineageRemote.py) `<input` `<output>` - for one-off usage / few species
* python [./misc/taxonomy/fetchLineageLocal.py](misc/taxonomy/fetchLineageLocal.py) `<input>` `<output>` - for large amount of data on multiple species
* python [./misc/taxonomy/fetchLineageRemote.py](misc/taxonomy/fetchLineageRemote.py) `<input>` `<output>` - for one-off usage / few species
Both scripts will simply replace the OX={taxId} by OX={fullLineage}.
......@@ -180,7 +294,7 @@ The script also print out warnings if an important data (e.g organism name) is m
### Running InterProScan
Once the multifasta file is ready (cf. previous steps), you can find the matches of all sequences using InterProScan.
It is advised to download the last version from [https://www.ebi.ac.uk/interpro/download.html](https://www.ebi.ac.uk/interpro/download.html) and keep it up-to-date.
It is advised to download the last version from [https://www.ebi.ac.uk/interpro/download.html](https://www.ebi.ac.uk/interpro/download.html) and keep it up-to-date. The current version of InterProScan (5.42-78.0) requires Java 11 to run.
The output format must be XML to be accepted as a valid input for UniFIRE.
......@@ -272,7 +386,7 @@ This tool translates the URML rules into the Drools language, converts the input
## Limitations
### Memory
A minimum of 2 Go of memory is required for this software to run.
A minimum of 4 GB of memory is required for this software to run.
For a large number of protein to process, it is advised to split them into chunks of approx. 1000 proteins per rule evaluation to keep the memory usage low.
This is automatically handled by the `-n / --chunksize` option of UniFIRE (by default 1000).
......@@ -290,13 +404,18 @@ This is automatically handled by the `-n / --chunksize` option of UniFIRE (by de
## Issues & Suggestions
Please do not hesitate to raise new issues if you experience any bugs or you have improvement suggestions on the software, the models, the helper scripts, the documentation, etc...
If you have any questions regarding this software, if you experience any bugs or have suggestions for improvements on
the software, the models, the helper scripts, the documentation, etc, please contact us through the UniFIRE mailing
list
* **UniFIRE Mailing List** - [unifire@ebi.ac.uk](mailto:unifire:ebi.ac.uk)
[UniFIRE Issue Tracker](https://gitlabci.ebi.ac.uk/uniprot.aa/UniFIRE/issues)
## Authors
* **Alexandre Renaux**
* **Chuming Chen**
* **Hermann Zellner**
## Contact
......
......@@ -34,11 +34,11 @@ SCRIPT_DIR=$(get_script_dir)
function run {
local cmdArgs="${@}"
java -cp "$SCRIPT_DIR/../lib/*" uk.ac.ebi.uniprot.unifire.validators.fasta.MultiFastaValidatorApp ${cmdArgs}
java -cp "${SCRIPT_DIR}/../target/*:${SCRIPT_DIR}/../target/dependency/*" uk.ac.ebi.uniprot.unifire.validators.fasta.MultiFastaValidatorApp ${cmdArgs}
}
function main {
run "${@}"
}
main "${@}"
\ No newline at end of file
main "${@}"
############################################################################
# Copyright (c) 2018 European Molecular Biology Laboratory
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
############################################################################
FROM ubuntu:18.04
ENV JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
RUN apt-get update \
&& apt-get install -y wget openjdk-8-jdk maven git coreutils hmmer python-numpy python-qt4 python-lxml python-six \
python-pip python-biopython python-requests python3 ncbi-data libdw1
RUN pip install --upgrade ete3
COPY scripts /opt/scripts/bin
RUN chmod 775 /opt/scripts/bin/*.sh
RUN /opt/scripts/bin/update-taxonomy-cache.py
RUN /opt/scripts/bin/download-interproscan.sh
RUN /opt/scripts/bin/download-unifire.sh
RUN mkdir /volume
VOLUME /volume
CMD /opt/scripts/bin/unifire-workflow.sh
#!/usr/bin/env bash
############################################################################
# Copyright (c) 2018 European Molecular Biology Laboratory
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
############################################################################
set -e
set -u
function usage() {
echo "usage: $0 -i <INPUT_FILE> -i <OUTPUT_FOLDER> [-v <VERSION> [-w <WORKING_FOLDER [-c]]]"
echo " -i: Path to multi-FASTA input file with headers in UniProt FASTA header format, containing at least"
echo " OX=<taxid>. (Required)"
echo " -o: Path to output folder. All output files with predictions in TSV format will be available in this"
echo " folder at the end of the procedure. (Required)"
echo " -v: Version of the docker image to use, e.g. 2020.2. Available versions are listed under"
echo " https://gitlab.ebi.ac.uk/uniprot-public/unifire/container_registry. (Optional), DEFAULT: 2020.2"
echo " -w: Path to an empty working directory. If this option is not given, then a temporary folder will be"
echo " created and used to store intermediate files. (Optional)"
echo " -c: Clean up temporary files. If set, then all temporary files will be cleaned up at the end of the"
echo " procedure. If no working directory is provided through option -w then the temporary files are cleaned"
echo " up by default"
exit 1
}
infile=""
outdir=""
workdir=""
cleanworkdir=0
docker_version="2020.2"
predictionfiles="predictions_unirule.out predictions_saas.out predictions_unirule-pirsr.out"
while getopts "i:o:w:c:v:" optionName
do
case "${optionName}" in
i) infile=${OPTARG};;
o) outdir=${OPTARG};;
w) workdir=${OPTARG};;
v) docker_version=${OPTARG};;
c) cleanworkdir=1;;
esac
done
# infile
function check_infile() {
if [[ ! -f ${infile} ]]
then
echo "Error: Input file ${infile} not found!"
usage
fi
}
# outdir
function check_outdir() {
if [[ ! -d ${outdir} ]]
then
echo "Given outdir ${outdir} does not exist. Trying to create it..."
set +e
mkdir -p ${outdir}
if [[ $? == 0 ]]
then
echo "Successfullt created output directory ${outdir}."
else
echo "Failed to create output directory ${outdir}."
usage
fi
set -e
fi
}
# workdir
function check_workdir() {
usemktmp=0
if [[ ${workdir} == "" ]]
then
usemktmp=1
echo "No working directory given. Creating temporary directory."
elif [[ ! -d ${workdir} ]]
then
echo "Given working directory does not exist. Trying to create it ..."
set +e
mkdir -p ${workdir}
if [[ $? == 0 ]]
then
echo "Successfullt created working directory ${workdir}"
else
usemktmp=1
echo "Failed to create working directory ${workdir}. Creating temporary directory."
fi
set -e
fi
if [[ ${usemktmp} == 0 ]] && [[ ! -z "$(ls -A ${workdir})" ]]
then
usemktmp=1
echo "Given working directory ${workdir} is not empty. Creating temporary directory instead."
fi
if [[ ${usemktmp} == 1 ]]
then
workdir=`mktemp -d`
cleanworkdir=1
echo "Using ${workdir} for temporary files. Please make sure there is enough free space on the according filesystem."
fi
}
# Run the docker image on $the prepared {workdir}
function run_docker_image() {
cp ${infile} ${workdir}/proteins.fasta
docker run \
--mount type=bind,source=${workdir},target=/volume \
dockerhub.ebi.ac.uk/uniprot-public/unifire:${docker_version}
}
# Move output files from ${workdir} to ${outdir}
function move_output_files() {
for predictionfile in ${predictionfiles}
do
echo Copying prediction file ${predictionfile} to ${outdir}
cp -p ${workdir}/${predictionfile} ${outdir}/
done
}
# Clean up
function cleanup_workdir() {
if [[ ${cleanworkdir} == 1 ]]
then
echo "Cleaning up folder ${workdir}"
for predictionfile in ${predictionfiles}
do
rm -f ${workdir}/${predictionfile}
done
rm -f ${workdir}/proteins.fasta
rm -f ${workdir}/proteins_lineage.fasta
rm -f ${workdir}/proteins_lineage-ipr-urml.xml
rm -f ${workdir}/proteins_lineage-ipr.xml
rm -f ${workdir}/seq/*.fasta
rm -f ${workdir}/aln/*.aln
if [[ -d ${workdir}/aln ]];
then
rmdir ${workdir}/aln
fi
if [[ -d ${workdir}/seq ]]
then
rmdir ${workdir}/seq
fi
fi
}
# main
check_infile
check_outdir
check_workdir
run_docker_image
move_output_files
cleanup_workdir
#!/usr/bin/env bash
############################################################################
# Copyright (c) 2018 European Molecular Biology Laboratory
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
############################################################################
set -e
set -u
ROOT_FOLDER="/opt"
DOWNLOAD_FOLDER="/opt/download"
mkdir -p ${DOWNLOAD_FOLDER}
cd ${DOWNLOAD_FOLDER}
echo "Downloading InterProScan..."
wget -q ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.41-78.0/interproscan-5.41-78.0-64-bit.tar.gz
echo "Done."
wget -q ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.41-78.0/interproscan-5.41-78.0-64-bit.tar.gz.md5
ipr_check=`md5sum -c interproscan-5.41-78.0-64-bit.tar.gz.md5`
if [[ ${ipr_check} != "interproscan-5.41-78.0-64-bit.tar.gz: OK" ]]
then
exit 11
fi
mkdir -p ${ROOT_FOLDER}
cd ${ROOT_FOLDER}
echo "Extracting InterProScan..."
tar -pxzf ${DOWNLOAD_FOLDER}/interproscan-5.41-78.0-64-bit.tar.gz
echo "Done."
cd ${DOWNLOAD_FOLDER}
echo "Downloading Panther data..."
wget -q ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/data/panther-data-14.1.tar.gz
echo "Done."
wget -q ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/data/panther-data-14.1.tar.gz.md5
panther_check=`md5sum -c panther-data-14.1.tar.gz.md5`
if [[ ${panther_check} != "panther-data-14.1.tar.gz: OK" ]]
then
exit 12
fi
cd ${ROOT_FOLDER}/interproscan-5.41-78.0/data
echo "Extracting Panther data..."
tar -pxzf ${DOWNLOAD_FOLDER}/panther-data-14.1.tar.gz
echo "Done."
# Clean up tar to reduce the size of the image
rm -f ${DOWNLOAD_FOLDER}/interproscan-5.41-78.0-64-bit.tar.gz
rm -f ${DOWNLOAD_FOLDER}/interproscan-5.41-78.0-64-bit.tar.gz.md5
rm -f ${DOWNLOAD_FOLDER}/panther-data-14.1.tar.gz
rm -f ${DOWNLOAD_FOLDER}/panther-data-14.1.tar.gz.md5
#!/usr/bin/env bash
############################################################################
# Copyright (c) 2018 European Molecular Biology Laboratory
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
############################################################################
set -e
set -u
GIT_REPO="/opt/git"
echo "Downloading and building UniFIRE..."
mkdir -p ${GIT_REPO}
cd ${GIT_REPO}
git clone https://gitlab.ebi.ac.uk/uniprot-public/unifire.git
cd unifire
./build.sh
echo "Done building UniFIRE."
\ No newline at end of file
#!/usr/bin/env bash
############################################################################
# Copyright (c) 2018 European Molecular Biology Laboratory
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
############################################################################
UNIFIRE_REPO="/opt/git/unifire"
INTERPROSCAN_REPO="/opt/interproscan-5.41-78.0"
VOLUME=/volume
infilename=infile.fasta
cd ${UNIFIRE_REPO}
./misc/taxonomy/fetchLineageLocal.py ${VOLUME}/proteins.fasta ${VOLUME}/proteins_lineage.fasta
${INTERPROSCAN_REPO}/interproscan.sh -f xml -dp -i ${VOLUME}/proteins_lineage.fasta \
--appl "Hamap,ProSiteProfiles,ProSitePatterns,Pfam,TIGRFAM,SMART,PRINTS,SFLD,CDD,Gene3D,ProDom,PIRSF,PANTHER,SUPERFAMILY" \
-o ${VOLUME}/proteins_lineage-ipr.xml
PATH="/usr/lib/jvm/java-8-openjdk-amd64/bin:${PATH}"
${UNIFIRE_REPO}/distribution/bin/pirsr.sh -i ${VOLUME}/proteins_lineage-ipr.xml \
-o ${VOLUME} -a /usr/bin/hmmalign -d ${UNIFIRE_REPO}/samples/pirsr_data
${UNIFIRE_REPO}/distribution/bin/unifire.sh -r ${UNIFIRE_REPO}/samples/unirule-urml-latest.xml \
-i ${VOLUME}/proteins_lineage-ipr.xml -t ${UNIFIRE_REPO}/samples/unirule-templates-latest.xml \
-o ${VOLUME}/predictions_unirule.out
${UNIFIRE_REPO}/distribution/bin/unifire.sh -r ${UNIFIRE_REPO}/samples/saas-urml-latest.xml \
-i ${VOLUME}/proteins_lineage-ipr.xml \
-o ${VOLUME}/predictions_saas.out
${UNIFIRE_REPO}/distribution/bin/unifire.sh -n 100 -r ${UNIFIRE_REPO}/samples/unirule.pirsr-urml-latest.xml \
-i ${VOLUME}/proteins_lineage-ipr-urml.xml -s XML -t ${UNIFIRE_REPO}/samples/pirsr_data/PIRSR_templates.xml \
-o ${VOLUME}/predictions_unirule-pirsr.out
# prediction output files must belong to the same user and group as proteins.fasta input file
ownership=`stat -c "%u:%g" ${VOLUME}/proteins.fasta`
for outfile in proteins_lineage.fasta proteins_lineage-ipr.xml proteins_lineage-ipr-urml.xml predictions_unirule.out \
predictions_saas.out predictions_unirule-pirsr.out seq aln
do
chown -R ${ownership} ${VOLUME}/${outfile}
done
#!/usr/bin/env python
############################################################################
# Copyright (c) 2018 European Molecular Biology Laboratory
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software