Skip to content
Snippets Groups Projects
Commit 907018f2 authored by Bishoy Wadie's avatar Bishoy Wadie
Browse files

Initial commit

parents
No related branches found
No related tags found
No related merge requests found
Pipeline #113112 canceled with stages
^HVSlimPred\.Rproj$
^\.Rproj\.user$
^LICENSE\.md$
^\.gitlab-ci\.yml$
^README\.Rmd$
.Rproj.user
.Rhistory
.RData
image: rocker/tidyverse
stages:
- build
- test
- deploy
building:
stage: build
script:
- R -e "remotes::install_deps(dependencies = TRUE)"
- R -e 'devtools::check()'
# To have the coverage percentage appear as a gitlab badge follow these
# instructions:
# https://docs.gitlab.com/ee/user/project/pipelines/settings.html#test-coverage-parsing
# The coverage parsing string is
# Coverage: \d+\.\d+
testing:
stage: test
allow_failure: true
when: on_success
only:
- master
script:
- Rscript -e 'install.packages("DT")'
- Rscript -e 'covr::gitlab(quiet = FALSE)'
artifacts:
paths:
- public
# To produce a code coverage report as a GitLab page see
# https://about.gitlab.com/2016/11/03/publish-code-coverage-report-with-gitlab-pages/
pages:
stage: deploy
dependencies:
- testing
script:
- ls
artifacts:
paths:
- public
expire_in: 30 days
only:
- master
Package: HVSlimPred
Title: Identification of novel functional linear motifs using the host-viral protein interaction network and the principle of convergent evolution
Version: 0.0.0.9000
Authors@R:
person(given = "Bishoy",
family = "Wadie",
role = c("aut", "cre"),
email = "bwadie@ebi.ac.uk")
person(given = "Evangelia",
family = "Petsalaki",
role = "cre",
email = "petsalaki@ebi.ac.uk")
Description: This package complements the performance evaluation analysis for the manuscript entitled "Identifying novel functional linear motifs using the host-viral protein interaction network and the principle of convergent evolution".
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.1.0
Imports:
dplyr,
tidyr,
plyr,
ggplot2,
hypergea,
bio3d,
readxl,
stringr,
magrittr,
forcats,
utils,
mltools
Depends:
R (>= 2.10)
Version: 1.0
RestoreWorkspace: No
SaveWorkspace: No
AlwaysSaveHistory: Default
EnableCodeIndexing: Yes
UseSpacesForTab: Yes
NumSpacesForTab: 2
Encoding: UTF-8
RnwWeave: Sweave
LaTeX: pdfLaTeX
AutoAppendNewline: Yes
StripTrailingWhitespace: Yes
LineEndingConversion: Posix
BuildType: Package
PackageUseDevtools: Yes
PackageInstallArgs: --no-multiarch --with-keep.source
PackageRoxygenize: rd,collate,namespace
YEAR: 2020
COPYRIGHT HOLDER: Evangelia Petsalaki
# MIT License
Copyright (c) 2020 Evangelia Petsalaki
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
# Generated by roxygen2: do not edit by hand
export("%>%")
export("%nin%")
export(HV_motif_level_eval)
export(HV_prot_dom_int_eval)
export(HV_prot_level_eval)
export(parse_qslim)
importFrom(magrittr,"%>%")
importFrom(utils,download.file)
importFrom(utils,read.csv)
importFrom(utils,read.delim)
importFrom(utils,read.table)
This diff is collapsed.
#' @keywords internal
"_PACKAGE"
# The following block is used by usethis to automatically manage
# roxygen namespace tags. Modify with care!
## usethis namespace: start
## usethis namespace: end
NULL
#' Parse qslim output per host-viral interaction data
#'
#' Parses the raw combined qslim output per host-viral interaction dataset. Then it adds the corresponding ELM instances along with the CompariMotif similarity between the predicted regular expression motif patterns and that of ELM classes. It also includes the distance between the predicted motif and that of an ELM instance, only if the motif-carrying protein and ELM class match.
#'
#' @return
#' A data frame containing all the predicted hits annotated with corresponding ELM instances and CompariMotif similarity scores.
#'
#' @format
#' - *Dataset*: Uniprot IDs of Host-viral interaction denoted as Viral-prot_Human-prot
#' - *Pattern*: Motif's regular expression pattern
#' - *uniprot*: Uniprot ID of motif-carrying protein
#' - *Org*: Organism info about the motif-carrying protein
#' - *Start_Pos*: First position of the predicted motif
#' - *End_Pos*: Last position of the predicted motif
#' - *Desc*: Uniprot description of the motif-carrying protein
#' - *Id*: ELM Identifier
#' - *Sim_rank*: Ranking of CompariMotif relationships. A value of 9 corresponds to Exact match, while a value of 1 corresponds to Complex Overlap (Check CompariMotif [Website](http://bioware.ucd.ie/~compass/biowareweb/Server_pages/help/slimfinder/comparimotif_help.html#relationships) for more details)
#' - *Score*: CompariMotif heuristic similarity score, defined as the product of Matched positions and Normalized Information content (Check CompariMotif [paper](https://academic.oup.com/bioinformatics/article/24/10/1307/177233) for more details)
#' - *Accession*: ELM Instance Accession
#' - *start*: First position of the ELM motif
#' - *end*: Last position of the ELM motif
#' - *Logic*: ELM instance logic as reported by ELM database.
#' - *motif_distance*: Distance between predicted and ELM motif on the same protein
#' - *motif_overlap*: Overlap proportion between predicted and ELM motif on the same protein
#'
#' @author Bishoy Wadie, Evangelia Petsalaki
#'
#' @export
#' @importFrom utils read.csv
#' @examples
#' pred_hits = parse_qslim()
parse_qslim = function(){
input_data = download_data(path = "Evaluation_data/allhitsperint.occ.csv")
comparimotif_data = preprocess_comparimotif()
ELM_all_instances = Get_ELM_all()
all_hitsint = read.csv(input_data)
all_hitsint = all_hitsint[,-c(2,3,12,14)]
all_hitsint$uniprot = NA
all_hitsint$Org = NA
for (i in 1:nrow(all_hitsint)){
all_hitsint$uniprot[i] = strsplit(all_hitsint$Seq[i], split = "_")[[1]][1]
all_hitsint$Org[i] = strsplit(all_hitsint$Seq[i], split = "_")[[1]][4]
}
all_hitsint = all_hitsint[,c(1:3,12,13,5:11)]
all_hitsint = all_hitsint[!duplicated(all_hitsint),]
colnames(comparimotif_data)[4] = "Pattern"
new_hits_ints = dplyr::left_join(all_hitsint, comparimotif_data[,c(1,2,4,3,8,14)], by = c("Dataset", "Pattern", "uniprot"))
new_hits_ints = new_hits_ints %>% dplyr::select(-Sig, -Prot_Len, -Variant, -Match)
new_hits_ints[which(stringr::str_detect(new_hits_ints$uniprot, "-") == TRUE),]$uniprot =
substr(new_hits_ints[which(stringr::str_detect(new_hits_ints$uniprot, "-") == TRUE),]$uniprot, 1, nchar(new_hits_ints[which(stringr::str_detect(new_hits_ints$uniprot, "-") == TRUE),]$uniprot) -2 )
new_hits_ints = dplyr::left_join(new_hits_ints, ELM_all_instances[,c(1,3,5,7,8,11)], by = c("Id","uniprot"))
new_hits_ints = qslim_ELM_motif_distance(new_hits_ints)
new_hits_ints = new_hits_ints %>% dplyr::select(-PepDesign)
return(new_hits_ints)
}
qslim_ELM_motif_distance = function(df){
df$motif_distance = NA
df$motif_overlap = NA
for (i in 1:nrow(df)){
if (is.na(df$start[i])){
next()
df$motif_distance[i] = NA
df$motif_overlap[i] = NA
}
if (abs(df$start[i]) > abs(df$End_Pos[i])){
df$motif_distance[i] = (abs(df$start[i]) - abs(df$End_Pos[i]))
df$motif_overlap[i] = 0
}
else if (abs(df$start[i]) > abs(df$Start_Pos[i])){
df$motif_distance[i] = (abs(df$start[i]) - abs(df$Start_Pos[i]))
df$motif_overlap[i] = (abs(df$End_Pos[i]) - abs(df$start[i])) / (abs(df$End_Pos[i]) - abs(df$Start_Pos[i]))
}
else if (abs(df$Start_Pos[i]) > abs(df$end[i])){
df$motif_distance[i] = (abs(df$Start_Pos[i]) - abs(df$end[i]))
df$motif_overlap[i] = 0
}
else{
df$motif_distance[i] = (abs(df$Start_Pos[i]) - abs(df$start[i]))
df$motif_overlap[i] = (abs(df$end[i]) - abs(df$Start_Pos[i])) / (abs(df$end[i]) - abs(df$start[i]))
}
}
return(df)
}
motif_sim_to_score = function(Sim){
results = c()
for (i in 1:length(Sim)){
rank = switch(Sim[i], "Exact Match" = 9, "Variant Match" = 8, "Degenerate Match" = 8,
"Complex Match" = 7, "Exact Parent" = 6, "Exact Subsequence" = 6,
"Degenerate Parent" = 5, "Degenerate Subsequence" = 5, "Variant Parent" = 5,
"Variant Subsequence" = 5, "Complex Parent" = 4, "Complex Subsequence" = 4,
"Exact Overlap" = 3, "Degenerate Overlap" = 2, "Variant Overlap" = 2,
"Complex Overlap" = 1)
results[i] = rank
}
return(results)
}
preprocess_comparimotif = function(){
comparimotif_Ids = download_data(path = "Evaluation_data/compari_motif_ELM_ids.RDS", is.rds = T)
#According to qslim_finder paper will apply MatchIC >= 1.5, NormIC >=0.5, matchpos 2+
comparimotif_Ids = comparimotif_Ids %>% dplyr::filter(MatchIC >= 1.5, NormIC >= 0.5, MatchPos >=2)
comparimotif_Ids$Sim_rank = motif_sim_to_score(comparimotif_Ids$Sim1)
comparimotif_Ids = comparimotif_Ids[,c(1:8,17,9:16)]
qslim_elm = comparimotif_Ids %>% dplyr::group_by(Name1,Name2, Motif1) %>% dplyr::summarise(n = n())
qslim_elm = as.data.frame(qslim_elm)
uniques = qslim_elm[which(qslim_elm$n == 1),]
dupls = qslim_elm[which(qslim_elm$n > 1),]
dupls = as.data.frame(dupls)
result = list()
for (i in 1:nrow(dupls)){
filtered = dplyr::filter(comparimotif_Ids, Name1 == dupls$Name1[i], Name2 == dupls$Name2[i], Motif1 == dupls$Motif1[i])
filtered = filtered[which.max(filtered$Score),]
result[[i]] = filtered
}
result = dplyr::bind_rows(result)
unique_df = suppressMessages(plyr::match_df(comparimotif_Ids, uniques))
Final_data = rbind.data.frame(result, unique_df)
new_compari_results = Final_data %>% tidyr::separate(Name1, into = c("Dataset", "uniprot"), sep = ":")
colnames(new_compari_results)[5] = "Id"
new_compari_results = new_compari_results[,-c(1:2)]
return(new_compari_results)
}
Get_ELM_all = function(){
ELM_input = download_data(path = "Evaluation_data/elm_all_instances.csv")
ELM_all_instances = read.csv(ELM_input)
colnames(ELM_all_instances) = c("Accession","Class_type","Id","Entry_name","uniprot","other_uniprots","start","end","Pubmeds","Method","Logic","PDBs","Organism")
ELM_all_instances[which(stringr::str_detect(ELM_all_instances$uniprot, "-") == TRUE),]$uniprot =
substr(ELM_all_instances[which(stringr::str_detect(ELM_all_instances$uniprot, "-") == TRUE),]$uniprot, 1, nchar(ELM_all_instances[which(stringr::str_detect(ELM_all_instances$uniprot, "-") == TRUE),]$uniprot) -2 )
return(ELM_all_instances)
}
File added
#' Pipe operator
#'
#' See \code{magrittr::\link[magrittr:pipe]{\%>\%}} for details.
#'
#' @name %>%
#' @rdname pipe
#' @keywords internal
#' @export
#' @importFrom magrittr %>%
#' @usage lhs \%>\% rhs
NULL
#' @export
`%nin%` <- Negate("%in%")
mismatch_df = function (x, y, on = NULL){
if (is.null(on)) {
on <- dplyr::intersect(names(x), names(y))
message("Matching on: ", paste(on, collapse = ", "))
}
keys <- plyr::join.keys(x, y, on)
x[keys$x %nin% keys$y, , drop = FALSE]
}
#' @importFrom utils download.file
download_data = function(path, is.rds = F){
ftp_base = "ftp://ftp.ebi.ac.uk/pub/contrib/petsalaki/Wadie_et_al"
if (!is.rds){
tmp = tempfile()
data = download.file(file.path(ftp_base, path), destfile = tmp)
return(tmp)
}
else{
data = readRDS(url(file.path(ftp_base, path), "rb"))
return(data)
}
}
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# HVSlimPred
<!-- badges: start -->
<!-- badges: end -->
This R package complements the performance evaluation analysis for the manuscript entitled "Identifying novel functional linear motifs using the host-viral protein interaction network and the principle of convergent evolution".
## Installation
You can install the released version of HVSlimPred from gitlab with:
``` r
devtools::install_gitlab("https://gitlab.ebi.ac.uk/petsalakilab/hvslimpred")
```
## Get Protein-level evaluation metrics
For protein-level enrichment, we measured the enrichment of true-positives in our predicted dataset using a one-tailed fisher-exact test, where the odds ratio represents the magnitude of the enrichment. True-positives are the number of motif-carrying proteins present in both the predicted dataset and the ELM dataset regardless of whether the predicted protein has the right motif or found in the right location
The output of the following command is a data frame containing all the relevant protein-level performance metrics for each domain enrichment filter in addition to the non-filtered qslim output.
```{r Protein-level evaluation, eval=FALSE}
library(HVSlimPred)
prot_eval_metrics = HV_prot_level_eval()
```
## Get Motif-level evaluation metrics
For motif-level enrichment, we simply cannot use a binary classification as we did for protein-level evaluation because in reality, predicted motifs are partially correct to some extent as they might contain true-positive residues in a given sequence stretch, and therefore we used a re-implemented version of the evaluation protocol proposed in [Prytuliak et al. 2017](https://academic.oup.com/nar/article/45/W1/W470/3782606) instead of binary classification, where we computed the common performance metrics (Recall, precision F1, etc .. ) both residue-wise and site-wise given that the motif-carrying proteins are also found in the ELM benchmarking dataset. So this analysis was not performed on proteins not reported in the ELM dataset.
The output of the following command is a data frame containing all the relevant motif-level performance metrics for each domain enrichment filter in addition to the non-filtered qslim output.
```{r Motif-level evaluation, eval=FALSE}
library(HVSlimPred)
motif_eval_metrics = HV_motif_level_eval()
```
## Get Protein-domain interactions evaluation metrics
For evaluating protein-domain interactions we measured the enrichment of true-positive interactions between a given motif-carrying protein and its associated domains as reported in the ELM interaction dataset. As in the motif-level evaluation, this analysis was performed only on the motif-carrying proteins reported in the ELM interactions dataset, where true-positives represents the number of correctly associated domains for a given motif-carrying protein and then summed over all motif-carrying proteins in the predicted dataset.
The output of the following command is a data frame containing all the relevant protein-domain interactions' performance metrics for each domain enrichment filter.
```{r Protein-domain interactions evaluation, eval=FALSE}
library(HVSlimPred)
ProtDom_int_eval_metrics = HV_prot_dom_int_eval()
```
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/HVSlimPred-package.R
\docType{package}
\name{HVSlimPred-package}
\alias{HVSlimPred}
\alias{HVSlimPred-package}
\title{HVSlimPred: Identification of novel functional linear motifs using the host-viral protein interaction network and the principle of convergent evolution}
\description{
This package complements the performance evaluation analysis for the manuscript entitled "Identifying novel functional linear motifs using the host-viral protein interaction network and the principle of convergent evolution".
}
\author{
\strong{Maintainer}: Evangelia Petsalaki \email{petsalaki@ebi.ac.uk}
}
\keyword{internal}
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/Evaluation.R
\name{HV_motif_level_eval}
\alias{HV_motif_level_eval}
\title{Motif-level evaluation of predicted hits}
\usage{
HV_motif_level_eval()
}
\value{
A data frame containing all the relevant motif-level performance metrics for each domain enrichment filter in addition to the non-filtered qslim output.
}
\description{
Calculates performance metrics for evaluating predicted hits against ELM true positive instances on the motif-level.
}
\details{
For motif-level enrichment, we simply cannot use a binary classification as we did for protein-level evaluation because in reality, predicted motifs are partially correct to some extent as they might contain true-positive residues in a given sequence stretch, and therefore we used a re-implemented version of the evaluation protocol proposed in \href{https://academic.oup.com/nar/article/45/W1/W470/3782606}{Prytuliak et al. 2017} instead of binary classification, where we computed the common performance metrics (Recall, precision F1, etc .. ) both residue-wise and site-wise given that the motif-carrying proteins are also found in the ELM benchmarking dataset. So this analysis was not performed on proteins not reported in the ELM dataset.
}
\examples{
motif_eval_metrics = HV_motif_level_eval()
}
\references{
Prytuliak, Roman, et al. "HH-MOTiF: de novo detection of short linear motifs in proteins by Hidden Markov Model comparisons." \emph{Nucleic acids research} 45.W1 (2017): W470-W477.
}
\author{
Bishoy Wadie, Evangelia Petsalaki
}
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/Evaluation.R
\name{HV_prot_dom_int_eval}
\alias{HV_prot_dom_int_eval}
\title{Evaluation of protein-domain interactions}
\usage{
HV_prot_dom_int_eval()
}
\value{
A data frame containing all the relevant protein-domain interactions' performance metrics for each domain enrichment filter.
}
\description{
Calculates performance metrics for evaluating protein-domain interactions between the motif-carrying protein and the enriched domain against ELM interactions.
}
\details{
For evaluating protein-domain interactions we measured the enrichment of true-positive interactions between a given motif-carrying protein and its associated domains as reported in the ELM interaction dataset. As in the motif-level evaluation, this analysis was performed only on the motif-carrying proteins reported in the ELM interactions dataset, where true-positives represents the number of correctly associated domains for a given motif-carrying protein and then summed over all motif-carrying proteins in the predicted dataset.
}
\examples{
ProtDom_int_eval_metrics = HV_prot_dom_int_eval()
}
\author{
Bishoy Wadie, Evangelia Petsalaki
}
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/Evaluation.R
\name{HV_prot_level_eval}
\alias{HV_prot_level_eval}
\title{Protein-level evaluation of predicted hits}
\usage{
HV_prot_level_eval()
}
\value{
A data frame containing all the relevant protein-level performance metrics for each domain enrichment filter in addition to the non-filtered qslim output.
}
\description{
Calculates performance metrics for evaluating predicted hits against ELM true positive instances on the protein-level.
}
\details{
For protein-level enrichment, we measured the enrichment of true-positives in our predicted dataset using a one-tailed fisher-exact test, where the odds ratio represents the magnitude of the enrichment. True-positives are the number of motif-carrying proteins present in both the predicted dataset and the ELM dataset regardless of whether the predicted protein has the right motif or found in the right location
}
\examples{
prot_eval_metrics = HV_prot_level_eval()
}
\author{
Bishoy Wadie, Evangelia Petsalaki
}
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/parse_qslim.R
\name{parse_qslim}
\alias{parse_qslim}
\title{Parse qslim output per host-viral interaction data}
\format{
\itemize{
\item \emph{Dataset}: Uniprot IDs of Host-viral interaction denoted as Viral-prot_Human-prot
\item \emph{Pattern}: Motif's regular expression pattern
\item \emph{uniprot}: Uniprot ID of motif-carrying protein
\item \emph{Org}: Organism info about the motif-carrying protein
\item \emph{Start_Pos}: First position of the predicted motif
\item \emph{End_Pos}: Last position of the predicted motif
\item \emph{Desc}: Uniprot description of the motif-carrying protein
\item \emph{Id}: ELM Identifier
\item \emph{Sim_rank}: Ranking of CompariMotif relationships. A value of 9 corresponds to Exact match, while a value of 1 corresponds to Complex Overlap (Check CompariMotif \href{http://bioware.ucd.ie/~compass/biowareweb/Server_pages/help/slimfinder/comparimotif_help.html#relationships}{Website} for more details)
\item \emph{Score}: CompariMotif heuristic similarity score, defined as the product of Matched positions and Normalized Information content (Check CompariMotif \href{https://academic.oup.com/bioinformatics/article/24/10/1307/177233}{paper} for more details)
\item \emph{Accession}: ELM Instance Accession
\item \emph{start}: First position of the ELM motif
\item \emph{end}: Last position of the ELM motif
\item \emph{Logic}: ELM instance logic as reported by ELM database.
\item \emph{motif_distance}: Distance between predicted and ELM motif on the same protein
\item \emph{motif_overlap}: Overlap proportion between predicted and ELM motif on the same protein
}
}
\usage{
parse_qslim()
}
\value{
A data frame containing all the predicted hits annotated with corresponding ELM instances and CompariMotif similarity scores.
}
\description{
Parses the raw combined qslim output per host-viral interaction dataset. Then it adds the corresponding ELM instances along with the CompariMotif similarity between the predicted regular expression motif patterns and that of ELM classes. It also includes the distance between the predicted motif and that of an ELM instance, only if the motif-carrying protein and ELM class match.
}
\examples{
pred_hits = parse_qslim()
}
\author{
Bishoy Wadie, Evangelia Petsalaki
}
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment