Commit 20549520 authored by Lukas Pravda's avatar Lukas Pravda
Browse files

Merge branch 'development'

parents e3e33c72 1349158a
......@@ -2,7 +2,7 @@ image: continuumio/miniconda
before_script:
- apt-get update && apt-get install make
- conda create -c rdkit -n rdkit-doc rdkit python=3.6.5
- conda create -c rdkit -n rdkit-doc rdkit python=3.7
- source activate rdkit-doc
- pip install -e .
- pip install pytest-cov sphinx sphinx_rtd_theme recommonmark sphinx-autodoc-typehints sphinx-markdown-tables
......
# Changelog
## RELEASE 0.5 - May 15, 2019
### Features
* Add support for UniChem mapping.
* Add bond information to the SVG decomposition.
* Allow PARITY method to be atom/bond specific.
* Improve and extend physchem properties.
* Enhanced CIF export (physchem, scaffolds, fragments, 2D, mapping).
* Few improvements to match newest RDKit version (**breaking changes**).
## RELEASE 0.4 - January 12, 2019
### Features
......@@ -7,7 +18,7 @@
* Add SVG decomposition in the SVG format.
* Protein-ligand interaction pipeline moved to separate [repository](https://gitlab.ebi.ac.uk/pdbe/release/interactions).
* Add basic properties calculation (Abhik).
* Extension and improvements of the Fragment library.
* Extension and improvements of the Fragment library.
## RELEASE 0.3 - October 12, 2018
......
......@@ -5,9 +5,13 @@
* A set of python tools to deal with PDB chemical components definitions
for small molecules, taken from the [wwPDB Chemical Component Dictionary](https://www.wwpdb.org/data/ccd)
* The tools use:
* [RDKit](http://www.rdkit.org/) for chemistry
* [PDBeCIF](https://gitlab.com/glenveegee/PDBeCIF.git) cif parser.
* [PDBeCIF](https://gitlab.com/glenveegee/PDBeCIF.git) cif parser.
* [scipy](https://www.scipy.org/) for depiction quality check.
* [numpy](https://www.numpy.org/) for molecular scaling.
* Please note that the project is under active development.
## Installation instructions
......@@ -17,7 +21,7 @@
For linux/mac OS this is most easily done using the anaconda python with commands similar to:
```console
conda create -c rdkit -n rdkit-env rdkit python=3.6
conda create -c rdkit -n rdkit-env rdkit python=3.7
conda activate rdkit-env
```
......@@ -35,7 +39,9 @@
* Fragment library search.
* Chemical scaffolds (Murcko scaffold, Murcko general, BRICS).
* Lightweight implementation of parity method by Jon Tyczak.
* Generate some basic molecular properties per component.
* RDKit molecular properties per component.
* UniChem mapping.
## TODO list
......@@ -56,13 +62,13 @@ The documentation depends on the following packages:
* `recommonmark`
* `sphinx-autodoc-typehints`
Note that `sphinx` needs to be a part of the virtual environmnent, if you want to generate documentation by yourself.
Note that `sphinx` needs to be a part of the virtual environment, if you want to generate documentation by yourself.
Otherwise it cannot pick `rdkit` module. `sphinx_rtd_theme` is a theme providing nice `ReadtheDocs` mobile friendly style.
* Generate *.rst* files to be included as a part of the documentation. Inside the directory `pdbeccdutils/doc` run the following commands to generate documentation.
* Alternativelly, use the `recommonmark` package along with the proper configuration to get the Markdown working.
Use the following to generate initial markup files to be used by sphinx. This needs to be used when adding another subpackages.
Use the following to generate initial markup files to be used by sphinx. This needs to be used when adding another sub-packages.
```console
sphinx-apidoc -f -o /path/to/output/dir ../pdbeccdutils/
......
......@@ -5,4 +5,6 @@
```eval_rst
.. automodule:: pdbeccdutils.utils.pubchem_downloader
:members:
.. automodule:: pdbeccdutils.utils.web_services
:members:
```
\ No newline at end of file
__version__ = '0.4'
__version__ = '0.5'
"""Lightweight implementation of the parity method provided by
Jon Tyczak to Abhik to be used by the cofactors pipeline.
Jon Tyzack to Abhik to be used by the cofactors pipeline.
"""
from typing import NamedTuple
from rdkit import Chem
from rdkit.Chem import rdFMCS
class ParityResult(NamedTuple):
"""
NamedTuple for the result of parity method along with the details
necessary for calculating the similarity score.
Attributes:
template_atoms (int): Number of template atoms.
query_atoms (int): Number of query molecule atoms.
match_count (int): Size o the common subgaph match.
similarity_score (float): Calculate similarity score.
"""
template_atoms: int
query_atoms: int
match_count: int
similarity_score: float
from pdbeccdutils.core.models import ParityResult
def _get_matches(mol, smarts):
......@@ -86,35 +69,45 @@ def _generate_sim_score(template, query, smarts):
return (best_matches, best_sim_score)
def compare_molecules(template, query, thresh=0.01):
def compare_molecules(template, query, thresh=0.01, exact_match=False):
"""Given the two molecules calculates their similarity score.
If expected similarity score is lower than the threshold nothing
is calculated.
Args:
template(rdkit.Chem.rdchem.Mol): Template molecule
query(rdkit.Chem.rdchem.Mol): Query molecule
thresh(float, optional): Defaults to 0.01: Threshold score for
the match to be considered.
template (rdkit.Chem.rdchem.Mol): Template molecule
query (rdkit.Chem.rdchem.Mol): Query molecule
thresh (float, optional): Threshold score for
the match to be considered. Defaults to 0.01.
exact_match (bool, optional): Controls whether atom type and
bond order should be checked too. Defaults to False.
Returns:
ParityResult: Result of the PARITY comparison.
"""
templ_atoms = template.GetNumAtoms()
template_atoms = template.GetNumAtoms()
query_atoms = query.GetNumAtoms()
min_num_atoms = min(templ_atoms, query_atoms)
max_sim_score = float(min_num_atoms) / float(templ_atoms + query_atoms - min_num_atoms)
min_num_atoms = min(template_atoms, query_atoms)
max_sim_score = float(min_num_atoms) / float(template_atoms + query_atoms - min_num_atoms)
if max_sim_score < thresh:
return ParityResult(templ_atoms, query_atoms, 0, 0)
mcs_graph = rdFMCS.FindMCS([template, query],
bondCompare=rdFMCS.BondCompare.CompareAny,
atomCompare=rdFMCS.AtomCompare.CompareAny,
timeout=40,
completeRingsOnly=True)
return ParityResult(template_atoms, query_atoms, 0, 0.0)
if not exact_match:
mcs_graph = rdFMCS.FindMCS([template, query],
bondCompare=rdFMCS.BondCompare.CompareAny,
atomCompare=rdFMCS.AtomCompare.CompareAny,
timeout=40,
completeRingsOnly=True)
else:
mcs_graph = rdFMCS.FindMCS([template, query],
bondCompare=rdFMCS.BondCompare.CompareOrderExact,
atomCompare=rdFMCS.AtomCompare.CompareElements,
timeout=40,
completeRingsOnly=True)
matches, sim_score = _generate_sim_score(template, query, mcs_graph.smartsString)
return ParityResult(templ_atoms, query_atoms, matches, sim_score)
return ParityResult(template_atoms, query_atoms, matches, sim_score)
......@@ -15,9 +15,9 @@
# specific language governing permissions and limitations
# under the License.
"""A set of methods for reading in data and creating internal representation
of molecules. The basic use can be as easy as this::
"""
A set of methods for reading in data and creating internal representation
of molecules. The basic use can be as easy as this:
from pdbeccdutils.core import ccd_reader
......@@ -26,15 +26,17 @@ of molecules. The basic use can be as easy as this::
"""
import os
import sys
from datetime import date
from typing import Dict, List, NamedTuple
import rdkit
from mmCif.mmcifIO import MMCIF2Dict
from pdbeccdutils.core.component import Component
from pdbeccdutils.helpers import str_conversions
from pdbeccdutils.helpers import collection_ext
from pdbeccdutils.core.models import Descriptor, CCDProperties
from pdbeccdutils.core.exceptions import CCDUtilsError
from pdbeccdutils.core.models import CCDProperties, Descriptor, ReleaseStatus
from pdbeccdutils.helpers import collection_ext, conversions
class CCDReaderResult(NamedTuple):
......@@ -59,7 +61,7 @@ class CCDReaderResult(NamedTuple):
def read_pdb_cif_file(path_to_cif: str) -> CCDReaderResult:
"""
Read in single wwpdb cif component and create its internal
Read in single wwPDB CCD CIF component and create its internal
representation.
Args:
......@@ -82,7 +84,7 @@ def read_pdb_cif_file(path_to_cif: str) -> CCDReaderResult:
def read_pdb_components_file(path_to_cif: str) -> Dict[str, CCDReaderResult]:
"""
Process multiple compounds stored in the wwpdb CCD
Process multiple compounds stored in the wwPDB CCD
`components.cif` file.
Args:
......@@ -94,13 +96,20 @@ def read_pdb_components_file(path_to_cif: str) -> Dict[str, CCDReaderResult]:
Returns:
dict(str, CCDReaderResult): Internal representation of all
the compponents in the `components.cif` file.
the components in the `components.cif` file.
"""
if not os.path.isfile(path_to_cif):
raise ValueError('File \'{}\' does not exists'.format(path_to_cif))
return {k: _parse_pdb_mmcif(v)
for k, v in MMCIF2Dict().parse(path_to_cif).items()}
result_bag = {}
for k, v in MMCIF2Dict().parse(path_to_cif).items():
try:
result_bag[k] = _parse_pdb_mmcif(v)
except CCDUtilsError as e:
print(f'ERROR: Data block {k} not processed. Reason: ({str(e)}).', file=sys.stderr)
return result_bag
# region parse mmcif
......@@ -165,9 +174,11 @@ def _parse_pdb_atoms(mol, atoms):
element = '*'
atom = rdkit.Chem.Atom(element)
atom.SetChiralTag(_atom_chiral_tag(atoms['pdbx_stereo_config'][i]))
atom.SetProp('name', atoms['atom_id'][i])
atom.SetProp('alt_name', atoms['alt_atom_id'][i])
atom.SetFormalCharge(str_conversions.str_to_int(atoms['charge'][i]))
atom.SetBoolProp('leaving_atom', atoms['pdbx_leaving_atom_flag'][i] == "Y")
atom.SetFormalCharge(conversions.str_to_int(atoms['charge'][i]))
if isotope is not None:
atom.SetIsotope(isotope)
......@@ -211,9 +222,9 @@ def _setup_pdb_conformer(atoms, label):
conformer = rdkit.Chem.Conformer(len(atoms['atom_id']))
for i in range(len(atoms['atom_id'])):
x = str_conversions.str_to_float(atoms[label.format(('x'))][i])
y = str_conversions.str_to_float(atoms[label.format(('y'))][i])
z = str_conversions.str_to_float(atoms[label.format(('z'))][i])
x = conversions.str_to_float(atoms[label.format(('x'))][i])
y = conversions.str_to_float(atoms[label.format(('y'))][i])
z = conversions.str_to_float(atoms[label.format(('z'))][i])
atom_position = rdkit.Chem.rdGeometry.Point3D(x, y, z)
conformer.SetAtomPosition(i, atom_position)
......@@ -242,7 +253,10 @@ def _parse_pdb_bonds(mol, bonds, atoms, errors):
if any(a is None for a in [atom_1, atom_2, bond_order]):
errors.append('Problem with the {}-th bond in the _chem_comp_bond group'.format(i + 1))
mol.AddBond(atom_1, atom_2, bond_order)
try:
mol.AddBond(atom_1, atom_2, bond_order)
except RuntimeError:
errors.append(f'Duplicit bond {bonds["atom_id_1"][i]}-{bonds["atom_id_2"][i]}')
def _handle_implicit_hydrogens(mol):
......@@ -268,7 +282,7 @@ def _handle_implicit_hydrogens(mol):
def _parse_pdb_descriptors(pdbx_chem_comp_descriptors, label='descriptor'):
"""
Parse useful informations from _pdbx_chem_comp_* category
Parse useful information from _pdbx_chem_comp_* category
Args:
pdbx_chem_comp_descriptors (dict): mmcif category with the
......@@ -295,22 +309,28 @@ def _parse_pdb_descriptors(pdbx_chem_comp_descriptors, label='descriptor'):
def _parse_pdb_properties(chem_comp):
"""
Parse useful informations from _chem_comp category
Parse useful information from _chem_comp category
Args:
chem_comp (dict): the mmcif category with rdkit.Chem_comp info.
chem_comp (dict): the mmcif category with _chem_comp info
Returns:
Properties: namedtuple with the property info
Properties: dataclass with the CCD properties.
"""
properties = None
if chem_comp:
mod_date = chem_comp['pdbx_modified_date'][0].split('-')
d = date(1970, 1, 1) if mod_date[0] == '?' else date(int(mod_date[0]), int(mod_date[1]), int(mod_date[2]))
rel_status = chem_comp['pdbx_release_status'][0]
rel_status = ReleaseStatus.from_str(chem_comp['pdbx_release_status'][0])
properties = CCDProperties(id=chem_comp['id'][0],
name=chem_comp['name'][0],
formula=chem_comp['formula'][0],
modified_date=chem_comp['pdbx_modified_date'][0],
pdbx_release_status=chem_comp['pdbx_release_status'][0],
weight=chem_comp['formula_weight'][0])
modified_date=d,
pdbx_release_status=rel_status,
weight=0.0 if chem_comp['formula_weight'][0] == '?' else float(chem_comp['formula_weight'][0]))
return properties
......@@ -351,10 +371,29 @@ def _bond_pdb_order(value_order):
"""
if value_order == 'SING':
return rdkit.Chem.rdchem.BondType(1)
elif value_order == 'DOUB':
if value_order == 'DOUB':
return rdkit.Chem.rdchem.BondType(2)
elif value_order == 'TRIP':
if value_order == 'TRIP':
return rdkit.Chem.rdchem.BondType(3)
else:
return None
return None
def _atom_chiral_tag(tag):
"""Parse _chem_comp.pdbx_stereo.config from chem_comp
Args:
tag (str): R/S/N identification of chiral center.
Returns:
rdkit.Chem.ChiralType: Chiral center in RDKit language.
"""
if tag == 'N':
return rdkit.Chem.ChiralType.CHI_UNSPECIFIED
if tag == 'S':
return rdkit.Chem.ChiralType.CHI_TETRAHEDRAL_CCW
if tag == 'R':
return rdkit.Chem.ChiralType.CHI_TETRAHEDRAL_CW
return rdkit.Chem.ChiralType.CHI_UNSPECIFIED
# endregion parse mmcif
......@@ -20,15 +20,15 @@
Raises:
CCDUtilsError: If deemed format is not supported or an unrecoverable
error occures.
error occurres.
"""
import copy
import json
import math
import xml.etree.ElementTree as ET
from xml.dom import minidom
from collections import OrderedDict
from typing import List
from xml.dom import minidom
import mmCif.mmcifIO as mmcif
import rdkit
......@@ -274,6 +274,13 @@ def to_pdb_ccd_cif_file(path, component: Component, remove_hs=True):
cif_copy = copy.deepcopy(component.ccd_cif_dict)
_add_sw_info_cif(cif_copy)
_add_2d_depiction_cif(component, cif_copy)
_add_fragments_and_scaffolds_cif(component, cif_copy)
_add_rdkit_properties_cif(component, cif_copy)
_add_unichem_mapping(component, cif_copy)
_add_rdkit_conformer(component, cif_copy, remove_hs)
if remove_hs:
h_indices: List[int] = [i for i, x in enumerate(cif_copy['_chem_comp_atom']['type_symbol']) if x == "H"]
h_names: List[str] = [cif_copy['_chem_comp_atom']['atom_id'][i] for i in h_indices]
......@@ -381,7 +388,7 @@ def to_json_dict(component: Component, remove_hs=True, conf_type=ConformerType.I
:obj:`dict` of :obj:`str`: dictionary representation of the component
"""
if conf_type == ConformerType.AllConformers:
raise AttributeError('All conformer export is not suppoted for json export')
raise AttributeError('All conformer export is not supported for json export')
(mol_to_save, conf_id, conf_type) = _prepate_structure(component, remove_hs, conf_type)
result = {'atoms': [], 'bonds': []}
......@@ -454,7 +461,7 @@ def _prepate_structure(component, remove_hs, conf_type):
to be exported.
"""
conf_id = 0 if conf_type == ConformerType.Depiction else component.conformers_mapping[conf_type]
mol_to_save = component._2dmol if conf_type == ConformerType.Depiction else component.mol
mol_to_save = component.mol2D if conf_type == ConformerType.Depiction else component.mol
if remove_hs:
mol_to_save = rdkit.Chem.RemoveHs(mol_to_save, sanitize=False)
......@@ -499,11 +506,6 @@ def _write_pdb_ccd_cif_info(cif_dict, component):
calc_weight = rdkit.Chem.rdMolDescriptors.CalcExactMolWt(component.mol)
mod_date = "{}-{:02d}-{:02d}".format(component.modified_date.year, component.modified_date.month, component.modified_date.day)
cif_dict['pdbeccdutils'] = OrderedDict([])
cif_dict['pdbeccdutils']['rdkit_version'] = rdkit.__version__
cif_dict['pdbeccdutils']['core_version'] = pdbeccdutils.__version__
label = '_chem_comp'
cif_dict[label] = OrderedDict([])
cif_dict[label]['id'] = component.id
......@@ -516,9 +518,6 @@ def _write_pdb_ccd_cif_info(cif_dict, component):
cif_dict[label]['pdbx_modified_date'] = mod_date
cif_dict[label]['pdbx_release_status'] = component.pdbx_release_status.name
cif_dict['pdbeccdutils']['rdkit_version'] = rdkit.__version__
cif_dict['pdbeccdutils']['core_version'] = pdbeccdutils.__version__
def _write_pdb_ccd_cif_atoms(cif_dict, component):
"""Writes the _chem_comp_atom namespace with atom details.
......@@ -793,7 +792,7 @@ def _get_atom_coord(component, at_id, conformer_type):
# region fallbacks
def _to_sdf_str_fallback(mol, id, mappings):
def _to_sdf_str_fallback(mol, ccd_id, mappings):
"""Fallback method to generate SDF file in case the default one in
RDKit fails.
......@@ -816,7 +815,7 @@ def _to_sdf_str_fallback(mol, id, mappings):
atom_count = mol.GetNumAtoms()
bond_count = mol.GetNumBonds()
content.append('{} - {} conformer\n RDKit 3D\n'.format(id, k))
content.append('{} - {} conformer\n RDKit 3D\n'.format(ccd_id, k))
content.append('{:>3}{:3} 0 0 0 0 0 0 0 0999 V2000'.format(atom_count, bond_count))
for i in range(0, atom_count):
......@@ -825,7 +824,7 @@ def _to_sdf_str_fallback(mol, id, mappings):
rdkit_conformer.GetAtomPosition(i).y,
rdkit_conformer.GetAtomPosition(i).z,
mol.GetAtomWithIdx(i).GetSymbol(),
_charge_to_sdf(mol.GetAtomWithIdx(i).GetFormalCharge()),
__charge_to_sdf(mol.GetAtomWithIdx(i).GetFormalCharge()),
))
for i in range(0, bond_count):
......@@ -833,8 +832,8 @@ def _to_sdf_str_fallback(mol, id, mappings):
content.append('{:>3}{:>3}{:>3}{:>3} 0 0 0'.format(
bond.GetBeginAtom().GetIdx() + 1,
bond.GetEndAtom().GetIdx() + 1,
_bond_type_to_sdf(bond),
_bond_stereo_to_sdf(bond)
__bond_type_to_sdf(bond),
__bond_stereo_to_sdf(bond)
))
content.append('M END')
content.append('$$$$')
......@@ -846,8 +845,8 @@ def _to_pdb_str_fallback(mol, conf_id):
RDKit fails.
Args:
mol (rdkit.Chem.rdchem.Mol): Molecule to be writter.
conf_id (int): conformer id to be writen.
mol (rdkit.Chem.rdchem.Mol): Molecule to be written.
conf_id (int): conformer id to be written.
Returns:
str: String representation the component in the PDB format.
......@@ -905,8 +904,11 @@ def _to_pdb_str_fallback(mol, conf_id):
return "\n".join(content)
# endregion fallbacks
def _charge_to_sdf(charge):
# region helpers
def __charge_to_sdf(charge):
"""Translate RDkit charge to the SDF language.
Args:
......@@ -933,7 +935,7 @@ def _charge_to_sdf(charge):
return "0"
def _bond_stereo_to_sdf(bond):
def __bond_stereo_to_sdf(bond):
"""Translate bond stereo information to the sdf language. Needs to
be checked.
......@@ -947,15 +949,14 @@ def _bond_stereo_to_sdf(bond):
if stereo == rdkit.Chem.rdchem.BondStereo.STEREONONE:
return '0'
elif stereo == rdkit.Chem.rdchem.BondStereo.STEREOANY:
if stereo == rdkit.Chem.rdchem.BondStereo.STEREOANY:
return '4'
elif stereo in (rdkit.Chem.rdchem.BondStereo.STEREOCIS, rdkit.Chem.rdchem.BondStereo.STEREOTRANS):
if stereo in (rdkit.Chem.rdchem.BondStereo.STEREOCIS, rdkit.Chem.rdchem.BondStereo.STEREOTRANS):
return '3'
else:
return '0'
return '0'
def _bond_type_to_sdf(bond):
def __bond_type_to_sdf(bond):
"""Get bond type in sdf language. Based on:
http://www.nonlinear.com/progenesis/sdf-studio/v0.9/faq/sdf-file-format-guidance.aspx
......@@ -979,4 +980,277 @@ def _bond_type_to_sdf(bond):
return '0'
# endregion fallbacks
def __post_process_cif_category(cif_copy, category_name):
"""Single value category needs to be string rather than array
with a single value. Also if the category contains nothing, it should
be removed.
Args:
cif_copy (dict of str: dict): Dictionary like structure of
the CIF file.
category_name (str): Category name to be accounted for
"""
if not cif_copy[category_name]: # nothing in the category => should be removed
cif_copy.pop(category_name)
return
for k, v in cif_copy[category_name].items():
if isinstance(v, list):
if len(v) == 1:
cif_copy[category_name][k] = v[0]
if not v:
cif_copy.pop(category_name)
return
# endregion helpers
# region cif-export addition
def _add_sw_info_cif(cif_copy):
"""Add information to the cif file about software versions used
to generate this information
Args:
cif_copy (dict of str: dict): Dictionary like structure of
the CIF file.
"""
category = '_software'
cif_copy[category] = {}
cif_copy[category]['name'] = ['rdkit', 'pdbeccdutils']
cif_copy[category]['version'] = [rdkit.__version__, pdbeccdutils.__version__]
cif_copy[category]['description'] = ['Computation functionality.', 'Wrapper to provide 2D templates and molecular fragments.']
def _add_2d_depiction_cif(component, cif_copy):
"""Add 2D coordinates of the component depiction
Args:
component (Component): pdbeccdutils component.
cif_copy (dict of str: dict): Dictionary like structure of
the CIF file.
"""
if component.mol2D is None:
return
__add_rdkit_2d_atoms_cif(component, cif_copy)
__add_rdkit_2d_bonds_cif(component, cif_copy)
def _add_fragments_and_scaffolds_cif(component, cif_copy):
"""Add fragments and scaffolds information to the CIF export
Args:
component (Component): Component to be exported.
cif_copy (dict of str: dict): Dictionary like structure of
the CIF file.
"""
substructure_category = '_chem_comp_pdbe_substructure'
mapping_category = '_chem_comp_pdbe_substructure_mapping'
ids = [f'S{i+1}' for i in range(0, len(component.scaffolds))]
ids += [f'F{i+1}' for i in range(0, len(component.fragments))]
types = [f'scaffold' for i in range(0, len(component.scaffolds))]
types += [f'fragment' for i in range(0, len(component.fragments))]