Skip to content
Snippets Groups Projects
Commit d66449b6 authored by Wojtek Bazant's avatar Wojtek Bazant
Browse files

C. elegans references use WormBase mapping to INSDC protein ids

- maintain naming convention: WormBase specific stuff says Wormbase at the front
- rewrite WormBaseDirectParser
- WormBaseDirectParser populates protein_ids
- superclass method to make dependent protein_ids as parent
- tap into UniProtParser
  + also skip EMBL scaffold ids (we can't reliably assign them)
- tap into RefSeqGPFFParser
  + extract a method
- tests for new stuff
  + add %args to parametrise test_parser

Benefits for RefSeqGPFFParser:
RefSeq proteins have coordinates as part of their identity, so we
can't reliably sequence match them, we will also pick up all paralogs.
This change fixes this spurious mapping.
Benefits for UniProtParser:
Not the above: UniProt entries are not tied to coordinates so all
paralogs map to the same entry. We can handle versioning and updates
a bit better: if WormBase updates an entry and a protein id changes but
UniProt doesn't reflect this yet, with the change we will still pick up
the UniProt entry although we can't sequence match any more.
parent 4a7bf4f0
No related branches found
No related tags found
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment