Skip to content

C. elegans references use WormBase mapping to INSDC protein ids

Marek Szuba requested to merge github/fork/wbazant/master into master

Created by: wbazant

Description

Changes affecting vertebrates run of the pipeline

  • Bits of RefSeqGPFFParser. The rest of the new behaviour was added by subclassing and inheritance. *

All changes

  • rewrite WormbaseDirectParser
  • WormbaseDirectParser populates protein_ids
  • new xref: wormbase_cds
  • superclass method to make dependent protein_ids as parent
  • tap into UniProtParser
    • also skip EMBL scaffold ids (we can't reliably assign them)
  • tap into RefSeqGPFFParser
    • extract two methods so that I can subclass them
    • fix a loose "next" instead of "return"
  • tests for new stuff

Use case

C. elegans can have better xrefs in WormBase ParaSite.

Benefits

For Ensembl: The changes in RefSeqGPFFParser are a very minor change for the better. Test coverage for UniProtParser and RefSeqGPFFParser.

Organisational: Easier to maintain if WormBase specific behaviour is inside the code base as long as we're responsible. Ensembl can refactor without breaking our code (and a test will fail if it breaks).

Benefits - RefSeqGPFFParser: This change fixes this spurious mapping of paralogs.

Benefits - UniProtParser: Not the above: UniProt entries are not tied to coordinates so all paralogs map to the same entry. We can handle versioning and updates a bit better: if WormBase updates an entry and a protein id changes but UniProt doesn't reflect this yet, with the change we will still pick up the UniProt entry although we can't sequence match any more.

Possible Drawbacks

The repo will take a bit longer to download. The test suite will run a bit longer.

Testing

I unit tested relevant old code and new code. I did ~ 7 all-pipeline runs of this code in various stages of completion.

Merge request reports