C. elegans references use WormBase mapping to INSDC protein ids
- maintain naming convention: WormBase specific stuff says Wormbase at the front - rewrite WormBaseDirectParser - WormBaseDirectParser populates protein_ids - superclass method to make dependent protein_ids as parent - tap into UniProtParser + also skip EMBL scaffold ids (we can't reliably assign them) - tap into RefSeqGPFFParser + extract a method - tests for new stuff + add %args to parametrise test_parser Benefits for RefSeqGPFFParser: RefSeq proteins have coordinates as part of their identity, so we can't reliably sequence match them, we will also pick up all paralogs. This change fixes this spurious mapping. Benefits for UniProtParser: Not the above: UniProt entries are not tied to coordinates so all paralogs map to the same entry. We can handle versioning and updates a bit better: if WormBase updates an entry and a protein id changes but UniProt doesn't reflect this yet, with the change we will still pick up the UniProt entry although we can't sequence match any more.
Showing
- misc-scripts/xref_mapping/XrefParser/RefSeqGPFFParser.pm 36 additions, 15 deletionsmisc-scripts/xref_mapping/XrefParser/RefSeqGPFFParser.pm
- misc-scripts/xref_mapping/XrefParser/WormbaseCElegansBase.pm 44 additions, 0 deletionsmisc-scripts/xref_mapping/XrefParser/WormbaseCElegansBase.pm
- misc-scripts/xref_mapping/XrefParser/WormbaseCElegansRefSeqGPFFParser.pm 49 additions, 0 deletions...ef_mapping/XrefParser/WormbaseCElegansRefSeqGPFFParser.pm
- misc-scripts/xref_mapping/XrefParser/WormbaseCElegansUniProtParser.pm 43 additions, 0 deletions.../xref_mapping/XrefParser/WormbaseCElegansUniProtParser.pm
- misc-scripts/xref_mapping/XrefParser/WormbaseDirectParser.pm 65 additions, 121 deletionsmisc-scripts/xref_mapping/XrefParser/WormbaseDirectParser.pm
- modules/t/xref_parser.t 182 additions, 13 deletionsmodules/t/xref_parser.t
Please register or sign in to comment