Rewritten UniProtParser
Created by: mkszuba
Edited: DO NOT MERGE at this point, please see the "Testing" section below for details.
Description
The old UniProtParser is very complex, includes a lot of legacy code, contains several bugs (see e.g. ENSCORESW-2837) and would be difficult to adapt to follow the Extract-Transform-Load process of operation.
Use case
UniProt-KB xrefs are a crucial component of every Ensembl release.
Benefits
Hopefully cleaner, more modular and ETL-friendly code base which might be easier to transplant to the refactored xref pipeline or extend (e.g. to process XML rather than text dumps or to query the UniProt database directly) in the future.
Possible Drawbacks
Differences in the output of the old and the new parser due to bug fixes, which will be fairly challenging to validate given the size of the input.
Testing
Have you added/modified unit tests to test the changes? No.
If so, do the tests pass/fail? N/A
Have you run the entire test suite and no regression was detected?
- I have confirmed that the new parser works fine on and produces expexted results from a small subset of both Swiss-Prot and TrEMBL data - but only if the input is uncompressed, gzipped files cause a "broken pipe" error in zcat.
- xref_parser.t, which got added to the repository by PR #286 and which tests, among others, UniProtParser, fails. Problems pertaining directly to UniProtParser have been traced down to bugs in the test suite and subsequently fixed, unfortunately it turns out WormbaseCElegansUniProtParser inherits UniProtParser and will therefore have to be refactored as well in order to become compatible with the ETL version.