Skip to content

Rewritten UniProtParser

Marek Szuba requested to merge feature/UniProtParserETL into feature/xref_sprint

Created by: mkszuba

Edited: DO NOT MERGE at this point, please see the "Testing" section below for details.

Description

The old UniProtParser is very complex, includes a lot of legacy code, contains several bugs (see e.g. ENSCORESW-2837) and would be difficult to adapt to follow the Extract-Transform-Load process of operation.

Use case

UniProt-KB xrefs are a crucial component of every Ensembl release.

Benefits

Hopefully cleaner, more modular and ETL-friendly code base which might be easier to transplant to the refactored xref pipeline or extend (e.g. to process XML rather than text dumps or to query the UniProt database directly) in the future.

Possible Drawbacks

Differences in the output of the old and the new parser due to bug fixes, which will be fairly challenging to validate given the size of the input.

Testing

Have you added/modified unit tests to test the changes? No.

If so, do the tests pass/fail? N/A

Have you run the entire test suite and no regression was detected?

  1. I have confirmed that the new parser works fine on and produces expexted results from a small subset of both Swiss-Prot and TrEMBL data - but only if the input is uncompressed, gzipped files cause a "broken pipe" error in zcat.
  2. xref_parser.t, which got added to the repository by PR #286 and which tests, among others, UniProtParser, fails. Problems pertaining directly to UniProtParser have been traced down to bugs in the test suite and subsequently fixed, unfortunately it turns out WormbaseCElegansUniProtParser inherits UniProtParser and will therefore have to be refactored as well in order to become compatible with the ETL version.

Merge request reports