ENSCORESW-3196: do not overwrite source_id

Merged Marek Szuba requested to merge bugfix/checksum_source into master

Created by: magaliruffier

Requirements

  • Filling out the template is required. Any pull request that does not include enough information to be reviewed in a timely manner may be closed at the maintainers' discretion;
  • Review the contributing guidelines for this repository; remember in particular:
    • do not modify code without testing for regression
    • provide simple unit tests to test the changes
    • if you change the schema you must patch the test databases as well, see Updating the schema
    • the PR must not fail unit testing

Description

Using one or more sentences, describe in detail the proposed changes. We have checksum entries for RNAcentral (transcript sequences) and UniParc (protein sequences) Currently, we do not distinguish between the two sources and attempt to map checksums from both sources to all sequences available. Recently, due to clash in checksums, we have started mapping some UniParc checksums to transcript sequences. To avoid this, we want to store the checksums with the source they come from and use that information in the later mapping stage, to only attempt to map RNAcentral to transcript sequences, and UniParc to protein sequences

Use case

Describe the problem. Please provide an example representing the motivation behind the need for having these changes in place. UPI0001765128 is a UniParc entry with checksum CFECF93F30021B262B430492E53B847C The transcript sequence for ENST00000616594.2 has the same checksum and maps to UPI0001765128 With this change, UPI0001765128 is only compared to protein sequences and will not map to ENST00000616594.2

Benefits

If applicable, describe the advantages the changes will have. No cross-source checksum mapping happening

Possible Drawbacks

If applicable, describe any possible undesirable consequence of the changes. The individual checksum mapping stages might be slower, as the SQL query restricts on individual source_id

Testing

Have you added/modified unit tests to test the changes? NA

If so, do the tests pass/fail? NA

Have you run the entire test suite and no regression was detected? There is no test case for this pipeline The whole xref pipeline was run before and after the change. In particular, the DataChecks run at the end were reporting the mismapped UPI0001765128 xref before the change, but not after

Merge request reports