Created by: magaliruffier
Using one or more sentences, describe in detail the proposed changes. We have checksum entries for RNAcentral (transcript sequences) and UniParc (protein sequences) Currently, we do not distinguish between the two sources and attempt to map checksums from both sources to all sequences available. Recently, due to clash in checksums, we have started mapping some UniParc checksums to transcript sequences. To avoid this, we want to store the checksums with the source they come from and use that information in the later mapping stage, to only attempt to map RNAcentral to transcript sequences, and UniParc to protein sequences
Describe the problem. Please provide an example representing the motivation behind the need for having these changes in place. UPI0001765128 is a UniParc entry with checksum CFECF93F30021B262B430492E53B847C The transcript sequence for ENST00000616594.2 has the same checksum and maps to UPI0001765128 With this change, UPI0001765128 is only compared to protein sequences and will not map to ENST00000616594.2
If applicable, describe the advantages the changes will have. No cross-source checksum mapping happening
If applicable, describe any possible undesirable consequence of the changes. The individual checksum mapping stages might be slower, as the SQL query restricts on individual source_id
Have you added/modified unit tests to test the changes? NA
If so, do the tests pass/fail? NA
Have you run the entire test suite and no regression was detected? There is no test case for this pipeline The whole xref pipeline was run before and after the change. In particular, the DataChecks run at the end were reporting the mismapped UPI0001765128 xref before the change, but not after