Skip to content

ENSCORESW-2819: clean up steps in mapping

Marek Szuba requested to merge feature/clearer_mapping_stage into master

Created by: magaliruffier

Requirements

  • Filling out the template is required. Any pull request that does not include enough information to be reviewed in a timely manner may be closed at the maintainers' discretion;
  • Review the contributing guidelines for this repository; remember in particular:
    • do not modify code without testing for regression
    • provide simple unit tests to test the changes
    • if you change the schema you must patch the test databases as well, see Updating the schema
    • the PR must not fail unit testing

Description

Using one or more sentences, describe in detail the proposed changes. This change aims to make the mapping stage in the xref pipeline slightly clearer. The test for unlinked_entries, to check for broken foreign keys, is run a large number of times, in some times before and after any change. In this update, it is run only once after each step of change. As this is done consistently for each step, it will automatically have been run before the next change. The other change is to filter the list of gene specific xref sources. There is a list of sources where xrefs should be mapped to gene, and copied across if they are mapped to a transcript or translation instead. That list is quite long and has a number of species-specific sources in there, meaning less than half those sources will be applicable to one species at a given point. The change to the code checks if a source is used in the current species before adding it to the list, removing the need for large join queries on empty results.

Use case

Describe the problem. Please provide an example representing the motivation behind the need for having these changes in place. The whole mapping stage is slightly faster as a result. Additionally, it is easier to assess which stage the pipeline is at, as the same step is not needlessly run multiple times consecutively.

Benefits

If applicable, describe the advantages the changes will have. Clearer and faster mapping stage.

Possible Drawbacks

If applicable, describe any possible undesirable consequence of the changes.

Testing

Have you added/modified unit tests to test the changes? Pipeline was run on several vertebrate species and the results were identical to previous runs.

If so, do the tests pass/fail?

Have you run the entire test suite and no regression was detected? NA

Merge request reports