[ENSCORESW-3183][ENSCORESW-3176] Mapper performance improvements
Created by: ens-bwalts
Requirements
- Filling out the template is required. Any pull request that does not include enough information to be reviewed in a timely manner may be closed at the maintainers' discretion;
- Review the contributing guidelines for this repository; remember in particular:
- do not modify code without testing for regression
- provide simple unit tests to test the changes
- if you change the schema you must patch the test databases as well, see Updating the schema
- the PR must not fail unit testing
Description
These changes improve the speed of the recent interval tree based Mapper implementation. There are two main sets of fixes in this PR:
- Improvements to the immutable interval tree code to speed up building a new tree (by not sorting already-sorted lists, and using less stringent typechecking)
- Caching interval trees used by the mapper for reuse, rather than re-creating them each time
map_coordinates()
ormap_insert()
is called
I also fixed up _dump()
so that it works again and is more useful, as I needed to use that method to aid in implementing these changes.
Use case
Other teams have raised several JIRA tickets complaining about the slowness of the release/97 interval tree based Mapper. The Mapper is used directly or indirectly in many, many places throughout Ensembl, so its performance issues have wide impact for production as well as other Ensembl users. These are gathered in JIRA epic ENSCORESW-3183, and this in particular addresses the difficulty reported by production in ENSCORESW-3176
Benefits
A 2-3 order of magnitude speedup in performance. For example, the Mapper calls involved in dumping the sequence of human chromosome 20 speeds up from 4954 seconds (82 minutes, 465 ms per call) to 19.7 seconds (1.8 ms per call). Most of this improvement is due to caching of trees in Mapper. The improvements in the IntervalTree itself cut the runtime by about 1/3 to 1/2
Possible Drawbacks
We need to make sure that Mapper adaptors (e.g. AssemblyMapperAdaptor) continue to cache the Mappers themselves, so that they can be re-used. Otherwise, we end up having to go through the expense of rebuilding all these data structures all over again.
Because of these changes, Mapper is now fully subject to the two hard things in computer science: naming things, cache invalidation, and off by one errors. Any further changes to the Mapper need to be carefully considered to be sure stored trees are correctly instantiated and removed.
Testing
Have you added/modified unit tests to test the changes? Yes, tests for tree caching and invalidation in mapper.t
If so, do the tests pass/fail? pass
Have you run the entire test suite and no regression was detected? yes
In addition, I have run sequence dumper scripts with these new changes and the output is identical to the pre-change output.