[ENSCORESW-3183][ENSCORESW-3176] Mapper performance improvements (!446) · Merge requests · ensembl-gh-mirror / ensembl

Marek Szuba requested to merge bugfix/intervaltree_performance into master Nov 20, 2019

Created by: ens-bwalts

Requirements

Filling out the template is required. Any pull request that does not include enough information to be reviewed in a timely manner may be closed at the maintainers' discretion;
Review the contributing guidelines for this repository; remember in particular:
- do not modify code without testing for regression
- provide simple unit tests to test the changes
- if you change the schema you must patch the test databases as well, see Updating the schema
- the PR must not fail unit testing

Description

These changes improve the speed of the recent interval tree based Mapper implementation. There are two main sets of fixes in this PR:

Improvements to the immutable interval tree code to speed up building a new tree (by not sorting already-sorted lists, and using less stringent typechecking)
Caching interval trees used by the mapper for reuse, rather than re-creating them each time map_coordinates() or map_insert() is called

I also fixed up _dump() so that it works again and is more useful, as I needed to use that method to aid in implementing these changes.

Use case

Other teams have raised several JIRA tickets complaining about the slowness of the release/97 interval tree based Mapper. The Mapper is used directly or indirectly in many, many places throughout Ensembl, so its performance issues have wide impact for production as well as other Ensembl users. These are gathered in JIRA epic ENSCORESW-3183, and this in particular addresses the difficulty reported by production in ENSCORESW-3176

Benefits

A 2-3 order of magnitude speedup in performance. For example, the Mapper calls involved in dumping the sequence of human chromosome 20 speeds up from 4954 seconds (82 minutes, 465 ms per call) to 19.7 seconds (1.8 ms per call). Most of this improvement is due to caching of trees in Mapper. The improvements in the IntervalTree itself cut the runtime by about 1/3 to 1/2

Possible Drawbacks

We need to make sure that Mapper adaptors (e.g. AssemblyMapperAdaptor) continue to cache the Mappers themselves, so that they can be re-used. Otherwise, we end up having to go through the expense of rebuilding all these data structures all over again.

Because of these changes, Mapper is now fully subject to the two hard things in computer science: naming things, cache invalidation, and off by one errors. Any further changes to the Mapper need to be carefully considered to be sure stored trees are correctly instantiated and removed.

Testing

Have you added/modified unit tests to test the changes? Yes, tests for tree caching and invalidation in mapper.t

If so, do the tests pass/fail? pass

Have you run the entire test suite and no regression was detected? yes

In addition, I have run sequence dumper scripts with these new changes and the output is identical to the pre-change output.

[ENSCORESW-3183][ENSCORESW-3176] Mapper performance improvements