ENSCORESW-2850: optimised for single species run (!295) · Merge requests · ensembl-gh-mirror / ensembl

Marek Szuba requested to merge feature/parser_simplify into master Sep 07, 2018

Created by: magaliruffier

Requirements

Filling out the template is required. Any pull request that does not include enough information to be reviewed in a timely manner may be closed at the maintainers' discretion;
Review the contributing guidelines for this repository; remember in particular:
- do not modify code without testing for regression
- provide simple unit tests to test the changes
- if you change the schema you must patch the test databases as well, see Updating the schema
- the PR must not fail unit testing

Description

Using one or more sentences, describe in detail the proposed changes. This ensures the xref_parser script works correctly for a single species with a simplified configuration file.

Use case

Describe the problem. Please provide an example representing the motivation behind the need for having these changes in place. The xref code has been largely updated to accommodate a hive-based pipeline with less manual intervention. As part of this, a default configuration for each division has been set up, removing the need to add every new species in the configuration file. This has broken some of the assumptions in the historical parser script. The proposed fix separates clearly the species_id, which is used as a parameter in the various parsers to retrieve the taxon-specific data, and the division_id, which is used to know which sources to run.

Benefits

If applicable, describe the advantages the changes will have. The parser can be run as a standalone job on any species and source.

Possible Drawbacks

If applicable, describe any possible undesirable consequence of the changes. The parser cannot be run on more than one species at a time. This should be done with the eHive pipeline instead.

Testing

Have you added/modified unit tests to test the changes? The script was run on some sample species.

If so, do the tests pass/fail? Before the change, sources get submitted correctly for the division, but any source requiring connection to the species database fail, as it is using the division_id rather than the species_id. After the change, all sources are submitted and run correctly.

Have you run the entire test suite and no regression was detected? NA

ENSCORESW-2850: optimised for single species run