sequence/proteome/:species GET endpoint for whole proteome download
Created by: vsitnik
Requirements
- Filling out the template is required. Any pull request that does not include enough information to be reviewed in a timely manner may be closed at the maintainers' discretion;
- Review the contributing guidelines for this repository; remember in particular:
- do not modify code without testing for regression
- provide simple unit tests to test the changes
- the PR must not fail unit testing
- if you're adding/updating documentation of an endpoint, make sure you add/update the necessary parameters to the (template) configuration files in the ensembl-rest_private repo
Description
New endpoint allows downloading all protein sequences for the specified species. Only species having 'true' meta.proteome_download_allowed in the core databases will be affected. For others this feature will be forbidden.
Use case
Will be use by uniprot to download protein fastas from vectorbase.org.
wget --header='Content-type:text/x-fasta' 'http://127.0.0.1:34274/sequence/proteome/Anopheles atroparvus?' -O - | gzip - > Aatr.prot.fasta.gz
Benefits
The endpoint allows to download all 'canonical' protein sequences for Anopheles atroparvus in 2 minutes 25 seconds instead of approximately 3 hours when using current approach.
Possible Drawbacks
Still slow. Won't be appropriate for a large genomes, probably. Thus, setting meta.proteome_download_allowed should be done with cautious. seq_regions should have proper 'coding_cnt' and 'toplevel' attributes set.
Testing
t/sequences.t
updated to test the new endpoint behaviour.
No regression was seen for the affected features.
VectorBase prod db anopheles_atroparvus_core_1810_93_3 was used fot the performance testing.
Changelog
It's a new endpoint, which allows whole proteome fasta downloads.