Skip to content

sequence/proteome/:species GET endpoint for whole proteome download

Marek Szuba requested to merge github/fork/vsitnik/vb_proteome_download into master

Created by: vsitnik

Requirements

  • Filling out the template is required. Any pull request that does not include enough information to be reviewed in a timely manner may be closed at the maintainers' discretion;
  • Review the contributing guidelines for this repository; remember in particular:
    • do not modify code without testing for regression
    • provide simple unit tests to test the changes
    • the PR must not fail unit testing
    • if you're adding/updating documentation of an endpoint, make sure you add/update the necessary parameters to the (template) configuration files in the ensembl-rest_private repo

Description

New endpoint allows downloading all protein sequences for the specified species. Only species having 'true' meta.proteome_download_allowed in the core databases will be affected. For others this feature will be forbidden.

Use case

Will be use by uniprot to download protein fastas from vectorbase.org. wget --header='Content-type:text/x-fasta' 'http://127.0.0.1:34274/sequence/proteome/Anopheles atroparvus?' -O - | gzip - > Aatr.prot.fasta.gz

Benefits

The endpoint allows to download all 'canonical' protein sequences for Anopheles atroparvus in 2 minutes 25 seconds instead of approximately 3 hours when using current approach.

Possible Drawbacks

Still slow. Won't be appropriate for a large genomes, probably. Thus, setting meta.proteome_download_allowed should be done with cautious. seq_regions should have proper 'coding_cnt' and 'toplevel' attributes set.

Testing

t/sequences.t updated to test the new endpoint behaviour. No regression was seen for the affected features.

VectorBase prod db anopheles_atroparvus_core_1810_93_3 was used fot the performance testing.

Changelog

It's a new endpoint, which allows whole proteome fasta downloads.

Merge request reports