Skip to content

Update to annotation provider in EMBL/GenBank files

Created by: james-monkeyshines

Description

We used to have a meta_key, provider.name (and an associated provider.url), but this did not allow us to capture the fact that sometimes the assembly needs to be credited to one provider, and the gene annotation to another. Vertebrate databases used the field to indicate the annotation provider, and non-vertebrates used it for the assembly provider. To better represent these use cases, provider.name was replaced by two new meta_keys, assembly.provider_name and annotation.provider_name (https://www.ebi.ac.uk/panda/jira/browse/ENSINT-361).

In the comments of EMBL and GenBank files, the provider.name was formerly used to indicate the source of the annotation; this was replaced by assembly.provider_name in a previous PR (#506), but it would be more accurate to use annotation.provider_name, and only if that is undefined, fall back to assembly.provider_name.

Further, since an annotation can have multiple providers, it is good to list them all, rather than select one to include in the comments.

Use case

For vertebrates, this change does not make much difference, because assembly.provider_name is typically not defined, so the code falls back to the generic 'Ensembl' in any case. But for non-vertebrates, which are typically annotated by non-Ensembl groups, this allows for more accurate attribution.

Benefits

Better attribution for annotation in ftp files.

Possible Drawbacks

None I can think of.

Testing

Have you added/modified unit tests to test the changes? Yes

If so, do the tests pass/fail? Pass

Have you run the entire test suite and no regression was detected? Yes

Edited by Stefano Giorgetti

Merge request reports