For non-merge species the xref pipeline will use these source lists to assign gene and transcript display xrefs:
==Genes
1. RFAM (Best name)
2. miRBase
3. Uniprot_genename
4. EntrezGene_no_LOC% (Worst name)
==Transcripts
1. RFAM (Best name)
2. miRBase
3. Uniprot/SWISSPROT
4. UniProt/Varsplic (Worst name)
When assigning diplay xrefs exclude xrefs with accessions prefixed with XM_, XP_, YP_, NT_, NC_, LOC, NR_ followed by at least 1 digit.
Use the pig core database to compare display xrefs generated using the old gene display xrefs sources and the new gene/transcript lists to see if there is an improvement in gene/transcript names and in case of any unwanted names make sure we exclude those xrefs from display xrefs.
Review the quality of labels projected to other species.
Review the logic of the projection code, make any changes if necessary.
Add functionality to specify display_xref types which take priority over projections and shouldn't be overridden by them.
For human:
For alt_alleles which have a Clone_based_ensembl_gene display xref and are on the same clone use the same xref without incrementing the number appended to the clone name to make up the xref accession.
Potentially we may want to use Uniprot_genename xrefs for human, mouse and zebrafish before we assign a clone based name - Amonida will discuss this with Havana.
2) Bugs (high priority)
Investigate why we have Uniprot xrefs sequence matched with target and query identity < 100% in human.
3) Make configuration and running the xref pipeline smoother (high priority)
We will remove all unnecessary species.pm modules from ensembl/misc-scripts/xref_mapping/XrefMapper/
Use the latest branched version of the xref pipeline code to run xrefs.
Core team to thoroughly test the head code before branching.
Remove the need to copy ncRNA xrefs into sw4_ncRNA_Xrefs@genebuild7 in order to import them into the xref database. Possible solution is to calculate display xrefs in core.
Use a variable in xref_config.ini for the current release and apply it to ontology and ccds database names.
4) Xref pipeline testing (medium priority)
Write unit/integration tests for the xref pipline.
Check if Michael's QC work can be extended to xrefs.
Additional healthchecks to test if orthologous genes have the same names.
5) Additional functionality (low priority)
Flag dependent xrefs on the website and point to the master xref.
Improve tracing of where xrefs come from. Possible solution could be an extra table in core db to store information on where xrefs were loaded from (file location, database etc..) for each external_db, xref analysis id etc.
Add functionality to flag old xrefs which haven't been updated by the xref pipeline to make sure they are still valid links to external entities.
6) Improve the speed of xref runs/ data storage (low priority)
Calculate display xrefs on one gene in an alt_allele set and copy the display xref across to other genes.
Create an xref database which will store locations of downloaded xref source files, so that multi-species files are only downloaded once and stored in one location.