updated info for priority xrefs

f00dc2bc · Ian Longden · a4dcb98e · f00dc2bc
Commit f00dc2bc authored 18 years ago by Ian Longden
--- a/misc-scripts/xref_mapping/parsing_information.txt
+++ b/misc-scripts/xref_mapping/parsing_information.txt
@@ -52,6 +52,12 @@ RGD (rat only)
 Refseq_dna
 ----------

+Refseq_dna is now a priority xref source for human, so in addition to the ncbi file used it will
+also use a local file that is generated from the CCDS data which DIRECTLY links refseqs to the ensembl 
+trancsripts. If a refseq is not in this file then the sequence data from the ncbi is used to mapped 
+via exonerate in the normal manner.
+
+More generally.
 The files come in two types those for specific species i.e.
 ftp://ftp.ncbi.nih.gov/genomes/Gallus_gallus/RNA/rna.gbk.gz

@@ -253,19 +259,37 @@ entries will also be matched.
 HUGO
 ----

+HUGO data uses prioritys to allocate each identifier to one ensembl id.
+The prioritys are :-
+1) via Havana
+2) Via CCDS
+3) Via Refseq
+4) Via Uniprot
+5) Via Entrezgene
+
+1) DIRECT relationships are made by transfering the manually annotated ones from
+   havana to ensembl.
+   LOCAL:HUGO/HUGO_TO_ENSG
+
+2) DIRECT relationships are made by transfering the ones from CCDS to ensembl.
+   LOCAL:HUGO/CCDS_TO_HUGO
+
+3,4 and 5)
+
 The Human Genome Organisation xrefs are obtained from using the following url:-

 http://www.gene.ucl.ac.uk/cgi-bin/nomenclature/gdlw.pl?title=Genew+output+data
 &col=gd_hgnc_id&col=gd_app_sym&col=gd_app_name&col=gd_prev_sym&col=gd_aliases
-&col=md_prot_id&col=gd_pub_refseq_ids&status=Approved&status=Approved+Non-Human
-&status_opt=3&=on&where=&order_by=gd_hgnc_id&limit=&format=text&submit=submit
-&.cgifields=&.cgifields=status&.cgifields=chr
+&col=md_prot_id&col=gd_pub_refseq_ids&col=md_eg_id&status=Approved
+&status=Approved+Non-Human&status_opt=3&=on&where=&order_by=gd_hgnc_id&limit=
+&format=text&submit=submit&.cgifields=&.cgifields=status&.cgifields=chr
+

-Which is a script that produces a list of HUGO identifiers with the uniprot and
-refseq entries they are linked to.
+Which is a script that produces a list of HUGO identifiers with the Uniprot and
+Refseq and EntrezGene entries they are linked to.

-The files have references to uniprot and refseq entries and so the GO entries are 
-set to be dependent xref on these.
+The files have references to uniprot, refseq and entrezgen entries and so the 
+HUGO entries are set to be dependent xref on these.

 NOTE: due to length of its name the file is stored in the name of its checksum.