Removed - superceded by ensembl_xrefs.sxw/pdf

c0930eaa · Glenn Proctor · b1bab90e · b1bab90e
Commit c0930eaa authored 20 years ago by Glenn Proctor
--- a/misc-scripts/xref_mapping/README
+++ b/misc-scripts/xref_mapping/README
-The Ensembl Xref System
-=======================
-
-
-Overview
--------
-
-What is an xref
---------------
-
-The xref holding database
-------------------------
-
-Parsing xrefs
-------------
-
-Mapping xrefs to Ensembl objects
--------------------------------
-
-The Ensembl core schema xref tables
-----------------------------------
-
-
-
-THe following is the initial first draft/initial mind dump which will be tidied
-up. 
-
-The Xref tables are created and populated by the scripts in this directory.
-The process can be viewed as a two part process. 
-First parsing of the data and secondly mapping these to the Ensembl data.
-
-Parsing.
-
-
-what data is stored.
-
-All species have the same "core" set of data which consists of 
-
-
-source name	      Parser(s) used to populate
-------------------------------------------------------------------------------
-Uniprot/SPtrembl      UniProtParser.pm  
-Uniprot/Swissprot     UniProtParser.pm
-Refseq                RefSeqGPFFParser.pm
-MIM                   UniProtParser.pm + RefSeqGPFFParser.pm
-PDB		      UniProtParser.pm
-EMBL		      UniProtParser.pm
-Protein_id            UniProtParser.pm
-LocusLink	      RefSeqGPFFParser.pm
-GO		      GOParser.pm
-Interpro	      InterproParser.pm (no link to primary_xref, special case)
-*pubmed               UniProtParser.pm + RefSeqGPFFParser.pm
-*medline              UniProtParser.pm + RefSeqGPFFParser.pm
-*mim2                 MIMParser.pm
-
-
-* are new data sources.
-
-Species specific data:-
-
-
-source name		species		parser 
------------------------------------------------------
-HUGO		     	Human		HUGOParser.pm
-MarkerSymbol (MGD/MGI)	Mouse		MGDParser.pm
-RGD                  	Rat        	RGDParser.pm
-ZFIN                 	Zebrafish  	ZFINParser.pm
-
-
-
-General Tutorial
-
-The perl script to create and populate the database is xref_parser.pl
-
-  xref_parser --help produces:-
-
-xref_parser.pl -user {user} -pass {password} -host {host} -port {port} 
-	-dbname {database} -species {species1,species2} 
-	-source {source1,source2} -skipdownload -create
-
-If no source is specified then then all source are loaded. The same is done for
-species so it is best to specify this one or the script may take a while.
-
-
-So to load/parse all the xrefs for the human the command would be:-
-
-  xref_parser.pm -host host1 -port 3350 -user admin -pass password 
-                 -dbname xref_store -species human -create
-
-
-The following is the typical output for a rat run :-)
-
-perl xref_parser.pl -host host1 -port 3350 -user admin -pass pass -dbname xref_test -species rat -create
-
-Removed existing database xref_test
-Creating xref_test from sql/table.sql
-Populating metadata in xref_test from sql/populate_metadata.sql
-Species rat is valid (name = rattus_norvegicus, ID = 10116)
-Downloading ftp://ftp.ebi.ac.uk/pub/databases/SPproteomes/swissprot_files/proteomes/10116.SPC to ./UniprotSWISSPROT/10116.SPC
-Checksum for 10116.SPC does not match, parsing
-Parsing 10116.SPC with UniProtParser
-Taxonomy ID 10116 corresponds to species ID 10116 name rattus_norvegicus
-SwissProt source id for ./UniprotSWISSPROT/10116.SPC: 1
-SpTREMBL source id for ./UniprotSWISSPROT/10116.SPC: 2
-Read 3970 SwissProt xrefs and 4432 SPTrEMBL xrefs from ./UniprotSWISSPROT/10116.SPC
-Uploading xrefs
-Downloading ftp://ftp.ncbi.nih.gov/genomes/R_norvegicus/protein/protein.gbk.gz to ./RefSeq/protein.gbk.gz
-Uncompressing ./RefSeq/protein.gbk.gz
-Checksum for protein.gbk does not match, parsing
-Parsing protein.gbk with RefSeqGPFFParser
-Read 21178 xrefs from ./RefSeq/protein.gbk
-Uploading xrefs
-Downloading ftp://ftp.ncbi.nih.gov/genomes/R_norvegicus/RNA/rna.gbk.gz to ./RefSeq/rna.gbk.gz
-Uncompressing ./RefSeq/rna.gbk.gz
-Checksum for rna.gbk does not match, parsing
-Parsing rna.gbk with RefSeqGPFFParser
-Read 21178 xrefs from ./RefSeq/rna.gbk
-Uploading xrefs
-Downloading ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/RAT/gene_association.goa_rat.gz to ./GO/gene_association.goa_rat.gz
-Uncompressing ./GO/gene_association.goa_rat.gz
-Checksum for gene_association.goa_rat does not match, parsing
-Parsing gene_association.goa_rat with GOParser
-        44991 GO dependent xrefs added
-Downloading ftp://rgd.mcw.edu/pub/data_release/genbank_to_gene_ids.txt to ./RGD/genbank_to_gene_ids.txt
-Checksum for genbank_to_gene_ids.txt does not match, parsing
-Parsing genbank_to_gene_ids.txt with RGDParser
-        11863 xrefs succesfully loaded
-        2461 xrefs ignored
-
-
-
-NOTE: Due to the fact that some Uniprot/refseq entries may no longer be valid (loaded) some xrefs are 
-ignored and so do not panic if you see xref ignored messages with values.
-
-
-
-So we now have a set of xrefs that are dependent on the uniprot and refseq 
-entries loaded. These can then be mapped to the ENSEMBL entitys with the 
-xref_mapper.pl script.
-
-
-To add new data to the xrefs you will have to edit sql/populate_metadata.sql
-and/or type in the sql.
-
-Add a new source you will insert a new source code. i.e.
-
-INSERT INTO source VALUES (2000, 'NEW', 1, 'Y', 4);
-
-Becouse some sources are dependent on others being loaded the last argument is
-the order. Lower numbers are processed first. 
-
-
-You will also have to specify the files to down load and the parser to use. 
-i.e.
-
-INSERT INTO source_url (source_id, species_id, url, checksum, file_modified_date,
-	upload_date, parser) VALUES (2000, 9606,'ftp://ftp.new.org/new.gz', '',
-	now(), now(), "NEWParser");
-
-You will have to create XrefParser/NEWparser.pm.
-
-
-
-
-
-The parsers.
-
-
-UniProtParser.pm
-
-Each record is stored as an xref and a primary_xref.
-The accession is the main key and is taken from the AC line.
-Any DR line that matches a valid source name (in the database) will
-be processed and stored as xrefs and dependent xrefs. Currently these 
-are MIM, PDB, EMBL In addition protein_id is taken from the EMBL line.
-NOTE: InterPro is not loaded here as Interpro does not match InterPro 
-(note capital P here). This is loaded seperately via the interpro parser.
-New entries are now added for the medline and pubmed lines by parsing RX 
-lines with MEDLINE or PUBMED xrefs.
-Species specific files are parsed here so all records are stored.
-
-
-
-RefSeqGPFFParser.pm	
-
-Each record is stored as an xref and a primary_xref.
-The accession is the main key and is taken from the AC line.
-LocusLink, OMIM, pubmed and medline are stored as xrefs and dependent xref. 
-Species specific files are parsed here so all records are stored.
-
-
-GOParser.pm
-
-Will only add entries if uniprot or refseq entry has been loaded already.
-ENSEMBL entries are also ignored as these will only map onto themselves.
-Most GO entries will already exist in the xref table (from Uniprot parsing) 
-but most will not have the description, so this is added if there is none. 
-Dependent xrefs are created if they do not exist already.
-
-
-InterproParser.pm
-
-The xrefs are stored for each Interpro but NO dependent xrefs are stored. 
-Instead a seperate table is populated (interpro) with the interpro/pfam 
-mappings. Uniprot/Refseq accesions are NOT checked to see if they are already 
-in the database, therefore is species non-specific, but the xref is stored 
-with the species specified in the run. 
-
-
-MIMParser.pm
-
-Uses the Gene names to map the mim numbers to the protein accessions via the
-HUGO numbers. So HUGO has to be parsed already aswell as Uniprot and Refseq.
-These are stored as MIM2 at present and are expected to replace the disease
-database.
-
-
-HUGOParser.pm, MGDParser.pm, RGDParser.pm, ZFINParser.pm
-
-Uniprot and Refseq must be already be parsed. Entries are added to the
-xref table and the dependent xref linked to the proteins (if they have been 
-loaded). So Entries are added if the accession is valid for uniprot or refseq
-for that particular species.
-
-
-
-
-NOTE: RefSeqParser.pm also exists and can be used to parse the fasta type
-files for the Refseq's.  At the moment the genbank style files are passed for 
-both protein and rna files. But the xrefs are on a whole just duplicated
-as they contain bascially the same xref data. A decision will have to be made
-as to the benefits/disadvantages of this. The alternative is to pass the rna 
-as a fasta. (which i think is what the old system used to do, judging by the
-numbers of xrefs).
-
-
- 
-
-
-
-
-
-
-
-
-
-
-
- 
-
-
-
-
-
-
-
-