Moved to docs directory

f8e86770 · Ian Longden · 2151b2cc · 2151b2cc · 2151b2cc · 2151b2cc
Commit f8e86770 authored 13 years ago by Ian Longden
--- a/misc-scripts/xref_mapping/parsing_information.txt
+++ b/misc-scripts/xref_mapping/parsing_information.txt
-UniProt/Swissprot - UniProt/Trembl (UNIversal PROTein resource)
---------------------------------------------------------------
-
-The files can come in two types:
-
-1)  Contains data for all species
-
-    ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/uniprot_sprot.dat.gz
-
-    or
-
-    ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/uniprot_trembl.dat.gz
-
-    This is the normal case.
-
-2)  Contains data for one species only
-
-    ftp://ftp.ebi.ac.uk/pub/databases/integr8/uniprot/proteomes/17.D_melanogaster.dat.gz
-
-
-These are primary Xrefs in that they contain sequence and hence can be
-mapped to the Ensembl entities via normal alignment methods (we use
-Exonerate).
-
-
-This is a list of dependent Xrefs that might be added:
-
-    EMBL
-    PDB
-    protein_id
-
-
-Note: For human, mouse and rat we also take the direct mappings from uniprot for the SWISSPROT entries.
-Those not mapped by uniprot are then processed in the normal way.
-
-Refseq_peptide
--------------
-
-The files come in two types those for specific species i.e.
-
-    ftp://ftp.ncbi.nih.gov/genomes/Canis_familiaris/protein/protein.gbk.gz
-
-or as a series of numbered none specific species files i.e.
-
-    ftp://ftp.ncbi.nih.gov/refseq/release/vertebrate_other/vertebrate_other3.protein.gpff.gz
-
-These files are parsed by the parser RefSeqGPFFParser.pm
-
-These are primary Xrefs in that they contain sequence and hence can be
-mapped to the Ensembl entities via normal alignment methods (we use
-Exonerate).
-
-Below is a list of dependent Xrefs that might be added:
-
-    EntrezGene
-
-
-Refseq_dna
----------
-
-The files come in two types those for specific species i.e.
-
-    ftp://ftp.ncbi.nih.gov/genomes/Gallus_gallus/RNA/rna.gbk.gz
-
-or as a series of numbered none specific species files i.e.
-
-    ftp://ftp.ncbi.nih.gov/refseq/release/vertebrate_mammalian/vertebrate_mammalian46.rna.fna.gz
-
-These files are parsed by the parser RefSeqParser.pm
-
-These are primary Xrefs in that they contain sequence and hence can be
-mapped to the Ensembl entities via normal alignment methods (we use
-Exonerate).
-
-
-
-IPI (International Protein Index)
---------------------------------
-
-Comes as species specific file i.e.
-
-    ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.HUMAN.fasta.gz
-
-The files have something like
-
->IPI:IPI00000005.1|SWISS-PROT:P01111|TREMBL:Q5U091|ENSEMBL:ENSP00000261444;ENSP00000358548|REFSEQ:NP_002515|VEGA:OTTHUMP00000013879 Tax_Id=9606 GTPase NRas precursor
-sequence..................
-
-But most of the header information is ignored except for the description
-and the IPI value.  The sequence is used to position the IPI Xref.
-
-These are primary Xrefs in that they contain sequence and hence can be
-mapped to the Ensembl entities via normal alignment methods (we use
-Exonerate).
-
-Has no dependent Xrefs.
-
-
-UniGene
-------
-
-Comes as species specific file i.e.
-
-    ftp://ftp.ncbi.nih.gov/repository/UniGene/Bos_taurus/Bt.seq.uniq.gz
-    ftp://ftp.ncbi.nih.gov/repository/UniGene/Bos_taurus/Bt.data.gz
-
-These are primary Xrefs in that they contain sequence and hence can be
-mapped to the Ensembl entities via normal alignment methods (we use
-Exonerate).  No longer loaded via UniProt.
-
-Has no dependent Xrefs.
-
-
-EMBL
----
-
-These are dependent Xrefs and are linked to Ensembl via the UniProt
-entries.
-
-
-PDB
---
-
-Protein Data Bank entries are dependent Xrefs and are linked to Ensembl
-via the UniProt entries.
-
-
-protein_id
----------
-
-These are dependent Xrefs and are linked to Ensembl via the UniProt
-entries.
-
-
-PUBMED + Medline
----------------
-
-These are no longer stored due to the large numbers of these.  If you
-want to add these then see the UniProtParser and RefseqPArser for more
-details.
-
-
-GO
--
-
-Can come in a species specific file or can contain all species.
-
-    ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gene_association.goa_uniprot.gz
-    ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/gene_association.goa_human.gz
-
-GO information in the UniProt and RefSeq files are ignored and just the
-information from the above files are used.  The files have references to
-UniProt and RefSeq entries and so the GO entries are set to be dependent
-Xref on these.
-
-
-EntrezGene
----------
-
-Gene-centred information at NCBI is stored as a dependent Xref and is
-obtained from the RefSeq entries.
-
-
-InterPro
--------
-
-InterPro is a database of protein families, domains and functional sites
-and gets it data from the file
-
-    ftp://ftp.ebi.ac.uk/pub/databases/interpro/interpro.xml.gz
-
-NOTE:  InterPro has its own table and hence the Xrefs are stored but
-are not linked to the Ensembl entities directly but a list of InterPro
-and identifiers are stored.  The identifiers stored are of the type
-
-    PROSITE, PFAM, PRINTS, PREFILE, PROFILE, TIGRFAMs
-
-
-
-ncRNA, RFAM, miRNA_Registry
---------------------------
-
-This is a local and is not down loaded automatically via FTP so you must
-copy this file first before running the parser.
-
-    file:ncRNA/ncRNA.txt
-
-These are direct Xrefs so the file contains data on what the Xref is and
-which Ensembl entity it matches too.
-
-
-SPECIES SPECIFIC ENTRIES
------------------------
------------------------
-
-
-Human
-----
-
-
-MIM - Online Mendelian Inheritance in Man
-----------------------------------------
-
-Descriptions and types are obtained from the file
-
- ftp://grcf.jhmi.edu/OMIM/omim.txt.Z
-
-This creates two set of Xrefs:
-
-1) MIM_GENE   (disease genes and other expressed gene)
-2) MIM_MORBID (the disease genes)
-
-Note those in set 2 will also be in set 1.
-
-These MIM Xrefs are linked to UniProt/SwissProt entries using the
-UniProtParser.pm creating dependent Xrefs.  Note if the Swissprot entry
-does not specify whether the MIM entry is a phenotype or a gene then it
-is ignored.  For this same reason MIM dependent Xrefs are NOT obtained
-from the RefSeq entries.
-
-So when the Swissprot entries are matched to Ensembl the MIM entries
-will also be matched.
-
-
-HGNC
----
-
-The HUman Genome Organisation Xrefs are obtained from various sources:-
-
-
-1) HGNC (ensembl_mapped)
-HGNC has direct mapping to ensembl which have been manually curated. 
-So information is obtianed from the script http://www.genenames.org/cgi-bin/hgnc_downloads.cgi
-
-2) CCDS 
-The HGNC's are connected to the same ensembl object that the CCDS are linked 
-to. We connec to the ccds database to get this information.
-
-3) Vega
-This is made from the Havana manually curated database.
-
-4) HGNC
-HGNC has links to other databases like uniprot,refseq etc and these can be used to link to ensembl
-
-
-
-
-Which of these is chosen at the mapping stage is based on the prioritys of 
-the sources. Here they are listed in order above.
-This is known as a priority xref as the mapping with the best priority is 
-chosen.  
-
-
-
-CCDS
----
-
-The CCDS database identifies a core set of human protein coding regions
-that are consistently annotated by multiple public resources and pass
-quality tests.
-
-A local file is used here:
-
-    file:CCDS/CCDS.txt
-
-The file contains a list of CCDS identifiers and the Ensembl entities
-they match to.  So direct Xrefs are created for these.
-
-
-Mouse
-----
-
-MGI
------------
-
-Previously known as 'MarkerSymbol'.
-
-    ftp://ftp.informatics.jax.org/pub/reports/MRK_SwissProt_TrEMBL.rpt
-    ftp://ftp.informatics.jax.org/pub/reports/MRK_Synonym.sql.rpt
-
-This is mouse specific Xref being the Mouse Genome Informatics data.
-The files have references to UniProt entries and so the GO entries are
-set to be dependent Xrefs on these.
-
-
-Rat
---
-
-RGD
--
-
-Rat Genome Database entries are populate by using the file
-
-    ftp://rgd.mcw.edu/pub/data_release/GENES
-
-The RGD Xrefs are dependent Xrefs on the Refseq entries.
-
-
-Zebra fish
----------
-
-ZFIN_ID
-------
-
-The two files
-
-    http://zfin.org/data_transfer/Downloads/refseq.txt
-    http://zfin.org/data_transfer/Downloads/swissprot.txt
-
-contains list of ZFIN identifiers and RefSeq or Swissprot identifiers
-depending on the file.
-
-This creates a set of dependent Xrefs on RefSeq and UniProt entries.
-
-
-C Elegans
---------
-
-wormpep_id, wormbase_locus, wormbase_gene, wormbase_transcript
--------------------------------------------------------------
-
-Uses the file
-
-    ftp://ftp.sanger.ac.uk/pub/databases/wormpep/wormpep180/wormpep.table180
-
-and the database (last release should do)
-
-    mysql:ensembldb.ensembl.org:3306:caenorhabditis_elegans_core_46_170b:anonymous
-
-This creates direct Xrefs for all these.
--- a/misc-scripts/xref_mapping/xrefs_detailed_docs.txt
+++ b/misc-scripts/xref_mapping/xrefs_detailed_docs.txt
--- a/misc-scripts/xref_mapping/xrefs_overview.txt
+++ b/misc-scripts/xref_mapping/xrefs_overview.txt
-The Xref System
-========================================================================
-
-The external database references (Xrefs) are added to the Ensembl
-databases using the code found in this directory.  The process consists
-of two parts.  First part is parsing the data into a temporary database
-(Xref database).  The second part is to map the new Xrefs to the Ensembl
-database.
-
-
-Parsing the external database references
------------------------------------------------------------------------
-
-In this directory you will find an ini-file called 'xref_config.ini'.
-This file contains two types of configuration sections: source sections
-and species sections.  A source section defines Xref priority, order
-etc. (as key-value pairs, see the comment at the top of the source
-sections for a fuller explanation of these keys) for the source and
-also the URIs pointing to the data files that the source should use.
-The source label will only be used to refer to the source within the
-ini-file (from a species section), so this can be any text string which
-is easy to understand the meaning of.
-
-A species section contains information about species aliases, the
-numerical taxonomy ID(s) and what sources to use for that species.  If
-a species has more than one taxonomy ID (in the case where there are
-multiple strains or subspecies, for example), there can be more than one
-'taxonomy_id' key.  The name of the species is defined by the source
-label and will be store in the Xref database.
-
-For now, the script 'xref_config2sql.pl' (also found in this directory)
-should be used to convert the ini-file into a SQL file which you
-should replace the file 'sql/populate_metadata.sql' with.  The
-'xref_config2sql.pl' script expects to find 'xref_config.ini' in the
-current directory, but you may specify an alternative file as the first
-command line argument to the script if you have moved or renamed the
-ini-file.  When 'xref_parser.pl' is run it will load the generated SQL
-file into the database and will then download and parse all external
-data files for one or several specified species.
-
-If you want to add a new source you will have to add a new source
-section, following the pattern used by the other source sections.  You
-will then have to add it to the species that require the data.
-
-If the new data comes in files not previously handled by the Xref
-system, you will now also have to write the parser NewSourceParser.pm
-(the parser name may be arbitrary chosen) in the XrefParser directory.
-You can find lots of examples of parsers in this directory.
-
-Before running the Xref parser, make sure that the environment
-variable 'http_proxy' is set to point to the local HTTP proxy to get
-outside the firewall.  For Sanger, the value of the variable should be
-"http://cache.internal.sanger.ac.uk:3128", i.e. for tcsh shells you
-should have
-
-    setenv http_proxy http://cache.internal.sanger.ac.uk:3128
-
-in your ~/.tcshrc file, while for bash-like shell you should have
-
-    export http_proxy=http://cache.internal.sanger.ac.uk:3128
-
-in your ~/.profile or ~/.bashrc file.
-
-When you run the script 'xref_parser.pl' to do the Xrefs you must pass
-to it several options but for most runs all you need to specify it the
-user (user name on the database), pass (password), host (database host),
-dbname, and species, i.e.
-
-    perl xref_parser.pl -host mymachine -user admin -pass XXXX \
-        -dbname new_human_xref -species human
-
-Please keep the output from this script and check it later.  At the end
-of the output there will be a summary of what was successful and what
-failed to run.  This is important.
-
-The parsing can create three types of Xrefs these are
-
-1) Primary   (These have sequence and are mapped via exonerate)
-2) Dependent (Have no sequence but are dependent on the Primary ones)
-3) Direct    (These are directly linked to the Ensembl entities, so the
-             mapping is already done)
-
-Some sources will have more than one set of files associated with it,
-in these cases they have the same source name but different source IDs.
-These are known as "priority Xrefs" as the Xrefs are mapped according to
-the priority of the source.  An example of this is the HUGOs.
-
-For more information on the what data can be parsed see the
-'parsing_information.txt' file.
-
-
-Mapping the external database references to the Ensembl core database
------------------------------------------------------------------------
-
-This is an overview of what goes on in the script 'xref_mapper.pl' .
-
-Primary Xrefs are dumped out to two Fasta files, one for peptides and
-the other for DNA.  Ensembl Transcripts and Translations are then dumped
-out to two files in Fasta format.
-
-Exonerate is then used to find the best matches for the Xrefs.
-If there is more than one best match then the Xref is mapped to
-more than one Ensembl entity.  A cutoff is used to filter the best
-matches to make sure they pass certain criteria.  By default this
-is that the query identity OR the target identity must be over
-90%.  This can be changed by creating your own '<method>.pm' file
-in the directory 'XrefMapper/Methods' and creating subroutines
-'query_identity_threshold()' and 'target_identity_threshold()' which
-return the new values.
-
-So exonerate will generate a set of .map files with the mapping in.  The
-map-files are then parsed and any that pass the criteria are stored in
-the 'xref' table, 'object_xref' table and the 'identity_xref' table.
-All dependent Xrefs are also stored if the parent is mapped.
-
-Direct Xrefs are also stored at this stage but no mapping is needed here
-as we already knew what each Xref maps too.
-
-For priority Xrefs (ones that have multiple sources) the highest
-priority one is only stored.
-
-Any Xrefs which fail to be mapped are written to the unmapped_object
-table with a brief explanation of why they could not be mapped.
-
-Once all the mapping have been stored the display Xrefs and the
-descriptions are generated for the transcripts and genes.
-
-If you want to change any of the default settings you can create a new
-'<species>.pm' for your particular species, or '<taxon>.pm' and override
-the script 'BasicMapper.pm' (see 'rattus_norvegicus.pm' as an example).
-
-The 'xref_mapper.pl' script needs a configuration file which has
-information on the Xref database and the core database and also the
-species name.  Below is an example of running the mapping.
-
-    perl ~/ensembl-live/ensembl/misc-scripts/xref_mapping/xref_mapper.pl \
-        -file xref_input -upload >&MAPPER.OUT
-
-
-Here is an example of a configuration file for 'xref_mapper.pl':
------------------------------------------------------------------------
-xref
-host=ensembl-machine
-port=3306
-dbname=human_xref_42
-user=admin
-password=xxxx
-dir=./xref
-
-species=homo_sapiens
-taxon=mammalia (this is optional - use taxon if you need more than one species to use the same '<taxon>.pm' module) 
-host=ensembl-machine
-port=3306
-dbname=homo_sapiens_core_42_36d
-user=admin
-password=xxxx
-dir=./ensembl
-
-farm
-queue=long
-exonerate=/software/ensembl/bin/exonerate-1.4.0
------------------------------------------------------------------------
-
-Note it is good practice to put a sub-directory for the Ensembl
-directory as many files are generated and hence best to put these all
-together and way from everything else or it will be hard to find things.
-Also the directory can be tared and zipped in case you need to check
-things later.