now in docs directory

a21e9cf0 · Ian Longden · d9c6f802 · a21e9cf0 · a21e9cf0 · a21e9cf0
Commit a21e9cf0 authored 13 years ago by Ian Longden
--- a/misc-scripts/xref_mapping/docs/FAQ.txt
+++ b/misc-scripts/xref_mapping/docs/FAQ.txt
+Questions
+---------
+
+1)  What code do i need to run the external database cross reference mapping.
+2)  What is the recommended way to run the extrnal databse cross references for 
+    an already entered species?
+3)  How do i add a new species?
+4)  How do i add a new external database source?
+5)  How do i track my process?
+6)  I have mapping errors how do i fix them?
+7)  How do i  start again from the parsing has finished stage?
+8)  How do i start again from the mapping_finished stage?
+9)  What is fullmode and partupdate?
+10) How do i run my external database references without a compute farm?
+11) I want to use a different list of external database sources for my 
+    display_xrefs (names)?
+12) I want to use a different list of external database sources for my gene 
+    descriptions?
+
+
+Answers
+-------
+
+1) What software do i need to run the external database cross reference mapping?
+
+   You will need a copy of exonerate and the ensembl API code.
+   Exonerate installation intructions can be found at
+       http://www.ebi.ac.uk/~guy/exonerate/
+   To install the ensembl API see 
+       http://www.ensembl.org/info/docs/api/api_installation.html
+
+
+
+2) What is the recommended way to run the xrefs for an already entered species?
+
+   The xref system comes in two parts, first parsing the external database sources
+   into an tempory xref database and then mapping these to the core database.
+
+   a) To parse the data into the xref database you should use the script 
+      xref_parser.pl, which can be found in ensembl/misc-scripts/xref_mapping 
+      directory.     
+ 
+      xref_parser.pl -user rwuser -pass XXX -host host1 -species human 
+                     -dbname human_xref -stats -create >& PARSER.OUT
+
+      check the file PARSER.OUT to make sure everything is okay. It could be that
+      it was unable to connect to an external site and may not have loaded 
+      everything.
+      If there was a problem with the connections try again but this time use the
+      option -checkdownload as this will not download data you already have but 
+      will try to get the data you are missing, saving time.
+
+      The xref_parser.pl script may wait for you to answer a couple of questions 
+      about overwriting the database or redoing the configuration so you will also
+      have to look at what is in the output file, but this is usually worth doing
+      to keep a record of what the parser did. 
+
+      At the end of the parsing you should get a summary which should look 
+      something like:-
+
+      ============================================================================
+      Summary of status
+      ============================================================================
+                    EntrezGene EntrezGeneParser         OKAY
+                            GO GOParser                 OKAY
+                            GO InterproGoParser         OKAY
+                      Interpro InterproParser           OKAY
+                    RefSeq_dna RefSeqParser             OKAY
+                RefSeq_peptide RefSeqGPFFParser         OKAY
+                       UniGene UniGeneParser            OKAY
+              Uniprot/SPTREMBL UniProtParser            OKAY
+             Uniprot/SWISSPROT UniProtParser            OKAY
+                         ncRNA ncRNA_DBParser           OKAY
+
+
+      If any of these are not OKAY then ther has been a problem so look further 
+      up in the file to find out why it failed.
+
+   b) Map the external databases entries to the core database.
+
+      First you need to create a configuration file.
+      Below is an example of a configuration file
+      ####################################################
+      xref
+      host=host1
+      port=3306
+      dbname=macaca_xref
+      user=user1
+      password=pass1
+      dir=./xref_dir
+
+      species=macaca_mulatta
+      host=host2
+      port=3306
+      dbname=macaca_core
+      user=user2
+      password=pass2
+      dir=./ensembl_dir
+
+      farm
+      queue=long
+      exonerate=/software/ensembl/bin/exonerate-1.4.0
+      ####################################################
+      Note that the Directorys specified must exist when the mapping is done.
+
+      The farm options are totally optional and can be left out but may be needed
+      if you have different queue names or have exonerate installed not in the 
+      default place
+
+      Now we can do the mapping.
+      Ideally this should be done in two steps so that after the first step you 
+      can check the output to make sure you are happy with everything before 
+      loading into the core database.
+
+      i) Map the entitys in the xref database and do some checks etc.
+         xref_mapper.pl -file xref_config >& MAPPER1.OUT
+
+         If you have no compute farm then add the -nofarm option.
+         Check the output file if warning about xref number increasing do not 
+         worry the main thing to be concerned about is a reduction in the number 
+         of that none are in the xref database abut are in the core database.
+
+         If you get errors about the mapping files then a couple of things could 
+         have gone wrong, first and usual culprit is that the system ran out of 
+         disk space or the compute farm job got lost.
+         In this case you have two options
+            1) reset then database to the parsing stage and rerun all the mappings
+
+              To reset the database use the option -reset_to_parsing_finished
+
+              xref_mapper.pl -file xref_config -reset_to_parsing_finished
+
+              then redo the mapping
+ 
+              xref_mapper.pl -file xref_config -dumpcheck >& MAPPER.OUT
+
+              Note here we use -dumpcheck to make the program does not dump the 
+              fasta files if they  are already there as this process can take 
+              along time and the fasta files will not have changed.
+
+
+             2) just redo those jobs that failed.
+
+              Run the mapper with the -resubmit_failed_jobs flag
+
+              xref_mapper.pl -file xref_config -resubmit_failed_jobs
+
+           Option 2 will be much faster as it will only redo the jobs that failed.
+
+
+      ii) Load the data into the core database and calculate the display_xrefs etc
+
+          xref_mapper.pl -file xref_config -upload >& MAPPER2.OUT
+
+
+
+3) How do i add a new species?
+
+   Edit the file xref_config.ini and add a new entry in the species section
+   Here is an example:-
+
+[species macaca_mulatta]
+taxonomy_id     = 9544
+aliases         = macaque, rhesus, rhesus macaque, rmacaque
+source          = EntrezGene::MULTI
+source          = GO::MULTI
+source          = InterproGO::MULTI
+source          = Interpro::MULTI
+source          = RefSeq_dna::MULTI-vertebrate_mammalian
+source          = RefSeq_peptide::MULTI-vertebrate_mammalian
+source          = Uniprot/SPTREMBL::MULTI
+source          = Uniprot/SWISSPROT::MULTI
+source          = UniGene::macaca_mulatta
+source          = ncRNA::MULTI
+
+   [species xxxx] and  taxonomy_id must be present.
+  
+   It is usually best just to cut and paste an already existing similar species 
+   and start from that. 
+
+
+
+4) How do i add a new external database source?
+
+   Edit the file xref_config.ini and add a new entry in the sources section
+   Here is an example:-
+
+
+[source Fantom::mus_musculus]
+# Used by mus_muscullus
+name            = Fantom
+download        = Y
+order           = 100
+priority        = 1
+prio_descr      =
+parser          = FantomParser
+release_uri     =
+data_uri        = ftp://fantom.gsc.riken.jp/DDBJ_fantom3_HTC_accession.txt.gz
+
+
+   name: The name you want to call the external database.
+         You must also add this to the core databases
+
+   download: Y if the data needs to be obtained online (i.e. not a local file)
+             N if you are getting the data from a file.
+
+   order: The order in which the source shpuld be parsed. 1 beinging the first.
+          
+   priority: This is for sources where we get the data from multiple places
+             i.e. HGNC. For most sources just set this to 1.
+
+   prio_desc: Only used for priority sources. And sets a description to give 
+              a way to diffentiate them and track which is which.
+              
+   parser: Which parser to use. If this is a new source then you will probably 
+           need a new parser. Find a parser that is similar and start from this.
+           Parsers must be in the ensembl/misc-scripts/xref_mapping/XrefParser 
+           directory.
+           
+   release_uri: a uri to get the release information from. The parser should 
+                handle this.
+
+   data_uri: Explains how and where to get the data from. There can be multiple 
+             lines of this.
+             
+
+   The uri can get data via several methods and here is the list and a brief 
+   explaination.
+    
+       ftp:    Get the file via ftp
+
+       script: Passes argumant to the parser. This might be things like a database
+               to connect to to run smome sql to get the data..
+
+       file:   The name with full path of the file to be parsed.
+       
+       http: To get data via an external webpage/cgi script.
+
+
+
+5) How do i track my process?
+
+     If you did not use -noverbose then the output file should give you a general
+     idea of what stage you are at. By directly examining the xref database you 
+     can see the last stage that was completed by viewing the entries in the 
+     process_status table.
+     
+     Another option is to use the script xref_tracker.pl which will give you some 
+     information about the status. The script is ran similar to the xref_mapper.pl
+     code in that it needs a config_file.
+
+     xref_tracker.pl -file xref_config
+
+     This script gives more information when the xref_mapper is running the 
+     mapping jobs or processing the mapping files as it will tell you how many 
+     have finished and how many are left to run etc. These are the longer stages
+     of the process.
+
+
+6) I have mapping errors how do i fix them?
+
+   If for some reason a mapping job failed this tends to be things like running 
+   out of disk space, the compute farm loosing a job etc then you have a couple 
+   of options.
+
+   i) reset the database to the parsing stage and rerun all the mappings
+
+   To reset the database use the option -reset_to_parsing_finished
+
+      xref_mapper.pl -file xref_config -reset_to_parsing_finished
+
+   then redo the mapping
+ 
+     xref_mapper.pl -file xref_config -dumpcheck
+
+   Note here we use -dumpcheck to make sure the program does not dump the fasta 
+   files if they are already there as this process can take along time and the 
+   fasta files will not have changed.
+
+ 
+   ii) just redo those jobs that failed.
+
+   Run the mapper with the -resubmit_failed_jobs flag
+
+     xref_mapper.pl -file xref_config -resubmit_failed_jobs
+
+
+
+7) How do i start again from the parsing has finished stage?
+
+   To reset the database use the option -reset_to_parsing_finished
+
+      xref_mapper.pl -file xref_config -reset_to_parsing_finished
+
+
+
+8) How do i start again from the mapping_finished stage?
+
+   To reset the database use the option -reset_to_mapping_finished
+
+      xref_mapper.pl -file xref_config -reset_to_mapping_finished
+
+   Remember to use -dumpcheck when you run xref_mapper.pl the next
+   time to save time.
+
+
+
+9) What is fullmode and partupdate?
+
+   Fullmode means that all the xrefs are being updated and not just a few specific
+   external database sources. This is important as this affects the way the 
+   display_xrefs, descriptions are calculated at the end.The user can override 
+   this by setting -partupdate option in the mapper options or change the entry 
+   in the table (key is "fullmode" in meta table).
+
+   If we are doing all the xref sources then we know that all the data is local 
+   and hence can do some SQL to get  the display_xrefs etc But if this is not the
+   case then the core database will have extra information in it that may be 
+   needed so we have to query the core database. The xref database has extra 
+   information that is not in the xref database and so simple SQL can be used 
+   whereas with the core 
+   database we have to go for each gene and then for each transcript etc using the
+   API which is alot slower.
+
+   In summary only alter the mode here if you know what you are doing and what 
+   consequences there are.
+
+
+10) How do i run my external database references without a compute farm?
+
+  Simply use the -nofarm option with the xref_mapper.pl script.
+
+  This will run the exonerate jobs locally.
+
+
+
+11) I want to use a different list of external database sources for my 
+    display_xrefs (names)?
+
+   The external databases to be used for the display_xrefs are taken from either 
+   the BasicMapper.pm subroutine transcript_display_sources  i.e.
+
+   sub transcript_display_xref_sources {
+     my @list = qw(miRBase
+                RFAM
+                HGNC_curated_gene
+		HGNC_automatic_gene
+                MGI_curated_gene
+		MGI_automatic_gene
+		Clone_based_vega_gene
+		Clone_based_ensembl_gene
+		HGNC_curated_transcript
+		HGNC_automatic_transcript
+		MGI_curated_transcript
+		MGI_automatic_transcript
+		Clone_based_vega_transcript
+		Clone_based_ensembl_transcript
+		IMGT/GENE_DB
+		HGNC
+		SGD
+		MGI
+		flybase_symbol
+		Anopheles_symbol
+		Genoscope_annotated_gene
+		Uniprot/SWISSPROT
+		Uniprot/Varsplic
+		RefSeq_peptide
+		RefSeq_dna
+		Uniprot/SPTREMBL
+		EntrezGene
+	        IPI);
+
+     my %ignore;
+     $ignore{"EntrezGene"}= 'FROM:RefSeq_[pd][en][pa].*_predicted';
+  
+     return [\@list,\%ignore];
+
+   }
+
+
+
+
+   or if you want to create your own list then you need to create a species.pm 
+   file and create a new subroutine there an example here is for 
+   drosophila_melanogaster.
+   So in the file drosophila_melanogaster.pm  
+   (found in the directory ensembl/misc-scripts/xref_mapping/XrefMapper)
+   we have :-
+
+   sub transcript_display_xref_sources {
+
+     my @list = qw(FlyBaseName_transcript FlyBaseCGID_transcript flybase_annotation_id);
+                
+
+     my %ignore;
+     $ignore{"EntrezGene"}= 'FROM:RefSeq_[pd][en][pa].*_predicted';
+
+     return [\@list,\%ignore];
+
+  }
+
+
+
+12) I want to use a different list of external database sources for my gene 
+    descriptions?
+
+   As above but this time we use the sub gene_description_sources.
+
--- a/misc-scripts/xref_mapping/docs/parsing_information.txt
+++ b/misc-scripts/xref_mapping/docs/parsing_information.txt
+UniProt/Swissprot - UniProt/Trembl (UNIversal PROTein resource)
+---------------------------------------------------------------
+
+The files can come in two types:
+
+1)  Contains data for all species
+
+    ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/uniprot_sprot.dat.gz
+
+    or
+
+    ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/uniprot_trembl.dat.gz
+
+    This is the normal case.
+
+2)  Contains data for one species only
+
+    ftp://ftp.ebi.ac.uk/pub/databases/integr8/uniprot/proteomes/17.D_melanogaster.dat.gz
+
+
+These are primary Xrefs in that they contain sequence and hence can be
+mapped to the Ensembl entities via normal alignment methods (we use
+Exonerate).
+
+
+This is a list of dependent Xrefs that might be added:
+
+    EMBL
+    PDB
+    protein_id
+
+
+Note: For human, mouse and rat we also take the direct mappings from uniprot for the SWISSPROT entries.
+Those not mapped by uniprot are then processed in the normal way.
+
+Refseq_peptide
+--------------
+
+The files come in two types those for specific species i.e.
+
+    ftp://ftp.ncbi.nih.gov/genomes/Canis_familiaris/protein/protein.gbk.gz
+
+or as a series of numbered none specific species files i.e.
+
+    ftp://ftp.ncbi.nih.gov/refseq/release/vertebrate_other/vertebrate_other3.protein.gpff.gz
+
+These files are parsed by the parser RefSeqGPFFParser.pm
+
+These are primary Xrefs in that they contain sequence and hence can be
+mapped to the Ensembl entities via normal alignment methods (we use
+Exonerate).
+
+Below is a list of dependent Xrefs that might be added:
+
+    EntrezGene
+
+
+Refseq_dna
+----------
+
+The files come in two types those for specific species i.e.
+
+    ftp://ftp.ncbi.nih.gov/genomes/Gallus_gallus/RNA/rna.gbk.gz
+
+or as a series of numbered none specific species files i.e.
+
+    ftp://ftp.ncbi.nih.gov/refseq/release/vertebrate_mammalian/vertebrate_mammalian46.rna.fna.gz
+
+These files are parsed by the parser RefSeqParser.pm
+
+These are primary Xrefs in that they contain sequence and hence can be
+mapped to the Ensembl entities via normal alignment methods (we use
+Exonerate).
+
+
+
+IPI (International Protein Index)
+---------------------------------
+
+Comes as species specific file i.e.
+
+    ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.HUMAN.fasta.gz
+
+The files have something like
+
+>IPI:IPI00000005.1|SWISS-PROT:P01111|TREMBL:Q5U091|ENSEMBL:ENSP00000261444;ENSP00000358548|REFSEQ:NP_002515|VEGA:OTTHUMP00000013879 Tax_Id=9606 GTPase NRas precursor
+sequence..................
+
+But most of the header information is ignored except for the description
+and the IPI value.  The sequence is used to position the IPI Xref.
+
+These are primary Xrefs in that they contain sequence and hence can be
+mapped to the Ensembl entities via normal alignment methods (we use
+Exonerate).
+
+Has no dependent Xrefs.
+
+
+UniGene
+-------
+
+Comes as species specific file i.e.
+
+    ftp://ftp.ncbi.nih.gov/repository/UniGene/Bos_taurus/Bt.seq.uniq.gz
+    ftp://ftp.ncbi.nih.gov/repository/UniGene/Bos_taurus/Bt.data.gz
+
+These are primary Xrefs in that they contain sequence and hence can be
+mapped to the Ensembl entities via normal alignment methods (we use
+Exonerate).  No longer loaded via UniProt.
+
+Has no dependent Xrefs.
+
+
+EMBL
+----
+
+These are dependent Xrefs and are linked to Ensembl via the UniProt
+entries.
+
+
+PDB
+---
+
+Protein Data Bank entries are dependent Xrefs and are linked to Ensembl
+via the UniProt entries.
+
+
+protein_id
+----------
+
+These are dependent Xrefs and are linked to Ensembl via the UniProt
+entries.
+
+
+PUBMED + Medline
+----------------
+
+These are no longer stored due to the large numbers of these.  If you
+want to add these then see the UniProtParser and RefseqPArser for more
+details.
+
+
+GO
+--
+
+Can come in a species specific file or can contain all species.
+
+    ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gene_association.goa_uniprot.gz
+    ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/gene_association.goa_human.gz
+
+GO information in the UniProt and RefSeq files are ignored and just the
+information from the above files are used.  The files have references to
+UniProt and RefSeq entries and so the GO entries are set to be dependent
+Xref on these.
+
+
+EntrezGene
+----------
+
+Gene-centred information at NCBI is stored as a dependent Xref and is
+obtained from the RefSeq entries.
+
+
+InterPro
+--------
+
+InterPro is a database of protein families, domains and functional sites
+and gets it data from the file
+
+    ftp://ftp.ebi.ac.uk/pub/databases/interpro/interpro.xml.gz
+
+NOTE:  InterPro has its own table and hence the Xrefs are stored but
+are not linked to the Ensembl entities directly but a list of InterPro
+and identifiers are stored.  The identifiers stored are of the type
+
+    PROSITE, PFAM, PRINTS, PREFILE, PROFILE, TIGRFAMs
+
+
+
+ncRNA, RFAM, miRNA_Registry
+---------------------------
+
+This is a local and is not down loaded automatically via FTP so you must
+copy this file first before running the parser.
+
+    file:ncRNA/ncRNA.txt
+
+These are direct Xrefs so the file contains data on what the Xref is and
+which Ensembl entity it matches too.
+
+
+SPECIES SPECIFIC ENTRIES
+------------------------
+------------------------
+
+
+Human
+-----
+
+
+MIM - Online Mendelian Inheritance in Man
+-----------------------------------------
+
+Descriptions and types are obtained from the file
+
+ ftp://grcf.jhmi.edu/OMIM/omim.txt.Z
+
+This creates two set of Xrefs:
+
+1) MIM_GENE   (disease genes and other expressed gene)
+2) MIM_MORBID (the disease genes)
+
+Note those in set 2 will also be in set 1.
+
+These MIM Xrefs are linked to UniProt/SwissProt entries using the
+UniProtParser.pm creating dependent Xrefs.  Note if the Swissprot entry
+does not specify whether the MIM entry is a phenotype or a gene then it
+is ignored.  For this same reason MIM dependent Xrefs are NOT obtained
+from the RefSeq entries.
+
+So when the Swissprot entries are matched to Ensembl the MIM entries
+will also be matched.
+
+
+HGNC
+----
+
+The HUman Genome Organisation Xrefs are obtained from various sources:-
+
+
+1) HGNC (ensembl_mapped)
+HGNC has direct mapping to ensembl which have been manually curated. 
+So information is obtianed from the script http://www.genenames.org/cgi-bin/hgnc_downloads.cgi
+
+2) CCDS 
+The HGNC's are connected to the same ensembl object that the CCDS are linked 
+to. We connec to the ccds database to get this information.
+
+3) Vega
+This is made from the Havana manually curated database.
+
+4) HGNC
+HGNC has links to other databases like uniprot,refseq etc and these can be used to link to ensembl
+
+
+
+
+Which of these is chosen at the mapping stage is based on the prioritys of 
+the sources. Here they are listed in order above.
+This is known as a priority xref as the mapping with the best priority is 
+chosen.  
+
+
+
+CCDS
+----
+
+The CCDS database identifies a core set of human protein coding regions
+that are consistently annotated by multiple public resources and pass
+quality tests.
+
+A local file is used here:
+
+    file:CCDS/CCDS.txt
+
+The file contains a list of CCDS identifiers and the Ensembl entities
+they match to.  So direct Xrefs are created for these.
+
+
+Mouse
+-----
+
+MGI
+------------
+
+Previously known as 'MarkerSymbol'.
+
+    ftp://ftp.informatics.jax.org/pub/reports/MRK_SwissProt_TrEMBL.rpt
+    ftp://ftp.informatics.jax.org/pub/reports/MRK_Synonym.sql.rpt
+
+This is mouse specific Xref being the Mouse Genome Informatics data.
+The files have references to UniProt entries and so the GO entries are
+set to be dependent Xrefs on these.
+
+
+Rat
+---
+
+RGD
+--
+
+Rat Genome Database entries are populate by using the file
+
+    ftp://rgd.mcw.edu/pub/data_release/GENES
+
+The RGD Xrefs are dependent Xrefs on the Refseq entries.
+
+
+Zebra fish
+----------
+
+ZFIN_ID
+-------
+
+The two files
+
+    http://zfin.org/data_transfer/Downloads/refseq.txt
+    http://zfin.org/data_transfer/Downloads/swissprot.txt
+
+contains list of ZFIN identifiers and RefSeq or Swissprot identifiers
+depending on the file.
+
+This creates a set of dependent Xrefs on RefSeq and UniProt entries.
+
+
+C Elegans
+---------
+
+wormpep_id, wormbase_locus, wormbase_gene, wormbase_transcript
+--------------------------------------------------------------
+
+Uses the file
+
+    ftp://ftp.sanger.ac.uk/pub/databases/wormpep/wormpep180/wormpep.table180
+
+and the database (last release should do)
+
+    mysql:ensembldb.ensembl.org:3306:caenorhabditis_elegans_core_46_170b:anonymous
+
+This creates direct Xrefs for all these.
--- a/misc-scripts/xref_mapping/docs/xrefs_detailed_docs.txt
+++ b/misc-scripts/xref_mapping/docs/xrefs_detailed_docs.txt
--- a/misc-scripts/xref_mapping/docs/xrefs_overview.txt
+++ b/misc-scripts/xref_mapping/docs/xrefs_overview.txt
+The Xref System
+========================================================================
+
+The external database references (Xrefs) are added to the Ensembl
+databases using the code found in this directory.  The process consists
+of two parts.  First part is parsing the data into a temporary database
+(Xref database).  The second part is to map the new Xrefs to the Ensembl
+database.
+
+
+Parsing the external database references
+------------------------------------------------------------------------
+
+In this directory you will find an ini-file called 'xref_config.ini'.
+This file contains two types of configuration sections: source sections
+and species sections.  A source section defines Xref priority, order
+etc. (as key-value pairs, see the comment at the top of the source
+sections for a fuller explanation of these keys) for the source and
+also the URIs pointing to the data files that the source should use.
+The source label will only be used to refer to the source within the
+ini-file (from a species section), so this can be any text string which
+is easy to understand the meaning of.
+
+A species section contains information about species aliases, the
+numerical taxonomy ID(s) and what sources to use for that species.  If
+a species has more than one taxonomy ID (in the case where there are
+multiple strains or subspecies, for example), there can be more than one
+'taxonomy_id' key.  The name of the species is defined by the source
+label and will be store in the Xref database.
+
+For now, the script 'xref_config2sql.pl' (also found in this directory)
+should be used to convert the ini-file into a SQL file which you
+should replace the file 'sql/populate_metadata.sql' with.  The
+'xref_config2sql.pl' script expects to find 'xref_config.ini' in the
+current directory, but you may specify an alternative file as the first
+command line argument to the script if you have moved or renamed the
+ini-file.  When 'xref_parser.pl' is run it will load the generated SQL
+file into the database and will then download and parse all external
+data files for one or several specified species.
+
+If you want to add a new source you will have to add a new source
+section, following the pattern used by the other source sections.  You
+will then have to add it to the species that require the data.
+
+If the new data comes in files not previously handled by the Xref
+system, you will now also have to write the parser NewSourceParser.pm
+(the parser name may be arbitrary chosen) in the XrefParser directory.
+You can find lots of examples of parsers in this directory.
+
+Before running the Xref parser, make sure that the environment
+variable 'http_proxy' is set to point to the local HTTP proxy to get
+outside the firewall.  For Sanger, the value of the variable should be
+"http://cache.internal.sanger.ac.uk:3128", i.e. for tcsh shells you
+should have
+
+    setenv http_proxy http://cache.internal.sanger.ac.uk:3128
+
+in your ~/.tcshrc file, while for bash-like shell you should have
+
+    export http_proxy=http://cache.internal.sanger.ac.uk:3128
+
+in your ~/.profile or ~/.bashrc file.
+
+When you run the script 'xref_parser.pl' to do the Xrefs you must pass
+to it several options but for most runs all you need to specify it the
+user (user name on the database), pass (password), host (database host),
+dbname, and species, i.e.
+
+    perl xref_parser.pl -host mymachine -user admin -pass XXXX \
+        -dbname new_human_xref -species human
+
+Please keep the output from this script and check it later.  At the end
+of the output there will be a summary of what was successful and what
+failed to run.  This is important.
+
+The parsing can create three types of Xrefs these are
+
+1) Primary   (These have sequence and are mapped via exonerate)
+2) Dependent (Have no sequence but are dependent on the Primary ones)
+3) Direct    (These are directly linked to the Ensembl entities, so the
+             mapping is already done)
+
+Some sources will have more than one set of files associated with it,
+in these cases they have the same source name but different source IDs.
+These are known as "priority Xrefs" as the Xrefs are mapped according to
+the priority of the source.  An example of this is the HUGOs.
+
+For more information on the what data can be parsed see the
+'parsing_information.txt' file.
+
+
+Mapping the external database references to the Ensembl core database
+------------------------------------------------------------------------
+
+This is an overview of what goes on in the script 'xref_mapper.pl' .
+
+Primary Xrefs are dumped out to two Fasta files, one for peptides and
+the other for DNA.  Ensembl Transcripts and Translations are then dumped
+out to two files in Fasta format.
+
+Exonerate is then used to find the best matches for the Xrefs.
+If there is more than one best match then the Xref is mapped to
+more than one Ensembl entity.  A cutoff is used to filter the best
+matches to make sure they pass certain criteria.  By default this
+is that the query identity OR the target identity must be over
+90%.  This can be changed by creating your own '<method>.pm' file
+in the directory 'XrefMapper/Methods' and creating subroutines
+'query_identity_threshold()' and 'target_identity_threshold()' which
+return the new values.
+
+So exonerate will generate a set of .map files with the mapping in.  The
+map-files are then parsed and any that pass the criteria are stored in
+the 'xref' table, 'object_xref' table and the 'identity_xref' table.
+All dependent Xrefs are also stored if the parent is mapped.
+
+Direct Xrefs are also stored at this stage but no mapping is needed here
+as we already knew what each Xref maps too.
+
+For priority Xrefs (ones that have multiple sources) the highest
+priority one is only stored.
+
+Any Xrefs which fail to be mapped are written to the unmapped_object
+table with a brief explanation of why they could not be mapped.
+
+Once all the mapping have been stored the display Xrefs and the
+descriptions are generated for the transcripts and genes.
+
+If you want to change any of the default settings you can create a new
+'<species>.pm' for your particular species, or '<taxon>.pm' and override
+the script 'BasicMapper.pm' (see 'rattus_norvegicus.pm' as an example).
+
+The 'xref_mapper.pl' script needs a configuration file which has
+information on the Xref database and the core database and also the
+species name.  Below is an example of running the mapping.
+
+    perl ~/ensembl-live/ensembl/misc-scripts/xref_mapping/xref_mapper.pl \
+        -file xref_input -upload >&MAPPER.OUT
+
+
+Here is an example of a configuration file for 'xref_mapper.pl':
+------------------------------------------------------------------------
+xref
+host=ensembl-machine
+port=3306
+dbname=human_xref_42
+user=admin
+password=xxxx
+dir=./xref
+
+species=homo_sapiens
+taxon=mammalia (this is optional - use taxon if you need more than one species to use the same '<taxon>.pm' module) 
+host=ensembl-machine
+port=3306
+dbname=homo_sapiens_core_42_36d
+user=admin
+password=xxxx
+dir=./ensembl
+
+farm
+queue=long
+exonerate=/software/ensembl/bin/exonerate-1.4.0
+------------------------------------------------------------------------
+
+Note it is good practice to put a sub-directory for the Ensembl
+directory as many files are generated and hence best to put these all
+together and way from everything else or it will be hard to find things.
+Also the directory can be tared and zipped in case you need to check
+things later.