first draft of the document to help expalin the running of the xref system

ed3ef884 · Ian Longden · ac2ee655 · ed3ef884
Commit ed3ef884 authored 13 years ago by Ian Longden
--- a/misc-scripts/xref_mapping/docs/running_the_xref_pipeline.txt
+++ b/misc-scripts/xref_mapping/docs/running_the_xref_pipeline.txt
+What is the purpose of this document
+------------------------------------
+
+This documant show the steps and best practices for runnings external 
+databases references (xrefs) for various species.
+
+Who is this document written for
+--------------------------------
+
+Anyone wanting to run the xref pipeline for information on what xrefs are 
+and general details please see  xrefs_overview.txt, xrefs_detailed_docs.txt, 
+FAQ.txt and parsing_information.txt in this directory.
+
+
+Overview of steps
+-----------------
+
+1) configure the system.
+2) update ccds database ( if human or mouse)
+3) update alt_alleles   ( if human)
+4) update LRGs          ( if human)
+5) run the parsing
+6) run the mapping
+
+Please note the stable_id mapping has to be done and the Vega databases 
+available (for human, mouse and zebrafish) before the xref pipeline can be ran.
+
+
+Configuring the system
+----------------------
+
+Edit the file config_xref.ini file see FAQ.txt for more details. If this 
+species is already in the file then this could be just a case of checking the 
+correct versions of the databases are being used. 
+
+It is also inportant to have the correct version of the API at this stage as 
+by default it uses the API version to define which database to connect to. 
+i.e. ensembl_ontology_xx where xx is the version. So for ensembl release 65 
+this would be database ensembl_ontology_65. Where the 65 is obtained from the 
+API.
+
+Update alt_allele table
+-----------------------
+
+At present this is just for human. The following script examines the vega 
+database and based on the names creates the alt_alleles for the core database.
+The vega database should already have links to the core database so that is
+how we go from the vega stable_id to ensembl stable_id.
+
+
+In the ensembl/misc-scripts/alt_alleles directory you need to run the script
+alt_alleles.pl 
+
+This can be run using the API to automatically pick up the correct 
+databases (make sure api version is correct to pick this up)
+
+perl alt_alleles.pl -cpass XXXX > & human_release_65_alt_alleles
+
+
+or specify all the arguments i.e.
+
+perl alt_alleles.pl -vhost ens-staging1 -vport 3306 
+                    -vdbname homo_sapiens_vega_65_37 
+                    -cdbname homo_sapiens_core_65_37 
+                    -chost ens-staging1 -cpass XXXX 
+                    >& human_release_65_alt_alleles
+
+
+
+
+Update ccds database
+--------------------
+
+Becouse the stable ids may have changed in the core database we need to update
+these in the ccds databases. At present only human and mouse have these.
+The script to run is  store_ccds_xrefs.pl and is in the directory
+ensembl-personal/genebuilders/ccds/scripts.
+
+perl store_ccds_xrefs.pl -ccds_dbname ccds_human_65 -ccds_host
+ens-livemirror -ccds_user rw -ccds_pass passwrod -dbname
+homo_sapiens_core_65_37 -host ens-staging1 -port 3306 -user ro
+-verbose -species human -path GRCh37 -write -delete_old
+
+
+
+update LRGs
+-----------
+
+
+Good docs can be found at
+https://www.ebi.ac.uk/seqdb/confluence/display/ENS/Importing+LRGs+into+Ensembl
+
+which comes down to doing the following :-
+
+perl scripts/import.lrg.pl  -verbose -do_all -host ens-staging -port
+3306 -user rw -pass password  -core homo_sapiens_core_65_37
+-otherfeatures homo_sapiens_otherfeatures_65_37 -cdna
+homo_sapiens_cdna_65_37 -vega homo_sapiens_vega_65_37 -rnaseq
+homo_sapiens_rnaseq_65_37 -clean  >& clean.OUT
+
+perl scripts/import.lrg.pl  -verbose -do_all -host ens-staging -port
+3306 -user rw -pass password  -core homo_sapiens_core_65_37
+-otherfeatures homo_sapiens_otherfeatures_65_37 -cdna
+homo_sapiens_cdna_65_37 -vega homo_sapiens_vega_65_37 -rnaseq
+homo_sapiens_rnaseq_65_37 -import -xrefs >& import.OUT
+
+perl scripts/import.lrg.pl  -verbose -do_all -host ens-staging -port
+3306 -user rw -pass password  -core homo_sapiens_core_65_37 -
+otherfeatures homo_sapiens_otherfeatures_65_37 -cdna
+homo_sapiens_cdna_65_37 -vega homo_sapiens_vega_65_37 -rnaseq
+homo_sapiens_rnaseq_65_37 -overlap  >& overlap.OUT
+
+perl scripts/import.lrg.pl  -verbose -do_all -host ens-staging -port
+3306 -user rw -pass password  -core homo_sapiens_core_65_37
+-otherfeatures homo_sapiens_otherfeatures_65_37 -cdna
+homo_sapiens_cdna_65_37 -vega homo_sapiens_vega_65_37 -rnaseq
+homo_sapiens_rnaseq_65_37 -verify  >& verify.OUT
+
+need to add modules to perl5lib to know where to find the modules
+so for my instance i set
+setenv PERL5LIB ${PERL5LIB}:/nfs/users/nfs_i/ianl/LRG/code/modules
+
+If the cdna databses is not yet ready then remove the "-cdna
+homo_sapiens_cdna_65_37" bit and continue but let who ever is building
+this database that you are doing the LRGs so that they get the same
+data.
+
+
+
+Run the parsing
+---------------
+
+More detailed instructions can be found in the FAQ.txt and 
+
+but basically you should cd to where you want the files to be downloaded to 
+and run the following;-
+
+bsub -o parse.out -e parse.err perl
+~/src/ensembl/misc-scripts/xref_mapping/xref_parser.pl
+-user rw -pass password -host ens-research -dbname
+ianl_dog_xref_65 -species dog -create -stats -force
+
+
+-species : which species to start the parsing for
+-create  : tells the script to create a new database even if one exists already
+-stats   : gives you statistics about what xrefs have been added for each 
+           parser
+-force   : means no interaction (i.e. for the farm) so it assumes yes to all 
+           questions
+
+by running on the farm the systems people are happier and by using -o and -e we
+can keep the error and output files seperate.
+
+In this directory you will find parse.out which shows a sample output for 
+running human xref parsing stage.
+
+I will add ">" to the start of the output lines to differentiate these from 
+my comments
+
+Explaination of the output:-
+
+> Options: -user rw -pass password -host ens-research 
+> -dbname ianl_human_xref_65 -species human -stats -
+> create -force
+
+Tells us what options were used when the parser script was ran.
+
+
+> ----{ XXXX }-----------------------------------------------------------------
+output from the parser XXXX
+
+> Parsing script:host=>ens-livemirror,dbname=>ccds_human_65,tran_name=>ENST,
+>  with XXXXParser
+
+XXXX is being parsed with the XXXXParser ( see ensembl/misc-scripts/xref_mapper
+/XrefParser/XXXXParser.pm for the module.
+
+>source          xrefs   prim    dep     gdir    tdir    tdir    coord   synonyms
+>XXX_transcript  0       0       0       0       33689
+>XXXX            26451   0       0       0       0       0       0       0
+
+So the Parser added 26451 xrefs and 33689 direct xrefs to the transcripts.
+Note: we can have more direct xrefs than xrefs as one xref may go to a few
+transcripts, this is not a problem.
+
+>================================================================================
+>Summary of status
+>================================================================================
+>                          CCDS CCDSParser               OKAY
+>                        DBASS3 DBASSParser              OKAY
+>                        DBASS5 DBASSParser              OKAY
+>                    EntrezGene EntrezGeneParser         OKAY
+>                            GO GOParser                 OKAY
+>                            GO InterproGoParser         OKAY
+>                          HGNC VegaOfficialNameParser   OKAY
+>                          HGNC HGNC_CCDSParser          OKAY
+>                          HGNC HGNCParser               OKAY
+
+The status for each parser should be "OKAY" if it is not then there was a problem.
+
+
+Run the mapping
+---------------
+
+First create a configuration script to tell the mapper program information it
+needs. Here is an example.
+
+#############################################################
+xref
+host=ensembl-host1
+port=3306
+dbname=human_xref_65
+user=rw
+password=xxxx
+dir=./xref
+
+species=homo_sapiens
+host=ensembl-host2
+port=3306
+dbname=homo_sapiens_core_65_37
+user=rw
+password=xxxx
+dir=./ensembl
+pr_host = ensembl-old
+pr_user = ro
+pr_dbname = homo_sapiens_core_64_37
+
+farm
+queue=long
+exonerate=/software/ensembl/bin/exonerate-1.4.0
+#############################################################
+
+
+>xref
+>host=ensembl-host1
+>port=3306
+>dbname=human_xref_65
+>user=rw
+>password=xxxx
+
+defines what is needed to connect to the xref database
+
+
+>dir=./xref
+
+Sets where to dump the xref databases fasta files
+Note the directory must exist already.
+
+
+>species=homo_sapiens
+>host=ensembl-host2
+>port=3306
+>dbname=homo_sapiens_core_65_37
+>user=rw
+>password=xxxx
+
+defines what is needed to connect to the core database
+
+
+>dir=./ensembl
+
+Sets where to dump the core databases fasta files
+Note the directory must exist already.
+
+
+
+>pr_host = ensembl-archive
+>pr_user = ro
+>pr_dbname = homo_sapiens_core_64_37
+
+Normally as part of the xref mapping we check the number of xrefs in the
+core database to the one in the xref database and flag any sources that
+have changed by more than 5%, as this may indicate that we have a problem.
+But specifying pr_... we are instructing the comparison to be to another core 
+database. This is normally done when the core database we are updating does 
+not have a full set of xrefs already and hence the comparison would be useless.
+
+
+>farm
+>queue=long
+>exonerate=/software/ensembl/bin/exonerate-1.4.0
+
+Instead of using the default farm queue or exonerate executable we can 
+overwrite these here. Typically the EBI and Sanger have different queues
+and other organisations may also differ so this is very useful.
+
+
+So we are now ready to run the mapping. We need to tell the mapper where the 
+configuration file is (see above).
+
+The mapper is ran twice generally. The first time does all the major work like
+dumping the fasta files, mapping these files, reading in the mapping files, and
+creating all the connections. At this stage a comparison of the xrefs in the 
+core database and new xref database is done.
+A typical command line call would be..
+
+bsub -o mapper1.out -e mapper1.err perl xref_mapper.pl -file config_file
+
+if you do not have access to a compute farm then :-
+
+perl xref_mapper.pl -file config_file -nofarm >& mapper1.out
+(but this will be slow)
+
+If everything looks okay we will then transfer the data by adding -upload to 
+the command line options, i.e. when using the farm
+
+bsub -o mapper2.out -e mapper2.err perl xref_mapper.pl 
+     -file config_file -upload
+
+In this directory you will find examples of mapper1.out and mapper2.out but 
+again the important bits will be explained.
+
+So for mapper1.out
+
+>Options: -file xref_input
+>running in verbose mode
+
+Informs the user how the mapper was ran
+
+>current status is parsing_finished
+
+Report the current status of the xref_database. This is used to work out what 
+to do next
+
+
+>No alt_alleles found for this species.
+
+only for human do we inport the alt_alleles
+
+
+>Dumping xref & Ensembl sequences
+>Dumping Xref fasta files
+>Dumping Ensembl Fasta files
+>53067 Transcripts dumped 41693 Transaltions dumped
+
+Reports what files are dumped. If these are already dumped and the option 
+-dumpcheck was used then this will be report and if the fasta files already
+exist they will not be re dumped.
+
+>Deleting out, err and map files from output dir: /workdir/release_65/zebrafish/ensembl
+>Deleting txt and sql files from output dir: /workdir/release_65/zebrafish/ensembl
+>LSF job ID for main mapping job: 887287, name ExonerateGappedBest1_1318933449 with
+> 481 arrays elements)
+>LSF job ID for main mapping job: 887288, name ExonerateGappedBest1_1318933451 with
+> 253 arrays elements)
+>LSF job ID for Depend job: 887289 (job array with 1 job)
+>already processed = 0, processed = 734, errors = 0, empty = 0 
+
+This is information on the mapping of the fasta files using exonerate. Check that
+the errors are 0 else one of the mapping went wrong.
+
+
+>Could not find stable id ENSDART00000126968 in table to get the internal id hence
+> ignoring!!! (for RFAM)
+>Could not find stable id ENSDART00000121043 in table to get the internal id hence
+> ignoring!!! (for RFAM)
+
+
+Sometimes external databases will have links to EnsEMBL that are no longer valid,
+usually due to time delays in the releases wrt the external database.
+Here we can see two of these for RFAM, as long as this number is not too large
+this is not a problem.
+
+
+>The foillowing will be processed as priority xrefs
+>	Uniprot/SPTREMBL
+>	ZFIN_ID
+
+Priority xrefs are those xrefs where we get the data from more than one place.
+These will have prioritys which tell us which is better so the best ones are 
+chosen at this point.
+
+
+>Process Pairs
+>Starting at object_xref of 837705
+>	NEW	2733
+>2733 new relationships added
+
+For some xrefs thet can be considered as being paied i.e. RefSeq_Peptide and 
+RefSeq_mrna so if we match one of these but not its pair then we add this 
+relationship in now.
+
+
+>Writing InterPro
+>
+>246386 already existed
+>
+>  Wrote 0 interpro table entries
+>    including 51399 object xrefs, 
+>    and 51399 go xrefs
+
+We create extra mapping using the InterPro table and these are the stats for this.
+
+
+>ZFIN_ID is associated with both Transcript and Translation object types
+>Therefore moving all associations from Translation to Transcript
+
+If a particular source in this example ZFIN_ID is linked to more than one of Gene,
+Transcript or Translation then all are moved to the highest level. Gene being the 
+highest and Translation the lowest.
+
+
+>DBASS3 moved to Gene level.
+>DBASS5 moved to Gene level.
+
+Some sources are considered to belong to genes but maybe mapped to transcripts or 
+translations so we move these now to the gene.
+
+
+>For gene ENSDARG00000001832 we have mutiple ZFIN_ID's
+>	Keeping the best one si:ch1073-403i13.1
+>	removing zgc:113912 from gene
+>	removing zgc:103599 from gene
+>Multiple best ZFIN_ID's using vega to find the most common for ENSDARG00000057813
+>	lratb  (chosen as first)
+>	wu:fj89a05  (left as ZFIN_ID reference but not gene symbol)
+
+For some sources (HGNC in human, MGI in mouse and ZFIN_ID in zebrafish) we only 
+want to have one reference per gene so using things like their prioritys, %id 
+mapping values etc we try to find the best one and remove the others. If we cannot 
+find a best one then all are kept.
+
+
+>WARNING: Clone_based_ensembl_gene has decreased by -5 % was 7652 now 7194
+>WARNING: Clone_based_ensembl_transcript has decreased by -8 % was 8260 now 7554
+>WARNING: xrefs miRBase_gene_name are not in the new database but are in the old???
+>WARNING: xrefs OTTG are not in the new database but are in the old???
+>WARNING: xrefs OTTT are not in the new database but are in the old???
+>WARNING: RefSeq_ncRNA has increased by 5% was 644 now 677
+>WARNING: xrefs RFAM_gene_name are not in the new database but are in the old???
+>WARNING: xrefs shares_CDS_and_UTR_with_OTTT are not in the new database but are 
+> in the old???
+>WARNING: xrefs Vega_translation are not in the new database but are in the old???
+>WARNING: ZFIN_ID_curated_transcript_notransfer has 9748 xrefs in the new database
+> but NONE in the old
+
+Look through the warnings to see anything is obviously wrong. Note some xrefs are 
+only ever in the core database and are left alone, these are sources like OTTG, 
+OTTT, Vega_translation as these are set by the merging code (used by the 
+genebuilders to produce the core database). 
+
+NOTE: The xrefs are updated by deleted the sources it is updating and then adding
+the new ones, so if we are not updating a source it will still stay in the core
+database.
+
+
+>xref_mapper.pl FINISHED NORMALLY
+
+The script has finished normally after 
+
+
+
+If you are happy with the messages we can now transfer the data to the core 
+database. This is done by adding -upload to the command line (see above).
+mapper2.out gives a sample output for this.
+
+>Options: -file xref_input -upload
+>running in verbose mode
+>current status is tests_finished
+
+Report the current status of the xref_database. This is used to work out what 
+to do next. So we can see here that the test are finished and we are ready to 
+load the data.
+
+
+
+>Deleting data for EMBL from core before updating from new xref database
+>Deleting data for EntrezGene from core before updating from new xref database
+>Deleting data for GO from core before updating from new xref database
+>Deleting data for goslim_goa from core before updating from new xref database
+>Deleting data for IPI from core before updating from new xref database
+
+Delete the data for the sources we are updating.
+
+
+>updating (236) EMBL in core (for DEPENDENT xrefs)
+>DEP 42665 xrefs, 94223 object_xrefs
+>updating (39) EntrezGene in core (for DEPENDENT xrefs)
+>DEP 21473 xrefs, 23897 object_xrefs
+>	added 30853 synonyms
+>updating (52) GO in core (for DEPENDENT xrefs)
+>GO 4535
+>updating (274) goslim_goa in core (for DEPENDENT xrefs)
+>DEP 99 xrefs, 96927 object_xrefs
+>updating (91) IPI in core (for SEQUENCE_MATCH xrefs)
+>SEQ 35478
+
+So we report the number and type of xrefs that are loaded.
+
+
+>Setting Transcript and Gene display_xrefs from xref database into core and 
+> setting the desc
+
+In the official naming routine which mouse, human and zebrafish run we set
+the display_xrefs and descriptions.
+
+
+>Using xref_off set of 722445
+
+So xref_id in the xref database + the offset will be the same as the core xref_id.
+Used for checking/debuging mainly.
+
+>24488 gene descriptions added
+>Only setting those not already set
+>Presedence for Gene Descriptions
+>	Uniprot/SPTREMBL	1
+>	RefSeq_dna	3
+>	RefSeq_peptide	4
+>	Uniprot/SWISSPROT	5
+>	IMGT/GENE_DB	6
+>	ZFIN_ID	7
+>	miRBase	8
+>	RFAM	9
+>6437 gene descriptions added
+
+For those that the official naming routine could not set we now add display_xrefs
+and decriptions. NOTE: the higher the number ther greater the priority for naming.
+
+
+>xref_mapper.pl FINISHED NORMALLY
+
+The script has finished successfully. If you do not see this then it crashed for
+some reason and you need to look at the mapper2.err file.
+
+