oWhat is the purpose of this document ------------------------------------ This document shows the steps and best practices for running external databases references (xrefs) for various species. Who is this document written for -------------------------------- Anyone wanting to run the xref pipeline. For information on what xrefs are and general details please see xrefs_overview.txt, xrefs_detailed_docs.txt, FAQ.txt and parsing_information.txt in this directory. Overview of steps ----------------- 1) configure the system. 2) update ccds database ( if human or mouse) 3) update LRGs ( if human) 4) run the parsing 5) run the mapping Please note the stable_id mapping has to be done and the Vega databases available (for human, mouse and zebrafish) before the xref pipeline can be run. Configuring the system ---------------------- Edit the xref_config.ini file, see FAQ.txt for more details. If this species is already in the file then this could be just a case of checking the correct versions of the databases are being used. It is also important to have the correct version of the API at this stage as by default it uses the API version to define which database to connect to. i.e. ensembl_ontology_xx where xx is the version. So for ensembl release 65 this would be database ensembl_ontology_65. Where the 65 is obtained from the API. Update ccds database -------------------- Because the stable ids may have changed in the core database we need to update these in the ccds databases. At present only human and mouse have these. The script to run is store_ccds_xrefs.pl and is in the directory ensembl-personal/genebuilders/ccds/scripts. Submit the job to the farm with 500Mb memory requirement: bsub -q normal -M 500000 -R'select[mem>500] rusage[mem=500]' -o /lustre/scratch103/ensembl/mk8/xrefs_67/human/ccds.out -e /lustre/scratch103/ensembl/mk8/xrefs_67/human/ccds.err perl ~/ensembl-personal/genebuilders/ccds/scripts/store_ccds_xrefs.pl -ccds_dbname ccds_human_67 -ccds_host ens-livemirror -ccds_user ensadmin -ccds_pass ensembl -dbname homo_sapiens_core_67_37 -host ens-staging1 -port 3306 -user ensro -verbose -species human -path GRCh37 -write -delete_old update LRGs ----------- Good docs can be found at https://www.ebi.ac.uk/seqdb/confluence/display/ENS/Importing+LRGs+into+Ensembl which comes down to doing the following :- Check that the LRG modules are added to perl5lib so for my instance I set setenv PERL5LIB ${PERL5LIB}:/nfs/users/nfs_i/ianl/LRG/code/modules perl scripts/import.lrg.pl -verbose -do_all -host ens-staging -port 3306 -user rw -pass password -core homo_sapiens_core_65_37 -otherfeatures homo_sapiens_otherfeatures_65_37 -cdna homo_sapiens_cdna_65_37 -vega homo_sapiens_vega_65_37 -rnaseq homo_sapiens_rnaseq_65_37 -clean >& clean.OUT perl scripts/import.lrg.pl -verbose -do_all -host ens-staging -port 3306 -user rw -pass password -core homo_sapiens_core_65_37 -otherfeatures homo_sapiens_otherfeatures_65_37 -cdna homo_sapiens_cdna_65_37 -vega homo_sapiens_vega_65_37 -rnaseq homo_sapiens_rnaseq_65_37 -import -xrefs >& import.OUT perl scripts/import.lrg.pl -verbose -do_all -host ens-staging -port 3306 -user rw -pass password -core homo_sapiens_core_65_37 - otherfeatures homo_sapiens_otherfeatures_65_37 -cdna homo_sapiens_cdna_65_37 -vega homo_sapiens_vega_65_37 -rnaseq homo_sapiens_rnaseq_65_37 -overlap >& overlap.OUT perl scripts/import.lrg.pl -verbose -do_all -host ens-staging -port 3306 -user rw -pass password -core homo_sapiens_core_65_37 -otherfeatures homo_sapiens_otherfeatures_65_37 -cdna homo_sapiens_cdna_65_37 -vega homo_sapiens_vega_65_37 -rnaseq homo_sapiens_rnaseq_65_37 -verify >& verify.OUT c shell script to generate the commands: set db_args = '-host ens-staging -port 3306 -user xxx -pass xxx' #delete cdna if it's not ready yet, LRGs will be imported into cdna from core set db_types =(core otherfeatures vega rnaseq cdna) set species = 'homo_sapiens' set dbs = '' set options = (clean import overlap verify) set version = '67_37' set spath = '/nfs/users/nfs_m/mk8/code' set outpath = '/lustre/scratch103/ensembl/mk8/xrefs_67/human' foreach type ($db_types) set dbs = "${dbs} -${type} ${species}_${type}_${version}" end foreach option ($options) set memory_line = '' if ($option == 'import') then set command_line_args = "$db_args $dbs -$option -xrefs" set memory_line = "-q normal -M 1500000 -R'select[mem>1500] rusage[mem=1500]'" else set command_line_args = "$db_args $dbs -$option " endif echo "bsub $memory_line -o $outpath/lrg_$option.out -e $outpath/lrg_$option.err perl $spath/scripts/import.lrg.pl -verbose -do_all $command_line_args" end exit; If the cdna databases are not yet ready then remove the "-cdna homo_sapiens_cdna_65_37" bit and continue but let who ever is building this database know that you are doing the LRGs. They should wait until LRGs import is complete and can use all of the assembly info from core, including the LRG data to build the cDNA db. Run the parsing --------------- *PROXY: Some sources require the downloading of files over HTTP. If you are firewalled then make sure you have set the HTTP_PROXY environment variable* More detailed instructions can be found in the FAQ.txt, but basically you should cd to where you want the files to be downloaded to and run the following;- bsub -o parse.out -e parse.err perl ~/src/ensembl/misc-scripts/xref_mapping/xref_parser.pl -user rw -pass password -host ens-research -dbname ianl_dog_xref_65 -species dog -create -stats -force -species : which species to start the parsing for -create : tells the script to create a new database even if one exists already -stats : gives you statistics about what xrefs have been added for each parser -force : means no interaction (i.e. for the farm) so it assumes yes to all questions by running on the farm the systems people are happier and by using -o and -e we can keep the error and output files separate. In this directory you will find parse.out which shows a sample output for running human xref parsing stage. I will add ">" to the start of the output lines to differentiate these from my comments Explanation of the output:- > Options: -user rw -pass password -host ens-research > -dbname ianl_human_xref_65 -species human -stats - > create -force Tells us what options were used when the parser script was run. > ----{ XXXX }----------------------------------------------------------------- output from the parser XXXX > Parsing script:host=>ens-livemirror,dbname=>ccds_human_65,tran_name=>ENST, > with XXXXParser XXXX is being parsed with the XXXXParser ( see ensembl/misc-scripts/xref_mapper /XrefParser/XXXXParser.pm for the module. >source xrefs prim dep gdir tdir tdir coord synonyms >XXX_transcript 0 0 0 0 33689 >XXXX 26451 0 0 0 0 0 0 0 So the Parser added 26451 xrefs and 33689 direct xrefs to the transcripts. Note: we can have more direct xrefs than xrefs as one xref may go to a few transcripts, this is not a problem. >================================================================================ >Summary of status >================================================================================ > CCDS CCDSParser OKAY > DBASS3 DBASSParser OKAY > DBASS5 DBASSParser OKAY > EntrezGene EntrezGeneParser OKAY > GO GOParser OKAY > GO InterproGoParser OKAY > HGNC VegaOfficialNameParser OKAY > HGNC HGNC_CCDSParser OKAY > HGNC HGNCParser OKAY The status for each parser should be "OKAY" if it is not then there was a problem. Run the mapping --------------- First create a configuration script to tell the mapper program information it needs. Here is an example. ############################################################# xref host=ensembl-host1 port=3306 dbname=human_xref_65 user=rw password=xxxx dir=./xref species=homo_sapiens host=ensembl-host2 port=3306 dbname=homo_sapiens_core_65_37 user=rw password=xxxx dir=./ensembl pr_host = ensembl-old pr_user = ro pr_dbname = homo_sapiens_core_64_37 farm queue=long exonerate=/software/ensembl/bin/exonerate-1.4.0 ############################################################# >xref >host=ensembl-host1 >port=3306 >dbname=human_xref_65 >user=rw >password=xxxx defines what is needed to connect to the xref database >dir=./xref Sets where to dump the xref databases fasta files Note the directory must exist already. >species=homo_sapiens >host=ensembl-host2 >port=3306 >dbname=homo_sapiens_core_65_37 >user=rw >password=xxxx defines what is needed to connect to the core database >dir=./ensembl Sets where to dump the core databases fasta files Note the directory must exist already. >pr_host = ensembl-archive >pr_user = ro >pr_dbname = homo_sapiens_core_64_37 Normally as part of the xref mapping we check the number of xrefs in the core database to the one in the xref database and flag any sources that have changed by more than 5%, as this may indicate that we have a problem. But specifying pr_... we are instructing the comparison to be to another core database. This is normally done when the core database we are updating does not have a full set of xrefs already and hence the comparison would be useless. >farm >queue=long >exonerate=/software/ensembl/bin/exonerate-1.4.0 Instead of using the default farm queue or exonerate executable we can overwrite these here. Typically the EBI and Sanger have different queues and other organisations may also differ so this is very useful. So we are now ready to run the mapping. We need to tell the mapper where the configuration file is (see above). The mapper is ran twice generally. The first time does all the major work like dumping the fasta files, mapping these files, reading in the mapping files, and creating all the connections. At this stage a comparison of the xrefs in the core database and new xref database is done. A typical command line call would be.. bsub -o mapper1.out -e mapper1.err perl xref_mapper.pl -file config_file if you do not have access to a compute farm then :- perl xref_mapper.pl -file config_file -nofarm >& mapper1.out (but this will be slow) If everything looks okay we will then transfer the data by adding -upload to the command line options, i.e. when using the farm bsub -o mapper2.out -e mapper2.err perl xref_mapper.pl -file config_file -upload In this directory you will find examples of mapper1.out and mapper2.out but again the important bits will be explained. So for mapper1.out >Options: -file xref_input >running in verbose mode Informs the user how the mapper was run >current status is parsing_finished Report the current status of the xref_database. This is used to work out what to do next >No alt_alleles found for this species. only for human do we import the alt_alleles >Dumping xref & Ensembl sequences >Dumping Xref fasta files >Dumping Ensembl Fasta files >53067 Transcripts dumped 41693 Transaltions dumped Reports what files are dumped. If these are already dumped and the option -dumpcheck was used then this will be report and if the fasta files already exist they will not be re dumped. >Deleting out, err and map files from output dir: /workdir/release_65/zebrafish/ensembl >Deleting txt and sql files from output dir: /workdir/release_65/zebrafish/ensembl >LSF job ID for main mapping job: 887287, name ExonerateGappedBest1_1318933449 with > 481 arrays elements) >LSF job ID for main mapping job: 887288, name ExonerateGappedBest1_1318933451 with > 253 arrays elements) >LSF job ID for Depend job: 887289 (job array with 1 job) >already processed = 0, processed = 734, errors = 0, empty = 0 This is information on the mapping of the fasta files using exonerate. Check that the errors are 0 else one of the mappings went wrong. >Could not find stable id ENSDART00000126968 in table to get the internal id hence > ignoring!!! (for RFAM) >Could not find stable id ENSDART00000121043 in table to get the internal id hence > ignoring!!! (for RFAM) Sometimes external databases will have links to EnsEMBL that are no longer valid, usually due to time delays in the releases wrt the external database. Here we can see two of these for RFAM, as long as this number is not too large this is not a problem. >The foillowing will be processed as priority xrefs > Uniprot/SPTREMBL > ZFIN_ID Priority xrefs are those xrefs where we get the data from more than one place. These will have priorities which tell us which is better so the best ones are chosen at this point. >Process Pairs >Starting at object_xref of 837705 > NEW 2733 >2733 new relationships added For some xrefs that can be considered as being paired i.e. RefSeq_Peptide and RefSeq_mrna, if we match one of these but not its pair then we add this relationship in now. >Writing InterPro > >246386 already existed > > Wrote 0 interpro table entries > including 51399 object xrefs, > and 51399 go xrefs We create extra mapping using the InterPro table and these are the stats for this. >ZFIN_ID is associated with both Transcript and Translation object types >Therefore moving all associations from Translation to Transcript If a particular source in this example ZFIN_ID is linked to more than one of Gene, Transcript or Translation then all are moved to the highest level. Gene being the highest and Translation the lowest. >DBASS3 moved to Gene level. >DBASS5 moved to Gene level. Some sources are considered to belong to genes but may be mapped to transcripts or translations so we move these now to the gene. >For gene ENSDARG00000001832 we have mutiple ZFIN_ID's > Keeping the best one si:ch1073-403i13.1 > removing zgc:113912 from gene > removing zgc:103599 from gene >Multiple best ZFIN_ID's using vega to find the most common for ENSDARG00000057813 > lratb (chosen as first) > wu:fj89a05 (left as ZFIN_ID reference but not gene symbol) For some sources (HGNC in human, MGI in mouse and ZFIN_ID in zebrafish) we only want to have one reference per gene so using things like their priorities, %id mapping values etc. we try to find the best one and remove the others. If we cannot find a best one then all are kept. >WARNING: Clone_based_ensembl_gene has decreased by -5 % was 7652 now 7194 >WARNING: Clone_based_ensembl_transcript has decreased by -8 % was 8260 now 7554 >WARNING: xrefs miRBase_gene_name are not in the new database but are in the old??? >WARNING: xrefs OTTG are not in the new database but are in the old??? >WARNING: xrefs OTTT are not in the new database but are in the old??? >WARNING: RefSeq_ncRNA has increased by 5% was 644 now 677 >WARNING: xrefs RFAM_gene_name are not in the new database but are in the old??? >WARNING: xrefs shares_CDS_and_UTR_with_OTTT are not in the new database but are > in the old??? >WARNING: xrefs Vega_translation are not in the new database but are in the old??? >WARNING: ZFIN_ID_curated_transcript_notransfer has 9748 xrefs in the new database > but NONE in the old Look through the warnings to see anything is obviously wrong. Note some xrefs are only ever in the core database and are left alone, these are sources like OTTG, OTTT, Vega_translation as these are set by the merging code (used by the genebuilders to produce the core database). NOTE: The xrefs are updated by deleting the sources it is updating and then adding the new ones, so if we are not updating a source it will still stay in the core database. >xref_mapper.pl FINISHED NORMALLY If you are happy with the messages we can now transfer the data to the core database. This is done by adding -upload to the command line (see above). mapper2.out gives a sample output for this. >Options: -file xref_input -upload >running in verbose mode >current status is tests_finished Report the current status of the xref_database. This is used to work out what to do next. We can see here that the tests are finished and we are ready to load the data. >Deleting data for EMBL from core before updating from new xref database >Deleting data for EntrezGene from core before updating from new xref database >Deleting data for GO from core before updating from new xref database >Deleting data for goslim_goa from core before updating from new xref database >Deleting data for IPI from core before updating from new xref database Delete the data for the sources we are updating. >updating (236) EMBL in core (for DEPENDENT xrefs) >DEP 42665 xrefs, 94223 object_xrefs >updating (39) EntrezGene in core (for DEPENDENT xrefs) >DEP 21473 xrefs, 23897 object_xrefs > added 30853 synonyms >updating (52) GO in core (for DEPENDENT xrefs) >GO 4535 >updating (274) goslim_goa in core (for DEPENDENT xrefs) >DEP 99 xrefs, 96927 object_xrefs >updating (91) IPI in core (for SEQUENCE_MATCH xrefs) >SEQ 35478 So we report the number and type of xrefs that are loaded. >Setting Transcript and Gene display_xrefs from xref database into core and > setting the desc In the official naming routine which mouse, human and zebrafish run, we set the display_xrefs and descriptions. >Using xref_off set of 722445 So xref_id in the xref database + the offset will be the same as the core xref_id. Used for checking/debuging mainly. >24488 gene descriptions added >Only setting those not already set >Presedence for Gene Descriptions > Uniprot/SPTREMBL 1 > RefSeq_dna 3 > RefSeq_peptide 4 > Uniprot/SWISSPROT 5 > IMGT/GENE_DB 6 > ZFIN_ID 7 > miRBase 8 > RFAM 9 >6437 gene descriptions added For those that the official naming routine could not set, we now add display_xrefs and descriptions. NOTE: the higher the number the greater the priority for naming. >xref_mapper.pl FINISHED NORMALLY The script has finished successfully. If you do not see this then it crashed for some reason and you need to look at the mapper2.err file.