running_the_xref_pipeline.txt

oWhat is the purpose of this document
------------------------------------

This document shows the steps and best practices for running external 
databases references (xrefs) for various species.

Who is this document written for
--------------------------------

Anyone wanting to run the xref pipeline. For information on what xrefs are 
and general details please see  xrefs_overview.txt, xrefs_detailed_docs.txt, 
FAQ.txt and parsing_information.txt in this directory.


Overview of steps
-----------------

1) configure the system.
2) update ccds database ( if human or mouse) 
3) update LRGs          ( if human)
4) run the parsing
5) run the mapping

Please note the stable_id mapping has to be done and the Vega databases 
available (for human, mouse and zebrafish) before the xref pipeline can be run.


Configuring the system
----------------------

Edit the xref_config.ini file, see FAQ.txt for more details. If this 
species is already in the file then this could be just a case of checking the 
correct versions of the databases are being used. 

It is also important to have the correct version of the API at this stage as 
by default it uses the API version to define which database to connect to. 
i.e. ensembl_ontology_xx where xx is the version. So for ensembl release 65 
this would be database ensembl_ontology_65. Where the 65 is obtained from the 
API.


Update ccds database
--------------------

Because the stable ids may have changed in the core database we need to update
these in the ccds databases. At present only human and mouse have these.
The script to run is  store_ccds_xrefs.pl and is in the directory
ensembl-personal/genebuilders/ccds/scripts.

Submit the job to the farm with 500Mb memory requirement: 

bsub -q normal -M 500000 -R'select[mem>500] rusage[mem=500]' -o /lustre/scratch103/ensembl/mk8/xrefs_67/human/ccds.out -e /lustre/scratch103/ensembl/mk8/xrefs_67/human/ccds.err perl ~/ensembl-personal/genebuilders/ccds/scripts/store_ccds_xrefs.pl -ccds_dbname ccds_human_67 -ccds_host ens-livemirror -ccds_user ensadmin -ccds_pass ensembl -dbname homo_sapiens_core_67_37 -host ens-staging1 -port 3306 -user ensro -verbose -species human -path GRCh37 -write -delete_old


update LRGs
-----------


Good docs can be found at
https://www.ebi.ac.uk/seqdb/confluence/display/ENS/Importing+LRGs+into+Ensembl

which comes down to doing the following :-
Check that the LRG modules are added to perl5lib
so for my instance I set
setenv PERL5LIB ${PERL5LIB}:/nfs/users/nfs_i/ianl/LRG/code/modules

perl scripts/import.lrg.pl  -verbose -do_all -host ens-staging -port
3306 -user rw -pass password  -core homo_sapiens_core_65_37
-otherfeatures homo_sapiens_otherfeatures_65_37 -cdna
homo_sapiens_cdna_65_37 -vega homo_sapiens_vega_65_37 -rnaseq
homo_sapiens_rnaseq_65_37 -clean  >& clean.OUT

perl scripts/import.lrg.pl  -verbose -do_all -host ens-staging -port
3306 -user rw -pass password  -core homo_sapiens_core_65_37
-otherfeatures homo_sapiens_otherfeatures_65_37 -cdna
homo_sapiens_cdna_65_37 -vega homo_sapiens_vega_65_37 -rnaseq
homo_sapiens_rnaseq_65_37 -import -xrefs >& import.OUT

perl scripts/import.lrg.pl  -verbose -do_all -host ens-staging -port
3306 -user rw -pass password  -core homo_sapiens_core_65_37 -
otherfeatures homo_sapiens_otherfeatures_65_37 -cdna
homo_sapiens_cdna_65_37 -vega homo_sapiens_vega_65_37 -rnaseq
homo_sapiens_rnaseq_65_37 -overlap  >& overlap.OUT

perl scripts/import.lrg.pl  -verbose -do_all -host ens-staging -port
3306 -user rw -pass password  -core homo_sapiens_core_65_37
-otherfeatures homo_sapiens_otherfeatures_65_37 -cdna
homo_sapiens_cdna_65_37 -vega homo_sapiens_vega_65_37 -rnaseq
homo_sapiens_rnaseq_65_37 -verify  >& verify.OUT

c shell script to generate the commands:

set db_args = '-host ens-staging -port 3306 -user xxx -pass xxx'
#delete cdna if it's not ready yet, LRGs will be imported into cdna from core
set db_types =(core otherfeatures vega rnaseq cdna)
set species = 'homo_sapiens'
set dbs = ''

set options = (clean import overlap verify)
set version = '67_37'
set spath = '/nfs/users/nfs_m/mk8/code'
set outpath = '/lustre/scratch103/ensembl/mk8/xrefs_67/human'


foreach type ($db_types)
	set dbs = "${dbs} -${type} ${species}_${type}_${version}"
end

foreach option ($options)
	set memory_line = ''
	if ($option == 'import') then
	   set command_line_args = "$db_args $dbs -$option -xrefs"
	   set memory_line = "-q normal -M 1500000 -R'select[mem>1500] rusage[mem=1500]'"
	else 
	   set command_line_args = "$db_args $dbs -$option "
	endif

	echo "bsub $memory_line -o $outpath/lrg_$option.out -e $outpath/lrg_$option.err perl $spath/scripts/import.lrg.pl  -verbose -do_all $command_line_args"

end

exit;


If the cdna databases are not yet ready then remove the "-cdna
homo_sapiens_cdna_65_37" bit and continue but let who ever is building
this database know that you are doing the LRGs. They should wait until 
LRGs import is complete and can use all of the assembly info from core, 
including the LRG data to build the cDNA db.

Run the parsing
---------------

*PROXY: Some sources require the downloading of files over HTTP. If you are 
firewalled then make sure you have set the HTTP_PROXY environment variable*

More detailed instructions can be found in the FAQ.txt, 
but basically you should cd to where you want the files to be downloaded to
and run the following;-

bsub -o parse.out -e parse.err perl
~/src/ensembl/misc-scripts/xref_mapping/xref_parser.pl
-user rw -pass password -host ens-research -dbname
ianl_dog_xref_65 -species dog -create -stats -force


-species : which species to start the parsing for
-create  : tells the script to create a new database even if one exists already
-stats   : gives you statistics about what xrefs have been added for each 
           parser
-force   : means no interaction (i.e. for the farm) so it assumes yes to all 
           questions

by running on the farm the systems people are happier and by using -o and -e we
can keep the error and output files separate.

In this directory you will find parse.out which shows a sample output for 
running human xref parsing stage.

I will add ">" to the start of the output lines to differentiate these from 
my comments

Explanation of the output:-

> Options: -user rw -pass password -host ens-research 
> -dbname ianl_human_xref_65 -species human -stats -
> create -force

Tells us what options were used when the parser script was run.


> ----{ XXXX }-----------------------------------------------------------------
output from the parser XXXX

> Parsing script:host=>ens-livemirror,dbname=>ccds_human_65,tran_name=>ENST,
>  with XXXXParser

XXXX is being parsed with the XXXXParser ( see ensembl/misc-scripts/xref_mapper
/XrefParser/XXXXParser.pm for the module.

>source          xrefs   prim    dep     gdir    tdir    tdir    coord   synonyms
>XXX_transcript  0       0       0       0       33689
>XXXX            26451   0       0       0       0       0       0       0

So the Parser added 26451 xrefs and 33689 direct xrefs to the transcripts.
Note: we can have more direct xrefs than xrefs as one xref may go to a few
transcripts, this is not a problem.

>================================================================================
>Summary of status
>================================================================================
>                          CCDS CCDSParser               OKAY
>                        DBASS3 DBASSParser              OKAY
>                        DBASS5 DBASSParser              OKAY
>                    EntrezGene EntrezGeneParser         OKAY
>                            GO GOParser                 OKAY
>                            GO InterproGoParser         OKAY
>                          HGNC VegaOfficialNameParser   OKAY
>                          HGNC HGNC_CCDSParser          OKAY
>                          HGNC HGNCParser               OKAY

The status for each parser should be "OKAY" if it is not then there was a problem.


Run the mapping
---------------

First create a configuration script to tell the mapper program information it
needs. Here is an example.

#############################################################
xref
host=ensembl-host1
port=3306
dbname=human_xref_65
user=rw
password=xxxx
dir=./xref

species=homo_sapiens
host=ensembl-host2
port=3306
dbname=homo_sapiens_core_65_37
user=rw
password=xxxx
dir=./ensembl
pr_host = ensembl-old
pr_user = ro
pr_dbname = homo_sapiens_core_64_37

farm
queue=long
exonerate=/software/ensembl/bin/exonerate-1.4.0
#############################################################


>xref
>host=ensembl-host1
>port=3306
>dbname=human_xref_65
>user=rw
>password=xxxx

defines what is needed to connect to the xref database


>dir=./xref

Sets where to dump the xref databases fasta files
Note the directory must exist already.


>species=homo_sapiens
>host=ensembl-host2
>port=3306
>dbname=homo_sapiens_core_65_37
>user=rw
>password=xxxx

defines what is needed to connect to the core database


>dir=./ensembl

Sets where to dump the core databases fasta files
Note the directory must exist already.


>pr_host = ensembl-archive
>pr_user = ro
>pr_dbname = homo_sapiens_core_64_37

Normally as part of the xref mapping we check the number of xrefs in the
core database to the one in the xref database and flag any sources that
have changed by more than 5%, as this may indicate that we have a problem.
But specifying pr_... we are instructing the comparison to be to another core 
database. This is normally done when the core database we are updating does
not have a full set of xrefs already and hence the comparison would be useless.


>farm
>queue=long
>exonerate=/software/ensembl/bin/exonerate-1.4.0

Instead of using the default farm queue or exonerate executable we can 
overwrite these here. Typically the EBI and Sanger have different queues
and other organisations may also differ so this is very useful.


So we are now ready to run the mapping. We need to tell the mapper where the 
configuration file is (see above).

The mapper is ran twice generally. The first time does all the major work like
dumping the fasta files, mapping these files, reading in the mapping files, and
creating all the connections. At this stage a comparison of the xrefs in the 
core database and new xref database is done.
A typical command line call would be..

bsub -o mapper1.out -e mapper1.err perl xref_mapper.pl -file config_file

if you do not have access to a compute farm then :-

perl xref_mapper.pl -file config_file -nofarm >& mapper1.out
(but this will be slow)

If everything looks okay we will then transfer the data by adding -upload to 
the command line options, i.e. when using the farm

bsub -o mapper2.out -e mapper2.err perl xref_mapper.pl 
     -file config_file -upload

In this directory you will find examples of mapper1.out and mapper2.out but 
again the important bits will be explained.

So for mapper1.out

>Options: -file xref_input
>running in verbose mode

Informs the user how the mapper was run

>current status is parsing_finished

Report the current status of the xref_database. This is used to work out what 
to do next


>No alt_alleles found for this species.

only for human do we import the alt_alleles


>Dumping xref & Ensembl sequences
>Dumping Xref fasta files
>Dumping Ensembl Fasta files
>53067 Transcripts dumped 41693 Transaltions dumped

Reports what files are dumped. If these are already dumped and the option 
-dumpcheck was used then this will be report and if the fasta files already
exist they will not be re dumped.

>Deleting out, err and map files from output dir: /workdir/release_65/zebrafish/ensembl
>Deleting txt and sql files from output dir: /workdir/release_65/zebrafish/ensembl
>LSF job ID for main mapping job: 887287, name ExonerateGappedBest1_1318933449 with
> 481 arrays elements)
>LSF job ID for main mapping job: 887288, name ExonerateGappedBest1_1318933451 with
> 253 arrays elements)
>LSF job ID for Depend job: 887289 (job array with 1 job)
>already processed = 0, processed = 734, errors = 0, empty = 0 

This is information on the mapping of the fasta files using exonerate. Check that
the errors are 0 else one of the mappings went wrong.


>Could not find stable id ENSDART00000126968 in table to get the internal id hence
> ignoring!!! (for RFAM)
>Could not find stable id ENSDART00000121043 in table to get the internal id hence
> ignoring!!! (for RFAM)


Sometimes external databases will have links to EnsEMBL that are no longer valid,
usually due to time delays in the releases wrt the external database.
Here we can see two of these for RFAM, as long as this number is not too large
this is not a problem.


>The foillowing will be processed as priority xrefs
>	Uniprot/SPTREMBL
>	ZFIN_ID

Priority xrefs are those xrefs where we get the data from more than one place.
These will have priorities which tell us which is better so the best ones are 
chosen at this point.


>Process Pairs
>Starting at object_xref of 837705
>	NEW	2733
>2733 new relationships added

For some xrefs that can be considered as being paired i.e. RefSeq_Peptide and 
RefSeq_mrna, if we match one of these but not its pair then we add this 
relationship in now.


>Writing InterPro
>
>246386 already existed
>
>  Wrote 0 interpro table entries
>    including 51399 object xrefs, 
>    and 51399 go xrefs

We create extra mapping using the InterPro table and these are the stats for this.


>ZFIN_ID is associated with both Transcript and Translation object types
>Therefore moving all associations from Translation to Transcript

If a particular source in this example ZFIN_ID is linked to more than one of Gene,
Transcript or Translation then all are moved to the highest level. Gene being the 
highest and Translation the lowest.


>DBASS3 moved to Gene level.
>DBASS5 moved to Gene level.

Some sources are considered to belong to genes but may be mapped to transcripts or 
translations so we move these now to the gene.


>For gene ENSDARG00000001832 we have mutiple ZFIN_ID's
>	Keeping the best one si:ch1073-403i13.1
>	removing zgc:113912 from gene
>	removing zgc:103599 from gene
>Multiple best ZFIN_ID's using vega to find the most common for ENSDARG00000057813
>	lratb  (chosen as first)
>	wu:fj89a05  (left as ZFIN_ID reference but not gene symbol)

For some sources (HGNC in human, MGI in mouse and ZFIN_ID in zebrafish) we only 
want to have one reference per gene so using things like their priorities, %id 
mapping values etc. we try to find the best one and remove the others. If we cannot 
find a best one then all are kept.


>WARNING: Clone_based_ensembl_gene has decreased by -5 % was 7652 now 7194
>WARNING: Clone_based_ensembl_transcript has decreased by -8 % was 8260 now 7554
>WARNING: xrefs miRBase_gene_name are not in the new database but are in the old???
>WARNING: xrefs OTTG are not in the new database but are in the old???
>WARNING: xrefs OTTT are not in the new database but are in the old???
>WARNING: RefSeq_ncRNA has increased by 5% was 644 now 677
>WARNING: xrefs RFAM_gene_name are not in the new database but are in the old???
>WARNING: xrefs shares_CDS_and_UTR_with_OTTT are not in the new database but are 
> in the old???
>WARNING: xrefs Vega_translation are not in the new database but are in the old???
>WARNING: ZFIN_ID_curated_transcript_notransfer has 9748 xrefs in the new database
> but NONE in the old

Look through the warnings to see anything is obviously wrong. Note some xrefs are 
only ever in the core database and are left alone, these are sources like OTTG, 
OTTT, Vega_translation as these are set by the merging code (used by the 
genebuilders to produce the core database). 

NOTE: The xrefs are updated by deleting the sources it is updating and then adding
the new ones, so if we are not updating a source it will still stay in the core
database.


>xref_mapper.pl FINISHED NORMALLY


If you are happy with the messages we can now transfer the data to the core 
database. This is done by adding -upload to the command line (see above).
mapper2.out gives a sample output for this.

>Options: -file xref_input -upload
>running in verbose mode
>current status is tests_finished

Report the current status of the xref_database. This is used to work out what 
to do next. We can see here that the tests are finished and we are ready to 
load the data.


>Deleting data for EMBL from core before updating from new xref database
>Deleting data for EntrezGene from core before updating from new xref database
>Deleting data for GO from core before updating from new xref database
>Deleting data for goslim_goa from core before updating from new xref database
>Deleting data for IPI from core before updating from new xref database

Delete the data for the sources we are updating.


>updating (236) EMBL in core (for DEPENDENT xrefs)
>DEP 42665 xrefs, 94223 object_xrefs
>updating (39) EntrezGene in core (for DEPENDENT xrefs)
>DEP 21473 xrefs, 23897 object_xrefs
>	added 30853 synonyms
>updating (52) GO in core (for DEPENDENT xrefs)
>GO 4535
>updating (274) goslim_goa in core (for DEPENDENT xrefs)
>DEP 99 xrefs, 96927 object_xrefs
>updating (91) IPI in core (for SEQUENCE_MATCH xrefs)
>SEQ 35478
So we report the number and type of xrefs that are loaded.


>Setting Transcript and Gene display_xrefs from xref database into core and 
> setting the desc

In the official naming routine which mouse, human and zebrafish run, we set
the display_xrefs and descriptions.


>Using xref_off set of 722445

So xref_id in the xref database + the offset will be the same as the core xref_id.
Used for checking/debuging mainly.

>24488 gene descriptions added
>Only setting those not already set
>Presedence for Gene Descriptions
>	Uniprot/SPTREMBL	1
>	RefSeq_dna	3
>	RefSeq_peptide	4
>	Uniprot/SWISSPROT	5
>	IMGT/GENE_DB	6
>	ZFIN_ID	7
>	miRBase	8
>	RFAM	9
>6437 gene descriptions added

For those that the official naming routine could not set, we now add display_xrefs
and descriptions. NOTE: the higher the number the greater the priority for naming.


>xref_mapper.pl FINISHED NORMALLY

The script has finished successfully. If you do not see this then it crashed for
some reason and you need to look at the mapper2.err file.