Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
E
ensembl
Manage
Activity
Members
Labels
Plan
Issues
0
Issue boards
Milestones
Iterations
Wiki
Requirements
Jira
Code
Merge requests
1
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Locked files
Build
Pipelines
Jobs
Pipeline schedules
Test cases
Artifacts
Deploy
Releases
Package Registry
Container Registry
Operate
Environments
Terraform modules
Monitor
Incidents
Service Desk
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Code review analytics
Issue analytics
Insights
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Terms and privacy
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
ensembl-gh-mirror
ensembl
Commits
ed3ef884
Commit
ed3ef884
authored
13 years ago
by
Ian Longden
Browse files
Options
Downloads
Patches
Plain Diff
first draft of the document to help expalin the running of the xref system
parent
ac2ee655
No related branches found
Branches containing commit
No related tags found
Tags containing commit
No related merge requests found
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
misc-scripts/xref_mapping/docs/running_the_xref_pipeline.txt
+525
-0
525 additions, 0 deletions
misc-scripts/xref_mapping/docs/running_the_xref_pipeline.txt
with
525 additions
and
0 deletions
misc-scripts/xref_mapping/docs/running_the_xref_pipeline.txt
0 → 100644
+
525
−
0
View file @
ed3ef884
What is the purpose of this document
------------------------------------
This documant show the steps and best practices for runnings external
databases references (xrefs) for various species.
Who is this document written for
--------------------------------
Anyone wanting to run the xref pipeline for information on what xrefs are
and general details please see xrefs_overview.txt, xrefs_detailed_docs.txt,
FAQ.txt and parsing_information.txt in this directory.
Overview of steps
-----------------
1) configure the system.
2) update ccds database ( if human or mouse)
3) update alt_alleles ( if human)
4) update LRGs ( if human)
5) run the parsing
6) run the mapping
Please note the stable_id mapping has to be done and the Vega databases
available (for human, mouse and zebrafish) before the xref pipeline can be ran.
Configuring the system
----------------------
Edit the file config_xref.ini file see FAQ.txt for more details. If this
species is already in the file then this could be just a case of checking the
correct versions of the databases are being used.
It is also inportant to have the correct version of the API at this stage as
by default it uses the API version to define which database to connect to.
i.e. ensembl_ontology_xx where xx is the version. So for ensembl release 65
this would be database ensembl_ontology_65. Where the 65 is obtained from the
API.
Update alt_allele table
-----------------------
At present this is just for human. The following script examines the vega
database and based on the names creates the alt_alleles for the core database.
The vega database should already have links to the core database so that is
how we go from the vega stable_id to ensembl stable_id.
In the ensembl/misc-scripts/alt_alleles directory you need to run the script
alt_alleles.pl
This can be run using the API to automatically pick up the correct
databases (make sure api version is correct to pick this up)
perl alt_alleles.pl -cpass XXXX > & human_release_65_alt_alleles
or specify all the arguments i.e.
perl alt_alleles.pl -vhost ens-staging1 -vport 3306
-vdbname homo_sapiens_vega_65_37
-cdbname homo_sapiens_core_65_37
-chost ens-staging1 -cpass XXXX
>& human_release_65_alt_alleles
Update ccds database
--------------------
Becouse the stable ids may have changed in the core database we need to update
these in the ccds databases. At present only human and mouse have these.
The script to run is store_ccds_xrefs.pl and is in the directory
ensembl-personal/genebuilders/ccds/scripts.
perl store_ccds_xrefs.pl -ccds_dbname ccds_human_65 -ccds_host
ens-livemirror -ccds_user rw -ccds_pass passwrod -dbname
homo_sapiens_core_65_37 -host ens-staging1 -port 3306 -user ro
-verbose -species human -path GRCh37 -write -delete_old
update LRGs
-----------
Good docs can be found at
https://www.ebi.ac.uk/seqdb/confluence/display/ENS/Importing+LRGs+into+Ensembl
which comes down to doing the following :-
perl scripts/import.lrg.pl -verbose -do_all -host ens-staging -port
3306 -user rw -pass password -core homo_sapiens_core_65_37
-otherfeatures homo_sapiens_otherfeatures_65_37 -cdna
homo_sapiens_cdna_65_37 -vega homo_sapiens_vega_65_37 -rnaseq
homo_sapiens_rnaseq_65_37 -clean >& clean.OUT
perl scripts/import.lrg.pl -verbose -do_all -host ens-staging -port
3306 -user rw -pass password -core homo_sapiens_core_65_37
-otherfeatures homo_sapiens_otherfeatures_65_37 -cdna
homo_sapiens_cdna_65_37 -vega homo_sapiens_vega_65_37 -rnaseq
homo_sapiens_rnaseq_65_37 -import -xrefs >& import.OUT
perl scripts/import.lrg.pl -verbose -do_all -host ens-staging -port
3306 -user rw -pass password -core homo_sapiens_core_65_37 -
otherfeatures homo_sapiens_otherfeatures_65_37 -cdna
homo_sapiens_cdna_65_37 -vega homo_sapiens_vega_65_37 -rnaseq
homo_sapiens_rnaseq_65_37 -overlap >& overlap.OUT
perl scripts/import.lrg.pl -verbose -do_all -host ens-staging -port
3306 -user rw -pass password -core homo_sapiens_core_65_37
-otherfeatures homo_sapiens_otherfeatures_65_37 -cdna
homo_sapiens_cdna_65_37 -vega homo_sapiens_vega_65_37 -rnaseq
homo_sapiens_rnaseq_65_37 -verify >& verify.OUT
need to add modules to perl5lib to know where to find the modules
so for my instance i set
setenv PERL5LIB ${PERL5LIB}:/nfs/users/nfs_i/ianl/LRG/code/modules
If the cdna databses is not yet ready then remove the "-cdna
homo_sapiens_cdna_65_37" bit and continue but let who ever is building
this database that you are doing the LRGs so that they get the same
data.
Run the parsing
---------------
More detailed instructions can be found in the FAQ.txt and
but basically you should cd to where you want the files to be downloaded to
and run the following;-
bsub -o parse.out -e parse.err perl
~/src/ensembl/misc-scripts/xref_mapping/xref_parser.pl
-user rw -pass password -host ens-research -dbname
ianl_dog_xref_65 -species dog -create -stats -force
-species : which species to start the parsing for
-create : tells the script to create a new database even if one exists already
-stats : gives you statistics about what xrefs have been added for each
parser
-force : means no interaction (i.e. for the farm) so it assumes yes to all
questions
by running on the farm the systems people are happier and by using -o and -e we
can keep the error and output files seperate.
In this directory you will find parse.out which shows a sample output for
running human xref parsing stage.
I will add ">" to the start of the output lines to differentiate these from
my comments
Explaination of the output:-
> Options: -user rw -pass password -host ens-research
> -dbname ianl_human_xref_65 -species human -stats -
> create -force
Tells us what options were used when the parser script was ran.
> ----{ XXXX }-----------------------------------------------------------------
output from the parser XXXX
> Parsing script:host=>ens-livemirror,dbname=>ccds_human_65,tran_name=>ENST,
> with XXXXParser
XXXX is being parsed with the XXXXParser ( see ensembl/misc-scripts/xref_mapper
/XrefParser/XXXXParser.pm for the module.
>source xrefs prim dep gdir tdir tdir coord synonyms
>XXX_transcript 0 0 0 0 33689
>XXXX 26451 0 0 0 0 0 0 0
So the Parser added 26451 xrefs and 33689 direct xrefs to the transcripts.
Note: we can have more direct xrefs than xrefs as one xref may go to a few
transcripts, this is not a problem.
>================================================================================
>Summary of status
>================================================================================
> CCDS CCDSParser OKAY
> DBASS3 DBASSParser OKAY
> DBASS5 DBASSParser OKAY
> EntrezGene EntrezGeneParser OKAY
> GO GOParser OKAY
> GO InterproGoParser OKAY
> HGNC VegaOfficialNameParser OKAY
> HGNC HGNC_CCDSParser OKAY
> HGNC HGNCParser OKAY
The status for each parser should be "OKAY" if it is not then there was a problem.
Run the mapping
---------------
First create a configuration script to tell the mapper program information it
needs. Here is an example.
#############################################################
xref
host=ensembl-host1
port=3306
dbname=human_xref_65
user=rw
password=xxxx
dir=./xref
species=homo_sapiens
host=ensembl-host2
port=3306
dbname=homo_sapiens_core_65_37
user=rw
password=xxxx
dir=./ensembl
pr_host = ensembl-old
pr_user = ro
pr_dbname = homo_sapiens_core_64_37
farm
queue=long
exonerate=/software/ensembl/bin/exonerate-1.4.0
#############################################################
>xref
>host=ensembl-host1
>port=3306
>dbname=human_xref_65
>user=rw
>password=xxxx
defines what is needed to connect to the xref database
>dir=./xref
Sets where to dump the xref databases fasta files
Note the directory must exist already.
>species=homo_sapiens
>host=ensembl-host2
>port=3306
>dbname=homo_sapiens_core_65_37
>user=rw
>password=xxxx
defines what is needed to connect to the core database
>dir=./ensembl
Sets where to dump the core databases fasta files
Note the directory must exist already.
>pr_host = ensembl-archive
>pr_user = ro
>pr_dbname = homo_sapiens_core_64_37
Normally as part of the xref mapping we check the number of xrefs in the
core database to the one in the xref database and flag any sources that
have changed by more than 5%, as this may indicate that we have a problem.
But specifying pr_... we are instructing the comparison to be to another core
database. This is normally done when the core database we are updating does
not have a full set of xrefs already and hence the comparison would be useless.
>farm
>queue=long
>exonerate=/software/ensembl/bin/exonerate-1.4.0
Instead of using the default farm queue or exonerate executable we can
overwrite these here. Typically the EBI and Sanger have different queues
and other organisations may also differ so this is very useful.
So we are now ready to run the mapping. We need to tell the mapper where the
configuration file is (see above).
The mapper is ran twice generally. The first time does all the major work like
dumping the fasta files, mapping these files, reading in the mapping files, and
creating all the connections. At this stage a comparison of the xrefs in the
core database and new xref database is done.
A typical command line call would be..
bsub -o mapper1.out -e mapper1.err perl xref_mapper.pl -file config_file
if you do not have access to a compute farm then :-
perl xref_mapper.pl -file config_file -nofarm >& mapper1.out
(but this will be slow)
If everything looks okay we will then transfer the data by adding -upload to
the command line options, i.e. when using the farm
bsub -o mapper2.out -e mapper2.err perl xref_mapper.pl
-file config_file -upload
In this directory you will find examples of mapper1.out and mapper2.out but
again the important bits will be explained.
So for mapper1.out
>Options: -file xref_input
>running in verbose mode
Informs the user how the mapper was ran
>current status is parsing_finished
Report the current status of the xref_database. This is used to work out what
to do next
>No alt_alleles found for this species.
only for human do we inport the alt_alleles
>Dumping xref & Ensembl sequences
>Dumping Xref fasta files
>Dumping Ensembl Fasta files
>53067 Transcripts dumped 41693 Transaltions dumped
Reports what files are dumped. If these are already dumped and the option
-dumpcheck was used then this will be report and if the fasta files already
exist they will not be re dumped.
>Deleting out, err and map files from output dir: /workdir/release_65/zebrafish/ensembl
>Deleting txt and sql files from output dir: /workdir/release_65/zebrafish/ensembl
>LSF job ID for main mapping job: 887287, name ExonerateGappedBest1_1318933449 with
> 481 arrays elements)
>LSF job ID for main mapping job: 887288, name ExonerateGappedBest1_1318933451 with
> 253 arrays elements)
>LSF job ID for Depend job: 887289 (job array with 1 job)
>already processed = 0, processed = 734, errors = 0, empty = 0
This is information on the mapping of the fasta files using exonerate. Check that
the errors are 0 else one of the mapping went wrong.
>Could not find stable id ENSDART00000126968 in table to get the internal id hence
> ignoring!!! (for RFAM)
>Could not find stable id ENSDART00000121043 in table to get the internal id hence
> ignoring!!! (for RFAM)
Sometimes external databases will have links to EnsEMBL that are no longer valid,
usually due to time delays in the releases wrt the external database.
Here we can see two of these for RFAM, as long as this number is not too large
this is not a problem.
>The foillowing will be processed as priority xrefs
> Uniprot/SPTREMBL
> ZFIN_ID
Priority xrefs are those xrefs where we get the data from more than one place.
These will have prioritys which tell us which is better so the best ones are
chosen at this point.
>Process Pairs
>Starting at object_xref of 837705
> NEW 2733
>2733 new relationships added
For some xrefs thet can be considered as being paied i.e. RefSeq_Peptide and
RefSeq_mrna so if we match one of these but not its pair then we add this
relationship in now.
>Writing InterPro
>
>246386 already existed
>
> Wrote 0 interpro table entries
> including 51399 object xrefs,
> and 51399 go xrefs
We create extra mapping using the InterPro table and these are the stats for this.
>ZFIN_ID is associated with both Transcript and Translation object types
>Therefore moving all associations from Translation to Transcript
If a particular source in this example ZFIN_ID is linked to more than one of Gene,
Transcript or Translation then all are moved to the highest level. Gene being the
highest and Translation the lowest.
>DBASS3 moved to Gene level.
>DBASS5 moved to Gene level.
Some sources are considered to belong to genes but maybe mapped to transcripts or
translations so we move these now to the gene.
>For gene ENSDARG00000001832 we have mutiple ZFIN_ID's
> Keeping the best one si:ch1073-403i13.1
> removing zgc:113912 from gene
> removing zgc:103599 from gene
>Multiple best ZFIN_ID's using vega to find the most common for ENSDARG00000057813
> lratb (chosen as first)
> wu:fj89a05 (left as ZFIN_ID reference but not gene symbol)
For some sources (HGNC in human, MGI in mouse and ZFIN_ID in zebrafish) we only
want to have one reference per gene so using things like their prioritys, %id
mapping values etc we try to find the best one and remove the others. If we cannot
find a best one then all are kept.
>WARNING: Clone_based_ensembl_gene has decreased by -5 % was 7652 now 7194
>WARNING: Clone_based_ensembl_transcript has decreased by -8 % was 8260 now 7554
>WARNING: xrefs miRBase_gene_name are not in the new database but are in the old???
>WARNING: xrefs OTTG are not in the new database but are in the old???
>WARNING: xrefs OTTT are not in the new database but are in the old???
>WARNING: RefSeq_ncRNA has increased by 5% was 644 now 677
>WARNING: xrefs RFAM_gene_name are not in the new database but are in the old???
>WARNING: xrefs shares_CDS_and_UTR_with_OTTT are not in the new database but are
> in the old???
>WARNING: xrefs Vega_translation are not in the new database but are in the old???
>WARNING: ZFIN_ID_curated_transcript_notransfer has 9748 xrefs in the new database
> but NONE in the old
Look through the warnings to see anything is obviously wrong. Note some xrefs are
only ever in the core database and are left alone, these are sources like OTTG,
OTTT, Vega_translation as these are set by the merging code (used by the
genebuilders to produce the core database).
NOTE: The xrefs are updated by deleted the sources it is updating and then adding
the new ones, so if we are not updating a source it will still stay in the core
database.
>xref_mapper.pl FINISHED NORMALLY
The script has finished normally after
If you are happy with the messages we can now transfer the data to the core
database. This is done by adding -upload to the command line (see above).
mapper2.out gives a sample output for this.
>Options: -file xref_input -upload
>running in verbose mode
>current status is tests_finished
Report the current status of the xref_database. This is used to work out what
to do next. So we can see here that the test are finished and we are ready to
load the data.
>Deleting data for EMBL from core before updating from new xref database
>Deleting data for EntrezGene from core before updating from new xref database
>Deleting data for GO from core before updating from new xref database
>Deleting data for goslim_goa from core before updating from new xref database
>Deleting data for IPI from core before updating from new xref database
Delete the data for the sources we are updating.
>updating (236) EMBL in core (for DEPENDENT xrefs)
>DEP 42665 xrefs, 94223 object_xrefs
>updating (39) EntrezGene in core (for DEPENDENT xrefs)
>DEP 21473 xrefs, 23897 object_xrefs
> added 30853 synonyms
>updating (52) GO in core (for DEPENDENT xrefs)
>GO 4535
>updating (274) goslim_goa in core (for DEPENDENT xrefs)
>DEP 99 xrefs, 96927 object_xrefs
>updating (91) IPI in core (for SEQUENCE_MATCH xrefs)
>SEQ 35478
So we report the number and type of xrefs that are loaded.
>Setting Transcript and Gene display_xrefs from xref database into core and
> setting the desc
In the official naming routine which mouse, human and zebrafish run we set
the display_xrefs and descriptions.
>Using xref_off set of 722445
So xref_id in the xref database + the offset will be the same as the core xref_id.
Used for checking/debuging mainly.
>24488 gene descriptions added
>Only setting those not already set
>Presedence for Gene Descriptions
> Uniprot/SPTREMBL 1
> RefSeq_dna 3
> RefSeq_peptide 4
> Uniprot/SWISSPROT 5
> IMGT/GENE_DB 6
> ZFIN_ID 7
> miRBase 8
> RFAM 9
>6437 gene descriptions added
For those that the official naming routine could not set we now add display_xrefs
and decriptions. NOTE: the higher the number ther greater the priority for naming.
>xref_mapper.pl FINISHED NORMALLY
The script has finished successfully. If you do not see this then it crashed for
some reason and you need to look at the mapper2.err file.
This diff is collapsed.
Click to expand it.
Preview
0%
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment