doc changes

ada79199 · Ian Longden · 6c6b37f6 · ada79199 · ada79199 · ada79199
Commit ada79199 authored 13 years ago by Ian Longden
--- a/misc-scripts/xref_mapping/docs/FAQ.txt
+++ b/misc-scripts/xref_mapping/docs/FAQ.txt
@@ -10,11 +10,10 @@ Questions
 6)  I have mapping errors how do i fix them?
 7)  How do i  start again from the parsing has finished stage?
 8)  How do i start again from the mapping_finished stage?
-9)  What is fullmode and partupdate?
-10) How do i run my external database references without a compute farm?
-11) I want to use a different list of external database sources for my 
+9)  How do i run my external database references without a compute farm?
+10) I want to use a different list of external database sources for my 
    display_xrefs (names)?
-12) I want to use a different list of external database sources for my gene 
+11) I want to use a different list of external database sources for my gene 
    descriptions?


@@ -113,9 +112,15 @@ Answers
      loading into the core database.

      i) Map the entitys in the xref database and do some checks etc.
-         xref_mapper.pl -file xref_config >& MAPPER1.OUT

-         If you have no compute farm then add the -nofarm option.
+         perl ~/src/ensembl/misc-scripts/xref_mapper/xref_mapper.pl  
+              -file xref_config -nofarm >& MAPPER1.OUT
+
+	 or if using the farm
+
+	 bsub -o mapper.out -e mapper.err 
+         perl ~/src/ensembl/misc-scripts/xref_mapper/xref_mapper.pl -file xref_config
+
         Check the output file if warning about xref number increasing do not 
         worry the main thing to be concerned about is a reduction in the number 
         of that none are in the xref database abut are in the core database.
@@ -305,28 +310,7 @@ data_uri        = ftp://fantom.gsc.riken.jp/DDBJ_fantom3_HTC_accession.txt.gz



-9) What is fullmode and partupdate?
-
-   Fullmode means that all the xrefs are being updated and not just a few specific
-   external database sources. This is important as this affects the way the 
-   display_xrefs, descriptions are calculated at the end.The user can override 
-   this by setting -partupdate option in the mapper options or change the entry 
-   in the table (key is "fullmode" in meta table).
-
-   If we are doing all the xref sources then we know that all the data is local 
-   and hence can do some SQL to get  the display_xrefs etc But if this is not the
-   case then the core database will have extra information in it that may be 
-   needed so we have to query the core database. The xref database has extra 
-   information that is not in the xref database and so simple SQL can be used 
-   whereas with the core 
-   database we have to go for each gene and then for each transcript etc using the
-   API which is alot slower.
-
-   In summary only alter the mode here if you know what you are doing and what 
-   consequences there are.
-
-
-10) How do i run my external database references without a compute farm?
+9) How do i run my external database references without a compute farm?

  Simply use the -nofarm option with the xref_mapper.pl script.

@@ -334,7 +318,7 @@ data_uri        = ftp://fantom.gsc.riken.jp/DDBJ_fantom3_HTC_accession.txt.gz



-11) I want to use a different list of external database sources for my 
+10) I want to use a different list of external database sources for my 
    display_xrefs (names)?

   The external databases to be used for the display_xrefs are taken from either 
@@ -401,7 +385,7 @@ data_uri        = ftp://fantom.gsc.riken.jp/DDBJ_fantom3_HTC_accession.txt.gz



-12) I want to use a different list of external database sources for my gene 
+11) I want to use a different list of external database sources for my gene 
    descriptions?

   As above but this time we use the sub gene_description_sources.

--- a/misc-scripts/xref_mapping/docs/xrefs_detailed_docs.txt
+++ b/misc-scripts/xref_mapping/docs/xrefs_detailed_docs.txt
@@ -413,41 +413,37 @@ calulate display xref for genes and transcripts
 -----------------------------------------------

 The external databases to be used for the display_xrefs are taken from either the 
-BasicMapper.pm subroutine transcript_display_sources  i.e.
+DisplayXrefs.pm subroutine transcript_display_sources  i.e.

 sub transcript_display_xref_sources {
-  my @list = qw(miRBase
+  my @list = qw(HGNC
+                MGI
+                Clone_based_vega_gene
+                Clone_based_ensembl_gene
+                HGNC_transcript_name
+                MGI_transcript_name
+                Clone_based_vega_transcript
+                Clone_based_ensembl_transcript
+                miRBase
                RFAM
-                HGNC_curated_gene
-		HGNC_automatic_gene
-                MGI_curated_gene
-		MGI_automatic_gene
-		Clone_based_vega_gene
-		Clone_based_ensembl_gene
-		HGNC_curated_transcript
-		HGNC_automatic_transcript
-		MGI_curated_transcript
-		MGI_automatic_transcript
-		Clone_based_vega_transcript
-		Clone_based_ensembl_transcript
-		IMGT/GENE_DB
-		HGNC
-		SGD
-		MGI
-		flybase_symbol
-		Anopheles_symbol
-		Genoscope_annotated_gene
-		Uniprot/SWISSPROT
-		Uniprot/Varsplic
-		RefSeq_peptide
-		RefSeq_dna
-		Uniprot/SPTREMBL
-		EntrezGene
+                IMGT/GENE_DB
+                SGD
+                flybase_symbol
+                Anopheles_symbol
+                Genoscope_annotated_gene
+                Uniprot/SWISSPROT
+                Uniprot/Varsplic
+                Uniprot/SPTREMBL
+                EntrezGene
 	        IPI);

  my %ignore;
-  $ignore{"EntrezGene"}= 'FROM:RefSeq_[pd][en][pa].*_predicted';
-  
+ $ignore{"Uniprot/SPTREMBL"} =(<<BIGN);
+SELECT object_xref_id
+    FROM object_xref JOIN xref USING(xref_id) JOIN source USING(source_id)
+     WHERE ox_status = 'DUMP_OUT' AND name = 'Uniprot/SPTREMBL'
+      AND priority_description = 'protein_evidence_gt_3'
+BIGN
  return [\@list,\%ignore];

 }
@@ -456,30 +452,35 @@ sub transcript_display_xref_sources {


 or if you want to create your own list then you need to create a species.pm file 
-and create a new subroutine there an example here is for drosophila_melanogaster.
-So in the file drosophila_melanogaster.pm  we have :-
+and create a new subroutine there an example here is for tetraodon_nigroviridis.
+So in the file tetraodon_nigroviridis.pm  we have :-

 sub transcript_display_xref_sources {

-  my @list = qw(FlyBaseName_transcript FlyBaseCGID_transcript 
-                flybase_annotation_id);
-                
+ my @list = qw(HGNC
+		MGI
+		wormbase_transcript
+		flybase_symbol
+		Anopheles_symbol
+		Genoscope_annotated_gene
+		Genoscope_predicted_transcript
+		Uniprot/SWISSPROT
+		RefSeq
+		Uniprot/SPTREMBL
+		LocusLink);

  my %ignore;
-  $ignore{"EntrezGene"}= 'FROM:RefSeq_[pd][en][pa].*_predicted';

-  return [\@list,\%ignore];
-
-}
+  $ignore{"Uniprot/SPTREMBL"} =(<<BIGN);
+SELECT object_xref_id
+    FROM object_xref JOIN xref USING(xref_id) JOIN source USING(source_id)
+     WHERE ox_status = 'DUMP_OUT' AND name = 'Uniprot/SPTREMBL' 
+      AND priority_description = 'protein_evidence_gt_3'
+BIGN


-If we are in fullmode then a tempory table is created with these sources in and 
-some sql is used to find the best display_xrefs for the transcripts and genes. 
-
-If we are not in fullmode then we have to use the API to cycle though each gene, 
-transcript and translation fecth all there xrefs and get the best. This is not that quick.
-
-The results are stored directly to the core database.
+  return [\@list,\%ignore];
+}


 calulate the gene descriptions.

--- a/misc-scripts/xref_mapping/docs/xrefs_overview.txt
+++ b/misc-scripts/xref_mapping/docs/xrefs_overview.txt
@@ -11,12 +11,12 @@ database.
 Parsing the external database references
 ------------------------------------------------------------------------

-In this directory you will find an ini-file called 'xref_config.ini'.
-This file contains two types of configuration sections: source sections
-and species sections.  A source section defines Xref priority, order
-etc. (as key-value pairs, see the comment at the top of the source
-sections for a fuller explanation of these keys) for the source and
-also the URIs pointing to the data files that the source should use.
+In the xref_mapper directory you will find an ini-file called 
+'xref_config.ini'.This file contains two types of configuration 
+sections: source sections and species sections.  A source section defines 
+Xref priority, order etc. (as key-value pairs, see the comment at the top 
+of the source sections for a fuller explanation of these keys) for the source 
+and also the URIs pointing to the data files that the source should use.
 The source label will only be used to refer to the source within the
 ini-file (from a species section), so this can be any text string which
 is easy to understand the meaning of.
@@ -28,15 +28,14 @@ multiple strains or subspecies, for example), there can be more than one
 'taxonomy_id' key.  The name of the species is defined by the source
 label and will be store in the Xref database.

-For now, the script 'xref_config2sql.pl' (also found in this directory)
-should be used to convert the ini-file into a SQL file which you
-should replace the file 'sql/populate_metadata.sql' with.  The
-'xref_config2sql.pl' script expects to find 'xref_config.ini' in the
-current directory, but you may specify an alternative file as the first
-command line argument to the script if you have moved or renamed the
-ini-file.  When 'xref_parser.pl' is run it will load the generated SQL
-file into the database and will then download and parse all external
-data files for one or several specified species.
+For now, the script 'xref_config2sql.pl' should be used to convert the 
+ini-file into a SQL file which you should replace the file 
+'sql/populate_metadata.sql' with.  The 'xref_config2sql.pl' script expects 
+to find 'xref_config.ini' in the current directory, but you may specify an 
+alternative file as the first command line argument to the script if you have 
+moved or renamed the ini-file.  When 'xref_parser.pl' is run it will load the 
+generated SQL file into the database and will then download and parse all 
+external data files for one or several specified species.

 If you want to add a new source you will have to add a new source
 section, following the pattern used by the other source sections.  You
@@ -67,7 +66,15 @@ user (user name on the database), pass (password), host (database host),
 dbname, and species, i.e.

    perl xref_parser.pl -host mymachine -user admin -pass XXXX \
-        -dbname new_human_xref -species human
+        -dbname new_human_xref -species human -stats
+
+If you are using the farm the i strongly advise using this as it makes 
+the systems people happier and it is easier to get the output and error
+files seperately.
+
+    bsub -o parse.out -e parse.err perl xref_parser.pl -host mymachine \
+         -user admin -pass XXXX -dbname new_human_xref -species human \
+          -stats -force

 Please keep the output from this script and check it later.  At the end
 of the output there will be a summary of what was successful and what
@@ -83,7 +90,7 @@ The parsing can create three types of Xrefs these are
 Some sources will have more than one set of files associated with it,
 in these cases they have the same source name but different source IDs.
 These are known as "priority Xrefs" as the Xrefs are mapped according to
-the priority of the source.  An example of this is the HUGOs.
+the priority of the source.  An example of this is HGNCs.

 For more information on the what data can be parsed see the
 'parsing_information.txt' file.
@@ -94,9 +101,8 @@ Mapping the external database references to the Ensembl core database

 This is an overview of what goes on in the script 'xref_mapper.pl' .

-Primary Xrefs are dumped out to two Fasta files, one for peptides and
-the other for DNA.  Ensembl Transcripts and Translations are then dumped
-out to two files in Fasta format.
+Primary Xrefs are dumped out to Fasta files, Ensembl Transcripts and 
+Translations are then dumped out to two files in Fasta format.

 Exonerate is then used to find the best matches for the Xrefs.
 If there is more than one best match then the Xref is mapped to
@@ -163,6 +169,6 @@ exonerate=/software/ensembl/bin/exonerate-1.4.0

 Note it is good practice to put a sub-directory for the Ensembl
 directory as many files are generated and hence best to put these all
-together and way from everything else or it will be hard to find things.
+together and away from everything else or it will be hard to find things.
 Also the directory can be tared and zipped in case you need to check
 things later.