Commit 3ac7e909 authored by Glenn Proctor's avatar Glenn Proctor
Browse files


parent 95ddcefa
java -jar /scratch/saxon/saxon7.jar tables.xml xml2html.xsl > tables.html
java -jar /scratch/saxon/saxon7.jar tables.xml xml2wiki.xsl > tables.txt
\ No newline at end of file
java org.apache.xalan.xslt.Process -in tables.xml -xsl schema-description.xsl -out tables.html
<!ELEMENT schemadescription (introduction?, diagram*, tablegroup*, concepts*)>
<!ATTLIST schemadescription schema-version CDATA #REQUIRED>
<!ATTLIST schemadescription document-version CDATA #REQUIRED>
<!ELEMENT introduction (text, process?)>
<!ELEMENT process (step*)>
<!ATTLIST process intro CDATA #IMPLIED>
<!ELEMENT concepts (concept*)>
<!ELEMENT concept (#PCDATA)>
<!ATTLIST concept description CDATA #REQUIRED>
<!ELEMENT diagram (#PCDATA)>
<!ATTLIST diagram description CDATA #IMPLIED>
<!ELEMENT tablegroup (table*)>
<!ATTLIST tablegroup name CDATA #REQUIRED>
<!ATTLIST tablegroup description CDATA #IMPLIED>
<!ELEMENT table (name, description, used, see?)>
<!ELEMENT description (#PCDATA)>
<!ELEMENT see (tableref*, conceptref*, urlref*)>
<!ELEMENT tableref (#PCDATA)>
<!ATTLIST tableref name CDATA #IMPLIED>
<!ATTLIST tableref reason CDATA #IMPLIED>
<!ELEMENT conceptref (#PCDATA)>
<!ATTLIST conceptref name CDATA #IMPLIED>
<!ATTLIST conceptref reason CDATA #IMPLIED>
<!ELEMENT urlref (#PCDATA)>
<!ATTLIST urlref reason CDATA #IMPLIED>
This diff is collapsed.
=EnsEMBL Core Schema Documentation=
This document gives a high-level description of the tables that make up the EnsEMBL core schema. Tables are grouped into logical groups, and the purpose of each table is explained. It is intended to allow people to familiarise themselves with the schema when encountering it for the first time, or when they need to use some tables that they've not used before. Note that while some of the more important columns in some of the tables are discussed, this document makes no attempt to enumerate all of the names, types and contents of every single table. Some concepts which are referred to in the table descriptions are given at the end of this document; these are linked to from the table description where appropriate.Different tables are populated throughout the gene build process:
||0||Create empty schema, populate meta table||
||1||Load DNA - populates dna, clone, contig, chromosome, assembly tables||
||2||Analyze DNA (raw computes) - populates genomic feature/analysis tables||
||3||Build genes - populates exon, transcript,etc. gene-related tables||
||4a||Analyze genes - populate protein_feature, xref tables, interpro||
||4b||ID mapping||
This document refers to version ''15'' of the EnsEMBL core schema. You are looking at revision ''1.21''of this document.
* - Schema diagram on 2 A4 pages for easy printing.
* - Schema diagram on one A3 page.
''Quick links to tables:''
'''Fundamental tables'''
* assembly
* chromosome
* clone
* contig
* dna
* exon
* exon_stable_id
* exon_transcript
* gene
* gene_description
* gene_stable_id
* karyotype
* meta
* supporting_feature
* transcript
* transcript_stable_id
* translation
* translation_stable_id
'''Features and analyses'''
* analysis
* dna_align_feature
* marker
* marker_feature
* marker_map_location
* marker_synonym
* prediction_transcript
* protein_align_feature
* protein_feature
* qtl
* qtl_feature
* qtl_synonym
* repeat_consensus
* repeat_feature
* simple_feature
'''ID Mapping'''
(Tables involved in mapping identifiers between releases)
* gene_archive
* mapping_session
* peptide_archive
* stable_id_event
'''Exernal references'''
(Tables used for storing links to and details about objects that are stored in other databases)
* external_db
* external_synonym
* go_xref
* identity_xref
* object_xref
* xref
(Tables that don't fit anywhere else.)
* interpro
== Fundamental tables ==
Contains information about BAC clones inside the EnsEMBL database. As this was the way that DNA came into EnsEMBL for the human genome, all DNA has to be specified in terms of BAC clones, even if they don't exist. Also, some of the BAC clones in early versions of the EnsEMBL database haven't been submitted to EMBL but were Sanger Institute internal BAC names. This is why clone has two different possible identifiers (and versions for each of them). Note that although there are many dates inside this table, they are not well maintained. The htg_phase column describes whether High Throughput Genomic sequencing is finished or unfinished: draft is 123, finished is 4.
A contig is a piece of DNA inside a clone that is contiguous i.e. not interupted by sequencing gaps. One or more contigs made up the BAC clone sequence. Due to this historical importance, they are currently the reference coordinate system for all features inside EnsEMBL, although not all species come in clones/contigs. Contigs directly link to dna table entries which contain the actual sequence information. Currently, fake clone and contig entries must be faked for genomes that don't have these things. The sequence is that of the contig, not that of the golden path, i.e. to construct the golden path from the dna entries, the sequence of contigs with an orientation of -1 must be reversed and bases complemented. The assembly table has the contig orientation (raw_ori). Note the length of the dna.sequence field is always equal to the appropriate length field in the contig table.
Contains DNA sequence. This table has a 1:1 relationship with the contig table.
'''See also:'''
* contig - 1:1 relationship to the dna table.
* external_synonym - Allows xrefs to have more than one name
Describes how contig sequences make up the chromosomal sequence. The data in this table defines the "static golden path", i.e. the best effort draft full genome sequence as determined by the UCSC or NCBI (depending which assembly you are using) Each row represents a contig (raw_id, FK from contig table) at least part of which is present in the golden path. The part of the contig that is in the path is delimited by fields raw_start and raw_end and the absolute position within the golden path chromosome (chromosome_id) is given by chr_start and chr_end. Each contig is in a "supercontig" such as a "fingerprint clone contig" or NT contig and the super contig is identified by the superctg_name column and the position of the specified section of the contig within its supercontig is given by fields superctg_start and superctg_end.
'''See also:'''
* contig -
* co-ordinates - Chromosome start and end positions are stored in chromosomal co-ordinates whereas the contig start and end positions are stored in contig co-ordinates.
* supercontigs - The mapping between contigs and supercontigs is also stored in the assembly table.
Describes chromosomes. Currently contains the name and length of each chromosome for the species. Currently also contains additional rows to allow genes to be stored even if the chromosome on which the gene appears is not positively identified.
Describes bands that can be stained on the chromosome.
Stores data about exons. Associated with transcripts via exon_transcript. Allows access to contigs via the contig_* columns. The sticky_rank differentiates between fragments of the same exon; i.e for exons that span multiple contigs, all the fragments are in this table with the same id, but different sticky_rank values.
'''See also:'''
* exon_transcript - Used to associate exons with transcripts.
* sticky_rank - Differentiates between exons that span multiple contigs.
Relates exon IDs in this release to release-independent stable identifiers.
'''See also:'''
* stable_id - Describes the rationale behind the use of stable identifiers.
TBC Note that a transcript is usually associated with a translation, but may not be, e.g. in the case of pseudogenes and RNA genes (those that code for RNA molecules).
Relates transcript IDs in this release to release-independent stable identifiers.
'''See also:'''
* stable_id - Describes the rationale behind the use of stable identifiers.
Relationship table linking exons with transcripts. The rank column ndicates the 5' to 3' position of the exon within the transcript, i.e. a rank of 1 means the exon is the 5' most within this transcript.
'''See also:'''
* exon - One of the entities related by the exon_transcript table.
* transcript - One of the entities related by the exon_transcript table.
Allows transcripts to be related to genes.
Relates gene IDs in this release to release-independent stable identifiers.
'''See also:'''
* stable_id - Describes the rationale behind the use of stable identifiers.
Where appropriate, allows specific genes to be given a description.
'''See also:'''
* gene - for the actual gene.
Describes which parts of which exons are used in translation. The seq_start and seq_end columns are 1-based offsets into the *relative* coordinate system of start_exon_id and end_exon_id. i.e, if the translation starts at the first base of the exon, seq_start would be 1.
Relates translation IDs in this release to release-independent stable identifiers.
'''See also:'''
* stable_id - Describes the rationale behind the use of stable identifiers.
Describes the exon prediction process by linking exons to DNA or protein alignment features. As in several other tables, the feature_id column is a foreign key; the feature_type column specifies which table feature_id refers to.
Stores data about the data in the current schema. Taxonomy information, version information and the default value for the type column in the assembly table are stored here. Unlike other tables, data in the meta table is stored as key/value pairs.
'''See also:'''
* assembly - The default value for assembly.type is stored in the meta table.
== Features and analyses ==
Usually describes a program and some database that together are used to create a feature on a piece of sequence. Each feature is marked with an analysis_id. The most important column is logic_name, which is used by the webteam to render a feature correctly on contigview (or even retrieve the right feature). Logic_name is also used in the pipeline to identify the analysis which has to run in a given status of the pipeline. The module column tells the pipeline which Perl module does the whole analysis, typically a RunnableDB module.
Stores DNA sequence alignments generated from Blast (or Blast-like) comparisons.
'''See also:'''
* cigar_line - Used to encode gapped alignments.
Stores translation alignments generated from Blast (or Blast-like) comparisons.
'''See also:'''
* cigar_line - Used to encode gapped alignments.
Describes sequence repeat regions.
Used to describe marker positions.
'''See also:'''
* marker - Stores details about the markers themselves.
* marker_map_location -
* marker_synonym - Holds alternative names for markers.
Describes Quantitative Trail Loci (QTL) positions as obtained from inbreeding experiments. Note the values in this table are in chromosomal co-ordinates. Also, this table is not populated all schemas.
'''See also:'''
* qtl - Describes the markers used to define a QTL.
* qtl_synonym - Stores alternative names for QTLs
Stores information about ab initio gene transcript predictions.
Describes general genomic features that don't fit into any of the more specific feature tables.
Describes features on the translations (as opposed to the DNA sequence itself), i.e. parts of the peptide. In peptide co-ordinates rather than contig co-ordinates.
'''See also:'''
* analysis - Describes how protein features were derived.
* co-ordinates -
Describes the markers (of which there may be up to three) which define Quantitative Trait Loci. Note that QTL is a statistical technique used to find links between certain expressed traits and regions in a genetic map.
'''See also:'''
* qtl_synonym - Describes alternative names for QTLs
Describes alternative names for Quantitative Trait Loci (QTLs).
Stores data about the marker itself - e.g. the primer sequences used.
'''See also:'''
* marker_synonym - Stores alternative names for markers.
* marker_map_location -
Allows storage of information about the postion of a marker.
'''See also:'''
* marker - Stores marker data.
Stores alternative names for markers, as well as their sources.
'''See also:'''
* marker - Stores the original marker.
Stores consenus sequences obtained from analysing repeat features.
== ID Mapping ==
Tables involved in mapping identifiers between releases
Stores details of ID mapping sessions - a mapping session represents the session when stable IDs where mapped from one database to another. Details of the "old" and "new" databases are stored.
'''See also:'''
* stable_id_event - Stores details of what happened during the mapping session.
* stable_id - Describes the need for ID mapping.
Represents what happened to all gene, transcript and translation stable IDs during a mapping session. This includes which IDs where deleted, created and related to each other. Each event is represented by one or more rows in the table.
'''See also:'''
* mapping_session - Describes the session when events stored in this table occured.
Contains a snapshot of the stable IDs associated with genes deleted or changed between releases. Includes gene, transcript and translation stable IDs.
Contains the peptides for deleted or changed translations.
== Exernal references ==
Tables used for storing links to and details about objects that are stored in other databases
Holds data about objects which are external to EnsEMBL, but need to be associated with EnsEMBL objects. Information about the database that the external object is stored in is held in the external_db table entry referred to by the external_db column.
'''See also:'''
* external_db - Describes the database that xrefs are stored in
* external_synonym - Allows xrefs to have more than one name
Stores data about the external databases in which the objects described in the xref table are stored.
'''See also:'''
* xref - Holds data about the external objects that are stored in the external_dbs.
Some xref objects can be referred to by more than one name. This table relates names to xref IDs.
'''See also:'''
* xref - Holds most of the data about xrefs.
Describes links between EnsEMBL objects and objects held in external databases. The EnsEMBL object can be one of several types; the type is held in the ensembl_object_type column. The ID of the particular EnsEMBL gene, translation or whatever is given in the ensembl_id column. The xref_id points to the entry in the xref table that holds data about the external object.Each EnsEMBL object can be associated with zero or more xrefs. An xref object can be associated with one or more EnsEMBL objects.
'''See also:'''
* xref - Stores the data about each externally-referenced object.
* go_xref - Stores extra data for relationships to GO objects.
* identity_xref - Stores data about how 'good' the relationships are
Links between EnsEMBL objects and external objects produced by GO (Gene Ontology) require some additional data which is not stored in the object_xref table.
'''See also:'''
* object_xref - Stores basic, non GO-specific information for GO xrefs
* GO - Gene Ontology website
Describes how well a particular xref obeject matches the EnsEMBL object.
'''See also:'''
* object_xref - Stores basic information about EnsEMBL object-xref mapping
== Miscellaneous ==
Tables that don't fit anywhere else.
Allows storage of links to the InterPro database. InterPro is a database of protein families, domains and functional sites in which identifiable features found in known proteins can be applied to unknown protein sequences.
'''See also:'''