Commit fc6ced68 authored by Arne Stabenau's avatar Arne Stabenau
Browse files

Merged in the new-seqstore into HEAD version

parent ef4bc069

Too many changes to show.

To preserve performance only 121 of 121+ files are displayed.
The most up-to-date version of this document can be found on
branch-new-seqstore. It has been moved there so it does not have to be
maintained in two locations.
\ No newline at end of file
ENSEMBL - API Change Specification
==================================
CONTENTS
--------
Introduction
Goals
Schema Modifications
Proposed New/Modified Tables
seq_region
coord_system
seq_region_annotation
dna
assembly
gene
transcript
translation
all feature tables
meta_coord
misc_feature
misc_set
misc_feature_misc_set
misc_attrib
Removed Tables
contig
clone
chromosome
Meta Information
API Changes
Slice
Tile
SliceAdaptor
RawContig
RawContigAdaptor
Clone
CloneAdaptor
Chromosome
ChromosomeAdaptor
Root
Storable Base Class
Features
transform
transfer
move
project
StickyExon
AssemblyMapper
FeatureAdaptors
CoordSystemAdaptor
New Features
Assembly Exceptions
Haplotypes
Pseudo Autosomal Regions
Multiple Assemblies
Other Considerations
Loci
INTRODUCTION
------------
This document describes the changes that are being made to the EnsEMBL core
schema and Perl/Java/C APIs.
GOALS
-----
-A cleaner, more intuitive API
-A more general schema able to better capture divergent assembly types
-More flexibility with regards to assembly related data such as haplotypes,
PARs, WGS assemblies etc.
SCHEMA MODIFICATIONS
--------------------
Proposed New/Modified Tables:
-----------------------------
seq_region
----------
The seq_region table is a generic replacement for the clone, contig,
and chromosome tables. Additionally supercontigs which were formerly in the
assembly table are also present in this table. The name column can contain
chromosome names, clone accessions, supercontig names or anything that is
appropriate for the seq_region it describes. The coord_system_id is a
foreign key to the new coordinate system and is used to distinguish
between the divergent types of sequence regions in the table.
seq_region_id int
name varchar
coord_system_id int references coord_system table
length int
coord_system
------------
The coordinate system table lists the available coordinate systems in the
database. The attrib is mysql set and is used to denote the default version
of each named coordinate system. E.g. there may be two 'chromosome' coordiate
systems and the default may be version 'NCBI34'. The 'top_level' and
sequence level attribs denote the coordinate system from which sequence is
retrieved and the coordinate system which has the largest assembled pieces.
The top_level coordinate system will usually be 'chromosome' but for some
shrapnel assemblied this may be something like 'supercontig' or 'clone'.
There may be multiple toplevel coordinate systems providing that they share
the same name (but different version) and providing one of them is the
default. There may only be a single sequence level coordinate system.
Note that the version in the coordinate system can be viewed as applying
to every seq_region of a given coordinate system. It is analagous to
a CVS tag, not a CVS version. E.g. The version would 'NCBI33' apply to
every chromosome seq_region so it is a valid version. A clone accession of
'8' would not be a valid version because it only describes a particular
seq_region of the coordinate system - not all of them.
coord_system_id int
name varchar
version varchar
attrib set ('top_level', 'default_version', 'sequence_level')
seq_region_annotation
---------------------
This table allows for extra arbitrary information to be attached to
seq_regions. For example the htg_phase was formerly part of the clone table
but now is stored in this table.
seq_region_id int
attrib_type_id smallint references attrib_type table
value varchar
dna
---
Formerly the contig table referenced the dna table. Now the dna table
refrences the seq_region_table. Every seq_region which has a coordinate
system with the 'sequence_level' attrib should be referenced by an entry in
the dna table.
seq_region_id int
sequence varchar
assembly
--------
The assembly table has been made more generic. Columns that previously
were names chr_* and contig_* have been renamed asm_* and cmp_* (assembled
and component) respectively. The superctg_name column has been removed.
Supercontigs are now defined in the seq_region table.
The makeup of all seq_regions from smaller seq_regions can be described in
this table. The relationships which are explicitly defined must be listed
in the meta table. For example, the clone <-> contig mapping used to be
defined in the contig table with an embl_offset column. This information is
now found in this table instead.
asm_seq_region_id int
asm_start int
asm_end int
cmp_seq_region_id int
cmp_start int
cmp_end int
ori tinyint
gene
----
For faster retrieval and retrieval independently of transcripts and
exons, genes have a seq_region_id, seq_region_start and seq_region_end
which defines the span of their transcript.
The transcript_count column has been removed as it was never used.
gene_id int
type varchar
analysis_id int
seq_region_id int
seq_region_start int
seq_region_end int
seq_region_strand tinyint
display_xref_id int
transcript
----------
For faster retrieval and retrieval independently of genes and exons
transcripts also have a seq_region_id, seq_region_start and
seq_region_end. The translation_id has been removed; translations will point
to transcripts instead (and pseudogenes will have no translation).
The exon_count column has been removed as it was never used.
transcript_id int
gene_id int
seq_region_id int
seq_region_start int
seq_region_end int
seq_region_strand tinyint
display_xref_id int
translation
-----------
Translations now reference transcripts rather than transcripts referencing
a single (or no) translation. This allows for more elegant handling of
pseudogenes (where there is no translation) and also can be used to supply
multiple translations for a single transcript (e.g. polycistronic genes).
translation_id int
transcript_id int
start_exon_id int
end_exon_id int
seq_start int
seq_end int
all feature tables
------------------
All feature tables would now have seq_region_id, seq_region_start,
seq_region_end, seq_region_strand instead of contig_id, contig_start,
contig_end. This includes the repeat_feature, simple_feature,
dna_align_feature, protein_align_feature, exon, marker_feature,
karyotype and qtl_feature tables.
meta_coord
----------
The meta coord table defines what coordinate systems are used to store each
type of feature. A given type of feature may be stored in multiple
coordinate systems, but these will not be retrieved by the API unless there
is an entry in the meta_coord table.
table_name varchar
coord_system_id int
misc_feature
------------
This is a renaming of the mapfrag table. The renaming reflects the fact that
this table can be used to store any type of feature.
misc_feature_id
seq_region_id
seq_region_start
seq_region_end
seq_region_strand
misc_set
--------
This table was formerly names mapset. It defines 'sets' that can be used
to group misc_features together.
misc_set_id smallint
code varchar
name description
description text
max_length int
misc_feature_misc_set
---------------------
This is a link table defining the many-to-many relationship between the
misc_set and misc_feature tables.
misc_feature_id int
misc_set_id smallint
misc_attrib
-----------
This table was formerly named mapfrag_annotation. It contains arbitrary
annotations of misc_features and links to the same attrib_type table that
the seq_region_attrib table uses.
misc_feature_id int
attrib_type_id smallint
value varchar
Removed Tables
--------------
contig
------
Contigs are no longer needed. They are stored as entries in the seq_region
table with type 'contig'. The embl_offset and clone_id will not be
necessary as their relationship to clones can be described by the
assembly table.
clone
-----
Clones are no longer needed. Clones are stored as entries in the seq_region
table with coord_system 'clone'. The modified timestamp will be discarded
as it is no longer maintained anyway. The embl_acc, version, and
embl_version columns are redundant and will also be discarded. Versions
are simply appended onto the end of the name with a delimiting '.'.
Any additional information that needs to be present (such as htg_phase) can
be added to the seq_region_attrib table.
chromosome
----------
This table is no longer needed. Chromosomes can be stored in the
seq_region table with a 'chromosome' coord_system.
META INFORMATION
----------------
Considerable more meta information is stored in the core
databases in order for the general approach to be maintained.
This information is stored in the new coord_system table and in the
meta, and meta_coord tables.
Meta information includes the following:
* The coordinate system that features of a given type are stored in. This
information is stored in the meta_coord table and is used when constructing
queries for a particular feature table.
* The top-level coordinate system. For human
this would be 'chromosome'. For briggsae this may be something like
'scaffold' or 'super contig'. This information would be used to construct
the web display and would possibly be the default coordinate system when
a coordinate system is unspecified by a user. This is stored as a flag
in the coord_system table.
* The default version of each coordinate system. This is stored as a flag
in the coord_system table.
* The coord_system where sequence is stored. This will be stored as a
flag in the coord_system table. Initially it will only be possible
to have a single coord_system in which sequence is stored. This
may be extended in the future to allow sequence to be stored for multiple
coord_systems.
* The coordinate system relationships between that are explicitly defined
in the assembly table. The new API is capable of 2 step (implicit) mapping
between coordinate systems, but these relationships can be determined
through the direct relationship information.
For example the clone, chromosome and nt_contig coordinate systems may all
be constructed from the contig coordinate system:
contig -> clone
contig -> chromosome
contig -> nt_contig
Or there may be a more hierarchical approach:
contig -> clone
clone -> nt_contig
nt_contig -> chromosome
This information is stored in the meta table under the key
'assembly.mapping' with the following format (versions are optional):
assembled_coord_system_name[:version]|component_coord_system_name[:version]
For example the meta table for human might contain the following entries:
mysql> select * from meta where meta_key = 'assembly.mapping';
+---------+------------------+--------------------------+
| meta_id | meta_key | meta_value |
+---------+------------------+--------------------------+
| 43 | assembly.mapping | chromosome:NCBI33|contig |
| 44 | assembly.mapping | clone|contig |
| 45 | assembly.mapping | supercontig|contig |
+---------+------------------+--------------------------+
* The names of the allowable coordinate systems. This would allow for
quick validation of API requests and provide a list that could be used
by the website for coordinate system selection. This information will be
stored in the coord_system table.
* The coordinate system(s) that each feature type is stored in. This is
stored in the meta_coord table.
API CHANGES
-----------
Slice
-----
Slice methods chr_start, chr_end, chr_name will be renamed start, end,
seq_region_name. For backwards compatibility the old methods are
chained to the new methods with deprecated warnings.
A new slice method 'coord_system' will be added and will return a
Bio::EnsEMBL::CoordSystem object.
Slices will represent a region on a seq_region as opposed to a region on a
chromosome. Slices will be immutable (i.e. their attributes will not be
changeable). A new slice will have to be created if the attributes are to
be changed.
The following attributes will therefore define a unique slice:
coord_system (e.g. object with name and version)
seq_region_name (e.g. 'X' or 'AL035554.1')
start (e.g. 1000000 or 1)
end (e.g. 2000001 or 800)
strand (e.g. 1 or -1)
The name method will return the above values joined by a ':' delimiter, and
will not be settable:
e.g. 'chromosome:NCBI33:X:1000000:2000001:1' or 'clone::AL035554.1:1:800:-1'
This value can be used as a hashvalue that uniquely defines a slice.
The concept of an 'empty' slice will no longer exist.
The get_tiling_path method will be deprecated in favour of a more general
method project(). Whereas get_tiling_path() implies a relationship between
an assembly and the coordinate system which makes up the assembly the
project method will allow conversion accross any two coordinate systems.
It will take a coord_system string as an argument and rather
than returning a list of Tile objects it will return a listref of triplets
containing a start int, and end int, and a 'to' slice object. The following
is an example of how this method would be used ($clone is a reference to a
slice object in the clone coordinate system):
my $clone_path = $slice->project('clone');
foreach my $segment (@$clone_path) {
my ($start, $end, $clone) = @$segment;
print $slice->seq_region_name, ':', $start, '-', $end , ' -> ',
$clone->seq_region_name, ':', $clone->start, '-', $clone->end,
$clone->strand, "\n";
}
An optional second argument to project() will be the coordinate
system version. E.g.:
$ncbi34_path = $slice->project('chromosome','NCBI34').
Tile
----
The tile object will no longer be necessary. However for backwards
compatibility it will remain in the system for some time before being phased
out along with the get_tiling_path method.
SliceAdaptor
------------
The Slice adaptor must provide a method to fetch a slice via its coordinate
system, seq_region_name, start, end, and strand.
The old, commonly used method fetch_by_chr_start_end has been altered to
simply chain to this new method (with a warning) as do most other
SliceAdaptor methods.
Another method which is necessary with the disapearence of the Clone,
RawContig and Chromosome adaptors is one which allows for all slices
of a certain type to be retrieved. For example it is often necessary to
retrieve all chromosomes, or clones for a species. This method is simply
named fetch_all. The old fetch_all methods on the ChromosomeAdaptor,
RawContigAdaptor, CloneAdaptor, etc. chain to the new method for backwards
compatibility.
Method Names and Signatures
---------------------------
Slice fetch_by_region(coord_system, name)
Slice fetch_by_region(coord_system, name, start)
Slice fetch_by_region(coord_system, name, start, end)
Slice fetch_by_region(coord_system, name, start, end, strand)
Slice fetch_by_region(coord_system, name, start, end, strand, version)
listref of Slices fetch_all(coord_system)
listref of Slices fetch_all(coord_system, version)
RawContig
---------
The RawContig object is no longer necessary with the new system. RawContigs
are replaced by Slices with coord_system = 'contig'. In the interests of
backwards compatibility the RawContig class will still be present for
sometime as a minimal implmentation inheriting from the Slice class.
RawContigAdaptor
----------------
The RawContigAdaptor is no longer necessary. The RawContigAdaptor is
replaced by the SliceAdaptor. For backwards compatibility a minimal
implementation of the RawContigAdaptor will remain which inherits from the
SliceAdaptor.
Clone
-----
The Clone object is no longer necessary in the new system. Clones are
replaced by Slices with coord_system = 'clone'. For backwards compatibility
a minimal implementation will remain which inherits from the Slice object.
CloneAdaptor
------------
The CloneAdaptor object is no longer necessary in the new system. The
CloneAdaptor is replaced by the SliceAdaptor. For backwards compatibility
a minimal implementation will remain which inherits from the SliceAdaptor.
Chromosome
----------
The Chromosome object is no longer necessary in the new system. The
Chromosome is replaced by Slices with coord system 'chromosome' (or
whatever the top level seq_region type is for that species). For backwards
compatibility a minimal implementation will remain which inherits from the
Slice object.
Statistical information (e.g. known genes, genes, snps) that
was on chromosomes may be stored in the seq_region_attrib table or
in some sort of density table.
ChromosomeAdaptor
-----------------
The Chromosome object is no longer necessary in the new system. The
ChromosomeAdaptor is replaced by the SliceAdaptor. For backwards
compatibility a minimal implementation which inherits from the SliceAdaptor
will remain.
Root
----
Every class in the current EnsEMBL perl API inherits directly or indirectly
from Bio::EnsEMBL::Root. This inheritance is almost exclusively for the
following following three methods:
throw
warn
_rearrange
Nothing is gained by implementing this relationship as inheritance, and there
are several disadvantages:
(1) Everything must inherit from this class to use those 3 object methods.
This can result in patterns of multiple inheritance which are generally
considered to be a bad thing.
(2) It is not possible to use the throw, warn or rearrange method within
the constructor until the object is blessed. Blessing the object first
and then calling rearrange to extract named arguments is slower because
the blessed hash needs to be expanded as more keys are added and several
key access/value assignements may need to be performed.
(3) Objects become larger and object construction becomes slightly slower
because constructors traverse an additional level of inheritance.
A better approach, which we have used, is to make the methods static and
create a static utility class that exports the methods. The warn method has
been renamed warning so as not to conflict with the builtin perl function
warn and the _rearrange method has be renamed rearrange.
The following is an example of the old styl Root inheritance and the new
style static utility methods:
#
# OLD STYLE
#
package Old;
use Bio::EnsEMBL::Root;
@ISA = qw(Bio::EnsEMBL::Root);
sub new {
my $caller = shift;
my $class = ref($caller) || $caller;
$self = $class->SUPER::new(@_);
my ($start, $end) = $self->_rearrange(['START', 'END'], @_);
if(!defined($start) || !$defined($end)) {
$self->throw('-START and -END arguments are required');
}
$self->{'start'} = $start;
$self->{'end'} = $end;
return $self;
}
#
# NEW STYLE
#
package New;
use Bio::EnsEMBL::Utils::Exception qw(throw warning);
use Bio::EnsEMBL::Utils::Argument qw(rearrange);
sub new {
my $caller = shift;
my $class = ref($caller) || $caller;
my ($start, $end) = rearrange(['START', 'END'], @_);
if(!defined($start) || !defined($end)) {
throw('-START and -END arguments are required');
}
return bless {'start' => $start, 'end' => $end}, $class;
}
The calls to $self->rearrange $self->warn and $self->throw have been
replaced by class method calls to warning() and throw() inside the core API.
However, for backwards compatibility the existance inheritance to
Bio::EnsEMBL::Root will remain in many cases (and be removed at a later date)
Storable Base Class
-------------------
Almost all business objects in the EnsEMBL system are storable in the db
and the ones which are always require 2 methods: dbID() and adaptor(). These
methods have been moved to a Storable base class which most of the
business objects now inherit from. This module has an additional method
is_stored() which takes a database argument and returns true if the object
appears to already have been stored in the provided database.
Features
--------
All features should inherit from a base class that implements common feature
functionality. Formerly this role was filled by the bloated SeqFeature class
which inherits from Bio::SeqFeature and Bio::SeqFeatureI etc.
This class has been replaced by a smaller, less complicated
implementation named Feature. To make classes more polymorphic in general,
the gene, and transcript objects should now also inherit from the Feature
class. This class implements the following core methods common to all
features:
start
end
strand
slice (formerly named contig/entire_seq/etc.)
transform
transfer
project
analysis
The feature class inherits from the Storable base class and thereby inherits
the following methods:
adaptor
dbID
is_stored
The signature and behaviour of the transform method has been changed. The
existing method works differently depending on the arguments passed as
described below.
OLD transform(no arguments)
-----------------------