From 987b78576644bb42c8c58bf3354d99ca1228a92b Mon Sep 17 00:00:00 2001 From: Magali Ruffier <mr6@ebi.ac.uk> Date: Thu, 29 Sep 2016 16:52:49 +0100 Subject: [PATCH] file not used any more --- docs/ensembl_changes_spec.txt | 955 ---------------------------------- 1 file changed, 955 deletions(-) delete mode 100644 docs/ensembl_changes_spec.txt diff --git a/docs/ensembl_changes_spec.txt b/docs/ensembl_changes_spec.txt deleted file mode 100644 index 0670adfca6..0000000000 --- a/docs/ensembl_changes_spec.txt +++ /dev/null @@ -1,955 +0,0 @@ -ENSEMBL - API Change Specification -================================== - -CONTENTS --------- - -Introduction -Goals -Schema Modifications - Proposed New/Modified Tables - seq_region - coord_system - seq_region_annotation - dna - assembly - gene - transcript - translation - all feature tables - meta_coord - misc_feature - misc_set - misc_feature_misc_set - misc_attrib - Removed Tables - contig - clone - chromosome -Meta Information -API Changes - Slice - Tile - SliceAdaptor - RawContig - RawContigAdaptor - Clone - CloneAdaptor - Chromosome - ChromosomeAdaptor - Root - Storable Base Class - Features - transform - transfer - move - project - StickyExon - AssemblyMapper - FeatureAdaptors - CoordSystemAdaptor -New Features - Assembly Exceptions - Haplotypes - Pseudo Autosomal Regions - Multiple Assemblies -Other Considerations - Loci - - -INTRODUCTION ------------- - -This document describes the changes that are being made to the EnsEMBL core -schema and Perl/Java/C APIs. - -GOALS ------ --A cleaner, more intuitive API --A more general schema able to better capture divergent assembly types --More flexibility with regards to assembly related data such as haplotypes, - PARs, WGS assemblies etc. - -SCHEMA MODIFICATIONS --------------------- - -Proposed New/Modified Tables: ------------------------------ - - seq_region - ---------- - The seq_region table is a generic replacement for the clone, contig, - and chromosome tables. Additionally supercontigs which were formerly in the - assembly table are also present in this table. The name column can contain - chromosome names, clone accessions, supercontig names or anything that is - appropriate for the seq_region it describes. The coord_system_id is a - foreign key to the new coordinate system and is used to distinguish - between the divergent types of sequence regions in the table. - - seq_region_id int - name varchar - coord_system_id int references coord_system table - length int - - - coord_system - ------------ - The coordinate system table lists the available coordinate systems in the - database. The attrib is mysql set and is used to denote the default version - of each named coordinate system. E.g. there may be two 'chromosome' coordiate - systems and the default may be version 'NCBI34'. The 'top_level' and - sequence level attribs denote the coordinate system from which sequence is - retrieved and the coordinate system which has the largest assembled pieces. - The top_level coordinate system will usually be 'chromosome' but for some - shrapnel assemblied this may be something like 'supercontig' or 'clone'. - - There may be multiple toplevel coordinate systems providing that they share - the same name (but different version) and providing one of them is the - default. There may only be a single sequence level coordinate system. - - Note that the version in the coordinate system can be viewed as applying - to every seq_region of a given coordinate system. It is analagous to - a CVS tag, not a CVS version. E.g. The version would 'NCBI33' apply to - every chromosome seq_region so it is a valid version. A clone accession of - '8' would not be a valid version because it only describes a particular - seq_region of the coordinate system - not all of them. - - coord_system_id int - name varchar - version varchar - attrib set ('top_level', 'default_version', 'sequence_level') - - - seq_region_annotation - --------------------- - This table allows for extra arbitrary information to be attached to - seq_regions. For example the htg_phase was formerly part of the clone table - but now is stored in this table. - - seq_region_id int - attrib_type_id smallint references attrib_type table - value varchar - - - dna - --- - Formerly the contig table referenced the dna table. Now the dna table - refrences the seq_region_table. Every seq_region which has a coordinate - system with the 'sequence_level' attrib should be referenced by an entry in - the dna table. - - seq_region_id int - sequence varchar - - - assembly - -------- - The assembly table has been made more generic. Columns that previously - were names chr_* and contig_* have been renamed asm_* and cmp_* (assembled - and component) respectively. The superctg_name column has been removed. - Supercontigs are now defined in the seq_region table. - - The makeup of all seq_regions from smaller seq_regions can be described in - this table. The relationships which are explicitly defined must be listed - in the meta table. For example, the clone <-> contig mapping used to be - defined in the contig table with an embl_offset column. This information is - now found in this table instead. - - asm_seq_region_id int - asm_start int - asm_end int - cmp_seq_region_id int - cmp_start int - cmp_end int - ori tinyint - - gene - ---- - For faster retrieval and retrieval independently of transcripts and - exons, genes have a seq_region_id, seq_region_start and seq_region_end - which defines the span of their transcript. - - The transcript_count column has been removed as it was never used. - - gene_id int - type varchar - analysis_id int - seq_region_id int - seq_region_start int - seq_region_end int - seq_region_strand tinyint - display_xref_id int - - - transcript - ---------- - For faster retrieval and retrieval independently of genes and exons - transcripts also have a seq_region_id, seq_region_start and - seq_region_end. The translation_id has been removed; translations will point - to transcripts instead (and pseudogenes will have no translation). - - The exon_count column has been removed as it was never used. - - transcript_id int - gene_id int - seq_region_id int - seq_region_start int - seq_region_end int - seq_region_strand tinyint - display_xref_id int - - - translation - ----------- - Translations now reference transcripts rather than transcripts referencing - a single (or no) translation. This allows for more elegant handling of - pseudogenes (where there is no translation) and also can be used to supply - multiple translations for a single transcript (e.g. polycistronic genes). - - translation_id int - transcript_id int - start_exon_id int - end_exon_id int - seq_start int - seq_end int - - - all feature tables - ------------------ - All feature tables would now have seq_region_id, seq_region_start, - seq_region_end, seq_region_strand instead of contig_id, contig_start, - contig_end. This includes the repeat_feature, simple_feature, - dna_align_feature, protein_align_feature, exon, marker_feature, - karyotype and qtl_feature tables. - - meta_coord - ---------- - The meta coord table defines what coordinate systems are used to store each - type of feature. A given type of feature may be stored in multiple - coordinate systems, but these will not be retrieved by the API unless there - is an entry in the meta_coord table. - - table_name varchar - coord_system_id int - - - misc_feature - ------------ - This is a renaming of the mapfrag table. The renaming reflects the fact that - this table can be used to store any type of feature. - - misc_feature_id - seq_region_id - seq_region_start - seq_region_end - seq_region_strand - - misc_set - -------- - This table was formerly names mapset. It defines 'sets' that can be used - to group misc_features together. - - misc_set_id smallint - code varchar - name description - description text - max_length int - - misc_feature_misc_set - --------------------- - This is a link table defining the many-to-many relationship between the - misc_set and misc_feature tables. - - misc_feature_id int - misc_set_id smallint - - - misc_attrib - ----------- - This table was formerly named mapfrag_annotation. It contains arbitrary - annotations of misc_features and links to the same attrib_type table that - the seq_region_attrib table uses. - - misc_feature_id int - attrib_type_id smallint - value varchar - - - -Removed Tables --------------- - - contig - ------ - Contigs are no longer needed. They are stored as entries in the seq_region - table with type 'contig'. The embl_offset and clone_id will not be - necessary as their relationship to clones can be described by the - assembly table. - - clone - ----- - Clones are no longer needed. Clones are stored as entries in the seq_region - table with coord_system 'clone'. The modified timestamp will be discarded - as it is no longer maintained anyway. The embl_acc, version, and - embl_version columns are redundant and will also be discarded. Versions - are simply appended onto the end of the name with a delimiting '.'. - - Any additional information that needs to be present (such as htg_phase) can - be added to the seq_region_attrib table. - - chromosome - ---------- - This table is no longer needed. Chromosomes can be stored in the - seq_region table with a 'chromosome' coord_system. - - - -META INFORMATION ----------------- - -Considerable more meta information is stored in the core -databases in order for the general approach to be maintained. -This information is stored in the new coord_system table and in the -meta, and meta_coord tables. - -Meta information includes the following: - - * The coordinate system that features of a given type are stored in. This - information is stored in the meta_coord table and is used when constructing - queries for a particular feature table. - - * The top-level coordinate system. For human - this would be 'chromosome'. For briggsae this may be something like - 'scaffold' or 'super contig'. This information would be used to construct - the web display and would possibly be the default coordinate system when - a coordinate system is unspecified by a user. This is stored as a flag - in the coord_system table. - - * The default version of each coordinate system. This is stored as a flag - in the coord_system table. - - * The coord_system where sequence is stored. This will be stored as a - flag in the coord_system table. Initially it will only be possible - to have a single coord_system in which sequence is stored. This - may be extended in the future to allow sequence to be stored for multiple - coord_systems. - - * The coordinate system relationships between that are explicitly defined - in the assembly table. The new API is capable of 2 step (implicit) mapping - between coordinate systems, but these relationships can be determined - through the direct relationship information. - - For example the clone, chromosome and nt_contig coordinate systems may all - be constructed from the contig coordinate system: - contig -> clone - contig -> chromosome - contig -> nt_contig - Or there may be a more hierarchical approach: - contig -> clone - clone -> nt_contig - nt_contig -> chromosome - This information is stored in the meta table under the key - 'assembly.mapping' with the following format (versions are optional): - assembled_coord_system_name[:version]|component_coord_system_name[:version] - - For example the meta table for human might contain the following entries: - mysql> select * from meta where meta_key = 'assembly.mapping'; - +---------+------------------+--------------------------+ - | meta_id | meta_key | meta_value | - +---------+------------------+--------------------------+ - | 43 | assembly.mapping | chromosome:NCBI33|contig | - | 44 | assembly.mapping | clone|contig | - | 45 | assembly.mapping | supercontig|contig | - +---------+------------------+--------------------------+ - - * The names of the allowable coordinate systems. This would allow for - quick validation of API requests and provide a list that could be used - by the website for coordinate system selection. This information will be - stored in the coord_system table. - - * The coordinate system(s) that each feature type is stored in. This is - stored in the meta_coord table. - - -API CHANGES ------------ - -Slice ------ - Slice methods chr_start, chr_end, chr_name will be renamed start, end, - seq_region_name. For backwards compatibility the old methods are - chained to the new methods with deprecated warnings. - - A new slice method 'coord_system' will be added and will return a - Bio::EnsEMBL::CoordSystem object. - - Slices will represent a region on a seq_region as opposed to a region on a - chromosome. Slices will be immutable (i.e. their attributes will not be - changeable). A new slice will have to be created if the attributes are to - be changed. - - The following attributes will therefore define a unique slice: - coord_system (e.g. object with name and version) - seq_region_name (e.g. 'X' or 'AL035554.1') - start (e.g. 1000000 or 1) - end (e.g. 2000001 or 800) - strand (e.g. 1 or -1) - - The name method will return the above values joined by a ':' delimiter, and - will not be settable: - e.g. 'chromosome:NCBI33:X:1000000:2000001:1' or 'clone::AL035554.1:1:800:-1' - This value can be used as a hashvalue that uniquely defines a slice. - - The concept of an 'empty' slice will no longer exist. - - The get_tiling_path method will be deprecated in favour of a more general - method project(). Whereas get_tiling_path() implies a relationship between - an assembly and the coordinate system which makes up the assembly the - project method will allow conversion accross any two coordinate systems. - It will take a coord_system string as an argument and rather - than returning a list of Tile objects it will return a listref of triplets - containing a start int, and end int, and a 'to' slice object. The following - is an example of how this method would be used ($clone is a reference to a - slice object in the clone coordinate system): - - my $clone_path = $slice->project('clone'); - - foreach my $segment (@$clone_path) { - my ($start, $end, $clone) = @$segment; - print $slice->seq_region_name, ':', $start, '-', $end , ' -> ', - $clone->seq_region_name, ':', $clone->start, '-', $clone->end, - $clone->strand, "\n"; - } - - An optional second argument to project() will be the coordinate - system version. E.g.: - $ncbi34_path = $slice->project('chromosome','NCBI34'). - - -Tile ----- - The tile object will no longer be necessary. However for backwards - compatibility it will remain in the system for some time before being phased - out along with the get_tiling_path method. - - -SliceAdaptor ------------- - The Slice adaptor must provide a method to fetch a slice via its coordinate - system, seq_region_name, start, end, and strand. - The old, commonly used method fetch_by_chr_start_end has been altered to - simply chain to this new method (with a warning) as do most other - SliceAdaptor methods. - - Another method which is necessary with the disapearence of the Clone, - RawContig and Chromosome adaptors is one which allows for all slices - of a certain type to be retrieved. For example it is often necessary to - retrieve all chromosomes, or clones for a species. This method is simply - named fetch_all. The old fetch_all methods on the ChromosomeAdaptor, - RawContigAdaptor, CloneAdaptor, etc. chain to the new method for backwards - compatibility. - - Method Names and Signatures - --------------------------- - Slice fetch_by_region(coord_system, name) - Slice fetch_by_region(coord_system, name, start) - Slice fetch_by_region(coord_system, name, start, end) - Slice fetch_by_region(coord_system, name, start, end, strand) - Slice fetch_by_region(coord_system, name, start, end, strand, version) - listref of Slices fetch_all(coord_system) - listref of Slices fetch_all(coord_system, version) - -RawContig ---------- - The RawContig object is no longer necessary with the new system. RawContigs - are replaced by Slices with coord_system = 'contig'. In the interests of - backwards compatibility the RawContig class will still be present for - sometime as a minimal implmentation inheriting from the Slice class. - - -RawContigAdaptor ----------------- - The RawContigAdaptor is no longer necessary. The RawContigAdaptor is - replaced by the SliceAdaptor. For backwards compatibility a minimal - implementation of the RawContigAdaptor will remain which inherits from the - SliceAdaptor. - -Clone ------ - The Clone object is no longer necessary in the new system. Clones are - replaced by Slices with coord_system = 'clone'. For backwards compatibility - a minimal implementation will remain which inherits from the Slice object. - -CloneAdaptor ------------- - The CloneAdaptor object is no longer necessary in the new system. The - CloneAdaptor is replaced by the SliceAdaptor. For backwards compatibility - a minimal implementation will remain which inherits from the SliceAdaptor. - -Chromosome ----------- - The Chromosome object is no longer necessary in the new system. The - Chromosome is replaced by Slices with coord system 'chromosome' (or - whatever the top level seq_region type is for that species). For backwards - compatibility a minimal implementation will remain which inherits from the - Slice object. - - Statistical information (e.g. known genes, genes, snps) that - was on chromosomes may be stored in the seq_region_attrib table or - in some sort of density table. - -ChromosomeAdaptor ------------------ - The Chromosome object is no longer necessary in the new system. The - ChromosomeAdaptor is replaced by the SliceAdaptor. For backwards - compatibility a minimal implementation which inherits from the SliceAdaptor - will remain. - - -Root ----- - Every class in the current EnsEMBL perl API inherits directly or indirectly - from Bio::EnsEMBL::Root. This inheritance is almost exclusively for the - following following three methods: - throw - warn - _rearrange - - Nothing is gained by implementing this relationship as inheritance, and there - are several disadvantages: - (1) Everything must inherit from this class to use those 3 object methods. - This can result in patterns of multiple inheritance which are generally - considered to be a bad thing. - - (2) It is not possible to use the throw, warn or rearrange method within - the constructor until the object is blessed. Blessing the object first - and then calling rearrange to extract named arguments is slower because - the blessed hash needs to be expanded as more keys are added and several - key access/value assignements may need to be performed. - - (3) Objects become larger and object construction becomes slightly slower - because constructors traverse an additional level of inheritance. - - A better approach, which we have used, is to make the methods static and - create a static utility class that exports the methods. The warn method has - been renamed warning so as not to conflict with the builtin perl function - warn and the _rearrange method has be renamed rearrange. - - The following is an example of the old styl Root inheritance and the new - style static utility methods: - - # - # OLD STYLE - # - package Old; - - use Bio::EnsEMBL::Root; - - @ISA = qw(Bio::EnsEMBL::Root); - - sub new { - my $caller = shift; - my $class = ref($caller) || $caller; - - $self = $class->SUPER::new(@_); - - my ($start, $end) = $self->_rearrange(['START', 'END'], @_); - - if(!defined($start) || !$defined($end)) { - $self->throw('-START and -END arguments are required'); - } - - $self->{'start'} = $start; - $self->{'end'} = $end; - - return $self; - } - - # - # NEW STYLE - # - package New; - - use Bio::EnsEMBL::Utils::Exception qw(throw warning); - use Bio::EnsEMBL::Utils::Argument qw(rearrange); - - sub new { - my $caller = shift; - - my $class = ref($caller) || $caller; - - my ($start, $end) = rearrange(['START', 'END'], @_); - - if(!defined($start) || !defined($end)) { - throw('-START and -END arguments are required'); - } - - return bless {'start' => $start, 'end' => $end}, $class; - } - - The calls to $self->rearrange $self->warn and $self->throw have been - replaced by class method calls to warning() and throw() inside the core API. - However, for backwards compatibility the existance inheritance to - Bio::EnsEMBL::Root will remain in many cases (and be removed at a later date) - - -Storable Base Class -------------------- - Almost all business objects in the EnsEMBL system are storable in the db - and the ones which are always require 2 methods: dbID() and adaptor(). These - methods have been moved to a Storable base class which most of the - business objects now inherit from. This module has an additional method - is_stored() which takes a database argument and returns true if the object - appears to already have been stored in the provided database. - -Features --------- - All features should inherit from a base class that implements common feature - functionality. Formerly this role was filled by the bloated SeqFeature class - which inherits from Bio::SeqFeature and Bio::SeqFeatureI etc. - This class has been replaced by a smaller, less complicated - implementation named Feature. To make classes more polymorphic in general, - the gene, and transcript objects should now also inherit from the Feature - class. This class implements the following core methods common to all - features: - - start - end - strand - slice (formerly named contig/entire_seq/etc.) - transform - transfer - project - analysis - - The feature class inherits from the Storable base class and thereby inherits - the following methods: - - adaptor - dbID - is_stored - - The signature and behaviour of the transform method has been changed. The - existing method works differently depending on the arguments passed as - described below. - - OLD transform(no arguments) - ----------------------- - Transforms from slice coordinates to contig coordinates. The feature - is changed in place and returned. If the feature already is in contig - coordinates an exception is thrown. The feature may be split into two - features in which case both features are returned (not sure if one of - them is transformed in place). Some features are not permitted to be - split in two in which case an exception is thrown? (not sure) if it is - to be split accross contigs. - - OLD transform(slice) - ---------------- - If the feature is already in slice coordinates and the slice is on the - same chromosome the features coordinates are simply shifted. If the - feature is already in slice coordinates but on a different chromosome - an exception is thrown. - It the feature is in contig coordinates and the slice is not empty then - it is transformed onto the new slice (or an exception is thrown if the - transform would cause the feature to end up on a different chromosome - than the slice). If the feature is in contig coordinates and the - slice is an empty slice the feature is transformed into chromosomal - coordinates and placed on a newly created slice of the entire chromosome. - - The new transformation has only a single valid signature and splits its - responsibilities with the new transfer method. The transfer - method transfers a feature onto another slice, whereas the transform - method simply converts coordinate systems. Transform does - NOT transform features in place but rather returns the newly - transformed feature as a new object: - - transform(coord_system, [version]) - ----------------------- - Takes a single string specifying the new coord system. If the coord - system is not valid an exception is thrown. If the coord system is the - same coord system as the feature is currently in a new feature that is - a copy of the old one is still be returned. This also retrieves - a slice which is the entire span of the region of the coordinate system - that this feature is being transformed to. For example transforming - an exon in contig coordinates to chromosomal coodinates will place a - copied exon on a slice of an entire chromosome. If a feature spans a - boundary in the coordinate system, undef is returned by the method - instead. - - transfer(slice) - ---------------- - Shifts a feature from one slice to another. If the new slice is in the - same coordinate system but different seq_region_name (e.g. both - chromosomal but different chromosomes) an exception is thrown. - If the new slice is in a different coordinate system then the - transform method is internally called first. If the feature would be - split across a boundary undef is returned instead. After the transform - there follows a potential move, if the slice does not cover the full - seq_region. If there is no transform call necessary, the feature is - copied and then moved. - - move( start, end, strand ) - -------------------------- - In place change of the coordinates of the feature. It will stay on the - same slice. - - project(coord_system, [version]) - ----------------- - This method is analagous to the project method on Bio::EnsEMBL::Slice. - It 'projects' a feature onto another coordinate system and returns the - results formatted as a listref of [$start, $end, $feature] triplets. - The $features returned are copies of the feature on which the method was - called, but with coordinates in the coordinate system that was projected - to. If the feature maps entirely to a gap then an empty list ref [] will - be returned. If the feature is mapped to multiple locations a listref - containing split features will be returned. - -StickyExon ----------- - The sticky exon object is not be present in the new system. It does not - make sense to define features in a coordinate system where they are simply - not present. Exons are calculated in chromosomal coordinates and they - will generally be retrieved in the same coordinates system. It will - of course be possible to still retrieve exons in contig coordinates but only - is they are fully defined on the contigs of interest. - The split coordinates can be obtained through a call to the project - method. - - -AssemblyMapper --------------- - The assembly mapper and assembly mapper adaptor classes have become more - general and sophisticated.Not only is it possible to map between two - coordinates systems whose relationship is explicitly defined in the assembly - table, but it is also possible to perform implicit, 2-step mapping - using 'coordinate system chaining'. - - For example if no explicit relationship is defined between the supercontig - and clone coordinate systems but relationships between the clone and contig - and the supercontig and contig coordinate systems is present the mapper has - the faculty to perform the mapping between the clone and supercontig systems. - In this case the contig cooridinate system is used as an intermediary: - - NTContig <-> Contig <-> Clone - - In the above example the assembly mapper adaptor internally does the - following: - - (1) Create a mapper object between the NTContig and Contig region - (2) Create a mapper object between the Contig and Clone region - (3) Create and return a third mapper constructed from the sets of mappings - generated by the intermediate mappers. - - -FeatureAdaptors ---------------- - Most FeatureAdaptors inherit from the BaseFeature adaptor. As a - minimum feature adaptors provide fetch_all_by_Slice and fetch_by_dbID - methods. The fetch by slice method provides the same return types and - requires the same arguments as before, but required some internal changes. - - The simplified algorithm for fetching features via a slice is: - - (1) Check with coord system is requested or that slice is in. - (2) Check which coord system features are in - (3) Obtain mapper between coord systems - (5) Retrieve features in their native coord system. - (6) Remap features to the requested coord system using the mapper - (7) Return the features - - The method fetch_all_by_RawContig is obsolete (it is equivalent to - fetching by a slice of a contig) but has be left in as an alias for the - fetch all by slice method for backwards compatibility. - - When performing a non-locational fetch (e.g. by dbID) features are still - returned in the coordinate system that they are calculated in. This is to - ensure that the feature can always be retrieved in this manner of fetching - and so that features which are not in the database can be distinguished from - features which are simply not in the requested coordinate sytem. When a - single feature which is not in the database is requested via a non-locational - fetch undef is returned instead. If multiple features are requested but none - are present in the database a reference to an empty list is returned. If the - features are required in a specific coordinate system the transfer, project - or transform method can always be used. - - -CoordSystemAdaptor ------------------- - A CoordSystemAdaptor provides access to the information in the - coord_system, meta and meta_coord tables. This adaptor provides - Bio::EnsEMBL::CoordSystem objects. - - -NEW FEATURES ------------- - -Assembly Exceptions (Symbolic Sequence Links) ---------------------------------------------- - - It is sometimes desirable to have multiple regions refer to the same - sequence. - - In much the same way a symlinked file acts as a pointer to a real file, - a symlinked region can point to another region of sequence. - - This can be described in the database through the addition of a table which - has a structure that mirrors that of the assembly table. The assembly table - does not define the structure underlying this seq_region, and it does not - have sequence of its own. By means of the assembly_exception table this - seq_region points to another seq_region where the underlying sequence is - defined: - - assembly_exception - ------------------ - seq_region_id int - seq_region_start int - seq_region_end int - exc_type enum('HAP', 'PAR') - exc_seq_region_id int - exc_seq_region_start int - exc_seq_region_end int - ori int (may not be needed, may implicitly be 1) - - When fetching features and sequence from a slice that overlaps a symlinked - region, the features and sequence from the symlinked region are returned. - This may be implemented by altering fetch by slice calls and adding a - SliceAdaptor method with splits a slice into non-symlinked components. - The following algorithm would apply to sequence and feature fetches: - (1) Split the slice into non-symlinked component slices - (2) Recursively call the method with the component slices - (3) Adjust the start and end of the returned features and place them - back on the original slice (or splice the sequence together if this - is a sequence fetch) - (4) Return the features or sequence - - Consider a slice which overlaps regions (A), (B), and (C) on chromosome Y: - - =============== (chrX) - ^^^^^^^^^^^ - ======== symlink ========= (chrY) - (A) (B) (C) - - Regions (A) and (C) are described by the assembly table, but region (B) - is described in the assembly_exception table and points to a region of - chromosome Y. When features or sequence are retrieved the slice is split - into 3 component slices which have no symlinks: region (A) and (C) are - slices on chromosome Y but region (B) is made into a slice on chromsome (X). - All of the features are fetched from the individual slices adjusted by - some addition and placed on back on the original Slice before being - returned. - - - -Haplotypes (and the MHC region) -------------------------------- - There are several requirements related to haplotypes: - - Must be able to determine which haplotypes overlap a slice - - Must be able to run genebuild/raw computes over the haplotypes - - Must be able to retrieve a slice on a haplotype and its flanking - regions (i.e. the regions of the default assembly bordering the - haplotype). - - It may be desireable to interpolate features from the default sequence - onto the haplotype - - Proposal: - The haplotype will be present as a full length 'chromosome' in the - seq_region table (or other appropriate coordinate system) but only the - region which differs from the the default assembly will be described in - the chromosome table. The regions which are identical will be described - by the assembly_exception table. - - It is possible to retrieve a slice on a haplotype just as any other slice - is retrieved from the SliceAdaptor. For example: - $slice = $slice_adaptor->fetch_by_region('chromosome', '6_DR52'); - - A slice created on a haplotype will have coordinates relative to the - start of the chromosome NOT relative to the start of the haplotype - region. For all intents and purposes a haplotype slice will behave as - a normal slice. - - For example, the assembly table could define the composition of the - divergent region of chromosome 6_DR52 (C), but leave the remainder of the - chromosomal composition undefined. The remainder of the - chromosome composition would be accounted for by 2 rows in the - assembly_exception table which described the synonymous regions in terms - of chromosome 6: - - ============== 6 ============== 6 - ^ ^ - _____|________ _______|______ - C ========== 6_DR52 - - - - - -Pseudo Autosomal Regions (PARs) -------------------------------- - There are several requirements related to PARs: - - The same sequence and features must be present on a region of - both chromosome X and chromosome Y - - The region and features should be returned when retreiving features - from either chromosome. - - It must still be possible to retrieve one of the features via its - identifier - - It must still be possible to transform features in the region from - chromosomal coordinates to contig coords and vice-versa. - - The genebuild should run over the region, but only once. - - Proposal: - Use the assembly_exception table in a similar fashion as it is used - for the haplotypes described above. Chromosome X can be the 'default' - chromosome for the PAR and Chromosome Y can be described by the assembly - table except in the PAR. The PAR on chromosome Y can be defined by the - assembly_exception table and refer to the corresponding sequence on - chromosome X. The same algorithm as used for haplotypes can then be used - when retrieving sequence or features from slices which overlap this - exception on chromosome Y. - - The following diagram illustrates how chromsome X and chromosome Y could - be defined: - - ========================================== X - ^ ^ - _|_ ____|____ - ====== ============ ============= Y - - -Multiple Assemblies -------------------- - -In theory it is possible to load multiple assemblies into the same database. -For example two coordinate systems with two versions chromosome:NCBI33 and -chromosome:NCBI34 could be loaded into the database. Leveraging the fact that -two step mapping is possible and that these coordinate systems share a -coincident mapping with the contig coorinate system it is possible to pull -across annotation from one assembly to the other. The following example -illustrates the transfer of genes from the chromosome X on the NCBI33 assembly -to the NCBI34 assembly: - - $slice = $slice_adaptor->fetch_by_region('chromosome', 'X', undef, - undef,undef, 'NCBI33'); - @genes = @{$gene_adaptor->fetch_all_by_Slice($slice)}; - - foreach my $gene (@genes) { - $gene->transform('chromosome', 'NCBI34'); - #... - } - - -OTHER CONSIDERATIONS --------------------- - -Loci ----- - -Similar genes which are defined across haplotypes need -to be somehow linked into loci. The intent is that -a user would be able to see that a gene has a counterpart -on an equivalent haplotypic sequence. - -This is work in progress. The current opinion among us is to implement -it via a relationship table that specifies which genes on default -haplotypes are considered equivalent to other genes on -(overlapping) haplotypes. - -- GitLab