Many object adaptors can provide a set of features which overlap a slice. The Slice itself also provides a means to obtain features which overlap its region. The following are two ways to obtain a list of genes which overlap a Slice:
...
...
@@ -298,7 +298,7 @@ foreach my $tr (@$transcripts) {
Translation objects and peptide sequence can be extracted from a Transcript object. It is important to remember that some Ensembl transcripts are non-coding (pseudogenes, ncRNAs, etc.) and have no translation. The primary purpose of a Translation object is to define the CDS and UTRs of its associated Transcript object. Peptide sequence is obtained directly from a Transcript object - not a Translation object as might be expected. The following example obtains the peptide sequence of a Transcript and the Translation's stable identifier:
Translation objects and peptide sequence can be extracted from a Transcript object. It is important to remember that some Ensembl transcripts are non-coding (pseudogenes, ncRNAs, etc.) and have no translation. The primary purpose of a Translation object is to define the CDS and UTRs of its associated Transcript object. Peptide sequence is obtained directly from a Transcript object – not a Translation object as might be expected. The following example obtains the peptide sequence of a Transcript and the Translation's stable identifier:
my $stable_id = 'ENST00000044768';
my $transcript_adaptor = $db->get_TranscriptAdaptor();
ProteinFeatures are features which are on an amino acid sequence rather than a nucleotide sequence. The method get_all_ProteinFeatures can be used to obtain a set of protein features from a Translation object.
...
...
@@ -426,7 +426,7 @@ my $ptranscripts = $slice->get_all_PredictionTranscripts;
foreach my $ptrans (@$ptranscripts) {
my $exons = $ptrans->get_all_Exons();
my $type = $ptrans->analysis->logic_name();
print "$type prediction has ".scalar(@$exons)." exons\n";
print "$type prediction has ".scalar(@$exons).” exons\n";
foreach my $exon (@$exons) {
print $exon->start . " - " .
...
...
@@ -489,8 +489,8 @@ Repetitive regions found by RepeatMasker and TRF (Tandem Repeat Finder) are repr
my $repeats = $slice->get_all_RepeatFeatures();
foreach my $repeat (@$repeats) {
print $repeat->display_id(), " ",
$repeat->start(), "-", $repeat->end(), "\n";
print $repeat->display_id(), “ “,
$repeat->start(), “-”, $repeat->end(), “\n”;
}
RepeatFeatures are used to perform repeat masking of the genomic sequence. Hard or softmasked genomic sequence can be retrieved from Slice objects using the get_repeatmasked_seq method. Hardmasking replaces sequence in repeat regions with Ns. Softmasking replaces sequence in repeat regions with lowercase sequence.
...
...
@@ -523,17 +523,17 @@ foreach my $synonym ($marker->get_all_MarkerSynonyms()}) {
A coordinate system is uniquely defined by its name and version. Most coordinate systems do not have a version, and the ones that do have a default version, so it is usually sufficient to use only the name when requesting a coordinate system. For example, chromosome coordinate systems have a version which is the assembly that defined the construction of the coordinate system. The version of human chromosome coordinate system might be NCBI33 or NCBI34.
...
...
@@ -673,7 +673,7 @@ Now suppose that you wish to write code which is independent of the species used
@@ -687,12 +687,12 @@ Features on a Slice in a given coordinate system may be moved to another slice i
The method transform can be used to move a feature to any coordinate system which is in the database. The feature will be placed on a Slice which spans the entire sequence that the feature is on in the requested coordinate system.
print "Feature is not defined in clonal coordinate system\n";
print “Feature is not defined in clonal coordinate system\n”;
}
The transform method returns a copy of the original feature in the new coordinate system, or undef if the feature is not defined in that coordinate system. A feature is considered to be undefined in a coordinate system if it overlaps an undefined region or if it crosses a coordinate system boundary. Take for example the tiling path relationship between chromosome and contig coordinate systems:
...
...
@@ -710,11 +710,11 @@ Both Feature A and Feature B are defined in the chromosomal coordinate system d
The special toplevel coordinate system can also be used in this instance to move the feature to the highest possible coordinate system in a given region:
my $new_feature = $feature->transform('toplevel');
Another useful method is display_id. This will return a string that can be used as the name or identifier for a particular feature. For a gene or transcript this method would return the stable_id, for an alignment feature this would return the hit sequence name (hseqname), etc.
# display_id returns a suitable display value for any feature type
print $feat->display_id(), "\n";
print $feat->display_id(), “\n”;
The feature_Slice method will return a Slice which is the exact overlap of the feature the method was called on. This slice can then be used to obtain the underlying sequence of the feature or to retrieve other features that overlap the same region, etc.
$feat_slice = $feat->feature_Slice();
# print the sequence of the feature region
print $feat_slice->seq(), "\n";
print $feat_slice->seq(), “\n”;
# print the sequence of the feature region + 5000bp flanking
The registry is a convienient storage/retrieval area for all the adaptors and provides an easy way to access them. If you have an Ensembl Web Server setup then you can automatically load all it's adaptors with the load_registry_with_web_adaptors method from the Registry module.
use Bio::EnsEMBL::Registry;
my $reg = "Bio::EnsEMBL::Registry";
$reg->load_registy_with_web_adaptors();
my $ga = $reg>get_adaptor("Homo_sapiens","estgene",”Gene”);
my $gene = $ga->fetch_by_stable_id("ENSESTG00000015126");
print $gene->seq()."\n";
The above gives an example of using the database data held in the Ensembl Web Server to ease the maintainance of code as we do not need to add the host, database name, host etc as this will already be set up. Plus it should now be more readable.
Another example of a general script is given below and takes four arguments the species, chromosome, start and end. This script will print out all the gene names with their start and end points and from which group database they were found for all genes found on the named chromosome between the start and end points specified.
#test2.pl
use Bio::EnsEMBL::Registry;
my $reg = "Bio::EnsEMBL::Registry";
my ($species, $chrom, $start, $end) = @ARGV;
die("Error species chrom start and end needed\n") unless defined($end);
$reg->load_registry_with_web_adaptors();
$species = $reg->get_alias($species);
my @dbs = $reg->get_all_DBAdaptors();
foreach my $db (@dbs){
if($db->species eq $species){
my $slice_adap = $reg->get_adaptor
($db->species, $db->group,"Slice");
if(defined($slice_adap)){
my $slice = $slice_adap->fetch_by_region
('chromosome',$chrom, $start, $end);
foreach $gene ( @{$slice->get_all_Genes} ) {
my $gene2= $gene->transform('chromosome');
my $name = $gene->stable_id() || $gene->type().".".
Note the path to the SiteDefs.pm module must first be added to the PERL5LIB enviroment variable if you want to use the load_registry_with_web_adaptors method. The next example will list all the databases that have been set up for the Ensembl Web Server :-
use Bio::EnsEMBL::Registry;
my $reg = "Bio::EnsEMBL::Registry";
$reg->load_registry_with_web_adaptors();
my @dbs = $reg->get_all_DBAdaptors();
foreach my $db (@dbs){
print $db->species()."\t".
$db->group()."\t".
$db->dbc->dbname()."\t".
$db->dbc->host()."\t".
$db->dbc->port()."\n";
}
To ensure the Registry stores the Adaptors in an organised way two new arguments have been added to the DBAdaptor new method, these are species and group. Default values are used if these are not given. Configuration scripts can be written to enable an easy setup of the Registry for all scripts to use. Below is an example of a configuration script.
The script is ran by calling the method load_all and passing it the file name. Alternatively if there is no file name the Enviroment Variable ENSEMBL_REGISTRY is checked for a valid file. If that fails the file ./ensembl_initrc is checked. So a central configuration script can be setup and occasional API programmers will no longer have to remember what databases are where and on what port etc. So to use the above configuration to get the sequence from a estgene stable_id would be :- This presumes i have set up ENSEMBL_REGISTRY.
use Bio::EnsEMBL::Registry;
my $reg = "Bio::EnsEMBL::Registry";
$reg->load_all();
my $gadap = $reg->get_adaptor("human","estgene",”Gene”);
my $gene = $gadap->fetch_by_stable_id("ENSESTG00000015126");