BaseAlignFeature.pm 25.9 KB
Newer Older
1 2
=head1 LICENSE

3
  Copyright (c) 1999-2013 The European Bioinformatics Institute and
4 5 6 7 8 9 10 11 12 13
  Genome Research Limited.  All rights reserved.

  This software is distributed under a modified Apache license.
  For license details, please see

    http://www.ensembl.org/info/about/code_licence.html

=head1 CONTACT

  Please email comments or questions to the public Ensembl
14
  developers list at <dev@ensembl.org>.
15 16 17 18 19

  Questions may also be sent to the Ensembl help desk at
  <helpdesk@ensembl.org>.

=cut
20 21 22

=head1 NAME

23 24
Bio::EnsEMBL::BaseAlignFeature - Baseclass providing a common abstract
implmentation for alignment features
25 26 27

=head1 SYNOPSIS

28 29 30 31 32 33 34 35 36 37 38
  my $feat = new Bio::EnsEMBL::DnaPepAlignFeature(
    -slice        => $slice,
    -start        => 100,
    -end          => 120,
    -strand       => 1,
    -hseqname     => 'SP:RF1231',
    -hstart       => 200,
    -hend         => 220,
    -analysis     => $analysis,
    -cigar_string => '10M3D5M2I'
  );
39

40
  where $analysis is a Bio::EnsEMBL::Analysis object.
41

42
  Alternatively if you have an array of ungapped features:
43

44 45
    my $feat =
      new Bio::EnsEMBL::DnaPepAlignFeature( -features => \@features );
46

47
  where @features is an array of Bio::EnsEMBL::FeaturePair objects.
48

49
  There is a method to (re)create ungapped features from the cigar_string:
50

51
    my @ungapped_features = $feat->ungapped_features();
52

53
  where @ungapped_features is an array of Bio::EnsEMBL::FeaturePair's.
54

55 56 57 58
  Bio::EnsEMBL::BaseAlignFeature inherits from:
    Bio::EnsEMBL::FeaturePair, which in turn inherits from:
      Bio::EnsEMBL::Feature,
  thus methods from both parent classes are available.
59

60

61 62 63
  The cigar_string is a condensed representation of the matches and gaps
  which make up the gapped alignment (where CIGAR stands for
  Concise Idiosyncratic Gapped Alignment Report).
64

65 66 67
  CIGAR format is: n <matches> [ x <deletes or inserts> m <matches> ]*
  where M = match, D = delete, I = insert; n, m are match lengths;
  x is delete or insert length.
68

69 70 71 72 73
  Spaces are omitted, thus: "23M4I12M2D1M"
  as are counts for any lengths of 1, thus 1M becomes M: "23M4I12M2DM"


  To make things clearer this is how a blast HSP would be parsed:
74

75 76
  >AK014066
         Length = 146
77

78
    Minus Strand HSPs:
79

80 81
    Score = 76 (26.8 bits), Expect = 1.4, P = 0.74
    Identities = 20/71 (28%), Positives = 29/71 (40%), Frame = -1
82

83 84 85
  Query:   479 GLQAPPPTPQGCRLIPPPPLGLQAPLPTLRAVGSSHHHP*GRQGSSLSSFRSSLASKASA 300
               G  APPP PQG R   P P G + P   L             + + ++  R  +A   +
  Sbjct:     7 GALAPPPAPQG-RWAFPRPTG-KRPATPLHGTARQDRQVRRSEAAKVTGCRGRVAPHVAP 64
86

87 88 89
  Query:   299 SSPHNPSPLPS 267
                  H P+P P+
  Sbjct:    65 PLTHTPTPTPT 75
90

91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
  The alignment goes from 479 down to 267 in the query sequence on the reverse
  strand, and from 7 to 75 in the subject sequence.

  The alignment is made up of the following ungapped pieces:

  query_seq start 447 , sbjct_seq hstart  7 , match length  33 , strand -1
  query_seq start 417 , sbjct_seq hstart 18 , match length  27 , strand -1
  query_seq start 267 , sbjct_seq hstart 27 , match length 147 , strand -1

  When assembled into a DnaPepAlignFeature where:
    (seqname, start, end, strand) refer to the query sequence,
    (hseqname, hstart, hend, hstrand) refer to the subject sequence,
  these ungapped pieces are represented by the cigar string:
    33M3I27M3I147M
  with start 267, end 479, strand -1, and hstart 7, hend 75, hstrand 1.

=head1 CAVEATS

  AlignFeature cigar strings have the opposite 'sense'
  ('D' and 'I' swapped) compared with Exonerate cigar strings.

  Exonerate modules in Bio::EnsEMBL::Analysis use this convention:

   The longer genomic sequence specified by:
      exonerate:    target
      AlignFeature: (sequence, start, end, strand)

   A shorter sequence (such as EST or protein) specified by:
      exonerate:    query
      AlignFeature: (hsequence, hstart, hend, hstrand)
121

122 123
  The resulting AlignFeature cigar strings have 'D' and 'I'
  swapped compared with the Exonerate output, i.e.:
124

125 126
    exonerate:    M 123 D 1 M 11 I 1 M 39
    AlignFeature: 123MI11MD39M
127

128
=head1 METHODS
129 130

=cut
131

132

133 134 135
package Bio::EnsEMBL::BaseAlignFeature;

use Bio::EnsEMBL::FeaturePair;
136 137
use Bio::EnsEMBL::Utils::Argument qw(rearrange);
use Bio::EnsEMBL::Utils::Exception qw(throw warning);
Monika Komorowska's avatar
Monika Komorowska committed
138
use Scalar::Util qw(weaken isweak);
139 140 141 142

use vars qw(@ISA);
use strict;

143
@ISA = qw(Bio::EnsEMBL::FeaturePair);
144

145 146 147 148 149

=head2 new

  Arg [..]   : List of named arguments. (-cigar_string , -features) defined
               in this constructor, others defined in FeaturePair and 
Graham McVicker's avatar
Graham McVicker committed
150 151 152
               SeqFeature superclasses.  Either cigar_string or a list
               of ungapped features should be provided - not both.
  Example    : $baf = new BaseAlignFeatureSubclass(-cigar_string => '3M3I12M');
153
  Description: Creates a new BaseAlignFeature using either a cigar string or
Graham McVicker's avatar
Graham McVicker committed
154 155 156 157 158 159 160
               a list of ungapped features.  BaseAlignFeature is an abstract
               baseclass and should not actually be instantiated - rather its
               subclasses should be.
  Returntype : Bio::EnsEMBL::BaseAlignFeature
  Exceptions : thrown if both feature and cigar string args are provided
               thrown if neither feature nor cigar string args are provided
  Caller     : general
161
  Status     : Stable
162 163 164

=cut

165
sub new {
166
  
167 168 169 170 171 172 173 174 175 176 177 178
  my $caller = shift;

  my $class = ref($caller) || $caller;

  my $self = $class->SUPER::new(@_);

  my ($cigar_string,$features) = rearrange([qw(CIGAR_STRING FEATURES)], @_);

  if (defined($cigar_string) && defined($features)) {
    throw("CIGAR_STRING or FEATURES argument is required - not both.");
  } elsif (defined($features)) {
    $self->_parse_features($features);
Laura Clarke's avatar
 
Laura Clarke committed
179
    
180 181 182 183 184
  } elsif (defined($cigar_string)) {
    $self->{'cigar_string'} = $cigar_string;
  } else {
    throw("CIGAR_STRING or FEATURES argument is required");
  }
Laura Clarke's avatar
 
Laura Clarke committed
185
  
186
  return $self;
187 188 189
}


190 191 192 193 194 195 196 197 198 199 200 201 202 203 204
=head2 new_fast

  Arg [1]    : hashref $hashref
               A hashref which will be blessed into a PepDnaAlignFeature.
  Example    : none
  Description: This allows for very fast object creation when a large number
               of PepDnaAlignFeatures needs to be created.  This is a bit of
               a hack but necessary when thousands of features need to be
               generated within a couple of seconds for web display. It is
               not recommended that this method be called unless you know what
               you are doing.  It requires knowledge of the internals of this
               class and its superclasses.
  Returntype : Bio::EnsEMBL::BaseAlignFeature
  Exceptions : none
  Caller     : none currently
205
  Status     : Stable
206 207 208 209 210

=cut

sub new_fast {
  my ($class, $hashref) = @_;
Monika Komorowska's avatar
Monika Komorowska committed
211 212 213
  my $self = bless $hashref, $class;
  weaken($self->{adaptor})  if ( ! isweak($self->{adaptor}) );
  return $self;
214
}
Graham McVicker's avatar
Graham McVicker committed
215

216

Arne Stabenau's avatar
Arne Stabenau committed
217
=head2 cigar_string
218

Arne Stabenau's avatar
Arne Stabenau committed
219
  Arg [1]    : string $cigar_string
Kieron Taylor's avatar
Kieron Taylor committed
220
  Example    : $feature->cigar_string( "12MI3M" );
221 222 223 224 225 226 227 228 229
  Description: get/set for attribute cigar_string.
               cigar_string describes the alignment:
                 "xM" stands for x matches (or mismatches),
                 "xI" for x inserts into the query sequence,
                 "xD" for x deletions from the query sequence
                 where the query sequence is specified by (seqname, start, ...)
                 and the subject sequence by (hseqname, hstart, ...).
               An "x" that is 1 can be omitted.
               See the SYNOPSIS for an example.
Arne Stabenau's avatar
Arne Stabenau committed
230
  Returntype : string
231
  Exceptions : none
Arne Stabenau's avatar
Arne Stabenau committed
232
  Caller     : general
233
  Status     : Stable
234 235 236 237

=cut

sub cigar_string {
238 239 240
  my $self = shift;
  $self->{'cigar_string'} = shift if(@_);
  return $self->{'cigar_string'};
241 242 243
}


244 245 246 247 248 249 250
=head2 alignment_length

  Arg [1]    : None
  Description: return the alignment length (including indels) based on the cigar_string
  Returntype : int
  Exceptions : 
  Caller     : 
251
  Status     : Stable
252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275

=cut

sub alignment_length {
  my $self = shift;

  if (! defined $self->{'_alignment_length'} && defined $self->cigar_string) {
    
    my @pieces = ( $self->cigar_string =~ /(\d*[MDI])/g );
    unless (@pieces) {
      print STDERR "Error parsing cigar_string\n";
    }
    my $alignment_length = 0;
    foreach my $piece (@pieces) {
      my ($length) = ( $piece =~ /^(\d*)/ );
      if (! defined $length || $length eq "") {
        $length = 1;
      }
      $alignment_length += $length;
    }
    $self->{'_alignment_length'} = $alignment_length;
  }
  return $self->{'_alignment_length'};
}
Graham McVicker's avatar
Graham McVicker committed
276

277 278
=head2 ungapped_features

Arne Stabenau's avatar
Arne Stabenau committed
279
  Args       : none
Graham McVicker's avatar
Graham McVicker committed
280
  Example    : @ungapped_features = $align_feature->get_feature
Arne Stabenau's avatar
Arne Stabenau committed
281 282 283 284 285
  Description: converts the internal cigar_string into an array of
               ungapped feature pairs
  Returntype : list of Bio::EnsEMBL::FeaturePair
  Exceptions : cigar_string not set internally
  Caller     : general
286
  Status     : Stable
287

Arne Stabenau's avatar
Arne Stabenau committed
288
=cut
289 290 291 292

sub ungapped_features {
  my ($self) = @_;

293 294
  if (!defined($self->{'cigar_string'})) {
    throw("No cigar_string defined.  Can't return ungapped features");
295 296
  }

297
  return @{$self->_parse_cigar()};
298 299
}

300 301 302 303 304 305 306 307 308 309 310 311 312
=head2 strands_reversed
 
  Arg [1]    : int $strands_reversed
  Description: get/set for attribute strands_reversed
               0 means that strand and hstrand are the original strands obtained
                 from the alignment program used
               1 means that strand and hstrand have been flipped as compared to
                 the original result provided by the alignment program used.
                 You may want to use the reverse_complement method to restore the
                 original strandness.
  Returntype : int
  Exceptions : none
  Caller     : general
313
  Status     : Stable
314 315
 
=cut
Graham McVicker's avatar
Graham McVicker committed
316

317 318 319 320 321 322
sub strands_reversed {
   my ($self, $arg) = @_;
 
   if ( defined $arg ) {
      $self->{'strands_reversed'} = $arg ;
   }
323 324 325

   $self->{'strands_reversed'} = 0 unless (defined $self->{'strands_reversed'});

326 327
   return $self->{'strands_reversed'};
}
Graham McVicker's avatar
Graham McVicker committed
328

329 330 331 332 333 334 335 336
=head2 reverse_complement

  Args       : none
  Description: reverse complement the FeaturePair,
               modifing strand, hstrand and cigar_string in consequence
  Returntype : none
  Exceptions : none
  Caller     : general
337
  Status     : Stable
338 339 340 341 342 343 344 345 346

=cut

sub reverse_complement {
  my ($self) = @_;

  # reverse strand in both sequences
  $self->strand($self->strand * -1);
  $self->hstrand($self->hstrand * -1);
347

348 349 350 351 352 353 354 355
  # reverse cigar_string as consequence
  my $cigar_string = $self->cigar_string;
  $cigar_string =~ s/(D|I|M)/$1 /g;
  my @cigar_pieces = split / /,$cigar_string;
  $cigar_string = "";
  while (my $piece = pop @cigar_pieces) {
    $cigar_string .= $piece;
  }
356 357 358 359 360 361 362 363
  
  $self->{'strands_reversed'} = 0 unless (defined $self->{'strands_reversed'});

  if ($self->strands_reversed) {
    $self->strands_reversed(0)
  } else {
    $self->strands_reversed(1);
  }
364 365 366

  $self->cigar_string($cigar_string);
}
367

Graham McVicker's avatar
Graham McVicker committed
368 369


370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398
=head2 transform

  Arg  1     : String $coordinate_system_name
  Arg [2]    : String $coordinate_system_version
  Example    : $feature = $feature->transform('contig');
               $feature = $feature->transform('chromosome', 'NCBI33');
  Description: Moves this AlignFeature to the given coordinate system.
               If the feature cannot be transformed to the destination 
               coordinate system undef is returned instead.
  Returntype : Bio::EnsEMBL::BaseAlignFeature;
  Exceptions : wrong parameters
  Caller     : general
  Status     : Medium Risk
             : deprecation needs to be removed at some time

=cut

sub transform {
  my $self = shift;

  # catch for old style transform calls
  if( ref $_[0] eq 'HASH') {
    deprecate("Calling transform with a hashref is deprecate.\n" .
              'Use $feat->transfer($slice) or ' .
              '$feat->transform("coordsysname") instead.');
    my (undef, $new_feat) = each(%{$_[0]});
    return $self->transfer($new_feat->slice);
  }

399 400 401 402 403
  my $new_feature = $self->SUPER::transform(@_);
  if ( !defined($new_feature)
    || $new_feature->length() != $self->length() )
  {
    my @segments = @{ $self->project(@_) };
404

405 406 407
    if ( !@segments ) {
      return undef;
    }
408 409

    my @ungapped;
410 411 412 413
    foreach my $f ( $self->ungapped_features() ) {
      $f = $f->transform(@_);
      if ( defined($f) ) {
        push( @ungapped, $f );
414
      } else {
415 416
        warning( "Failed to transform alignment feature; "
            . "ungapped component could not be transformed" );
417 418 419 420
        return undef;
      }
    }

421 422
    eval { $new_feature = $self->new( -features => \@ungapped ); };

423 424 425 426
    if ($@) {
      warning($@);
      return undef;
    }
427
  } ## end if ( !defined($new_feature...))
428 429 430 431 432

  return $new_feature;
}


433 434 435
=head2 _parse_cigar

  Args       : none
Graham McVicker's avatar
Graham McVicker committed
436 437
  Description: PRIVATE (internal) method - creates ungapped features from 
               internally stored cigar line
438 439 440
  Returntype : list of Bio::EnsEMBL::FeaturePair
  Exceptions : none
  Caller     : ungapped_features
441
  Status     : Stable
442 443 444

=cut

445 446 447 448 449
sub _parse_cigar {
  my ( $self ) = @_;

  my $query_unit = $self->_query_unit();
  my $hit_unit = $self->_hit_unit();
450

451 452 453
  my $string = $self->{'cigar_string'};

  throw("No cigar string defined in object") if(!defined($string));
454 455

  my @pieces = ( $string =~ /(\d*[MDI])/g );
Laura Clarke's avatar
 
Laura Clarke committed
456
  #print "cigar: ",join ( ",", @pieces ),"\n";
457

458
  my @features;
459 460
  my $strand1 = $self->{'strand'} || 1;
  my $strand2 = $self->{'hstrand'}|| 1;
461 462

  my ( $start1, $start2 );
463

464
  if( $strand1 == 1 ) {
465
    $start1 = $self->{'start'};
466
  } else {
467
    $start1 = $self->{'end'};
468 469 470
  }

  if( $strand2 == 1 ) {
471
    $start2 = $self->{'hstart'};
472
  } else {
473
    $start2 = $self->{'hend'};
474 475
  }

476 477 478
  #
  # Construct ungapped blocks as FeaturePairs objects for each MATCH
  #
479 480 481 482 483 484 485 486 487 488 489 490 491 492 493
  foreach my $piece (@pieces) {

    my ($length) = ( $piece =~ /^(\d*)/ );
    if( $length eq "" ) { $length = 1 }
    my $mapped_length;

    # explicit if statements to avoid rounding problems
    # and make sure we have sane coordinate systems
    if( $query_unit == 1 && $hit_unit == 3 ) {
      $mapped_length = $length*3;
    } elsif( $query_unit == 3 && $hit_unit == 1 ) {
      $mapped_length = $length / 3;
    } elsif ( $query_unit == 1 && $hit_unit == 1 ) {
      $mapped_length = $length;
    } else {
494 495
      throw("Internal error $query_unit $hit_unit, currently only " .
            "allowing 1 or 3 ");
496
    }
497

498 499
    if( int($mapped_length) != $mapped_length and
        ($piece =~ /M$/ or $piece =~ /D$/)) {
500 501
      throw("Internal error with mismapped length of hit, query " .
            "$query_unit, hit $hit_unit, length $length");
502 503 504
    }

    if( $piece =~ /M$/ ) {
505 506 507 508
      #
      # MATCH
      #
      my ( $qstart, $qend);
509
      if( $strand1 == 1 ) {
510 511 512
        $qstart = $start1;
        $qend = $start1 + $length - 1;
        $start1 = $qend + 1;
513
      } else {
514 515 516
        $qend = $start1;
        $qstart = $start1 - $length + 1;
        $start1 = $qstart - 1;
517
      }
518

519
      my ($hstart, $hend);
520
      if( $strand2 == 1 ) {
521 522 523
        $hstart = $start2;
        $hend = $start2 + $mapped_length - 1;
        $start2 = $hend + 1;
524
      } else {
525 526 527
        $hend = $start2;
        $hstart = $start2 - $mapped_length + 1;
        $start2 = $hstart - 1;
528 529
      }

530

531 532
      push @features, Bio::EnsEMBL::FeaturePair->new
        (-SLICE      => $self->{'slice'},
533
         -SEQNAME   => $self->{'seqname'},
534 535 536
         -START      => $qstart,
         -END        => $qend,
         -STRAND     => $strand1,
Abel Ureta-Vidal's avatar
Abel Ureta-Vidal committed
537
         -HSLICE      => $self->{'hslice'},
538 539 540 541 542 543 544
         -HSEQNAME   => $self->{'hseqname'},
         -HSTART     => $hstart,
         -HEND       => $hend,
         -HSTRAND    => $strand2,
         -SCORE      => $self->{'score'},
         -PERCENT_ID => $self->{'percent_id'},
         -ANALYSIS   => $self->{'analysis'},
545
         -P_VALUE    => $self->{'p_value'},
546 547
         -EXTERNAL_DB_ID => $self->{'external_db_id'}, 
         -HCOVERAGE   => $self->{'hcoverage'},
548 549 550
         -GROUP_ID    => $self->{'group_id'},
         -LEVEL_ID    => $self->{'level_id'});
      
551 552 553

      # end M cigar bits 
    } elsif( $piece =~ /I$/ ) {
554 555 556
      #
      # INSERT
      #
557
      if( $strand1 == 1 ) {
558
        $start1 += $length;
559
      } else {
560
        $start1 -= $length;
561 562
      }
    } elsif( $piece =~ /D$/ ) {
563 564 565
      #
      # DELETION
      #
566
      if( $strand2 == 1 ) {
567
        $start2 += $mapped_length;
568
      } else {
569
        $start2 -= $mapped_length;
570 571
      }
    } else {
572 573
      throw( "Illegal cigar line $string!" );
    }
574
  }
575 576

  return \@features;
577 578 579 580 581
}




582
=head2 _parse_features
583

Kieron Taylor's avatar
Kieron Taylor committed
584
  Arg  [1]   : listref Bio::EnsEMBL::FeaturePair $ungapped_features
585
  Description: creates internal cigar_string and start,end hstart,hend
586 587
               entries.
  Returntype : none, fills in values of self
Graham McVicker's avatar
Graham McVicker committed
588
  Exceptions : argument list undergoes many sanity checks - throws under many
589
               invalid conditions
590
  Caller     : new
591
  Status     : Stable
592

593
=cut
594

595 596
my $message_only_once = 1;

597 598
sub _parse_features {
  my ($self,$features ) = @_;
599

600 601
  my $query_unit = $self->_query_unit();
  my $hit_unit = $self->_hit_unit();
602

603
  if (ref($features) ne "ARRAY") {
604
    throw("features must be an array reference not a [".ref($features)."]");
605 606
  }

607 608 609 610
  my $strand  = $features->[0]->strand;

  throw ('FeaturePair needs to have strand == 1 or strand == -1') if(!$strand);

611
  my @f;
612

613 614 615 616 617 618 619 620 621
  #
  # Sort the features on their start position
  # Ascending order on positive strand, descending on negative strand
  #
  if( $strand == 1 ) {
    @f = sort {$a->start <=> $b->start} @$features;
  } else {
    @f = sort { $b->start <=> $a->start} @$features;
  }
622

623
  my $hstrand     = $f[0]->hstrand;
624
  my $slice       = $f[0]->slice();
Abel Ureta-Vidal's avatar
Abel Ureta-Vidal committed
625
  my $hslice       = $f[0]->hslice();
626
  my $name        = $slice ? $slice->name() : undef;
627 628 629 630
  my $hname       = $f[0]->hseqname;
  my $score       = $f[0]->score;
  my $percent     = $f[0]->percent_id;
  my $analysis    = $f[0]->analysis;
631
  my $pvalue      = $f[0]->p_value();
632 633
  my $external_db_id = $f[0]->external_db_id;
  my $hcoverage   = $f[0]->hcoverage;
634 635 636
  my $group_id    = $f[0]->group_id;
  my $level_id    = $f[0]->level_id;

Laura Clarke's avatar
 
Laura Clarke committed
637
  my $seqname = $f[0]->seqname;
638
  # implicit strand 1 for peptide sequences
639 640
  $strand  ||= 1;
  $hstrand ||= 1;
641
  my $ori = $strand * $hstrand;
642 643

  throw("No features in the array to parse") if(scalar(@f) == 0);
644

645 646
  my $prev1; # where last feature q part ended
  my $prev2; # where last feature s part ended
647

648 649
  my $string;

650 651
  # Use strandedness info of query and hit to make sure both sets of 
  # start and end  coordinates are oriented the right way around.
652 653
  my $f1start;
  my $f1end;
654 655
  my $f2end;
  my $f2start;
656

657
  if ($strand == 1) {
658 659
    $f1start = $f[0]->start;
    $f1end = $f[-1]->end;
660
  } else {
661 662
    $f1start = $f[-1]->start;
    $f1end = $f[0]->end;
663
  }
664

665 666 667 668 669 670 671
  if ($hstrand == 1) {
    $f2start = $f[0]->hstart;
    $f2end = $f[-1]->hend;
  } else {
    $f2start = $f[-1]->hstart;
    $f2end = $f[0]->hend;
  }
672

673 674 675
  #
  # Loop through each portion of alignment and construct cigar string
  #
676

677 678 679 680
  foreach my $f (@f) {
    #
    # Sanity checks
    #
681

682
    if (!$f->isa("Bio::EnsEMBL::FeaturePair")) {
683
      throw("Array element [$f] is not a Bio::EnsEMBL::FeaturePair");
684
    }
685 686
    if( defined($f->hstrand()) && $f->hstrand() != $hstrand ) {
      throw("Inconsistent hstrands in feature array");
687
    }
688 689
    if( defined($f->strand()) && ($f->strand != $strand)) {
      throw("Inconsistent strands in feature array");
690
    }
691 692 693
    if ( defined($name) && $name ne $f->slice->name()) {
      throw("Inconsistent names in feature array [$name - ".
            $f->slice->name()."]");
694
    }
695 696 697
    if ( defined($hname) && $hname ne $f->hseqname) {
      throw("Inconsistent hit names in feature array [$hname - ".
            $f->hseqname . "]");
698
    }
699 700 701
    if ( defined($score) && $score ne $f->score) {
      throw("Inconsisent scores in feature array [$score - " .
            $f->score . "]");
702
    }
703
    if (defined($f->percent_id) && $percent ne $f->percent_id) {
704 705 706 707 708 709
      throw("Inconsistent pids in feature array [$percent - " .
            $f->percent_id . "]");
    }
    if(defined($pvalue) && $pvalue != $f->p_value()) {
      throw("Inconsistant p_values in feature arraw [$pvalue " .
            $f->p_value() . "]");
710
    }
Laura Clarke's avatar
 
Laura Clarke committed
711 712 713 714
    if($seqname && $seqname ne $f->seqname){
      throw("Inconsistent seqname in feature array [$seqname - ".
            $f->seqname . "]");
    }
715 716
    my $start1 = $f->start;      #source sequence alignment start
    my $start2 = $f->hstart();   #hit sequence alignment start
717

718 719 720 721 722
    #
    # More sanity checking
    #
    if (defined($prev1)) {
      if ( $strand == 1 ) {
723 724 725 726 727
        if ($f->start < $prev1) {
          throw("Inconsistent coords in feature array (forward strand).\n" .
		       "Start [".$f->start()."] in current feature should be greater " .
           "than previous feature end [$prev1].");
        }
728
      } else {
729 730 731 732 733
        if ($f->end > $prev1) {
          throw("Inconsistent coords in feature array (reverse strand).\n" .
                "End [".$f->end() ."] should be less than previous feature " .
                "start [$prev1].");
        }
734
      }
735
    }
736

737
    my $length = ($f->end - $f->start + 1);    #length of source seq alignment
738
    my $hlength = ($f->hend - $f->hstart + 1); #length of hit seq alignment
739

740 741
    # using multiplication to avoid rounding errors, hence the
    # switch from query to hit for the ratios
742

743 744 745 746 747 748
    #
    # Yet more sanity checking
    #
    if($query_unit > $hit_unit){
      # I am going to make the assumption here that this situation will 
      # only occur with DnaPepAlignFeatures, this may not be true
Laura Clarke's avatar
Laura Clarke committed
749 750 751
      my $query_p_length = sprintf "%.0f", ($length/$query_unit);
      my $hit_p_length = sprintf "%.0f", ($hlength * $hit_unit);
      if( $query_p_length != $hit_p_length) {
752 753 754
        throw( "Feature lengths not comparable Lengths:" .$length .
               " " . $hlength . " Ratios:" . $query_unit . " " .
               $hit_unit );
Arne Stabenau's avatar
Arne Stabenau committed
755
      }
756
    } else{
Laura Clarke's avatar
Laura Clarke committed
757 758
      my $query_d_length = sprintf "%.0f", ($length*$hit_unit);
      my $hit_d_length = sprintf "%.0f", ($hlength * $query_unit);
759
      if( $length * $hit_unit != $hlength * $query_unit ) {
760 761 762
        throw( "Feature lengths not comparable Lengths:" . $length .
               " " . $hlength . " Ratios:" . $query_unit . " " .
               $hit_unit );
Arne Stabenau's avatar
Arne Stabenau committed
763
      }
764 765 766
    }

    my $hlengthfactor = ($query_unit/$hit_unit);
767

768 769 770 771 772 773 774 775 776
    #
    # Check to see if there is an I type (insertion) gap:
    #   If there is a space between the end of the last source sequence 
    #   alignment and the start of this one, then this is an insertion
    #

    my $insertion_flag = 0;
    if( $strand == 1 ) {
      if( ( defined $prev1 ) && ( $f->start > $prev1 + 1  )) {
777

778 779 780 781 782 783 784
        #there is an insertion
        $insertion_flag = 1;
        my $gap = $f->start - $prev1 - 1;
        if( $gap == 1 ) {
          $gap = ""; # no need for a number if gap length is 1
        }
        $string .= "$gap"."I";
785

Arne Stabenau's avatar
Arne Stabenau committed
786
      }
787 788 789 790

      #shift our position in the source seq alignment
      $prev1 = $f->end();
    } else {
791

792 793
      if(( defined $prev1 ) && ($f->end + 1 < $prev1 )) {

794 795 796 797 798 799 800
        #there is an insertion
        $insertion_flag = 1;
        my $gap = $prev1 - $f->end() - 1;
        if( $gap == 1 ) {
          $gap = ""; # no need for a number if gap length is 1
        }
        $string .= "$gap"."I";
Arne Stabenau's avatar
Arne Stabenau committed
801
      }
802

803 804 805
      #shift our position in the source seq alignment
      $prev1 = $f->start();
    }
806

807 808 809 810 811 812 813
    #
    # Check to see if there is a D type (deletion) gap
    #   There is a deletion gap if there is a space between the end of the
    #   last portion of the hit sequence alignment and this one
    #
    if( $hstrand == 1 ) {
      if((  defined $prev2 ) && ( $f->hstart() > $prev2 + 1 )) {
814

815 816 817
        #there is a deletion
        my $gap = $f->hstart - $prev2 - 1;
        my $gap2 = int( $gap * $hlengthfactor + 0.5 );
818
	
819 820 821 822 823 824 825
        if( $gap2 == 1 ) {
          $gap2 = "";  # no need for a number if gap length is 1
        }
        $string .= "$gap2"."D";

        #sanity check,  Should not be an insertion and deletion
        if($insertion_flag) {
826 827 828 829 830
          if ($message_only_once) {
            warning("Should not be an deletion and insertion on the " .
                    "same alignment region. cigar_line=$string\n");
            $message_only_once = 0;
          }
831 832
        }
      }
833 834 835 836 837
      #shift our position in the hit seq alignment
      $prev2 = $f->hend();

     } else {
      if( ( defined $prev2 ) && ( $f->hend() + 1 < $prev2 )) {
838

839 840 841
        #there is a deletion
        my $gap = $prev2 - $f->hend - 1;
        my $gap2 = int( $gap * $hlengthfactor + 0.5 );
842
	
843 844 845 846 847 848 849
        if( $gap2 == 1 ) {
          $gap2 = "";  # no need for a number if gap length is 1
        }
        $string .= "$gap2"."D";

        #sanity check,  Should not be an insertion and deletion
        if($insertion_flag) {
850 851 852 853 854 855
          if ($message_only_once) {
            warning("Should not be an deletion and insertion on the " .
                    "same alignment region. prev2 = $prev2; f->hend() = " .
                    $f->hend() . "; cigar_line = $string;\n");
            $message_only_once = 0;
          }
856
        }
857
      }
858 859 860
      #shift our position in the hit seq alignment
      $prev2 = $f->hstart();
    }
861

862 863 864
    my $matchlength = $f->end() - $f->start() + 1;
    if( $matchlength == 1 ) {
      $matchlength = "";
865
    }
866 867
    $string .= $matchlength."M";
  }
868

869 870
  $self->{'start'}      = $f1start;
  $self->{'end'}        = $f1end;
Laura Clarke's avatar
 
Laura Clarke committed
871
  $self->{'seqname'}    = $seqname;
872 873 874 875 876
  $self->{'strand'}     = $strand;
  $self->{'score'}      = $score;
  $self->{'percent_id'} = $percent;
  $self->{'analysis'}   = $analysis;
  $self->{'slice'}      = $slice;
Abel Ureta-Vidal's avatar
Abel Ureta-Vidal committed
877
  $self->{'hslice'}     = $hslice;
878 879 880 881 882 883
  $self->{'hstart'}     = $f2start;
  $self->{'hend'}       = $f2end;
  $self->{'hstrand'}    = $hstrand;
  $self->{'hseqname'}   = $hname;
  $self->{'cigar_string'} = $string;
  $self->{'p_value'}      = $pvalue;
884 885
  $self->{'external_db_id'} = $external_db_id; 
  $self->{'hcoverage'}    = $hcoverage; 
886 887
  $self->{'group_id'}     = $group_id;
  $self->{'level_id'}     = $level_id;
888
}
889

Laura Clarke's avatar
 
Laura Clarke committed
890

Arne Stabenau's avatar
Arne Stabenau committed
891

892

Laura Clarke's avatar
 
Laura Clarke committed
893 894


895 896 897 898 899 900
=head2 _hit_unit

  Args       : none
  Description: abstract method, overwrite with something that returns
               one or three
  Returntype : int 1,3
Arne Stabenau's avatar
Arne Stabenau committed
901
  Exceptions : none
902
  Caller     : internal
903
  Status     : Stable
Laura Clarke's avatar
 
Laura Clarke committed
904 905 906

=cut

907 908
sub _hit_unit {
  my $self = shift;
909
  throw( "Abstract method call!" );
Laura Clarke's avatar
 
Laura Clarke committed
910 911
}

Graham McVicker's avatar
Graham McVicker committed
912 913


914
=head2 _query_unit
Laura Clarke's avatar
 
Laura Clarke committed
915

916