Skip to content

Feature/blast parser

Marek Szuba requested to merge feature/blast_parser into master

Created by: avullo

This is a parser of BLAST+ applications (e.g. blast_formatter) formatted outputs.

WARNING: Support is only provided for a LIMITED number of ouptut formats, the column based ones.

In other words, this parser will only correctly parse output files which have been produced by a BLAST+ application by specifying one of the following "alignment view options":

  • 6: tabular
  • 7: tabular with comment lines
  • 10: comma-separated values

The parser's "open" method understand two arguments, the first is the name of the file to parse, and the second is the same string as the output format given to the BLAST+ application with the option '-outfmt'.

Valid output formats are those containing just the alignment view option with no format specifiers, e.g. '6', '7' or '10', in which case the parser will parse the columns as if they were in the order of the blast_formatter default format specifiers:

'seqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore',

or will parse the columns as if they were in the order specified in the open method second argument, e.g.:

'7 qacc sacc evalue score nident pident qstart qend sstart send length positive ppos qseq sseq'

will parse qacc, sacc, ... separated by tabs.

IMPORTANT: The parser automatically generates get_raw_[field_name] and get_[field_name] accessor methods for [field_name], where [field_name] is the name of a format specifier specified in the output format string. Invoking a getter method for a field which is not in the output format raises an exception.

Tests have been done for all three supported alignment view options (6, 7 and 10). Option 7 has been tested with the output format used by Compara in Bio::EnsEMBL::Compara::RunnableDB::BlastAndParsePAF, and with an output format as closely as possible to that used by the Ensembl Web team in their NCBIBLAST module in the private sanger-plugins repo.

Merge request reports