Skip to content

Create a way to Tabix query against chr containing ':'

Marek Szuba requested to merge github/fork/keiranmraine/hotfix/2.5.1 into master

Created by: keiranmraine

Hi,

The HLA contigs in human GRCh38 have names like 'HLA-A*01:01:01:01'. Constructing a query against these results in:

HLA-A*01:01:01:01:0-13503
EVAL_ERROR: You must specify a region in the format chr, chr:start or chr:start-end at /.../local/lib/perl5/x86_64-linux-thread-multi/Bio/DB/HTS/Tabix.pm line 64.

This is a failure in the way the region format checking is carried out.

The C tabix command line works with one caveat, if the chromosome contains : then it must be followed with : when addressed on it's own:

$ tabix test.tsv.gz 'HLA-A*01:01:01:01:1-50'
HLA-A*01:01:01:01   1   500 0.65
$ tabix test.tsv.gz 'HLA-A*01:01:01:01:1'
HLA-A*01:01:01:01   1   500 0.65
$ tabix test.tsv.gz 'HLA-A*01:01:01:01:'
HLA-A*01:01:01:01   1   500 0.65
$ tabix test.tsv.gz 'HLA-A*01:01:01:01'
# no result

This PR adds a new tabix method of query_full which allows the calling code to specify the elements of a region rather than a string, and then constructs a valid one to pass to the C layer.

As there is overlap, the original query function calls the query_full method for the common components.

I would recommend extending the error messages to indicate that this is a possible cause of failure to parse as well as deprecating the original query function.

Alternatively the original pattern match in query could be modified, however the only restriction in contig naming I'm aware of is whitespace.

Merge request reports