Create a way to Tabix query against chr containing ':'
Created by: keiranmraine
Hi,
The HLA contigs in human GRCh38 have names like 'HLA-A*01:01:01:01'. Constructing a query against these results in:
HLA-A*01:01:01:01:0-13503
EVAL_ERROR: You must specify a region in the format chr, chr:start or chr:start-end at /.../local/lib/perl5/x86_64-linux-thread-multi/Bio/DB/HTS/Tabix.pm line 64.
This is a failure in the way the region format checking is carried out.
The C tabix command line works with one caveat, if the chromosome contains :
then it must be followed with :
when addressed on it's own:
$ tabix test.tsv.gz 'HLA-A*01:01:01:01:1-50'
HLA-A*01:01:01:01 1 500 0.65
$ tabix test.tsv.gz 'HLA-A*01:01:01:01:1'
HLA-A*01:01:01:01 1 500 0.65
$ tabix test.tsv.gz 'HLA-A*01:01:01:01:'
HLA-A*01:01:01:01 1 500 0.65
$ tabix test.tsv.gz 'HLA-A*01:01:01:01'
# no result
This PR adds a new tabix method of query_full
which allows the calling code to specify the elements of a region rather than a string, and then constructs a valid one to pass to the C layer.
As there is overlap, the original query
function calls the query_full
method for the common components.
I would recommend extending the error messages to indicate that this is a possible cause of failure to parse as well as deprecating the original query
function.
Alternatively the original pattern match in query
could be modified, however the only restriction in contig naming I'm aware of is whitespace.