Created by: keiranmraine
The HLA contigs in human GRCh38 have names like 'HLA-A*01:01:01:01'. Constructing a query against these results in:
HLA-A*01:01:01:01:0-13503 EVAL_ERROR: You must specify a region in the format chr, chr:start or chr:start-end at /.../local/lib/perl5/x86_64-linux-thread-multi/Bio/DB/HTS/Tabix.pm line 64.
This is a failure in the way the region format checking is carried out.
The C tabix command line works with one caveat, if the chromosome contains
: then it must be followed with
: when addressed on it's own:
$ tabix test.tsv.gz 'HLA-A*01:01:01:01:1-50' HLA-A*01:01:01:01 1 500 0.65 $ tabix test.tsv.gz 'HLA-A*01:01:01:01:1' HLA-A*01:01:01:01 1 500 0.65 $ tabix test.tsv.gz 'HLA-A*01:01:01:01:' HLA-A*01:01:01:01 1 500 0.65 $ tabix test.tsv.gz 'HLA-A*01:01:01:01' # no result
This PR adds a new tabix method of
query_full which allows the calling code to specify the elements of a region rather than a string, and then constructs a valid one to pass to the C layer.
As there is overlap, the original
query function calls the
query_full method for the common components.
I would recommend extending the error messages to indicate that this is a possible cause of failure to parse as well as deprecating the original
Alternatively the original pattern match in
query could be modified, however the only restriction in contig naming I'm aware of is whitespace.