New optimizer pass that extracts point queries and interval filters

hl.filter_intervals is one of the best ways to speed up pipelines that only need to look at a small fraction of your data, but the method is quite a pain to use compared to simpler methods like filter_rows.

As of 0.2.15, Hail includes an optimizer pass that tries to rewrite filter / filter_rows method calls as filter_intervals when possible. This can make pipelines orders of magnitude faster, since it reduces the amount of data read from disk – and data you don’t read can’t slow you down! You’ll know this is happening if you see messages about Hail reading X of Y data partitions.

Example queries are below. Specifically, this optimization pass looks for comparisons between key fields and constant values:

>>> mt.filter_rows(mt.locus.contig == '16').count_rows()
2019-06-07 17:16:01 Hail: INFO: reading 5 of 128 data partitions
384

>>> mt.filter_rows((mt.locus.contig == '16') | (mt.locus.contig == '19')).count_rows()
2019-06-07 17:16:07 Hail: INFO: reading 10 of 128 data partitions
730

>>> mt.filter_rows(hl.literal({'16', '19'}).contains(mt.locus.contig)).count_rows()
2019-06-07 17:16:19 Hail: INFO: reading 10 of 128 data partitions
730

In [10]: mt.filter_rows((mt.locus.contig == '16') & (mt.locus.position > 10_000_000)).count_rows()
2019-06-07 17:16:24 Hail: INFO: reading 4 of 128 data partitions
Out[10]: 302

>>> mt.filter_rows(mt.locus == hl.parse_locus('1:3761547')).count_rows()
2019-06-07 17:16:32 Hail: INFO: reading 1 of 128 data partitions
1

>>> mt.filter_rows(hl.parse_locus_interval('16:20000000-30000000').contains(mt.locus)).count_rows()
2019-06-07 17:16:55 Hail: INFO: reading 1 of 128 data partitions
35

In conclusion, if you see a message that looks like an interval filter you didn’t write yourself, you didn’t do anything wrong – Hail is doing something cool.