Interval filtering in Hail
While this functionality has been in Hail for a while, we never actually stopped to announce some pretty awesome changes that went in at the end of 2016. Interval filtering is a very common operation on genomic datasets, and something that’s definitely worth optimizing. As part of their scientific process, we have seen analysts:
- drop sex chromosomes
- restrict to sex chromosomes
- remove low-complexity regions
- select the exons of their favorite genes
- and many more specialized interval filters!
How fast is it?
You all know that Hail uses Spark as a distributed compute engine, and maybe know something about how Spark has distributed collections that contain a number of ‘partitions’, which are shards of the dataset. Hail extends Spark to keep track of a little bit of metadata about these shards, like the first and last variant. We originally built this feature to enable us to do more efficient joins (which we’ll blog about in the future), but realized it could be used to do a lot better for certain types of filter operations.
Most operations in Hail involve reading the full dataset, processing it, and returning a result. If you’re looking for just variants annotated as “Loss of Function”, you still need to peek at the synonymous and missense variants to know not to include them. But intervals are special. Instead of looking at all the data, we can look at much, much less: in “Big O” notation, interval filtering is not
N being the number of variants in the dataset) but
n is the number of variants actually contained in the intervals).
If the above description sounded too abstract, now let’s make it real. I started by downloading the [gnomAD] (http://gnomad.broadinstitute.org/about) publicly-released exome sites VDS (you can get it here) to my laptop. This is a dataset which, while sites-only, is still pretty big. It contains almost 15M variants, and the compressed on-disk representation is more than 16 gigabytes. If you download it and peek at the variant annotation schema (
print(vds.variant_schema)), you’ll see why!
O(N) operation on this dataset is to read it and count the variants. This operation will be bounded by the time it takes to decompress and decode all the data from disk. On my laptop, this takes 17 minutes.
In : %time vds.count_variants() [Stage 13:=================================================>(13589 + 3) / 13592] CPU times: user 105 ms, sys: 34.8 ms, total: 140 ms Wall time: 16min 53s Out: 14646572L
But suppose we just care about chromosome 19. Let’s write a filter using
filter_variants_expr to do this. Even though we just care about one chromosome, the time is about the same:
In : %time vds.filter_variants_expr('v.contig == "19"').count_variants() [Stage 15:=================================================>(13587 + 5) / 13592] CPU times: user 107 ms, sys: 35.9 ms, total: 143 ms Wall time: 16min 59s Out: 1061475L
Now, we’ll start using
filter_intervals, and things are about to get a whole lot speedier.
In : %time vds.filter_intervals(Interval.parse('19')).count_variants() hail: info: pruned 0 redundant intervals hail: info: interval filter loaded 987 of 13592 partitions [Stage 16:======================================================>(984 + 3) / 987 ]CPU times: user 12.2 ms, sys: 4.15 ms, total: 16.4 ms Wall time: 1min 37s Out: 1061475L
Using an interval filter was about 10 times faster. The log output above hints at what’s going on behind the scenes: of the 13592 data shards, we’re only touching 987 of them! But the interval magic doesn’t stop here. We can go even more granular, and go even faster. This time, we’ll query just 1 megabase:
In : %time vds.filter_intervals(Interval.parse('19:10M-11M')).count_variants() hail: info: pruned 0 redundant intervals hail: info: interval filter loaded 30 of 13592 partitions [Stage 17:=================================================> (26 + 4) / 30] CPU times: user 2.45 ms, sys: 1.14 ms, total: 3.59 ms Wall time: 2.41 s Out: 29930L
Let’s shrink our query by another factor of 10:
In : %time vds.filter_intervals(Interval.parse('19:10M-10.1M')).count_variants() hail: info: pruned 0 redundant intervals hail: info: interval filter loaded 3 of 13592 partitions CPU times: user 1.94 ms, sys: 840 µs, total: 2.78 ms Wall time: 305 ms Out: 2186L
There we have it! Even using the same data representation which we optimized for maximal bandwidth in processing entire datasets, we’re able to query a 100-kilobase window of the genome in well under one second.
We haven’t yet seen many people taking full advantage of this functionality, partly because we never told them about it. Well, we’ve just done that. Ready, set, go!