Slow Performance using filter_intervals

I’m testing on a small (1.2GB) dataset and I would like to filter variants from a vcf file overlapping many short intervals. When I filter by a single interval, it’s taking >12 s to count.

This seems to be much slower than I would expect based on the results in this blog post (16GB on a laptop in 300ms.) I’ve got 40 cores and 256GB memory, and have seen high CPU and memory use. How is my use different than the example? Why is it taking so long when it says it is only accessing 1 of 38 partitions?

Here’s my code, and the matrix table description. Apologies if I’m missing something basic.

%time wes_genotypes.count()
CPU times: user 20 ms, sys: 10.6 ms, total: 30.6 ms Wall time: 34.2 s
(622351, 520)


intervals = [hl.parse_locus_interval('1:10027520-10027542')]
exon_vars = hl.filter_intervals(wes_genotypes,intervals,keep=True)
%time exon_vars.count()

CPU times: user 5.01 ms, sys: 0 ns, total: 5.01 ms Wall time: 12.8 s
(1, 520)

wes_genotypes.describe()
(omitting large number of INFO row columns)

----------------------------------------
Global fields:
None
----------------------------------------
Column fields:
's': str
----------------------------------------
Row fields:
'locus': locus<GRCh37>
'alleles': array<str>
'rsid': str
'qual': float64
'filters': set<str>
'info': struct {
...
----------------------------------------
Entry fields:
'AD': array<int32>
'DP': int32
'GQ': int32
'GT': call
'MIN_DP': int32
'PL': array<int32>
'SB': array<int32>
----------------------------------------
Column key: ['s']
Row key: ['locus', 'alleles']
----------------------------------------

Thanks for flagging this. If there are performance bugs hiding here, I want to get them fixed.

The performance of filter_intervals is proportional to the number of partitions kept, but also the operations upstream and downstream.

There are two reasons that we could get computations under a second in that blog post:

  1. The simplicity of the pipeline. It was:

    • read (read_matrix_table)
    • filter intervals
    • count

    Hail’s native formats are much faster to read than VCFs, and counting is very cheap.

  2. tiny partitions. In the edge case that you are querying an interval which overlaps only one variant, then the execution will take as long as it takes for one core to execute all pieces of the pipeline for that one partition.

    My partitions were ~1MB in that post, and the format was much faster. Coming from VCF, things are going to be slower, and your partitions are 30x bigger (32M partitions are very reasonable for batch processing tasks!)

I have a couple followups -

  1. What is the upstream pipeline? Just import_vcf, or are there qc/filter steps above? These will definitely slow Hail down.

  2. could you try importing with more partitions in import_vcf (min_partitions=1000?) and see how that matters?

My pipeline has no additional steps, just import, filter, and count.

I tried min_partitions=1000 and it sped things up. With 1MB partitions it takes ~600ms. Thanks for pointing out the importance of partition size.

I’ll update when I try with larger data.

The speed here shouldn’t really vary with data size – however, we see problems with more than ~100K partitions in Spark, so if you do have large data, tiny partitions won’t be an option.