Slow Performance using filter_intervals

cburghard · November 28, 2018, 5:22am

I’m testing on a small (1.2GB) dataset and I would like to filter variants from a vcf file overlapping many short intervals. When I filter by a single interval, it’s taking >12 s to count.

This seems to be much slower than I would expect based on the results in this blog post (16GB on a laptop in 300ms.) I’ve got 40 cores and 256GB memory, and have seen high CPU and memory use. How is my use different than the example? Why is it taking so long when it says it is only accessing 1 of 38 partitions?

Here’s my code, and the matrix table description. Apologies if I’m missing something basic.

%time wes_genotypes.count()
CPU times: user 20 ms, sys: 10.6 ms, total: 30.6 ms Wall time: 34.2 s
(622351, 520)

intervals = [hl.parse_locus_interval('1:10027520-10027542')]
exon_vars = hl.filter_intervals(wes_genotypes,intervals,keep=True)
%time exon_vars.count()

CPU times: user 5.01 ms, sys: 0 ns, total: 5.01 ms Wall time: 12.8 s
(1, 520)

wes_genotypes.describe()
(omitting large number of INFO row columns)

----------------------------------------
Global fields:
None
----------------------------------------
Column fields:
's': str
----------------------------------------
Row fields:
'locus': locus<GRCh37>
'alleles': array<str>
'rsid': str
'qual': float64
'filters': set<str>
'info': struct {
...
----------------------------------------
Entry fields:
'AD': array<int32>
'DP': int32
'GQ': int32
'GT': call
'MIN_DP': int32
'PL': array<int32>
'SB': array<int32>
----------------------------------------
Column key: ['s']
Row key: ['locus', 'alleles']
----------------------------------------

tpoterba · November 28, 2018, 1:13pm

Thanks for flagging this. If there are performance bugs hiding here, I want to get them fixed.

The performance of filter_intervals is proportional to the number of partitions kept, but also the operations upstream and downstream.

There are two reasons that we could get computations under a second in that blog post:

The simplicity of the pipeline. It was:
- read (read_matrix_table)
- filter intervals
- count
Hail’s native formats are much faster to read than VCFs, and counting is very cheap.
tiny partitions. In the edge case that you are querying an interval which overlaps only one variant, then the execution will take as long as it takes for one core to execute all pieces of the pipeline for that one partition.

My partitions were ~1MB in that post, and the format was much faster. Coming from VCF, things are going to be slower, and your partitions are 30x bigger (32M partitions are very reasonable for batch processing tasks!)

I have a couple followups -

What is the upstream pipeline? Just import_vcf, or are there qc/filter steps above? These will definitely slow Hail down.
could you try importing with more partitions in import_vcf (min_partitions=1000?) and see how that matters?

cburghard · November 28, 2018, 4:26pm

My pipeline has no additional steps, just import, filter, and count.

I tried min_partitions=1000 and it sped things up. With 1MB partitions it takes ~600ms. Thanks for pointing out the importance of partition size.

I’ll update when I try with larger data.

tpoterba · November 28, 2018, 4:27pm

The speed here shouldn’t really vary with data size – however, we see problems with more than ~100K partitions in Spark, so if you do have large data, tiny partitions won’t be an option.

Topic		Replies	Views
Using filter_intervals to query the 16GB gnomAD sites dataset in 300ms on your laptop Updates	0	4448	March 4, 2017
Parse_locus() over multiple regions using GRCh38 Hail Query & hailctl	9	1390	March 24, 2023
New optimizer pass that extracts point queries and interval filters Updates	0	1026	June 7, 2019
Poor performance for QC filtering on medium sized genotype data Hail Query & hailctl	20	2164	February 8, 2020
Performing sample missingness filtering on multiple pVCF files Hail Query & hailctl	2	240	November 9, 2023

Slow Performance using filter_intervals

Related topics