Count_variants() very slow after simple filter_variants


I have a vds file with 40 million variants; it loads very quickly (less than a second).

vds =‘intersect_imputed.vds’)

Counting the number of variants is quite slow, but acceptable:

startTime =
[Stage 3:====================================================>(2255 + 2) / 2257]40405505L
print "It took: ",(
It took: 0:04:23.685591

Note: I use datetime because timeit doesn’t seem to like Hail output

QC and filtering are also lightning fast:

startTime =
vds = vds.variant_qc().cache()
vds = vds.filter_variants_expr(‘ >= 0.4 && (va.qc.AF > 0.005 && va.qc.AF < 0.995)’)
print "It took: ",(
It run took: 0:00:00.330016

The problem appears when attempting to count the variants remaining after the filtering is done
startTime =
[Stage 4:> (0 + 2) / 2257]

after 10 minutes:
[Stage 4:=> (63 + 2) / 2257]

It’s so slow I haven’t even let it finish (which I estimate would take around 6 hours at that pace).
Any idea about why this is happening?



PS: the system is multicore, running Ubuntu 14.04.5, with 15GB of RAM, Spark 2.1.0
Memory usage while counting variants:
free -m
total used free shared buffers cached
Mem: 15671 14821 849 0 11 8777
-/+ buffers/cache: 6033 9638
Swap: 0 0 0

Hi Steph,
What you’re encountering is the lazy execution model of Hail / Spark. Hail methods are divided into transformations (operations that change the underling data state / logical layout) and actions (operations that return results directly, and actually need to run when the method is called).

Actions are methods like write, export, query, and count. Pretty much everything else is a transformation, including read!

Here’s an easy way to see this:

kt = KeyTable.range(10)

# this is an array index error, but it runs successfully
kt = kt.annotate('x = range(10)[10]') 

# we'll crash when we try to execute

The first count_variants() is taking 4 minutes, which is pretty slow because it’s reading everything from disk and counting the number of records.

The second count_variants() there has to do much more than the first – it has to read the VDS from disk, compute all the QC statistics, use them in the filter, and count the resulting number of variants post-filter.

Variant QC is pretty slow for large datasets, so I imagine that’s the bottleneck. The Spark progress bar indicates you only have 2 cores running in parallel, so I imagine many Hail operations will be pretty slow. Unfortunately, our current infrastructure does add a fair bit of overhead to allow for scalability, so the single-core performance isn’t great. We’re working hard to fix this.

PS - how large is your dataset, in GB? It’s always useful for use to have benchmarks.

One quick optimization – if the info score filter is removing lots of variants, then do that before variant qc so that you don’t need to compute QC stats on those variants:

startTime =
vds = vds.filter_variants_expr(‘ >= 0.4’)
vds = vds.variant_qc()
vds = vds.filter_variants_expr(‘va.qc.AF > 0.005 && va.qc.AF < 0.995’)
print "It took: ",(

I see… I obviously didn’t understand how Hail was working, very interesting, thank you.
That explains why the logistic regression only takes 0.78s to run, the heavy lifting happens when I export or plot the results. On that point, if save the results as vds, which will take some time, then if I want to do, let’s say, a QQ plot, it won’t have to recalculate the regression, am I correct?

Applying the INFO filter drops half of the variants, and running count after info filtering takes slightly less than five minutes. Given all I need from qc is the MAF, is there a faster way than variant_qc() to get it?

The vds file is 8.4G, it was created by merging 23 vcf files obtained from imputation at Sanger.
I’ll try to get more cores and memory, that should help then.

It’s possible that using an expression could be faster than variant QC, but not by much. Moving the computation after the filter should definitely help, though.

To your first question - if your pipeline is something like:

vds = vds.logreg(...)
df = vds.variants_table().to_pandas()

, then you actually will compute logreg twice. The better thing would be to read the vds you’ve just written:

vds = vds.logreg(...)
df ='logreg.vds').variants_table().to_pandas()

Do note that in 0.1, most of the operations that give you large amounts of data in python (collect, query, to_pandas) are pretty slow and you might be better off making an intermediate text file with export_variants.

1 Like

It is perhaps also worth mentioning that even without a cluster, if you can get access to a more-than-two-core machine you should see significant improvements in runtime due to scaling across many cores. You may see that an eight core machine completes your pipeline in about an hour.

1 Like