Poor performance for QC filtering on medium sized genotype data

I am trying to run some basic QC on a somewhat large dataset (30M variants by 3k samples) distributed as PLINK files. I am finding that the same steps to filter variants and samples by call rate that take 2 minutes with PLINK 1.9 (well 135 seconds) take over 2 hours with Hail. It’s actually more like 2.5 hours if you count the initial time taken for the MatrixTable conversion.

What can I do to bridge some of this ~50x performance difference? Here is some relevant Hail code:

# Convert to MatrixTable
mt_qc0 = hl.import_plink(
    bed='NB_final_snp.bed',
    bim='NB_final_snp.bim',
    fam='NB_final_snp.fam'
)
# This takes ~32 minutes
mt_qc0.write('NB_final_snp.mt', overwrite=True) 
# mt_qc0 reads in with 668 partitions

# Define a function for call rate QC
def qc_step_1(mt):
    mt = hl.variant_qc(mt)
    mt = mt.filter_rows(mt.variant_qc.call_rate >= .8)
    mt = hl.sample_qc(mt)
    mt = mt.filter_cols(mt.sample_qc.call_rate >= .8)
    mt = hl.variant_qc(mt)
    mt = mt.filter_rows(mt.variant_qc.call_rate >= .98)
    mt = hl.sample_qc(mt)
    mt = mt.filter_cols(mt.sample_qc.call_rate >= .98)
    return mt

# Load the converted table for QC
mt_qc0 = hl.read_matrix_table('NB_final_snp.mt')
mt_qc1 = qc_step_1(mt_qc0)

# This takes ~125 minutes
print(mt_qc1.count())
> (13381703, 2900)

The data itself is taken directly from Rice SNP-Seek (the “3K RG 29mio biallelic SNPs” dataset). I don’t actually care about rice data, I was just using this in some comparisons between the performance of Glow, Hail, and PLINK. As far as I know, there shouldn’t be anything about it that wouldn’t be supported well since it is bi-allelic and the contig names / lengths all clear the locus checks for the default GRCh37 genome. Are there any other assumptions baked into the indexing or data structures that might degrade performance?

This is the comparable PLINK code (which is run on the same system):

# Use different intermediate results for each step with a cleanup at the end to 
# better emulate what Spark/Hail is doing
plink --bfile NB_final_snp --geno 0.2 --make-bed --out NB_final_snp_2
plink --bfile NB_final_snp_2 --mind 0.2 --make-bed --out NB_final_snp_3
plink --bfile NB_final_snp_3 --geno 0.02 --make-bed --out NB_final_snp_4
plink --bfile NB_final_snp_4 --mind 0.02 --make-bed --out NB_final_snp_5
rm NB_final_snp_{2,3,4}.*

That all takes about 2 minutes and results in the same number of variants (13381703) and samples (2900) so correctness issues are very unlikely. This QC process was taken from the Marees et al. 2018 tutorial so I would like to stick to that rather than using fewer sequential steps or different thresholds.

Also, I’m using Hail 0.2 (0.2.30-2ae07d872f43 specifically) and Spark 2.4.4 in local mode. I know local mode is a less typical use case but for this experiment, I’m running everything on a machine with 128G RAM and a 92G heap size for the driver and seeing CPU utilization at nearly 100% on 16 cores (8 physical) the whole 2 hours, as well as memory usage peaking around 50G – so GC pressure is not a problem.

My first thought was that perhaps stripping out all the extraneous qc fields might help when serializing intermediate results, but then again it takes ~30 minutes to write the original MatrixTable with none of those fields so given that there are 4 steps in my qc_step_1 function taking ~2 hours in total, that seems less likely to get me back on the same order of magnitude for performance.

What am I doing here that is suboptimal? Thanks in advance for the help!

This is a great post, thanks for your attention to detail in tracking the configurations for your observed performance. First, I don’t think you’re doing anything to misuse Hail’s interfaces. PLINK is much faster than Hail on a comparable configuration in almost all cases. Frank McSherry’s “Scalability! But at what COST?”(Configuration that Outperforms a Single Thread) is a team favorite that describes how scalable systems nearly ubiquitously require more CPU hours to perform the same tasks as well-optimized single-threaded programs.

There is another Hail feature that comes with a high COST – flexibility. variant_qc and sample_qc are implemented on Hail’s MatrixTable aggregation interface, which is able to express a huge variety of queries over data. To make this somewhat efficient, Hail’s backend has an optimizing compiler that emits JVM bytecode. This code is much slower than the hand-optimized C and assembly code inside PLINK2 – Chris Chang is extremely good at what he does. Additionally, PLINK datasets are represented in memory the same as on disk – 2 bits per genotype. Hail represents a genotype with far more memory, in order to accommodate multiallelic calls and differing ploidy. These two properties make Hail much more expensive than PLINK for the workflow above. If your workflow is entirely encompassed by PLINK2 functionality, I’d definitely recommend you use PLINK2!

Two additional notes:

  1. Optimizing for performance on “hard calls” (just genotypes) has not been a priority in the last few years, since most Hail users are working with data that contains additional fields in the genotype schema (e.g. AD, DP, GQ, PL in GATK VCFs). There are many fewer tricks that can be played to speed up processing these values, and the relative improvement from optimizing GT (PLINK-schema) is small for these data.

  2. We do have a roadmap to bringing the performance of this set of queries to a place where it’s roughly comparable with PLINK2 over the next 12-24 months. We have a set of planned infrastructure improvements to our compiler that will let us take advantage of specialized disk and memory representations for specific data types (like unphased diploid calls, AKA PLINK-style 2-bit genotypes), and we intend to build a native code backend that will let us generate code that uses the same SIMD instructions that make PLINK2 so fast.

Also, while the above applies, I suspect something bad is going on here and I’m digging into profiling results for your pipeline on some 1kg data.

Thank you for the quick and thoughtful response @tpoterba!

The larger subtext in my original post is that I’m trying to understand where Hail will fit in best within our infrastructure, and the post with your discussion with @pfsulliv is definitely helpful too. In the 6 months since that post, have you encountered any good public examples of end-to-end pipelines that highlight the features of Hail that make it best-in-class for certain types of analysis? In other words, are there examples of WGS QC workloads that illustrate some of the tricks you mentioned?

I would also be very curious to hear if you have any recommendations on how to best manage workflows that might integrate Hail with tools like those that you suggested may be better suited to certain tasks, given a suitable dataset size. For example, would you recommend any workflow orchestration tools/best practices for interoperating Hail, PLINK, Beagle, BOLT-LMM, etc.? I know many tools like that exist, but I’m curious if you have any strong opinions on how any of them interface with Hail/Spark specifically.

Lastly, would you say that the MatrixTable/BlockMatrix/Table layouts on disk and the intermediate representation compilation will remain relatively stable over the next year or two as the backend rewrite occurs? I was wondering, as a software developer + potential contributor and not a geneticist, if any documentation exists that explains how these things were designed and whether or not it’s worth trying to learn more about them now (assuming they’re not going to be drastically refactored). Also, does it look like this new backend, sans Spark, will utilize any existing systems for distributed computing or will it be a rewrite from the ground up?

Thanks again for your time Tim! And thank you for taking a deeper look at that filtering on 1kg data. I’d love to hear if you find anything interesting.

First, I want to thank you for prompting me to look into the performance of this pipeline – I discovered a really nasty performance regression that wasn’t being caught by our benchmark suite. However, it was introduced after 0.2.30 (it’s only present in 0.2.31) so that doesn’t immediately suggest a resolution for your experience. I’ll continue looking after I fix the problem I’ve already found.

I certainly appreciate the difficulty of trying to decide on the right set of tools and configurations for various kinds of analyses. I roughly think that if your planned analyses are well-defined statistical genetics pipelines that are supported by PLINK2/BOLT/SAIGE and your data don’t require much ad-hoc QC/exploration, then don’t bother setting up Hail yet. However, if you’d need to run some custom scripts or queries over the data, PLINK2 starts to look a lot less attractive – and this was one of the early motivators for building such expressive query functionality inside of Hail. I also think that there’s probably no better tool than Hail for dealing with WGS data right now, due to the ease of (a) representing information about annotations, phenotypes, and genotypes in the same data structure, (b) computing the wide variety of statistics about the data necessary to do principled QC, and © scaling compute resources to be able to process large datasets into the hundreds of thousands of WGS.

That said, I still think there’s a solid place for tools like SAIGE in final steps of analysis even if Hail is the primary platform for slicing and dicing the data. It’s quite easy to go from Hail to VCF or a similar interchange format for these tools. Local scientists also use a variety of workflow engines to power this type of task.

With regards to file stability, it’s a little hard to answer briefly. We do intend to maintain back compatibility, defined roughly as “newer versions of Hail can read files written by older versions of Hail”. However, we do intend to introduce new on-disk encodings and compression parameterizations that will improve size and performance, and these improvements mean that the bytes that import_plink(...).write(...) produces might be (hopefully!) different in six months. I’m intending to write a blog post that provides a high-level overview of this system when we’ve got a fully-functional prototype. We certainly want it to be easy for outside contributors to be able to dig into the code and introduce optimizations for their own workflows!

most Hail users are working with data that contains additional fields in the genotype schema (e.g. AD, DP, GQ, PL in GATK VCFs)

computing the wide variety of statistics about the data necessary to do principled QC

@tpoterba can you share any example programs that implement principled QC of VCF files with those additional fields? We’d love to include a more sophisticated workload like that in our benchmarks.

I’m intending to write a blog post that provides a high-level overview of this system when we’ve got a fully-functional prototype

Why wait until you have a fully-functional prototype? Do you have any design documents or roadmaps that you can easily share with potential developers here on Discourse? I understand it’s hard to write up a brand new blog post, but hopefully it’s easy to share something that y’all have already written internally to coordinate development.

The gnomAD team has built a QC and analysis library on Hail. That’s probably the most sophisticated analysis at this point. It’s all open source:

and a blog post describing the analysis can be found here:

1 Like

I think Tim is referring to a blog post announcing the features to users and explaining the user-visible impact.

Design of these things are ongoing and can be found on

A lot also happens in person since the team basically sits the same room, but if/when we get more engagement from external developers, we can push on formalizing the design process and developer documentation.

1 Like

I actually meant a post for engineering-minded folks that roughly describes how our 3-type system works, because it’s pretty cool and worth sharing for people who want to start looking under the hood. The user-visible impact won’t be a full blog post: “Hail is faster, and its files are smaller.” :smile:

the dev Discourse forum: https://dev.hail.is

Wow thanks for the pointer, I did not know this discussion forum existed! I should probably read the README a little more closely…

@eric-czech I want to thank you again for posting this pipeline – it exposes an infrastructure flaw in the computation for variantqc and sampleqc that leads to runtime exponential in the number of iterations (in your case, 2). We can absolutely make this faster (though probably not enormously faster in the 2-iteration case).

Interesting! Will you post an issue on GitHub to track the fix?

Woa good catch then @tpoterba!

I’m also wondering if there is a way to rewrite it so that it works like this instead:

  • Do a pass over the original dataset, calculate variant call rates, build a list of variants that don’t pass the call rate threshold (but don’t actually apply this filter to the original dataset yet)
  • Do another pass over the original dataset calculating sample call rates in a way that excludes the variants found above, building a similar list of samples that don’t clear the threshold
  • Do a third pass that recalculates variant call rates excluding any samples / variants found above
  • Do a fourth pass to get new sample call rates respecting all previous exclusions
  • Finally, apply all the filters to the original dataset and produce a materialized result on disk

Would I be correct in assuming that this might eliminate the need to write out intermediate tables in full? If so, any tips on how to do that with the API would be awesome!

@hammer also suggested that column-wise aggregations might have different performance characteristics than row-wise aggregations, so out of curiosity I tried benchmarking a simple aggregation along either axis for a square dataset (notebook) and found something like this:

That suggests that aggregating across columns is faster so presumably the MT layout on disk is row-major instead of column-major. Is that much true?

I don’t know how much this ~2x disparity would hold up with larger datasets, but this is also suggests some substantial gains might come from putting the longer axis (i.e. variants) of a rectangular dataset into the columns for operations like this where aggregations happen across one axis and then the other. Is there a way to transpose an MT like that?

Yep, absolutely. Doing an aggregation per row (variant_qc, implemented as annotate_rows with aggregation) is an aggregation over an in-memory array, while doing aggregation per column (sample_qc, implemented as annotate_cols with aggregation) requires us to build an array of aggregation states, fold in entries for each row, and then use a Spark tree aggregate (which involves a small shuffle) to compute the final result in a memory-safe way. I’m pretty surprised it’s only 2x…

I don’t think you’d have a good experience trying to transpose a MT at the moment. We don’t expose any primitive to do that IIRC, and Hail will have trouble scaling to 100s of millions of columns right now. It’s also certainly true that any gain from aggregating over the in-memory axis will be cancelled out (and more) by the need to do a transpose to go between two representations.

We basically do this now, but only for samples. This is certainly a multi-pass pipeline, but we don’t currently automatically insert write/read or persistence into our generated graphs, which means that each time you filter the variants to compute sample qc, you’re doing all the work above (quadratic work in the number of times you’re iterating over the qc/filters). For small number of iterations, this is a small fraction of total time, so is acceptable. We separately have the exponential-growth bug, that we’ll fix ASAP!

here’s a tracking issue:

1 Like

OK, I’ve got a patch to fix this:

Benchmarks for your pipeline (see the PR page) indicate a ~4x improvement. This brings us down to about 1 order of magnitude slower than PLINK, which is roughly what I’d expect given current infrastructure.

2 Likes

Thanks a lot @tpoterba!

One last question I have then is whether or not it’s possible to do this (i.e. compute matrix row/col sums via Hail) in a way that hits code paths in breeze using netlib-java to call out to some native BLAS/LAPACK implementation.

I tried this using the BlockMatrix API on some smaller 1KG data but it’s only about 2x faster than the MatrixTable version (though I haven’t tried yet with your patch, so that may explain the difference):

# Write MT out as BM
mt = hl.read_matrix_table('1KG.mt')
mt.count() # (25488488, 629)
hl.linalg.BlockMatrix.write_from_entry_expr(hl.is_defined(mt.GT), path, overwrite=True, block_size=4096)

# Do the same kind of sequential filtering on call rates 

def filter_by_variant_call_rate(bm, threshold):
    # I tried this first since matrix multiplication seems to use native BLAS in 
    # breeze while summations do not, but it was much slower
    # cr = ((bm @ np.ones((bm.shape[1], 1))) / bm.shape[1]).to_numpy().squeeze()

    cr = (bm.sum(axis=1) / bm.shape[1]).to_numpy().squeeze()
    idx = np.argwhere(cr >= threshold).squeeze()
    return bm.filter_rows(idx.tolist())

def filter_by_sample_call_rate(bm, threshold):
    idx = np.argwhere((bm.sum(axis=0) / bm.shape[0]).to_numpy().squeeze() >= threshold).squeeze()
    return bm.filter_cols(idx.tolist())

bm = filter_by_variant_call_rate(bm, .8)
bm = filter_by_sample_call_rate(bm, .8)
bm = filter_by_variant_call_rate(bm, .98)
bm = filter_by_sample_call_rate(bm, .98)

Would the issue have affected BlockMatrix operations as well? Either way, I’m wondering if I should be able to expect any substantial gains to come from using that API instead of MatrixTable like this.

Also, I see that access to SIMD instruction sets are a big goal for Spark 3.x and that some companies have managed to hack it into Spark 2.x in one way or another. Do you know of any users that have managed to bring this kind of capability to Hail on Spark 2.x? I’m guessing not but I was surprised at how much faster this was when I tried the same thing with numpy + dask block matrices (notebook), so I’m just reaching for ways to speed things up in advance of the new backend.

If you have any thoughts let me know, and either way thanks again for putting out that patch.

The existence of Hail’s BlockMatrix is primarily to enable scalable matmul within the Hail ecosystem.

I am not aware of anyone who has used SIMD in Hail. I wonder how much of the 2x is coming from the better cache-locality because the floats are packed together. Unfortunately, column-storing is also still on our in-progress list.

I’m sad, but not surprised that numpy + dask is way faster. :confused:

Thanks @danking, good to know. I’ll stick with the MatrixTable API then.