I am trying to run some basic QC on a somewhat large dataset (30M variants by 3k samples) distributed as PLINK files. I am finding that the same steps to filter variants and samples by call rate that take 2 minutes with PLINK 1.9 (well 135 seconds) take over 2 hours with Hail. It’s actually more like 2.5 hours if you count the initial time taken for the MatrixTable conversion.
What can I do to bridge some of this ~50x performance difference? Here is some relevant Hail code:
# Convert to MatrixTable
mt_qc0 = hl.import_plink(
bed='NB_final_snp.bed',
bim='NB_final_snp.bim',
fam='NB_final_snp.fam'
)
# This takes ~32 minutes
mt_qc0.write('NB_final_snp.mt', overwrite=True)
# mt_qc0 reads in with 668 partitions
# Define a function for call rate QC
def qc_step_1(mt):
mt = hl.variant_qc(mt)
mt = mt.filter_rows(mt.variant_qc.call_rate >= .8)
mt = hl.sample_qc(mt)
mt = mt.filter_cols(mt.sample_qc.call_rate >= .8)
mt = hl.variant_qc(mt)
mt = mt.filter_rows(mt.variant_qc.call_rate >= .98)
mt = hl.sample_qc(mt)
mt = mt.filter_cols(mt.sample_qc.call_rate >= .98)
return mt
# Load the converted table for QC
mt_qc0 = hl.read_matrix_table('NB_final_snp.mt')
mt_qc1 = qc_step_1(mt_qc0)
# This takes ~125 minutes
print(mt_qc1.count())
> (13381703, 2900)
The data itself is taken directly from Rice SNP-Seek (the “3K RG 29mio biallelic SNPs” dataset). I don’t actually care about rice data, I was just using this in some comparisons between the performance of Glow, Hail, and PLINK. As far as I know, there shouldn’t be anything about it that wouldn’t be supported well since it is bi-allelic and the contig names / lengths all clear the locus checks for the default GRCh37 genome. Are there any other assumptions baked into the indexing or data structures that might degrade performance?
This is the comparable PLINK code (which is run on the same system):
# Use different intermediate results for each step with a cleanup at the end to
# better emulate what Spark/Hail is doing
plink --bfile NB_final_snp --geno 0.2 --make-bed --out NB_final_snp_2
plink --bfile NB_final_snp_2 --mind 0.2 --make-bed --out NB_final_snp_3
plink --bfile NB_final_snp_3 --geno 0.02 --make-bed --out NB_final_snp_4
plink --bfile NB_final_snp_4 --mind 0.02 --make-bed --out NB_final_snp_5
rm NB_final_snp_{2,3,4}.*
That all takes about 2 minutes and results in the same number of variants (13381703) and samples (2900) so correctness issues are very unlikely. This QC process was taken from the Marees et al. 2018 tutorial so I would like to stick to that rather than using fewer sequential steps or different thresholds.
Also, I’m using Hail 0.2 (0.2.30-2ae07d872f43 specifically) and Spark 2.4.4 in local mode. I know local mode is a less typical use case but for this experiment, I’m running everything on a machine with 128G RAM and a 92G heap size for the driver and seeing CPU utilization at nearly 100% on 16 cores (8 physical) the whole 2 hours, as well as memory usage peaking around 50G – so GC pressure is not a problem.
My first thought was that perhaps stripping out all the extraneous qc fields might help when serializing intermediate results, but then again it takes ~30 minutes to write the original MatrixTable with none of those fields so given that there are 4 steps in my qc_step_1
function taking ~2 hours in total, that seems less likely to get me back on the same order of magnitude for performance.
What am I doing here that is suboptimal? Thanks in advance for the help!