Hail performance reference/benchmark

Hello Hail team,

Our team are using Hail for the first time for large datasets in a cluster setup (in AWS EMR). We are concerned about the processing performance of our current setup. Unfortunately we couldn’t find a “benchmark” test (e.g. this task with this dataset in x cores is expected to finish in this amount of time). I know that there are several factors that could influence the performance on a cluster, but maybe a simple test maybe would be useful just to make sure there isn’t any critical issue.

The task we are dealing with is to filter samples (mostly sample_qc(), code below) in UKBBs microarray + imputed dataset (~2,6Tb in BGEN files). We used 2 core nodes + 14 task nodes both with 16vCPUs. This processes took about +25 hours to finish. We could increase the amount of nodes to speed things up, but we wanted to make sure (before making any other test) that we aren’t wasting many resources due to a bad config.

Any insight would be really appreciated! Thank you all!


import hail as hl
hl.init(sc, default_reference="GRCh37")

def load_dataset():
    filenames = [ f's3a://.../ukb22828_chr{i}_v3.bgen' for i in range(1,23) ]
    mt = hl.import_bgen(
        entry_fields = ['GT'],
        sample_file = 's3a://.../ukb22828_chr22_v3.sample'
    pheno_table = hl.import_table(
    pheno_table = pheno_table.annotate(
    pheno_table = pheno_table.key_by('cd_participante')
    return mt.annotate_cols(pheno=pheno_table[mt.s])

mt = load_dataset()
mt = mt.filter_cols(
    (mt.pheno.cd_sexo_genetico == 'Female') & # <- caution
    (mt.pheno.used_in_phase_chr1_22 == 1) &
    (mt.pheno.used_in_pca == 1) &
    (hl.is_defined(mt.pheno.pca1)) &
    (hl.is_defined(mt.pheno.idade_entrevista)) &
    (hl.is_defined(mt.pheno.breast_cancer)) &
    (mt.pheno.ancestry_filter_pass) &

mt = hl.sample_qc(mt)

mt = mt.filter_cols(
    (mt.sample_qc.call_rate >= 0.95))