Hail performance reference/benchmark

rodrigo.barreiro · February 2, 2022, 7:06pm

Hello Hail team,

Our team are using Hail for the first time for large datasets in a cluster setup (in AWS EMR). We are concerned about the processing performance of our current setup. Unfortunately we couldn’t find a “benchmark” test (e.g. this task with this dataset in x cores is expected to finish in this amount of time). I know that there are several factors that could influence the performance on a cluster, but maybe a simple test maybe would be useful just to make sure there isn’t any critical issue.

The task we are dealing with is to filter samples (mostly sample_qc(), code below) in UKBBs microarray + imputed dataset (~2,6Tb in BGEN files). We used 2 core nodes + 14 task nodes both with 16vCPUs. This processes took about +25 hours to finish. We could increase the amount of nodes to speed things up, but we wanted to make sure (before making any other test) that we aren’t wasting many resources due to a bad config.

Any insight would be really appreciated! Thank you all!

Best,
Rodrigo

import hail as hl
hl.init(sc, default_reference="GRCh37")

def load_dataset():
    filenames = [ f's3a://.../ukb22828_chr{i}_v3.bgen' for i in range(1,23) ]
    mt = hl.import_bgen(
        filenames,
        entry_fields = ['GT'],
        sample_file = 's3a://.../ukb22828_chr22_v3.sample'
    )
    pheno_table = hl.import_table(
        f's3a://.../stg_ukbb_gwas_requisites_processed_v2.csv',
        delimiter=',',
        missing="NA",
        impute=True)
    pheno_table = pheno_table.annotate(
        cd_participante=hl.str(pheno_table.cd_participante)
    )
    pheno_table = pheno_table.key_by('cd_participante')
    return mt.annotate_cols(pheno=pheno_table[mt.s])

mt = load_dataset()
mt = mt.filter_cols(
    (mt.pheno.cd_sexo_genetico == 'Female') & # <- caution
    (mt.pheno.used_in_phase_chr1_22 == 1) &
    (mt.pheno.used_in_pca == 1) &
    (hl.is_defined(mt.pheno.pca1)) &
    (hl.is_defined(mt.pheno.idade_entrevista)) &
    (hl.is_defined(mt.pheno.breast_cancer)) &
    (mt.pheno.ancestry_filter_pass) &
    (~mt.pheno.excess_relative))

mt = hl.sample_qc(mt)

mt = mt.filter_cols(
    (mt.sample_qc.call_rate >= 0.95))

mt.cols().write('s3a://.../qced-samples.ht')

Topic		Replies	Views
Index_bgen() on UKBB imputed data expected time Hail Query & hailctl	15	803	January 25, 2022
Importing large BGEN into Hail Matrix Table Hail Query & hailctl	4	478	July 2, 2021
Performance after MatrixTable filtering (repartition question) Hail Query & hailctl	7	1704	December 20, 2018
Best practices for UK Biobank Imputed Data Hail Query & hailctl	9	1354	March 19, 2022
Efficient GWAS analyses: expected time and resources Hail Query & hailctl	0	24	May 21, 2025

Hail performance reference/benchmark

Related topics