Slow speed when using gnomadV3 callset

DBScan · May 8, 2024, 6:47am

I am trying to build my own ancestry classifier using genotypes from the 1000Genomes and HGDP dataset, taken from here Downloads | gnomAD (broadinstitute.org). The speed I observe is very slow (for example during LD pruning and filtering), compared to when I analyzed my own dataset. In both approaches, I initialize HAIL like this:

hl.init(tmp_dir = '/scratch/hail',
        local_tmpdir = '/scratch/hail',
        master='local[128]',
        spark_conf={'spark.driver.memory': '1800g',
                    'spark.executor.memory': '1800g',
                    'spark.local.dir' : '/scratch/hail',
                   'java.io.tmpdir': '/scratch/hail'})

I’ve noticed that the gnomad dataset contains 50’000 partitions, so I tried to change them during read_matrix_table, but I didnt observe a difference. Is there anything else I could try?

Topic		Replies	Views
Speeding up gnomAD annotation Hail Query & hailctl	3	805	December 1, 2020
Slow Terra Output Hail Batch & General Cloud	2	182	March 11, 2024
Running Hail on Terra -- how should I optimize? Hail Query & hailctl	7	1239	February 3, 2021
How to run GWAS from UK Biobank efficiently on Hail Hail Query & hailctl	11	3320	December 21, 2020
Improving pipeline performance Hail Query & hailctl	3	676	November 17, 2020

Slow speed when using gnomadV3 callset

Related topics