I am trying to build my own ancestry classifier using genotypes from the 1000Genomes and HGDP dataset, taken from here Downloads | gnomAD (broadinstitute.org). The speed I observe is very slow (for example during LD pruning and filtering), compared to when I analyzed my own dataset. In both approaches, I initialize HAIL like this:
hl.init(tmp_dir = '/scratch/hail',
local_tmpdir = '/scratch/hail',
master='local[128]',
spark_conf={'spark.driver.memory': '1800g',
'spark.executor.memory': '1800g',
'spark.local.dir' : '/scratch/hail',
'java.io.tmpdir': '/scratch/hail'})
I’ve noticed that the gnomad dataset contains 50’000 partitions, so I tried to change them during read_matrix_table
, but I didnt observe a difference. Is there anything else I could try?