Limit memory usage

I’m having trouble setting a limit to the memory Hail uses when running it locally on a server. I have set spark.driver.memory, spark.executor.pyspark.memory and spark.executor.memory:

export PYSPARK_SUBMIT_ARGS="--conf spark.driver.memory=48G --conf spark.executor.pyspark.memory=48G --conf spark.executor.memory=48G --conf spark.task.maxFailures=5 pyspark-shell"

But the job still eats up >500G memory before getting killed. Could anyone suggest what I’m doing wrong?
Thanks.

500G seems nuts. If there’s a memory leak inside Hail, this could be responsible, though. What’s your pipeline?

I’m trying to extract a subset of samples and variants from the UKB imputed data. Script below:

import hail as hl
import logging
import sys

chrom = int(sys.argv[1])
hl.init(default_reference = 'GRCh37', min_block_size=128, log = 'hail-chr{}.log'.format(chrom))
logging.getLogger("py4j").setLevel(logging.ERROR)

ind = hl.import_table('validation_samples.tsv'.format(n_validate), key = 'eid')

nealelab_variants = hl.import_table('variants.tsv.bgz', types = {'chr': 'tstr', 'pos': 'tint32' })
nealelab_variants = nealelab_variants.filter(nealelab_variants.chr == str(chrom))
select_variants = hl.parse_variant(nealelab_variants.variant)

vds = hl.import_bgen('imputed/bgen/ukb_imp_chr{}_v3.bgen'.format(chrom), entry_fields = ['GT', 'GP', 'dosage'], sample_file = 'ukb_imp_chr{}_v3.sample'.format(chrom), variants = select_variants)

vds = vds.semi_join_cols(ind)
hl.export_bgen(vds, 'validation.chr{}'.format(chrom))

I’m pretty sure there’s a memory leak in export_bgen, from a quick glance. We can get this fixed early next week.

Thanks a lot!

I believe this will fix it: https://github.com/hail-is/hail/pull/9006

Thanks, unfortunately it does not fix the problem for me - memory used even went to > 1TB in one case…

As a direct answer to your question, you cannot limit the memory used by Hail. That said, using 1 terabyte of RAM is probably a memory leak. We’ll continue to investigate.

How large is select_variants?

You have to update to 0.2.47 to get the fix, which just released this morning (it seems you posted your update about the problem not being fixed before the release came out)

The update indeed solves it, thank you very much! Previously I built from source on github (version 0.2.46-3a514a199ccd) after seeing your post - perhaps I still missed out something.

Sorry, I got confused. We made two memory fixes in the span of two days and I forgot which fix we were talking about. It’s true that 0.2.46-3a514a199ccd would have contained the fix I intended for you. However, seems like you also needed fix https://github.com/hail-is/hail/pull/9009.

Again, sorry for confusion, but glad things are working now!