Hail process seems unreasonably expensive

stephan · July 23, 2025, 1:11am

Hi all,

Below I have pasted the code I want to use to subset the AllofUs Hail ACAF matrix for the set of SNPs that I want (7.7 million), as well as to repartition it for faster performance downstream. The matrix currently has a lot of partitions which I believe lead to massive overhead (145,000), and has 100 million SNPs and 400,000 people. I filter the matrix using my hail table. Currently, the process looks like it will cost north of $1000. I have tried a variety of partitioning methods, and all of them seem similarly slow (128 workers and 4cpus 26gb RAM per worker, 128 workers and 8cpus 52gb RAM per worker, 64 workers and 16cpus 104gb RAM per worker etc- they all seem like they’ll take 10+ hours). I am doing a very similar process in the UKBB using plink (subsetting plink genotype files of 500,000 people and 100 million variants down to 7.7 million variants), and the whole thing costs about $20. Am I making some glaring mistake? I have built much of my AllofUs code (Subsetting a MT, PCA, GWAS) using Hail and I would love to keep using it, but I am having trouble making it cost-effective. Please advise- thank you!

(everything is inside one cell)
hits = hl.import_table(f’{bucket}/data/rsid_alleles_b38.tsv.bgz’,
types={‘CHR’: hl.tstr, ‘POS’: hl.tint32,
‘REF’: hl.tstr, ‘ALT’: hl.tstr})
hits = hits.annotate(
contig = (
hl.switch(hits.CHR)
.when(‘23’, ‘chrX’)
.when(‘24’, ‘chrY’)
.default(‘chr’ + hits.CHR)
)
)

hits = (hits
.annotate(
locus = hl.locus(hits.contig, hits.POS, ‘GRCh38’),
alleles_fwd = [hits.REF, hits.ALT],
alleles_rev = [hits.ALT, hits.REF]
)
.key_by(‘locus’))
mt_path = os.getenv(“WGS_ACAF_THRESHOLD_SPLIT_HAIL_PATH”)
mt = hl.read_matrix_table(mt_path)
ancestry_pred_path = “gs://fc-aou-datasets-controlled/v8/wgs/short_read/snpindel/aux/ancestry/ancestry_preds.tsv”
ancestry_pred = hl.import_table(ancestry_pred_path,
key=“research_id”,
impute=True,
types={“research_id”:“tstr”})

mt = mt.annotate_cols(ancestry_pred = ancestry_pred[mt.s])

mt = hl.variant_qc(mt)

mt = mt.filter_rows(
hl.is_defined(hits[mt.locus]) &
(
(mt.alleles == hits[mt.locus].alleles_fwd) |
(mt.alleles == hits[mt.locus].alleles_rev)
)
)

4. checkpoint after the expensive filter

ckpt_path = f’{bucket}/tmp/full_mt_filtered.mt’
mt = mt.checkpoint(ckpt_path, overwrite=True) # writes ~7 M variants

5. repartition & write final table

mt = mt.repartition(2048, shuffle=True) # small shuffle on 7 M rows
mt.write(f’{bucket}/data/hail_mt/mt_sbrc_common_all.mt’, overwrite=True)

Topic		Replies	Views
Performance after MatrixTable filtering (repartition question) Hail Query & hailctl	7	1732	December 20, 2018
Subsetting large data Hail Query & hailctl	3	731	September 28, 2022
Poor performance for QC filtering on medium sized genotype data Hail Query & hailctl	20	2199	February 8, 2020
Running Hail on Terra -- how should I optimize? Hail Query & hailctl	7	1265	February 3, 2021
Best way to repartition heavily-filtered matrix tables? Hail Query & hailctl	10	715	August 24, 2021

Hail process seems unreasonably expensive

4. checkpoint after the expensive filter

5. repartition & write final table

Related topics