Hail process seems unreasonably expensive

Hi all,

Below I have pasted the code I want to use to subset the AllofUs Hail ACAF matrix for the set of SNPs that I want (7.7 million), as well as to repartition it for faster performance downstream. The matrix currently has a lot of partitions which I believe lead to massive overhead (145,000), and has 100 million SNPs and 400,000 people. I filter the matrix using my hail table. Currently, the process looks like it will cost north of $1000. I have tried a variety of partitioning methods, and all of them seem similarly slow (128 workers and 4cpus 26gb RAM per worker, 128 workers and 8cpus 52gb RAM per worker, 64 workers and 16cpus 104gb RAM per worker etc- they all seem like they’ll take 10+ hours). I am doing a very similar process in the UKBB using plink (subsetting plink genotype files of 500,000 people and 100 million variants down to 7.7 million variants), and the whole thing costs about $20. Am I making some glaring mistake? I have built much of my AllofUs code (Subsetting a MT, PCA, GWAS) using Hail and I would love to keep using it, but I am having trouble making it cost-effective. Please advise- thank you!

(everything is inside one cell)
hits = hl.import_table(f’{bucket}/data/rsid_alleles_b38.tsv.bgz’,
types={‘CHR’: hl.tstr, ‘POS’: hl.tint32,
‘REF’: hl.tstr, ‘ALT’: hl.tstr})
hits = hits.annotate(
contig = (
hl.switch(hits.CHR)
.when(‘23’, ‘chrX’)
.when(‘24’, ‘chrY’)
.default(‘chr’ + hits.CHR)
)
)

hits = (hits
.annotate(
locus = hl.locus(hits.contig, hits.POS, ‘GRCh38’),
alleles_fwd = [hits.REF, hits.ALT],
alleles_rev = [hits.ALT, hits.REF]
)
.key_by(‘locus’))
mt_path = os.getenv(“WGS_ACAF_THRESHOLD_SPLIT_HAIL_PATH”)
mt = hl.read_matrix_table(mt_path)
ancestry_pred_path = “gs://fc-aou-datasets-controlled/v8/wgs/short_read/snpindel/aux/ancestry/ancestry_preds.tsv”
ancestry_pred = hl.import_table(ancestry_pred_path,
key=“research_id”,
impute=True,
types={“research_id”:“tstr”})

mt = mt.annotate_cols(ancestry_pred = ancestry_pred[mt.s])

mt = hl.variant_qc(mt)

mt = mt.filter_rows(
hl.is_defined(hits[mt.locus]) &
(
(mt.alleles == hits[mt.locus].alleles_fwd) |
(mt.alleles == hits[mt.locus].alleles_rev)
)
)

4. checkpoint after the expensive filter

ckpt_path = f’{bucket}/tmp/full_mt_filtered.mt’
mt = mt.checkpoint(ckpt_path, overwrite=True) # writes ~7 M variants

5. repartition & write final table

mt = mt.repartition(2048, shuffle=True) # small shuffle on 7 M rows
mt.write(f’{bucket}/data/hail_mt/mt_sbrc_common_all.mt’, overwrite=True)