Most efficient way to filter and densify VDS

Hello!

I am working on a rare noncoding variant burden analysis with the All of Us genomic data, which is stored only as a VDS. We are analyzing a total of 180 Mb spread across 120k intervals scattered throughout the genome for about 3000 samples as part of a nested case-control study.

I densified the VDS once in the past (Oct-ish 2023) using the following script:

#filter VDS by interval, then samples
my_intervals = ['chr17: 43044295-43125364', ...]
vds = hl.vds.filter_intervals(vds, [hl.parse_locus_interval(x, ) for x in my_intervals]
vds = hl.vds.filter_samples(vds, samples_to_keep, keep = True, remove_dead_alleles = True)

#transform local alleles to global
mt = vds.variant_data.annotate_entries(AD = hl.vds.local_to_global(vds.variant_data.LAD, vds.variant_data.LA, n_alleles = hl.len(vds.variant_data.alleles), fill_value = 0, number = 'R'))
mt = mt.annotate_entries(GT = hl.vds.lgt_to_gt(mt.LGT, mt.LA))
mt = mt.transmute_entries(FT = hl.if_else(mt.FT, "PASS", "FAIL"))

#densify to mt
mt = hl.vds.to_dense_mt(hl.vds.VariantDataset(vds.reference_data, mt))

However that particular test run ended up being quite expensive ($100 for 80kb and 74 samples on a 75/75 cluster) for a much smaller sample set and shorter interval than my current analysis. I was advised that filter_rows() is more efficient than filter_intervals() as a way of reducing cost. So in principle, I understand that the following code would run:

#filter VDS by interval
my_loci = hl.literal({hl.Locus('chr17', 43044295), hl.Locus('chr17', 43044296), ...})
vd = vds.variant_data
vd = vd.filter_rows(my_loci.contains(vd.locus))
vds = VariantDataset(vd.reference_data, vd)

I have two main questions:

  1. This second version involves passing a set of 180 million loci individually to filter_rows(), which feels a little clunky. Is this the most efficient way to go about doing this for larger regions? Is there a way to more efficiently pass a list of intervals to filter_rows() instead?

  2. I saw here that Hail doesn’t support filtering a subset of the full 245k AoU sample set very well in the VDS. In terms of the optimal order of operations here, do you recommend filtering by samples first, writing a smaller VDS, then filtering by intervals before densifying?

Thanks for your help!