Hello!
I am working on a rare noncoding variant burden analysis with the All of Us genomic data, which is stored only as a VDS. We are analyzing a total of 180 Mb spread across 120k intervals scattered throughout the genome for about 3000 samples as part of a nested case-control study.
I densified the VDS once in the past (Oct-ish 2023) using the following script:
#filter VDS by interval, then samples
my_intervals = ['chr17: 43044295-43125364', ...]
vds = hl.vds.filter_intervals(vds, [hl.parse_locus_interval(x, ) for x in my_intervals]
vds = hl.vds.filter_samples(vds, samples_to_keep, keep = True, remove_dead_alleles = True)
#transform local alleles to global
mt = vds.variant_data.annotate_entries(AD = hl.vds.local_to_global(vds.variant_data.LAD, vds.variant_data.LA, n_alleles = hl.len(vds.variant_data.alleles), fill_value = 0, number = 'R'))
mt = mt.annotate_entries(GT = hl.vds.lgt_to_gt(mt.LGT, mt.LA))
mt = mt.transmute_entries(FT = hl.if_else(mt.FT, "PASS", "FAIL"))
#densify to mt
mt = hl.vds.to_dense_mt(hl.vds.VariantDataset(vds.reference_data, mt))
However that particular test run ended up being quite expensive ($100 for 80kb and 74 samples on a 75/75 cluster) for a much smaller sample set and shorter interval than my current analysis. I was advised that filter_rows()
is more efficient than filter_intervals()
as a way of reducing cost. So in principle, I understand that the following code would run:
#filter VDS by interval
my_loci = hl.literal({hl.Locus('chr17', 43044295), hl.Locus('chr17', 43044296), ...})
vd = vds.variant_data
vd = vd.filter_rows(my_loci.contains(vd.locus))
vds = VariantDataset(vd.reference_data, vd)
I have two main questions:
-
This second version involves passing a set of 180 million loci individually to
filter_rows()
, which feels a little clunky. Is this the most efficient way to go about doing this for larger regions? Is there a way to more efficiently pass a list of intervals tofilter_rows()
instead? -
I saw here that Hail doesn’t support filtering a subset of the full 245k AoU sample set very well in the VDS. In terms of the optimal order of operations here, do you recommend filtering by samples first, writing a smaller VDS, then filtering by intervals before densifying?
Thanks for your help!