Most efficient way to filter and densify VDS

Hello!

I am working on a rare noncoding variant burden analysis with the All of Us genomic data, which is stored only as a VDS. We are analyzing a total of 180 Mb spread across 120k intervals scattered throughout the genome for about 3000 samples as part of a nested case-control study.

I densified the VDS once in the past (Oct-ish 2023) using the following script:

#filter VDS by interval, then samples
my_intervals = ['chr17: 43044295-43125364', ...]
vds = hl.vds.filter_intervals(vds, [hl.parse_locus_interval(x, ) for x in my_intervals]
vds = hl.vds.filter_samples(vds, samples_to_keep, keep = True, remove_dead_alleles = True)

#transform local alleles to global
mt = vds.variant_data.annotate_entries(AD = hl.vds.local_to_global(vds.variant_data.LAD, vds.variant_data.LA, n_alleles = hl.len(vds.variant_data.alleles), fill_value = 0, number = 'R'))
mt = mt.annotate_entries(GT = hl.vds.lgt_to_gt(mt.LGT, mt.LA))
mt = mt.transmute_entries(FT = hl.if_else(mt.FT, "PASS", "FAIL"))

#densify to mt
mt = hl.vds.to_dense_mt(hl.vds.VariantDataset(vds.reference_data, mt))

However that particular test run ended up being quite expensive ($100 for 80kb and 74 samples on a 75/75 cluster) for a much smaller sample set and shorter interval than my current analysis. I was advised that filter_rows() is more efficient than filter_intervals() as a way of reducing cost. So in principle, I understand that the following code would run:

#filter VDS by interval
my_loci = hl.literal({hl.Locus('chr17', 43044295), hl.Locus('chr17', 43044296), ...})
vd = vds.variant_data
vd = vd.filter_rows(my_loci.contains(vd.locus))
vds = VariantDataset(vd.reference_data, vd)

I have two main questions:

  1. This second version involves passing a set of 180 million loci individually to filter_rows(), which feels a little clunky. Is this the most efficient way to go about doing this for larger regions? Is there a way to more efficiently pass a list of intervals to filter_rows() instead?

  2. I saw here that Hail doesn’t support filtering a subset of the full 245k AoU sample set very well in the VDS. In terms of the optimal order of operations here, do you recommend filtering by samples first, writing a smaller VDS, then filtering by intervals before densifying?

Thanks for your help!

Hi @jbs!

I was advised that filter_rows() is more efficient than filter_intervals() as a way of reducing cost.

That’s not right. filter_intervals will avoid reading any partitions that don’t overlap any of the intervals, while filter_rows has to scan through every row and apply the filter condition (though our optimizer tries very hard to rewrite filter_rows to filter_intervals whenever possible). Also, that snippet only filters the variant data, not the reference data, which will likely cause downstream performance issues.

I saw here that Hail doesn’t support filtering a subset of the full 245k AoU sample set very well in the VDS. In terms of the optimal order of operations here, do you recommend filtering by samples first, writing a smaller VDS, then filtering by intervals before densifying?

You want to filter the variants first. The reason filtering samples is expensive is because a VDS (like a matrixtable) is stored “variant major”, i.e. for each variant it stores an array of all the entries for all the samples. Filtering the samples means filtering every one of those entries arrays, whereas filtering variants just requires skipping over the filtered ones. By filtering variants first, only the entries arrays for the kept variants need to be rewritten.

So in short, I think you were doing everything right before. One thought: can you run vds.reference_data.describe() and check if there’s a global field ref_block_max_length? That’s important for making filter_intervals on the reference data efficient.

@chrisvittal Can you think of anything else to suggest? Is the cost for the first run in the ballpark of what you would expect?

Actually, you probably want to write out the vds after filtering before densifying. That may actually be the cause of the high cost. Does that sound right, Chris?

That is correct. Here is what I would recommend

#filter VDS by interval, then samples
my_intervals = ['chr17: 43044295-43125364', ...]
vds = hl.vds.filter_intervals(vds, [hl.parse_locus_interval(x) for x in my_intervals]
vds = hl.vds.filter_samples(vds, samples_to_keep, keep = True, remove_dead_alleles = True)
vds = vds.checkpoint(...)

#transform local alleles to global
mt = vds.variant_data.annotate_entries(
    AD = hl.vds.local_to_global(vds.variant_data.LAD, 
                                vds.variant_data.LA, 
                                n_alleles = hl.len(vds.variant_data.alleles), 
                                fill_value = 0, 
                                number = 'R'))
mt = mt.annotate_entries(GT = hl.vds.lgt_to_gt(mt.LGT, mt.LA))
# unless you're going to convert to something like VCF immediately, 
# don't convert from a boolean to a string, it's a waste of memory/space
# mt = mt.transmute_entries(FT = hl.if_else(mt.FT, "PASS", "FAIL"))

#densify to mt
mt = hl.vds.to_dense_mt(hl.vds.VariantDataset(vds.reference_data, mt))

Thanks for the extremely helpful responses! That makes a lot of sense. This worked and the full densification cost in the ~$300 range.

It does look like ref_block_max_length is defined in a global field. I re-ran the original filter_intervals()-based snippet I had tested in Oct 2023, and it now runs in 2-3 mins on a 2/50 cluster. So it’s possible that this wasn’t the case back when I originally tested it in the fall.

Sorry, one more thing…

One of our downstream tasks will be to filter variants by allele frequency, so we will need to calculate AC/AF/AN fields which are not stored in the VDS.

I know that hl.agg.call_stats() will compute these from a MatrixTable. AoU seem to suggest densifying the target sites for all 245k samples, then computing these fields on the resulting MatrixTable. So something like:

# filter VDS by interval
my_intervals = ['chr17:43044295-43125364', ...]
vds = hl.vds.filter_intervals(vds, [hl.parse_locus_interval(x) for x in my_intervals])
vds = vds.checkpoint(...)

# transform local alleles to global
mt = vds.variant_data.annotate_entries(
    AD = hl.vds.local_to_global(vds.variant_data.LAD, 
                                vds.variant_data.LA, 
                                n_alleles = hl.len(vds.variant_data.alleles), 
                                fill_value = 0, 
                                number = 'R'))
mt = mt.annotate_entries(GT = hl.vds.lgt_to_gt(mt.LGT, mt.LA))

# densify to mt
mt = hl.vds.to_dense_mt(hl.vds.VariantDataset(vds.reference_data, mt))

# compute AC/AF/AN fields
mt = mt.annotate_rows(info = hl.agg.call_stats(mt.GT, mt.alleles))

# filter samples
mt = mt.semi_join_cols(phenotypes_table)

However, this would obviously involve writing a much larger MatrixTable. Is this really optimal? Or is there a better way to compute these fields from the VDS to allow filtering both variants and samples before densifying in order to avoid this?

Thanks again for your help!