When to densify?

Hi Dan King,

Thank you for your insightful presentation today!
Per our conversation, I’d appreciate it if you could take a moment to assess the code below for computational equivalence. Our goal here is to extract biallelic variants with AC >= 100
I expect bypassing densifying entire vds will significantly reduce computational time.

While my initial tests suggest consistent results, I want to ensure its reliability across various scenarios. Your expertise in this review would be highly appreciated.

## Code1
vds = hl.vds.read_vds("some_very_big.vds")
mt = vds.variant_data
mt = hl.split_multi_hts(mt)
mt = hl.variant_qc(mt)
mt = mt.annotate_rows(AC100 = mt.variant_qc.AC[1] > 99)
mt = mt.filter_rows(mt.AC100)
vds.variant_data = mt
mt = hl.vds.to_dense_mt(vds)
mt.write("filtered.mt", overwrite=True)
## Code2
vds = hl.vds.read_vds("some_very_big.vds")
mt = hl.vds.to_dense_mt(vds)
mt = hl.split_multi_hts(mt)
mt = hl.variant_qc(mt)
mt = mt.annotate_rows(AC100 = mt.variant_qc.AC[1] > 99)
mt = mt.filter_rows(mt.AC100)
mt.write("filtered.mt", overwrite=True)

OK, Patrick and I gave this a thought and we agree that both should have the same output. If a variant row doesn’t exist, it won’t exist in the densification. If you remove a row after densification, it won’t exist.

The presence or absence of a variant row should not affect the value of other variant rows. Densified variant rows are affected by the source variant row and the overlapping reference data. You don’t touch the reference data, so there should be no effect.

We’ll look into why Hail doesn’t automatically push the filter up into the variant data.

Hmm. I think part of the challenge for Hail automatically doing this is that variant_qc very much depends on reference data, but you’ve cleverly chosen to only look at the first alternate allele’s allele count.

I think it’d be hard for us to automatically do this optimization. We’ll keep this in mind as we develop further.

And an issue to track the optimization: [query] Hail should be able to automatically push backwards alternate allele only VDS filters · Issue #13695 · hail-is/hail · GitHub