When to densify?

skoyamamd · September 21, 2023, 6:33pm

Hi Dan King,

Thank you for your insightful presentation today!
Per our conversation, I’d appreciate it if you could take a moment to assess the code below for computational equivalence. Our goal here is to extract biallelic variants with AC >= 100
I expect bypassing densifying entire vds will significantly reduce computational time.

While my initial tests suggest consistent results, I want to ensure its reliability across various scenarios. Your expertise in this review would be highly appreciated.

## Code1
vds = hl.vds.read_vds("some_very_big.vds")
mt = vds.variant_data
mt = hl.split_multi_hts(mt)
mt = hl.variant_qc(mt)
mt = mt.annotate_rows(AC100 = mt.variant_qc.AC[1] > 99)
mt = mt.filter_rows(mt.AC100)
vds.variant_data = mt
mt = hl.vds.to_dense_mt(vds)
mt.write("filtered.mt", overwrite=True)

## Code2
vds = hl.vds.read_vds("some_very_big.vds")
mt = hl.vds.to_dense_mt(vds)
mt = hl.split_multi_hts(mt)
mt = hl.variant_qc(mt)
mt = mt.annotate_rows(AC100 = mt.variant_qc.AC[1] > 99)
mt = mt.filter_rows(mt.AC100)
mt.write("filtered.mt", overwrite=True)

danking · September 22, 2023, 5:54pm

OK, Patrick and I gave this a thought and we agree that both should have the same output. If a variant row doesn’t exist, it won’t exist in the densification. If you remove a row after densification, it won’t exist.

The presence or absence of a variant row should not affect the value of other variant rows. Densified variant rows are affected by the source variant row and the overlapping reference data. You don’t touch the reference data, so there should be no effect.

We’ll look into why Hail doesn’t automatically push the filter up into the variant data.

danking · September 22, 2023, 5:57pm

Hmm. I think part of the challenge for Hail automatically doing this is that variant_qc very much depends on reference data, but you’ve cleverly chosen to only look at the first alternate allele’s allele count.

I think it’d be hard for us to automatically do this optimization. We’ll keep this in mind as we develop further.

danking · September 22, 2023, 5:58pm

And an issue to track the optimization: [query] Hail should be able to automatically push backwards alternate allele only VDS filters · Issue #13695 · hail-is/hail · GitHub

Topic		Replies	Views
Most efficient way to filter and densify VDS Hail Query & hailctl	4	272	May 14, 2024
Counting Rows More Quickly in VDS Hail Query & hailctl	12	527	July 17, 2023
Error trying to densify Hail Query & hailctl	5	392	February 1, 2022
Isolating Split Variants from VDS Hail Query & hailctl	0	79	May 24, 2024
Variant Annotation Table Merge? Hail Query & hailctl	2	75	April 15, 2025

When to densify?

Related topics