Error trying to densify

klaricch · January 31, 2022, 3:25pm

I’m trying to filter a mt to 31895315 sites and densify the results (line 70 of this script: gnomad_qc/subpop_analysis.py at ca0d06f58b4095ba2af0ca31e9e615948666a370 · broadinstitute/gnomad_qc · GitHub), but keep getting what looks like a shuffle error. Do you have any suggestions? Would it be better to densify then filter? I can also send along the log file.

Thanks!
Kristen

tpoterba · January 31, 2022, 3:27pm

I’d definitely recommend rewriting this to use the new VDS to_dense_mt, which naturally only densifies the sites you want to keep and has a much-improved implementation.

klaricch · January 31, 2022, 4:49pm

This code is still working with the gnomAD v3 data, so we don’t actually have a VDS to start with

tpoterba · January 31, 2022, 7:25pm

It’s been some time since I dug into the sparse_mt utilities, but I remember thinking that densify_sites would have issues – I think if you densify fully and then filter after that will perform better. However, it’s certainly possible to convert a sparseMT to a VDS using hl.vds.VariantDataset.from_merged_representation!

jkgoodrich · January 31, 2022, 10:29pm

Do you think we should just convert the v3.1 sparse MT to a VDS and then remove the original MT and only store the VDS? Then we can work on that for this densify. I think previously it was decided that this code in densify_sites didn’t help more than a full densify if the variants were spread across the whole genome and it was a large set of variants.

Moving to VDS we still preserve all data, but have moved to the new representation.

We likely won’t need this MT for more than the current script (and then might archive), so in terms of cost is it better to do the VDS conversion, run this script with that update, and archive the new VDS, or just a full densify of the current MT with this site filter after?

tpoterba · February 1, 2022, 2:39am

As you point out, it really depends on downstream usage. It’s pretty cheap to convert back and forth, but slightly more expensive to go SMT => VDS because that requires reading the data twice (and because VDS is a more efficient representation to read). If I had to guess, it’s probably not worth converting just for this one pipeline, and you can run this using a full densify followed by filter.

Topic		Replies	Views
Vds to merged sparse matrix Hail Query & hailctl	4	364	July 18, 2023
When to densify? Hail Query & hailctl	3	266	September 22, 2023
Most efficient way to filter and densify VDS Hail Query & hailctl	4	271	May 14, 2024
Preventing a shuffle error Hail Query & hailctl	5	390	September 23, 2021
Densifying VDS to MatrixTable very expensive Hail Query & hailctl	2	362	November 13, 2023

Error trying to densify

Related topics