I’m trying to filter a mt to 31895315 sites and densify the results (line 70 of this script: gnomad_qc/subpop_analysis.py at ca0d06f58b4095ba2af0ca31e9e615948666a370 · broadinstitute/gnomad_qc · GitHub), but keep getting what looks like a shuffle error. Do you have any suggestions? Would it be better to densify then filter? I can also send along the log file.
I’d definitely recommend rewriting this to use the new VDS
to_dense_mt, which naturally only densifies the sites you want to keep and has a much-improved implementation.
This code is still working with the gnomAD v3 data, so we don’t actually have a VDS to start with
It’s been some time since I dug into the sparse_mt utilities, but I remember thinking that densify_sites would have issues – I think if you densify fully and then filter after that will perform better. However, it’s certainly possible to convert a sparseMT to a VDS using
Do you think we should just convert the v3.1 sparse MT to a VDS and then remove the original MT and only store the VDS? Then we can work on that for this densify. I think previously it was decided that this code in
densify_sites didn’t help more than a full densify if the variants were spread across the whole genome and it was a large set of variants.
Moving to VDS we still preserve all data, but have moved to the new representation.
We likely won’t need this MT for more than the current script (and then might archive), so in terms of cost is it better to do the VDS conversion, run this script with that update, and archive the new VDS, or just a full densify of the current MT with this site filter after?
As you point out, it really depends on downstream usage. It’s pretty cheap to convert back and forth, but slightly more expensive to go SMT => VDS because that requires reading the data twice (and because VDS is a more efficient representation to read). If I had to guess, it’s probably not worth converting just for this one pipeline, and you can run this using a full densify followed by filter.