Densifying VDS to MatrixTable very expensive

Hi,

I’m running Hail in the All of Us Researcher Workbench and I’m trying to densify 8602 samples at 264 sites from the AoU VDS to a MatrixTable, caching it, and running subsequent operations on it. This spins up three subsequent stages of 84,648 Spark jobs each, and costs about $100 in compute. This seems quite expensive to look at a comparatively small number of entries. Running the same commands on a VCF MatrixTable just takes about 10 seconds, while densifying the VDS takes hours or longer even on a large cluster.

I used the code below that I adapted from the AoU tutorial notebooks as a proof of concept. This only accesses one site in one sample after densifying the VDS to 8602 samples at 264 sites, and it spins up at least one stage of 84,648 jobs which ends up being very expensive.

vds = hl.vds.read_vds(os.getenv("WGS_VDS_PATH"))
vds_filtered = hl.vds.filter_samples(vds, person_ids) # 8602 samples
vds_filtered = hl.vds.filter_intervals(vds_filtered, locus_intervals) # 264 LocusIntervals (each 1bp long)

mt = vds_filtered.variant_data
mt = mt.annotate_entries(GT = hl.vds.lgt_to_gt(mt.LGT, mt.LA))
mt = hl.vds.to_dense_mt(hl.vds.VariantDataset(vds_filtered.reference_data, mt))

# Access only one sample at one site
mt_sample = mt.filter_cols(mt.s=='{SOME_AoU_PERSON_ID}')
mt_sample_site = mt_sample.filter_rows(mt_sample.locus==hl.parse_locus('chr2:...')).cache()

I remember hearing that converting from VDS to MT is quite cheap. Is this cost expected for this scale, or is there a bug in the code that I adapted from the tutorial?

Hey @mgatzen ! I’m sorry Hail isn’t performing the way you expect. I’m also surprised by the performance in this case.

What version of Hail are you using?

I would expect this code to execute in time proportional to the size of the intervals and definitely not take hours, if both of these things are true: (1) you’re using 0.2.114 or later and (2) your VDS has been updated with a ref_block_max_len. If (2) is not true (but (1) is) you should get a warning about “filtering intervals without a known max reference block length”.

vds = hl.vds.read_vds(os.getenv("WGS_VDS_PATH"))
vds_filtered = hl.vds.filter_samples(vds, person_ids) # 8602 samples
vds_filtered = hl.vds.filter_intervals(vds_filtered, locus_intervals) # 264 LocusIntervals (each 1bp long)

mt = vds_filtered.variant_data
mt = mt.annotate_entries(GT = hl.vds.lgt_to_gt(mt.LGT, mt.LA))
mt = hl.vds.to_dense_mt(hl.vds.VariantDataset(vds_filtered.reference_data, mt))
mt.write(...)

cache is likely not what you’re looking for. Are you trying to save the matrix table for future use? Use write and then read in that case.


I remember hearing that converting from VDS to MT is quite cheap.

This is not true; the opposite is true: converting from the sparse VDS format to a dense Matrix Table is expensive, because of the nature of going from a sparse to a dense representation.


Hail does not store data in a manner that lets you look at a subset of samples in O(SUBSET_SIZE) time. We’re working on something called columnar storage which would permit this kind of thing, but it’s a work in progress.

Hi @danking,

Thank you very much for your response! The version that is installed on the AoU Researcher Workbench is in fact 0.2.107, so before the improvements you mentioned in 0.2.114. Updating the Hail version from within the cloud environment doesn’t seem to work so I reached out to Lee and Sophie to make them aware of this, hopefully we can update the version to a more recent one.

Thanks again for your input,
Michael