Is there a more computationally efficient way to get a single variant from VDS?


I have to do some candidate gene test for a handful of rare variants (like a dozen or so, not GWAS or anything like that) in all of us data. These rare variants are only available in a VDS.

I am hoping I can just grab those few variants and convert to hail matrix table prior to association testing. I tested this with just one variant and am finding that this is taking an extremely long time (about a day with the compute I have), is there a faster way to grab a single variant from VDS for association testing?

Here is the code I am using:

# initiate up hail
import hail as hl
hl.init(default_reference='GRCh38', idempotent=True)
# read in the VDS file 
vds = hl.vds.read_vds(vds_srwgs_path)

# Get only the region for single variant rs189316643 
rs189316643_interval = ['chr12:29477233-chr12:29477234']
vds = hl.vds.filter_intervals(
     for x in rs189316643_interval], split_reference_blocks = True)
# Ensure that results return only one variant
# convert VDS to dense matrix table
# all these steps taken from example here:
mt = vds.variant_data.annotate_entries(AD = hl.vds.local_to_global(vds.variant_data.LAD, 
                                                                   fill_value=0, number='R'))
# convert Local Allele Depth (LAD) to Allele Depth (AD) 
# convert local genotype (LGT) to genotype (GT)
mt = mt.annotate_entries(GT = hl.experimental.lgt_to_gt(mt.LGT, mt.LA))

mt = mt.transmute_entries(FT = hl.if_else(mt.FT, "PASS", "FAIL"))

# run count to debug lazy loading  (this one runs fast)

mt = hl.vds.to_dense_mt(hl.vds.VariantDataset(vds.reference_data, mt))

# run count to debug lazy loading (this step is what is taking a very long time)

Any help would be greatly appreciated, thank you.

Hi @Brett_Vanderwerff,

Our resident VDS expert is on vacation, but I can try to help.

This should be something we can execute effeciently, but that depends on the input VDS having split its ref blocks to some max length. What do you get if you run

vds = hl.vds.read_vds(vds_srwgs_path)

Hi @patrick-schultz

I get: 1000

Great, that means the reference blocks have already been split up to be no larger than 1000 base pairs. In that case, I think you’ll have a better time using split_reference_blocks = False in filter_intervals. This will filter the variants to the specified locus intervals, and filter the reference data to blocks starting within one of the intervals, or up to 1000 base pairs before. So that still avoids reading most of the reference data, and should be fast. If you’re immediately densifying, there’s no benefit to doing the work of splitting reference blocks further.

However, I’ll admit that I’m surprised the densify is taking so long. If it’s possible for you to share a log from a run, even if you terminate before it completes, that would be a huge help in diagnosing what’s happening.