Is there a more computationally efficient way to get a single variant from VDS?

Brett_Vanderwerff · March 24, 2024, 5:01pm

Hello,

I have to do some candidate gene test for a handful of rare variants (like a dozen or so, not GWAS or anything like that) in all of us data. These rare variants are only available in a VDS.

I am hoping I can just grab those few variants and convert to hail matrix table prior to association testing. I tested this with just one variant and am finding that this is taking an extremely long time (about a day with the compute I have), is there a faster way to grab a single variant from VDS for association testing?

Here is the code I am using:

# initiate up hail
import hail as hl
hl.init(default_reference='GRCh38', idempotent=True)

# read in the VDS file 
vds = hl.vds.read_vds(vds_srwgs_path)


# Get only the region for single variant rs189316643 
rs189316643_interval = ['chr12:29477233-chr12:29477234']
vds = hl.vds.filter_intervals(
    vds,
    [hl.parse_locus_interval(x,)
     for x in rs189316643_interval], split_reference_blocks = True)

# Ensure that results return only one variant
vds.variant_data.count()

# convert VDS to dense matrix table
# all these steps taken from example here: https://workbench.researchallofus.org/workspaces/aou-rw-b7598f6e/duplicateofhowtoworkwithallofusgenomicdatahailplinkv7prep/notebooks/preview/03_Manipulate%20Hail%20VariantDataset.ipynb
mt = vds.variant_data.annotate_entries(AD = hl.vds.local_to_global(vds.variant_data.LAD, 
                                                                   vds.variant_data.LA, 
                                                                   n_alleles=hl.len(vds.variant_data.alleles), 
                                                                   fill_value=0, number='R'))

# convert Local Allele Depth (LAD) to Allele Depth (AD) 
# convert local genotype (LGT) to genotype (GT)
mt = mt.annotate_entries(GT = hl.experimental.lgt_to_gt(mt.LGT, mt.LA))

mt = mt.transmute_entries(FT = hl.if_else(mt.FT, "PASS", "FAIL"))

# run count to debug lazy loading  (this one runs fast)
mt.count()

mt = hl.vds.to_dense_mt(hl.vds.VariantDataset(vds.reference_data, mt))

# run count to debug lazy loading (this step is what is taking a very long time)
mt.count()

Any help would be greatly appreciated, thank you.

patrick-schultz · March 25, 2024, 5:35pm

Hi @Brett_Vanderwerff,

Our resident VDS expert is on vacation, but I can try to help.

This should be something we can execute effeciently, but that depends on the input VDS having split its ref blocks to some max length. What do you get if you run

vds = hl.vds.read_vds(vds_srwgs_path)
hl.eval(vds.reference_data.index_globals().ref_block_max_length)

Brett_Vanderwerff · March 27, 2024, 1:41pm

Hi @patrick-schultz

I get: 1000

patrick-schultz · March 27, 2024, 2:12pm

Great, that means the reference blocks have already been split up to be no larger than 1000 base pairs. In that case, I think you’ll have a better time using split_reference_blocks = False in filter_intervals. This will filter the variants to the specified locus intervals, and filter the reference data to blocks starting within one of the intervals, or up to 1000 base pairs before. So that still avoids reading most of the reference data, and should be fast. If you’re immediately densifying, there’s no benefit to doing the work of splitting reference blocks further.

However, I’ll admit that I’m surprised the densify is taking so long. If it’s possible for you to share a log from a run, even if you terminate before it completes, that would be a huge help in diagnosing what’s happening.

Topic		Replies	Views
Most efficient way to filter and densify VDS Hail Query & hailctl	4	288	May 14, 2024
Export VDS to VCF Hail Query & hailctl	12	1258	January 12, 2023
Densifying VDS to MatrixTable very expensive Hail Query & hailctl	2	363	November 13, 2023
Counting Rows More Quickly in VDS Hail Query & hailctl	12	534	July 17, 2023
'GT' are present in MT but missing from VDS for same samples Hail Query & hailctl	6	377	October 4, 2023

Is there a more computationally efficient way to get a single variant from VDS?

Related topics