Hello,
I have to do some candidate gene test for a handful of rare variants (like a dozen or so, not GWAS or anything like that) in all of us data. These rare variants are only available in a VDS.
I am hoping I can just grab those few variants and convert to hail matrix table prior to association testing. I tested this with just one variant and am finding that this is taking an extremely long time (about a day with the compute I have), is there a faster way to grab a single variant from VDS for association testing?
Here is the code I am using:
# initiate up hail
import hail as hl
hl.init(default_reference='GRCh38', idempotent=True)
# read in the VDS file
vds = hl.vds.read_vds(vds_srwgs_path)
# Get only the region for single variant rs189316643
rs189316643_interval = ['chr12:29477233-chr12:29477234']
vds = hl.vds.filter_intervals(
vds,
[hl.parse_locus_interval(x,)
for x in rs189316643_interval], split_reference_blocks = True)
# Ensure that results return only one variant
vds.variant_data.count()
# convert VDS to dense matrix table
# all these steps taken from example here: https://workbench.researchallofus.org/workspaces/aou-rw-b7598f6e/duplicateofhowtoworkwithallofusgenomicdatahailplinkv7prep/notebooks/preview/03_Manipulate%20Hail%20VariantDataset.ipynb
mt = vds.variant_data.annotate_entries(AD = hl.vds.local_to_global(vds.variant_data.LAD,
vds.variant_data.LA,
n_alleles=hl.len(vds.variant_data.alleles),
fill_value=0, number='R'))
# convert Local Allele Depth (LAD) to Allele Depth (AD)
# convert local genotype (LGT) to genotype (GT)
mt = mt.annotate_entries(GT = hl.experimental.lgt_to_gt(mt.LGT, mt.LA))
mt = mt.transmute_entries(FT = hl.if_else(mt.FT, "PASS", "FAIL"))
# run count to debug lazy loading (this one runs fast)
mt.count()
mt = hl.vds.to_dense_mt(hl.vds.VariantDataset(vds.reference_data, mt))
# run count to debug lazy loading (this step is what is taking a very long time)
mt.count()
Any help would be greatly appreciated, thank you.