Create project VCFs from VDS

bgorbet · October 11, 2022, 12:14am

Hi there,

I am completely new to Hail and I am tasked with creating project VCFs (per chromosome) from a VDS. I have figured out how to filter the VDS by chromosome, but the main issue is the entry schema. I need a project VCF entry schema with GT, PL, and AD not LGT, LAD, LPL, and LA fields. I have tried hl.vds.to_dense_mt(vds) and hl.experimental.densify(sparse_mt), but neither of these have produced the entry field schema that I require. Does anyone know the implementation for this, or if it can even be done? Annotating entries with GT = lgt_to_gt(lgt, la) partially solves the issue, but what about the PL and AD? Are there equivalent functions for these fields? Is there a way to convert all 3 with one function?

danking · October 11, 2022, 2:29pm

EDIT

Hail now ships with local_to_global which can covert from a locally A-, R-, or G-indexed array into a global A-, R-, or G-indexed array.

Hey @bgorbet !

EDIT: LPL to PL conversion needs more thought… We’ll get back to you shortly.

You might find these slides on the Scalable Variant Call Representation helpful (particularly the section on local alleles). The practical answer to your question is that LPL and LAD are indexed by local alleles, so:

alleles = [A, C, T]

# 7 reads supporting A, 5 reads supporting T
AD = [7, 0, 5]
LA = [0, 2]  # the first "local allele" is alleles[0] = A
             # the second "local allele" is alleles[2] = T
LAD = [7, 5]

So you can convert like this:

def lad_to_ad(alleles, lcl, la):
    return hl.range(
        hl.len(alleles)
    ).map(
        lambda global_index: hl.or_else(lcl[la.index(global_index)], 0)
    )

def lpl_to_pl(alleles, lcl, la):
    # ???
    
mt = mt.annotate_entries(
    AD=lad_to_ad(mt.alleles, mt.LAD, mt.LA),
    PL=lpl_to_pl(mt.alleles, mt.LPL, mt.LA)
)

We should add this to the Hail module. Apologies for its absence! The VDS and VDS combiner have taken off much faster than we anticipated!

danking · October 11, 2022, 2:30pm

Aside: @bgorbet , may I ask how you heard about VDS and Hail?

bgorbet · October 11, 2022, 3:30pm

Of course! I am a statistician working at the UT Health School of Public Health. We got some data for a project we are collaborating on, and it was in VDS format with VCF files that had been created from the VDS. Those VCF files have the genotype fields from the sparse mt format (LGT, LAD, LPL, indexed by LA). So, I am tasked with learning just enough Hail (my background is actually in computer science, and I have experience with Python, Jupyter Lab, and various cloud platforms) so that we can get this data in a format that is familiar to the rest of the team for downstream analysis. So far, I am so impressed with Hail, and if we were able to utilize the scalability on a Spark cluster or cloud platform, then we would probably stick with Hail, and I wouldn’t need to figure out how to undo the scalable format, but that is not our situation right now.

I also wanted to say that I saw your suggestion on memory allocation when working with Hail on a single machine and that was very helpful, thank you!

tpoterba · October 11, 2022, 3:35pm

Thanks so much for the context! Is this project CCDG or AoU? Those are two where we’ve helped with callset generation.

I’m planning on adding a feature that can translate all the the “local” fields (LGT, LAD, LPL, etc) to global in a single step, and make this possible to do as part of hl.vds.to_dense_mt. I’m hoping to have time to do that this afternoon or tomorrow morning, and will post here when we’ve got a pull request up.

bgorbet · October 11, 2022, 3:41pm

The project is CCDG. That feature would be greatly appreciated! This is behavior that I sort of already expected to happen with hl.vds.to_dense_mt so that would be a great home for this feature!

danking · November 29, 2022, 10:53pm

Hey!

Just an update for future users:

Hail now ships with local_to_global which can convert from local A, R, and G indexed arrays to global A, R, and G indexed arrays.

Topic		Replies	Views
Local_to_global: local alleles not strictly increasing: [0,2,1] Hail Query & hailctl	5	398	December 2, 2022
VDS to VCF with calculated AF field Hail Query & hailctl	4	386	January 10, 2023
LPL fields in Dragen msVCF Science	0	362	August 8, 2023
PL is not passed in haploid calls. After combining gVCF in LPL field is empty array [] Hail Query & hailctl	0	308	May 4, 2023
LGT on Hail Exported VCF files Hail Query & hailctl	6	501	January 15, 2023

Create project VCFs from VDS

EDIT

Related topics