Create project VCFs from VDS

Hi there,

I am completely new to Hail and I am tasked with creating project VCFs (per chromosome) from a VDS. I have figured out how to filter the VDS by chromosome, but the main issue is the entry schema. I need a project VCF entry schema with GT, PL, and AD not LGT, LAD, LPL, and LA fields. I have tried hl.vds.to_dense_mt(vds) and hl.experimental.densify(sparse_mt), but neither of these have produced the entry field schema that I require. Does anyone know the implementation for this, or if it can even be done? Annotating entries with GT = lgt_to_gt(lgt, la) partially solves the issue, but what about the PL and AD? Are there equivalent functions for these fields? Is there a way to convert all 3 with one function?

EDIT

Hail now ships with local_to_global which can covert from a locally A-, R-, or G-indexed array into a global A-, R-, or G-indexed array.


Hey @bgorbet !

EDIT: LPL to PL conversion needs more thought… We’ll get back to you shortly.

You might find these slides on the Scalable Variant Call Representation helpful (particularly the section on local alleles). The practical answer to your question is that LPL and LAD are indexed by local alleles, so:

alleles = [A, C, T]

# 7 reads supporting A, 5 reads supporting T
AD = [7, 0, 5]
LA = [0, 2]  # the first "local allele" is alleles[0] = A
             # the second "local allele" is alleles[2] = T
LAD = [7, 5]

So you can convert like this:

def lad_to_ad(alleles, lcl, la):
    return hl.range(
        hl.len(alleles)
    ).map(
        lambda global_index: hl.or_else(lcl[la.index(global_index)], 0)
    )

def lpl_to_pl(alleles, lcl, la):
    # ???
    
mt = mt.annotate_entries(
    AD=lad_to_ad(mt.alleles, mt.LAD, mt.LA),
    PL=lpl_to_pl(mt.alleles, mt.LPL, mt.LA)
)

We should :100: add this to the Hail module. Apologies for its absence! The VDS and VDS combiner have taken off much faster than we anticipated!

1 Like

Aside: @bgorbet , may I ask how you heard about VDS and Hail?

1 Like

Of course! I am a statistician working at the UT Health School of Public Health. We got some data for a project we are collaborating on, and it was in VDS format with VCF files that had been created from the VDS. Those VCF files have the genotype fields from the sparse mt format (LGT, LAD, LPL, indexed by LA). So, I am tasked with learning just enough Hail (my background is actually in computer science, and I have experience with Python, Jupyter Lab, and various cloud platforms) so that we can get this data in a format that is familiar to the rest of the team for downstream analysis. So far, I am so impressed with Hail, and if we were able to utilize the scalability on a Spark cluster or cloud platform, then we would probably stick with Hail, and I wouldn’t need to figure out how to undo the scalable format, but that is not our situation right now.

I also wanted to say that I saw your suggestion on memory allocation when working with Hail on a single machine and that was very helpful, thank you!

1 Like

Thanks so much for the context! Is this project CCDG or AoU? Those are two where we’ve helped with callset generation.

I’m planning on adding a feature that can translate all the the “local” fields (LGT, LAD, LPL, etc) to global in a single step, and make this possible to do as part of hl.vds.to_dense_mt. I’m hoping to have time to do that this afternoon or tomorrow morning, and will post here when we’ve got a pull request up.

The project is CCDG. That feature would be greatly appreciated! This is behavior that I sort of already expected to happen with hl.vds.to_dense_mt so that would be a great home for this feature!

Hey!

Just an update for future users:

Hail now ships with local_to_global which can convert from local A, R, and G indexed arrays to global A, R, and G indexed arrays.