Variant Quality Score Recalibration

A hail version of the GATK VQSR Step (Variant Quality Score Recalibration) would be very useful to help scale QC. A variant filter would be updated based on a Guassian Mixture Model rather than on fixed cutoffs of the various qc metrics.

There is a spark Gaussian Mixture Module available to help implement this…
https://spark.apache.org/docs/latest/mllib-clustering.html#gaussian-mixture

Hi jjfarrell,

I have a branch with an implementation of random forests for Hail that we have been using for filtering variants in ExAC / gnomAD and it has worked very well! Now that Hail can export variant features to DataFrames (VDS -> KeyTable -> DataFrame), it should be very easy to use pyspark to access mllib directly. I am planning to rewrite the random forests I am using to access mllib this way and will post that code once done. Using the mllib GMM should be equally easy. Please comment / post code if you experiment with mllib too!

Cheers,
Laurent

Hi @jjfarrell ,

I come across the gnomAD v3 blog, and I try to find the page about allele-specific version of GATK Variant Quality Score Recalibration (VQSR). Use the key words and I found your post here.
Because I cannot open the link they provided in the blog (as link above). Do you or others know where can I find the VQSR function(or the way to do) in Hail?
Thanks for the help!