Variant Quality Score Recalibration


#1

A hail version of the GATK VQSR Step (Variant Quality Score Recalibration) would be very useful to help scale QC. A variant filter would be updated based on a Guassian Mixture Model rather than on fixed cutoffs of the various qc metrics.

There is a spark Gaussian Mixture Module available to help implement this…
https://spark.apache.org/docs/latest/mllib-clustering.html#gaussian-mixture


#2

Hi jjfarrell,

I have a branch with an implementation of random forests for Hail that we have been using for filtering variants in ExAC / gnomAD and it has worked very well! Now that Hail can export variant features to DataFrames (VDS -> KeyTable -> DataFrame), it should be very easy to use pyspark to access mllib directly. I am planning to rewrite the random forests I am using to access mllib this way and will post that code once done. Using the mllib GMM should be equally easy. Please comment / post code if you experiment with mllib too!

Cheers,
Laurent