A hail version of the GATK VQSR Step (Variant Quality Score Recalibration) would be very useful to help scale QC. A variant filter would be updated based on a Guassian Mixture Model rather than on fixed cutoffs of the various qc metrics.
I have a branch with an implementation of random forests for Hail that we have been using for filtering variants in ExAC / gnomAD and it has worked very well! Now that Hail can export variant features to DataFrames (VDS -> KeyTable -> DataFrame), it should be very easy to use pyspark to access mllib directly. I am planning to rewrite the random forests I am using to access mllib this way and will post that code once done. Using the mllib GMM should be equally easy. Please comment / post code if you experiment with mllib too!