We are trying to use HAIL package to create a GWAS tutorial that is comparable to one we have made using R and Plink in hopes of being able to leverage Spark and possible improve the computational speed and/or memory consumption when conducting analyses (especially for larger datasets). Currently when attempting to perform quality control steps, we are able to get similar results when looking at variant call rate, minor allele frequency, and sample call rate outputs in both R and hail. When looking at heterozygosity to further assess sample quality and calculate a heterozygosity f statistic cut off to filter samples, we find sample qc statistics and available methods differ from R.
Our Current Approach in R:
The R snpStats package provides heterozygosity as an output which we plotted to determine what thresholds we wanted to use and manually calculated a heterozygosity f statistic/inbreeding cut off using minor allele frequency (for expected heterozygosity), and heterozygosity and N called for (observed heterozygosity). We then determined a f-statistic and calculated an f statistic cut off given the heterozygosity distribution to filter samples.
In hail we calculated mean heterozygosity by using the n_het/n_called columns output from the sample_qc method.
What is the best approach for filtering samples with extreme heterozygosity in hail?
Currently we are attempting a similar approach to what we did in R and calculate a heterozygosity f statistic threshold by determining the max of the absolute value of the f statistic from samples within 2-2.5 standard deviations of the mean heterozygosity. To do this we plan to use the annotate_cols and inbreeding aggregator to calculate the fstatistic and determine the thresholds using the n_het/n_called columns output from the sample qc method. Are there columns or stats included in the output from the sample qc method we should be using instead?