I have been using hail to do WGS QC for really large samples (100K), but due to the sample size is larger than previous Freezes, there are many changes to the upstream joint calling pipeline that I have not been aware of. e.g. AD was only missing for hom refs, PL sometimes missing for hom refs, and PL missing for multi-allalic variants. Those unexpected issues came to light from time to time after we noticed something wrong in our QC process and spent weeks or even month in checking where is wrong, then we found something was missing.
I am wondering how can I use Hail to effectively check the missingness of the phenotype (./.), ADs, GQs, PLs, and figuring out at which circumstances do they missing? This will be critical for me to do my job well, otherwise there will always be some surprise there waiting for me.
This is a good question and not one we can give a one-size-fits-all answer to. Can you give a couple of concrete examples of things you want to compute?
The code will often take a form like this:
# mt is your matrix table
results = mt.aggregate_entries(hl.struct(
gq_stats_for_missing_gt = hl.agg.filter(hl.is_missing(mt.GT), hl.agg.stats(mt.GQ))),
pl_missing_distribution_hom_ref = hl.agg.filter(mt.GT.is_hom_ref(), hl.agg.fraction(hl.is_missing(mt.PL))))
))
Thanks very much Tim, it is actually an very embarrassing issue.
I have to find out those accidental NAs to figure out the solution, which currently I don’t know how the data look like.
Immediately after our meeting long time ago, I began to work on the large scale WGS data QC, and suddenly my QC results went wrong, it took me months of hard work, that I found some of the variants have hom-ref genotype feild AD missing, all are hom-ref NA in the VCF file,
Then a few months later, I found all hom-ref PL is also missing,
And then I found some of the PL field missing after a quite a few months later, which my upstream joint caller colleborators, PL are missing for multi-allalic variants.
This kind of random and sudden missing of joint calling results will be very likely to happen, and we will not be informed of those random missingness, therefore, each time when I have the new data, I need to have a complete screening and overview of the missingness and NAs of the genotype data.
Now I wish to explore how those sudden missing and NA of AD, PL, GQ, distributed.