Identifiy problems of AD, PL, GQ effectively

zhouhufeng · January 31, 2021, 12:50pm

I have been using hail to do WGS QC for really large samples (100K), but due to the sample size is larger than previous Freezes, there are many changes to the upstream joint calling pipeline that I have not been aware of. e.g. AD was only missing for hom refs, PL sometimes missing for hom refs, and PL missing for multi-allalic variants. Those unexpected issues came to light from time to time after we noticed something wrong in our QC process and spent weeks or even month in checking where is wrong, then we found something was missing.
I am wondering how can I use Hail to effectively check the missingness of the phenotype (./.), ADs, GQs, PLs, and figuring out at which circumstances do they missing? This will be critical for me to do my job well, otherwise there will always be some surprise there waiting for me.

tpoterba · February 3, 2021, 10:50pm

This is a good question and not one we can give a one-size-fits-all answer to. Can you give a couple of concrete examples of things you want to compute?

The code will often take a form like this:


# mt is your matrix table
results = mt.aggregate_entries(hl.struct(
    gq_stats_for_missing_gt = hl.agg.filter(hl.is_missing(mt.GT), hl.agg.stats(mt.GQ))),
    pl_missing_distribution_hom_ref = hl.agg.filter(mt.GT.is_hom_ref(), hl.agg.fraction(hl.is_missing(mt.PL))))
))

zhouhufeng · February 8, 2021, 7:48pm

Thanks very much Tim, it is actually an very embarrassing issue.
I have to find out those accidental NAs to figure out the solution, which currently I don’t know how the data look like.
Immediately after our meeting long time ago, I began to work on the large scale WGS data QC, and suddenly my QC results went wrong, it took me months of hard work, that I found some of the variants have hom-ref genotype feild AD missing, all are hom-ref NA in the VCF file,
Then a few months later, I found all hom-ref PL is also missing,
And then I found some of the PL field missing after a quite a few months later, which my upstream joint caller colleborators, PL are missing for multi-allalic variants.
This kind of random and sudden missing of joint calling results will be very likely to happen, and we will not be informed of those random missingness, therefore, each time when I have the new data, I need to have a complete screening and overview of the missingness and NAs of the genotype data.
Now I wish to explore how those sudden missing and NA of AD, PL, GQ, distributed.

Topic		Replies	Views
Cannot run 'sample_qc' function when PL field is missing Hail Query & hailctl	2	580	January 26, 2021
Gentoype filtering - missing hom ref data Hail Query & hailctl	2	23	March 20, 2025
Unable to do sample/variant QC after combining MatrixTable Hail Query & hailctl	11	418	January 8, 2023
Code check to run WES Hail Query & hailctl	2	553	July 8, 2020
Bug: n_alleles may not be missing Science	4	500	September 2, 2021

Identifiy problems of AD, PL, GQ effectively

Related topics