Does Variant_qc() takes into account FT field? e.g. do not count GT that didn’t pass FT in AN and AF calculations?
hl.variant_qc
does not consider the FT
field. We should really make hl.variant_qc
into an aggregator rather than a method so that you can provide any hl.agg.filter
you like.
I think the currently recommend approach depends on what “removed” means to you. If you want to pretend those genotypes were never even sampled, use filter_entries
:
mt = mt.filter_entries(hl.len(mt.FT) > 0, keep=False)
mt = hl.variant_qc(mt)
In particular, the denominator used for means will exclude filtered entries.
If you instead want to treat them as sampled but with an unknown value, set all the entry fields to missing:
mt = mt.annotate_entries(**{
field: hl.missing(mt[field].dtype)
for field in ['GT', 'AD', ...]
})
EDIT: Removed typo in which I suggested hl.vds.variant_qc
exists.
Thank you Dan for the details and the suggestions. Indeed making it an aggregator would be useful.
As a followup, if my goal is to keep the GTs that are annotated as Failed in FT but not consider them in the calculations of variant_qc, then neither of the aforementioned suggestion applies, correct? I would modify variant_qc to achieve that. Is this a correct assumption?
Thank you again for all the help and support.
Yeah, so, this is a really subtle part of variant_qc
. Suppose you had this variant with five samples:
0/0, 0/1, NA, FILTERED, 1/1
Variant QC would say:
call_rate
is 3/5 (because there are 5 samples in your dataset, but at this site, only three of those five samples have a well-defined genotype).n_not_called
is 1. Because only one entry isNA
.n_filtered
is 1. One entry has been filtered withfilter_entries
.- HWE p-values, homozygote counts, AC, AN, etc. are all based on the genotypes which are neither missing nor filtered. The AF will use as its denominator 3 because there are three non-missing, non-filtered entries.
gq_stats.mean
likewise uses 3 as the denominator.dp_stats.mean
likewise uses 3 as the denominator.
I think the easiest thing to do is this:
mt = mt.annotate_entries(GT_copy = mt.GT)
mt = mt.annotate_entries(GT = hl.or_missing(mt.FT.contains("PASS"), mt.GT))
mt = hl.variant_qc(mt.GT)
mt = mt.annotate_entries(GT = mt.GT_copy)
mt = mt.drop('GT_copy')
This should calculate variant QC on only those entries that passed without affecting the GT entries themselves downstream.
You will need to define your own n_filtered
if you want that to be:
hl.agg.count_where(mt.FT.contains("PASS"))
Does this answer your question?
EDIT: Confused myself a couple times, but this should now be correct. For future travelers, here’s a simple MT which demonstrates that neither missing nor filtered values are included in aggregations:
In [16]: mt = hl.utils.range_matrix_table(3,3)
...: mt = mt.annotate_entries(x = (mt.row_idx + 1) * (mt.col_idx + 1))
...: mt = mt.filter_entries(mt.x != 4)
...: mt = mt.annotate_entries(x = hl.or_missing(mt.x != 6, mt.x))
...: mt.show()
...: mt = mt.annotate_rows(mean = hl.agg.mean(mt.x))
...: mt.mean.show()
+---------+-------+-------+-------+
| row_idx | 0.x | 1.x | 2.x |
+---------+-------+-------+-------+
| int32 | int32 | int32 | int32 |
+---------+-------+-------+-------+
| 0 | 1 | 2 | 3 |
| 1 | 2 | NA | NA |
| 2 | 3 | NA | 9 |
+---------+-------+-------+-------+
+---------+----------+
| row_idx | mean |
+---------+----------+
| int32 | float64 |
+---------+----------+
| 0 | 2.00e+00 |
| 1 | 2.00e+00 |
| 2 | 6.00e+00 |
+---------+----------+
Thank you, Dan!