Hl.variant_qc and the FT field

Does Variant_qc() takes into account FT field? e.g. do not count GT that didn’t pass FT in AN and AF calculations?

1 Like

hl.variant_qc does not consider the FT field. We should really make hl.variant_qc into an aggregator rather than a method so that you can provide any hl.agg.filter you like.

I think the currently recommend approach depends on what “removed” means to you. If you want to pretend those genotypes were never even sampled, use filter_entries:

mt = mt.filter_entries(hl.len(mt.FT) > 0, keep=False)
mt = hl.variant_qc(mt)

In particular, the denominator used for means will exclude filtered entries.

If you instead want to treat them as sampled but with an unknown value, set all the entry fields to missing:

mt = mt.annotate_entries(**{
    field: hl.missing(mt[field].dtype)
    for field in ['GT', 'AD', ...]
})

EDIT: Removed typo in which I suggested hl.vds.variant_qc exists.

Thank you Dan for the details and the suggestions. Indeed making it an aggregator would be useful.

As a followup, if my goal is to keep the GTs that are annotated as Failed in FT but not consider them in the calculations of variant_qc, then neither of the aforementioned suggestion applies, correct? I would modify variant_qc to achieve that. Is this a correct assumption?

Thank you again for all the help and support.

1 Like

Yeah, so, this is a really subtle part of variant_qc. Suppose you had this variant with five samples:

    0/0, 0/1, NA, FILTERED, 1/1

Variant QC would say:

  • call_rate is 3/5 (because there are 5 samples in your dataset, but at this site, only three of those five samples have a well-defined genotype).
  • n_not_called is 1. Because only one entry is NA.
  • n_filtered is 1. One entry has been filtered with filter_entries.
  • HWE p-values, homozygote counts, AC, AN, etc. are all based on the genotypes which are neither missing nor filtered. The AF will use as its denominator 3 because there are three non-missing, non-filtered entries.
  • gq_stats.mean likewise uses 3 as the denominator.
  • dp_stats.mean likewise uses 3 as the denominator.

I think the easiest thing to do is this:

mt = mt.annotate_entries(GT_copy = mt.GT)
mt = mt.annotate_entries(GT = hl.or_missing(mt.FT.contains("PASS"), mt.GT))
mt = hl.variant_qc(mt.GT)
mt = mt.annotate_entries(GT = mt.GT_copy)
mt = mt.drop('GT_copy')

This should calculate variant QC on only those entries that passed without affecting the GT entries themselves downstream.

You will need to define your own n_filtered if you want that to be:

hl.agg.count_where(mt.FT.contains("PASS"))

Does this answer your question?

EDIT: Confused myself a couple times, but this should now be correct. For future travelers, here’s a simple MT which demonstrates that neither missing nor filtered values are included in aggregations:

In [16]: mt = hl.utils.range_matrix_table(3,3)
    ...: mt = mt.annotate_entries(x = (mt.row_idx + 1) * (mt.col_idx + 1))
    ...: mt = mt.filter_entries(mt.x != 4)
    ...: mt = mt.annotate_entries(x = hl.or_missing(mt.x != 6, mt.x))
    ...: mt.show()
    ...: mt = mt.annotate_rows(mean = hl.agg.mean(mt.x))
    ...: mt.mean.show()
+---------+-------+-------+-------+
| row_idx |   0.x |   1.x |   2.x |
+---------+-------+-------+-------+
|   int32 | int32 | int32 | int32 |
+---------+-------+-------+-------+
|       0 |     1 |     2 |     3 |
|       1 |     2 |    NA |    NA |
|       2 |     3 |    NA |     9 |
+---------+-------+-------+-------+
+---------+----------+
| row_idx |     mean |
+---------+----------+
|   int32 |  float64 |
+---------+----------+
|       0 | 2.00e+00 |
|       1 | 2.00e+00 |
|       2 | 6.00e+00 |
+---------+----------+
1 Like

Thank you, Dan!