Filtering MatrixTables where column values do not match

Hi all, I’m relatively new to Hail and am having some difficulty filtering a MatrixTable by rows. I’ve read in a vcf and done some basic QC, giving me a MatrixTable I’ve named mts. I have 3 samples, and I need to remove all rows where GT is not 0/0 for all samples. When I run I can clearly see the values for each sample.

Could someone point me in the right direction for filtering these rows?

The MatrixTable interface intentionally doesn’t let you query single entries by column value, but instead forces you to write your pipeline in terms of computations applied all entry values as aggregations.

Here are two ways to remove sites that are 0/0 at every sample

Using aggregators:

mts = mts.filter_rows(hl.agg.all(mt.GT.is_hom_ref()), keep=False)

Using variant_qc (which uses aggregators in its implementation):

mts = hl.variant_qc(mts)
mts = mts.filter_rows(mts.variant_qc.AC[0] > 0)

Is there a way to do the opposite, so remove all sites that are NOT 0/0? I’ve tried using is_het_ref() and mt.GT.is_hom_ref() == False, but both seem to give me the same output as mts = mts.filter_rows(hl.agg.all(mt.GT.is_hom_ref()), keep=False)

f = mts.filter_rows(mts.variant_qc.AC[0] == 6) seems to work

This is pretty tailored to your example. I think I might do something like:

mts = mts.filter_rows(hl.agg.all(mt.GT.is_non_ref()), keep=False)

is_non_ref returns True if the call has at least one non-reference allele.

I have tried to translate this to filter_cols to remove all genotypes which do not show a variant by using this:

mtf2 = mtf.filter_cols(hl.agg.all(mtf.GT.is_het_non_ref()), keep=False)

which in principle seems to work, as mtf.count() gives (4, 1680) and mtf2.count() (4, 517). However, I have still genotypes left which do have a combination of 0/0 and NA. So I assume having NA as genotype is a problem. Is there a way to also remove those ?