Entry filtering semantics question: allele read balance

I am having different results using two different approaches for Allele read balance

Approach 1:

ab = mt_sqc2.AD[1] / hl.sum(mt_sqc2.AD)

filter_condition_ab = ((mt_sqc2.GT.is_hom_ref() & (ab <= 0.1)) |
                        (mt_sqc2.GT.is_het() & (ab >= 0.25) & (ab <= 0.75)) |
                        (mt_sqc2.GT.is_hom_var() & (ab >= 0.9)))

fraction_filtered = mt_sqc2.aggregate_entries(hl.agg.fraction(~filter_condition_ab))
print(f'Filtering {fraction_filtered * 100:.2f}% entries out of downstream analysis.')
mt_sqc2 = mt_sqc2.filter_entries(filter_condition_ab)

Result - Filtering 0.36% entries out of downstream analysis.

Approach 2:

ab = mt_sqc2.AD[1] / hl.sum(mt_sqc2.AD)

filter_condition_ab = (
    hl.case(missing_false=True)
    .when(mt_sqc2.GT.is_hom_ref(), ab <= 0.1)
    .when(mt_sqc2.GT.is_het(), (ab >= 0.25) & (ab <= 0.75))
    .when(mt_sqc2.GT.is_hom_var(), ab >= 0.9)
    .default(True) # keep everything else
)


fraction_filtered = mt_sqc2.aggregate_entries(hl.agg.fraction(~filter_condition_ab))
print(f'Filtering {fraction_filtered * 100:.2f}% entries out of downstream analysis.')
mt_sqc2 = mt_sqc2.filter_entries(filter_condition_ab)

Result - Filtering 0.26% entries out of downstream analysis.

Note that 0.58% missing data is already present in the original MatrixTable.

I’ve spent a lot of time grappling with this, here are some functions I’ve written that work- you have to be quite careful to deal with missingness properly, as filter conditions where you have a boolean results True/False/Missing will either filter OUT the missing entries/rows if you set keep=False, or keep them IN if you set keep=True.

Note: if you are interested in the overall allelic balance in hets across samples in a variant (agg across rows), you’d have to annotate that FIRST before running the function below. I have some functions for that too and can post if you’re interested.

def filter_genotypes_final(mt, min_het_ref_reads=0.2, max_het_ref_reads=0.8,
                       min_hom_ref_ref_reads=0.9, max_hom_alt_ref_reads=0.1):
'''
Filters genotypes for final variant QC. Filters out genotypes with poor allelic balance, defined as the percentage
of ref reads that should correspond to that genotype. For hets, percent of ref reads should be between 20-80%,
for hom ref they should be at least 90%, and for hom alt they should be maximum 10%.

:param mt: matrix table to annotat
:param min_het_ref_reads: minimum allowed fraction of ref reads for het genotypes (default 0.2)
:param max_het_ref_reads: maximum allowed fraction of ref reads for het genotypes (default 0.8)
:param min_hom_ref_ref_reads: minimum allowed fraction of ref reads for hom ref genotypes (default 0.9)
:param max_hom_alt_ref_reads: maximum allowed fraction of ref reads for hom alt genotypes (default 0.1)
:return:
'''
logging.info('\nFiltering out genotypes on het ref reads > ' + str(max_het_ref_reads) +
             ', \nor het ref reads < ' + str(min_het_ref_reads) +
             ', \nor hom_ref ref reads < ' + str(min_hom_ref_ref_reads) +
             ', \nor hom_alt ref reads > ' + str(max_hom_alt_ref_reads))

mt = mt.annotate_globals(genotype_qc_thresholds_final={'max_het_ref_reads': max_het_ref_reads,
                                                       'min_het_ref_reads': min_het_ref_reads,
                                                       'min_hom_ref_ref_reads': min_hom_ref_ref_reads,
                                                       'max_hom_alt_ref_reads': max_hom_alt_ref_reads})
# Get starting genotypes count
gtstart = mt.aggregate_entries(hl.agg.count_where(hl.is_defined(mt.GT)))
logging.info('Starting count for number of genotypes: ' + str(gtstart))

gthet = mt.aggregate_entries(hl.agg.count_where(hl.is_defined(mt.GT) & mt.GT.is_het()))
gthomvar = mt.aggregate_entries(hl.agg.count_where(hl.is_defined(mt.GT) & mt.GT.is_hom_var()))
gthomref = mt.aggregate_entries(hl.agg.count_where(hl.is_defined(mt.GT) & mt.GT.is_hom_ref()))

# GENOTYPE FILTERING NEEDS TO BE FILTERING OUT!! or else you filter out everything that's not a het
# in the first step!!! (keep = FALSE)

# Filter het genotypes on allelic balance
mt = mt.filter_entries(mt.GT.is_het() & (((mt.AD[0] / hl.sum(mt.AD)) < min_het_ref_reads) |
                                         ((mt.AD[0] / hl.sum(mt.AD)) > max_het_ref_reads)), keep=False)
gtcount1 = mt.aggregate_entries(hl.agg.count_where(hl.is_defined(mt.GT)))
het_excluded = gtstart - gtcount1

# Filter hom ref genotypes on allelic balance
mt = mt.filter_entries(mt.GT.is_hom_ref() & (((mt.AD[0] / hl.sum(mt.AD)) < min_hom_ref_ref_reads)), keep=False)
gtcount2 = mt.aggregate_entries(hl.agg.count_where(hl.is_defined(mt.GT)))
hom_ref_excluded = gtcount1 - gtcount2

# Filter hom var genotypes on allelic balance
mt = mt.filter_entries(mt.GT.is_hom_var() & (((mt.AD[0] / hl.sum(mt.AD)) > max_hom_alt_ref_reads)), keep=False)
gtcount3 = mt.aggregate_entries(hl.agg.count_where(hl.is_defined(mt.GT)))
hom_var_excluded = gtcount2 - gtcount3

logging.info("Number of het GTs excluded: "+str(het_excluded) + "(" + str(het_excluded/gthet) + "%)")
logging.info("Number of hom ref GTs excluded: "+str(hom_ref_excluded) + "(" + str(hom_ref_excluded/gthomref*100) +
             "%)")
logging.info("Number of hom alt GTs excluded: "+str(hom_var_excluded) + "(" + str(hom_var_excluded/gthomvar*100) +
             "%)")

mt = mt.annotate_globals(genotype_qc_exclusions={'het_excluded': het_excluded,
                                                 'hom_ref_excluded': hom_ref_excluded,
                                                 'hom_var_excluded': hom_var_excluded})

logging.info('Unfiltering entries (setting filtered entries to missing) after genotype QC.')
mt = mt.unfilter_entries()

return mt

Not sure if this 100% answers your question, but the function above should do the same thing you’re trying to do.

Thanks a lot @lea-urpa for sharing the code. It was really helpful.
I am able to replicate/reproduce same result as your script does by below mention code

mt = mt.filter_entries(hl.is_defined(mt.GT)) ## This I used to get rid of initial missing genotypes.

ab = mt.AD[1] / hl.sum(mt.AD)

filter_condition_ab = (
    hl.case(missing_false=True)
    .when(mt.GT.is_hom_ref(), ab > 0.1)
    .when(mt.GT.is_het(), (ab < 0.25) | (ab > 0.75))
    .when(mt.GT.is_hom_var(), ab < 0.9)
    .default(False) # remove everything else
)

mt = mt.filter_entries(filter_condition_ab, keep = False)

all_geno =  mt.aggregate_entries(hl.agg.fraction(hl.is_defined(mt.GT)))
print(f'All genotype entries: {all_geno * 100:.2f}%.') ## Result is 100%

The only difference is in my case, I removed the initial missing genotypes (first line) at the beginning. Did you want to keep them throughout ?

print(f'All genotype entries: {all_geno * 100:.2f}%.') ## Result should be <100% 

I really appreciate your help. Thank you very much.
––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
TO THE HAIL TEAM:

ab = mt.AD[1] / hl.sum(mt.AD)

filter_condition_ab = (
    hl.case(missing_false=True)
    .when(mt.GT.is_hom_ref(), ab > 0.1)
    .when(mt.GT.is_het(), (ab < 0.25) | (ab > 0.75))
    .when(mt.GT.is_hom_var(), ab < 0.9)
    .default(False) # remove everything else
)
mtA = mt.filter_entries(filter_condition_ab, keep = False)
ab = mt.AD[1] / hl.sum(mt.AD)
filter_condition_ab = (
    hl.case(missing_false=True)
    .when(mt.GT.is_hom_ref(), ab <= 0.1)
    .when(mt.GT.is_het(), (ab >= 0.25) & (ab <= 0.75))
    .when(mt.GT.is_hom_var(), ab >= 0.9)
    .default(True) ## keep everything else
)
mtB = mt.filter_entries(filter_condition_ab)

In my case, the number of total genotypes for mtA and mtB are different. I am a bit puzzled here to understand that.

Total genotypes calculated using -

Total_geno1 = mtA.aggregate_entries(hl.agg.count_where(hl.is_defined(mtA.GT)))
print ('Total genotypes mtA: ' + str(Total_geno1))

Total_geno2 = mtB.aggregate_entries(hl.agg.count_where(hl.is_defined(mtB.GT)))
print ('Total genotypes mtB: ' + str(Total_geno2))

Not sure how the initial missing genotypes play a role here, but anyway when I calculated total genotyes I used hl.is_defined(mt.GT), which should exclude missing data.

I probably still won’t have a full answer until after ASHG (or at least our workshop) but one of the important subtleties here is that there is a difference between a missing GT and a filtered entry.

A missing GT can be interrogated for using hl.is_defined and such, but a filtered entry doesn’t participate in aggregations. This computation:

Total_geno1 = mtA.aggregate_entries(hl.agg.count_where(hl.is_defined(mtA.GT)))

Will include the filtered entries neither in the numerator or denominator of the fraction.

Thanks Tim. To interrogate missing GT using hl.is_defined, I used below mentioned code.

defined_geno =  mt.aggregate_entries(hl.agg.fraction(hl.is_defined(mt.GT)))
print(f' Defined genotypes: {defined_geno * 100:.2f}%.')
non_defined = 100-(defined_geno * 100)
print(f' Non-defined genotypes: {non_defined:.2f}%.')

Result-
Defined genotypes: 99.41%.
Non-defined genotypes: 0.59%.

I thought 0.59% genotypes are missing GT in the mt. Is that correct ?

Any update on this @tpoterba?

will dig in later today

1 Like

I think this is best illustrated through an example – I’ll write that on my train commute tomorrow.

Take this matrix table:

In [4]: mt.show()
+---------+------+------+------+
| row_idx | 0.GT | 1.GT | 2.GT |
+---------+------+------+------+
|   int32 | call | call | call |
+---------+------+------+------+
|       0 | 0/0  | 0/1  | 1/1  |
|       1 | 0/1  | 1/1  | 0/0  |
|       2 | 1/1  | 0/0  | 0/1  |
+---------+------+------+------+

In [5]: mt.entries().show()
+---------+---------+------+
| row_idx | col_idx | GT   |
+---------+---------+------+
|   int32 |   int32 | call |
+---------+---------+------+
|       0 |       0 | 0/0  |
|       0 |       1 | 0/1  |
|       0 |       2 | 1/1  |
|       1 |       0 | 0/1  |
|       1 |       1 | 1/1  |
|       1 |       2 | 0/0  |
|       2 |       0 | 1/1  |
|       2 |       1 | 0/0  |
|       2 |       2 | 0/1  |
+---------+---------+------+

If I filter entries as so:

In [6]: mt = mt.filter_entries(~mt.GT.is_hom_var())

Then it looks like I have set GTs to missing in show():

+---------+------+------+------+
| row_idx | 0.GT | 1.GT | 2.GT |
+---------+------+------+------+
|   int32 | call | call | call |
+---------+------+------+------+
|       0 | 0/0  | 0/1  | NA   |
|       1 | 0/1  | NA   | 0/0  |
|       2 | NA   | 0/0  | 0/1  |
+---------+------+------+------+

However, what I’ve really done is remove those entries entirely:

In [8]: mt.entries().show()
+---------+---------+------+
| row_idx | col_idx | GT   |
+---------+---------+------+
|   int32 |   int32 | call |
+---------+---------+------+
|       0 |       0 | 0/0  |
|       0 |       1 | 0/1  |
|       1 |       0 | 0/1  |
|       1 |       2 | 0/0  |
|       2 |       1 | 0/0  |
|       2 |       2 | 0/1  |
+---------+---------+------+

So when I aggregate to compute the fraction defined, I get 6/6 = 1.0:

In [9]: mt.aggregate_entries(hl.agg.fraction(hl.is_defined(mt.GT)))
Out[9]: 1.0

I think this is what’s going on in your case, in a bit more of a subtle way.

Thanks a lot @tpoterba. Much appreciated !

How do you interrogate missing data ? Imagine you have genotype call as ./. .
So if you have to remove those, would you use -
mt= mt.filter_entries(hl.is_defined(mt.GT)) ?
Because if you run
mt.aggregate_entries(hl.agg.fraction(hl.is_defined(mt.GT))) at the beginning, it doesn’t give 1.0 (at least for my case)

In my case, I was more confused why my mtA and mtB giving me different results. They should be ideally provide same answer.

yes, exactly. There’s a difference between a missing GT and a filtered entry.

If you run mt= mt.filter_entries(hl.is_defined(mt.GT)), then the aggregate fraction you listed should return 1.0 afterwards.

1 Like

Thank you Tim for the clarification.