Entry filtering semantics question: allele read balance

kousikbioinfo · October 11, 2019, 2:43pm

I am having different results using two different approaches for Allele read balance

Approach 1:

ab = mt_sqc2.AD[1] / hl.sum(mt_sqc2.AD)

filter_condition_ab = ((mt_sqc2.GT.is_hom_ref() & (ab <= 0.1)) |
                        (mt_sqc2.GT.is_het() & (ab >= 0.25) & (ab <= 0.75)) |
                        (mt_sqc2.GT.is_hom_var() & (ab >= 0.9)))

fraction_filtered = mt_sqc2.aggregate_entries(hl.agg.fraction(~filter_condition_ab))
print(f'Filtering {fraction_filtered * 100:.2f}% entries out of downstream analysis.')
mt_sqc2 = mt_sqc2.filter_entries(filter_condition_ab)

Result - Filtering 0.36% entries out of downstream analysis.

Approach 2:

ab = mt_sqc2.AD[1] / hl.sum(mt_sqc2.AD)

filter_condition_ab = (
    hl.case(missing_false=True)
    .when(mt_sqc2.GT.is_hom_ref(), ab <= 0.1)
    .when(mt_sqc2.GT.is_het(), (ab >= 0.25) & (ab <= 0.75))
    .when(mt_sqc2.GT.is_hom_var(), ab >= 0.9)
    .default(True) # keep everything else
)


fraction_filtered = mt_sqc2.aggregate_entries(hl.agg.fraction(~filter_condition_ab))
print(f'Filtering {fraction_filtered * 100:.2f}% entries out of downstream analysis.')
mt_sqc2 = mt_sqc2.filter_entries(filter_condition_ab)

Result - Filtering 0.26% entries out of downstream analysis.

Note that 0.58% missing data is already present in the original MatrixTable.

lea-urpa · October 14, 2019, 8:03am

I’ve spent a lot of time grappling with this, here are some functions I’ve written that work- you have to be quite careful to deal with missingness properly, as filter conditions where you have a boolean results True/False/Missing will either filter OUT the missing entries/rows if you set keep=False, or keep them IN if you set keep=True.

Note: if you are interested in the overall allelic balance in hets across samples in a variant (agg across rows), you’d have to annotate that FIRST before running the function below. I have some functions for that too and can post if you’re interested.

def filter_genotypes_final(mt, min_het_ref_reads=0.2, max_het_ref_reads=0.8,
                       min_hom_ref_ref_reads=0.9, max_hom_alt_ref_reads=0.1):
'''
Filters genotypes for final variant QC. Filters out genotypes with poor allelic balance, defined as the percentage
of ref reads that should correspond to that genotype. For hets, percent of ref reads should be between 20-80%,
for hom ref they should be at least 90%, and for hom alt they should be maximum 10%.

:param mt: matrix table to annotat
:param min_het_ref_reads: minimum allowed fraction of ref reads for het genotypes (default 0.2)
:param max_het_ref_reads: maximum allowed fraction of ref reads for het genotypes (default 0.8)
:param min_hom_ref_ref_reads: minimum allowed fraction of ref reads for hom ref genotypes (default 0.9)
:param max_hom_alt_ref_reads: maximum allowed fraction of ref reads for hom alt genotypes (default 0.1)
:return:
'''
logging.info('\nFiltering out genotypes on het ref reads > ' + str(max_het_ref_reads) +
             ', \nor het ref reads < ' + str(min_het_ref_reads) +
             ', \nor hom_ref ref reads < ' + str(min_hom_ref_ref_reads) +
             ', \nor hom_alt ref reads > ' + str(max_hom_alt_ref_reads))

mt = mt.annotate_globals(genotype_qc_thresholds_final={'max_het_ref_reads': max_het_ref_reads,
                                                       'min_het_ref_reads': min_het_ref_reads,
                                                       'min_hom_ref_ref_reads': min_hom_ref_ref_reads,
                                                       'max_hom_alt_ref_reads': max_hom_alt_ref_reads})
# Get starting genotypes count
gtstart = mt.aggregate_entries(hl.agg.count_where(hl.is_defined(mt.GT)))
logging.info('Starting count for number of genotypes: ' + str(gtstart))

gthet = mt.aggregate_entries(hl.agg.count_where(hl.is_defined(mt.GT) & mt.GT.is_het()))
gthomvar = mt.aggregate_entries(hl.agg.count_where(hl.is_defined(mt.GT) & mt.GT.is_hom_var()))
gthomref = mt.aggregate_entries(hl.agg.count_where(hl.is_defined(mt.GT) & mt.GT.is_hom_ref()))

# GENOTYPE FILTERING NEEDS TO BE FILTERING OUT!! or else you filter out everything that's not a het
# in the first step!!! (keep = FALSE)

# Filter het genotypes on allelic balance
mt = mt.filter_entries(mt.GT.is_het() & (((mt.AD[0] / hl.sum(mt.AD)) < min_het_ref_reads) |
                                         ((mt.AD[0] / hl.sum(mt.AD)) > max_het_ref_reads)), keep=False)
gtcount1 = mt.aggregate_entries(hl.agg.count_where(hl.is_defined(mt.GT)))
het_excluded = gtstart - gtcount1

# Filter hom ref genotypes on allelic balance
mt = mt.filter_entries(mt.GT.is_hom_ref() & (((mt.AD[0] / hl.sum(mt.AD)) < min_hom_ref_ref_reads)), keep=False)
gtcount2 = mt.aggregate_entries(hl.agg.count_where(hl.is_defined(mt.GT)))
hom_ref_excluded = gtcount1 - gtcount2

# Filter hom var genotypes on allelic balance
mt = mt.filter_entries(mt.GT.is_hom_var() & (((mt.AD[0] / hl.sum(mt.AD)) > max_hom_alt_ref_reads)), keep=False)
gtcount3 = mt.aggregate_entries(hl.agg.count_where(hl.is_defined(mt.GT)))
hom_var_excluded = gtcount2 - gtcount3

logging.info("Number of het GTs excluded: "+str(het_excluded) + "(" + str(het_excluded/gthet) + "%)")
logging.info("Number of hom ref GTs excluded: "+str(hom_ref_excluded) + "(" + str(hom_ref_excluded/gthomref*100) +
             "%)")
logging.info("Number of hom alt GTs excluded: "+str(hom_var_excluded) + "(" + str(hom_var_excluded/gthomvar*100) +
             "%)")

mt = mt.annotate_globals(genotype_qc_exclusions={'het_excluded': het_excluded,
                                                 'hom_ref_excluded': hom_ref_excluded,
                                                 'hom_var_excluded': hom_var_excluded})

logging.info('Unfiltering entries (setting filtered entries to missing) after genotype QC.')
mt = mt.unfilter_entries()

return mt

lea-urpa · October 14, 2019, 8:06am

Not sure if this 100% answers your question, but the function above should do the same thing you’re trying to do.

kousikbioinfo · October 15, 2019, 12:17pm

Thanks a lot @lea-urpa for sharing the code. It was really helpful.
I am able to replicate/reproduce same result as your script does by below mention code

mt = mt.filter_entries(hl.is_defined(mt.GT)) ## This I used to get rid of initial missing genotypes.

ab = mt.AD[1] / hl.sum(mt.AD)

filter_condition_ab = (
    hl.case(missing_false=True)
    .when(mt.GT.is_hom_ref(), ab > 0.1)
    .when(mt.GT.is_het(), (ab < 0.25) | (ab > 0.75))
    .when(mt.GT.is_hom_var(), ab < 0.9)
    .default(False) # remove everything else
)

mt = mt.filter_entries(filter_condition_ab, keep = False)

all_geno =  mt.aggregate_entries(hl.agg.fraction(hl.is_defined(mt.GT)))
print(f'All genotype entries: {all_geno * 100:.2f}%.') ## Result is 100%

The only difference is in my case, I removed the initial missing genotypes (first line) at the beginning. Did you want to keep them throughout ?

print(f'All genotype entries: {all_geno * 100:.2f}%.') ## Result should be <100%

I really appreciate your help. Thank you very much.
––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
TO THE HAIL TEAM:

ab = mt.AD[1] / hl.sum(mt.AD)

filter_condition_ab = (
    hl.case(missing_false=True)
    .when(mt.GT.is_hom_ref(), ab > 0.1)
    .when(mt.GT.is_het(), (ab < 0.25) | (ab > 0.75))
    .when(mt.GT.is_hom_var(), ab < 0.9)
    .default(False) # remove everything else
)
mtA = mt.filter_entries(filter_condition_ab, keep = False)

ab = mt.AD[1] / hl.sum(mt.AD)
filter_condition_ab = (
    hl.case(missing_false=True)
    .when(mt.GT.is_hom_ref(), ab <= 0.1)
    .when(mt.GT.is_het(), (ab >= 0.25) & (ab <= 0.75))
    .when(mt.GT.is_hom_var(), ab >= 0.9)
    .default(True) ## keep everything else
)
mtB = mt.filter_entries(filter_condition_ab)

In my case, the number of total genotypes for mtA and mtB are different. I am a bit puzzled here to understand that.

Total genotypes calculated using -

Total_geno1 = mtA.aggregate_entries(hl.agg.count_where(hl.is_defined(mtA.GT)))
print ('Total genotypes mtA: ' + str(Total_geno1))

Total_geno2 = mtB.aggregate_entries(hl.agg.count_where(hl.is_defined(mtB.GT)))
print ('Total genotypes mtB: ' + str(Total_geno2))

Not sure how the initial missing genotypes play a role here, but anyway when I calculated total genotyes I used hl.is_defined(mt.GT), which should exclude missing data.

tpoterba · October 15, 2019, 12:22pm

I probably still won’t have a full answer until after ASHG (or at least our workshop) but one of the important subtleties here is that there is a difference between a missing GT and a filtered entry.

A missing GT can be interrogated for using hl.is_defined and such, but a filtered entry doesn’t participate in aggregations. This computation:

Total_geno1 = mtA.aggregate_entries(hl.agg.count_where(hl.is_defined(mtA.GT)))

Will include the filtered entries neither in the numerator or denominator of the fraction.

kousikbioinfo · October 16, 2019, 1:56pm

Thanks Tim. To interrogate missing GT using hl.is_defined, I used below mentioned code.

defined_geno =  mt.aggregate_entries(hl.agg.fraction(hl.is_defined(mt.GT)))
print(f' Defined genotypes: {defined_geno * 100:.2f}%.')
non_defined = 100-(defined_geno * 100)
print(f' Non-defined genotypes: {non_defined:.2f}%.')

Result-
Defined genotypes: 99.41%.
Non-defined genotypes: 0.59%.

I thought 0.59% genotypes are missing GT in the mt. Is that correct ?

kousikbioinfo · October 28, 2019, 3:08pm

Any update on this @tpoterba?

tpoterba · October 28, 2019, 3:14pm

will dig in later today

tpoterba · October 29, 2019, 12:09am

I think this is best illustrated through an example – I’ll write that on my train commute tomorrow.

tpoterba · October 30, 2019, 12:32am

Take this matrix table:

In [4]: mt.show()
+---------+------+------+------+
| row_idx | 0.GT | 1.GT | 2.GT |
+---------+------+------+------+
|   int32 | call | call | call |
+---------+------+------+------+
|       0 | 0/0  | 0/1  | 1/1  |
|       1 | 0/1  | 1/1  | 0/0  |
|       2 | 1/1  | 0/0  | 0/1  |
+---------+------+------+------+

In [5]: mt.entries().show()
+---------+---------+------+
| row_idx | col_idx | GT   |
+---------+---------+------+
|   int32 |   int32 | call |
+---------+---------+------+
|       0 |       0 | 0/0  |
|       0 |       1 | 0/1  |
|       0 |       2 | 1/1  |
|       1 |       0 | 0/1  |
|       1 |       1 | 1/1  |
|       1 |       2 | 0/0  |
|       2 |       0 | 1/1  |
|       2 |       1 | 0/0  |
|       2 |       2 | 0/1  |
+---------+---------+------+

If I filter entries as so:

In [6]: mt = mt.filter_entries(~mt.GT.is_hom_var())

Then it looks like I have set GTs to missing in show():

+---------+------+------+------+
| row_idx | 0.GT | 1.GT | 2.GT |
+---------+------+------+------+
|   int32 | call | call | call |
+---------+------+------+------+
|       0 | 0/0  | 0/1  | NA   |
|       1 | 0/1  | NA   | 0/0  |
|       2 | NA   | 0/0  | 0/1  |
+---------+------+------+------+

However, what I’ve really done is remove those entries entirely:

In [8]: mt.entries().show()
+---------+---------+------+
| row_idx | col_idx | GT   |
+---------+---------+------+
|   int32 |   int32 | call |
+---------+---------+------+
|       0 |       0 | 0/0  |
|       0 |       1 | 0/1  |
|       1 |       0 | 0/1  |
|       1 |       2 | 0/0  |
|       2 |       1 | 0/0  |
|       2 |       2 | 0/1  |
+---------+---------+------+

So when I aggregate to compute the fraction defined, I get 6/6 = 1.0:

In [9]: mt.aggregate_entries(hl.agg.fraction(hl.is_defined(mt.GT)))
Out[9]: 1.0

I think this is what’s going on in your case, in a bit more of a subtle way.

kousikbioinfo · October 30, 2019, 10:10am

Thanks a lot @tpoterba. Much appreciated !

How do you interrogate missing data ? Imagine you have genotype call as ./. .
So if you have to remove those, would you use -
mt= mt.filter_entries(hl.is_defined(mt.GT)) ?
Because if you run
mt.aggregate_entries(hl.agg.fraction(hl.is_defined(mt.GT))) at the beginning, it doesn’t give 1.0 (at least for my case)

In my case, I was more confused why my mtA and mtB giving me different results. They should be ideally provide same answer.

tpoterba · October 30, 2019, 11:38pm

yes, exactly. There’s a difference between a missing GT and a filtered entry.

If you run mt= mt.filter_entries(hl.is_defined(mt.GT)), then the aggregate fraction you listed should return 1.0 afterwards.

kousikbioinfo · October 31, 2019, 2:12pm

Thank you Tim for the clarification.

Topic		Replies	Views
How to keep missing GT in n_alt_alleles()? Hail Query & hailctl	7	748	February 9, 2022
Gentoype filtering - missing hom ref data Hail Query & hailctl	2	31	March 20, 2025
Issues grouping by cols and then filtering by GT Hail Query & hailctl	10	769	June 23, 2022
Filtering MatrixTable for genotype in specific sample Hail Query & hailctl	7	1694	January 8, 2019
Allele frequency filter - keep missings Hail Query & hailctl	2	477	July 6, 2021

Entry filtering semantics question: allele read balance

Related topics