Filtering MatrixTable for genotype in specific sample

ihelbig · January 8, 2019, 3:47am

Hi everybody, I have a relatively simple question that I’m stuck with. I have a MatrixTable mt with 50 samples that I would like to filter for all heterozygous variants in a given sample USAUPNEG0002P.

I managed to filter for the specific sample with mt_filtered = mt.filter_cols(mt.s == 'USAUPNEG0002P'), but I struggling to filter by the GT field for heterozygous variants. The solution is probably quite straightforward, but I am having trouble understanding how to properly handle the MatrixTable format for this filtering step?

mt_gt = mt_filtered.filter_rows(mt_filtered.GT == hl.call(1, 0))

results in the error message

    Found indices ['column', 'row'], with unexpected indices ['column']. Invalid fields:
        'GT' (indices ['column', 'row'])
    'MatrixTable.filter_rows' supports aggregation over axes ['column'], so these fields may appear inside an aggregator function.```

tpoterba · January 8, 2019, 10:54am

This isn’t obvious! The problem is that a MatrixTable intentionally doesn’t give you random access to specific rows or columns, which we do because we want to write distributed, streaming implementations over both axes.

There are several solutions.

First, you can use make_table when you want to go from a matrix orientation to a table with one field per column.

ht = mt.filter_cols(mt.s == 'USAUPNEG0002P').make_table()
ht.filter(ht['USAUPNEG0002P.GT'].is_het()).show()

Second, if you want to stay a MatrixTable (say, for instance, you are interested in looking at all 50 samples at variants where the USAUPNEG0002P sample is het), then you can use aggregators to do this filtering.

mt_filtered = mt.filter_rows(
    hl.count_where((mt.s == 'USAUPNEG0002P') & mt.GT.is_het()) > 0)

ihelbig · January 8, 2019, 9:34pm

Thank you so much! The first solution worked beautifully. However, when I try

mt_filtered = mt.filter_rows(hl.count_where((mt.s == 'USAUPNEG0002P') & mt.GT.is_het()) > 0)

I get the following error message:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-27-7b9d64f5f2ad> in <module>()
----> 1 mt_filtered = mt.filter_rows(hl.count_where((mt.s == 'USAUPNEG0002P') & mt.GT.is_het()) > 0)

AttributeError: module 'hail' has no attribute 'count_where'

I must be doing something wrong as I don’t find the count_where attribute in the hail module?

ihelbig · January 8, 2019, 9:42pm

Also, I was wondering whether you could briefly point me towards a quick way of identifying samples that are het or non-ref at a certain locus. I am able to filter the rows for a given locus, but then I need to switch to pandas to filter by 0/1.

l1 = hl.Locus.parse('1:900505')
mtx = mt.filter_rows(mt.locus == l1)

Using the make_table() would be difficult as I would need to filter all samples, not just a single sample?

Is there any good way I can accomplish this?

tpoterba · January 8, 2019, 9:53pm

sorry, typo. Should be hl.agg.count_where.

a quick way of identifying samples that are het or non-ref at a certain locus

So you have one locus you care about, and want to find all non-ref samples?

This is an efficient solution:

# use filter_intervals because it has great performance behavior
mt = hl.filter_intervals(mt, [hl.parse_locus_interval('1:900505-900506')])
print(mt.aggregate_entries(hl.agg.filter(mt.GT.is_non_ref(), mt.s)))

ihelbig · January 8, 2019, 10:55pm

Thank you! Unfortunately, the command

print(mt.aggregate_entries(hl.agg.filter(mt.GT.is_non_ref(), mt.s)))

results in an assertion error

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-54-bdb7919987d4> in <module>()
----> 1 print(mt.aggregate_entries(hl.agg.filter(mt.GT.is_non_ref(), mt.s)))

<decorator-gen-1036> in filter(condition, aggregation)

~/anaconda3/lib/python3.6/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
    557     @decorator
    558     def wrapper(__original_func, *args, **kwargs):
--> 559         args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
    560         return __original_func(*args_, **kwargs_)
    561 

~/anaconda3/lib/python3.6/site-packages/hail/typecheck/check.py in check_all(f, args, kwargs, checks, is_method)
    485                     if i < len(args):
    486                         arg = args[i]
--> 487                         args_.append(checker.check(arg, name, arg_name))
    488                     # passed as keyword
    489                     else:

~/anaconda3/lib/python3.6/site-packages/hail/expr/aggregators/aggregators.py in check(self, x, caller, param)
     24     def check(self, x, caller, param):
     25         x = self.coercer.check(x, caller, param)
---> 26         assert(len(x._ir.search(lambda node: isinstance(node, BaseApplyAggOp))) != 0)
     27         return x
     28 

AssertionError:

Do you have any good idea what this could be due to?

tpoterba · January 8, 2019, 10:58pm

oops, typo again:

print(mt.aggregate_entries(hl.agg.filter(mt.GT.is_non_ref(), hl.agg.collect(mt.s))))

Long day!

ihelbig · January 8, 2019, 10:59pm

…just figured it out in parallel as I stumbled across the post about the redesign of the aggregator interface (link)

Thank you so much, this work beautifully now and is very fast!

Topic		Replies	Views
Filtering MatrixTables where column values do not match Hail Query & hailctl	4	622	February 22, 2021
Is there a way to get a list of non-reference heterozygous and homozygous variants by gene for each sample from a MatrixTable? Hail Query & hailctl	0	229	January 8, 2024
Efficient way to filter on large number of samples by an entry value Hail Query & hailctl	3	438	May 10, 2021
Select certain samples from MatrixTable Hail Query & hailctl	9	880	October 6, 2022
Gene-based sample statistics Hail Query & hailctl	6	763	January 27, 2020

Filtering MatrixTable for genotype in specific sample

Related topics