Filtering MatrixTable for genotype in specific sample


#1

Hi everybody, I have a relatively simple question that I’m stuck with. I have a MatrixTable mt with 50 samples that I would like to filter for all heterozygous variants in a given sample USAUPNEG0002P.

I managed to filter for the specific sample with mt_filtered = mt.filter_cols(mt.s == 'USAUPNEG0002P'), but I struggling to filter by the GT field for heterozygous variants. The solution is probably quite straightforward, but I am having trouble understanding how to properly handle the MatrixTable format for this filtering step?

mt_gt = mt_filtered.filter_rows(mt_filtered.GT == hl.call(1, 0))

results in the error message

    Found indices ['column', 'row'], with unexpected indices ['column']. Invalid fields:
        'GT' (indices ['column', 'row'])
    'MatrixTable.filter_rows' supports aggregation over axes ['column'], so these fields may appear inside an aggregator function.```

#2

This isn’t obvious! The problem is that a MatrixTable intentionally doesn’t give you random access to specific rows or columns, which we do because we want to write distributed, streaming implementations over both axes.

There are several solutions.

First, you can use make_table when you want to go from a matrix orientation to a table with one field per column.

ht = mt.filter_cols(mt.s == 'USAUPNEG0002P').make_table()
ht.filter(ht['USAUPNEG0002P.GT'].is_het()).show()

Second, if you want to stay a MatrixTable (say, for instance, you are interested in looking at all 50 samples at variants where the USAUPNEG0002P sample is het), then you can use aggregators to do this filtering.

mt_filtered = mt.filter_rows(
    hl.count_where((mt.s == 'USAUPNEG0002P') & mt.GT.is_het()) > 0)

#3

Thank you so much! The first solution worked beautifully. However, when I try

mt_filtered = mt.filter_rows(hl.count_where((mt.s == 'USAUPNEG0002P') & mt.GT.is_het()) > 0)

I get the following error message:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-27-7b9d64f5f2ad> in <module>()
----> 1 mt_filtered = mt.filter_rows(hl.count_where((mt.s == 'USAUPNEG0002P') & mt.GT.is_het()) > 0)

AttributeError: module 'hail' has no attribute 'count_where'

I must be doing something wrong as I don’t find the count_where attribute in the hail module?


#4

Also, I was wondering whether you could briefly point me towards a quick way of identifying samples that are het or non-ref at a certain locus. I am able to filter the rows for a given locus, but then I need to switch to pandas to filter by 0/1.

l1 = hl.Locus.parse('1:900505')
mtx = mt.filter_rows(mt.locus == l1)

Using the make_table() would be difficult as I would need to filter all samples, not just a single sample?

Is there any good way I can accomplish this?


#5

sorry, typo. Should be hl.agg.count_where.

a quick way of identifying samples that are het or non-ref at a certain locus

So you have one locus you care about, and want to find all non-ref samples?

This is an efficient solution:

# use filter_intervals because it has great performance behavior
mt = hl.filter_intervals(mt, [hl.parse_locus_interval('1:900505-900506')])
print(mt.aggregate_entries(hl.agg.filter(mt.GT.is_non_ref(), mt.s)))

#6

Thank you! Unfortunately, the command

print(mt.aggregate_entries(hl.agg.filter(mt.GT.is_non_ref(), mt.s)))

results in an assertion error

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-54-bdb7919987d4> in <module>()
----> 1 print(mt.aggregate_entries(hl.agg.filter(mt.GT.is_non_ref(), mt.s)))

<decorator-gen-1036> in filter(condition, aggregation)

~/anaconda3/lib/python3.6/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
    557     @decorator
    558     def wrapper(__original_func, *args, **kwargs):
--> 559         args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
    560         return __original_func(*args_, **kwargs_)
    561 

~/anaconda3/lib/python3.6/site-packages/hail/typecheck/check.py in check_all(f, args, kwargs, checks, is_method)
    485                     if i < len(args):
    486                         arg = args[i]
--> 487                         args_.append(checker.check(arg, name, arg_name))
    488                     # passed as keyword
    489                     else:

~/anaconda3/lib/python3.6/site-packages/hail/expr/aggregators/aggregators.py in check(self, x, caller, param)
     24     def check(self, x, caller, param):
     25         x = self.coercer.check(x, caller, param)
---> 26         assert(len(x._ir.search(lambda node: isinstance(node, BaseApplyAggOp))) != 0)
     27         return x
     28 

AssertionError: 

Do you have any good idea what this could be due to?


#7

oops, typo again:

print(mt.aggregate_entries(hl.agg.filter(mt.GT.is_non_ref(), hl.agg.collect(mt.s))))

Long day!


#8

…just figured it out in parallel as I stumbled across the post about the redesign of the aggregator interface (link)

Thank you so much, this work beautifully now and is very fast!