Hi everybody, I have a relatively simple question that I’m stuck with. I have a MatrixTable mt with 50 samples that I would like to filter for all heterozygous variants in a given sample USAUPNEG0002P.
I managed to filter for the specific sample with mt_filtered = mt.filter_cols(mt.s == 'USAUPNEG0002P'), but I struggling to filter by the GT field for heterozygous variants. The solution is probably quite straightforward, but I am having trouble understanding how to properly handle the MatrixTable format for this filtering step?
Found indices ['column', 'row'], with unexpected indices ['column']. Invalid fields:
'GT' (indices ['column', 'row'])
'MatrixTable.filter_rows' supports aggregation over axes ['column'], so these fields may appear inside an aggregator function.```
This isn’t obvious! The problem is that a MatrixTable intentionally doesn’t give you random access to specific rows or columns, which we do because we want to write distributed, streaming implementations over both axes.
There are several solutions.
First, you can use make_table when you want to go from a matrix orientation to a table with one field per column.
Second, if you want to stay a MatrixTable (say, for instance, you are interested in looking at all 50 samples at variants where the USAUPNEG0002P sample is het), then you can use aggregators to do this filtering.
Also, I was wondering whether you could briefly point me towards a quick way of identifying samples that are het or non-ref at a certain locus. I am able to filter the rows for a given locus, but then I need to switch to pandas to filter by 0/1.
a quick way of identifying samples that are het or non-ref at a certain locus
So you have one locus you care about, and want to find all non-ref samples?
This is an efficient solution:
# use filter_intervals because it has great performance behavior
mt = hl.filter_intervals(mt, [hl.parse_locus_interval('1:900505-900506')])
print(mt.aggregate_entries(hl.agg.filter(mt.GT.is_non_ref(), mt.s)))