Filter Table by max within group

jsmadsen · December 22, 2021, 10:45pm

Probably simple but I have spent way too long on this. I have a table and would like to “de-duplicate” the rows keeping only row for each group in id2 based on max value of val. I manage with some gnarly group_by() -> hl.agg.max() -> join() but there must be a neater way. Essentially, I would like to turn this table:

id1 id2 val more_columns
-----------
a x 0.1 ...
b x 0.3 ...
c x 0.4 ...
d y 0.2 ...
e y 0.9 ...
f y 0.5 ...

Into this:

id1 id2 val more_columns
-------------
c x 0.4 ...
e y 0.9 ...

I hope it makes sense and thank you for your time

tpoterba · December 23, 2021, 12:52am

I think the group_by is exactly the right thing:

rows = mt.rows()
to_keep = rows.group_by(rows.id2).aggregate(top1 = hl.agg.take(rows.id1, ordering=-rows.id2)[0])).key_by('top1')
mt = mt.semi_join_rows(to_keep)

jsmadsen · December 23, 2021, 1:18am

Small correction for those who come looking in the future:

rows = mt.rows()
to_keep = rows.group_by(rows.id2).aggregate(top1 = hl.agg.take(rows.id1, 1, ordering=-rows.val)[0])).key_by('top1')
mt = mt.semi_join_rows(to_keep)
# or 
# mt = mt.filter_rows(hl.is_defined(to_keep[mt.id1]))

tpoterba · December 23, 2021, 1:30am

oops, thanks! Was it just the 1 argument to take that I missed?

jsmadsen · December 23, 2021, 5:41pm

And ordering=-rows.val I think.

alexsunny123 · March 23, 2022, 9:11am

jsmadsen:

Small correction for those who come looking in the future:

rows = mt.rows()
to_keep = rows.group_by(rows.id2).aggregate(top1 = hl.agg.take(rows.id1, 1, ordering=-rows.val)[0])).key_by('top1')
mt = mt.semi_join_rows(to_keep)
# or 
# mt = mt.filter_rows(hl.is_defined(to_keep[mt.id1]))

thanks for the awesome information.

alexsunny123 · February 4, 2023, 10:00am

alexsunny123:

jsmadsen:

Small correction for those who come looking in the future:

rows = mt.rows()
to_keep = rows.group_by(rows.id2).aggregate(top1 = hl.agg.take(rows.id1, 1, ordering=-rows.val)[0])).key_by('top1')
mt = mt.semi_join_rows(to_keep)
# or 
# mt = mt.filter_rows(hl.is_defined(to_keep[mt.id1]))
``` [Ometv](https://ometv.onl) [chatroulette](https://chatroulette.top)

thanks for the awesome information.

thanks my issue has been fixed.

Topic		Replies	Views
Group by columns and aggregate entries over all entries in the group Hail Query & hailctl	2	440	August 30, 2021
Multiple group statistics Hail Query & hailctl	6	447	May 8, 2020
Issues grouping by cols and then filtering by GT Hail Query & hailctl	10	757	June 23, 2022
Filtering MatrixTables where column values do not match Hail Query & hailctl	4	582	February 22, 2021
Group rows by several columns merging values into arrays Hail Query & hailctl	1	459	January 6, 2021

Filter Table by max within group

Related topics