Subsampling within a MatrixTable based on column condition

Hi all, let’s say I have a MatrixTable of 500 males and 500 females (annotated with a column field - “gender”). If I want a final mt with 500 males and 250 females (random sample), am I able to perform this operation natively within the MatrixTable (i.e. random sample 250 out of 500 females only based on mt.gender, but leaving the male samples all intact)? Thanks !

Here’s an easy way to do this:

samples_to_keep = mt.aggregate_cols(hl.agg.filter(mt.gender == 'M', hl.agg.collect(mt.s)).extend(hl.agg.filter(mt.gender == 'F', hl.agg.take(mt.s, 250, ordering=hl.rand_unif()))
mt.filter_cols(hl.literal(samples_to_keep).contains(mt.s))

I tried:
samples_to_keep = mt.aggregate_cols(hl.agg.filter(mt.gender == 'M', hl.agg.collect(mt.s)).extend(hl.agg.filter(mt.gender == 'F', hl.agg.take(mt.s, 250, ordering=hl.rand_unif()))
and I got:
SyntaxError: unexpected EOF while parsing
Can I check which part causes the error? Thank you very much.

I realized there were not enough right brackets for the line, and after correcting that:

samples_to_keep = mt.aggregate_cols(
    hl.agg.filter(
        mt.gender == 'M', hl.agg.collect(mt.s)
    ).extend(
        hl.agg.filter(mt.gender == 'F', hl.agg.take(mt.s, 105, ordering=hl.rand_unif()))
    )
)

I got this error:
TypeError: missing a required argument: 'lower'

hl.rand_unif requires a lower and upper argument. It doesn’t default to 0, 1. I’ll make a PR to fix that. In the meantime you’ll need to change to hl.rand_unif(0, 1).

PR for the defaults [query] default rand_unif to [0, 1] by danking · Pull Request #11833 · hail-is/hail · GitHub