Subsampling within a MatrixTable based on column condition

CuriousGeneticist · May 13, 2022, 3:38pm

Hi all, let’s say I have a MatrixTable of 500 males and 500 females (annotated with a column field - “gender”). If I want a final mt with 500 males and 250 females (random sample), am I able to perform this operation natively within the MatrixTable (i.e. random sample 250 out of 500 females only based on mt.gender, but leaving the male samples all intact)? Thanks !

tpoterba · May 13, 2022, 3:44pm

Here’s an easy way to do this:

samples_to_keep = mt.aggregate_cols(hl.agg.filter(mt.gender == 'M', hl.agg.collect(mt.s)).extend(hl.agg.filter(mt.gender == 'F', hl.agg.take(mt.s, 250, ordering=hl.rand_unif()))
mt.filter_cols(hl.literal(samples_to_keep).contains(mt.s))

CuriousGeneticist · May 13, 2022, 4:34pm

I tried:
samples_to_keep = mt.aggregate_cols(hl.agg.filter(mt.gender == 'M', hl.agg.collect(mt.s)).extend(hl.agg.filter(mt.gender == 'F', hl.agg.take(mt.s, 250, ordering=hl.rand_unif()))
and I got:
SyntaxError: unexpected EOF while parsing
Can I check which part causes the error? Thank you very much.

CuriousGeneticist · May 13, 2022, 4:38pm

I realized there were not enough right brackets for the line, and after correcting that:

samples_to_keep = mt.aggregate_cols(
    hl.agg.filter(
        mt.gender == 'M', hl.agg.collect(mt.s)
    ).extend(
        hl.agg.filter(mt.gender == 'F', hl.agg.take(mt.s, 105, ordering=hl.rand_unif()))
    )
)

I got this error:
TypeError: missing a required argument: 'lower'

danking · May 13, 2022, 4:46pm

hl.rand_unif requires a lower and upper argument. It doesn’t default to 0, 1. I’ll make a PR to fix that. In the meantime you’ll need to change to hl.rand_unif(0, 1).

danking · May 13, 2022, 4:50pm

PR for the defaults [query] default rand_unif to [0, 1] by danking · Pull Request #11833 · hail-is/hail · GitHub

Topic		Replies	Views
Filtering MatrixTable for genotype in specific sample Hail Query & hailctl	7	1685	January 8, 2019
Select certain samples from MatrixTable Hail Query & hailctl	9	820	October 6, 2022
Filtering MatrixTables where column values do not match Hail Query & hailctl	4	593	February 22, 2021
Filter samples from MatrixTable Hail Query & hailctl	8	667	October 22, 2021
Group by columns and aggregate entries over all entries in the group Hail Query & hailctl	2	448	August 30, 2021

Subsampling within a MatrixTable based on column condition

Related topics