Group_rows_by number of rows pr. group

FFS · February 12, 2020, 8:56am

When using group_rows_by, is there a way to get the group annotated with how many rows went into that group?

Ex. when grouping by gene and aggregating the entries.

dataset_result = dataset.group_rows_by(dataset.gene).aggregate(
    n_non_ref = hl.agg.count_where(dataset.GT.is_non_ref())
)

I would also like the dataset_result to have a row-value which is the number of SNPs grouped into that gene.

I expect the same question applies to the group_cols_by.

tpoterba · February 12, 2020, 1:36pm

yes, you can do:

dataset_result = dataset.group_rows_by(dataset.gene) \
    .aggregate_rows(n = hl.agg.count()) \
    .aggregate(n_non_ref = hl.agg.count_where(dataset.GT.is_non_ref()))

Topic		Replies	Views
Group by columns and aggregate entries over all entries in the group Hail Query & hailctl	2	450	August 30, 2021
Annotate rows with aggregations by grouped columns Hail Query & hailctl	3	560	November 16, 2018
Multiple group statistics Hail Query & hailctl	6	450	May 8, 2020
Gene-based GWAS Hail Query & hailctl	2	456	April 29, 2020
Calculating the number of variants per column category Hail Query & hailctl	10	658	November 17, 2020

Group_rows_by number of rows pr. group

Related topics