Annotate rows with aggregations by grouped columns

tsoare · November 16, 2018, 3:15pm

I’m likely missing something basic because I am trying to find the best way to annotate rows with aggregations by grouped columns. For example, the case of adding call_rate in cases and controls to the original MatrixTable.

ann = mt.annotate_rows(group_call_rate = hl.agg.group_by(hl.struct(case_status = mt[‘Is.Control’]),
hl.agg.fraction(hl.is_defined(mt.GT))))

However this returns a dict (and ideally I’d like an array, like with PCA scores).

±----------------------------------------+
| group_call_rate |
±----------------------------------------+
| dict<struct{case_status: str}, float64> |
±----------------------------------------+
| {(“No”):9.99e-01,(“Yes”):9.60e-01} |
| {(“No”):9.99e-01,(“Yes”):9.92e-01} |
| {(“No”):9.97e-01,(“Yes”):9.93e-01} |
| {(“No”):1.00e+00,(“Yes”):9.90e-01} |
| {(“No”):1.00e+00,(“Yes”):9.91e-01} |
| {(“No”):1.00e+00,(“Yes”):9.71e-01} |
| {(“No”):1.00e+00,(“Yes”):9.77e-01} |
| {(“No”):1.00e+00,(“Yes”):9.93e-01} |
| {(“No”):1.00e+00,(“Yes”):9.98e-01} |
| {(“No”):1.00e+00,(“Yes”):9.98e-01} |
±----------------------------------------+
showing top 10 rows

I also tried group_cols_by but I haven’t been able to annotate the values back to the original MatrixTable.

results = (mt.group_cols_by(case_status = mt[‘Is.Control’])
.aggregate(call_rate = agg.fraction(hl.is_defined(mt.GT))))
results.describe()

results.entry.show(10)
±--------------±-----------±------------±----------+
| locus | alleles | case_status | call_rate |
±--------------±-----------±------------±----------+
| locus | array | str | float64 |
±--------------±-----------±------------±----------+
| chr1:39203 | [“C”,“T”] | “No” | 9.99e-01 |
| chr1:39203 | [“C”,“T”] | “Yes” | 9.60e-01 |
| chr1:39224 | [“C”,“T”] | “No” | 9.99e-01 |
| chr1:39224 | [“C”,“T”] | “Yes” | 9.92e-01 |
| chr1:115746 | [“C”,“T”] | “No” | 9.97e-01 |
| chr1:115746 | [“C”,“T”] | “Yes” | 9.93e-01 |
| chr1:133160 | [“G”,“A”] | “No” | 1.00e+00 |
| chr1:133160 | [“G”,“A”] | “Yes” | 9.90e-01 |
| chr1:135203 | [“G”,“A”] | “No” | 1.00e+00 |
| chr1:135203 | [“G”,“A”] | “Yes” | 9.91e-01 |
±--------------±-----------±------------±----------+
showing top 10 rows

Maybe this is possible with collect()? But this workaround doesn’t seem optimal…

Thanks so much!

tpoterba · November 16, 2018, 4:43pm

However this returns a dict (and ideally I’d like an array, like with PCA scores).

what would the elements be? There’s probably a way to transform this dictionary into the array you want.

tsoare · November 16, 2018, 4:46pm

For this case, I am simply thinking the values [9.99e-01, 9.60e-01], without keys. (I do realize that dropping the keys from the dict might be blasphemy.)

tpoterba · November 16, 2018, 4:49pm

you can just do .values() on the result.

But you don’t know whether or case or control comes first!

Topic		Replies	Views
Group by columns and aggregate entries over all entries in the group Hail Query & hailctl	2	448	August 30, 2021
Error when trying to annotate a new row with a genotypes of the sample Hail Query & hailctl	2	330	July 13, 2023
Annotate rows based on GT Hail Query & hailctl	4	396	December 7, 2022
Group rows by several columns merging values into arrays Hail Query & hailctl	1	461	January 6, 2021
Multiple group statistics Hail Query & hailctl	6	450	May 8, 2020

Annotate rows with aggregations by grouped columns

Related topics