Annotate rows with aggregations by grouped columns

I’m likely missing something basic because I am trying to find the best way to annotate rows with aggregations by grouped columns. For example, the case of adding call_rate in cases and controls to the original MatrixTable.

ann = mt.annotate_rows(group_call_rate = hl.agg.group_by(hl.struct(case_status = mt[‘Is.Control’]),
hl.agg.fraction(hl.is_defined(mt.GT))))

However this returns a dict (and ideally I’d like an array, like with PCA scores).

±----------------------------------------+
| group_call_rate |
±----------------------------------------+
| dict<struct{case_status: str}, float64> |
±----------------------------------------+
| {(“No”):9.99e-01,(“Yes”):9.60e-01} |
| {(“No”):9.99e-01,(“Yes”):9.92e-01} |
| {(“No”):9.97e-01,(“Yes”):9.93e-01} |
| {(“No”):1.00e+00,(“Yes”):9.90e-01} |
| {(“No”):1.00e+00,(“Yes”):9.91e-01} |
| {(“No”):1.00e+00,(“Yes”):9.71e-01} |
| {(“No”):1.00e+00,(“Yes”):9.77e-01} |
| {(“No”):1.00e+00,(“Yes”):9.93e-01} |
| {(“No”):1.00e+00,(“Yes”):9.98e-01} |
| {(“No”):1.00e+00,(“Yes”):9.98e-01} |
±----------------------------------------+
showing top 10 rows

I also tried group_cols_by but I haven’t been able to annotate the values back to the original MatrixTable.

results = (mt.group_cols_by(case_status = mt[‘Is.Control’])
.aggregate(call_rate = agg.fraction(hl.is_defined(mt.GT))))
results.describe()

results.entry.show(10)
±--------------±-----------±------------±----------+
| locus | alleles | case_status | call_rate |
±--------------±-----------±------------±----------+
| locus | array | str | float64 |
±--------------±-----------±------------±----------+
| chr1:39203 | [“C”,“T”] | “No” | 9.99e-01 |
| chr1:39203 | [“C”,“T”] | “Yes” | 9.60e-01 |
| chr1:39224 | [“C”,“T”] | “No” | 9.99e-01 |
| chr1:39224 | [“C”,“T”] | “Yes” | 9.92e-01 |
| chr1:115746 | [“C”,“T”] | “No” | 9.97e-01 |
| chr1:115746 | [“C”,“T”] | “Yes” | 9.93e-01 |
| chr1:133160 | [“G”,“A”] | “No” | 1.00e+00 |
| chr1:133160 | [“G”,“A”] | “Yes” | 9.90e-01 |
| chr1:135203 | [“G”,“A”] | “No” | 1.00e+00 |
| chr1:135203 | [“G”,“A”] | “Yes” | 9.91e-01 |
±--------------±-----------±------------±----------+
showing top 10 rows

Maybe this is possible with collect()? But this workaround doesn’t seem optimal…

Thanks so much!

However this returns a dict (and ideally I’d like an array, like with PCA scores).

what would the elements be? There’s probably a way to transform this dictionary into the array you want.

For this case, I am simply thinking the values [9.99e-01, 9.60e-01], without keys. (I do realize that dropping the keys from the dict might be blasphemy.)

you can just do .values() on the result.

But you don’t know whether or case or control comes first!