Annotate rows with aggregations by grouped columns


#1

I’m likely missing something basic because I am trying to find the best way to annotate rows with aggregations by grouped columns. For example, the case of adding call_rate in cases and controls to the original MatrixTable.

ann = mt.annotate_rows(group_call_rate = hl.agg.group_by(hl.struct(case_status = mt[‘Is.Control’]),
hl.agg.fraction(hl.is_defined(mt.GT))))

However this returns a dict (and ideally I’d like an array, like with PCA scores).

±----------------------------------------+
| group_call_rate |
±----------------------------------------+
| dict<struct{case_status: str}, float64> |
±----------------------------------------+
| {(“No”):9.99e-01,(“Yes”):9.60e-01} |
| {(“No”):9.99e-01,(“Yes”):9.92e-01} |
| {(“No”):9.97e-01,(“Yes”):9.93e-01} |
| {(“No”):1.00e+00,(“Yes”):9.90e-01} |
| {(“No”):1.00e+00,(“Yes”):9.91e-01} |
| {(“No”):1.00e+00,(“Yes”):9.71e-01} |
| {(“No”):1.00e+00,(“Yes”):9.77e-01} |
| {(“No”):1.00e+00,(“Yes”):9.93e-01} |
| {(“No”):1.00e+00,(“Yes”):9.98e-01} |
| {(“No”):1.00e+00,(“Yes”):9.98e-01} |
±----------------------------------------+
showing top 10 rows

I also tried group_cols_by but I haven’t been able to annotate the values back to the original MatrixTable.

results = (mt.group_cols_by(case_status = mt[‘Is.Control’])
.aggregate(call_rate = agg.fraction(hl.is_defined(mt.GT))))
results.describe()

results.entry.show(10)
±--------------±-----------±------------±----------+
| locus | alleles | case_status | call_rate |
±--------------±-----------±------------±----------+
| locus | array | str | float64 |
±--------------±-----------±------------±----------+
| chr1:39203 | [“C”,“T”] | “No” | 9.99e-01 |
| chr1:39203 | [“C”,“T”] | “Yes” | 9.60e-01 |
| chr1:39224 | [“C”,“T”] | “No” | 9.99e-01 |
| chr1:39224 | [“C”,“T”] | “Yes” | 9.92e-01 |
| chr1:115746 | [“C”,“T”] | “No” | 9.97e-01 |
| chr1:115746 | [“C”,“T”] | “Yes” | 9.93e-01 |
| chr1:133160 | [“G”,“A”] | “No” | 1.00e+00 |
| chr1:133160 | [“G”,“A”] | “Yes” | 9.90e-01 |
| chr1:135203 | [“G”,“A”] | “No” | 1.00e+00 |
| chr1:135203 | [“G”,“A”] | “Yes” | 9.91e-01 |
±--------------±-----------±------------±----------+
showing top 10 rows

Maybe this is possible with collect()? But this workaround doesn’t seem optimal…

Thanks so much!


#2

However this returns a dict (and ideally I’d like an array, like with PCA scores).

what would the elements be? There’s probably a way to transform this dictionary into the array you want.


#3

For this case, I am simply thinking the values [9.99e-01, 9.60e-01], without keys. (I do realize that dropping the keys from the dict might be blasphemy.)


#4

you can just do .values() on the result.

But you don’t know whether or case or control comes first!