I’m likely missing something basic because I am trying to find the best way to annotate rows with aggregations by grouped columns. For example, the case of adding call_rate in cases and controls to the original MatrixTable.
ann = mt.annotate_rows(group_call_rate = hl.agg.group_by(hl.struct(case_status = mt[‘Is.Control’]),
hl.agg.fraction(hl.is_defined(mt.GT))))
However this returns a dict (and ideally I’d like an array, like with PCA scores).
±----------------------------------------+
| group_call_rate |
±----------------------------------------+
| dict<struct{case_status: str}, float64> |
±----------------------------------------+
| {(“No”):9.99e-01,(“Yes”):9.60e-01} |
| {(“No”):9.99e-01,(“Yes”):9.92e-01} |
| {(“No”):9.97e-01,(“Yes”):9.93e-01} |
| {(“No”):1.00e+00,(“Yes”):9.90e-01} |
| {(“No”):1.00e+00,(“Yes”):9.91e-01} |
| {(“No”):1.00e+00,(“Yes”):9.71e-01} |
| {(“No”):1.00e+00,(“Yes”):9.77e-01} |
| {(“No”):1.00e+00,(“Yes”):9.93e-01} |
| {(“No”):1.00e+00,(“Yes”):9.98e-01} |
| {(“No”):1.00e+00,(“Yes”):9.98e-01} |
±----------------------------------------+
showing top 10 rows
I also tried group_cols_by but I haven’t been able to annotate the values back to the original MatrixTable.
results = (mt.group_cols_by(case_status = mt[‘Is.Control’])
.aggregate(call_rate = agg.fraction(hl.is_defined(mt.GT))))
results.describe()
results.entry.show(10)
±--------------±-----------±------------±----------+
| locus | alleles | case_status | call_rate |
±--------------±-----------±------------±----------+
| locus | array | str | float64 |
±--------------±-----------±------------±----------+
| chr1:39203 | [“C”,“T”] | “No” | 9.99e-01 |
| chr1:39203 | [“C”,“T”] | “Yes” | 9.60e-01 |
| chr1:39224 | [“C”,“T”] | “No” | 9.99e-01 |
| chr1:39224 | [“C”,“T”] | “Yes” | 9.92e-01 |
| chr1:115746 | [“C”,“T”] | “No” | 9.97e-01 |
| chr1:115746 | [“C”,“T”] | “Yes” | 9.93e-01 |
| chr1:133160 | [“G”,“A”] | “No” | 1.00e+00 |
| chr1:133160 | [“G”,“A”] | “Yes” | 9.90e-01 |
| chr1:135203 | [“G”,“A”] | “No” | 1.00e+00 |
| chr1:135203 | [“G”,“A”] | “Yes” | 9.91e-01 |
±--------------±-----------±------------±----------+
showing top 10 rows
Maybe this is possible with collect()? But this workaround doesn’t seem optimal…
Thanks so much!