 # Multiple group statistics

Hi!
I try to calculate aggregated by multiple groups statistics:

``````agg_table = hl.agg.group_by(hl.struct(gene=mt.csq_parsed,
pheno=mt.pheno),
hl.struct(
non_miss=hl.agg.count_where(hl.is_defined(mt.GT)),
hom_ref=hl.agg.count_where(mt.GT.is_hom_ref()),
hom_var=hl.agg.count_where(mt.GT.is_hom_var()),
het=hl.agg.count_where(mt.GT.is_het())
)
)
``````

Object structure:

``````agg_table.describe()
--------------------------------------------------------
Type:
dict<struct {
gene: str,
pheno: str
}, struct {
non_miss: int64,
hom_ref: int64,
hom_var: int64,
het: int64
}>
--------------------------------------------------------
Source:
<hail.matrixtable.MatrixTable object at 0x7fcd265bd9d0>
Index:
[] (aggregated)
--------------------------------------------------------
Includes aggregation with index []
(Aggregation index may be promoted based on context)
--------------------------------------------------------
``````

Then I try to extract data from the `agg_table`:

``````value = agg_table.get(hl.struct(gene=hl.literal('GENE_NAME'), pheno='case'))\
.get('het')
value
<Int64Expression of type int64>
``````

But if I use any function I gen an error:

``````value.collect()
NotImplementedError: cannot convert aggregated expression to table
``````
1. how could I get individual values ​​from this data,
2. or get a aggregated by gene MatrixTable?

Thanks!

Hey @Rost,

You’ve stumbled into a weird part of our interface where the error messages get a bit confusing!

The `hl.agg.xxx` functions always must be used inside a Table or MatrixTable method such as `annotate`, `annotate_rows`, `aggregate`. Note that `agg_table.describe()` indicates that `age_table` is an expression that is `(aggregated)`.

``````grouped_stats = mt.aggregate_entries(
hl.agg.group_by(hl.struct(gene=mt.csq_parsed, pheno=mt.pheno),
hl.struct(
non_miss=hl.agg.count_where(hl.is_defined(mt.GT)),
hom_ref=hl.agg.count_where(mt.GT.is_hom_ref()),
hom_var=hl.agg.count_where(mt.GT.is_hom_var()),
het=hl.agg.count_where(mt.GT.is_het())))

``````

``````gene_mt = mt.group_rows_by(gene=mt.csq_parsed).aggregate(
non_miss=hl.agg.count_where(hl.is_defined(mt.GT)),
hom_ref=hl.agg.count_where(mt.GT.is_hom_ref()),
hom_var=hl.agg.count_where(mt.GT.is_hom_var()),
het=hl.agg.count_where(mt.GT.is_het()))
)
gene_pheno_mt = gene_mt.group_cols_by(pheno=gene_mt.pheno).aggregate(
non_miss=hl.agg.sum(gene_mt.non_miss),
hom_ref=hl.agg.sum(gene_mt.hom_ref),
hom_var=hl.agg.sum(gene_mt.hom_var),
het=hl.agg.sum(gene_mt.het)
)
``````

It’s unfortunate that you have to do this in two steps. It’s on my todo list to support row and column grouping at the same time.

Thanks @danking!

1. The first issue is solved, but the obtained result structure seems uncomfortable.
``````{
Struct(gene='GENE1', pheno='case'): Struct(non_miss=1241, hom_ref=1241, hom_var=0, het=0),
Struct(gene='GENE2', pheno='control'): Struct(non_miss=4214, hom_ref=4201, hom_var=0, het=13),
...}
``````

Maybe there is a straight approach to get the pandas dataframe without several nested iterations and loss of efficiency?

1. After aggregation by gene, the second aggregation by columns give me the following error:
``````TypeError: sum: parameter 'expr': expected expression of type int64 or float64, found <DictExpression of type dict<str, int64>>
``````

The second approach gives you the two-dimensional structure you seek. Are you sure you’re using exactly what I posted above? Count_where should not return a dictionary.

Yes.
Maybe there should be aggregated `gene mt` instead initial `mt`?

``````gene_pheno_mt = gene_mt.group_cols_by(pheno=gene_mt.pheno).aggregate(
non_miss=hl.agg.sum(gene_mt.non_miss),
hom_ref=hl.agg.sum(gene_mt.hom_ref),
hom_var=hl.agg.sum(gene_mt.hom_var),
het=hl.agg.sum(gene_mt.het)
)
``````

In this case all working fine.

1 Like

Ah, yes, sorry. I’ll edit the original.