Multiple group statistics

Hi!
I try to calculate aggregated by multiple groups statistics:

agg_table = hl.agg.group_by(hl.struct(gene=mt.csq_parsed[3],
                                      pheno=mt.pheno),
                hl.struct(
                    non_miss=hl.agg.count_where(hl.is_defined(mt.GT)),
                    hom_ref=hl.agg.count_where(mt.GT.is_hom_ref()),
                    hom_var=hl.agg.count_where(mt.GT.is_hom_var()),
                    het=hl.agg.count_where(mt.GT.is_het())
                         )
                           )

Object structure:

agg_table.describe()
--------------------------------------------------------
Type:
        dict<struct {
        gene: str, 
        pheno: str
    }, struct {
        non_miss: int64, 
        hom_ref: int64, 
        hom_var: int64, 
        het: int64
    }>
--------------------------------------------------------
Source:
    <hail.matrixtable.MatrixTable object at 0x7fcd265bd9d0>
Index:
    [] (aggregated)
--------------------------------------------------------
Includes aggregation with index []
    (Aggregation index may be promoted based on context)
--------------------------------------------------------

Then I try to extract data from the agg_table:

value = agg_table.get(hl.struct(gene=hl.literal('GENE_NAME'), pheno='case'))\
                            .get('het')
value
<Int64Expression of type int64>

But if I use any function I gen an error:

value.collect()
NotImplementedError: cannot convert aggregated expression to table
  1. how could I get individual values ​​from this data,
  2. or get a aggregated by gene MatrixTable?

Thanks!

Hey @Rost,

You’ve stumbled into a weird part of our interface where the error messages get a bit confusing!

The hl.agg.xxx functions always must be used inside a Table or MatrixTable method such as annotate, annotate_rows, aggregate. Note that agg_table.describe() indicates that age_table is an expression that is (aggregated).

If you want to run your query, try this instead:

grouped_stats = mt.aggregate_entries(
   hl.agg.group_by(hl.struct(gene=mt.csq_parsed[3], pheno=mt.pheno),
                   hl.struct(
                     non_miss=hl.agg.count_where(hl.is_defined(mt.GT)),
                     hom_ref=hl.agg.count_where(mt.GT.is_hom_ref()),
                     hom_var=hl.agg.count_where(mt.GT.is_hom_var()),
                     het=hl.agg.count_where(mt.GT.is_het())))

For your second question:

gene_mt = mt.group_rows_by(gene=mt.csq_parsed[3]).aggregate(
    non_miss=hl.agg.count_where(hl.is_defined(mt.GT)),
    hom_ref=hl.agg.count_where(mt.GT.is_hom_ref()),
    hom_var=hl.agg.count_where(mt.GT.is_hom_var()),
    het=hl.agg.count_where(mt.GT.is_het()))
)
gene_pheno_mt = gene_mt.group_cols_by(pheno=gene_mt.pheno).aggregate(
    non_miss=hl.agg.sum(gene_mt.non_miss),
    hom_ref=hl.agg.sum(gene_mt.hom_ref),
    hom_var=hl.agg.sum(gene_mt.hom_var),
    het=hl.agg.sum(gene_mt.het)
)

It’s unfortunate that you have to do this in two steps. It’s on my todo list to support row and column grouping at the same time.

Thanks @danking!

  1. The first issue is solved, but the obtained result structure seems uncomfortable.
{
Struct(gene='GENE1', pheno='case'): Struct(non_miss=1241, hom_ref=1241, hom_var=0, het=0),
 Struct(gene='GENE2', pheno='control'): Struct(non_miss=4214, hom_ref=4201, hom_var=0, het=13),
...}

Maybe there is a straight approach to get the pandas dataframe without several nested iterations and loss of efficiency?

  1. After aggregation by gene, the second aggregation by columns give me the following error:
TypeError: sum: parameter 'expr': expected expression of type int64 or float64, found <DictExpression of type dict<str, int64>>

The second approach gives you the two-dimensional structure you seek. Are you sure you’re using exactly what I posted above? Count_where should not return a dictionary.

Yes.
Maybe there should be aggregated gene mt instead initial mt?

gene_pheno_mt = gene_mt.group_cols_by(pheno=gene_mt.pheno).aggregate(
    non_miss=hl.agg.sum(gene_mt.non_miss),
    hom_ref=hl.agg.sum(gene_mt.hom_ref),
    hom_var=hl.agg.sum(gene_mt.hom_var),
    het=hl.agg.sum(gene_mt.het)
)

In this case all working fine.

1 Like

Ah, yes, sorry. I’ll edit the original.