How to capture summarize() outputs?

Hello,

I find the outputs from the summarize() expression incredibly useful for characterizing data and performing coherence checks. How can I capture the output so I can use it programmatically in downstream analyses?

Thanks,
Paul

Hi Paul, I’m sure someone from the Hail team will jump in here. In the meantime though I’ve been using agg.stats as it doesn’t look like there is any way to do this from summarize() at the moment.

Looking at mt.entry.DP.summarize() gives:

3930772 records.

  • DP (int32):
    Non-missing: 3930772 (100.00%)
    Missing: 0
    Minimum: 0
    Maximum: 3361
    Mean: 72.40
    Std Dev: 66.06

You can get all the same info with agg.stats:

q = mt.aggregate_entries(hl.agg.stats(mt.entry.DP))

You can then access:

q.mean
q.stdev
q.n
q.sum
q.min
q.max

This doesn’t work out the box if the data is in an array like AD but you can still access the array components or calculate the number of missing array elements with a few modifications.

It would be nice if you could directly capture this from summarize() in a way similar to hl.summarize_variants(mt, show=False) - it would be a lot fewer lines of code!

Angus

As @abg points out, Hail’s summarize is built out of normal aggregators. We don’t provide a simple way to directly access these aggregators.

If you’re comfortable looking at the source code, every field has a type, that type has an associated Expression in base_expression.py or typed_expressions.py. Base expression has _all_summary_aggs() which gives you all the aggregators used by summarize. This is defined in terms of _summary_aggs which has type-specific aggregations.

I’ll let the team know that it would be great to have a programmatic way to access summary information.

Thank you @abg and @danking! The agg.stats() method looks like a great solution. I’ll give it a shot.

And I’m still finding my bearing with expressions, but I’ll definitely take a look at the source code and I appreciate you passing this on to the team.

Best,
Paul