How to capture summarize() outputs?

pbilling · April 20, 2023, 6:12pm

Hello,

I find the outputs from the summarize() expression incredibly useful for characterizing data and performing coherence checks. How can I capture the output so I can use it programmatically in downstream analyses?

Thanks,
Paul

abg · April 27, 2023, 9:15am

Hi Paul, I’m sure someone from the Hail team will jump in here. In the meantime though I’ve been using agg.stats as it doesn’t look like there is any way to do this from summarize() at the moment.

Looking at mt.entry.DP.summarize() gives:

3930772 records.

DP (int32):
Non-missing: 3930772 (100.00%)
Missing: 0
Minimum: 0
Maximum: 3361
Mean: 72.40
Std Dev: 66.06

You can get all the same info with agg.stats:

q = mt.aggregate_entries(hl.agg.stats(mt.entry.DP))

You can then access:

q.mean
q.stdev
q.n
q.sum
q.min
q.max

This doesn’t work out the box if the data is in an array like AD but you can still access the array components or calculate the number of missing array elements with a few modifications.

It would be nice if you could directly capture this from summarize() in a way similar to hl.summarize_variants(mt, show=False) - it would be a lot fewer lines of code!

Angus

danking · April 27, 2023, 3:54pm

As @abg points out, Hail’s summarize is built out of normal aggregators. We don’t provide a simple way to directly access these aggregators.

If you’re comfortable looking at the source code, every field has a type, that type has an associated Expression in base_expression.py or typed_expressions.py. Base expression has _all_summary_aggs() which gives you all the aggregators used by summarize. This is defined in terms of _summary_aggs which has type-specific aggregations.

I’ll let the team know that it would be great to have a programmatic way to access summary information.

pbilling · April 27, 2023, 6:21pm

Thank you @abg and @danking! The agg.stats() method looks like a great solution. I’ll give it a shot.

And I’m still finding my bearing with expressions, but I’ll definitely take a look at the source code and I appreciate you passing this on to the team.

Best,
Paul

Topic		Replies	Views
Filtering using agg.stats without collecting to local value Hail Query & hailctl	5	427	July 22, 2020
Multiple group statistics Hail Query & hailctl	6	450	May 8, 2020
Export GWAS summary statistics to a .txt file Hail Query & hailctl	8	1133	February 22, 2022
Prepare hail entries for spark.ml Hail Query & hailctl	3	446	October 29, 2019
Count all members in an array field with counter aggregator for whole table Hail Query & hailctl	3	383	August 29, 2021

How to capture summarize() outputs?

Related topics