Hello,
I find the outputs from the summarize() expression incredibly useful for characterizing data and performing coherence checks. How can I capture the output so I can use it programmatically in downstream analyses?
Thanks,
Paul
Hello,
I find the outputs from the summarize() expression incredibly useful for characterizing data and performing coherence checks. How can I capture the output so I can use it programmatically in downstream analyses?
Thanks,
Paul
Hi Paul, I’m sure someone from the Hail team will jump in here. In the meantime though I’ve been using agg.stats as it doesn’t look like there is any way to do this from summarize() at the moment.
Looking at mt.entry.DP.summarize() gives:
3930772 records.
You can get all the same info with agg.stats:
q = mt.aggregate_entries(hl.agg.stats(mt.entry.DP))
You can then access:
q.mean
q.stdev
q.n
q.sum
q.min
q.max
This doesn’t work out the box if the data is in an array like AD but you can still access the array components or calculate the number of missing array elements with a few modifications.
It would be nice if you could directly capture this from summarize() in a way similar to hl.summarize_variants(mt, show=False) - it would be a lot fewer lines of code!
Angus
As @abg points out, Hail’s summarize is built out of normal aggregators. We don’t provide a simple way to directly access these aggregators.
If you’re comfortable looking at the source code, every field has a type, that type has an associated Expression in base_expression.py
or typed_expressions.py
. Base expression has _all_summary_aggs()
which gives you all the aggregators used by summarize. This is defined in terms of _summary_aggs
which has type-specific aggregations.
I’ll let the team know that it would be great to have a programmatic way to access summary information.