We have added a new aggregator called `agg.approx_cdf`

. Most users probably won’t need to use this directly, instead using one of the convenience methods wrapping it:

- The new aggregator
`agg.approx_quantiles`

. - The new plotting functions
`plot.cdf`

and`plot.pdf`

. - The plotting functions
`cdf`

,`pdf`

, and`histogram`

can take the results of the`approx_cdf`

aggregator, to allow creating multiple plots from a single aggregation.

### Warning

These methods are all considered experimental. In particular, be aware that these methods are all non-deterministic: computing `approx_cdf`

multiple times will give slightly different results each time. It is currently not possible to seed the aggregator. The interface to the plotting functions is likely to change in the future.

### Highlights

`approx_cdf`

computes a compressed representation of the distribution of values aggregated. If `data`

is the result of the this aggregator, e.g. `data = t.aggregate(hl.agg.approx_cdf(t.foo))`

, then `data`

can be used for several things, without requiring further computation on the source data. `data`

can be passed to the plotting functions `plots.pdf`

, `plots.cdf`

, and `plots.histogram`

. It can also be used to estimate quantiles. The aggregator `agg.approx_quantiles`

is added as a convenience wrapper around `approx_cdf`

, to estimate one or more quantiles.

## Details

`agg.approx_cdf(expr, k)`

This is the core aggregator providing the new functionality. It takes an `expr`

to summarize, and a parameter `k`

which controls the tradeoff between memory used and accuracy. The aggregator will use enough working memory to store a bit more than 3k values sampled from `expr`

, and will produce a sample of fewer values than its working memory.

`approx_cdf`

returns a struct containing two arrays, `values`

and `ranks`

. `values`

is a sorted sample of values from `expr`

. `ranks`

is an array of `int64`

of length `len(values) + 1`

. Perhaps the easiest way to think about `ranks`

is to consider the consecutive differences `weights = np.diff(ranks)`

. `weights`

is the same length as `values`

, and `sum(weights)`

equals `count(expr)`

, the number of values being summarized. Together, `values`

and `weights`

approximate the true distribution of `expr`

by the collection `values`

, with each `values[i]`

repeated `weights[i]`

many times.

For example, suppose `values = [0,2,5,6,9]`

and `ranks = [0,3,4,5,8,10]`

. Then `weights = [3,1,1,3,2]`

. Together, this approximates the true distribution by the array `[0,0,0,2,5,6,6,6,9,9]`

This pair of arrays can be used to estimate the rank of any value, or to estimate the value at any rank. Here, we define the rank `R(x)`

of some value `x`

to be the number of values of `expr`

less than `x`

. An equivalent view is that `R(x)`

is the (smallest) index where `x`

could be found if `expr`

were collected and sorted.

To estimate the rank of a value `x`

, let `i`

be the smallest index such that `values[i] >= x`

, or `len(values)`

if `x`

is greater than all elements of `values`

. Then we estimate `R(x) = ranks[i]`

, and quantile `Q(x) = ranks[i] / ranks[-1]`

.

To estimate the value at rank `r`

, let `i`

be the smallest index such that `ranks[i] <= r`

. Then we estimate the value to be `values[i]`

.

To estimate the value with quantile `q`

, we estimate the value at rank `floor(q * ranks[-1])`

.

`agg.approx_quantiles(expr, qs, k)`

This is a convenient wrapper around `approx_cdf`

. `expr`

and `k`

are passed directly to `approx_cdf`

. `qs`

is either a single quantile or an array of quantiles, where a quantile is a number `q`

with `0 <= q <= 1`

. Returns an array of values whose true quantiles are close to the requested quantiles.

`plots.cdf(data, k)`

Produces a cumulative density plot. If `data`

is an expression, this will first run the `approx_cdf`

aggregator. Alternatively, you can run `data = t.aggregate(hl.agg.approx_cdf(t.foo))`

yourself, and pass the results to `plots.cdf`

yourself. This allows you to produce multiple plots without having to run multiple aggregations.

The plot can be panned and zoomed, and a hover tooltip will display value/rank pairs.

`plots.pdf(data, k, smoothing, interactive)`

Produces a probability density plot. As with `cdf`

, `data`

can either be an expression or the results of a `agg.approx_cdf`

aggregation. `smoothing`

controls the amount of smoothing being applied.

If `interactive=True`

, this returns two values, e.g. `p, i = plots.pdf(data, interactive=True)`

. Then displaying the plot using `plots.show(p, interact=i)`

shows the plot along with a slider to interactively change the smoothing parameter. Note that interactivity requires the ipywidgets package.

`plots.histogram(data, bins, interactive)`

The existing `histogram`

plot has been modified, so that `data`

can now be the results of a `agg.approx_cdf`

aggregation. If `interactive=True`

, this returns two values as with `pdf`

, and passing both two `plots.show`

adds sliders to the histogram to vary the number of bins and to shift the bin edges left and right.