We have added a new aggregator called
agg.approx_cdf. Most users probably won’t need to use this directly, instead using one of the convenience methods wrapping it:
- The new aggregator
- The new plotting functions
- The plotting functions
histogramcan take the results of the
approx_cdfaggregator, to allow creating multiple plots from a single aggregation.
These methods are all considered experimental. In particular, be aware that these methods are all non-deterministic: computing
approx_cdf multiple times will give slightly different results each time. It is currently not possible to seed the aggregator. The interface to the plotting functions is likely to change in the future.
approx_cdf computes a compressed representation of the distribution of values aggregated. If
data is the result of the this aggregator, e.g.
data = t.aggregate(hl.agg.approx_cdf(t.foo)), then
data can be used for several things, without requiring further computation on the source data.
data can be passed to the plotting functions
plots.histogram. It can also be used to estimate quantiles. The aggregator
agg.approx_quantiles is added as a convenience wrapper around
approx_cdf, to estimate one or more quantiles.
This is the core aggregator providing the new functionality. It takes an
expr to summarize, and a parameter
k which controls the tradeoff between memory used and accuracy. The aggregator will use enough working memory to store a bit more than 3k values sampled from
expr, and will produce a sample of fewer values than its working memory.
approx_cdf returns a struct containing two arrays,
values is a sorted sample of values from
ranks is an array of
int64 of length
len(values) + 1. Perhaps the easiest way to think about
ranks is to consider the consecutive differences
weights = np.diff(ranks).
weights is the same length as
count(expr), the number of values being summarized. Together,
weights approximate the true distribution of
expr by the collection
values, with each
weights[i] many times.
For example, suppose
values = [0,2,5,6,9] and
ranks = [0,3,4,5,8,10]. Then
weights = [3,1,1,3,2]. Together, this approximates the true distribution by the array
This pair of arrays can be used to estimate the rank of any value, or to estimate the value at any rank. Here, we define the rank
R(x) of some value
x to be the number of values of
expr less than
x. An equivalent view is that
R(x) is the (smallest) index where
x could be found if
expr were collected and sorted.
To estimate the rank of a value
i be the smallest index such that
values[i] >= x, or
x is greater than all elements of
values. Then we estimate
R(x) = ranks[i], and quantile
Q(x) = ranks[i] / ranks[-1].
To estimate the value at rank
i be the smallest index such that
ranks[i] <= r. Then we estimate the value to be
To estimate the value with quantile
q, we estimate the value at rank
floor(q * ranks[-1]).
agg.approx_quantiles(expr, qs, k)
This is a convenient wrapper around
k are passed directly to
qs is either a single quantile or an array of quantiles, where a quantile is a number
0 <= q <= 1. Returns an array of values whose true quantiles are close to the requested quantiles.
Produces a cumulative density plot. If
data is an expression, this will first run the
approx_cdf aggregator. Alternatively, you can run
data = t.aggregate(hl.agg.approx_cdf(t.foo)) yourself, and pass the results to
plots.cdf yourself. This allows you to produce multiple plots without having to run multiple aggregations.
The plot can be panned and zoomed, and a hover tooltip will display value/rank pairs.
plots.pdf(data, k, smoothing, interactive)
Produces a probability density plot. As with
data can either be an expression or the results of a
smoothing controls the amount of smoothing being applied.
interactive=True, this returns two values, e.g.
p, i = plots.pdf(data, interactive=True). Then displaying the plot using
plots.show(p, interact=i) shows the plot along with a slider to interactively change the smoothing parameter. Note that interactivity requires the ipywidgets package.
plots.histogram(data, bins, interactive)
histogram plot has been modified, so that
data can now be the results of a
agg.approx_cdf aggregation. If
interactive=True, this returns two values as with
plots.show adds sliders to the histogram to vary the number of bins and to shift the bin edges left and right.