We have added a new aggregator called agg.approx_cdf
. Most users probably won’t need to use this directly, instead using one of the convenience methods wrapping it:
- The new aggregator
agg.approx_quantiles
. - The new plotting functions
plot.cdf
andplot.pdf
. - The plotting functions
cdf
,pdf
, andhistogram
can take the results of theapprox_cdf
aggregator, to allow creating multiple plots from a single aggregation.
Warning
These methods are all considered experimental. In particular, be aware that these methods are all non-deterministic: computing approx_cdf
multiple times will give slightly different results each time. It is currently not possible to seed the aggregator. The interface to the plotting functions is likely to change in the future.
Highlights
approx_cdf
computes a compressed representation of the distribution of values aggregated. If data
is the result of the this aggregator, e.g. data = t.aggregate(hl.agg.approx_cdf(t.foo))
, then data
can be used for several things, without requiring further computation on the source data. data
can be passed to the plotting functions plots.pdf
, plots.cdf
, and plots.histogram
. It can also be used to estimate quantiles. The aggregator agg.approx_quantiles
is added as a convenience wrapper around approx_cdf
, to estimate one or more quantiles.
Details
agg.approx_cdf(expr, k)
This is the core aggregator providing the new functionality. It takes an expr
to summarize, and a parameter k
which controls the tradeoff between memory used and accuracy. The aggregator will use enough working memory to store a bit more than 3k values sampled from expr
, and will produce a sample of fewer values than its working memory.
approx_cdf
returns a struct containing two arrays, values
and ranks
. values
is a sorted sample of values from expr
. ranks
is an array of int64
of length len(values) + 1
. Perhaps the easiest way to think about ranks
is to consider the consecutive differences weights = np.diff(ranks)
. weights
is the same length as values
, and sum(weights)
equals count(expr)
, the number of values being summarized. Together, values
and weights
approximate the true distribution of expr
by the collection values
, with each values[i]
repeated weights[i]
many times.
For example, suppose values = [0,2,5,6,9]
and ranks = [0,3,4,5,8,10]
. Then weights = [3,1,1,3,2]
. Together, this approximates the true distribution by the array [0,0,0,2,5,6,6,6,9,9]
This pair of arrays can be used to estimate the rank of any value, or to estimate the value at any rank. Here, we define the rank R(x)
of some value x
to be the number of values of expr
less than x
. An equivalent view is that R(x)
is the (smallest) index where x
could be found if expr
were collected and sorted.
To estimate the rank of a value x
, let i
be the smallest index such that values[i] >= x
, or len(values)
if x
is greater than all elements of values
. Then we estimate R(x) = ranks[i]
, and quantile Q(x) = ranks[i] / ranks[-1]
.
To estimate the value at rank r
, let i
be the smallest index such that ranks[i] <= r
. Then we estimate the value to be values[i]
.
To estimate the value with quantile q
, we estimate the value at rank floor(q * ranks[-1])
.
agg.approx_quantiles(expr, qs, k)
This is a convenient wrapper around approx_cdf
. expr
and k
are passed directly to approx_cdf
. qs
is either a single quantile or an array of quantiles, where a quantile is a number q
with 0 <= q <= 1
. Returns an array of values whose true quantiles are close to the requested quantiles.
plots.cdf(data, k)
Produces a cumulative density plot. If data
is an expression, this will first run the approx_cdf
aggregator. Alternatively, you can run data = t.aggregate(hl.agg.approx_cdf(t.foo))
yourself, and pass the results to plots.cdf
yourself. This allows you to produce multiple plots without having to run multiple aggregations.
The plot can be panned and zoomed, and a hover tooltip will display value/rank pairs.
plots.pdf(data, k, smoothing, interactive)
Produces a probability density plot. As with cdf
, data
can either be an expression or the results of a agg.approx_cdf
aggregation. smoothing
controls the amount of smoothing being applied.
If interactive=True
, this returns two values, e.g. p, i = plots.pdf(data, interactive=True)
. Then displaying the plot using plots.show(p, interact=i)
shows the plot along with a slider to interactively change the smoothing parameter. Note that interactivity requires the ipywidgets package.
plots.histogram(data, bins, interactive)
The existing histogram
plot has been modified, so that data
can now be the results of a agg.approx_cdf
aggregation. If interactive=True
, this returns two values as with pdf
, and passing both two plots.show
adds sliders to the histogram to vary the number of bins and to shift the bin edges left and right.