[Feature] Approximate quantiles, cdf and pdf plots

We have added a new aggregator called agg.approx_cdf. Most users probably won’t need to use this directly, instead using one of the convenience methods wrapping it:

  • The new aggregator agg.approx_quantiles.
  • The new plotting functions plot.cdf and plot.pdf.
  • The plotting functions cdf, pdf, and histogram can take the results of the approx_cdf aggregator, to allow creating multiple plots from a single aggregation.

Warning

These methods are all considered experimental. In particular, be aware that these methods are all non-deterministic: computing approx_cdf multiple times will give slightly different results each time. It is currently not possible to seed the aggregator. The interface to the plotting functions is likely to change in the future.

Highlights

approx_cdf computes a compressed representation of the distribution of values aggregated. If data is the result of the this aggregator, e.g. data = t.aggregate(hl.agg.approx_cdf(t.foo)), then data can be used for several things, without requiring further computation on the source data. data can be passed to the plotting functions plots.pdf, plots.cdf, and plots.histogram. It can also be used to estimate quantiles. The aggregator agg.approx_quantiles is added as a convenience wrapper around approx_cdf, to estimate one or more quantiles.

Details

agg.approx_cdf(expr, k)

This is the core aggregator providing the new functionality. It takes an expr to summarize, and a parameter k which controls the tradeoff between memory used and accuracy. The aggregator will use enough working memory to store a bit more than 3k values sampled from expr, and will produce a sample of fewer values than its working memory.

approx_cdf returns a struct containing two arrays, values and ranks. values is a sorted sample of values from expr. ranks is an array of int64 of length len(values) + 1. Perhaps the easiest way to think about ranks is to consider the consecutive differences weights = np.diff(ranks). weights is the same length as values, and sum(weights) equals count(expr), the number of values being summarized. Together, values and weights approximate the true distribution of expr by the collection values, with each values[i] repeated weights[i] many times.

For example, suppose values = [0,2,5,6,9] and ranks = [0,3,4,5,8,10]. Then weights = [3,1,1,3,2]. Together, this approximates the true distribution by the array [0,0,0,2,5,6,6,6,9,9]

This pair of arrays can be used to estimate the rank of any value, or to estimate the value at any rank. Here, we define the rank R(x) of some value x to be the number of values of expr less than x. An equivalent view is that R(x) is the (smallest) index where x could be found if expr were collected and sorted.

To estimate the rank of a value x, let i be the smallest index such that values[i] >= x, or len(values) if x is greater than all elements of values. Then we estimate R(x) = ranks[i], and quantile Q(x) = ranks[i] / ranks[-1].

To estimate the value at rank r, let i be the smallest index such that ranks[i] <= r. Then we estimate the value to be values[i].

To estimate the value with quantile q, we estimate the value at rank floor(q * ranks[-1]).

agg.approx_quantiles(expr, qs, k)

This is a convenient wrapper around approx_cdf. expr and k are passed directly to approx_cdf. qs is either a single quantile or an array of quantiles, where a quantile is a number q with 0 <= q <= 1. Returns an array of values whose true quantiles are close to the requested quantiles.

plots.cdf(data, k)

Produces a cumulative density plot. If data is an expression, this will first run the approx_cdf aggregator. Alternatively, you can run data = t.aggregate(hl.agg.approx_cdf(t.foo)) yourself, and pass the results to plots.cdf yourself. This allows you to produce multiple plots without having to run multiple aggregations.

The plot can be panned and zoomed, and a hover tooltip will display value/rank pairs.

plots.pdf(data, k, smoothing, interactive)

Produces a probability density plot. As with cdf, data can either be an expression or the results of a agg.approx_cdf aggregation. smoothing controls the amount of smoothing being applied.

If interactive=True, this returns two values, e.g. p, i = plots.pdf(data, interactive=True). Then displaying the plot using plots.show(p, interact=i) shows the plot along with a slider to interactively change the smoothing parameter. Note that interactivity requires the ipywidgets package.

plots.histogram(data, bins, interactive)

The existing histogram plot has been modified, so that data can now be the results of a agg.approx_cdf aggregation. If interactive=True, this returns two values as with pdf, and passing both two plots.show adds sliders to the histogram to vary the number of bins and to shift the bin edges left and right.