Ways to speed up QC plots computation


It took around 3 hours for sample QC and variant QC computation along with QC plots computation(18 plots) for 10 million variants hail matrix table. The size of the dataset was around 60 GB. I tried it on a 2 node spark cluster.

I think these are decent timings for QC plots computation for our pipeline. However, I was trying to see if there was any way to speed up QC plots computation(like sample call_rate, variant call_rate etc.)

I went through this Link: Hail | Plotting Tutorial

The histogram() method takes as an argument an aggregated hist expression, as well as optional arguments for the legend and title of the plot.

dp_hist = mt.aggregate_entries(hl.expr.aggregators.hist(mt.DP, 0, 30, 30))
p = hl.plot.histogram(dp_hist, legend='DP', title='DP Histogram')

This method, like all Hail plotting methods, also allows us to pass in fields of our data set directly. Choosing not to specify the range and bins arguments would result in a range being computed based on the largest and smallest values in the dataset and a default bins value of 50.

p = hl.plot.histogram(mt.DP, range=(0, 30), bins=30)

Do you think it will be beneficial if we do aggregation ahead and then pass on the aggregated metric to hail histogram plot?
Does hail takes care of it internally so there wouldn’t be any speed up?

Hi Abhishek,

If you don’t pass a range, histogram will do the following computation:

start, end = mt.aggregate_entries((hl.agg.min(mt.DP), hl.agg.max(mt.DP)))
dp_hist = mt.aggregate_entries(hl.agg.hist(mt.DP, start, end, bins))

If you do pass a range, it will only compute the second step. So if you are doing the same thing and passing the result to histogram, it will make no performance difference. However, one benefit of doing the aggregation yourself is that you can save dp_hist, and regenerate the plot without rerunning the aggregation.

If you’re doing exploratory analysis and might want to try plotting with different numbers of bins, histogram can also take the results of the approx_cdf aggregator, which is a more sophisticated stigmatization of a distribution of values (see [Feature] Approximate quantiles, cdf and pdf plots for more details). With the interactive=True flag, you can interactively modify the number of bins in the histogram. The tradeoff is that it won’t be as accurate as a hist aggregator with predetermined number of bins.

1 Like