Ways to speed up QC plots computation

patrick-schultz · September 1, 2021, 1:33pm

Hi Abhishek,

If you don’t pass a range, histogram will do the following computation:

start, end = mt.aggregate_entries((hl.agg.min(mt.DP), hl.agg.max(mt.DP)))
dp_hist = mt.aggregate_entries(hl.agg.hist(mt.DP, start, end, bins))

If you do pass a range, it will only compute the second step. So if you are doing the same thing and passing the result to histogram, it will make no performance difference. However, one benefit of doing the aggregation yourself is that you can save dp_hist, and regenerate the plot without rerunning the aggregation.

If you’re doing exploratory analysis and might want to try plotting with different numbers of bins, histogram can also take the results of the approx_cdf aggregator, which is a more sophisticated stigmatization of a distribution of values (see [Feature] Approximate quantiles, cdf and pdf plots for more details). With the interactive=True flag, you can interactively modify the number of bins in the histogram. The tradeoff is that it won’t be as accurate as a hist aggregator with predetermined number of bins.

Topic		Replies	Views
Create plots in R based on the hail metrics Hail Query & hailctl	9	403	September 2, 2020
Extracting DP into a list for plotting Hail Query & hailctl	3	412	November 19, 2021
Poor performance for QC filtering on medium sized genotype data Hail Query & hailctl	20	2165	February 8, 2020
Looking for a "count()" alternative for QC Hail Query & hailctl	2	477	November 6, 2018
Matplotlib with hl.plot Hail Query & hailctl	3	408	January 27, 2022

Ways to speed up QC plots computation

Related topics