Ways to speed up QC plots computation

Abhishek · September 1, 2021, 8:05am

Hi,

It took around 3 hours for sample QC and variant QC computation along with QC plots computation(18 plots) for 10 million variants hail matrix table. The size of the dataset was around 60 GB. I tried it on a 2 node spark cluster.

I think these are decent timings for QC plots computation for our pipeline. However, I was trying to see if there was any way to speed up QC plots computation(like sample call_rate, variant call_rate etc.)

I went through this Link: Hail | Plotting Tutorial

The histogram() method takes as an argument an aggregated hist expression, as well as optional arguments for the legend and title of the plot.

dp_hist = mt.aggregate_entries(hl.expr.aggregators.hist(mt.DP, 0, 30, 30))
p = hl.plot.histogram(dp_hist, legend='DP', title='DP Histogram')
show(p)

This method, like all Hail plotting methods, also allows us to pass in fields of our data set directly. Choosing not to specify the range and bins arguments would result in a range being computed based on the largest and smallest values in the dataset and a default bins value of 50.

p = hl.plot.histogram(mt.DP, range=(0, 30), bins=30)
show(p)

Question:
Do you think it will be beneficial if we do aggregation ahead and then pass on the aggregated metric to hail histogram plot?
Or
Does hail takes care of it internally so there wouldn’t be any speed up?

patrick-schultz · September 1, 2021, 1:33pm

Hi Abhishek,

If you don’t pass a range, histogram will do the following computation:

start, end = mt.aggregate_entries((hl.agg.min(mt.DP), hl.agg.max(mt.DP)))
dp_hist = mt.aggregate_entries(hl.agg.hist(mt.DP, start, end, bins))

If you do pass a range, it will only compute the second step. So if you are doing the same thing and passing the result to histogram, it will make no performance difference. However, one benefit of doing the aggregation yourself is that you can save dp_hist, and regenerate the plot without rerunning the aggregation.

If you’re doing exploratory analysis and might want to try plotting with different numbers of bins, histogram can also take the results of the approx_cdf aggregator, which is a more sophisticated stigmatization of a distribution of values (see [Feature] Approximate quantiles, cdf and pdf plots for more details). With the interactive=True flag, you can interactively modify the number of bins in the histogram. The tradeoff is that it won’t be as accurate as a hist aggregator with predetermined number of bins.

Topic		Replies	Views
Looking for a "count()" alternative for QC Hail Query & hailctl	2	476	November 6, 2018
Hail/Apache Spark Not Scaling by Cluster Size Hail Query & hailctl	2	215	February 21, 2024
Computation speed of hail aggregation Hail Query & hailctl	12	774	February 26, 2025
Google cloud speed up Hail Query & hailctl	10	846	September 18, 2019
Poor performance for QC filtering on medium sized genotype data Hail Query & hailctl	20	2164	February 8, 2020

Ways to speed up QC plots computation

Related topics