Hi,
It took around 3 hours for sample QC and variant QC computation along with QC plots computation(18 plots) for 10 million variants hail matrix table. The size of the dataset was around 60 GB. I tried it on a 2 node spark cluster.
I think these are decent timings for QC plots computation for our pipeline. However, I was trying to see if there was any way to speed up QC plots computation(like sample call_rate, variant call_rate etc.)
I went through this Link: Hail | Plotting Tutorial
The histogram() method takes as an argument an aggregated hist expression, as well as optional arguments for the legend and title of the plot.
dp_hist = mt.aggregate_entries(hl.expr.aggregators.hist(mt.DP, 0, 30, 30))
p = hl.plot.histogram(dp_hist, legend='DP', title='DP Histogram')
show(p)
This method, like all Hail plotting methods, also allows us to pass in fields of our data set directly. Choosing not to specify the range and bins arguments would result in a range being computed based on the largest and smallest values in the dataset and a default bins value of 50.
p = hl.plot.histogram(mt.DP, range=(0, 30), bins=30)
show(p)
Question:
Do you think it will be beneficial if we do aggregation ahead and then pass on the aggregated metric to hail histogram plot?
Or
Does hail takes care of it internally so there wouldn’t be any speed up?