I just started using Hail2.0 and find it a bit of a struggle to perform QC.
The main issue I face is the extremely long time it takes to do a count() or count_cols().
In Hail1.0 I would use sample_num and it would be much faster than the count methods.
Is there any similar method in Hail2.0? Did anyone find good solutions or tips for dealing with this issue?
The behavior between these two methods is the same. However, Hail’s backend infrastructure has changed a lot between 0.1 and 0.2.
One major difference is that we’ve made many operations lazy. For instance, comparing these two:
vds = vds.sample_qc()
vds = vds.filter_samples_expr('sa.qc.callRate > 0.97')
mt = hl.sample_qc(mt)
mt = mt.filter_cols(mt.sample_qc.call_rate > 0.97)
In 0.1, the majority of time will be spent in the first line (sample_qc). In 0.2, everything will be executed in line 3, appearing as if sample_qc is instant and counting takes a long time.
If you share your pipeline, we can probably help optimize it.