Looking for a "count()" alternative for QC


#1

Hello,
I just started using Hail2.0 and find it a bit of a struggle to perform QC.
The main issue I face is the extremely long time it takes to do a count() or count_cols().
In Hail1.0 I would use sample_num and it would be much faster than the count methods.

Is there any similar method in Hail2.0? Did anyone find good solutions or tips for dealing with this issue?


#2

The behavior between these two methods is the same. However, Hail’s backend infrastructure has changed a lot between 0.1 and 0.2.

One major difference is that we’ve made many operations lazy. For instance, comparing these two:

[0.1]

vds = vds.sample_qc()
vds = vds.filter_samples_expr('sa.qc.callRate > 0.97')
print(vds.num_samples)

[0.2]

mt = hl.sample_qc(mt)
mt = mt.filter_cols(mt.sample_qc.call_rate > 0.97)
print(mt.count_cols())

In 0.1, the majority of time will be spent in the first line (sample_qc). In 0.2, everything will be executed in line 3, appearing as if sample_qc is instant and counting takes a long time.

If you share your pipeline, we can probably help optimize it.


#3

Thank you!