Looking for a "count()" alternative for QC

Ella · November 4, 2018, 9:08am

Hello,
I just started using Hail2.0 and find it a bit of a struggle to perform QC.
The main issue I face is the extremely long time it takes to do a count() or count_cols().
In Hail1.0 I would use sample_num and it would be much faster than the count methods.

Is there any similar method in Hail2.0? Did anyone find good solutions or tips for dealing with this issue?

tpoterba · November 4, 2018, 2:37pm

The behavior between these two methods is the same. However, Hail’s backend infrastructure has changed a lot between 0.1 and 0.2.

One major difference is that we’ve made many operations lazy. For instance, comparing these two:

[0.1]

vds = vds.sample_qc()
vds = vds.filter_samples_expr('sa.qc.callRate > 0.97')
print(vds.num_samples)

[0.2]

mt = hl.sample_qc(mt)
mt = mt.filter_cols(mt.sample_qc.call_rate > 0.97)
print(mt.count_cols())

In 0.1, the majority of time will be spent in the first line (sample_qc). In 0.2, everything will be executed in line 3, appearing as if sample_qc is instant and counting takes a long time.

If you share your pipeline, we can probably help optimize it.

Ella · November 6, 2018, 11:54am

Thank you!

Topic		Replies	Views
Counting Rows More Quickly in VDS Hail Query & hailctl	12	526	July 17, 2023
Inconsistent per sample QC result Hail Query & hailctl	3	397	March 15, 2022
Hail sample_qc results Hail Query & hailctl	15	449	September 7, 2022
Inconsistent sample qc results Hail Query & hailctl	4	420	April 22, 2020
Errors when computing sample qc Hail Query & hailctl	0	230	October 17, 2023

Looking for a "count()" alternative for QC

Related topics