Drop unnecessary QC metrics

Hi,

I’m performing variant QC on a MatrixTable with dimensions (56993186, 414830). The data has already been filtered to include only common variants. The remaining steps are to filter for missingness and retain only biallelic SNPs.

For missingness QC, I’m using .variant_qc.call_rate >= 0.90. While hl.variant_qc() provides a comprehensive set of QC metrics, I only need the .call_rate field, and the rest will be removed when I export the results.

Given the long runtime of variant_qc(), is it possible to compute only .call_rate without calculating all other metrics, to potentially reduce computation time? Additionally, do you have any suggestions for optimizing these QC steps further? Thank you

As we state in the variant_qc documentation, call rate is equivalent to n_called/count_cols, so the following will suffice:

mt = mt.annotate_rows(
    call_rate = hl.agg.count_where(hl.is_defined(mt.GT)) / mt.count_cols()
)

While we attempt to optimize away (and therefore not compute) fields that are not used in your pipeline, that isn’t perfect. Doing things like this is the best way to optimize QC, making sure you’re only computing metrics you need and only one time. For example, if you don’t even need call_rate after filtering, you could simply do:

mt = mt.filter_rows(
    hl.agg.count_where(hl.is_defined(mt.GT)) / mt.count_cols() >= 0.90
)

Thank you so much for your suggestion. I wanted to confirm whether using / mt.count_cols() is correct here, or if I should use / mt.count_rows() to keep it consistent with mt.filter_rows()?

Also, for sample QC, can I use the following similar code to filter on samples?

mt = mt.filter_cols(         
    hl.agg.count_where(hl.is_defined(mt.GT)) / mt.count_cols() >= 0.90
)

You’re going to want count_rows() there.