I’m performing variant QC on a MatrixTable with dimensions (56993186, 414830). The data has already been filtered to include only common variants. The remaining steps are to filter for missingness and retain only biallelic SNPs.
For missingness QC, I’m using .variant_qc.call_rate >= 0.90. While hl.variant_qc() provides a comprehensive set of QC metrics, I only need the .call_rate field, and the rest will be removed when I export the results.
Given the long runtime of variant_qc(), is it possible to compute only .call_rate without calculating all other metrics, to potentially reduce computation time? Additionally, do you have any suggestions for optimizing these QC steps further? Thank you
While we attempt to optimize away (and therefore not compute) fields that are not used in your pipeline, that isn’t perfect. Doing things like this is the best way to optimize QC, making sure you’re only computing metrics you need and only one time. For example, if you don’t even need call_rate after filtering, you could simply do:
Thank you so much for your suggestion. I wanted to confirm whether using / mt.count_cols() is correct here, or if I should use / mt.count_rows() to keep it consistent with mt.filter_rows()?
Also, for sample QC, can I use the following similar code to filter on samples?