Getting low call rate when converting from VDS to Sparse MatrixTable

Hi!

I’m trying to perform QC on a Hail VDS of ~40K samples, but I’m getting a very low average call rate if I convert the VDS into a sparse MatrixTable instead of a dense MatrixTable.

From what I understand, it appears that hail.vds.sample_qc does not return the call rate due to the lack of a GT field in the SVCR format. As such, I’ve tried to convert the VDS to a MatrixTable and add the GT field using hl.vds.lgt_to_gt. Below is a snippet of what I’ve tested.

vds = hl.vds.read_vds('/path/to/vds')
mt = hl.vds.to_merged_sparse_mt(vds)
mt = mt.annotate_entries(GT = hl.vds.lgt_to_gt(mt.LGT, mt.LA))
mt = hl.sample_qc(mt)
call_rate_stats = mt.aggregate_cols(hl.agg.stats(mt.sample_qc.call_rate))
print(call_rate_stats)

But the call rate stats returns:

Struct(mean=0.02902165238671222, stdev=0.012164182297370613, min=0.0073442539129184746, max=0.09326385727999727, n=42105, sum=1221.956673742518)

Conversely, if I replace line 2 with hl.vds.to_dense_mt(vds), then the call rate stats look more reasonable:

Struct(mean=0.9465120866668351, stdev=0.009011747423607979, min=0.9220336485184929, max=0.9535544189899148, n=42105, sum=39852.89140910709)

Is there a way to get around this issue? Ideally, I’d like to use the sparse MatrixTable for QC so that I can convert it back to a VDS for further analysis.

Hi,
Sorry your first post was marked as spam by the automoderator! I started writing a reply there and only realized when you posted this again.

The short answer is that you should basically never use to_merged_sparse_mt. This exists for compatibility with an older representation for reference-blocked data, which has many pitfalls like the fact that you can easily use methods expecting a different representation (like hl.sample_qc) and get garbage results back.

If you want to use hl.sample_qc, you can use hl.vds.to_dense_mt to convert it to a dense matrixtable to compute call rate. It’s not possible to compute call rate in the definition used by hl.sample_qc without densifying, but Hail users with large datasets at Broad are starting to use different quality metrics – those included in hl.vds.sample_qc, which may be more meaningful/interpretable – instead.