Running polygenic risk scoring with VariantDataset

Do you have advice for how to run a polygenic risk score with a VariantDataset? My current flow is taking a painful amount of time to run on the All Of Us short-read WGS dataset and I’m not sure if it’s optimal.

vds = hl.vds.read_vds(vds_wgs_path)
scores = hl.read_table(score_file)
vds = hl.vds.filter_samples(vds, samples, keep=True)
vds = hl.vds.split_multi(vds)
vds = hl.vds.filter_variants(vds, scores)
mt = hl.vds.to_dense_mt(vds)
mt = mt.annotate_rows(**scores[mt.locus, mt.alleles])
mt = hl.variant_qc(mt)
mt = mt.annotate_rows(prior=2 * hl.if_else(mt.variant_qc.AF.length() == 1, 0, mt.variant_qc.AF[1]))
mt = mt.annotate_cols(
prs=hl.agg.sum(
    mt['effect_allele_weight'] * hl.coalesce(mt.GT.n_alt_alleles(), mt.prior)))
mt.cols().export(prs_file)

AoU makes available an already-dense MatrixTable with just common variants, right? Using that as the starting point rather than the VDS (+ to_dense_mt) is probably going to be a lot more efficient for queries that don’t need rare variation.

In addition to agreeing with Tim’s comments I’ll add that Hail does not yet have columnar storage (it’s in the works) so even running this operation on a dense matrix table will likely be bottlenecked by Hail’s decoding speed of data coming off the wire. This particularly bad for complex formats with lots of fields.

Also writing to a Hail Table will always yield better performance and avoid a concatenation step.

Unfortunately, it seems like many of the variants are rare by AoU standards, so as a starting point I believe have to use the VDS. For this query, we’re using approximately 1M variants, which were common across a different population. I assume that the number of variants, along with the way they are semi-uniformly distributed, makes interval filtering a no-go.

I’m planning to pass these PRS to R for statistical analysis, so unfortunately writing to a text file may be the only way. Would it be cheaper to write it to a Hail table in between as a checkpointing step?

I noticed that filter_samples actually triggers a row-wise filtering operation. Does this impede any interval filtering pushdown, so we should filter by interval first (if possible)? I was also surprised to see that interval filtering is expensive compared to MatrixTables, unless you specifically trigger a chromosome filter that allows you to use unchecked_filter_both.

This may be wishful thinking, but is it possible to just use the variant_data member of the VDS and get back into the MT regime without triggering to_dense_mt? Also, when would you recommend using a dense versus sparse MT?