Running polygenic risk scoring with VariantDataset

Do you have advice for how to run a polygenic risk score with a VariantDataset? My current flow is taking a painful amount of time to run on the All Of Us short-read WGS dataset and I’m not sure if it’s optimal.

vds = hl.vds.read_vds(vds_wgs_path)
scores = hl.read_table(score_file)
vds = hl.vds.filter_samples(vds, samples, keep=True)
vds = hl.vds.split_multi(vds)
vds = hl.vds.filter_variants(vds, scores)
mt = hl.vds.to_dense_mt(vds)
mt = mt.annotate_rows(**scores[, mt.alleles])
mt = hl.variant_qc(mt)
mt = mt.annotate_rows(prior=2 * hl.if_else(mt.variant_qc.AF.length() == 1, 0, mt.variant_qc.AF[1]))
mt = mt.annotate_cols(
    mt['effect_allele_weight'] * hl.coalesce(mt.GT.n_alt_alleles(), mt.prior)))

AoU makes available an already-dense MatrixTable with just common variants, right? Using that as the starting point rather than the VDS (+ to_dense_mt) is probably going to be a lot more efficient for queries that don’t need rare variation.

1 Like

In addition to agreeing with Tim’s comments I’ll add that Hail does not yet have columnar storage (it’s in the works) so even running this operation on a dense matrix table will likely be bottlenecked by Hail’s decoding speed of data coming off the wire. This particularly bad for complex formats with lots of fields.

Also writing to a Hail Table will always yield better performance and avoid a concatenation step.

Unfortunately, it seems like many of the variants are rare by AoU standards, so as a starting point I believe have to use the VDS. For this query, we’re using approximately 1M variants, which were common across a different population. I assume that the number of variants, along with the way they are semi-uniformly distributed, makes interval filtering a no-go.

I’m planning to pass these PRS to R for statistical analysis, so unfortunately writing to a text file may be the only way. Would it be cheaper to write it to a Hail table in between as a checkpointing step?

I noticed that filter_samples actually triggers a row-wise filtering operation. Does this impede any interval filtering pushdown, so we should filter by interval first (if possible)? I was also surprised to see that interval filtering is expensive compared to MatrixTables, unless you specifically trigger a chromosome filter that allows you to use unchecked_filter_both.

This may be wishful thinking, but is it possible to just use the variant_data member of the VDS and get back into the MT regime without triggering to_dense_mt? Also, when would you recommend using a dense versus sparse MT?