Calculate PGS score from UKB data using PGS Catalog

Hi!

I am trying to calculate polygenic risk scores (PRS/PGS) for UK Biobank data (imputed bgen files) using pre-computed scores from the PGS Catalog available online.

So basically I’m just trying to code a script that takes as input a file with the pre-computed scores from PGS catalog, as for example, this file: https://ftp.ebi.ac.uk/pub/databases/spot/pgs/scores/PGS000001/ScoringFiles/Harmonized/PGS000001_hmPOS_GRCh37.txt.gz ; and bgen files (one per chromosome) from UKB. Then the script will map the variants present in PGS Catalog with those present in UKB, and the beta values (coming from the PGS Catalog file) for each variant for each individual will be summed up to calculate the PGS per individual.

Here is a bgenix and plink version of what I am considering doing but using Hail instead: PRS Pipeline

Is there something similar already available?
Do you think it is sensible to use Hail for this use-case or should I keep with bgenix or plink? As I hope I explained, I am not trying to compute develop or validate PRS scores, but just do a “mapping of the variants” and use the scores already computed in the PGS Catalog.

Thank you in advance!

Hey @irun! If you’re just computing PRS scores for samples stored in BGEN with scores stored in compressed text, I don’t think Hail will provide much additional benefit.

Hail shines when you use it for an entire analysis on data in Hail native format. Things like QC’ing a sequencing matrix, computing PCA, LD, and performing regressions.

FWIW, we do have an example of how to use Hail to compute PGS.

Hi!

Thank you for the informative answer!
Thank you for pointing out that Hail might be better suited for other purposes, it makes sense. My main driver reason for using Hail was that I could visualize the data, filter, plot distributions and QC. But yes, my main goal is to map the variants from a pre-computed PGS, and the variants in UKB, pretty much a “grep” between the two files.

Maybe a bit of an off-topic question, but, do you know if it is better to use the imputed bgen files for this purpose, or just directly the genotyped calls (not imputed)? Is there a way that I can differentiate imputed from sequenced (called) variants from an imputed bgen file using Hail?