Calculate PGS score from UKB data using PGS Catalog

irun · November 5, 2023, 9:05am

Hi!

I am trying to calculate polygenic risk scores (PRS/PGS) for UK Biobank data (imputed bgen files) using pre-computed scores from the PGS Catalog available online.

So basically I’m just trying to code a script that takes as input a file with the pre-computed scores from PGS catalog, as for example, this file: https://ftp.ebi.ac.uk/pub/databases/spot/pgs/scores/PGS000001/ScoringFiles/Harmonized/PGS000001_hmPOS_GRCh37.txt.gz ; and bgen files (one per chromosome) from UKB. Then the script will map the variants present in PGS Catalog with those present in UKB, and the beta values (coming from the PGS Catalog file) for each variant for each individual will be summed up to calculate the PGS per individual.

Here is a bgenix and plink version of what I am considering doing but using Hail instead: PRS Pipeline

Is there something similar already available?
Do you think it is sensible to use Hail for this use-case or should I keep with bgenix or plink? As I hope I explained, I am not trying to compute develop or validate PRS scores, but just do a “mapping of the variants” and use the scores already computed in the PGS Catalog.

Thank you in advance!

danking · November 9, 2023, 7:02pm

Hey @irun! If you’re just computing PRS scores for samples stored in BGEN with scores stored in compressed text, I don’t think Hail will provide much additional benefit.

Hail shines when you use it for an entire analysis on data in Hail native format. Things like QC’ing a sequencing matrix, computing PCA, LD, and performing regressions.

FWIW, we do have an example of how to use Hail to compute PGS.

irun · November 13, 2023, 5:17pm

Hi!

Thank you for the informative answer!
Thank you for pointing out that Hail might be better suited for other purposes, it makes sense. My main driver reason for using Hail was that I could visualize the data, filter, plot distributions and QC. But yes, my main goal is to map the variants from a pre-computed PGS, and the variants in UKB, pretty much a “grep” between the two files.

Maybe a bit of an off-topic question, but, do you know if it is better to use the imputed bgen files for this purpose, or just directly the genotyped calls (not imputed)? Is there a way that I can differentiate imputed from sequenced (called) variants from an imputed bgen file using Hail?

Topic		Replies	Views
Best practices for UK Biobank Imputed Data Hail Query & hailctl	9	1378	March 19, 2022
Applying externally generated polygenic risk scores to a VDS Hail Query & hailctl	1	686	November 24, 2018
Computation speed of hail aggregation Hail Query & hailctl	12	784	February 26, 2025
UK Biobank DRAGEN WGS BGEN files use 16-bit probabilities that are incompatible with Hail Hail Query & hailctl	4	75	May 12, 2025
Hail for GEL .bgen import in UKB RAP Hail Query & hailctl	0	32	May 26, 2025

Calculate PGS score from UKB data using PGS Catalog

Related topics