I am working on PRS analysis these days and have some feature requests regarding SNP matching of the target data and GWAS summary statistics.
Allele frequency harmonization: When matching SNPs, I want to harmonize the allele frequency between two datasets so that I can validate transferability of PRS across different ancestries. Chi-squared test looks suitable for finding outlier SNPs, supported by the fact that many of phasing detects potential mismatches between the target data and the reference panel with X-squared statistics. But the thing is chi-squared test in hail, ‘hl.chi_squared_test’, only takes integer value as input and gives p-value and odds ratio as output. If I’m going to run a chi-squared test with allele frequency and want to get a chi-squared value, what should I do? Do you guys have any plan to implement this in the function?
SNP matching: One of the most popular PRS methods, Bigsnpr (R) matches SNPs from two different datasets by chromosome and position with 3 steps below. I wonder whether you guys are willing to construct this kind of SNP matching function.
Match SNPs, reversing the allele (ex. alleles[0] == A1 & alleles[1] == A2)
Also, minor issues. Do you guys have any plan to incorporate PRS computation algorithm (PRScs, for example) other than plink sum? And would it be possible for ‘hl.sample_qc’ to support het_freq_hwe metrics like ‘hl.variant_qc’ function by any chance? I want to replicate the pre-imputation QC conducted by ricopili thus need Fhet value. (RICOPILI - Preimputation (QC))
re 1), could you use as args to chi_squared_test be the number of alt alleles and number of total alleles for the two datasets? I’m not sure what else you could do here (comparing two frequencies alone seems like an underspecified problem)
re 2), this is totally possible to implement using Hail’s Python interface. The reason we don’t have something boxed up in a function already is that there are a bunch of decisions to make about how to do this, and we don’t want to force those choices on users (also because it’s work and we have a lot to do). I can help you write something that accomplishes the bullet points you listed if you’re interested.
re: PRS, we don’t have concrete plans to implement any other algorithms right now. PRSsc looks more challenging due to requiring an iterative fit, but other approaches might be trivial on top of existing Hail interfaces.
re: Fhet, we don’t really need to add this to sample_qc, because it’s super easy to compute on your own:
Sorry I must have left out some important information here. The purpose of allele frequency (AF) harmonization is to exclude SNPs that have inconsistent AF between the GWAS summary and the target data, so that we can use only transferrable SNPs across ancestries. As the ancestry of the target data I’m using is EAS whereas that of GWAS summary is EUR and many of ethnicity-specific PRS results are contributed from ethnicity-dependent SNP frequency, I wanted to conduct AF harmonization step to increase the PRS accuracy, prior to SNP matching. Though I have the allele count (AC) info of the target data like you suggested, GWAS summary doesn’t include AC info so I have no choice but to multiply AF with the number of samples and then to round it for the existing ‘hl.chi_squared_test’ function. Thus, what I requested is something like ‘chisq.test’ in r, with which one of two input vectors can be a vector of probabilities. (chisq.test function - RDocumentation) Hope it makes sense to you.
Yes, it would be very grateful if you help me get through it!