Hi, I am working on a building a script to perform GWAS on the imputed UKBiobank fileset. I have been building off this post: https://discuss.hail.is/t/how-to-run-gwas-from-uk-biobank-efficiently-on-hail/1400/9
After loading genotype data from been files, I would like to apply a distance filter on QC-filtered variants. The idea is basically to: create a set of variants that are at X bp (or on the next chromosome) from every other variant in the set.
For example, if these were the positions of QC-filtered variants and X=100 bp:
snp1 100000
snp2 100075
snp3 100086
snp4 100500
snp5 100550
snp6 100600
snp7 100800
Then, I would want to filter out: snp2, snp3, and snp5. Essentially, I want to loop over all of the variants in my MatrixTable and create a filter based on that. I have thought of two ways to do this:
-
Write the positions and locus, array information (what my genotype MatrixTable is keyed by) to file. Process the file with native Python syntax (I really just need a simple for loop and list). Output the variants I want to keep into a list. And then use this list to filter out variants with
.contains
-
Use hail.scan to create either (1) list of variants to keep (2) some sort of boolean expression that I can pass to hl.filter_rows…
Here is my attempt at using scan:
# Scan over locus field
def applyDistanceScan(selectedSNPs, currSNP):
selectedSNPs = hl.if_else((selectedSNPs.length() >= 0) &
(
(selectedSNPs.last().contig != currSNP.contig) |
(hl.abs(selectedSNPs.last().position - currSNP.position) >= hl.literal(1000 * 1000))
),
selectedSNPs.append(currSNP)
, selectedSNPs.append(currSNP))
return selectedSNPs
# # Get last added SNP, if on next chromosome or 1000 kbp away, add to selected SNPs
# lastSNP = selectedSNPs.last()
# dist = lastSNP.position - currSNP.position
# if (lastSNP.contig != currSNP.contig) or (dist >= 1000 * 1000):
# selectedSNPS = selectedSNPs.append(currSNP)
# return selectedSNPs
bgenFile = pathToUKBcentralizedGeno + 'ukb_imp_chr'+ chrom + '_v3.bgen'
mfiFile = pathToUKBcentralizedGeno + 'reformat_ukb_mfi_chr' + chrom + '_v3.txt'
# Genotype information for all individuals
genotypeInfo = hl.import_bgen(
path=bgenFile,
entry_fields=['GT','dosage'], # GT required for variant QC, dosage is better to regress on?
sample_file=sampleFile
)
locusArray = hl.agg.collect(genotypeInfo.locus)
# Apply distance filter on SNPs
distFiltSNPs = locusArray.scan(applyDistanceScan, hl.empty_array(hl.tlocus(reference_genome='GRCh37')))[0]
centralizedGeno = genotypeInfo.filter_rows(distFiltSNPs.contains(genotypeInfo.locus))
With this approach, I get the following error:
hail.expr.expressions.base_expression.ExpressionException: Cannot combine expressions from different source objects.
Found fields from 2 objects:
<hail.matrixtable.MatrixTable object at 0x7feccc4409a0>: ['locus']
<hail.matrixtable.MatrixTable object at 0x7feccc20f700>: ['locus']
I think that I want to do is more a scan across the locus field in my matrix table and use that to generate the same filter. Any ideas on how to tackle this would be much appreciated! I am very new to Hail, I attempted to build a scan on my own but clearly am doing something very wrong here! I can try the first approach, but would greatly appreciate any thoughts on how to do this.