Prune variants based on distance threshold

Hi, I am working on a building a script to perform GWAS on the imputed UKBiobank fileset. I have been building off this post: https://discuss.hail.is/t/how-to-run-gwas-from-uk-biobank-efficiently-on-hail/1400/9

After loading genotype data from been files, I would like to apply a distance filter on QC-filtered variants. The idea is basically to: create a set of variants that are at X bp (or on the next chromosome) from every other variant in the set.

For example, if these were the positions of QC-filtered variants and X=100 bp:

snp1 100000
snp2 100075
snp3 100086
snp4 100500
snp5 100550
snp6 100600
snp7 100800

Then, I would want to filter out: snp2, snp3, and snp5. Essentially, I want to loop over all of the variants in my MatrixTable and create a filter based on that. I have thought of two ways to do this:

  • Write the positions and locus, array information (what my genotype MatrixTable is keyed by) to file. Process the file with native Python syntax (I really just need a simple for loop and list). Output the variants I want to keep into a list. And then use this list to filter out variants with .contains

  • Use hail.scan to create either (1) list of variants to keep (2) some sort of boolean expression that I can pass to hl.filter_rows…

Here is my attempt at using scan:

# Scan over locus field
def applyDistanceScan(selectedSNPs, currSNP):
    selectedSNPs = hl.if_else((selectedSNPs.length() >= 0) & 
                (
                    (selectedSNPs.last().contig != currSNP.contig) | 
                    (hl.abs(selectedSNPs.last().position - currSNP.position) >= hl.literal(1000 * 1000))
                ),
    selectedSNPs.append(currSNP)
    , selectedSNPs.append(currSNP))

    return selectedSNPs

        # # Get last added SNP, if on next chromosome or 1000 kbp away, add to selected SNPs
        # lastSNP = selectedSNPs.last()
        # dist = lastSNP.position - currSNP.position
        # if (lastSNP.contig != currSNP.contig) or (dist >= 1000 * 1000):
        #     selectedSNPS = selectedSNPs.append(currSNP)
        
    # return selectedSNPs
  

bgenFile = pathToUKBcentralizedGeno + 'ukb_imp_chr'+ chrom + '_v3.bgen'
mfiFile = pathToUKBcentralizedGeno + 'reformat_ukb_mfi_chr' + chrom + '_v3.txt'

# Genotype information for all individuals
genotypeInfo = hl.import_bgen(
    path=bgenFile,
    entry_fields=['GT','dosage'], # GT required for variant QC, dosage is better to regress on? 
    sample_file=sampleFile
)
locusArray = hl.agg.collect(genotypeInfo.locus)
# Apply distance filter on SNPs
distFiltSNPs = locusArray.scan(applyDistanceScan, hl.empty_array(hl.tlocus(reference_genome='GRCh37')))[0]
centralizedGeno = genotypeInfo.filter_rows(distFiltSNPs.contains(genotypeInfo.locus))

With this approach, I get the following error:

hail.expr.expressions.base_expression.ExpressionException: Cannot combine expressions from different source objects.
    Found fields from 2 objects:
        <hail.matrixtable.MatrixTable object at 0x7feccc4409a0>: ['locus']
        <hail.matrixtable.MatrixTable object at 0x7feccc20f700>: ['locus']

I think that I want to do is more a scan across the locus field in my matrix table and use that to generate the same filter. Any ideas on how to tackle this would be much appreciated! I am very new to Hail, I attempted to build a scan on my own but clearly am doing something very wrong here! I can try the first approach, but would greatly appreciate any thoughts on how to do this.

I think this works –

WINDOW_SIZE=100
mt = hl.balding_nichols_model(3, 10, 1000)

def outside_window(prev, current):
  return hl.is_missing(prev) | (prev.contig != current.contig) | (current.position - prev.position >= WINDOW_SIZE)

# need to add hl.scan.last as an alias for this
last_locus = hl.scan.fold(
  hl.null(mt.locus.dtype),
  lambda prev: hl.if_else(outside_window(prev, mt.locus), mt.locus, prev),
  lambda prev, next: next) 

mt = mt.annotate_rows(last = last_locus)
mt = mt.filter_rows(outside_window(mt.last, mt.locus))
mt.rows().show()
+---------------+------------+--------------+------------------------------+---------------+
| locus         | alleles    | ancestral_af | af                           | last          |
+---------------+------------+--------------+------------------------------+---------------+
| locus<GRCh37> | array<str> |      float64 | array<float64>               | locus<GRCh37> |
+---------------+------------+--------------+------------------------------+---------------+
| 1:1           | ["A","C"]  |     5.39e-01 | [5.23e-01,5.65e-01,7.37e-01] | NA            |
| 1:101         | ["A","C"]  |     6.42e-01 | [7.66e-01,5.65e-01,5.63e-01] | 1:1           |
| 1:201         | ["A","C"]  |     2.56e-01 | [3.52e-01,2.94e-01,1.10e-01] | 1:101         |
| 1:326         | ["A","C"]  |     5.90e-01 | [6.86e-01,5.44e-01,7.41e-01] | 1:226         |
| 1:451         | ["A","C"]  |     4.95e-01 | [5.80e-01,4.71e-01,5.43e-01] | 1:351         |
| 1:576         | ["A","C"]  |     6.64e-01 | [8.12e-01,8.60e-01,5.07e-01] | 1:476         |
| 1:701         | ["A","C"]  |     4.39e-01 | [2.83e-01,6.35e-01,7.63e-01] | 1:601         |
| 1:826         | ["A","C"]  |     8.38e-01 | [8.91e-01,9.26e-01,7.22e-01] | 1:726         |
| 1:951         | ["A","C"]  |     5.47e-01 | [8.61e-01,6.26e-01,6.58e-01] | 1:851         |
+---------------+------------+--------------+------------------------------+---------------+

Fun problem :slight_smile:

the place your solution went wrong is using selectedSNPs like a mutable Python list, which isn’t valid Hail syntax (everything in Hail is functional to be able to build up our program representation). This is a big part of the steep learning curve for Hail.