Filter vars on MAF and output rsIDs and freq


#1

Hi!
I downloaded and imported gnomad.exomes.r2.1.sites.vcf.bgz and saved it as MatrixTable:

hl.import_vcf('/data/gnomad.exomes.r2.1.sites.vcf.bgz').write('/data/gnomad.exomes.mt', overwrite=True)
mt = hl.read_matrix_table('/data/gnomad.exomes.mt')

Now, I’d like to filter vars with MAF < 10e-5 by non_neuro_AF_nfe and print out rsIDs and MAFs of those vars that pass that threshold.

I tried several R-minded options, but none of them really worked:
mtf = mt.filter_cols(mt.info.non_neuro_AF_nfe < 0.00001)

Can you help please?


#2

The gnomad release data is sites-only (no sample-level genotype data). The best way to process this data is as a Hail Table (1-dimensional structured object).

To that end, importing as follows:

hl.import_vcf('/data/gnomad.exomes.r2.1.sites.vcf.bgz').rows().write('/data/gnomad.exomes.ht', overwrite=True)
ht = hl.read_table('/data/gnomad.exomes.ht')

Now there’s only one filter method, which will do what you expect:

htf = ht.filter(ht.info.non_neuro_AF_nfe < 0.00001)

#3

Last command results in this error:

TypeError                                Traceback (most recent call last)
<ipython-input-4-a362e52bebb0> in <module>
----> 1 htf = ht.filter(ht.info.non_neuro_AF_nfe < 0.00001)

~/anaconda3/envs/hail/lib/python3.7/site-packages/hail/expr/expressions/base_expression.py in __lt__(self, other)
    341 
    342     def __lt__(self, other):
--> 343         return self._compare_op("<", other)
    344 
    345     def __le__(self, other):

~/anaconda3/envs/hail/lib/python3.7/site-packages/hail/expr/expressions/base_expression.py in _compare_op(self, op, other)
    366         left, right, success = unify_exprs(self, other)
    367         if not success:
--> 368             raise TypeError(f"Invalid '{op}' comparison, cannot compare expressions "
    369                             f"of type '{self.dtype}' and '{other.dtype}'")
    370         res = left._bin_op(op, right, hl.tbool)

TypeError: Invalid '<' comparison, cannot compare expressions of type 'array<float64>' and 'float64'

Typing ht.describe() shows that ‘info’ is a row field with various subfields, including non_neuro_AF_nfe, which is array (float64).


#4

if info.non_neuro_AF_nfe is an array, then you need to decide what the scientifically appropriate thing to do is. If you know all variants are biallelic, then info.non_neuro_AF_nfe[0] is probably the value to compare against. If there are multiallelics, then that’s probably going to be wrong.


#5

Thanks! Yes, I’m interested in biallelic vars. Can you please show how to extract and export rsIDs and MAF (i.e. info.non_neuro_AF_nfe[0]) into a new table?