Filter vars on MAF and output rsIDs and freq


I downloaded and imported gnomad.exomes.r2.1.sites.vcf.bgz and saved it as MatrixTable:

hl.import_vcf('/data/gnomad.exomes.r2.1.sites.vcf.bgz').write('/data/', overwrite=True)
mt = hl.read_matrix_table('/data/')

Now, I’d like to filter vars with MAF < 10e-5 by non_neuro_AF_nfe and print out rsIDs and MAFs of those vars that pass that threshold.

I tried several R-minded options, but none of them really worked:
mtf = mt.filter_cols( < 0.00001)

Can you help please?


The gnomad release data is sites-only (no sample-level genotype data). The best way to process this data is as a Hail Table (1-dimensional structured object).

To that end, importing as follows:

hl.import_vcf('/data/gnomad.exomes.r2.1.sites.vcf.bgz').rows().write('/data/', overwrite=True)
ht = hl.read_table('/data/')

Now there’s only one filter method, which will do what you expect:

htf = ht.filter( < 0.00001)


Last command results in this error:

TypeError                                Traceback (most recent call last)
<ipython-input-4-a362e52bebb0> in <module>
----> 1 htf = ht.filter( < 0.00001)

~/anaconda3/envs/hail/lib/python3.7/site-packages/hail/expr/expressions/ in __lt__(self, other)
    342     def __lt__(self, other):
--> 343         return self._compare_op("<", other)
    345     def __le__(self, other):

~/anaconda3/envs/hail/lib/python3.7/site-packages/hail/expr/expressions/ in _compare_op(self, op, other)
    366         left, right, success = unify_exprs(self, other)
    367         if not success:
--> 368             raise TypeError(f"Invalid '{op}' comparison, cannot compare expressions "
    369                             f"of type '{self.dtype}' and '{other.dtype}'")
    370         res = left._bin_op(op, right, hl.tbool)

TypeError: Invalid '<' comparison, cannot compare expressions of type 'array<float64>' and 'float64'

Typing ht.describe() shows that ‘info’ is a row field with various subfields, including non_neuro_AF_nfe, which is array (float64).


if info.non_neuro_AF_nfe is an array, then you need to decide what the scientifically appropriate thing to do is. If you know all variants are biallelic, then info.non_neuro_AF_nfe[0] is probably the value to compare against. If there are multiallelics, then that’s probably going to be wrong.


Thanks! Yes, I’m interested in biallelic vars. Can you please show how to extract and export rsIDs and MAF (i.e. info.non_neuro_AF_nfe[0]) into a new table?