Hi!
I downloaded and imported gnomad.exomes.r2.1.sites.vcf.bgz and saved it as MatrixTable:
hl.import_vcf('/data/gnomad.exomes.r2.1.sites.vcf.bgz').write('/data/gnomad.exomes.mt', overwrite=True)
mt = hl.read_matrix_table('/data/gnomad.exomes.mt')
Now, I’d like to filter vars with MAF < 10e-5 by non_neuro_AF_nfe and print out rsIDs and MAFs of those vars that pass that threshold.
I tried several R-minded options, but none of them really worked:
mtf = mt.filter_cols(mt.info.non_neuro_AF_nfe < 0.00001)
Can you help please?
The gnomad release data is sites-only (no sample-level genotype data). The best way to process this data is as a Hail Table (1-dimensional structured object).
To that end, importing as follows:
hl.import_vcf('/data/gnomad.exomes.r2.1.sites.vcf.bgz').rows().write('/data/gnomad.exomes.ht', overwrite=True)
ht = hl.read_table('/data/gnomad.exomes.ht')
Now there’s only one filter
method, which will do what you expect:
htf = ht.filter(ht.info.non_neuro_AF_nfe < 0.00001)
Last command results in this error:
TypeError Traceback (most recent call last)
<ipython-input-4-a362e52bebb0> in <module>
----> 1 htf = ht.filter(ht.info.non_neuro_AF_nfe < 0.00001)
~/anaconda3/envs/hail/lib/python3.7/site-packages/hail/expr/expressions/base_expression.py in __lt__(self, other)
341
342 def __lt__(self, other):
--> 343 return self._compare_op("<", other)
344
345 def __le__(self, other):
~/anaconda3/envs/hail/lib/python3.7/site-packages/hail/expr/expressions/base_expression.py in _compare_op(self, op, other)
366 left, right, success = unify_exprs(self, other)
367 if not success:
--> 368 raise TypeError(f"Invalid '{op}' comparison, cannot compare expressions "
369 f"of type '{self.dtype}' and '{other.dtype}'")
370 res = left._bin_op(op, right, hl.tbool)
TypeError: Invalid '<' comparison, cannot compare expressions of type 'array<float64>' and 'float64'
Typing ht.describe()
shows that ‘info’ is a row field with various subfields, including non_neuro_AF_nfe, which is array (float64).
if info.non_neuro_AF_nfe
is an array, then you need to decide what the scientifically appropriate thing to do is. If you know all variants are biallelic, then info.non_neuro_AF_nfe[0]
is probably the value to compare against. If there are multiallelics, then that’s probably going to be wrong.
Thanks! Yes, I’m interested in biallelic vars. Can you please show how to extract and export rsIDs and MAF (i.e. info.non_neuro_AF_nfe[0]) into a new table?