Gnomad allele frequency query

hi, as I am still learning hail (python), I have a very basic question, because I do not yet fully understand hail tables and how to query them.

I want to get allele frequencies for all gnomAD populations for a subset of variants and export it as table or vcf.

I tried exemplarily for one random variant:

ht = hl.read_table(‘data/gnomad.genomes.v3.1.sites.ht/’)
favorite_locus = hl.parse_locus(‘chr1:156129878’, ‘GRCh38’)

When comparing it with the data on the gnomAD-browser, I can see that:

ht.filter(ht.locus == favorite_locus).freq[0].show()
±---------------±-----------±----------±----------±----------±------------------------+
| locus | alleles | .AC | .AF | .AN | .homozygote_count |
±---------------±-----------±----------±----------±----------±------------------------+
| locus | array | int32 | float64 | int32 | int32 |
±---------------±-----------±----------±----------±----------±------------------------+
| chr1:156129878 | [“T”,“G”] | 27542 | 1,81e-01 | 151952 | 4757 |
±---------------±-----------±----------±----------±----------±------------------------+

gives the information for total AF and

ht.filter(ht.locus == favorite_locus).freq[2].show()

±---------------±-----------±----------±----------±----------±------------------------+
| locus | alleles | .AC | .AF | .AN | .homozygote_count |
±---------------±-----------±----------±----------±----------±------------------------+
| locus | array | int32 | float64 | int32 | int32 |
±---------------±-----------±----------±----------±----------±------------------------+
| chr1:156129878 | [“T”,“G”] | 4832 | 7,11e-02 | 67994 | 180 |
±---------------±-----------±----------±----------±----------±------------------------+

for NFE AF.

Can someone explain me how to use the “meta” information to more easily query for populations. Or more general, how to get names for the information stored in an array of a row-field ?

Thanks a lot for you help!

The gnomAD Hail Tables have a global field named freq_meta. Each entry in freq_meta describes the freq entry at the same index (freq_meta[0] contains information about freq[0], freq_meta[1] about freq[1], etc.).

To get freq_meta as a Python list, use hl.eval(ht.globals.freq_meta).

Thanks! That already helps. Again, sorry for the basic questions, but can you tell me how to combine this information to:
a) generate a ht with e.g. freq[0] and freq[2]
and
b) finally export it to a tab, tsv or csv delimited text file which looks like this:

locus | alleles | gnomAD.AC | gnomAD.AF | gnomAD.AN | gnomAD.homozygote_count | nfe.AC | nfe.AF | nfe.AN | nfe.homozygote_count

Thanks!

Something along the lines of:

ht.select(
  "locus",
  "alleles",
  gnomad_ac=ht.freq[0].AC,
  gnomad_af=ht.freq[0].AF,
  gnomad_an=ht.freq[0].AN,
  gnomad_homozygote_count=ht.freq[0].homozygote_count,
).export("/path/to/export.tsv")

To include more columns, include expressions for them in the select call.

Thanks a lot, that really helped.
It worked without the “locus”, “alleles” part.

Hello,

I want to extract the counts but for the whole genome, not just few lociis. I downloaded gnomad v211 liftover ht and ran the command, I met the following error -

2021-03-02 17:09:23 Hail: ERROR: Analysis exception: 'Table.select': cannot overwrite key field 'locus' with annotate, select or drop; use key_by to modify keys.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<decorator-gen-970>", line 2, in select
  File "/root/anaconda3/envs/hail/lib/python3.7/site-packages/hail/typecheck/check.py", line 614, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/root/anaconda3/envs/hail/lib/python3.7/site-packages/hail/table.py", line 922, in select
    self._row)
  File "/root/anaconda3/envs/hail/lib/python3.7/site-packages/hail/utils/misc.py", line 423, in get_select_exprs
    check_keys(caller, name, protected_key)
  File "/root/anaconda3/envs/hail/lib/python3.7/site-packages/hail/utils/misc.py", line 397, in check_keys
    raise ExpressionException(msg)
hail.expr.expressions.base_expression.ExpressionException: 'Table.select': cannot overwrite key field 'locus' with annotate, select or drop; use key_by to modify keys.

I get disk space error when I try the command without "locus","alleles" part, maybe need to set a tmp directory(?)

What is the code you ran?

Sorry, my earlier example incorrectly included the locus and alleles key fields in the select. It should have been:

ht.select(
  gnomad_ac=ht.freq[0].AC,
  gnomad_af=ht.freq[0].AF,
  gnomad_an=ht.freq[0].AN,
  gnomad_homozygote_count=ht.freq[0].homozygote_count,
).export("/path/to/export.tsv")

Hello @nawatts and @tpoterba ,

I tried this command

 ht.select(
   gnomad_ac=ht.freq[0].AC,
   gnomad_af=ht.freq[0].AF,
   gnomad_an=ht.freq[0].AN,
   gnomad_homozygote_count=ht.freq[0].homozygote_count,
 ).export("gnomad_v211_genome_counts.tsv")

and it worked without issue, but the locus and alleles are printed as

chr1:10067      ["T","TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC"]

Could I get them in this format?

chr\tStart\tEnd\tRef\tAlt\tAC\Homozygotes

Since I need the end positions, I would rather try to get the output from hail itself, rather than parsing this file myself.

That would look something like:

ht.select(
  chr=ht.locus.contig,
  tStart=ht.locus.position,
  tRef=ht.alleles[0],
  tAlt=ht.alleles[1],
  tAC=ht.freq[0].AC,
  Homozygotes=ht.freq[0].homozygote_count,
).export("file.tsv")

I’m not sure what tEnd would refer to. The gnomAD variants have a single locus for a position.

1 Like

@kvn95ss, If your table uses the locus and alleles and key fields (check ht.describe()), you’ll need to. run ht = ht.key_by() first.

A post was split to a new topic: How do I create a locus and allele keyed table from chromosome, start position, end position, reference allele and alt allele?