Help for annotating a matrixtable variant data in DNAnexus with gnomAD database

Hello team,

I am trying to annotate a variant data in matrixtable format on the DNAnexus platform using hail with gnomAD database. The hail version in DNAnexus is 0.2.78. I am using hl.experimental.DB() and db.annotate_rows_db() for annotate my mt data. My code ran ok, but I got all missing values for the annotation fields from gnomAD. Can you please help me to figure it out what went wrong and I can get the annotation values from gnomAD? Thanks very much for your help.

My code is below:
fmturl = ‘dnax://…/filtered.mt’
filtered_mt=hl.read_matrix_table(fmturl)
db = hl.experimental.DB(region=‘us’, cloud=‘aws’)
ann_mt = db.annotate_rows_db(filtered_mt, ‘gnomad_genome_sites’)

I got the message below after running the above code

2023-02-06 20:05:46 Hail: INFO: index_compatible_version: More than one compatible version exists for annotation dataset: gnomad_genome_sites. Rows have been annotated with version 3.1.

convert matrixtable to hail table

ann_tb = ann_mt.rows()
ann_tb.show()

all gnomad_genome_sites fields are NA

I suspect you have different alleles from gnomad. Have you already split your dataset? Look at hl.split_multi_hts. You can verify this by checking this:

ht = hl.experimental.load_dataset('gnomad_genome_sites', version='3.1')
# drop the alleles from the key
ht = ht.key_by('locus')
mt = filtered_mt
mt = mt.annotate_rows(gnomad_alleles = ht[mt.locus].alleles)
mt.gnomad_alleles.show()

Thanks Dan for your advice. I tried to run the code you provided above to see if I have different alleles from gnomad.

after running mt.gnomad_alleles.show(2)

|locus|alleles|gnomad_alleles|
|locus|array|array|
|chr1:18182422|[“G”,“A”,“<NON_REF>”]|[“G”,“A”]|
|chr1:18183539|[“T”,“G”,“<NON_REF>”]|[“T”,“G”]|

You are correct! If I can remove “<NON_REF>” from my alleles, then I will have the same allele as gnomad. Is there any method I can use to remove “<NON_REF>”?

My filtered_mt was created by the code below:
filtered_mt = orig_mt.filter_rows(hl.agg.any(orig_mt.GT.is_non_ref()))

Thanks again for your help!

You can do:

mt = mt.key_rows_by(locus=mt.locus, alleles=mt.alleles.filter(lambda x: x != '<NON_REF>'))

Thanks very much, Tim. I will try it out.

With the help from Dan and Tim, I am able to annotate my variant data with gnomad_genome_sites. Thanks again to Dan and Tim.

From the gnomad annotation, I am specially interested in the allele frequency and we want to filter out variants with allele frequency > 0.001 to find rare variants. I tried to understand the annotated data by just looking at one row, and found that the row field of gnomad_genome_sites.freq has the nested structure with 598 records, where each is also a structure with the AC, AF, AN, homozygote_count fields. I am confused here with this one-to-many match results. Can you please help me to understand this?

I know that AC is for allele count, AF is for allele frequency, and AN is for allele number. I also found another row field called popmax which also has the same structure of freq but only 1 record. For my purpose of filtering, should I be using the popmax?

I am a beginner just starting to learn genomic data analysis. Any advice/guidance are greatly appreciated!

I think your best bet is to contact the gnomAD team. They should have some documentation of what the numbers mean.

I believe there are annotations in the globals of the table which describe what each element of the AF array corresponds to. For example, AF[0], I believe, refers to the overall allele frequency. The other elements correspond to the allele frequency within certain sub-populations.

Thanks Dan for your advice. I will check the global fields for further info and also check with the gnomAD team.

This might be helpful: gnomAD

Thanks Dan for providing the link of gnomAD.

I tried to see the content of the freq_meta in the global field of gnomad_genome_sites and my code is below:

ht = hl.experimental.load_dataset(‘gnomad_genome_sites’,
version=‘3.1’,
reference_genome=‘GRCh38’,
region=‘us’,
cloud=‘aws’)
hl.eval(ht.freq_meta)

Following are the eval results

[frozendict({‘group’: ‘adj’}),
frozendict({‘group’: ‘raw’}),
frozendict({‘group’: ‘adj’, ‘pop’: ‘nfe’}),
frozendict({‘group’: ‘adj’, ‘pop’: ‘fin’}),
frozendict({‘group’: ‘adj’, ‘pop’: ‘mid’}),
frozendict({‘group’: ‘adj’, ‘pop’: ‘oth’}),
frozendict({‘group’: ‘adj’, ‘pop’: ‘ami’}),
frozendict({‘group’: ‘adj’, ‘pop’: ‘asj’}),
frozendict({‘group’: ‘adj’, ‘pop’: ‘afr’}),
frozendict({‘group’: ‘adj’, ‘pop’: ‘eas’}),
frozendict({‘group’: ‘adj’, ‘pop’: ‘sas’}),
frozendict({‘group’: ‘adj’, ‘pop’: ‘amr’}),
frozendict({‘group’: ‘adj’, ‘sex’: ‘XX’}),
frozendict({‘group’: ‘adj’, ‘sex’: ‘XY’}),

I don’t know what ‘adj’ and ‘raw’ are referring to. What I need is the allele frequency of total population. Maybe I just need to use the first element, which refers to the overall allele frequency as you pointed out above.

Take a look at “How do I access the gnomAD Hail Table frequency annotation?” at gnomAD’s help page.

Thanks Dan. This help page is very helpful!

1 Like