Help for annotating a matrixtable variant data in DNAnexus with gnomAD database

shengwei66 · February 6, 2023, 9:05pm

Hello team,

I am trying to annotate a variant data in matrixtable format on the DNAnexus platform using hail with gnomAD database. The hail version in DNAnexus is 0.2.78. I am using hl.experimental.DB() and db.annotate_rows_db() for annotate my mt data. My code ran ok, but I got all missing values for the annotation fields from gnomAD. Can you please help me to figure it out what went wrong and I can get the annotation values from gnomAD? Thanks very much for your help.

My code is below:
fmturl = ‘dnax://…/filtered.mt’
filtered_mt=hl.read_matrix_table(fmturl)
db = hl.experimental.DB(region=‘us’, cloud=‘aws’)
ann_mt = db.annotate_rows_db(filtered_mt, ‘gnomad_genome_sites’)

I got the message below after running the above code

2023-02-06 20:05:46 Hail: INFO: index_compatible_version: More than one compatible version exists for annotation dataset: gnomad_genome_sites. Rows have been annotated with version 3.1.

convert matrixtable to hail table

ann_tb = ann_mt.rows()
ann_tb.show()

all gnomad_genome_sites fields are NA

danking · February 6, 2023, 10:44pm

I suspect you have different alleles from gnomad. Have you already split your dataset? Look at hl.split_multi_hts. You can verify this by checking this:

ht = hl.experimental.load_dataset('gnomad_genome_sites', version='3.1')
# drop the alleles from the key
ht = ht.key_by('locus')
mt = filtered_mt
mt = mt.annotate_rows(gnomad_alleles = ht[mt.locus].alleles)
mt.gnomad_alleles.show()

shengwei66 · February 7, 2023, 3:04am

Thanks Dan for your advice. I tried to run the code you provided above to see if I have different alleles from gnomad.

after running mt.gnomad_alleles.show(2)

You are correct! If I can remove “<NON_REF>” from my alleles, then I will have the same allele as gnomad. Is there any method I can use to remove “<NON_REF>”?

My filtered_mt was created by the code below:
filtered_mt = orig_mt.filter_rows(hl.agg.any(orig_mt.GT.is_non_ref()))

Thanks again for your help!

tpoterba · February 7, 2023, 3:22am

You can do:

mt = mt.key_rows_by(locus=mt.locus, alleles=mt.alleles.filter(lambda x: x != '<NON_REF>'))

shengwei66 · February 7, 2023, 1:51pm

Thanks very much, Tim. I will try it out.

shengwei66 · February 7, 2023, 8:52pm

With the help from Dan and Tim, I am able to annotate my variant data with gnomad_genome_sites. Thanks again to Dan and Tim.

From the gnomad annotation, I am specially interested in the allele frequency and we want to filter out variants with allele frequency > 0.001 to find rare variants. I tried to understand the annotated data by just looking at one row, and found that the row field of gnomad_genome_sites.freq has the nested structure with 598 records, where each is also a structure with the AC, AF, AN, homozygote_count fields. I am confused here with this one-to-many match results. Can you please help me to understand this?

I know that AC is for allele count, AF is for allele frequency, and AN is for allele number. I also found another row field called popmax which also has the same structure of freq but only 1 record. For my purpose of filtering, should I be using the popmax?

I am a beginner just starting to learn genomic data analysis. Any advice/guidance are greatly appreciated!

danking · February 7, 2023, 11:04pm

I think your best bet is to contact the gnomAD team. They should have some documentation of what the numbers mean.

I believe there are annotations in the globals of the table which describe what each element of the AF array corresponds to. For example, AF[0], I believe, refers to the overall allele frequency. The other elements correspond to the allele frequency within certain sub-populations.

shengwei66 · February 8, 2023, 2:43am

Thanks Dan for your advice. I will check the global fields for further info and also check with the gnomAD team.

danking · February 8, 2023, 8:12pm

This might be helpful: gnomAD

shengwei66 · February 9, 2023, 2:55am

Thanks Dan for providing the link of gnomAD.

I tried to see the content of the freq_meta in the global field of gnomad_genome_sites and my code is below:

ht = hl.experimental.load_dataset(‘gnomad_genome_sites’,
version=‘3.1’,
reference_genome=‘GRCh38’,
region=‘us’,
cloud=‘aws’)
hl.eval(ht.freq_meta)

Following are the eval results

[frozendict({‘group’: ‘adj’}),
frozendict({‘group’: ‘raw’}),
frozendict({‘group’: ‘adj’, ‘pop’: ‘nfe’}),
frozendict({‘group’: ‘adj’, ‘pop’: ‘fin’}),
frozendict({‘group’: ‘adj’, ‘pop’: ‘mid’}),
frozendict({‘group’: ‘adj’, ‘pop’: ‘oth’}),
frozendict({‘group’: ‘adj’, ‘pop’: ‘ami’}),
frozendict({‘group’: ‘adj’, ‘pop’: ‘asj’}),
frozendict({‘group’: ‘adj’, ‘pop’: ‘afr’}),
frozendict({‘group’: ‘adj’, ‘pop’: ‘eas’}),
frozendict({‘group’: ‘adj’, ‘pop’: ‘sas’}),
frozendict({‘group’: ‘adj’, ‘pop’: ‘amr’}),
frozendict({‘group’: ‘adj’, ‘sex’: ‘XX’}),
frozendict({‘group’: ‘adj’, ‘sex’: ‘XY’}),
…

I don’t know what ‘adj’ and ‘raw’ are referring to. What I need is the allele frequency of total population. Maybe I just need to use the first element, which refers to the overall allele frequency as you pointed out above.

danking · February 9, 2023, 7:28pm

Take a look at “How do I access the gnomAD Hail Table frequency annotation?” at gnomAD’s help page.

shengwei66 · February 9, 2023, 8:53pm

Thanks Dan. This help page is very helpful!

Topic		Replies	Views
Annotating variants in a matrix table with 1000genomes database Hail Query & hailctl	0	341	April 20, 2023
Cannot Annotate DB Hail Query & hailctl	6	479	October 5, 2020
Annotate variants with hom var samples Hail Query & hailctl	0	342	January 19, 2023
Annotate variants with genes Hail Query & hailctl	9	482	May 22, 2023
Variant annotation in MatrixTable Hail Query & hailctl	13	791	July 3, 2020