Querying gnomad using hail table by gene symbol

Hi fantastic hail team,

Here is my simple question:

let’s say I have a list of genes("/path/gene_list.tsv") and I would like to extract the gnomad(v2/v3) data for this gene list. I have tried it by below code and I am not getting the results as expected. Could you please help with where I do it wrongly? Also why when I try to download the hail table from gs:// it only can be downloaded as hail table not hail matrixtable?

mt = hl.read_table(“gs://gnomad-public/release/2.1.1/ht/exomes/gnomad.exomes.r2.1.1.sites.ht/”)

gene_list= pandas.read_table(‘path/gene_list.tsv’)

mt_f1 = mt.filter(mt.vep.transcript_consequences.gene_symbol.show() == gene_list[“gene_symbol”])

Thank you!

Amir

The released gnomAD data is site-level summary data, not the genotype-level matrix table. The genotypes cannot be made public for privacy reasons.

A couple of things to note here.

First, show() is only used to print, it’s not used in filtering. Second, Hail doesn’t support filtering on a Pandas Series like this.

Third, VEP output is a bit tough to manipulate because mt.vep.transcript_consequences.gene_symbol is an array of gene symbols, not a single string. I recommend installing the gnomAD utilities first using pip install gnomad or adding --pkgs gnomad to hailctl dataproc start. Then you can do the following:

from gnomad.utils.vep import process_consequences
ht = hl.read_table(“gs://gnomad-public/release/2.1.1/ht/exomes/gnomad.exomes.r2.1.1.sites.ht/”)
gene_list= pandas.read_table(‘path/gene_list.tsv’)

ht_pc = process_consequences(ht)

genes_as_hail_literal = hl.literal(set(gene_list['gene_symbol']))
ht_gene_symbol = ht_pc.vep.worst_csq_per_variant_canonical.gene_symbol
ht_pc.filter(
    hl.len(genes_as_hail_literal.intersection(hl.set(ht_gene_symbol))) > 0)

EDIT: Fixed bug noted in post 4 below.

1 Like

Thanks a lot for the reply, very useful (clearly I need more practice to adapt hail expression!). However, I have tried it, and running the last line

ht_filt = ht_pc.filter(genes_as_hail_literal.contains(ht_gene_symbol))

got this error:

TypeError: 'SetExpression.contains' expects 'item' to be the same type as its elements
set element type:  'str'
type of arg 'item': array<str>'

which I assume it has the same problem as mt.vep.transcript_consequences.gene_symbol?

ah, it’s a doubly-nested array! I think you can do:

ht_pc.filter(hl.len(genes_as_hail_literal.intersection(hl.set(ht_gene_symbol))) > 0)
1 Like

3 posts were split to a new topic: Filtering is really fast but show is really slow, why is that?

Hi tpoterba,

I follow your solution about filtering gnomad data with a list of gene symbol. However, I don’t know why after filtering, I got the same number of records (variants).
Could you please take a look at my code below. Many thanks!

ht = hl.read_table(“path/gnomad.exomes.r2.1.sites.ht/”)
ht_pc = process_consequences(ht)
gene_list= pd.read_table(‘path/gene_list.tsv’)
gene_list=gene_list.dropna()
genes_as_hail_literal = hl.literal(set(gene_list[‘gene_symbol’]))
ht_gene_symbol = ht_pc.vep.worst_csq_by_gene_canonical.gene_symbol
ht_pc.filter(hl.len(genes_as_hail_literal.intersection(hl.set(ht_gene_symbol))) > 0)
ht_pc.count()

The last line returns the same number of variants as ht.count(). My gene_list.tsv file contains only one column (named gene_symbol with each row represents a gene name).

One more note is that to not raise and error, I need to put “vep.worst_csq_by_gene_canonical.gene_symbol” but not “vep.worst_csq_per_variant_canonical.gene_symbol” like in your reply and I don’t really understand it.