Hi fantastic hail team,
Here is my simple question:
let’s say I have a list of genes("/path/gene_list.tsv") and I would like to extract the gnomad(v2/v3) data for this gene list. I have tried it by below code and I am not getting the results as expected. Could you please help with where I do it wrongly? Also why when I try to download the hail table from gs:// it only can be downloaded as hail table not hail matrixtable?
mt = hl.read_table(“gs://gnomad-public/release/2.1.1/ht/exomes/gnomad.exomes.r2.1.1.sites.ht/”)
mt_f1 = mt.filter(mt.vep.transcript_consequences.gene_symbol.show() == gene_list[“gene_symbol”])
The released gnomAD data is site-level summary data, not the genotype-level matrix table. The genotypes cannot be made public for privacy reasons.
A couple of things to note here.
First, show() is only used to print, it’s not used in filtering. Second, Hail doesn’t support filtering on a Pandas Series like this.
Third, VEP output is a bit tough to manipulate because
mt.vep.transcript_consequences.gene_symbol is an array of gene symbols, not a single string. I recommend installing the gnomAD utilities first using
pip install gnomad or adding
--pkgs gnomad to
hailctl dataproc start. Then you can do the following:
from gnomad.utils.vep import process_consequences
ht = hl.read_table(“gs://gnomad-public/release/2.1.1/ht/exomes/gnomad.exomes.r2.1.1.sites.ht/”)
ht_pc = process_consequences(ht)
genes_as_hail_literal = hl.literal(set(gene_list['gene_symbol']))
ht_gene_symbol = ht_pc.vep.worst_csq_per_variant_canonical.gene_symbol
hl.len(genes_as_hail_literal.intersection(hl.set(ht_gene_symbol))) > 0)
EDIT: Fixed bug noted in post 4 below.
Thanks a lot for the reply, very useful (clearly I need more practice to adapt hail expression!). However, I have tried it, and running the last line
ht_filt = ht_pc.filter(genes_as_hail_literal.contains(ht_gene_symbol))
got this error:
TypeError: 'SetExpression.contains' expects 'item' to be the same type as its elements
set element type: 'str'
type of arg 'item': array<str>'
which I assume it has the same problem as
ah, it’s a doubly-nested array! I think you can do:
ht_pc.filter(hl.len(genes_as_hail_literal.intersection(hl.set(ht_gene_symbol))) > 0)