let’s say I have a list of genes("/path/gene_list.tsv") and I would like to extract the gnomad(v2/v3) data for this gene list. I have tried it by below code and I am not getting the results as expected. Could you please help with where I do it wrongly? Also why when I try to download the hail table from gs:// it only can be downloaded as hail table not hail matrixtable?
The released gnomAD data is site-level summary data, not the genotype-level matrix table. The genotypes cannot be made public for privacy reasons.
A couple of things to note here.
First, show() is only used to print, it’s not used in filtering. Second, Hail doesn’t support filtering on a Pandas Series like this.
Third, VEP output is a bit tough to manipulate because mt.vep.transcript_consequences.gene_symbol is an array of gene symbols, not a single string. I recommend installing the gnomAD utilities first using pip install gnomad or adding --pkgs gnomad to hailctl dataproc start. Then you can do the following:
I follow your solution about filtering gnomad data with a list of gene symbol. However, I don’t know why after filtering, I got the same number of records (variants).
Could you please take a look at my code below. Many thanks!
The last line returns the same number of variants as ht.count(). My gene_list.tsv file contains only one column (named gene_symbol with each row represents a gene name).
One more note is that to not raise and error, I need to put “vep.worst_csq_by_gene_canonical.gene_symbol” but not “vep.worst_csq_per_variant_canonical.gene_symbol” like in your reply and I don’t really understand it.