Hello all! I am currently attempting to iterate through every single gene in my Hail table (total of ~ 100,000,000 rows), filtering to the appropriate consequence class, and summing up the mutability values for all variants in the class with this code:
consequence_classes = ['synonymous_variant', 'missense_variant', 'start_lost', 'stop_gained', 'stop_lost', 'splice_donor_variant',
'splice_acceptor_variant', 'splice_region_variant']
results_dict = {}
for gene in ensembl_vep.gene.collect(): #ensembl_vep is the Hail table
gene_ht = ensembl_vep.filter(ensembl_vep.gene == gene)
mutability_sums = [0] * len(consequence_classes)
for i, consequence_class in enumerate(consequence_classes):
consq_ht = gene_ht.filter(gene_ht.major_consequence == consequence_class)
mutability_sums[i] = consq_ht.aggregate(hl.agg.sum(consq_ht.mutability_float))
results_dict[gene] = mutability_sums
However, this takes quite long for around 18,000 genes. Can I ask if there is any way to make the process more efficient? I have a slightly more complex code that also filters to different combinations of pathogenicity predictors, and it takes like 15 mins per gene… (tracked by printing progress)
I’d appreciate any input on this, thank you very much!
Thank you very much Tim for your insights! Also, just another technical question to improve my understanding of Hail, for this line: for gene in ensembl_vep.gene.collect():
Does it iterate through unique genes? I am printing the progress with the “gene” variable, and so far I see 5 loops where the gene is the same gene.