How to make iterating through every gene more efficient?

Hello all! I am currently attempting to iterate through every single gene in my Hail table (total of ~ 100,000,000 rows), filtering to the appropriate consequence class, and summing up the mutability values for all variants in the class with this code:

consequence_classes = ['synonymous_variant', 'missense_variant', 'start_lost', 'stop_gained', 'stop_lost', 'splice_donor_variant',
                      'splice_acceptor_variant', 'splice_region_variant']

results_dict = {}

for gene in ensembl_vep.gene.collect():    #ensembl_vep is the Hail table
    gene_ht = ensembl_vep.filter(ensembl_vep.gene == gene)
    mutability_sums = [0] * len(consequence_classes)
    for i, consequence_class in enumerate(consequence_classes):
        consq_ht = gene_ht.filter(gene_ht.major_consequence == consequence_class)
        mutability_sums[i] = consq_ht.aggregate(hl.agg.sum(consq_ht.mutability_float))
    results_dict[gene] = mutability_sums

However, this takes quite long for around 18,000 genes. Can I ask if there is any way to make the process more efficient? I have a slightly more complex code that also filters to different combinations of pathogenicity predictors, and it takes like 15 mins per gene… (tracked by printing progress)

I’d appreciate any input on this, thank you very much!

The way to do this efficiently is to do all genes at the same time!

results_dict = ensembl_vep.aggregate(
    hl.agg.group_by(ensembl_vep.gene,
        hl.agg.group_by(ensembl_vep.major_consequence,
            hl.agg.sum(ensembl_vep.mutability_float))))

Thank you!! I will try that. Also, if I am specify particular classes like this snapshot:

    for i in predictor_classes:
        if i == 'mis_metaSVM':
            consq_ht = gene_ht.filter((gene_ht.major_consequence == 'missense_variant') & (gene_ht.meta_svm == "D"))
            mutability_sums.append(consq_ht.aggregate(hl.agg.sum(consq_ht.mutability_float)))
        elif i == 'mis_CADD20':
            consq_ht = gene_ht.filter((gene_ht.major_consequence == 'missense_variant') & (gene_ht.cadd > 20))
            mutability_sums.append(consq_ht.aggregate(hl.agg.sum(consq_ht.mutability_float)))
        elif i == 'mis_CADD30':
            consq_ht = gene_ht.filter((gene_ht.major_consequence == 'missense_variant') & (gene_ht.cadd > 30))
            mutability_sums.append(consq_ht.aggregate(hl.agg.sum(consq_ht.mutability_float)))

Is there a way to do this in a more efficient way?

results_dict = ensembl_vep.aggregate(
    hl.agg.group_by(ensembl_vep.gene,
        hl.struct(mutability_per_csq=hl.agg.group_by(ensembl_vep.major_consequence,
                      hl.agg.sum(ensembl_vep.mutability_float)),
                  mis_metaSVM=hl.agg.filter((ensembl_vep.major_consequence == 'missense_variant') & (ensembl_vep.meta_svm == "D"), hl.agg.sum(ensembl_vep.mutability_float))),
                  mis_CADD20=...)




Thank you very much Tim for your insights! Also, just another technical question to improve my understanding of Hail, for this line:
for gene in ensembl_vep.gene.collect():
Does it iterate through unique genes? I am printing the progress with the “gene” variable, and so far I see 5 loops where the gene is the same gene.

ensembl_vep.gene.collect() is the same as:

ensembl_vep.aggregate(hl.agg.collect(ensembl_vep.gene))

This aggregates over the rows of the table, collecting each value. This will collect duplicates. If you want uniques, you could do:

ensembl_vep.aggregate(hl.agg.collect_as_set(ensembl_vep.gene))

1 Like