How to make iterating through every gene more efficient?

MsUTR · May 18, 2023, 7:13pm

Hello all! I am currently attempting to iterate through every single gene in my Hail table (total of ~ 100,000,000 rows), filtering to the appropriate consequence class, and summing up the mutability values for all variants in the class with this code:

consequence_classes = ['synonymous_variant', 'missense_variant', 'start_lost', 'stop_gained', 'stop_lost', 'splice_donor_variant',
                      'splice_acceptor_variant', 'splice_region_variant']

results_dict = {}

for gene in ensembl_vep.gene.collect():    #ensembl_vep is the Hail table
    gene_ht = ensembl_vep.filter(ensembl_vep.gene == gene)
    mutability_sums = [0] * len(consequence_classes)
    for i, consequence_class in enumerate(consequence_classes):
        consq_ht = gene_ht.filter(gene_ht.major_consequence == consequence_class)
        mutability_sums[i] = consq_ht.aggregate(hl.agg.sum(consq_ht.mutability_float))
    results_dict[gene] = mutability_sums

However, this takes quite long for around 18,000 genes. Can I ask if there is any way to make the process more efficient? I have a slightly more complex code that also filters to different combinations of pathogenicity predictors, and it takes like 15 mins per gene… (tracked by printing progress)

I’d appreciate any input on this, thank you very much!

tpoterba · May 18, 2023, 7:43pm

The way to do this efficiently is to do all genes at the same time!

results_dict = ensembl_vep.aggregate(
    hl.agg.group_by(ensembl_vep.gene,
        hl.agg.group_by(ensembl_vep.major_consequence,
            hl.agg.sum(ensembl_vep.mutability_float))))

MsUTR · May 18, 2023, 7:46pm

Thank you!! I will try that. Also, if I am specify particular classes like this snapshot:

    for i in predictor_classes:
        if i == 'mis_metaSVM':
            consq_ht = gene_ht.filter((gene_ht.major_consequence == 'missense_variant') & (gene_ht.meta_svm == "D"))
            mutability_sums.append(consq_ht.aggregate(hl.agg.sum(consq_ht.mutability_float)))
        elif i == 'mis_CADD20':
            consq_ht = gene_ht.filter((gene_ht.major_consequence == 'missense_variant') & (gene_ht.cadd > 20))
            mutability_sums.append(consq_ht.aggregate(hl.agg.sum(consq_ht.mutability_float)))
        elif i == 'mis_CADD30':
            consq_ht = gene_ht.filter((gene_ht.major_consequence == 'missense_variant') & (gene_ht.cadd > 30))
            mutability_sums.append(consq_ht.aggregate(hl.agg.sum(consq_ht.mutability_float)))

Is there a way to do this in a more efficient way?

tpoterba · May 18, 2023, 7:50pm

results_dict = ensembl_vep.aggregate(
    hl.agg.group_by(ensembl_vep.gene,
        hl.struct(mutability_per_csq=hl.agg.group_by(ensembl_vep.major_consequence,
                      hl.agg.sum(ensembl_vep.mutability_float)),
                  mis_metaSVM=hl.agg.filter((ensembl_vep.major_consequence == 'missense_variant') & (ensembl_vep.meta_svm == "D"), hl.agg.sum(ensembl_vep.mutability_float))),
                  mis_CADD20=...)

MsUTR · May 18, 2023, 7:52pm

Thank you very much Tim for your insights! Also, just another technical question to improve my understanding of Hail, for this line:
for gene in ensembl_vep.gene.collect():
Does it iterate through unique genes? I am printing the progress with the “gene” variable, and so far I see 5 loops where the gene is the same gene.

tpoterba · May 18, 2023, 7:53pm

ensembl_vep.gene.collect() is the same as:

ensembl_vep.aggregate(hl.agg.collect(ensembl_vep.gene))

This aggregates over the rows of the table, collecting each value. This will collect duplicates. If you want uniques, you could do:

ensembl_vep.aggregate(hl.agg.collect_as_set(ensembl_vep.gene))

Topic		Replies	Views
Counting number of variants for each interval Hail Query & hailctl	3	372	September 22, 2022
More efficient way to extract calls? Hail Query & hailctl	2	368	December 14, 2022
Compound heterozygote analysis? Hail Query & hailctl	3	431	May 17, 2022
Compound hets and array<str> to list help Hail Query & hailctl	2	541	May 12, 2020
Gene-based sample statistics Hail Query & hailctl	6	735	January 27, 2020

How to make iterating through every gene more efficient?

Related topics