Method too large help

Hi there,

I have a dictionary with conditions, ch_dict, per gene to define a type of variant, e.g. if a variant in one of these genes is a stop gain mutation or has a mutation in the 76th amino acid I want to annotate under a True/False categorization. Each gene has different conditions for its variants.

ch_dict
{'ASXL1': {'mutation_types': <SetExpression of type set<str>>,
  'missense_mutation_ranges': <ArrayNumericExpression of type array<int32>>,
  'missense_mutation_locations': <ArrayExpression of type array<str>>,
  'exons': <SetExpression of type set<int32>>},
 'ASXL2': {'mutation_types': <SetExpression of type set<str>>,
  'missense_mutation_ranges': <ArrayNumericExpression of type array<int32>>,
  'missense_mutation_locations': <ArrayExpression of type array<str>>,
  'exons': <SetExpression of type set<int32>>},
...

When I use the following code to do the annotation:

mt.row.chip_variant = hl.literal(False)

def check_variants(row, info):
    mutation_type  =  info['mutation_types'].contains(row.variant_type)

    missense_mutation_range =  info['missense_mutation_ranges'].contains(row.amino_acid_number) 

    missense_mutation_locations = info['missense_mutation_locations'].contains(row.amino_acid_change )  
    return (mutation_type  | missense_mutation_locations | missense_mutation_range) 


for gene, gene_info in ch_dict.items():
    is_chip = check_variants(mt, gene_info)
    mt = mt.annotate_rows(chip_variant = hl.cond(mt.row.gene_name == gene, is_chip, mt.row.chip_variant))

It runs just fine, but when I

mt.row.chip_variant.show(5)

I get the method too large error. I’m guessing from all these conditions. What would be the smarter way to do this.

This hack works:

import numpy as np
chips = np.zeros((mt.rows().count()))
mt.row.chip_variant = hl.literal(False)
mt.row.chip_variant.collect()
counter = 0
for gene, gene_info in ch_dict.items():
    is_chip = check_variants(mt.filter_rows(mt.row.gene_name == gene), gene_info)
    gene_idxs = (mt.row.gene_name == gene).collect()
    is_chip_collected = is_chip.collect()
    chips[gene_idxs] = is_chip_collected
    counter += 1
    print(counter, sum(chips))

But is very slow even for ~70 genes. Any comments about hoe to improve the code would be great.

While the method code size is a bug due to an infrastructural deficiency we’re working to fix, even if it worked, your pipeline would have terrible performance because you’re doing an if/else check per gene per variant. Doing a dictionary lookup should make this quite fast:


def check_variants(row, info):
    mutation_type  =  info['mutation_types'].contains(row.variant_type)

    missense_mutation_range =  info['missense_mutation_ranges'].contains(row.amino_acid_number) 

    missense_mutation_locations = info['missense_mutation_locations'].contains(row.amino_acid_change )  
    return (mutation_type  | missense_mutation_locations | missense_mutation_range) 

info_at_gene = hl.literal(ch_dict).get(mt.gene_name)
chip_variant = check_variants(mt.row, info_at_gene)
mt = mt.annotate_rows(chip_variant = hl.coalesce(chip_variant, False))