Hi there,
I have a dictionary with conditions, ch_dict
, per gene to define a type of variant, e.g. if a variant in one of these genes is a stop gain mutation or has a mutation in the 76th amino acid I want to annotate under a True/False categorization. Each gene has different conditions for its variants.
ch_dict
{'ASXL1': {'mutation_types': <SetExpression of type set<str>>,
'missense_mutation_ranges': <ArrayNumericExpression of type array<int32>>,
'missense_mutation_locations': <ArrayExpression of type array<str>>,
'exons': <SetExpression of type set<int32>>},
'ASXL2': {'mutation_types': <SetExpression of type set<str>>,
'missense_mutation_ranges': <ArrayNumericExpression of type array<int32>>,
'missense_mutation_locations': <ArrayExpression of type array<str>>,
'exons': <SetExpression of type set<int32>>},
...
When I use the following code to do the annotation:
mt.row.chip_variant = hl.literal(False)
def check_variants(row, info):
mutation_type = info['mutation_types'].contains(row.variant_type)
missense_mutation_range = info['missense_mutation_ranges'].contains(row.amino_acid_number)
missense_mutation_locations = info['missense_mutation_locations'].contains(row.amino_acid_change )
return (mutation_type | missense_mutation_locations | missense_mutation_range)
for gene, gene_info in ch_dict.items():
is_chip = check_variants(mt, gene_info)
mt = mt.annotate_rows(chip_variant = hl.cond(mt.row.gene_name == gene, is_chip, mt.row.chip_variant))
It runs just fine, but when I
mt.row.chip_variant.show(5)
I get the method too large error. I’m guessing from all these conditions. What would be the smarter way to do this.
This hack works:
import numpy as np
chips = np.zeros((mt.rows().count()))
mt.row.chip_variant = hl.literal(False)
mt.row.chip_variant.collect()
counter = 0
for gene, gene_info in ch_dict.items():
is_chip = check_variants(mt.filter_rows(mt.row.gene_name == gene), gene_info)
gene_idxs = (mt.row.gene_name == gene).collect()
is_chip_collected = is_chip.collect()
chips[gene_idxs] = is_chip_collected
counter += 1
print(counter, sum(chips))
But is very slow even for ~70 genes. Any comments about hoe to improve the code would be great.